I built a limited CSV package for parsing data in Mahout at one point.  I
doubt that it was general enough to be helpful here, but the experience
might be.

The thing that *really* made a big difference in speed was to avoid copies
and conversions to String.  To do that, I built a state machine that
operated on bytes to do the parsing from byte arrays.  The parser passed
around offsets only.  Then when converting data, I converted directly from
the original byte array into the target type.  For the most common case (in
my data) of converting to Integers, this eliminated masses of cons'ing and
because the conversion was special purpose (I assumed UTF8 encoding and
assumed that numbers could only use ASCII range digits), the conversion to
integers was particularly fast.

Overall, this made about a 20x difference in speed.  This is not 20%; the
final time was 5% of the original.

On Thu, Mar 15, 2012 at 8:34 AM, sebb <seb...@gmail.com> wrote:

> In my testing, using final class variables for delimiter, escape etc
> (set in ctor) shaves about 1 sec off the time to read the world town
> data file compared with accessing these fields inline through the
> format field.
>
> Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement.
>
> I suspect this is partly because the fetches are currently in loops
> rather than any getter overhead.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

Reply via email to