I built a limited CSV package for parsing data in Mahout at one point. I doubt that it was general enough to be helpful here, but the experience might be.
The thing that *really* made a big difference in speed was to avoid copies and conversions to String. To do that, I built a state machine that operated on bytes to do the parsing from byte arrays. The parser passed around offsets only. Then when converting data, I converted directly from the original byte array into the target type. For the most common case (in my data) of converting to Integers, this eliminated masses of cons'ing and because the conversion was special purpose (I assumed UTF8 encoding and assumed that numbers could only use ASCII range digits), the conversion to integers was particularly fast. Overall, this made about a 20x difference in speed. This is not 20%; the final time was 5% of the original. On Thu, Mar 15, 2012 at 8:34 AM, sebb <seb...@gmail.com> wrote: > In my testing, using final class variables for delimiter, escape etc > (set in ctor) shaves about 1 sec off the time to read the world town > data file compared with accessing these fields inline through the > format field. > > Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement. > > I suspect this is partly because the fetches are currently in loops > rather than any getter overhead. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > >