RE: [PATCH] Faster CsvDataContext implementation for single-line values

Manuel van den Berg Thu, 01 Aug 2013 13:07:16 -0700

+1 on this in terms of functionality. Using Csv data stores intensively so a 
need to have for us.


Didn't check the code though.

Manuel
________________________________________
From: Kasper Sørensen [[email protected]]
Sent: 26 July 2013 11:26
To: [email protected]
Subject: Re: [PATCH] Faster CsvDataContext implementation for single-line values

Slight correction. I had left out a bit of functionality for sorting
the fields correctly in the new Row implementation. So the performance
improvement is "only" 60% straight out based on my tests:

        // results with old impl: [13908, 13827, 14577]. Total= 42312

        // results with new impl: [9052, 9200, 8193]. Total= 26445


2013/7/26 Kasper Sørensen <[email protected]>:
> Hi everyone,
>
> For one of our applications using MetaModel we have a customer with
> quite large files (100+ M records per file) and reading through them
> takes quite some time, although the CSV module of MetaModel is known
> by us to be one of the fastest modules.
>
> But these particular files (and probably many others) have a
> characteristic that I see we could utilize to make an optimization:
> They don't allow values that span multiple lines. For instance
> consider:
>
> name,company
> Kasper Sørensen, Human Inference
> Ankit Kumar, Human Inference
>
> This is a rather normal CSV layout. But our CSV parser also allows
> multiline values (if quoted), like this:
>
> "name","company"
> "Kasper Sørensen","Human
> Inference"
> "Ankit Kumar","Human Inference"
>
> Now the optimization I had in mind is to delay the actual parsing of
> lines until the point where a value is needed. But this wont work with
> multiline values since we wouldn't know if we should reserve only a
> single line or multiple lines for the delayed/lazy CSV parser. So
> therefore the module is slowed down by a blocking CSV parsing
> operation for each row.
>
> But if we add a flag to the user that he only expects/accepts
> single-line values, then we can actually simply read through the file
> with something like a BufferedReader and then return Row objects that
> encapsulate the raw String line. The parsing of this line is then
> delayed and can potentially be made multithreaded.
>
> I made a quick prototype patch [1] (still a few improvements to be
> made) of this idea and my quick'n'dirty tests showed up to ~ 65%
> performance increase in a multithreaded consumer environment!
>
> I did three runs before and after the improvements on a 30k record
> file. The results are number of milliseconds used for reading through
> all the values of the file:
>
>         // results with old impl: [13908, 13827, 14577]. Total= 42312
>
>         // results with new impl: [8567, 8965, 8154]. Total= 25686
>
> The test that I ran is the class called 'CsvBigFileMemoryTest.java'.
>
> What do you guys think? Is it feasable to make a optimization like
> this for a specific type of CSV file?
>
> [1] https://gist.github.com/kaspersorensen/6087230

RE: [PATCH] Faster CsvDataContext implementation for single-line values

Reply via email to