Re: [PATCH] Faster CsvDataContext implementation for single-line values

Kasper Sørensen Fri, 02 Aug 2013 04:48:59 -0700

I'll go ahead with the improvement commit then.


2013/8/1 Manuel van den Berg <[email protected]>:
> +1 on this in terms of functionality. Using Csv data stores intensively so a 
> need to have for us.
>
> Didn't check the code though.
>
> Manuel
> ________________________________________
> From: Kasper Sørensen [[email protected]]
> Sent: 26 July 2013 11:26
> To: [email protected]
> Subject: Re: [PATCH] Faster CsvDataContext implementation for single-line 
> values
>
> Slight correction. I had left out a bit of functionality for sorting
> the fields correctly in the new Row implementation. So the performance
> improvement is "only" 60% straight out based on my tests:
>
>         // results with old impl: [13908, 13827, 14577]. Total= 42312
>
>         // results with new impl: [9052, 9200, 8193]. Total= 26445
>
>
> 2013/7/26 Kasper Sørensen <[email protected]>:
>> Hi everyone,
>>
>> For one of our applications using MetaModel we have a customer with
>> quite large files (100+ M records per file) and reading through them
>> takes quite some time, although the CSV module of MetaModel is known
>> by us to be one of the fastest modules.
>>
>> But these particular files (and probably many others) have a
>> characteristic that I see we could utilize to make an optimization:
>> They don't allow values that span multiple lines. For instance
>> consider:
>>
>> name,company
>> Kasper Sørensen, Human Inference
>> Ankit Kumar, Human Inference
>>
>> This is a rather normal CSV layout. But our CSV parser also allows
>> multiline values (if quoted), like this:
>>
>> "name","company"
>> "Kasper Sørensen","Human
>> Inference"
>> "Ankit Kumar","Human Inference"
>>
>> Now the optimization I had in mind is to delay the actual parsing of
>> lines until the point where a value is needed. But this wont work with
>> multiline values since we wouldn't know if we should reserve only a
>> single line or multiple lines for the delayed/lazy CSV parser. So
>> therefore the module is slowed down by a blocking CSV parsing
>> operation for each row.
>>
>> But if we add a flag to the user that he only expects/accepts
>> single-line values, then we can actually simply read through the file
>> with something like a BufferedReader and then return Row objects that
>> encapsulate the raw String line. The parsing of this line is then
>> delayed and can potentially be made multithreaded.
>>
>> I made a quick prototype patch [1] (still a few improvements to be
>> made) of this idea and my quick'n'dirty tests showed up to ~ 65%
>> performance increase in a multithreaded consumer environment!
>>
>> I did three runs before and after the improvements on a 30k record
>> file. The results are number of milliseconds used for reading through
>> all the values of the file:
>>
>>         // results with old impl: [13908, 13827, 14577]. Total= 42312
>>
>>         // results with new impl: [8567, 8965, 8154]. Total= 25686
>>
>> The test that I ran is the class called 'CsvBigFileMemoryTest.java'.
>>
>> What do you guys think? Is it feasable to make a optimization like
>> this for a specific type of CSV file?
>>
>> [1] https://gist.github.com/kaspersorensen/6087230

Re: [PATCH] Faster CsvDataContext implementation for single-line values

Reply via email to