+1 on this in terms of functionality. Using Csv data stores intensively so a need to have for us.
Didn't check the code though. Manuel ________________________________________ From: Kasper Sørensen [[email protected]] Sent: 26 July 2013 11:26 To: [email protected] Subject: Re: [PATCH] Faster CsvDataContext implementation for single-line values Slight correction. I had left out a bit of functionality for sorting the fields correctly in the new Row implementation. So the performance improvement is "only" 60% straight out based on my tests: // results with old impl: [13908, 13827, 14577]. Total= 42312 // results with new impl: [9052, 9200, 8193]. Total= 26445 2013/7/26 Kasper Sørensen <[email protected]>: > Hi everyone, > > For one of our applications using MetaModel we have a customer with > quite large files (100+ M records per file) and reading through them > takes quite some time, although the CSV module of MetaModel is known > by us to be one of the fastest modules. > > But these particular files (and probably many others) have a > characteristic that I see we could utilize to make an optimization: > They don't allow values that span multiple lines. For instance > consider: > > name,company > Kasper Sørensen, Human Inference > Ankit Kumar, Human Inference > > This is a rather normal CSV layout. But our CSV parser also allows > multiline values (if quoted), like this: > > "name","company" > "Kasper Sørensen","Human > Inference" > "Ankit Kumar","Human Inference" > > Now the optimization I had in mind is to delay the actual parsing of > lines until the point where a value is needed. But this wont work with > multiline values since we wouldn't know if we should reserve only a > single line or multiple lines for the delayed/lazy CSV parser. So > therefore the module is slowed down by a blocking CSV parsing > operation for each row. > > But if we add a flag to the user that he only expects/accepts > single-line values, then we can actually simply read through the file > with something like a BufferedReader and then return Row objects that > encapsulate the raw String line. The parsing of this line is then > delayed and can potentially be made multithreaded. > > I made a quick prototype patch [1] (still a few improvements to be > made) of this idea and my quick'n'dirty tests showed up to ~ 65% > performance increase in a multithreaded consumer environment! > > I did three runs before and after the improvements on a 30k record > file. The results are number of milliseconds used for reading through > all the values of the file: > > // results with old impl: [13908, 13827, 14577]. Total= 42312 > > // results with new impl: [8567, 8965, 8154]. Total= 25686 > > The test that I ran is the class called 'CsvBigFileMemoryTest.java'. > > What do you guys think? Is it feasable to make a optimization like > this for a specific type of CSV file? > > [1] https://gist.github.com/kaspersorensen/6087230
