I'll go ahead with the improvement commit then.
2013/8/1 Manuel van den Berg <[email protected]>: > +1 on this in terms of functionality. Using Csv data stores intensively so a > need to have for us. > > Didn't check the code though. > > Manuel > ________________________________________ > From: Kasper Sørensen [[email protected]] > Sent: 26 July 2013 11:26 > To: [email protected] > Subject: Re: [PATCH] Faster CsvDataContext implementation for single-line > values > > Slight correction. I had left out a bit of functionality for sorting > the fields correctly in the new Row implementation. So the performance > improvement is "only" 60% straight out based on my tests: > > // results with old impl: [13908, 13827, 14577]. Total= 42312 > > // results with new impl: [9052, 9200, 8193]. Total= 26445 > > > 2013/7/26 Kasper Sørensen <[email protected]>: >> Hi everyone, >> >> For one of our applications using MetaModel we have a customer with >> quite large files (100+ M records per file) and reading through them >> takes quite some time, although the CSV module of MetaModel is known >> by us to be one of the fastest modules. >> >> But these particular files (and probably many others) have a >> characteristic that I see we could utilize to make an optimization: >> They don't allow values that span multiple lines. For instance >> consider: >> >> name,company >> Kasper Sørensen, Human Inference >> Ankit Kumar, Human Inference >> >> This is a rather normal CSV layout. But our CSV parser also allows >> multiline values (if quoted), like this: >> >> "name","company" >> "Kasper Sørensen","Human >> Inference" >> "Ankit Kumar","Human Inference" >> >> Now the optimization I had in mind is to delay the actual parsing of >> lines until the point where a value is needed. But this wont work with >> multiline values since we wouldn't know if we should reserve only a >> single line or multiple lines for the delayed/lazy CSV parser. So >> therefore the module is slowed down by a blocking CSV parsing >> operation for each row. >> >> But if we add a flag to the user that he only expects/accepts >> single-line values, then we can actually simply read through the file >> with something like a BufferedReader and then return Row objects that >> encapsulate the raw String line. The parsing of this line is then >> delayed and can potentially be made multithreaded. >> >> I made a quick prototype patch [1] (still a few improvements to be >> made) of this idea and my quick'n'dirty tests showed up to ~ 65% >> performance increase in a multithreaded consumer environment! >> >> I did three runs before and after the improvements on a 30k record >> file. The results are number of milliseconds used for reading through >> all the values of the file: >> >> // results with old impl: [13908, 13827, 14577]. Total= 42312 >> >> // results with new impl: [8567, 8965, 8154]. Total= 25686 >> >> The test that I ran is the class called 'CsvBigFileMemoryTest.java'. >> >> What do you guys think? Is it feasable to make a optimization like >> this for a specific type of CSV file? >> >> [1] https://gist.github.com/kaspersorensen/6087230
