Re: [csv] Performance comparison
On Tue, Mar 13, 2012 at 4:33 AM, Ralph Goers wrote: >> I don't think we should be trying to recode JDK classes. > > If the implementations suck, why not? +1 -- http://www.grobmeier.de https://www.timeandbill.de - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
After more experiments I'm less enthusiastic about providing an optimized BufferedReader. The result of the performance test is significantly different if the test is run alone or after all the other unit tests (about 30% slower). When all the tests are executed, the removal of the synchronized blocks in BufferedReader has no visible effect (maybe less than 1%), and the Harmony implementation becomes slower. Emmanuel Bourg Le 13/03/2012 10:20, Emmanuel Bourg a écrit : Le 13/03/2012 02:47, Niall Pemberton a écrit : IMO performance should be taken out of the equation by using the Readable interface[1]. That way the users can use whatever implementation suits them (for example using an underlying buffered InputStream) to change/improve performance. I you mean that the performance of BufferedReader should be taken out of the equation then I agree. All CSV parsers should be compared with the same input source, otherwise the comparison isn't fair. Using Readable would be really nice, but that's very low level. We would have to build line reading and mark/reset on top of that, that's almost equivalent to reimplementing BufferedReader. If [io] could provide a BufferedReader implementation that: - takes a Readable in the constructor - does not synchronize reads - recognizes unicode line separators (and the classic ones) then I buy it right away! Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
Le 13/03/2012 02:47, Niall Pemberton a écrit : IMO performance should be taken out of the equation by using the Readable interface[1]. That way the users can use whatever implementation suits them (for example using an underlying buffered InputStream) to change/improve performance. I you mean that the performance of BufferedReader should be taken out of the equation then I agree. All CSV parsers should be compared with the same input source, otherwise the comparison isn't fair. Using Readable would be really nice, but that's very low level. We would have to build line reading and mark/reset on top of that, that's almost equivalent to reimplementing BufferedReader. If [io] could provide a BufferedReader implementation that: - takes a Readable in the constructor - does not synchronize reads - recognizes unicode line separators (and the classic ones) then I buy it right away! Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
On 13 March 2012 09:01, Emmanuel Bourg wrote: > Le 13/03/2012 01:44, sebb a écrit : > > >> I don't think we should be trying to recode JDK classes. > > > I'd rather not, but we have done that in the past. FastDateFormat and > StrBuilder come to mind. And now Java has StringBuilder, which means StrBuilder is perhaps no longer necessary... > Emmanuel Bourg > - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Le 13/03/2012 01:44, sebb a écrit : I don't think we should be trying to recode JDK classes. I'd rather not, but we have done that in the past. FastDateFormat and StrBuilder come to mind. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
On Mar 12, 2012, at 5:44 PM, sebb wrote: > On 13 March 2012 00:29, Emmanuel Bourg wrote: >> Le 13/03/2012 01:25, sebb a écrit : >> >> >>> I'm concerned that the CSV code may grow and grow with private >>> versions of code that could be provided by the JDK. >>> >>> By all means make sure the code is efficient in the way it uses the >>> JDK classes, but I don't think we should be recoding standard classes. >> >> >> I agree such a class should not live in [csv], but maybe in [io]? > > I don't think we should be trying to recode JDK classes. If the implementations suck, why not? Ralph
Re: [csv] Performance comparison
On 13 March 2012 01:47, Niall Pemberton wrote: > On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg wrote: >> Le 13/03/2012 01:25, sebb a écrit : >> >> >>> I'm concerned that the CSV code may grow and grow with private >>> versions of code that could be provided by the JDK. >>> >>> By all means make sure the code is efficient in the way it uses the >>> JDK classes, but I don't think we should be recoding standard classes. >> >> >> I agree such a class should not live in [csv], but maybe in [io]? > > IMO performance should be taken out of the equation by using the > Readable interface[1]. That way the users can use whatever > implementation suits them (for example using an underlying buffered > InputStream) to change/improve performance. +1, excellent suggestion. > Niall > > [1] http://docs.oracle.com/javase/7/docs/api/java/lang/Readable.html > >> Emmanuel Bourg >> > > - > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg wrote: > Le 13/03/2012 01:25, sebb a écrit : > > >> I'm concerned that the CSV code may grow and grow with private >> versions of code that could be provided by the JDK. >> >> By all means make sure the code is efficient in the way it uses the >> JDK classes, but I don't think we should be recoding standard classes. > > > I agree such a class should not live in [csv], but maybe in [io]? IMO performance should be taken out of the equation by using the Readable interface[1]. That way the users can use whatever implementation suits them (for example using an underlying buffered InputStream) to change/improve performance. Niall [1] http://docs.oracle.com/javase/7/docs/api/java/lang/Readable.html > Emmanuel Bourg > - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
On 13 March 2012 00:29, Emmanuel Bourg wrote: > Le 13/03/2012 01:25, sebb a écrit : > > >> I'm concerned that the CSV code may grow and grow with private >> versions of code that could be provided by the JDK. >> >> By all means make sure the code is efficient in the way it uses the >> JDK classes, but I don't think we should be recoding standard classes. > > > I agree such a class should not live in [csv], but maybe in [io]? I don't think we should be trying to recode JDK classes. > Emmanuel Bourg > - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
On Mar 12, 2012, at 20:30, Emmanuel Bourg wrote: > Le 13/03/2012 01:25, sebb a écrit : > >> I'm concerned that the CSV code may grow and grow with private >> versions of code that could be provided by the JDK. >> >> By all means make sure the code is efficient in the way it uses the >> JDK classes, but I don't think we should be recoding standard classes. > > I agree such a class should not live in [csv], but maybe in [io]? That would be better but we need to think twice before adding code. Gary > > Emmanuel Bourg > - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
On Mar 12, 2012, at 20:25, sebb wrote: > On 13 March 2012 00:12, Emmanuel Bourg wrote: >> I kept tickling ExtendedBufferedReader and I have some interesting results. >> >> First I tried to simplify it by extending java.io.LineNumberReader instead >> of BufferedReader. The performance decreased by 20%, probably because the >> class is synchronized internally. >> >> But wait, isn't BufferedReader also synchronized? I copied the code of >> BufferedReader and removed the synchronized blocks. Now the time to parse >> the file is down to 2652 ms, 28% faster than previously! >> >> Of course the code of BufferedReader can't be copied from the JDK due to the >> license mismatch, so I took the version from Harmony. On my test it is about >> 4% faster than the JDK counterpart, and the parsing time is now around 2553 >> ms. > > I'm concerned that the CSV code may grow and grow with private > versions of code that could be provided by the JDK. > > By all means make sure the code is efficient in the way it uses the > JDK classes, but I don't think we should be recoding standard classes. +1 Gary > >> Now Commons CSV can start claiming being the fastest CSV parser around :) >> >> Emmanuel Bourg >> >> >> Le 12/03/2012 11:31, Emmanuel Bourg a écrit : >> >>> I have identified the performance killer, it's the >>> ExtendedBufferedReader. It implements a complex logic to fetch one >>> character ahead, but this extra character is rarely used. I have >>> implemented a simpler look ahead using mark/reset as suggested by Bob >>> Smith in CSV-42 and the performance improved by 30%. >>> >>> Now the parsing is down to 3406 ms, and that's almost without touching >>> the parser yet. >>> >>> Emmanuel Bourg >>> >>> >>> Le 11/03/2012 15:05, Emmanuel Bourg a écrit : Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained on a Core 2 Duo E8400 after several iterations to let the JIT compiler kick in: Direct read 750 ms Java CSV 3328 ms Super CSV 3562 ms (+7%) OpenCSV 3609 ms (+8.4%) GenJava CSV 3844 ms (+15.5%) Commons CSV 4656 ms (+39.9%) Skife CSV 4813 ms (+44.6%) I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use them. I haven't analyzed why Commons CSV is slower yet, but it seems there is room for improvements. The memory usage will have to be compared too, I'm looking for a way to measure it. Emmanuel Bourg [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz >>> >>> >> >> > > - > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK classes, but I don't think we should be recoding standard classes. I agree such a class should not live in [csv], but maybe in [io]? Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
On 13 March 2012 00:12, Emmanuel Bourg wrote: > I kept tickling ExtendedBufferedReader and I have some interesting results. > > First I tried to simplify it by extending java.io.LineNumberReader instead > of BufferedReader. The performance decreased by 20%, probably because the > class is synchronized internally. > > But wait, isn't BufferedReader also synchronized? I copied the code of > BufferedReader and removed the synchronized blocks. Now the time to parse > the file is down to 2652 ms, 28% faster than previously! > > Of course the code of BufferedReader can't be copied from the JDK due to the > license mismatch, so I took the version from Harmony. On my test it is about > 4% faster than the JDK counterpart, and the parsing time is now around 2553 > ms. I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK classes, but I don't think we should be recoding standard classes. > Now Commons CSV can start claiming being the fastest CSV parser around :) > > Emmanuel Bourg > > > Le 12/03/2012 11:31, Emmanuel Bourg a écrit : > >> I have identified the performance killer, it's the >> ExtendedBufferedReader. It implements a complex logic to fetch one >> character ahead, but this extra character is rarely used. I have >> implemented a simpler look ahead using mark/reset as suggested by Bob >> Smith in CSV-42 and the performance improved by 30%. >> >> Now the parsing is down to 3406 ms, and that's almost without touching >> the parser yet. >> >> Emmanuel Bourg >> >> >> Le 11/03/2012 15:05, Emmanuel Bourg a écrit : >>> >>> Hi, >>> >>> I compared the performance of Commons CSV with the other CSV parsers >>> available. I took the world cities file from Maxmind as a test file [1], >>> it's a big file of 130M with 2.8 million records. >>> >>> Here are the results obtained on a Core 2 Duo E8400 after several >>> iterations to let the JIT compiler kick in: >>> >>> Direct read 750 ms >>> Java CSV 3328 ms >>> Super CSV 3562 ms (+7%) >>> OpenCSV 3609 ms (+8.4%) >>> GenJava CSV 3844 ms (+15.5%) >>> Commons CSV 4656 ms (+39.9%) >>> Skife CSV 4813 ms (+44.6%) >>> >>> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use >>> them. >>> >>> I haven't analyzed why Commons CSV is slower yet, but it seems there is >>> room for improvements. The memory usage will have to be compared too, >>> I'm looking for a way to measure it. >>> >>> >>> Emmanuel Bourg >>> >>> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz >>> >> >> > > - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
I kept tickling ExtendedBufferedReader and I have some interesting results. First I tried to simplify it by extending java.io.LineNumberReader instead of BufferedReader. The performance decreased by 20%, probably because the class is synchronized internally. But wait, isn't BufferedReader also synchronized? I copied the code of BufferedReader and removed the synchronized blocks. Now the time to parse the file is down to 2652 ms, 28% faster than previously! Of course the code of BufferedReader can't be copied from the JDK due to the license mismatch, so I took the version from Harmony. On my test it is about 4% faster than the JDK counterpart, and the parsing time is now around 2553 ms. Now Commons CSV can start claiming being the fastest CSV parser around :) Emmanuel Bourg Le 12/03/2012 11:31, Emmanuel Bourg a écrit : I have identified the performance killer, it's the ExtendedBufferedReader. It implements a complex logic to fetch one character ahead, but this extra character is rarely used. I have implemented a simpler look ahead using mark/reset as suggested by Bob Smith in CSV-42 and the performance improved by 30%. Now the parsing is down to 3406 ms, and that's almost without touching the parser yet. Emmanuel Bourg Le 11/03/2012 15:05, Emmanuel Bourg a écrit : Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained on a Core 2 Duo E8400 after several iterations to let the JIT compiler kick in: Direct read 750 ms Java CSV 3328 ms Super CSV 3562 ms (+7%) OpenCSV 3609 ms (+8.4%) GenJava CSV 3844 ms (+15.5%) Commons CSV 4656 ms (+39.9%) Skife CSV 4813 ms (+44.6%) I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use them. I haven't analyzed why Commons CSV is slower yet, but it seems there is room for improvements. The memory usage will have to be compared too, I'm looking for a way to measure it. Emmanuel Bourg [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
Yes this is what I mean. It might be worth a shot. Folks who specialize in parsing have spent much time on these libraries. It would make sense that they are quite fast. It gets us out of the parsing business. On Mar 12, 2012 12:41 PM, "Emmanuel Bourg" wrote: > Le 12/03/2012 17:28, James Carman a écrit : > >> Would one of the parser libraries not work here? >> > > You think at something like JavaCC or AntLR? Not sure it'll be more > efficient than a handcrafted parser. The CSV format is simple enough to do > it manually. > > Emmanuel Bourg > >
Re: [csv] Performance comparison
On Mon, Mar 12, 2012 at 5:41 PM, Emmanuel Bourg wrote: > Le 12/03/2012 17:28, James Carman a écrit : > >> Would one of the parser libraries not work here? > > > You think at something like JavaCC or AntLR? Not sure it'll be more > efficient than a handcrafted parser. The CSV format is simple enough to do > it manually. +1 I did the same for my json lib... javacc et al are pretty complex. I still struggle to understand everything around ognl... if not necessary, my preference is always to leave such tools out. > > Emmanuel Bourg > -- http://www.grobmeier.de https://www.timeandbill.de - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Le 12/03/2012 17:28, James Carman a écrit : Would one of the parser libraries not work here? You think at something like JavaCC or AntLR? Not sure it'll be more efficient than a handcrafted parser. The CSV format is simple enough to do it manually. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
Am 12. März 2012 17:22 schrieb Emmanuel Bourg : > Le 12/03/2012 17:03, Benedikt Ritter a écrit : > > >> The hole logic behind CSVLexer.nextToken() is very hard to read >> (IMHO). Maybe a some refactoring would help to make it easier to >> identify bottle necks? > > > Yes I started investigating in this direction. I filed a few bugs regarding > the behavior of the escaping that aim at clarifying the parser. > > I think the nextToken() method should be broken into smaller methods to help > the JIT compiler. > I would start by eliminating the Token parameter. You could either create a new token on each method call and return that one instead of reusing on the gets passed in or you could use a private field currentToken in CSVLexer. But I think that object creation costs for a data object like Token can be considered irrelevant (so creating one in each method call will not hurt us). > The JIT does some surprising things, I found that even unused code branches > can have an impact on the performance. For example if simpleTokenLexer() is > changed to not support escaped characters, the performance improves by 10% > (the input has no escaped character). And that's not merely because an if > statement was removed. If I add a System.out.println() in this if block that > is never called, the performance improves as well. > > So any change to the parser will have to be carefully tested. Innocent > changes can have a significant impact. > > > Emmanuel Bourg > - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Would one of the parser libraries not work here? On Mar 12, 2012 12:22 PM, "Emmanuel Bourg" wrote: > Le 12/03/2012 17:03, Benedikt Ritter a écrit : > > The hole logic behind CSVLexer.nextToken() is very hard to read >> (IMHO). Maybe a some refactoring would help to make it easier to >> identify bottle necks? >> > > Yes I started investigating in this direction. I filed a few bugs > regarding the behavior of the escaping that aim at clarifying the parser. > > I think the nextToken() method should be broken into smaller methods to > help the JIT compiler. > > The JIT does some surprising things, I found that even unused code > branches can have an impact on the performance. For example if > simpleTokenLexer() is changed to not support escaped characters, the > performance improves by 10% (the input has no escaped character). And > that's not merely because an if statement was removed. If I add a > System.out.println() in this if block that is never called, the performance > improves as well. > > So any change to the parser will have to be carefully tested. Innocent > changes can have a significant impact. > > > Emmanuel Bourg > >
Re: [csv] Performance comparison
Le 12/03/2012 17:03, Benedikt Ritter a écrit : The hole logic behind CSVLexer.nextToken() is very hard to read (IMHO). Maybe a some refactoring would help to make it easier to identify bottle necks? Yes I started investigating in this direction. I filed a few bugs regarding the behavior of the escaping that aim at clarifying the parser. I think the nextToken() method should be broken into smaller methods to help the JIT compiler. The JIT does some surprising things, I found that even unused code branches can have an impact on the performance. For example if simpleTokenLexer() is changed to not support escaped characters, the performance improves by 10% (the input has no escaped character). And that's not merely because an if statement was removed. If I add a System.out.println() in this if block that is never called, the performance improves as well. So any change to the parser will have to be carefully tested. Innocent changes can have a significant impact. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
Am 12. März 2012 11:31 schrieb Emmanuel Bourg : > I have identified the performance killer, it's the ExtendedBufferedReader. > It implements a complex logic to fetch one character ahead, but this extra > character is rarely used. I have implemented a simpler look ahead using > mark/reset as suggested by Bob Smith in CSV-42 and the performance improved > by 30%. > > Now the parsing is down to 3406 ms, and that's almost without touching the > parser yet. > great work Emmanuel! looking at my profiler, I can say that 70% of the time is spend in ExtendedBufferedReader.read(). This is no wonder, since read() is the method that does the actual work. However, we should try to minimize accesses to read(). For example isEndOfLine() calls read() two times. And isEndOfLine() get's called 5 times by CSVLexer.nextToken() and it's submethods. The hole logic behind CSVLexer.nextToken() is very hard to read (IMHO). Maybe a some refactoring would help to make it easier to identify bottle necks? Benedikt > Emmanuel Bourg > > > Le 11/03/2012 15:05, Emmanuel Bourg a écrit : > >> Hi, >> >> I compared the performance of Commons CSV with the other CSV parsers >> available. I took the world cities file from Maxmind as a test file [1], >> it's a big file of 130M with 2.8 million records. >> >> Here are the results obtained on a Core 2 Duo E8400 after several >> iterations to let the JIT compiler kick in: >> >> Direct read 750 ms >> Java CSV 3328 ms >> Super CSV 3562 ms (+7%) >> OpenCSV 3609 ms (+8.4%) >> GenJava CSV 3844 ms (+15.5%) >> Commons CSV 4656 ms (+39.9%) >> Skife CSV 4813 ms (+44.6%) >> >> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use >> them. >> >> I haven't analyzed why Commons CSV is slower yet, but it seems there is >> room for improvements. The memory usage will have to be compared too, >> I'm looking for a way to measure it. >> >> >> Emmanuel Bourg >> >> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz >> > > - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Le 12/03/2012 16:44, sebb a écrit : Java has a PushbackReader class - could that not be used? I considered it, but it doesn't mix well with line reading. The mark/reset solution is really simple and efficient. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
On 12 March 2012 10:31, Emmanuel Bourg wrote: > I have identified the performance killer, it's the ExtendedBufferedReader. > It implements a complex logic to fetch one character ahead, but this extra > character is rarely used. I have implemented a simpler look ahead using > mark/reset as suggested by Bob Smith in CSV-42 and the performance improved > by 30%. Java has a PushbackReader class - could that not be used? > Now the parsing is down to 3406 ms, and that's almost without touching the > parser yet. > > Emmanuel Bourg > > > Le 11/03/2012 15:05, Emmanuel Bourg a écrit : > >> Hi, >> >> I compared the performance of Commons CSV with the other CSV parsers >> available. I took the world cities file from Maxmind as a test file [1], >> it's a big file of 130M with 2.8 million records. >> >> Here are the results obtained on a Core 2 Duo E8400 after several >> iterations to let the JIT compiler kick in: >> >> Direct read 750 ms >> Java CSV 3328 ms >> Super CSV 3562 ms (+7%) >> OpenCSV 3609 ms (+8.4%) >> GenJava CSV 3844 ms (+15.5%) >> Commons CSV 4656 ms (+39.9%) >> Skife CSV 4813 ms (+44.6%) >> >> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use >> them. >> >> I haven't analyzed why Commons CSV is slower yet, but it seems there is >> room for improvements. The memory usage will have to be compared too, >> I'm looking for a way to measure it. >> >> >> Emmanuel Bourg >> >> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz >> > > - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
I have identified the performance killer, it's the ExtendedBufferedReader. It implements a complex logic to fetch one character ahead, but this extra character is rarely used. I have implemented a simpler look ahead using mark/reset as suggested by Bob Smith in CSV-42 and the performance improved by 30%. Now the parsing is down to 3406 ms, and that's almost without touching the parser yet. Emmanuel Bourg Le 11/03/2012 15:05, Emmanuel Bourg a écrit : Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained on a Core 2 Duo E8400 after several iterations to let the JIT compiler kick in: Direct read 750 ms Java CSV 3328 ms Super CSV 3562 ms (+7%) OpenCSV 3609 ms (+8.4%) GenJava CSV 3844 ms (+15.5%) Commons CSV 4656 ms (+39.9%) Skife CSV 4813 ms (+44.6%) I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use them. I haven't analyzed why Commons CSV is slower yet, but it seems there is room for improvements. The memory usage will have to be compared too, I'm looking for a way to measure it. Emmanuel Bourg [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
Le 12/03/2012 00:02, Benedikt Ritter a écrit : I've started to dig my way through the source. I've not done too much performance measuring in my career yet. I would use VisualVM for profiling, if you don't know anything better. Usually I work with JProfiler, it identifies the "hotspots" pretty well, but I'm not sure if it will produce relevant results on the complex methods of CSVLexer. And how about some performance junit tests? They may not be as accurate as a profiler, but they can give you a feeling, whether you are on the right way. I wrote a quick test locally, but that's not clean enough to be committed. It looks like this: public class PerformanceTest extends TestCase { private int max = 10; private BufferedReader getReader() throws IOException { return new BufferedReader(new FileReader("worldcitiespop.txt")); } public void testReadBigFile() throws Exception { for (int i = 0; i < max; i++) { BufferedReader in = getReader(); long t0 = System.currentTimeMillis(); int count = readAll(in); in.close(); System.out.println("File read in " + (System.currentTimeMillis() - t0) + "ms" + " " + count + " lines"); } System.out.println(); } private int readAll(BufferedReader in) throws IOException { int count = 0; while (in.readLine() != null) { count++; } return count; } public void testParseBigFile() throws Exception { for (int i = 0; i < max; i++) { long t0 = System.currentTimeMillis(); int count = parseCommonsCSV(getReader()); System.out.println("File parsed in " + (System.currentTimeMillis() - t0) + "ms with Commons CSV" + " " + count + " lines"); } System.out.println(); } private int parseCommonsCSV(Reader in) { CSVFormat format = CSVFormat.DEFAULT.withSurroundingSpacesIgnored(false); int count = 0; for (String[] record : format.parse(in)) { count++; } return count; } } Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
Am 11. März 2012 21:21 schrieb Emmanuel Bourg : > Le 11/03/2012 16:53, Benedikt Ritter a écrit : > > >> I have some spare time to help you with this. I'll check out the >> latest source tonight. Any suggestion where to start? > > > Hi Benedikt, thank you for helping. You can start looking at the source of > CSVParser if anything catch your eyes, and then run a profiler to try and > identify the performance critical parts that could be improved. > Hi Emmanuel, I've started to dig my way through the source. I've not done too much performance measuring in my career yet. I would use VisualVM for profiling, if you don't know anything better. And how about some performance junit tests? They may not be as accurate as a profiler, but they can give you a feeling, whether you are on the right way. Benedikt > Emmanuel Bourg > - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
Re: [csv] Performance comparison
Le 11/03/2012 16:53, Benedikt Ritter a écrit : I have some spare time to help you with this. I'll check out the latest source tonight. Any suggestion where to start? Hi Benedikt, thank you for helping. You can start looking at the source of CSVParser if anything catch your eyes, and then run a profiler to try and identify the performance critical parts that could be improved. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature
Re: [csv] Performance comparison
Am 11. März 2012 15:05 schrieb Emmanuel Bourg : > Hi, > > I compared the performance of Commons CSV with the other CSV parsers > available. I took the world cities file from Maxmind as a test file [1], > it's a big file of 130M with 2.8 million records. > > Here are the results obtained on a Core 2 Duo E8400 after several iterations > to let the JIT compiler kick in: > > Direct read 750 ms > Java CSV 3328 ms > Super CSV 3562 ms (+7%) > OpenCSV 3609 ms (+8.4%) > GenJava CSV 3844 ms (+15.5%) > Commons CSV 4656 ms (+39.9%) > Skife CSV 4813 ms (+44.6%) > > I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use > them. > > I haven't analyzed why Commons CSV is slower yet, but it seems there is room > for improvements. The memory usage will have to be compared too, I'm looking > for a way to measure it. > Hey Emmanuel, I have some spare time to help you with this. I'll check out the latest source tonight. Any suggestion where to start? Regards, Benedikt > > Emmanuel Bourg > > [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz > - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
[csv] Performance comparison
Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained on a Core 2 Duo E8400 after several iterations to let the JIT compiler kick in: Direct read 750 ms Java CSV3328 ms Super CSV 3562 ms (+7%) OpenCSV 3609 ms (+8.4%) GenJava CSV 3844 ms (+15.5%) Commons CSV 4656 ms (+39.9%) Skife CSV 4813 ms (+44.6%) I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use them. I haven't analyzed why Commons CSV is slower yet, but it seems there is room for improvements. The memory usage will have to be compared too, I'm looking for a way to measure it. Emmanuel Bourg [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz smime.p7s Description: S/MIME Cryptographic Signature