Re: [CSV] Performance

2012-03-15 Thread Ted Dunning
On Thu, Mar 15, 2012 at 3:11 PM, Emmanuel Bourg wrote: > ... > 1. Buffer the data in the BufferedReader > 2. Accumulate data in a reusable buffer for the current token > Reusable buffers are usually death in terms of subtle bugs and they rarely actually help that much. The key is to avoid copyi

Re: [CSV] Performance

2012-03-15 Thread Ted Dunning
See the Line and FastLine classes in org.apache.mahout.classifier.sgd.SimpleCsvExamples in the Mahout Examples module. You can see an older version of mahout here. This class hasn't changed in forever. https://github.com/tdunning/mahout/blob/debian-package/examples/src/main/java/org/apache/mahou

Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg
Le 15/03/2012 18:06, sebb a écrit : If I revert it, I still get a better time with Lexer2, though not quite as good an improvement. I ran my perf test, Lexer2 is slower on my system :( The order of the if doesn't change much here. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographi

Re: [CSV] Performance

2012-03-15 Thread sebb
On 15 March 2012 16:48, Emmanuel Bourg wrote: > Le 15/03/2012 17:42, sebb a écrit : > > >> I also used local vars in the other methods. >> >> See http://people.apache.org/~sebb/CSV/ > > > Thank you. You also reordered the if in simpleTokenLexer, that may explain > the difference. I'll give it a tr

Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg
Le 15/03/2012 17:42, sebb a écrit : I also used local vars in the other methods. See http://people.apache.org/~sebb/CSV/ Thank you. You also reordered the if in simpleTokenLexer, that may explain the difference. I'll give it a try. Emmanuel Bourg smime.p7s Description: S/MIME Cryptogra

Re: [CSV] Performance

2012-03-15 Thread sebb
On 15 March 2012 16:10, Emmanuel Bourg wrote: > Le 15/03/2012 16:45, sebb a écrit : > > >> So I then tried hauling the format method calls out of the loops into >> final local variables. >> This improves performance (slightly) in both client and server mode. > > > Could you show some code please?

Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg
Le 15/03/2012 16:45, sebb a écrit : So I then tried hauling the format method calls out of the loops into final local variables. This improves performance (slightly) in both client and server mode. Could you show some code please? I'm unable to reproduce this. I used local variables in simple

Re: [CSV] Performance

2012-03-15 Thread sebb
On 15 March 2012 13:17, Emmanuel Bourg wrote: > Le 15/03/2012 14:13, sebb a écrit : > > >> Eclipse, so probably client VM? > > > Probably. You can print the java.vm.name system property at the beginning of > the test, that will tell you the VM used. It was client. I've now tried with server, and

Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg
Thank you for sharing your experience Ted. Do you have a link to the code of your parser? I'd like to get a look. Currently the data flow in Commons CSV is: 1. Buffer the data in the BufferedReader 2. Accumulate data in a reusable buffer for the current token 3. Turn the token buffer into a Str

Re: [CSV] Performance

2012-03-15 Thread Ted Dunning
I built a limited CSV package for parsing data in Mahout at one point. I doubt that it was general enough to be helpful here, but the experience might be. The thing that *really* made a big difference in speed was to avoid copies and conversions to String. To do that, I built a state machine tha

Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg
Le 15/03/2012 14:13, sebb a écrit : Eclipse, so probably client VM? Probably. You can print the java.vm.name system property at the beginning of the test, that will tell you the VM used. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature

Re: [CSV] Performance

2012-03-15 Thread sebb
On 15 March 2012 12:43, Emmanuel Bourg wrote: > Le 15/03/2012 13:34, sebb a écrit : > >> In my testing, using final class variables for delimiter, escape etc >> (set in ctor) shaves about 1 sec off the time to read the world town >> data file compared with accessing these fields inline through the

Re: [CSV] Performance

2012-03-15 Thread Benedikt Ritter
Am 15. März 2012 13:50 schrieb Gary Gregory : > Can you put your perf test code and resources in SVN so I do not have to > write on please? > Hi Gary, have a look at http://markmail.org/message/x73i3hl63rjqdyfa (I agree with you, that having a clean performance test in SVN would be better) Regar

Re: [CSV] Performance

2012-03-15 Thread Emmanuel Bourg
Le 15/03/2012 13:34, sebb a écrit : In my testing, using final class variables for delimiter, escape etc (set in ctor) shaves about 1 sec off the time to read the world town data file compared with accessing these fields inline through the format field. Average time goes from c. 25.5 to c. 24.5

[CSV] Performance

2012-03-15 Thread sebb
In my testing, using final class variables for delimiter, escape etc (set in ctor) shaves about 1 sec off the time to read the world town data file compared with accessing these fields inline through the format field. Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement. I suspect

Re: [csv] Performance comparison

2012-03-14 Thread Christian Grobmeier
On Tue, Mar 13, 2012 at 4:33 AM, Ralph Goers wrote: >> I don't think we should be trying to recode JDK classes. > > If the implementations suck, why not? +1 -- http://www.grobmeier.de https://www.timeandbill.de - To unsubscri

Re: [csv] Performance comparison

2012-03-14 Thread Emmanuel Bourg
After more experiments I'm less enthusiastic about providing an optimized BufferedReader. The result of the performance test is significantly different if the test is run alone or after all the other unit tests (about 30% slower). When all the tests are executed, the removal of the synchronized

Re: [csv] Performance comparison

2012-03-13 Thread Emmanuel Bourg
Le 13/03/2012 02:47, Niall Pemberton a écrit : IMO performance should be taken out of the equation by using the Readable interface[1]. That way the users can use whatever implementation suits them (for example using an underlying buffered InputStream) to change/improve performance. I you mean

Re: [csv] Performance comparison

2012-03-13 Thread sebb
On 13 March 2012 09:01, Emmanuel Bourg wrote: > Le 13/03/2012 01:44, sebb a écrit : > > >> I don't think we should be trying to recode JDK classes. > > > I'd rather not, but we have done that in the past. FastDateFormat and > StrBuilder come to mind. And now Java has StringBuilder, which means St

Re: [csv] Performance comparison

2012-03-13 Thread Emmanuel Bourg
Le 13/03/2012 01:44, sebb a écrit : I don't think we should be trying to recode JDK classes. I'd rather not, but we have done that in the past. FastDateFormat and StrBuilder come to mind. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature

Re: [csv] Performance comparison

2012-03-12 Thread Ralph Goers
On Mar 12, 2012, at 5:44 PM, sebb wrote: > On 13 March 2012 00:29, Emmanuel Bourg wrote: >> Le 13/03/2012 01:25, sebb a écrit : >> >> >>> I'm concerned that the CSV code may grow and grow with private >>> versions of code that could be provided by the JDK. >>> >>> By all means make sure the c

Re: [csv] Performance comparison

2012-03-12 Thread sebb
On 13 March 2012 01:47, Niall Pemberton wrote: > On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg wrote: >> Le 13/03/2012 01:25, sebb a écrit : >> >> >>> I'm concerned that the CSV code may grow and grow with private >>> versions of code that could be provided by the JDK. >>> >>> By all means mak

Re: [csv] Performance comparison

2012-03-12 Thread Niall Pemberton
On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg wrote: > Le 13/03/2012 01:25, sebb a écrit : > > >> I'm concerned that the CSV code may grow and grow with private >> versions of code that could be provided by the JDK. >> >> By all means make sure the code is efficient in the way it uses the >> JD

Re: [csv] Performance comparison

2012-03-12 Thread sebb
On 13 March 2012 00:29, Emmanuel Bourg wrote: > Le 13/03/2012 01:25, sebb a écrit : > > >> I'm concerned that the CSV code may grow and grow with private >> versions of code that could be provided by the JDK. >> >> By all means make sure the code is efficient in the way it uses the >> JDK classes,

Re: [csv] Performance comparison

2012-03-12 Thread Gary Gregory
On Mar 12, 2012, at 20:30, Emmanuel Bourg wrote: > Le 13/03/2012 01:25, sebb a écrit : > >> I'm concerned that the CSV code may grow and grow with private >> versions of code that could be provided by the JDK. >> >> By all means make sure the code is efficient in the way it uses the >> JDK classe

Re: [csv] Performance comparison

2012-03-12 Thread Gary Gregory
On Mar 12, 2012, at 20:25, sebb wrote: > On 13 March 2012 00:12, Emmanuel Bourg wrote: >> I kept tickling ExtendedBufferedReader and I have some interesting results. >> >> First I tried to simplify it by extending java.io.LineNumberReader instead >> of BufferedReader. The performance decreased b

Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg
Le 13/03/2012 01:25, sebb a écrit : I'm concerned that the CSV code may grow and grow with private versions of code that could be provided by the JDK. By all means make sure the code is efficient in the way it uses the JDK classes, but I don't think we should be recoding standard classes. I a

Re: [csv] Performance comparison

2012-03-12 Thread sebb
On 13 March 2012 00:12, Emmanuel Bourg wrote: > I kept tickling ExtendedBufferedReader and I have some interesting results. > > First I tried to simplify it by extending java.io.LineNumberReader instead > of BufferedReader. The performance decreased by 20%, probably because the > class is synchron

Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg
I kept tickling ExtendedBufferedReader and I have some interesting results. First I tried to simplify it by extending java.io.LineNumberReader instead of BufferedReader. The performance decreased by 20%, probably because the class is synchronized internally. But wait, isn't BufferedReader als

Re: [csv] Performance comparison

2012-03-12 Thread James Carman
Yes this is what I mean. It might be worth a shot. Folks who specialize in parsing have spent much time on these libraries. It would make sense that they are quite fast. It gets us out of the parsing business. On Mar 12, 2012 12:41 PM, "Emmanuel Bourg" wrote: > Le 12/03/2012 17:28, James Carm

Re: [csv] Performance comparison

2012-03-12 Thread Christian Grobmeier
On Mon, Mar 12, 2012 at 5:41 PM, Emmanuel Bourg wrote: > Le 12/03/2012 17:28, James Carman a écrit : > >> Would one of the parser libraries not work here? > > > You think at something like JavaCC or AntLR? Not sure it'll be more > efficient than a handcrafted parser. The CSV format is simple enoug

Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg
Le 12/03/2012 17:28, James Carman a écrit : Would one of the parser libraries not work here? You think at something like JavaCC or AntLR? Not sure it'll be more efficient than a handcrafted parser. The CSV format is simple enough to do it manually. Emmanuel Bourg smime.p7s Description: S

Re: [csv] Performance comparison

2012-03-12 Thread Benedikt Ritter
Am 12. März 2012 17:22 schrieb Emmanuel Bourg : > Le 12/03/2012 17:03, Benedikt Ritter a écrit : > > >> The hole logic behind CSVLexer.nextToken() is very hard to read >> (IMHO). Maybe a some refactoring would help to make it easier to >> identify bottle necks? > > > Yes I started investigating in

Re: [csv] Performance comparison

2012-03-12 Thread James Carman
Would one of the parser libraries not work here? On Mar 12, 2012 12:22 PM, "Emmanuel Bourg" wrote: > Le 12/03/2012 17:03, Benedikt Ritter a écrit : > > The hole logic behind CSVLexer.nextToken() is very hard to read >> (IMHO). Maybe a some refactoring would help to make it easier to >> identify

Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg
Le 12/03/2012 17:03, Benedikt Ritter a écrit : The hole logic behind CSVLexer.nextToken() is very hard to read (IMHO). Maybe a some refactoring would help to make it easier to identify bottle necks? Yes I started investigating in this direction. I filed a few bugs regarding the behavior of th

Re: [csv] Performance comparison

2012-03-12 Thread Benedikt Ritter
Am 12. März 2012 11:31 schrieb Emmanuel Bourg : > I have identified the performance killer, it's the ExtendedBufferedReader. > It implements a complex logic to fetch one character ahead, but this extra > character is rarely used. I have implemented a simpler look ahead using > mark/reset as suggest

Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg
Le 12/03/2012 16:44, sebb a écrit : Java has a PushbackReader class - could that not be used? I considered it, but it doesn't mix well with line reading. The mark/reset solution is really simple and efficient. Emmanuel Bourg smime.p7s Description: S/MIME Cryptographic Signature

Re: [csv] Performance comparison

2012-03-12 Thread sebb
On 12 March 2012 10:31, Emmanuel Bourg wrote: > I have identified the performance killer, it's the ExtendedBufferedReader. > It implements a complex logic to fetch one character ahead, but this extra > character is rarely used. I have implemented a simpler look ahead using > mark/reset as suggeste

Re: [csv] Performance comparison

2012-03-12 Thread Emmanuel Bourg
I have identified the performance killer, it's the ExtendedBufferedReader. It implements a complex logic to fetch one character ahead, but this extra character is rarely used. I have implemented a simpler look ahead using mark/reset as suggested by Bob Smith in CSV-42 and the performance improv

Re: [csv] Performance comparison

2012-03-11 Thread Emmanuel Bourg
Le 12/03/2012 00:02, Benedikt Ritter a écrit : I've started to dig my way through the source. I've not done too much performance measuring in my career yet. I would use VisualVM for profiling, if you don't know anything better. Usually I work with JProfiler, it identifies the "hotspots" pretty

Re: [csv] Performance comparison

2012-03-11 Thread Benedikt Ritter
Am 11. März 2012 21:21 schrieb Emmanuel Bourg : > Le 11/03/2012 16:53, Benedikt Ritter a écrit : > > >> I have some spare time to help you with this. I'll check out the >> latest source tonight. Any suggestion where to start? > > > Hi Benedikt, thank you for helping. You can start looking at the so

Re: [csv] Performance comparison

2012-03-11 Thread Emmanuel Bourg
Le 11/03/2012 16:53, Benedikt Ritter a écrit : I have some spare time to help you with this. I'll check out the latest source tonight. Any suggestion where to start? Hi Benedikt, thank you for helping. You can start looking at the source of CSVParser if anything catch your eyes, and then run

Re: [csv] Performance comparison

2012-03-11 Thread Benedikt Ritter
Am 11. März 2012 15:05 schrieb Emmanuel Bourg : > Hi, > > I compared the performance of Commons CSV with the other CSV parsers > available. I took the world cities file from Maxmind as a test file [1], > it's a big file of 130M with 2.8 million records. > > Here are the results obtained on a Core 2

[csv] Performance comparison

2012-03-11 Thread Emmanuel Bourg
Hi, I compared the performance of Commons CSV with the other CSV parsers available. I took the world cities file from Maxmind as a test file [1], it's a big file of 130M with 2.8 million records. Here are the results obtained on a Core 2 Duo E8400 after several iterations to let the JIT comp