On Thu, Mar 15, 2012 at 3:11 PM, Emmanuel Bourg wrote:
> ...
> 1. Buffer the data in the BufferedReader
> 2. Accumulate data in a reusable buffer for the current token
>
Reusable buffers are usually death in terms of subtle bugs and they rarely
actually help that much. The key is to avoid copyi
See the Line and FastLine classes
in org.apache.mahout.classifier.sgd.SimpleCsvExamples in the Mahout
Examples module.
You can see an older version of mahout here. This class hasn't changed in
forever.
https://github.com/tdunning/mahout/blob/debian-package/examples/src/main/java/org/apache/mahou
Le 15/03/2012 18:06, sebb a écrit :
If I revert it, I still get a better time with Lexer2, though not
quite as good an improvement.
I ran my perf test, Lexer2 is slower on my system :( The order of the if
doesn't change much here.
Emmanuel Bourg
smime.p7s
Description: S/MIME Cryptographi
On 15 March 2012 16:48, Emmanuel Bourg wrote:
> Le 15/03/2012 17:42, sebb a écrit :
>
>
>> I also used local vars in the other methods.
>>
>> See http://people.apache.org/~sebb/CSV/
>
>
> Thank you. You also reordered the if in simpleTokenLexer, that may explain
> the difference. I'll give it a tr
Le 15/03/2012 17:42, sebb a écrit :
I also used local vars in the other methods.
See http://people.apache.org/~sebb/CSV/
Thank you. You also reordered the if in simpleTokenLexer, that may
explain the difference. I'll give it a try.
Emmanuel Bourg
smime.p7s
Description: S/MIME Cryptogra
On 15 March 2012 16:10, Emmanuel Bourg wrote:
> Le 15/03/2012 16:45, sebb a écrit :
>
>
>> So I then tried hauling the format method calls out of the loops into
>> final local variables.
>> This improves performance (slightly) in both client and server mode.
>
>
> Could you show some code please?
Le 15/03/2012 16:45, sebb a écrit :
So I then tried hauling the format method calls out of the loops into
final local variables.
This improves performance (slightly) in both client and server mode.
Could you show some code please? I'm unable to reproduce this. I used
local variables in simple
On 15 March 2012 13:17, Emmanuel Bourg wrote:
> Le 15/03/2012 14:13, sebb a écrit :
>
>
>> Eclipse, so probably client VM?
>
>
> Probably. You can print the java.vm.name system property at the beginning of
> the test, that will tell you the VM used.
It was client.
I've now tried with server, and
Thank you for sharing your experience Ted. Do you have a link to the
code of your parser? I'd like to get a look.
Currently the data flow in Commons CSV is:
1. Buffer the data in the BufferedReader
2. Accumulate data in a reusable buffer for the current token
3. Turn the token buffer into a Str
I built a limited CSV package for parsing data in Mahout at one point. I
doubt that it was general enough to be helpful here, but the experience
might be.
The thing that *really* made a big difference in speed was to avoid copies
and conversions to String. To do that, I built a state machine tha
Le 15/03/2012 14:13, sebb a écrit :
Eclipse, so probably client VM?
Probably. You can print the java.vm.name system property at the
beginning of the test, that will tell you the VM used.
Emmanuel Bourg
smime.p7s
Description: S/MIME Cryptographic Signature
On 15 March 2012 12:43, Emmanuel Bourg wrote:
> Le 15/03/2012 13:34, sebb a écrit :
>
>> In my testing, using final class variables for delimiter, escape etc
>> (set in ctor) shaves about 1 sec off the time to read the world town
>> data file compared with accessing these fields inline through the
Am 15. März 2012 13:50 schrieb Gary Gregory :
> Can you put your perf test code and resources in SVN so I do not have to
> write on please?
>
Hi Gary,
have a look at http://markmail.org/message/x73i3hl63rjqdyfa (I agree
with you, that having a clean performance test in SVN would be better)
Regar
Le 15/03/2012 13:34, sebb a écrit :
In my testing, using final class variables for delimiter, escape etc
(set in ctor) shaves about 1 sec off the time to read the world town
data file compared with accessing these fields inline through the
format field.
Average time goes from c. 25.5 to c. 24.5
In my testing, using final class variables for delimiter, escape etc
(set in ctor) shaves about 1 sec off the time to read the world town
data file compared with accessing these fields inline through the
format field.
Average time goes from c. 25.5 to c. 24.5 which is a 4% improvement.
I suspect
On Tue, Mar 13, 2012 at 4:33 AM, Ralph Goers wrote:
>> I don't think we should be trying to recode JDK classes.
>
> If the implementations suck, why not?
+1
--
http://www.grobmeier.de
https://www.timeandbill.de
-
To unsubscri
After more experiments I'm less enthusiastic about providing an
optimized BufferedReader. The result of the performance test is
significantly different if the test is run alone or after all the other
unit tests (about 30% slower). When all the tests are executed, the
removal of the synchronized
Le 13/03/2012 02:47, Niall Pemberton a écrit :
IMO performance should be taken out of the equation by using the
Readable interface[1]. That way the users can use whatever
implementation suits them (for example using an underlying buffered
InputStream) to change/improve performance.
I you mean
On 13 March 2012 09:01, Emmanuel Bourg wrote:
> Le 13/03/2012 01:44, sebb a écrit :
>
>
>> I don't think we should be trying to recode JDK classes.
>
>
> I'd rather not, but we have done that in the past. FastDateFormat and
> StrBuilder come to mind.
And now Java has StringBuilder, which means St
Le 13/03/2012 01:44, sebb a écrit :
I don't think we should be trying to recode JDK classes.
I'd rather not, but we have done that in the past. FastDateFormat and
StrBuilder come to mind.
Emmanuel Bourg
smime.p7s
Description: S/MIME Cryptographic Signature
On Mar 12, 2012, at 5:44 PM, sebb wrote:
> On 13 March 2012 00:29, Emmanuel Bourg wrote:
>> Le 13/03/2012 01:25, sebb a écrit :
>>
>>
>>> I'm concerned that the CSV code may grow and grow with private
>>> versions of code that could be provided by the JDK.
>>>
>>> By all means make sure the c
On 13 March 2012 01:47, Niall Pemberton wrote:
> On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg wrote:
>> Le 13/03/2012 01:25, sebb a écrit :
>>
>>
>>> I'm concerned that the CSV code may grow and grow with private
>>> versions of code that could be provided by the JDK.
>>>
>>> By all means mak
On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg wrote:
> Le 13/03/2012 01:25, sebb a écrit :
>
>
>> I'm concerned that the CSV code may grow and grow with private
>> versions of code that could be provided by the JDK.
>>
>> By all means make sure the code is efficient in the way it uses the
>> JD
On 13 March 2012 00:29, Emmanuel Bourg wrote:
> Le 13/03/2012 01:25, sebb a écrit :
>
>
>> I'm concerned that the CSV code may grow and grow with private
>> versions of code that could be provided by the JDK.
>>
>> By all means make sure the code is efficient in the way it uses the
>> JDK classes,
On Mar 12, 2012, at 20:30, Emmanuel Bourg wrote:
> Le 13/03/2012 01:25, sebb a écrit :
>
>> I'm concerned that the CSV code may grow and grow with private
>> versions of code that could be provided by the JDK.
>>
>> By all means make sure the code is efficient in the way it uses the
>> JDK classe
On Mar 12, 2012, at 20:25, sebb wrote:
> On 13 March 2012 00:12, Emmanuel Bourg wrote:
>> I kept tickling ExtendedBufferedReader and I have some interesting results.
>>
>> First I tried to simplify it by extending java.io.LineNumberReader instead
>> of BufferedReader. The performance decreased b
Le 13/03/2012 01:25, sebb a écrit :
I'm concerned that the CSV code may grow and grow with private
versions of code that could be provided by the JDK.
By all means make sure the code is efficient in the way it uses the
JDK classes, but I don't think we should be recoding standard classes.
I a
On 13 March 2012 00:12, Emmanuel Bourg wrote:
> I kept tickling ExtendedBufferedReader and I have some interesting results.
>
> First I tried to simplify it by extending java.io.LineNumberReader instead
> of BufferedReader. The performance decreased by 20%, probably because the
> class is synchron
I kept tickling ExtendedBufferedReader and I have some interesting results.
First I tried to simplify it by extending java.io.LineNumberReader
instead of BufferedReader. The performance decreased by 20%, probably
because the class is synchronized internally.
But wait, isn't BufferedReader als
Yes this is what I mean. It might be worth a shot. Folks who specialize
in parsing have spent much time on these libraries. It would make sense
that they are quite fast. It gets us out of the parsing business.
On Mar 12, 2012 12:41 PM, "Emmanuel Bourg" wrote:
> Le 12/03/2012 17:28, James Carm
On Mon, Mar 12, 2012 at 5:41 PM, Emmanuel Bourg wrote:
> Le 12/03/2012 17:28, James Carman a écrit :
>
>> Would one of the parser libraries not work here?
>
>
> You think at something like JavaCC or AntLR? Not sure it'll be more
> efficient than a handcrafted parser. The CSV format is simple enoug
Le 12/03/2012 17:28, James Carman a écrit :
Would one of the parser libraries not work here?
You think at something like JavaCC or AntLR? Not sure it'll be more
efficient than a handcrafted parser. The CSV format is simple enough to
do it manually.
Emmanuel Bourg
smime.p7s
Description: S
Am 12. März 2012 17:22 schrieb Emmanuel Bourg :
> Le 12/03/2012 17:03, Benedikt Ritter a écrit :
>
>
>> The hole logic behind CSVLexer.nextToken() is very hard to read
>> (IMHO). Maybe a some refactoring would help to make it easier to
>> identify bottle necks?
>
>
> Yes I started investigating in
Would one of the parser libraries not work here?
On Mar 12, 2012 12:22 PM, "Emmanuel Bourg" wrote:
> Le 12/03/2012 17:03, Benedikt Ritter a écrit :
>
> The hole logic behind CSVLexer.nextToken() is very hard to read
>> (IMHO). Maybe a some refactoring would help to make it easier to
>> identify
Le 12/03/2012 17:03, Benedikt Ritter a écrit :
The hole logic behind CSVLexer.nextToken() is very hard to read
(IMHO). Maybe a some refactoring would help to make it easier to
identify bottle necks?
Yes I started investigating in this direction. I filed a few bugs
regarding the behavior of th
Am 12. März 2012 11:31 schrieb Emmanuel Bourg :
> I have identified the performance killer, it's the ExtendedBufferedReader.
> It implements a complex logic to fetch one character ahead, but this extra
> character is rarely used. I have implemented a simpler look ahead using
> mark/reset as suggest
Le 12/03/2012 16:44, sebb a écrit :
Java has a PushbackReader class - could that not be used?
I considered it, but it doesn't mix well with line reading. The
mark/reset solution is really simple and efficient.
Emmanuel Bourg
smime.p7s
Description: S/MIME Cryptographic Signature
On 12 March 2012 10:31, Emmanuel Bourg wrote:
> I have identified the performance killer, it's the ExtendedBufferedReader.
> It implements a complex logic to fetch one character ahead, but this extra
> character is rarely used. I have implemented a simpler look ahead using
> mark/reset as suggeste
I have identified the performance killer, it's the
ExtendedBufferedReader. It implements a complex logic to fetch one
character ahead, but this extra character is rarely used. I have
implemented a simpler look ahead using mark/reset as suggested by Bob
Smith in CSV-42 and the performance improv
Le 12/03/2012 00:02, Benedikt Ritter a écrit :
I've started to dig my way through the source. I've not done too much
performance measuring in my career yet. I would use VisualVM for
profiling, if you don't know anything better.
Usually I work with JProfiler, it identifies the "hotspots" pretty
Am 11. März 2012 21:21 schrieb Emmanuel Bourg :
> Le 11/03/2012 16:53, Benedikt Ritter a écrit :
>
>
>> I have some spare time to help you with this. I'll check out the
>> latest source tonight. Any suggestion where to start?
>
>
> Hi Benedikt, thank you for helping. You can start looking at the so
Le 11/03/2012 16:53, Benedikt Ritter a écrit :
I have some spare time to help you with this. I'll check out the
latest source tonight. Any suggestion where to start?
Hi Benedikt, thank you for helping. You can start looking at the source
of CSVParser if anything catch your eyes, and then run
Am 11. März 2012 15:05 schrieb Emmanuel Bourg :
> Hi,
>
> I compared the performance of Commons CSV with the other CSV parsers
> available. I took the world cities file from Maxmind as a test file [1],
> it's a big file of 130M with 2.8 million records.
>
> Here are the results obtained on a Core 2
Hi,
I compared the performance of Commons CSV with the other CSV parsers
available. I took the world cities file from Maxmind as a test file [1],
it's a big file of 130M with 2.8 million records.
Here are the results obtained on a Core 2 Duo E8400 after several
iterations to let the JIT comp
44 matches
Mail list logo