Re: [julia-users] Re: TinySegmenter benchmark

2015-11-06 Thread Michiaki ARIGA
Finally, I compared with C++ and Go version using TinySegmenterMaker.
https://github.com/shogo82148/TinySegmenterMaker/pull/10

The resulting times(in seconds for 100 loops a text file) is following:
RubyC++PerlNode.jsGoPythonJulia132.9848134105.3110.50111.8511.70

After my blog post, ikawaha optimized Golang version using same way we did,
and golang gets faster than Julia.

On Thu, Oct 22, 2015 at 11:45 PM Michiaki ARIGA  wrote:

> Masahiro Nakagawa a.k.a. repeatedly told me my mistakes of the benchmark,
> I re-benchmarked.
>
> Node.jsPython2Python3JuliaRuby9.6293.0823.941.4619.44
>
> - loop number of Python was 10 times smaller than other languages
> - repeatedly optimized Ruby implementation
> - changed loop size from 100 to 10
>
> repeatedly also benchmarked in dlang, I will do it after updating El
> Capitan :)
> http://repeatedly.github.io/ja/2015/10/tinysegmenter-benchmark-and-d/
>
>
> On Thu, Oct 22, 2015 at 6:28 AM Pontus Stenetorp 
> wrote:
>
>> On 21 October 2015 at 17:49, Stefan Karpinski 
>> wrote:
>> >
>> > That's an excellent performance comparison case study! Nice work,
>> Chezou and nice blog post (the Google translation is pretty readable).
>>
>> Very readable indeed and I am always happy to see more NLP code in
>> Julia!  Keep up the good work!
>>
>> Pontus
>>
>


Re: [julia-users] Re: TinySegmenter benchmark

2015-11-06 Thread Stefan Karpinski
I'm kind of surprised that C++ is so slow. I would imagine that anything
you can do performance-wise in Go or Julia, you ought to be able to do in
C++. Any idea what's going on there?

On Fri, Nov 6, 2015 at 12:00 PM, Michiaki ARIGA  wrote:

> Finally, I compared with C++ and Go version using TinySegmenterMaker.
> https://github.com/shogo82148/TinySegmenterMaker/pull/10
>
> The resulting times(in seconds for 100 loops a text file) is following:
> RubyC++PerlNode.jsGoPythonJulia132.9848134105.3110.50111.8511.70
>
> After my blog post, ikawaha optimized Golang version using same way we
> did, and golang gets faster than Julia.
>
> On Thu, Oct 22, 2015 at 11:45 PM Michiaki ARIGA  wrote:
>
>> Masahiro Nakagawa a.k.a. repeatedly told me my mistakes of the benchmark,
>> I re-benchmarked.
>>
>> Node.jsPython2Python3JuliaRuby9.6293.0823.941.4619.44
>>
>> - loop number of Python was 10 times smaller than other languages
>> - repeatedly optimized Ruby implementation
>> - changed loop size from 100 to 10
>>
>> repeatedly also benchmarked in dlang, I will do it after updating El
>> Capitan :)
>> http://repeatedly.github.io/ja/2015/10/tinysegmenter-benchmark-and-d/
>>
>>
>> On Thu, Oct 22, 2015 at 6:28 AM Pontus Stenetorp 
>> wrote:
>>
>>> On 21 October 2015 at 17:49, Stefan Karpinski 
>>> wrote:
>>> >
>>> > That's an excellent performance comparison case study! Nice work,
>>> Chezou and nice blog post (the Google translation is pretty readable).
>>>
>>> Very readable indeed and I am always happy to see more NLP code in
>>> Julia!  Keep up the good work!
>>>
>>> Pontus
>>>
>>


Re: [julia-users] Re: TinySegmenter benchmark

2015-11-06 Thread Steven G. Johnson
Note that the Go version went one step further and packed the tuples into 
64-bit integers where possible. We could do the same thing, though at this 
point we seem to be hitting the point of diminishing returns.

Re: [julia-users] Re: TinySegmenter benchmark

2015-11-06 Thread Steven G. Johnson
The C++ version is basically transcribed from the JavaScript version and 
constructs tons of temporary strings.  The key improvement we made (and Go 
subsequently adapted) is to use tuples of Char instead for the hash tables. 

I'm pretty pleased that Julia is within 10% of the Go version, and also that Go 
benefited from our optimization work. 

Re: [julia-users] Re: TinySegmenter benchmark

2015-10-22 Thread Michiaki ARIGA
Masahiro Nakagawa a.k.a. repeatedly told me my mistakes of the benchmark, I
re-benchmarked.

Node.jsPython2Python3JuliaRuby9.6293.0823.941.4619.44

- loop number of Python was 10 times smaller than other languages
- repeatedly optimized Ruby implementation
- changed loop size from 100 to 10

repeatedly also benchmarked in dlang, I will do it after updating El
Capitan :)
http://repeatedly.github.io/ja/2015/10/tinysegmenter-benchmark-and-d/


On Thu, Oct 22, 2015 at 6:28 AM Pontus Stenetorp 
wrote:

> On 21 October 2015 at 17:49, Stefan Karpinski 
> wrote:
> >
> > That's an excellent performance comparison case study! Nice work, Chezou
> and nice blog post (the Google translation is pretty readable).
>
> Very readable indeed and I am always happy to see more NLP code in
> Julia!  Keep up the good work!
>
> Pontus
>


[julia-users] Re: TinySegmenter benchmark

2015-10-21 Thread Michiaki Ariga
Thanks for Steven's great help, I learned many things to optimize string 
operation of Julia.

Finally, I wrote this episode on my blog (in Japanese only, sorry).
http://chezou.hatenablog.com/entry/2015/10/21/234317

-- chezou

2015年10月21日水曜日 2時49分56秒 UTC+9 Steven G. Johnson:
>
> I thought people might be interested in this cross-language benchmark of a 
> realistic application:
>
>  https://github.com/chezou/TinySegmenter.jl/issues/8
>
> TinySegmenter is an algorithm for breaking Japanese text into words, and 
> it has been ported by several authors to different programming languages. 
>  Michiaki Ariga (@chezou) ported it to Julia, and after optimizing it a bit 
> with me he ran some benchmarks comparing the performance to the different 
> TinySegmenter ports.  The resulting times (in seconds) for different 
> languages were:
>
> JavaScriptPython2Python3JuliaRuby121.0492.8529.640012.36(933+)
>
> The algorithm basically consists of looping over the characters in a 
> string, plugging tuples of consecutive characters into a dictionary of 
> "scores", and spitting out a word break when the score exceeds a threshold. 
>   The biggest speedup in optimizing the Julia code came from using tuples 
> of Char (characters) rather than concatenating the chars into strings 
> (which avoids the need to create and then discard lots of temporary strings 
> by exploiting Julia's fast tuples).
>
> The Julia implementation is also different from the others in that it is 
> the only one that operates completely in-place on the text, without 
> allocating large temporary arrays of characters and character categories, 
> and returns SubStrings rather than copies of the words.  This sped things 
> up only slightly, but saves a lot of memory for a large text.
>
> --SGJ
>
> PS. Also, Julia's ability to explicitly type dictionaries caught a bug in 
> the original implementation, where the author had missed the fact that the 
> グ character is actually formed by two codepoints in Unicode.
>


Re: [julia-users] Re: TinySegmenter benchmark

2015-10-21 Thread Stefan Karpinski
That's an excellent performance comparison case study! Nice work, Chezou
and nice blog post (the Google translation is pretty readable).

On Wed, Oct 21, 2015 at 10:58 AM, Michiaki Ariga  wrote:

> Thanks for Steven's great help, I learned many things to optimize string
> operation of Julia.
>
> Finally, I wrote this episode on my blog (in Japanese only, sorry).
> http://chezou.hatenablog.com/entry/2015/10/21/234317
>
> -- chezou
>
> 2015年10月21日水曜日 2時49分56秒 UTC+9 Steven G. Johnson:
>
>> I thought people might be interested in this cross-language benchmark of
>> a realistic application:
>>
>>  https://github.com/chezou/TinySegmenter.jl/issues/8
>>
>> TinySegmenter is an algorithm for breaking Japanese text into words, and
>> it has been ported by several authors to different programming languages.
>> Michiaki Ariga (@chezou) ported it to Julia, and after optimizing it a bit
>> with me he ran some benchmarks comparing the performance to the different
>> TinySegmenter ports.  The resulting times (in seconds) for different
>> languages were:
>>
>> JavaScriptPython2Python3JuliaRuby121.0492.8529.640012.36(933+)
>>
>> The algorithm basically consists of looping over the characters in a
>> string, plugging tuples of consecutive characters into a dictionary of
>> "scores", and spitting out a word break when the score exceeds a threshold.
>>   The biggest speedup in optimizing the Julia code came from using tuples
>> of Char (characters) rather than concatenating the chars into strings
>> (which avoids the need to create and then discard lots of temporary strings
>> by exploiting Julia's fast tuples).
>>
>> The Julia implementation is also different from the others in that it is
>> the only one that operates completely in-place on the text, without
>> allocating large temporary arrays of characters and character categories,
>> and returns SubStrings rather than copies of the words.  This sped things
>> up only slightly, but saves a lot of memory for a large text.
>>
>> --SGJ
>>
>> PS. Also, Julia's ability to explicitly type dictionaries caught a bug in
>> the original implementation, where the author had missed the fact that the
>> グ character is actually formed by two codepoints in Unicode.
>>
>


Re: [julia-users] Re: TinySegmenter benchmark

2015-10-21 Thread Pontus Stenetorp
On 21 October 2015 at 17:49, Stefan Karpinski  wrote:
>
> That's an excellent performance comparison case study! Nice work, Chezou and 
> nice blog post (the Google translation is pretty readable).

Very readable indeed and I am always happy to see more NLP code in
Julia!  Keep up the good work!

Pontus