Re: [julia-users] Re: TinySegmenter benchmark

Stefan Karpinski Wed, 21 Oct 2015 09:51:07 -0700

That's an excellent performance comparison case study! Nice work, Chezou
and nice blog post (the Google translation is pretty readable).


On Wed, Oct 21, 2015 at 10:58 AM, Michiaki Ariga <che...@gmail.com> wrote:

> Thanks for Steven's great help, I learned many things to optimize string
> operation of Julia.
>
> Finally, I wrote this episode on my blog (in Japanese only, sorry).
> http://chezou.hatenablog.com/entry/2015/10/21/234317
>
> -- chezou
>
> 2015年10月21日水曜日 2時49分56秒 UTC+9 Steven G. Johnson:
>
>> I thought people might be interested in this cross-language benchmark of
>> a realistic application:
>>
>>      https://github.com/chezou/TinySegmenter.jl/issues/8
>>
>> TinySegmenter is an algorithm for breaking Japanese text into words, and
>> it has been ported by several authors to different programming languages.
>> Michiaki Ariga (@chezou) ported it to Julia, and after optimizing it a bit
>> with me he ran some benchmarks comparing the performance to the different
>> TinySegmenter ports.  The resulting times (in seconds) for different
>> languages were:
>>
>> JavaScriptPython2Python3JuliaRuby121.0492.8529.640012.36(933+)
>>
>> The algorithm basically consists of looping over the characters in a
>> string, plugging tuples of consecutive characters into a dictionary of
>> "scores", and spitting out a word break when the score exceeds a threshold.
>>   The biggest speedup in optimizing the Julia code came from using tuples
>> of Char (characters) rather than concatenating the chars into strings
>> (which avoids the need to create and then discard lots of temporary strings
>> by exploiting Julia's fast tuples).
>>
>> The Julia implementation is also different from the others in that it is
>> the only one that operates completely in-place on the text, without
>> allocating large temporary arrays of characters and character categories,
>> and returns SubStrings rather than copies of the words.  This sped things
>> up only slightly, but saves a lot of memory for a large text.
>>
>> --SGJ
>>
>> PS. Also, Julia's ability to explicitly type dictionaries caught a bug in
>> the original implementation, where the author had missed the fact that the
>> ｸﾞ character is actually formed by two codepoints in Unicode.
>>
>

Re: [julia-users] Re: TinySegmenter benchmark

Reply via email to