[julia-users] Re: TinySegmenter benchmark

Michiaki Ariga Wed, 21 Oct 2015 08:01:26 -0700

Thanks for Steven's great help, I learned many things to optimize string 
operation of Julia.


Finally, I wrote this episode on my blog (in Japanese only, sorry).
http://chezou.hatenablog.com/entry/2015/10/21/234317

-- chezou

2015年10月21日水曜日 2時49分56秒 UTC+9 Steven G. Johnson:
>
> I thought people might be interested in this cross-language benchmark of a 
> realistic application:
>
>      https://github.com/chezou/TinySegmenter.jl/issues/8
>
> TinySegmenter is an algorithm for breaking Japanese text into words, and 
> it has been ported by several authors to different programming languages. 
>  Michiaki Ariga (@chezou) ported it to Julia, and after optimizing it a bit 
> with me he ran some benchmarks comparing the performance to the different 
> TinySegmenter ports.  The resulting times (in seconds) for different 
> languages were:
>
> JavaScriptPython2Python3JuliaRuby121.0492.8529.640012.36(933+)
>
> The algorithm basically consists of looping over the characters in a 
> string, plugging tuples of consecutive characters into a dictionary of 
> "scores", and spitting out a word break when the score exceeds a threshold. 
>   The biggest speedup in optimizing the Julia code came from using tuples 
> of Char (characters) rather than concatenating the chars into strings 
> (which avoids the need to create and then discard lots of temporary strings 
> by exploiting Julia's fast tuples).
>
> The Julia implementation is also different from the others in that it is 
> the only one that operates completely in-place on the text, without 
> allocating large temporary arrays of characters and character categories, 
> and returns SubStrings rather than copies of the words.  This sped things 
> up only slightly, but saves a lot of memory for a large text.
>
> --SGJ
>
> PS. Also, Julia's ability to explicitly type dictionaries caught a bug in 
> the original implementation, where the author had missed the fact that the 
> ｸﾞ character is actually formed by two codepoints in Unicode.
>

[julia-users] Re: TinySegmenter benchmark

Reply via email to