Thanks for Steven's great help, I learned many things to optimize string operation of Julia.
Finally, I wrote this episode on my blog (in Japanese only, sorry). http://chezou.hatenablog.com/entry/2015/10/21/234317 -- chezou 2015年10月21日水曜日 2時49分56秒 UTC+9 Steven G. Johnson: > > I thought people might be interested in this cross-language benchmark of a > realistic application: > > https://github.com/chezou/TinySegmenter.jl/issues/8 > > TinySegmenter is an algorithm for breaking Japanese text into words, and > it has been ported by several authors to different programming languages. > Michiaki Ariga (@chezou) ported it to Julia, and after optimizing it a bit > with me he ran some benchmarks comparing the performance to the different > TinySegmenter ports. The resulting times (in seconds) for different > languages were: > > JavaScriptPython2Python3JuliaRuby121.0492.8529.640012.36(933+) > > The algorithm basically consists of looping over the characters in a > string, plugging tuples of consecutive characters into a dictionary of > "scores", and spitting out a word break when the score exceeds a threshold. > The biggest speedup in optimizing the Julia code came from using tuples > of Char (characters) rather than concatenating the chars into strings > (which avoids the need to create and then discard lots of temporary strings > by exploiting Julia's fast tuples). > > The Julia implementation is also different from the others in that it is > the only one that operates completely in-place on the text, without > allocating large temporary arrays of characters and character categories, > and returns SubStrings rather than copies of the words. This sped things > up only slightly, but saves a lot of memory for a large text. > > --SGJ > > PS. Also, Julia's ability to explicitly type dictionaries caught a bug in > the original implementation, where the author had missed the fact that the > グ character is actually formed by two codepoints in Unicode. >