That's an excellent performance comparison case study! Nice work, Chezou and nice blog post (the Google translation is pretty readable).
On Wed, Oct 21, 2015 at 10:58 AM, Michiaki Ariga <che...@gmail.com> wrote: > Thanks for Steven's great help, I learned many things to optimize string > operation of Julia. > > Finally, I wrote this episode on my blog (in Japanese only, sorry). > http://chezou.hatenablog.com/entry/2015/10/21/234317 > > -- chezou > > 2015年10月21日水曜日 2時49分56秒 UTC+9 Steven G. Johnson: > >> I thought people might be interested in this cross-language benchmark of >> a realistic application: >> >> https://github.com/chezou/TinySegmenter.jl/issues/8 >> >> TinySegmenter is an algorithm for breaking Japanese text into words, and >> it has been ported by several authors to different programming languages. >> Michiaki Ariga (@chezou) ported it to Julia, and after optimizing it a bit >> with me he ran some benchmarks comparing the performance to the different >> TinySegmenter ports. The resulting times (in seconds) for different >> languages were: >> >> JavaScriptPython2Python3JuliaRuby121.0492.8529.640012.36(933+) >> >> The algorithm basically consists of looping over the characters in a >> string, plugging tuples of consecutive characters into a dictionary of >> "scores", and spitting out a word break when the score exceeds a threshold. >> The biggest speedup in optimizing the Julia code came from using tuples >> of Char (characters) rather than concatenating the chars into strings >> (which avoids the need to create and then discard lots of temporary strings >> by exploiting Julia's fast tuples). >> >> The Julia implementation is also different from the others in that it is >> the only one that operates completely in-place on the text, without >> allocating large temporary arrays of characters and character categories, >> and returns SubStrings rather than copies of the words. This sped things >> up only slightly, but saves a lot of memory for a large text. >> >> --SGJ >> >> PS. Also, Julia's ability to explicitly type dictionaries caught a bug in >> the original implementation, where the author had missed the fact that the >> グ character is actually formed by two codepoints in Unicode. >> >