I see from the analysis and comments on issue 8826 
<https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FJuliaLang%2Fjulia%2Fissues%2F8826%23issuecomment-62062525&sa=D&sntz=1&usg=AFQjCNEEUiw7K2MbV7zclPvU4GwVyQE9Tg>that
 
I am not to first to stumble on this performance gap and that improvements 
are coming.  I understand that the initial emphasis for julia has been on 
numerical calculations, but it would be nice to Have It All and be able to 
use julia confidently in problems that mix numerics and string manipulation.

On Thursday, November 20, 2014 2:11:57 AM UTC-8, Milan Bouchet-Valat wrote:
>
> Le mercredi 19 novembre 2014 à 17:05 -0800, Greg Lee a écrit : 
> > Is there a faster way to do the following, which builds a dictionary 
> > of unique tokens and counts? 
> >         function unigrams(fn::String) 
> >             grams = Dict{String,Int32}() 
> >             f = open(fn) 
> >             for line in eachline(f) 
> >                 for t in split(line) 
> >                     i = get(grams,t,0) 
> >                     grams[t] = i+1 
> >                 end 
> >             end 
> >             close(f) 
> >             return grams 
> >         end 
> > 
> > 
> > On a file with 1.9M unique tokens, this is 8x slower than Python 
> > written in the same style.  The big hit comes from string keys; using 
> > int keys is closer to Python's performance.  Timings:  Julia 1083s, 
> > Python 126s, c++ 80s. 
> At least you can avoid doing the dict lookup twice (once to get the 
> value, once to set it) for values that are alreadu present in the 
> dictionary, by calling the unexported function Base.ht_keyindex(), like 
> this: 
>         for t in split(line) 
>             index = Base.ht_keyindex(grams, t) 
>             if index > 0 
>                 grams.vals[index] += 1 
>             else 
>                 grams[t] = 1 
>             end 
>         end 
>
> (BTW, this is a trick I'm using to compute frequency tables in 
> Tables.jl, you may also want yo use that package directly.) 
>
> Of course this will break if the internal structure of dictionaries 
> changes. Work is going on to allow performing this kind of optimization 
> in a reliable way: 
> https://github.com/JuliaLang/julia/issues/8826#issuecomment-62062525 
>
>
> Regards 
>

Reply via email to