Re: [julia-users] Dict performance with String keys

2014-11-21 Thread Stefan Karpinski
I expect we may be able to swap in a new String representation in a month and let the breakage begin. Any code relies on the internal representation of Strings being specifically Array{Uint8,1} will break, in particular using the mutability thereof will fail. The I/O stuff is going to take a lot mo

Re: [julia-users] Dict performance with String keys

2014-11-20 Thread Pontus Stenetorp
On 21 November 2014 03:41, Stefan Karpinski wrote: > > I'm currently working on an overhaul of byte vectors and strings, which will > be followed by an overhaul of I/O (how one typically gets byte vectors). It > will take a bit of time but all things string and I/O related should be much > more ef

Re: [julia-users] Dict performance with String keys

2014-11-20 Thread Stefan Karpinski
I'm currently working on an overhaul of byte vectors and strings, which will be followed by an overhaul of I/O (how one typically gets byte vectors). It will take a bit of time but all things string and I/O related should be much more efficient once I'

Re: [julia-users] Dict performance with String keys

2014-11-20 Thread Greg Lee
I see from the analysis and comments on issue 8826 that I am not to first to stumble on this performance gap and that improvements

Re: [julia-users] Dict performance with String keys

2014-11-20 Thread Milan Bouchet-Valat
Le mercredi 19 novembre 2014 à 17:05 -0800, Greg Lee a écrit : > Is there a faster way to do the following, which builds a dictionary > of unique tokens and counts? > function unigrams(fn::String) > grams = Dict{String,Int32}() > f = open(fn) > for line i

Re: [julia-users] Dict performance with String keys

2014-11-19 Thread Pontus Stenetorp
On 20 November 2014 10:05, Greg Lee wrote: > > Is there a faster way to do the following, which builds a dictionary of > unique tokens and counts? I share your frustration regarding this. It should be mentioned though that converting tokens to integers is a fairly standard performance hack in Na

Re: [julia-users] Dict performance with String keys

2014-11-19 Thread Mike Nolta
https://github.com/JuliaLang/julia/issues/8826 Just curious, what do you get if you replace for t in split(line) with words = split(line) for i = 1:length(words) t = words[i] ? -Mike On Wed, Nov 19, 2014 at 10:47 PM, Greg Lee wrote: > Better but not competitive: 547s wit

Re: [julia-users] Dict performance with String keys

2014-11-19 Thread Greg Lee
Better but not competitive: 547s with symbols. On Wednesday, November 19, 2014 6:51:43 PM UTC-8, tshort wrote: > > You could try using symbols instead of strings. Replace t with symbol(t). > On Nov 19, 2014 8:06 PM, "Greg Lee" > > wrote: > >> Is there a faster way to do the following, which buil

Re: [julia-users] Dict performance with String keys

2014-11-19 Thread Tom Short
You could try using symbols instead of strings. Replace t with symbol(t). On Nov 19, 2014 8:06 PM, "Greg Lee" wrote: > Is there a faster way to do the following, which builds a dictionary of > unique tokens and counts? > > function unigrams(fn::String) > grams = Dict{String,Int32}() > f =

[julia-users] Dict performance with String keys

2014-11-19 Thread Greg Lee
Is there a faster way to do the following, which builds a dictionary of unique tokens and counts? function unigrams(fn::String) grams = Dict{String,Int32}() f = open(fn) for line in eachline(f) for t in split(line) i = get(grams,t,0) grams[t] = i+1