On 16 December 2014 at 04:39, Stefan Karpinski <ste...@karpinski.org> wrote: > > I'm not sure how this technique applies any less to strings than to numbers. > Can you clarify what your concern is?
I am suspecting that what he wants to do is to limit the size of a vocabulary (set of tokens) by imposing an occurrence threshold. This is fairly standard and what you really want would be a sorted dictionary. However, since this is usually just a pre-processing step I don't even bother optimising and just do something like: UNKFORM = "<UNK>" unkcutoff = 1 tokoccs = Dict{UTF8String,Int}() for tok in data tokoccs[tok] = get(tokoccs, tok, 0) + 1 end vocab = Set{UTF8String}() unks = Set{UTF8String}() push!(vocab, UNKFORM) for (tok, occs) in tokoccs if occs > unkcutoff push!(vocab, tok) else push!(unks, tok) end end This is O(n) + O(n) which should be fine unless you want to focus on optimising this specific region of code. Pontus