On 16 December 2014 at 04:39, Stefan Karpinski <ste...@karpinski.org> wrote:
>
> I'm not sure how this technique applies any less to strings than to numbers.
> Can you clarify what your concern is?

I am suspecting that what he wants to do is to limit the size of a
vocabulary (set of tokens) by imposing an occurrence threshold.  This
is fairly standard and what you really want would be a sorted
dictionary.  However, since this is usually just a pre-processing step
I don't even bother optimising and just do something like:

    UNKFORM = "<UNK>"
    unkcutoff = 1
    tokoccs = Dict{UTF8String,Int}()
    for tok in data
        tokoccs[tok] = get(tokoccs, tok, 0) + 1
    end
    vocab = Set{UTF8String}()
    unks = Set{UTF8String}()
    push!(vocab, UNKFORM)
    for (tok, occs) in tokoccs
        if occs > unkcutoff
            push!(vocab, tok)
        else
            push!(unks, tok)
        end
    end

This is O(n) + O(n) which should be fine unless you want to focus on
optimising this specific region of code.

    Pontus

Reply via email to