Re: Efficient dictionary storage in memory

Sean Owen Sat, 16 Jan 2010 06:15:37 -0800

I'm speaking only off the top of my head, but my hunch it's not worth
optimizing this. Yes, the alternative is to store the string's UTF-8
encoding as a byte[]. That's going to incur overhead in translating
back and forth to String where needed, and my guess is that's going to
be big enough to make this not worthwhile.


The only other idea I have is a trie, which is typically a great data
structure for dictionaries like this.

Sean


On Sat, Jan 16, 2010 at 2:10 PM, Robin Anil <robin.a...@gmail.com> wrote:
> Currently java strings use double the space of the characters in it because
> its all in utf-16. A 190MB dictionary file therefore uses around 600MB when
> loaded into a HashMap<String, Integer>.  Is there some optimization we could
> do in terms of storing them and ensuring that chinese, devanagiri and other
> characters dont get messed up in the process.
>
> Some options benson suggested was: storing just the byte[] form and adding
> the the option of supplying the hash function in OpenObjectIntHashmap or
> even using a UTF-8 string.
>
> Or we could leave this alone. I currently estimate the memory requirement
> using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
> generating the dictionary split for the vectorizer
>
> Robin
>

Re: Efficient dictionary storage in memory

Reply via email to