Currently java strings use double the space of the characters in it because
its all in utf-16. A 190MB dictionary file therefore uses around 600MB when
loaded into a HashMap<String, Integer>.  Is there some optimization we could
do in terms of storing them and ensuring that chinese, devanagiri and other
characters dont get messed up in the process.

Some options benson suggested was: storing just the byte[] form and adding
the the option of supplying the hash function in OpenObjectIntHashmap or
even using a UTF-8 string.

Or we could leave this alone. I currently estimate the memory requirement
using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
generating the dictionary split for the vectorizer

Robin

Reply via email to