Currently java strings use double the space of the characters in it because its all in utf-16. A 190MB dictionary file therefore uses around 600MB when loaded into a HashMap<String, Integer>. Is there some optimization we could do in terms of storing them and ensuring that chinese, devanagiri and other characters dont get messed up in the process.
Some options benson suggested was: storing just the byte[] form and adding the the option of supplying the hash function in OpenObjectIntHashmap or even using a UTF-8 string. Or we could leave this alone. I currently estimate the memory requirement using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when generating the dictionary split for the vectorizer Robin