I'm speaking only off the top of my head, but my hunch it's not worth optimizing this. Yes, the alternative is to store the string's UTF-8 encoding as a byte[]. That's going to incur overhead in translating back and forth to String where needed, and my guess is that's going to be big enough to make this not worthwhile.
The only other idea I have is a trie, which is typically a great data structure for dictionaries like this. Sean On Sat, Jan 16, 2010 at 2:10 PM, Robin Anil <robin.a...@gmail.com> wrote: > Currently java strings use double the space of the characters in it because > its all in utf-16. A 190MB dictionary file therefore uses around 600MB when > loaded into a HashMap<String, Integer>. Is there some optimization we could > do in terms of storing them and ensuring that chinese, devanagiri and other > characters dont get messed up in the process. > > Some options benson suggested was: storing just the byte[] form and adding > the the option of supplying the hash function in OpenObjectIntHashmap or > even using a UTF-8 string. > > Or we could leave this alone. I currently estimate the memory requirement > using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when > generating the dictionary split for the vectorizer > > Robin >