I agree the overhead of byte[] -> UTF-8 probably isn't too good for
lookup performance.

In line with Sean's suggestion, I've used tries in the past for doing
this sort of string -> integer mapping. They generally perform well
enough for adding entries as well as retrieval. Not nearly as
efficient as a hash, but there is usually enough of a memory savings
to make it worth it. They have the added benefit of making it easy to
do prefix searches, although that isn't a strict requirement here.

As Oliver suggests, a bloom filter may be an option, but wouldn't a
secondary data structure be required to hold the actual values? Would
false positives really be an issue with a dictionary scale problem?

I presume there's a need for compact integer -> string representation
which can be achieved by using string difference compression. Seeking
to a mod of the id and then building up the final string by scanning
forward through the list of incremental changes. iirc, lucene does
something like this.

Drew

On Sat, Jan 16, 2010 at 9:15 AM, Sean Owen <sro...@gmail.com> wrote:
> I'm speaking only off the top of my head, but my hunch it's not worth
> optimizing this. Yes, the alternative is to store the string's UTF-8
> encoding as a byte[]. That's going to incur overhead in translating
> back and forth to String where needed, and my guess is that's going to
> be big enough to make this not worthwhile.
>
> The only other idea I have is a trie, which is typically a great data
> structure for dictionaries like this.
>
> Sean
>
>
> On Sat, Jan 16, 2010 at 2:10 PM, Robin Anil <robin.a...@gmail.com> wrote:
>> Currently java strings use double the space of the characters in it because
>> its all in utf-16. A 190MB dictionary file therefore uses around 600MB when
>> loaded into a HashMap<String, Integer>.  Is there some optimization we could
>> do in terms of storing them and ensuring that chinese, devanagiri and other
>> characters dont get messed up in the process.
>>
>> Some options benson suggested was: storing just the byte[] form and adding
>> the the option of supplying the hash function in OpenObjectIntHashmap or
>> even using a UTF-8 string.
>>
>> Or we could leave this alone. I currently estimate the memory requirement
>> using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
>> generating the dictionary split for the vectorizer
>>
>> Robin
>>
>

Reply via email to