Well, there is considerable redundancy in the list of words that could result in massive compression. This is roughly what Lucene is doing. Storing each string incurs substantial overhead that dwarfs even the original size of the strings (overhead is about 50 bytes per string, the average word length in running English text is 6 characters, a dictionary should average somewhat longer). Since we are using UCS-16 internally to store what are essentially ASCII strings for wikipedia, we have more than 10:1 overhead here.
The classic compression for string collections is to sort and use prefix compression. Accelerated order list searching works in log log N time to search such a list or a separate array of offsets can allow direct addressing of strings. On particular trick to help with such an arrangement is to encode an offset into the string back to where the full prefix can be found. The Lucene compression is a bit stronger than simple prefix elimination in that it constructs a DFA that recognizes words from the dictionary. This gives you both compression and the function of the hash table rolled into one system. By the time you have walked through the bytes of a word, you have also looked it up without ever needing to construct the string containing the word. My suggestion is that we have a show of hands to see who cares (and how much) about keeping Lucene 3 compatibility. If no, or very few, hands go up, moving to a modern representation is probably a good idea. On Wed, Nov 7, 2012 at 7:58 AM, Sean Owen <[email protected]> wrote: > This just means "out of memory": the dictionary is too big. It's nothing in > particular to do with the number of size or rate of objects allocated. I > don't know if a different implementation is going to be appreciably > different in terms of size -- these are already primitive-based specialized > implementations. > > My dumb question is, is this not just a consequence of the implementation > trying to store an unbounded (well, up to trillions) of entries in a map in > memory? > > Beyond this I don't know anything about the implementation. > > > On Wed, Nov 7, 2012 at 3:17 PM, Grant Ingersoll <[email protected]> > wrote: > > > Hi, > > > > We're hitting OOMs while running vectorization during dictionary loading > > in TFPartialIndexVectorReducer. We have the dictionary chunk size set to > > 100 (the minimum) and have about 11M items in the dictionary (bigrams are > > on) and our heap size is set to 12 GB. We haven't debugged deeply yet, > > but the OOM routinely occurs in the rehash method: > > 2012-11-07 04:34:04,750 FATAL org.apache.hadoop.mapred.Child: Error > > running child : java.lang.OutOfMemoryError: Java heap space > > at > > > org.apache.mahout.math.map.OpenObjectIntHashMap.rehash(OpenObjectIntHashMap.java:430) > > at > > > org.apache.mahout.math.map.OpenObjectIntHashMap.put(OpenObjectIntHashMap.java:383) > > at > > > org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:131) > > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) > > at > > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) > > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:416) > > at > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > > at org.apache.hadoop.mapred.Child.main(Child.java:249) > > > > I'm also guessing (haven't turned on GC logs yet) that the GC simply > can't > > keep up w/ the allocations, but perhaps there is also a bug somewhere in > > the dictionary code and the dictionary is corrupt and it's misreading the > > size. I can share the dictionary privately if anyone wants to look at > it, > > but I can't share it publicly. > > > > Has anyone else seen this? Is my understanding correct? > > > > I can see a couple of remedies: > > 1. Pass in an initial capacity and see if we can better control the size > > of the allocation > > 2. Switch to Lucene's FST for dictionaries: The tradeoff would be a much > > smaller dictionary (10 GB of wikipedia in Lucene is roughly a 250K > > dictionary size) and very little deserialization (the dictionary is all > > byte arrays) at the cost of lookups in a given mapper. However, that > > latter cost would likely be more than made up for by the fact that in > most > > situations, one would only need 1 dictionary chunk, thereby eliminating > > several MapReduce iterations. The other downside/upside is that we would > > need to go to Lucene 4. > > > > Thoughts? > > > > -Grant >
