Hi,

We're hitting OOMs while running vectorization during dictionary loading in 
TFPartialIndexVectorReducer.  We have the dictionary chunk size set to 100 (the 
minimum) and have about 11M items in the dictionary (bigrams are on) and our 
heap size is set to 12 GB.   We haven't debugged deeply yet, but the OOM 
routinely occurs in the rehash method:
2012-11-07 04:34:04,750 FATAL org.apache.hadoop.mapred.Child: Error running 
child : java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.mahout.math.map.OpenObjectIntHashMap.rehash(OpenObjectIntHashMap.java:430)
        at 
org.apache.mahout.math.map.OpenObjectIntHashMap.put(OpenObjectIntHashMap.java:383)
        at 
org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:131)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
        at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

I'm also guessing (haven't turned on GC logs yet) that the GC simply can't keep 
up w/ the allocations, but perhaps there is also a bug somewhere in the 
dictionary code and the dictionary is corrupt and it's misreading the size.  I 
can share the dictionary privately if anyone wants to look at it, but I can't 
share it publicly.

Has anyone else seen this?  Is my understanding correct?

I can see a couple of remedies:
1.  Pass in an initial capacity and see if we can better control the size of 
the allocation
2. Switch to Lucene's FST for dictionaries:  The tradeoff would be a much 
smaller dictionary (10 GB of wikipedia in Lucene is roughly a 250K dictionary 
size) and very little  deserialization (the dictionary is all byte arrays) at 
the cost of lookups in a given mapper.  However, that latter cost would likely 
be more than made up for by the fact that in most situations, one would only 
need 1 dictionary chunk, thereby eliminating several MapReduce iterations.  The 
other downside/upside is that we would need to go to Lucene 4.

Thoughts?

-Grant

Reply via email to