Hi,
We're hitting OOMs while running vectorization during dictionary loading in
TFPartialIndexVectorReducer. We have the dictionary chunk size set to 100 (the
minimum) and have about 11M items in the dictionary (bigrams are on) and our
heap size is set to 12 GB. We haven't debugged deeply yet, but the OOM
routinely occurs in the rehash method:
2012-11-07 04:34:04,750 FATAL org.apache.hadoop.mapred.Child: Error running
child : java.lang.OutOfMemoryError: Java heap space
at
org.apache.mahout.math.map.OpenObjectIntHashMap.rehash(OpenObjectIntHashMap.java:430)
at
org.apache.mahout.math.map.OpenObjectIntHashMap.put(OpenObjectIntHashMap.java:383)
at
org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:131)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
I'm also guessing (haven't turned on GC logs yet) that the GC simply can't keep
up w/ the allocations, but perhaps there is also a bug somewhere in the
dictionary code and the dictionary is corrupt and it's misreading the size. I
can share the dictionary privately if anyone wants to look at it, but I can't
share it publicly.
Has anyone else seen this? Is my understanding correct?
I can see a couple of remedies:
1. Pass in an initial capacity and see if we can better control the size of
the allocation
2. Switch to Lucene's FST for dictionaries: The tradeoff would be a much
smaller dictionary (10 GB of wikipedia in Lucene is roughly a 250K dictionary
size) and very little deserialization (the dictionary is all byte arrays) at
the cost of lookups in a given mapper. However, that latter cost would likely
be more than made up for by the fact that in most situations, one would only
need 1 dictionary chunk, thereby eliminating several MapReduce iterations. The
other downside/upside is that we would need to go to Lucene 4.
Thoughts?
-Grant