Bah, that's not correct. I do end up keeping each unique ngram for a given n-1gram in memory in the CollocCombiner and CollocReducer to do frequency counting. There's likely a more elegant solution to this.
On Sun, Feb 28, 2010 at 10:00 AM, Drew Farris <drew.far...@gmail.com> wrote: > Argh, I'll look into it and see where Grams are kept in memory. There > really shouldn't be any place where they're retained beyond what's > needed for a single document. I doubt that there are documents in > wikipedia that would blow the heap in this way, but I supposed it's > possible. You're just doing bigrams, or did you end up going up to > 5-grams? > > On Sun, Feb 28, 2010 at 7:50 AM, Robin Anil <robin.a...@gmail.com> wrote: >> after 9 hours of compute, it failed. It never went past the colloc combiner >> pass :( >> >> reason. I will have to tag drew along to identify the possible cause of this >> out of memory error >> >> >> java.lang.OutOfMemoryError: Java heap space >> at >> org.apache.mahout.utils.nlp.collocations.llr.Gram.<init>(Gram.java:67) >> at >> org.apache.mahout.utils.nlp.collocations.llr.CollocCombiner.reduce(CollocCombiner.java:62) >> at >> org.apache.mahout.utils.nlp.collocations.llr.CollocCombiner.reduce(CollocCombiner.java:30) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:921) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1077) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:719) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:233) >> at >> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2216) >> >