So one option would be to do the frequency counts in another pass, but I don't really like that idea. I think a compound key / secondary sort would work so.that the ngrams don't have to be tracked in a set.
I will give it a try later today. On Sunday, February 28, 2010, Drew Farris <drew.far...@gmail.com> wrote: > Bah, that's not correct. I do end up keeping each unique ngram for a > given n-1gram in memory in the CollocCombiner and CollocReducer to do > frequency counting. There's likely a more elegant solution to this. > > On Sun, Feb 28, 2010 at 10:00 AM, Drew Farris <drew.far...@gmail.com> wrote: >> Argh, I'll look into it and see where Grams are kept in memory. There >> really shouldn't be any place where they're retained beyond what's >> needed for a single document. I doubt that there are documents in >> wikipedia that would blow the heap in this way, but I supposed it's >> possible. You're just doing bigrams, or did you end up going up to >> 5-grams? >> >> On Sun, Feb 28, 2010 at 7:50 AM, Robin Anil <robin.a...@gmail.com> wrote: >>> after 9 hours of compute, it failed. It never went past the colloc combiner >>> pass :( >>> >>> reason. I will have to tag drew along to identify the possible cause of this >>> out of memory error >>> >>> >>> java.lang.OutOfMemoryError: Java heap space >>> at >>> org.apache.mahout.utils.nlp.collocations.llr.Gram.<init>(Gram.java:67) >>> at >>> org.apache.mahout.utils.nlp.collocations.llr.CollocCombiner.reduce(CollocCombiner.java:62) >>> at >>> org.apache.mahout.utils.nlp.collocations.llr.CollocCombiner.reduce(CollocCombiner.java:30) >>> at >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:921) >>> at >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1077) >>> at >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:719) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:233) >>> at >>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2216) >>> >> >