Nope, sadly I wasn't doing the secondary sort (ugh) -- working on it now. Drew
On Sun, Feb 28, 2010 at 3:48 PM, Jake Mannix <jake.man...@gmail.com> wrote: > I thought you were doing the secondary sort idea? That's certainly the > way to make sure you need nothing significant kept in memory, and this > clearly won't scale without that optimization... > > I'd say this should get fixed before we release 0.3 > > -jake > > On Sun, Feb 28, 2010 at 7:30 AM, Drew Farris <drew.far...@gmail.com> wrote: > >> So one option would be to do the frequency counts in another pass, but >> I don't really like that idea. I think a compound key / secondary sort >> would work so.that the ngrams don't have to be tracked in a set. >> >> I will give it a try later today. >> >> On Sunday, February 28, 2010, Drew Farris <drew.far...@gmail.com> wrote: >> > Bah, that's not correct. I do end up keeping each unique ngram for a >> > given n-1gram in memory in the CollocCombiner and CollocReducer to do >> > frequency counting. There's likely a more elegant solution to this. >> > >> > On Sun, Feb 28, 2010 at 10:00 AM, Drew Farris <drew.far...@gmail.com> >> wrote: >> >> Argh, I'll look into it and see where Grams are kept in memory. There >> >> really shouldn't be any place where they're retained beyond what's >> >> needed for a single document. I doubt that there are documents in >> >> wikipedia that would blow the heap in this way, but I supposed it's >> >> possible. You're just doing bigrams, or did you end up going up to >> >> 5-grams? >> >> >> >> On Sun, Feb 28, 2010 at 7:50 AM, Robin Anil <robin.a...@gmail.com> >> wrote: >> >>> after 9 hours of compute, it failed. It never went past the colloc >> combiner >> >>> pass :( >> >>> >> >>> reason. I will have to tag drew along to identify the possible cause of >> this >> >>> out of memory error >> >>> >> >>> >> >>> java.lang.OutOfMemoryError: Java heap space >> >>> at >> org.apache.mahout.utils.nlp.collocations.llr.Gram.<init>(Gram.java:67) >> >>> at >> org.apache.mahout.utils.nlp.collocations.llr.CollocCombiner.reduce(CollocCombiner.java:62) >> >>> at >> org.apache.mahout.utils.nlp.collocations.llr.CollocCombiner.reduce(CollocCombiner.java:30) >> >>> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:921) >> >>> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1077) >> >>> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:719) >> >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:233) >> >>> at >> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2216) >> >>> >> >> >> > >> >