So one option would be to do the frequency counts in another pass, but
I don't really like that idea. I think a compound key / secondary sort
would work so.that the ngrams don't have to be tracked in a set.

I will give it a try later today.

On Sunday, February 28, 2010, Drew Farris <drew.far...@gmail.com> wrote:
> Bah, that's not correct. I do end up keeping each unique ngram for a
> given n-1gram in memory in the CollocCombiner and CollocReducer to do
> frequency counting. There's likely a more elegant solution to this.
>
> On Sun, Feb 28, 2010 at 10:00 AM, Drew Farris <drew.far...@gmail.com> wrote:
>> Argh, I'll look into it and see where Grams are kept in memory. There
>> really shouldn't be any place where they're retained beyond what's
>> needed for a single document. I doubt that there are documents in
>> wikipedia that would blow the heap in this way, but I supposed it's
>> possible. You're just doing bigrams, or did you end up going up to
>> 5-grams?
>>
>> On Sun, Feb 28, 2010 at 7:50 AM, Robin Anil <robin.a...@gmail.com> wrote:
>>> after 9 hours of compute,  it failed. It never went past the colloc combiner
>>> pass :(
>>>
>>> reason. I will have to tag drew along to identify the possible cause of this
>>> out of memory error
>>>
>>>
>>> java.lang.OutOfMemoryError: Java heap space
>>>        at 
>>> org.apache.mahout.utils.nlp.collocations.llr.Gram.<init>(Gram.java:67)
>>>        at 
>>> org.apache.mahout.utils.nlp.collocations.llr.CollocCombiner.reduce(CollocCombiner.java:62)
>>>        at 
>>> org.apache.mahout.utils.nlp.collocations.llr.CollocCombiner.reduce(CollocCombiner.java:30)
>>>        at 
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:921)
>>>        at 
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1077)
>>>        at 
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:719)
>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:233)
>>>        at 
>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2216)
>>>
>>
>

Reply via email to