Aargh. More emr woes. I had kept a job on before i went to the hadoop
summit. Seems it took some 20 mins for preprocess. 20 mins to word count and
it is stuck for 10 hours in creating the dictionary file. I am closing
everything and trying out cloudera

Robin

On Mon, Mar 1, 2010 at 2:21 AM, Drew Farris <drew.far...@gmail.com> wrote:

> Nope, sadly I wasn't doing the secondary sort (ugh) -- working on it now.
>
> Drew
>
> On Sun, Feb 28, 2010 at 3:48 PM, Jake Mannix <jake.man...@gmail.com>
> wrote:
> > I thought you were doing the secondary sort idea?  That's certainly the
> > way to make sure you need nothing significant kept in memory, and this
> > clearly won't scale without that optimization...
> >
> > I'd say this should get fixed before we release 0.3
> >
> >  -jake
> >
> > On Sun, Feb 28, 2010 at 7:30 AM, Drew Farris <drew.far...@gmail.com>
> wrote:
> >
> >> So one option would be to do the frequency counts in another pass, but
> >> I don't really like that idea. I think a compound key / secondary sort
> >> would work so.that the ngrams don't have to be tracked in a set.
> >>
> >> I will give it a try later today.
> >>
> >> On Sunday, February 28, 2010, Drew Farris <drew.far...@gmail.com>
> wrote:
> >> > Bah, that's not correct. I do end up keeping each unique ngram for a
> >> > given n-1gram in memory in the CollocCombiner and CollocReducer to do
> >> > frequency counting. There's likely a more elegant solution to this.
> >> >
> >> > On Sun, Feb 28, 2010 at 10:00 AM, Drew Farris <drew.far...@gmail.com>
> >> wrote:
> >> >> Argh, I'll look into it and see where Grams are kept in memory. There
> >> >> really shouldn't be any place where they're retained beyond what's
> >> >> needed for a single document. I doubt that there are documents in
> >> >> wikipedia that would blow the heap in this way, but I supposed it's
> >> >> possible. You're just doing bigrams, or did you end up going up to
> >> >> 5-grams?
> >> >>
> >> >> On Sun, Feb 28, 2010 at 7:50 AM, Robin Anil <robin.a...@gmail.com>
> >> wrote:
> >> >>> after 9 hours of compute,  it failed. It never went past the colloc
> >> combiner
> >> >>> pass :(
> >> >>>
> >> >>> reason. I will have to tag drew along to identify the possible cause
> of
> >> this
> >> >>> out of memory error
> >> >>>
> >> >>>
> >> >>> java.lang.OutOfMemoryError: Java heap space
> >> >>>        at
> >> org.apache.mahout.utils.nlp.collocations.llr.Gram.<init>(Gram.java:67)
> >> >>>        at
> >>
> org.apache.mahout.utils.nlp.collocations.llr.CollocCombiner.reduce(CollocCombiner.java:62)
> >> >>>        at
> >>
> org.apache.mahout.utils.nlp.collocations.llr.CollocCombiner.reduce(CollocCombiner.java:30)
> >> >>>        at
> >>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:921)
> >> >>>        at
> >>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1077)
> >> >>>        at
> >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:719)
> >> >>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:233)
> >> >>>        at
> >> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2216)
> >> >>>
> >> >>
> >> >
> >>
> >
>

Reply via email to