Re: Performance of RowSimilarityJob

Suneel Marthi Fri, 26 Sep 2014 16:10:54 -0700

I had seen the issue u r reporting when running CooccurrencesMapper on a 2M
document corpus on an 80 node cluster.
The job would be stuck in cooccurencesMapper forever.


This has been fixed in 0.9 (I have not had a chance to try it out on the
size and cluster I had before), so it would be good if u could try running
with 0.9.

P.S. 0.7 is not supported anymore and Mahout's come a long way since 0.7,
please upgrade to 0.9.

On Fri, Sep 26, 2014 at 7:02 PM, Burke Webster <bu...@collectiveip.com>
wrote:

> We are currently using 0.7 so that could be the issue.  Last I looked I
> believe we had around 22 million unique bi-grams in the dictionary.
>
> I can look into the newer code and see if that fixes our problems.
>
> On Fri, Sep 26, 2014 at 4:26 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > Can you say how many words you are seeing?
> >
> > How many unique bigrams?
> >
> > As Suneel asked, which version of Mahout?
> >
> >
> >
> > On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster <bu...@collectiveip.com>
> > wrote:
> >
> > > I've been implementing the RowSimilarityJob on our 40-node cluster and
> > have
> > > run into so serious performance issues.
> > >
> > > Trying to run the job on a corpus of just over 2 million documents
> using
> > > bi-grams.  When I get to the pairwise similarity step
> > (CooccurrencesMapper
> > > and SimilarityReducer) I am running out of space on hdfs because the
> job
> > is
> > > generating over 5 terabytes of output data.
> > >
> > > Has anybody else run into similar issues?  What other info can I
> provide
> > > that would be helpful?
> > >
> > > Thanks,
> > > Burke
> > >
> >
>

Re: Performance of RowSimilarityJob

Reply via email to