Re: Performance of RowSimilarityJob

Burke Webster Fri, 26 Sep 2014 16:03:47 -0700

We are currently using 0.7 so that could be the issue.  Last I looked I
believe we had around 22 million unique bi-grams in the dictionary.


I can look into the newer code and see if that fixes our problems.

On Fri, Sep 26, 2014 at 4:26 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Can you say how many words you are seeing?
>
> How many unique bigrams?
>
> As Suneel asked, which version of Mahout?
>
>
>
> On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster <bu...@collectiveip.com>
> wrote:
>
> > I've been implementing the RowSimilarityJob on our 40-node cluster and
> have
> > run into so serious performance issues.
> >
> > Trying to run the job on a corpus of just over 2 million documents using
> > bi-grams.  When I get to the pairwise similarity step
> (CooccurrencesMapper
> > and SimilarityReducer) I am running out of space on hdfs because the job
> is
> > generating over 5 terabytes of output data.
> >
> > Has anybody else run into similar issues?  What other info can I provide
> > that would be helpful?
> >
> > Thanks,
> > Burke
> >
>

Re: Performance of RowSimilarityJob

Reply via email to