We are currently using 0.7 so that could be the issue. Last I looked I believe we had around 22 million unique bi-grams in the dictionary.
I can look into the newer code and see if that fixes our problems. On Fri, Sep 26, 2014 at 4:26 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Can you say how many words you are seeing? > > How many unique bigrams? > > As Suneel asked, which version of Mahout? > > > > On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster <bu...@collectiveip.com> > wrote: > > > I've been implementing the RowSimilarityJob on our 40-node cluster and > have > > run into so serious performance issues. > > > > Trying to run the job on a corpus of just over 2 million documents using > > bi-grams. When I get to the pairwise similarity step > (CooccurrencesMapper > > and SimilarityReducer) I am running out of space on hdfs because the job > is > > generating over 5 terabytes of output data. > > > > Has anybody else run into similar issues? What other info can I provide > > that would be helpful? > > > > Thanks, > > Burke > > >