I had seen the issue u r reporting when running CooccurrencesMapper on a 2M document corpus on an 80 node cluster. The job would be stuck in cooccurencesMapper forever.
This has been fixed in 0.9 (I have not had a chance to try it out on the size and cluster I had before), so it would be good if u could try running with 0.9. P.S. 0.7 is not supported anymore and Mahout's come a long way since 0.7, please upgrade to 0.9. On Fri, Sep 26, 2014 at 7:02 PM, Burke Webster <bu...@collectiveip.com> wrote: > We are currently using 0.7 so that could be the issue. Last I looked I > believe we had around 22 million unique bi-grams in the dictionary. > > I can look into the newer code and see if that fixes our problems. > > On Fri, Sep 26, 2014 at 4:26 PM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > Can you say how many words you are seeing? > > > > How many unique bigrams? > > > > As Suneel asked, which version of Mahout? > > > > > > > > On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster <bu...@collectiveip.com> > > wrote: > > > > > I've been implementing the RowSimilarityJob on our 40-node cluster and > > have > > > run into so serious performance issues. > > > > > > Trying to run the job on a corpus of just over 2 million documents > using > > > bi-grams. When I get to the pairwise similarity step > > (CooccurrencesMapper > > > and SimilarityReducer) I am running out of space on hdfs because the > job > > is > > > generating over 5 terabytes of output data. > > > > > > Has anybody else run into similar issues? What other info can I > provide > > > that would be helpful? > > > > > > Thanks, > > > Burke > > > > > >