Re: Performance of RowSimilarityJob

2014-09-27 Thread Burke Webster
Thanks for the feedback everybody. I'll give 0.9 a run. Thanks! Sent from my iPhone On Sep 26, 2014, at 5:10 PM, Suneel Marthi suneel.mar...@gmail.com wrote: I had seen the issue u r reporting when running CooccurrencesMapper on a 2M document corpus on an 80 node cluster. The job would be

Re: Performance of RowSimilarityJob

2014-09-26 Thread Suneel Marthi
What's the Mahout version? Please work off of 0.9, there was a performance issue in RSJ that was fixed in 0.9. On Fri, Sep 26, 2014 at 4:23 PM, Burke Webster bu...@collectiveip.com wrote: I've been implementing the RowSimilarityJob on our 40-node cluster and have run into so serious

Re: Performance of RowSimilarityJob

2014-09-26 Thread Ted Dunning
Can you say how many words you are seeing? How many unique bigrams? As Suneel asked, which version of Mahout? On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster bu...@collectiveip.com wrote: I've been implementing the RowSimilarityJob on our 40-node cluster and have run into so serious

Re: Performance of RowSimilarityJob

2014-09-26 Thread Burke Webster
We are currently using 0.7 so that could be the issue. Last I looked I believe we had around 22 million unique bi-grams in the dictionary. I can look into the newer code and see if that fixes our problems. On Fri, Sep 26, 2014 at 4:26 PM, Ted Dunning ted.dunn...@gmail.com wrote: Can you say

Re: Performance of RowSimilarityJob

2014-09-26 Thread Ted Dunning
Yeah... that is pretty ancient. On Fri, Sep 26, 2014 at 4:02 PM, Burke Webster bu...@collectiveip.com wrote: We are currently using 0.7 so that could be the issue. Last I looked I believe we had around 22 million unique bi-grams in the dictionary. I can look into the newer code and see if

Re: Performance of RowSimilarityJob

2014-09-26 Thread Suneel Marthi
I had seen the issue u r reporting when running CooccurrencesMapper on a 2M document corpus on an 80 node cluster. The job would be stuck in cooccurencesMapper forever. This has been fixed in 0.9 (I have not had a chance to try it out on the size and cluster I had before), so it would be good if