DictionaryVectorizer performance

2014-10-08 Thread Burke Webster
I'm trying to turn a corpus of around 2.3 million docs into a sparse vectors for input into RowSimilarityJob and seem to be running into some performance issues with the DictionaryVectorizer.createDictionaryChunks. It seems that the goal is to number each "term" (in my case bi-grams). This is done

Re: Performance of RowSimilarityJob

2014-09-27 Thread Burke Webster
since 0.7, > please upgrade to 0.9. > > On Fri, Sep 26, 2014 at 7:02 PM, Burke Webster > wrote: > >> We are currently using 0.7 so that could be the issue. Last I looked I >> believe we had around 22 million unique bi-grams in the dictionary. >> >> I can

Re: Performance of RowSimilarityJob

2014-09-26 Thread Burke Webster
you are seeing? > > How many unique bigrams? > > As Suneel asked, which version of Mahout? > > > > On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster > wrote: > > > I've been implementing the RowSimilarityJob on our 40-node cluster and > have > > run into

Performance of RowSimilarityJob

2014-09-26 Thread Burke Webster
I've been implementing the RowSimilarityJob on our 40-node cluster and have run into so serious performance issues. Trying to run the job on a corpus of just over 2 million documents using bi-grams. When I get to the pairwise similarity step (CooccurrencesMapper and SimilarityReducer) I am runnin