I'm trying to turn a corpus of around 2.3 million docs into a sparse
vectors for input into RowSimilarityJob and seem to be running into some
performance issues with the DictionaryVectorizer.createDictionaryChunks.
It seems that the goal is to number each "term" (in my case bi-grams).
This is done
since 0.7,
> please upgrade to 0.9.
>
> On Fri, Sep 26, 2014 at 7:02 PM, Burke Webster
> wrote:
>
>> We are currently using 0.7 so that could be the issue. Last I looked I
>> believe we had around 22 million unique bi-grams in the dictionary.
>>
>> I can
you are seeing?
>
> How many unique bigrams?
>
> As Suneel asked, which version of Mahout?
>
>
>
> On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster
> wrote:
>
> > I've been implementing the RowSimilarityJob on our 40-node cluster and
> have
> > run into
I've been implementing the RowSimilarityJob on our 40-node cluster and have
run into so serious performance issues.
Trying to run the job on a corpus of just over 2 million documents using
bi-grams. When I get to the pairwise similarity step (CooccurrencesMapper
and SimilarityReducer) I am runnin