I'm trying to turn a corpus of around 2.3 million docs into a sparse vectors for input into RowSimilarityJob and seem to be running into some performance issues with the DictionaryVectorizer.createDictionaryChunks. It seems that the goal is to number each "term" (in my case bi-grams). This is done in-memory and attempts to enforce a max chunk size.
Is there a reason we wouldn't update this code to use an approach similar the one presented here - to http://waredingen.nl/monotonically-increasing-row-ids-with-mapredu. Using a special comparator and grouping partitioner allows us to parallelize this operation across a mapreduce cluster. I'm happy to incorporate these changes (and am testing this locally). Just curious if I might be missing something that forces the use of the current single-threaded approach? Also, if this is better suited for the mahout-dev list I'm happy to send it there. Thanks, Burke