I had to make this decision while designing Mahout-627 and chose the stripes representation to avoid Network I/O bottlenecks.
On Mon, Jul 18, 2011 at 4:35 PM, Dhruv Kumar <[email protected]> wrote: > The "pairs" approach where each co-occurring pair is emitted by mapper as a > key value pair is I/O heavy but can be run on clusters with small per node > memory. > > There is another design pattern by Jimmy Lin called the "stripes" where the > co-occurences per term are aggregated into associative arrays. The mappers > emit keys as the term, and the values as the term's aggregated associative > array. > > The "stripes" term is less network I/O heavy but uses more memory per node > becuase the hashmaps can get quite big. > > Can the RowSimilarityJob be redesigned to use the "stripes" design pattern > as shown in the following paper by Jimmy Lin? > > http://www.aclweb.org/anthology-new/D/D08/D08-1044.pdf > > > > > On Mon, Jul 18, 2011 at 3:58 PM, Grant Ingersoll <[email protected]>wrote: > >> >> On Jul 18, 2011, at 3:52 PM, Grant Ingersoll wrote: >> > >> > I think the big win is to not construe this implementation to be based >> on what is in the paper. I'm starting to think we should have two >> RowSimilarity jobs. One for algebraic functions and one for those who are >> not. >> >> >> And I don't mean two completely separate, just a different mapper/reducer >> for that phase (and likely a combiner, too). >> >> -Grant >> >> >
