I had to make this decision while designing Mahout-627 and chose the stripes
representation to avoid Network I/O bottlenecks.

On Mon, Jul 18, 2011 at 4:35 PM, Dhruv Kumar <[email protected]> wrote:

> The "pairs" approach where each co-occurring pair is emitted by mapper as a
> key value pair is I/O heavy but can be run on clusters with small per node
> memory.
>
> There is another design pattern by Jimmy Lin called the "stripes" where the
> co-occurences per term are aggregated into associative arrays. The mappers
> emit keys as the term, and the values as the term's aggregated associative
> array.
>
> The "stripes" term is less network I/O heavy but uses more memory per node
> becuase the hashmaps can get quite big.
>
> Can the RowSimilarityJob be redesigned to use the "stripes" design pattern
> as shown in the following paper by Jimmy Lin?
>
> http://www.aclweb.org/anthology-new/D/D08/D08-1044.pdf
>
>
>
>
> On Mon, Jul 18, 2011 at 3:58 PM, Grant Ingersoll <[email protected]>wrote:
>
>>
>> On Jul 18, 2011, at 3:52 PM, Grant Ingersoll wrote:
>> >
>> > I think the big win is to not construe this implementation to be based
>> on what is in the paper.  I'm starting to think we should have two
>> RowSimilarity jobs.  One for algebraic functions and one for those who are
>> not.
>>
>>
>> And I don't mean two completely separate, just a different mapper/reducer
>> for that phase (and likely a combiner, too).
>>
>> -Grant
>>
>>
>

Reply via email to