The "pairs" approach where each co-occurring pair is emitted by mapper as a
key value pair is I/O heavy but can be run on clusters with small per node
memory.

There is another design pattern by Jimmy Lin called the "stripes" where the
co-occurences per term are aggregated into associative arrays. The mappers
emit keys as the term, and the values as the term's aggregated associative
array.

The "stripes" term is less network I/O heavy but uses more memory per node
becuase the hashmaps can get quite big.

Can the RowSimilarityJob be redesigned to use the "stripes" design pattern
as shown in the following paper by Jimmy Lin?

http://www.aclweb.org/anthology-new/D/D08/D08-1044.pdf



On Mon, Jul 18, 2011 at 3:58 PM, Grant Ingersoll <[email protected]>wrote:

>
> On Jul 18, 2011, at 3:52 PM, Grant Ingersoll wrote:
> >
> > I think the big win is to not construe this implementation to be based on
> what is in the paper.  I'm starting to think we should have two
> RowSimilarity jobs.  One for algebraic functions and one for those who are
> not.
>
>
> And I don't mean two completely separate, just a different mapper/reducer
> for that phase (and likely a combiner, too).
>
> -Grant
>
>

Reply via email to