Surprisingly, both forms are about the same in difficult for co-occurrence.
Firstly, both forms are a single map-reduce apart.  Secondly, both forms are
likely the output of a log analysis where the input form is actually more
likely user/item pairs.  From that form, co-occurrence counting is most
easily done by reducing on user, emitting all pairs of items and then
counting in traditional wise.

But with very large data sets, even before doing the actual co-occurrence,
it is commonly advisable to reduce to item-major form and down-sample the
users associated with the most common items.  This is similar to the row and
column normalization done in singular value techniques, but is applied to
the original data.

Map-reduce is pretty impressive though; sampling is not necessary except for
the largest data sets on the smallest clusters.

The biggest surprise I have had in using this sort of data reduction is that
simply emitting all of the item pairs is pretty danged efficient.  There are
clever things to do to avoid so much data motion, but they save surprisingly
little and are much more complex to implement (correctly).

On Tue, Jan 20, 2009 at 4:26 AM, Sean Owen <[email protected]> wrote:

> how can I tell when a line specifies the opposite, item
> followed by user IDs? the former is easier, BTW.
>



-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Reply via email to