Surprisingly, both forms are about the same in difficult for co-occurrence. Firstly, both forms are a single map-reduce apart. Secondly, both forms are likely the output of a log analysis where the input form is actually more likely user/item pairs. From that form, co-occurrence counting is most easily done by reducing on user, emitting all pairs of items and then counting in traditional wise.
But with very large data sets, even before doing the actual co-occurrence, it is commonly advisable to reduce to item-major form and down-sample the users associated with the most common items. This is similar to the row and column normalization done in singular value techniques, but is applied to the original data. Map-reduce is pretty impressive though; sampling is not necessary except for the largest data sets on the smallest clusters. The biggest surprise I have had in using this sort of data reduction is that simply emitting all of the item pairs is pretty danged efficient. There are clever things to do to avoid so much data motion, but they save surprisingly little and are much more complex to implement (correctly). On Tue, Jan 20, 2009 at 4:26 AM, Sean Owen <[email protected]> wrote: > how can I tell when a line specifies the opposite, item > followed by user IDs? the former is easier, BTW. > -- Ted Dunning, CTO DeepDyve 4600 Bohannon Drive, Suite 220 Menlo Park, CA 94025 www.deepdyve.com 650-324-0110, ext. 738 858-414-0013 (m)
