Hi, Below seems interesting? SIGIR 2010 http://www.lsdsir.org/wp-content/uploads/2010/05/lsdsir10-2.pdf which uses LSH + Hadoop
Alex On Thu, Aug 12, 2010 at 3:32 PM, Gökhan Çapan <[email protected]> wrote: > Hi Sebastian, > > I remember the birth of this "co-occurrence based distributed > recommendation" idea, I was there:). That's why I know the general concepts > behind the idea. However, I haven't looked at the Mahout code lately, and I > don't know the details about how co-occurrence computation is implemented. > Choosing either pairs or stripes approach is a design choice while > implementing a co-occurrence computation (and other similar ones). Although > these have slight differences in implementation phase, size of intermediate > data and the time complexity of the algorithms are very different from each > other. If the problem is the size of intermediate data produced while > computing the co-occurrence matrix, it's worth to try the "other" approach. > > In addition, MinHashing may be another option to produce all candidate > similar pairs according to Jaccard similarity without looking all pairs. It > may approximately be implemented using MapReduce. Actually I have > implemented one, to detect near duplicate documents in a large document > set, > as well as to find similar items according to users' binary ratings; and it > yields to very good results, in terms of both accuracy and performance. > This method is also a part of Google's news recommendation engine. > > On Thu, Aug 12, 2010 at 9:11 PM, Sebastian Schelter < > [email protected] > > wrote: > > > Hi Gökhan, > > > > I had a quick look at the paper and the "stripes" approach looks > > interesting though I'm not sure whether we can apply it to the > > RowSimilarityJob. > > > > I think when we want to improve the performance of the ItemSimilarityJob > > we should go another path that follows some thoughts by Ted Dunning. > > IIRC he said that you don't really learn anything new about an item > > after seeing a certain number of preferences and thus it should be > > sufficient to look at a fixed number of them at maximum per item. > > RecommenderJob is already limiting the number of preferences considered > > per item with MaybePruneRowsMapper and ItemSimilarityJob should adapt > > that option too. > > > > The algorithm is based on the paper "Elsayed et al: Pairwise Document > > Similarity in Large Collections with MapReduce" > > ( > > > http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > ) > > which seems to use the "pairs" approach described in your paper as far > > as I can judge that. IIRC the authors if the papers remove frequently > > occurring words to achieve a linear runtime which would correspond the > > "maxNumberOfPreferencesPerItem" approach MAHOUT-460 suggests. > > > > --sebastian > > > > > > > > Am 12.08.2010 16:45, schrieb Gökhan Çapan: > > > Hi, > > > I haven't seen the code, but maybe Mahout needs some optimization while > > > computing item-item co-occurrences. It may be re-implemented using > > "stripes" > > > approach using in-mapper combining if it is not. It can be found at: > > > > > > 1. www.aclweb.org/anthology/D/D08/D08-1044.pdf > > > > > > If it already is, sorry for the post. > > > > > > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde < > > > [email protected]> wrote: > > > > > > > > >> Sebastian, thank's for the reply. The step name is* > > >> :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer. and each > > >> map task > > >> takes around 10 hours to finish. > > >> > > >> Reduce task dir > > >> > > >> > > > (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output) > > >> has map output files ( files like map_2.out) and each one is 5GB in > > size. > > >> > > >> I have been looking at the code and saw what you describe in the > e-mail. > > It > > >> makes sense. But still 160 GB of intermediate info from a 2.6 GB input > > file > > >> still makes me wonder if something is wrong. > > >> > > >> Should I just wait for the patch? > > >> Thanks again! > > >> Charly > > >> > > >> On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter < > > >> [email protected] > > >> > > >>> wrote: > > >>> > > >> > > >>> Hi Charly, > > >>> > > >>> can you tell which Map/Reduce step was executed last before you ran > out > > >>> of disk space? > > >>> > > >>> I'm not familiar with the Netflix dataset and can only guess what > > >>> happened, but I would say that you ran out of diskspace because > > >>> ItemSimilarityJob currently uses all preferences to compute the > > >>> similarities. This makes it scale in the square of the number of > > >>> occurrences of the most popular item, which is a bad thing if that > > >>> number is huge. We need a way to limit the number of preferences > > >>> considered per item, there is already a ticket for this ( > > >>> https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to > > provide > > >>> a patch in the next days. > > >>> > > >>> --sebastian > > >>> > > >>> > > >>> > > >>> Am 12.08.2010 00:15, schrieb Charly Lizarralde: > > >>> > > >>>> Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I > > >>>> > > >> have > > >> > > >>>> just ran out of disk space (160 GB) in my mapred.local.dir when > > running > > >>>> RowSimilarityJob. > > >>>> > > >>>> Is this normal behaviour? How can I improve this? > > >>>> > > >>>> Thanks! > > >>>> Charly > > >>>> > > >>>> > > >>>> > > >>> > > >>> > > >> > > > > > > > > > > > > > > > > -- > Gökhan Çapan >
