Hi Gökhan, I had a quick look at the paper and the "stripes" approach looks interesting though I'm not sure whether we can apply it to the RowSimilarityJob.
I think when we want to improve the performance of the ItemSimilarityJob we should go another path that follows some thoughts by Ted Dunning. IIRC he said that you don't really learn anything new about an item after seeing a certain number of preferences and thus it should be sufficient to look at a fixed number of them at maximum per item. RecommenderJob is already limiting the number of preferences considered per item with MaybePruneRowsMapper and ItemSimilarityJob should adapt that option too. The algorithm is based on the paper "Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce" (http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf) which seems to use the "pairs" approach described in your paper as far as I can judge that. IIRC the authors if the papers remove frequently occurring words to achieve a linear runtime which would correspond the "maxNumberOfPreferencesPerItem" approach MAHOUT-460 suggests. --sebastian Am 12.08.2010 16:45, schrieb Gökhan Çapan: > Hi, > I haven't seen the code, but maybe Mahout needs some optimization while > computing item-item co-occurrences. It may be re-implemented using "stripes" > approach using in-mapper combining if it is not. It can be found at: > > 1. www.aclweb.org/anthology/D/D08/D08-1044.pdf > > If it already is, sorry for the post. > > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde < > [email protected]> wrote: > > >> Sebastian, thank's for the reply. The step name is* >> :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer. and each >> map task >> takes around 10 hours to finish. >> >> Reduce task dir >> >> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output) >> has map output files ( files like map_2.out) and each one is 5GB in size. >> >> I have been looking at the code and saw what you describe in the e-mail. It >> makes sense. But still 160 GB of intermediate info from a 2.6 GB input file >> still makes me wonder if something is wrong. >> >> Should I just wait for the patch? >> Thanks again! >> Charly >> >> On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter < >> [email protected] >> >>> wrote: >>> >> >>> Hi Charly, >>> >>> can you tell which Map/Reduce step was executed last before you ran out >>> of disk space? >>> >>> I'm not familiar with the Netflix dataset and can only guess what >>> happened, but I would say that you ran out of diskspace because >>> ItemSimilarityJob currently uses all preferences to compute the >>> similarities. This makes it scale in the square of the number of >>> occurrences of the most popular item, which is a bad thing if that >>> number is huge. We need a way to limit the number of preferences >>> considered per item, there is already a ticket for this ( >>> https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to provide >>> a patch in the next days. >>> >>> --sebastian >>> >>> >>> >>> Am 12.08.2010 00:15, schrieb Charly Lizarralde: >>> >>>> Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I >>>> >> have >> >>>> just ran out of disk space (160 GB) in my mapred.local.dir when running >>>> RowSimilarityJob. >>>> >>>> Is this normal behaviour? How can I improve this? >>>> >>>> Thanks! >>>> Charly >>>> >>>> >>>> >>> >>> >> > > >
