Sebastian, thank's for the reply. The step name is* :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer. and each map task takes around 10 hours to finish.
Reduce task dir (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output) has map output files ( files like map_2.out) and each one is 5GB in size. I have been looking at the code and saw what you describe in the e-mail. It makes sense. But still 160 GB of intermediate info from a 2.6 GB input file still makes me wonder if something is wrong. Should I just wait for the patch? Thanks again! Charly On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <[email protected] > wrote: > Hi Charly, > > can you tell which Map/Reduce step was executed last before you ran out > of disk space? > > I'm not familiar with the Netflix dataset and can only guess what > happened, but I would say that you ran out of diskspace because > ItemSimilarityJob currently uses all preferences to compute the > similarities. This makes it scale in the square of the number of > occurrences of the most popular item, which is a bad thing if that > number is huge. We need a way to limit the number of preferences > considered per item, there is already a ticket for this ( > https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to provide > a patch in the next days. > > --sebastian > > > > Am 12.08.2010 00:15, schrieb Charly Lizarralde: > > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I have > > just ran out of disk space (160 GB) in my mapred.local.dir when running > > RowSimilarityJob. > > > > Is this normal behaviour? How can I improve this? > > > > Thanks! > > Charly > > > > > >
