Re: ItemSimilarityJob

Charly Lizarralde Thu, 12 Aug 2010 05:51:56 -0700

Sebastian, thank's for the reply.  The step name is*
:*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
map task
takes around 10 hours to finish.

Reduce task dir
(var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
has map output files ( files like map_2.out) and each one is 5GB in size.

I have been looking at the code and saw what you describe in the e-mail. It
makes sense. But still 160 GB of intermediate info from a 2.6 GB input file
still makes me wonder if something is wrong.

Should I just wait for the patch?
Thanks again!
Charly

On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <[email protected]
> wrote:

> Hi Charly,
>
> can you tell which Map/Reduce step was executed last before you ran out
> of disk space?
>
> I'm not familiar with the Netflix dataset and can only guess what
> happened, but I would say that you ran out of diskspace because
> ItemSimilarityJob currently uses all preferences to compute the
> similarities. This makes it scale in the square of the number of
> occurrences of the most popular item, which is a bad thing if that
> number is huge. We need a way to limit the number of preferences
> considered per item, there is already a ticket for this (
> https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to provide
> a patch in the next days.
>
> --sebastian
>
>
>
> Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I have
> > just ran out of disk space (160 GB) in my mapred.local.dir when running
> > RowSimilarityJob.
> >
> > Is this normal behaviour? How can I improve this?
> >
> > Thanks!
> > Charly
> >
> >
>
>

Re: ItemSimilarityJob

Reply via email to