Re: ItemSimilarityJob

Gökhan Çapan Thu, 12 Aug 2010 12:34:27 -0700

Hi Ted,

I have seen some benchmark results between different versions of
co-occurrence computation, I will share them if I can find, today or
tomorrow.


On Thu, Aug 12, 2010 at 10:30 PM, Ted Dunning <[email protected]> wrote:

> Jimmy Lin's stripes work was presented at the last Summit and there was
> heated (well, warm and cordial at least) discussion with the Map-reduce
> committers about whether good use of a combiner wouldn't do just as well.
>
> My take-away as a spectator is that a combiner was
>
> a) vastly easier to code
>
> b) would be pretty certain to be within 2x as performant and likely very
> close to the same speed
>
> c) would not need changing each time the underlying map-reduce changed
>
> My conclusion was that combiners were the way to go (for me).  Your
> mileage,
> as always, will vary.
>
> On Thu, Aug 12, 2010 at 7:45 AM, Gökhan Çapan <[email protected]> wrote:
>
> > Hi,
> > I haven't seen the code, but maybe Mahout needs some optimization while
> > computing item-item co-occurrences. It may be re-implemented using
> > "stripes"
> > approach using in-mapper combining if it is not. It can be found at:
> >
> >   1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
> >
> > If it already is, sorry for the post.
> >
> > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> > [email protected]> wrote:
> >
> > > Sebastian, thank's for the reply.  The step name is*
> > > :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
> > > map task
> > > takes around 10 hours to finish.
> > >
> > > Reduce task dir
> > >
> > >
> >
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> > > has map output files ( files like map_2.out) and each one is 5GB in
> size.
> > >
> > > I have been looking at the code and saw what you describe in the
> e-mail.
> > It
> > > makes sense. But still 160 GB of intermediate info from a 2.6 GB input
> > file
> > > still makes me wonder if something is wrong.
> > >
> > > Should I just wait for the patch?
> > > Thanks again!
> > > Charly
> > >
> > > On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> > > [email protected]
> > > > wrote:
> > >
> > > > Hi Charly,
> > > >
> > > > can you tell which Map/Reduce step was executed last before you ran
> out
> > > > of disk space?
> > > >
> > > > I'm not familiar with the Netflix dataset and can only guess what
> > > > happened, but I would say that you ran out of diskspace because
> > > > ItemSimilarityJob currently uses all preferences to compute the
> > > > similarities. This makes it scale in the square of the number of
> > > > occurrences of the most popular item, which is a bad thing if that
> > > > number is huge. We need a way to limit the number of preferences
> > > > considered per item, there is already a ticket for this (
> > > > https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
> > provide
> > > > a patch in the next days.
> > > >
> > > > --sebastian
> > > >
> > > >
> > > >
> > > > Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> > > > > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I
> > > have
> > > > > just ran out of disk space (160 GB) in my mapred.local.dir when
> > running
> > > > > RowSimilarityJob.
> > > > >
> > > > > Is this normal behaviour? How can I improve this?
> > > > >
> > > > > Thanks!
> > > > > Charly
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Gökhan Çapan
> >
>



-- 
Gökhan Çapan

Re: ItemSimilarityJob

Reply via email to