Re: ItemSimilarityJob

Charly Lizarralde Thu, 12 Aug 2010 16:16:30 -0700

I really must read the papers in order keep on commenting on this
thread...if by any chance I can dive into the Job "internals" may be I can
see to write an optmization.


For now, I will buy another disk and retry the tests.

Thanks for the replies!
Charly

On Thu, Aug 12, 2010 at 5:12 PM, Ted Dunning <[email protected]> wrote:

> I am sympathetic with the goals of stripes and have not analyzed the
> situation in detail.  Instead, I am simply reporting that at least one guy
> with very deep knowledge of the Hadoop map-reduce framework felt that
> similar results could be achieved without quite so much fanciness.
>
> On Thu, Aug 12, 2010 at 1:06 PM, Gökhan Çapan <[email protected]> wrote:
>
> > Hi Ted,
> > Combining phase is also applicable to stripes approach (also in-mapper
> > combining). The experiments I have remembered was from the paper, stripes
> > approach produces larger intermediate key-value pairs before the
> combiners
> > according to this experiment, though. Below is from the paper;
> >
> > "*Results demonstrate that the stripes approach is*
> > *far more efficient than the pairs approach: 666 seconds*
> > *(11m 6s) compared to 3758 seconds (62m 38s)*
> > *for the entire APW sub-corpus (improvement by a*
> > *factor of 5.7). On the entire sub-corpus, the mappers*
> > *in the pairs approach generated 2.6 billion intermediate*
> > *key-value pairs totally 31.2 GB. After the*
> > *combiners, this was reduced to 1.1 billion key-value*
> > *pairs, which roughly quantifies the amount of data*
> > *involved in the shuffling and sorting of the keys. On*
> > *the other hand, the mappers in the stripes approach*
> > *generated 653 million intermediate key-value pairs*
> > *totally 48.1 GB; after the combiners, only 28.8 million*
> > *key-value pairs were left. The stripes approach*
> > *provides more opportunities for combiners to aggregate*
> > *intermediate results, thus greatly reducing network*
> > *traffic in the sort and shuffle phase.*"
> >
> >
> > On Thu, Aug 12, 2010 at 10:34 PM, Gökhan Çapan <[email protected]>
> wrote:
> >
> > > Hi Ted,
> > >
> > > I have seen some benchmark results between different versions of
> > > co-occurrence computation, I will share them if I can find, today or
> > > tomorrow.
> > >
> > >
> > > On Thu, Aug 12, 2010 at 10:30 PM, Ted Dunning <[email protected]
> > >wrote:
> > >
> > >> Jimmy Lin's stripes work was presented at the last Summit and there
> was
> > >> heated (well, warm and cordial at least) discussion with the
> Map-reduce
> > >> committers about whether good use of a combiner wouldn't do just as
> > well.
> > >>
> > >> My take-away as a spectator is that a combiner was
> > >>
> > >> a) vastly easier to code
> > >>
> > >> b) would be pretty certain to be within 2x as performant and likely
> very
> > >> close to the same speed
> > >>
> > >> c) would not need changing each time the underlying map-reduce changed
> > >>
> > >> My conclusion was that combiners were the way to go (for me).  Your
> > >> mileage,
> > >> as always, will vary.
> > >>
> > >> On Thu, Aug 12, 2010 at 7:45 AM, Gökhan Çapan <[email protected]>
> > wrote:
> > >>
> > >> > Hi,
> > >> > I haven't seen the code, but maybe Mahout needs some optimization
> > while
> > >> > computing item-item co-occurrences. It may be re-implemented using
> > >> > "stripes"
> > >> > approach using in-mapper combining if it is not. It can be found at:
> > >> >
> > >> >   1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
> > >> >
> > >> > If it already is, sorry for the post.
> > >> >
> > >> > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> > >> > [email protected]> wrote:
> > >> >
> > >> > > Sebastian, thank's for the reply.  The step name is*
> > >> > > :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and
> each
> > >> > > map task
> > >> > > takes around 10 hours to finish.
> > >> > >
> > >> > > Reduce task dir
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> > >> > > has map output files ( files like map_2.out) and each one is 5GB
> in
> > >> size.
> > >> > >
> > >> > > I have been looking at the code and saw what you describe in the
> > >> e-mail.
> > >> > It
> > >> > > makes sense. But still 160 GB of intermediate info from a 2.6 GB
> > input
> > >> > file
> > >> > > still makes me wonder if something is wrong.
> > >> > >
> > >> > > Should I just wait for the patch?
> > >> > > Thanks again!
> > >> > > Charly
> > >> > >
> > >> > > On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> > >> > > [email protected]
> > >> > > > wrote:
> > >> > >
> > >> > > > Hi Charly,
> > >> > > >
> > >> > > > can you tell which Map/Reduce step was executed last before you
> > ran
> > >> out
> > >> > > > of disk space?
> > >> > > >
> > >> > > > I'm not familiar with the Netflix dataset and can only guess
> what
> > >> > > > happened, but I would say that you ran out of diskspace because
> > >> > > > ItemSimilarityJob currently uses all preferences to compute the
> > >> > > > similarities. This makes it scale in the square of the number of
> > >> > > > occurrences of the most popular item, which is a bad thing if
> that
> > >> > > > number is huge. We need a way to limit the number of preferences
> > >> > > > considered per item, there is already a ticket for this (
> > >> > > > https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
> > >> > provide
> > >> > > > a patch in the next days.
> > >> > > >
> > >> > > > --sebastian
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> > >> > > > > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB)
> > and
> > >> I
> > >> > > have
> > >> > > > > just ran out of disk space (160 GB) in my mapred.local.dir
> when
> > >> > running
> > >> > > > > RowSimilarityJob.
> > >> > > > >
> > >> > > > > Is this normal behaviour? How can I improve this?
> > >> > > > >
> > >> > > > > Thanks!
> > >> > > > > Charly
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Gökhan Çapan
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Gökhan Çapan
> > >
> >
> >
> >
> > --
> > Gökhan Çapan
> >
>

Re: ItemSimilarityJob

Reply via email to