Re: ItemSimilarityJob

Gökhan Çapan Thu, 12 Aug 2010 13:06:57 -0700

Hi Ted,
Combining phase is also applicable to stripes approach (also in-mapper
combining). The experiments I have remembered was from the paper, stripes
approach produces larger intermediate key-value pairs before the combiners
according to this experiment, though. Below is from the paper;


"*Results demonstrate that the stripes approach is*
*far more efficient than the pairs approach: 666 seconds*
*(11m 6s) compared to 3758 seconds (62m 38s)*
*for the entire APW sub-corpus (improvement by a*
*factor of 5.7). On the entire sub-corpus, the mappers*
*in the pairs approach generated 2.6 billion intermediate*
*key-value pairs totally 31.2 GB. After the*
*combiners, this was reduced to 1.1 billion key-value*
*pairs, which roughly quantifies the amount of data*
*involved in the shuffling and sorting of the keys. On*
*the other hand, the mappers in the stripes approach*
*generated 653 million intermediate key-value pairs*
*totally 48.1 GB; after the combiners, only 28.8 million*
*key-value pairs were left. The stripes approach*
*provides more opportunities for combiners to aggregate*
*intermediate results, thus greatly reducing network*
*traffic in the sort and shuffle phase.*"


On Thu, Aug 12, 2010 at 10:34 PM, Gökhan Çapan <[email protected]> wrote:

> Hi Ted,
>
> I have seen some benchmark results between different versions of
> co-occurrence computation, I will share them if I can find, today or
> tomorrow.
>
>
> On Thu, Aug 12, 2010 at 10:30 PM, Ted Dunning <[email protected]>wrote:
>
>> Jimmy Lin's stripes work was presented at the last Summit and there was
>> heated (well, warm and cordial at least) discussion with the Map-reduce
>> committers about whether good use of a combiner wouldn't do just as well.
>>
>> My take-away as a spectator is that a combiner was
>>
>> a) vastly easier to code
>>
>> b) would be pretty certain to be within 2x as performant and likely very
>> close to the same speed
>>
>> c) would not need changing each time the underlying map-reduce changed
>>
>> My conclusion was that combiners were the way to go (for me).  Your
>> mileage,
>> as always, will vary.
>>
>> On Thu, Aug 12, 2010 at 7:45 AM, Gökhan Çapan <[email protected]> wrote:
>>
>> > Hi,
>> > I haven't seen the code, but maybe Mahout needs some optimization while
>> > computing item-item co-occurrences. It may be re-implemented using
>> > "stripes"
>> > approach using in-mapper combining if it is not. It can be found at:
>> >
>> >   1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
>> >
>> > If it already is, sorry for the post.
>> >
>> > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
>> > [email protected]> wrote:
>> >
>> > > Sebastian, thank's for the reply.  The step name is*
>> > > :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
>> > > map task
>> > > takes around 10 hours to finish.
>> > >
>> > > Reduce task dir
>> > >
>> > >
>> >
>> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
>> > > has map output files ( files like map_2.out) and each one is 5GB in
>> size.
>> > >
>> > > I have been looking at the code and saw what you describe in the
>> e-mail.
>> > It
>> > > makes sense. But still 160 GB of intermediate info from a 2.6 GB input
>> > file
>> > > still makes me wonder if something is wrong.
>> > >
>> > > Should I just wait for the patch?
>> > > Thanks again!
>> > > Charly
>> > >
>> > > On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
>> > > [email protected]
>> > > > wrote:
>> > >
>> > > > Hi Charly,
>> > > >
>> > > > can you tell which Map/Reduce step was executed last before you ran
>> out
>> > > > of disk space?
>> > > >
>> > > > I'm not familiar with the Netflix dataset and can only guess what
>> > > > happened, but I would say that you ran out of diskspace because
>> > > > ItemSimilarityJob currently uses all preferences to compute the
>> > > > similarities. This makes it scale in the square of the number of
>> > > > occurrences of the most popular item, which is a bad thing if that
>> > > > number is huge. We need a way to limit the number of preferences
>> > > > considered per item, there is already a ticket for this (
>> > > > https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
>> > provide
>> > > > a patch in the next days.
>> > > >
>> > > > --sebastian
>> > > >
>> > > >
>> > > >
>> > > > Am 12.08.2010 00:15, schrieb Charly Lizarralde:
>> > > > > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and
>> I
>> > > have
>> > > > > just ran out of disk space (160 GB) in my mapred.local.dir when
>> > running
>> > > > > RowSimilarityJob.
>> > > > >
>> > > > > Is this normal behaviour? How can I improve this?
>> > > > >
>> > > > > Thanks!
>> > > > > Charly
>> > > > >
>> > > > >
>> > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Gökhan Çapan
>> >
>>
>
>
>
> --
> Gökhan Çapan
>



-- 
Gökhan Çapan

Re: ItemSimilarityJob

Reply via email to