Re: ItemSimilarityJob

Gökhan Çapan Thu, 12 Aug 2010 12:32:59 -0700

Hi Sebastian,

I remember the birth of this "co-occurrence based distributed
recommendation" idea, I was there:). That's why I know the general concepts
behind the idea. However, I haven't looked at the Mahout code lately, and I
don't know the details about how co-occurrence computation is implemented.
Choosing either pairs or stripes approach is a design choice while
implementing a co-occurrence computation (and other similar ones). Although
these have slight differences in implementation phase, size of intermediate
data and the time complexity of the algorithms are very different from each
other. If the problem is the size of intermediate data produced while
computing the co-occurrence matrix, it's worth to try the "other" approach.


In addition, MinHashing may be another option to produce all candidate
similar pairs according to Jaccard similarity without looking all pairs. It
may approximately be implemented using MapReduce. Actually I have
implemented one, to detect near duplicate documents in a large document set,
as well as to find similar items according to users' binary ratings; and it
yields to very good results, in terms of  both accuracy and performance.
This method is also a part of Google's news recommendation engine.

On Thu, Aug 12, 2010 at 9:11 PM, Sebastian Schelter <[email protected]
> wrote:

> Hi Gökhan,
>
> I had a quick look at the paper and the "stripes" approach looks
> interesting though I'm not sure whether we can apply it to the
> RowSimilarityJob.
>
> I think when we want to improve the performance of the ItemSimilarityJob
> we should go another path that follows some thoughts by Ted Dunning.
> IIRC he said that you don't really learn anything new about an item
> after seeing a certain number of preferences and thus it should be
> sufficient to look at a fixed number of them at maximum per item.
> RecommenderJob is already limiting the number of preferences considered
> per item with MaybePruneRowsMapper and ItemSimilarityJob should adapt
> that option too.
>
> The algorithm is based on the paper "Elsayed et al: Pairwise Document
> Similarity in Large Collections with MapReduce"
> (
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> )
> which seems to use the "pairs" approach described in your paper as far
> as I can judge that. IIRC the authors if the papers remove frequently
> occurring words to achieve a linear runtime which would correspond the
> "maxNumberOfPreferencesPerItem" approach MAHOUT-460 suggests.
>
> --sebastian
>
>
>
> Am 12.08.2010 16:45, schrieb Gökhan Çapan:
> > Hi,
> > I haven't seen the code, but maybe Mahout needs some optimization while
> > computing item-item co-occurrences. It may be re-implemented using
> "stripes"
> > approach using in-mapper combining if it is not. It can be found at:
> >
> >    1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
> >
> > If it already is, sorry for the post.
> >
> > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> > [email protected]> wrote:
> >
> >
> >> Sebastian, thank's for the reply.  The step name is*
> >> :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
> >> map task
> >> takes around 10 hours to finish.
> >>
> >> Reduce task dir
> >>
> >>
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> >> has map output files ( files like map_2.out) and each one is 5GB in
> size.
> >>
> >> I have been looking at the code and saw what you describe in the e-mail.
> It
> >> makes sense. But still 160 GB of intermediate info from a 2.6 GB input
> file
> >> still makes me wonder if something is wrong.
> >>
> >> Should I just wait for the patch?
> >> Thanks again!
> >> Charly
> >>
> >> On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> >> [email protected]
> >>
> >>> wrote:
> >>>
> >>
> >>> Hi Charly,
> >>>
> >>> can you tell which Map/Reduce step was executed last before you ran out
> >>> of disk space?
> >>>
> >>> I'm not familiar with the Netflix dataset and can only guess what
> >>> happened, but I would say that you ran out of diskspace because
> >>> ItemSimilarityJob currently uses all preferences to compute the
> >>> similarities. This makes it scale in the square of the number of
> >>> occurrences of the most popular item, which is a bad thing if that
> >>> number is huge. We need a way to limit the number of preferences
> >>> considered per item, there is already a ticket for this (
> >>> https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
> provide
> >>> a patch in the next days.
> >>>
> >>> --sebastian
> >>>
> >>>
> >>>
> >>> Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> >>>
> >>>> Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I
> >>>>
> >> have
> >>
> >>>> just ran out of disk space (160 GB) in my mapred.local.dir when
> running
> >>>> RowSimilarityJob.
> >>>>
> >>>> Is this normal behaviour? How can I improve this?
> >>>>
> >>>> Thanks!
> >>>> Charly
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> >
>
>


-- 
Gökhan Çapan

Re: ItemSimilarityJob

Reply via email to