Re: ItemSimilarityJob

alex lin Thu, 12 Aug 2010 12:46:24 -0700

Hi,

Below seems interesting?
SIGIR 2010 http://www.lsdsir.org/wp-content/uploads/2010/05/lsdsir10-2.pdf
which uses LSH + Hadoop


Alex

On Thu, Aug 12, 2010 at 3:32 PM, Gökhan Çapan <[email protected]> wrote:

> Hi Sebastian,
>
> I remember the birth of this "co-occurrence based distributed
> recommendation" idea, I was there:). That's why I know the general concepts
> behind the idea. However, I haven't looked at the Mahout code lately, and I
> don't know the details about how co-occurrence computation is implemented.
> Choosing either pairs or stripes approach is a design choice while
> implementing a co-occurrence computation (and other similar ones). Although
> these have slight differences in implementation phase, size of intermediate
> data and the time complexity of the algorithms are very different from each
> other. If the problem is the size of intermediate data produced while
> computing the co-occurrence matrix, it's worth to try the "other" approach.
>
> In addition, MinHashing may be another option to produce all candidate
> similar pairs according to Jaccard similarity without looking all pairs. It
> may approximately be implemented using MapReduce. Actually I have
> implemented one, to detect near duplicate documents in a large document
> set,
> as well as to find similar items according to users' binary ratings; and it
> yields to very good results, in terms of  both accuracy and performance.
> This method is also a part of Google's news recommendation engine.
>
> On Thu, Aug 12, 2010 at 9:11 PM, Sebastian Schelter <
> [email protected]
> > wrote:
>
> > Hi Gökhan,
> >
> > I had a quick look at the paper and the "stripes" approach looks
> > interesting though I'm not sure whether we can apply it to the
> > RowSimilarityJob.
> >
> > I think when we want to improve the performance of the ItemSimilarityJob
> > we should go another path that follows some thoughts by Ted Dunning.
> > IIRC he said that you don't really learn anything new about an item
> > after seeing a certain number of preferences and thus it should be
> > sufficient to look at a fixed number of them at maximum per item.
> > RecommenderJob is already limiting the number of preferences considered
> > per item with MaybePruneRowsMapper and ItemSimilarityJob should adapt
> > that option too.
> >
> > The algorithm is based on the paper "Elsayed et al: Pairwise Document
> > Similarity in Large Collections with MapReduce"
> > (
> >
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> > )
> > which seems to use the "pairs" approach described in your paper as far
> > as I can judge that. IIRC the authors if the papers remove frequently
> > occurring words to achieve a linear runtime which would correspond the
> > "maxNumberOfPreferencesPerItem" approach MAHOUT-460 suggests.
> >
> > --sebastian
> >
> >
> >
> > Am 12.08.2010 16:45, schrieb Gökhan Çapan:
> > > Hi,
> > > I haven't seen the code, but maybe Mahout needs some optimization while
> > > computing item-item co-occurrences. It may be re-implemented using
> > "stripes"
> > > approach using in-mapper combining if it is not. It can be found at:
> > >
> > >    1. www.aclweb.org/anthology/D/D08/D08-1044.pdf
> > >
> > > If it already is, sorry for the post.
> > >
> > > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde <
> > > [email protected]> wrote:
> > >
> > >
> > >> Sebastian, thank's for the reply.  The step name is*
> > >> :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.  and each
> > >> map task
> > >> takes around 10 hours to finish.
> > >>
> > >> Reduce task dir
> > >>
> > >>
> >
> (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output)
> > >> has map output files ( files like map_2.out) and each one is 5GB in
> > size.
> > >>
> > >> I have been looking at the code and saw what you describe in the
> e-mail.
> > It
> > >> makes sense. But still 160 GB of intermediate info from a 2.6 GB input
> > file
> > >> still makes me wonder if something is wrong.
> > >>
> > >> Should I just wait for the patch?
> > >> Thanks again!
> > >> Charly
> > >>
> > >> On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter <
> > >> [email protected]
> > >>
> > >>> wrote:
> > >>>
> > >>
> > >>> Hi Charly,
> > >>>
> > >>> can you tell which Map/Reduce step was executed last before you ran
> out
> > >>> of disk space?
> > >>>
> > >>> I'm not familiar with the Netflix dataset and can only guess what
> > >>> happened, but I would say that you ran out of diskspace because
> > >>> ItemSimilarityJob currently uses all preferences to compute the
> > >>> similarities. This makes it scale in the square of the number of
> > >>> occurrences of the most popular item, which is a bad thing if that
> > >>> number is huge. We need a way to limit the number of preferences
> > >>> considered per item, there is already a ticket for this (
> > >>> https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to
> > provide
> > >>> a patch in the next days.
> > >>>
> > >>> --sebastian
> > >>>
> > >>>
> > >>>
> > >>> Am 12.08.2010 00:15, schrieb Charly Lizarralde:
> > >>>
> > >>>> Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I
> > >>>>
> > >> have
> > >>
> > >>>> just ran out of disk space (160 GB) in my mapred.local.dir when
> > running
> > >>>> RowSimilarityJob.
> > >>>>
> > >>>> Is this normal behaviour? How can I improve this?
> > >>>>
> > >>>> Thanks!
> > >>>> Charly
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > >
> >
> >
>
>
> --
> Gökhan Çapan
>

Re: ItemSimilarityJob

Reply via email to