Re: Mahout performance issues

Ted Dunning Sun, 04 Dec 2011 13:02:03 -0800

Sean,

You can also do #1.  That is what I have used in the past and what I
recommend.  That achieves a large part of #2, but what is most important is
that it *directly* addresses the key cost factor in off-line
recommendations since the number of item pairs emitted is proportional to
the sum of the number of items squared for each user.


Specifically, I think that each user should have at most N items and if
they have more, the number they have should be down-sampled to the point
that they have N.

I also think that there are some cases were strategy #2 is important even
if #1 is implemented.

If #1 and #2 are done, then it is a matter of convenience to limit the
number of items in each row of the item-item matrix.  This is #4 which I
endorse and which Sebastian has endorsed.

On Sun, Dec 4, 2011 at 5:42 AM, Sean Owen <sro...@gmail.com> wrote:

> To talk about this clearly, let me go back to my example and add to it:
>
> ---
> Say we're recommending for user A. User A is connected to items 1, 2, 3.
> Those items are connected to other users X, Y, Z. And those users in turn
> are connected to items 100, 101, 102, 103.... You can down-sample three
> things:
>
> 1. The 1,2,3
> 2. The X,Y,Z
> 3. The 100,101,102
> 4. ... the result of downsampling 1-3, again
> ---
>
> The current implementation samples #2. My proposal samples #2 and #3.
> Sebastian's samples #3. Your proposal does #2 and #4. I believe that doing
> all 4 is redundant. You probably need to do at least #2 and #3 to avoid the
> prolific-user and prolific-item problem.
>
> The reason you are still seeing a fair number of IDs is that #1 is not also
> sampled, in my implementation.
>
> I think I suggest that we still have one solution for this, since it's all
> small variants on the same theme, and let's make in
> SamplingCandidateItemStrategy.
>
> To me, the remaining question is just, which of these 4 do you want to do?
> I suggest 2, 3, and maybe 1.
> Follow on question: should we make separately settable limits for each, or
> does this get complex without much use?
>
> On Sun, Dec 4, 2011 at 1:04 PM, Daniel Zohar <disso...@gmail.com> wrote:
>
> > I assume the parameter does not affect the possibleItemIDs because of the
> > following line:
> >
> > max = (int)
> > Math.max(defaultMaxPrefsPerItemConsidered, userItemCountMultiplier *
> > Math.log(Math.max(dataModel.getNumUsers(), dataModel.getNumItems())));
> >
> > On Sun, Dec 4, 2011 at 2:59 PM, Daniel Zohar <disso...@gmail.com> wrote:
> >
> > > Sean, your impl. is indeed better than mine but for some reason when I
> > ran
> > > it with for a user with a lot of interactions, I got 2023
> possibleItemIDs
> > > (although I used 10,2 in the constructor).
> > >
> > > Sebastian, I will try and expriment also with your patch. I would just
> > > like to add that in my opinion, as long as 'killing items' has to be
> done
> > > manually, it is not scalable by definition. I personally would always
> > > prefer to avoid these kind of solutions. Also, in my case, the most
> > popular
> > > item has only 3% of the users interacted with, so I suppose that's not
> > > exactly the case as well..
> >
>

Re: Mahout performance issues

Reply via email to