Sean, You can also do #1. That is what I have used in the past and what I recommend. That achieves a large part of #2, but what is most important is that it *directly* addresses the key cost factor in off-line recommendations since the number of item pairs emitted is proportional to the sum of the number of items squared for each user.
Specifically, I think that each user should have at most N items and if they have more, the number they have should be down-sampled to the point that they have N. I also think that there are some cases were strategy #2 is important even if #1 is implemented. If #1 and #2 are done, then it is a matter of convenience to limit the number of items in each row of the item-item matrix. This is #4 which I endorse and which Sebastian has endorsed. On Sun, Dec 4, 2011 at 5:42 AM, Sean Owen <sro...@gmail.com> wrote: > To talk about this clearly, let me go back to my example and add to it: > > --- > Say we're recommending for user A. User A is connected to items 1, 2, 3. > Those items are connected to other users X, Y, Z. And those users in turn > are connected to items 100, 101, 102, 103.... You can down-sample three > things: > > 1. The 1,2,3 > 2. The X,Y,Z > 3. The 100,101,102 > 4. ... the result of downsampling 1-3, again > --- > > The current implementation samples #2. My proposal samples #2 and #3. > Sebastian's samples #3. Your proposal does #2 and #4. I believe that doing > all 4 is redundant. You probably need to do at least #2 and #3 to avoid the > prolific-user and prolific-item problem. > > The reason you are still seeing a fair number of IDs is that #1 is not also > sampled, in my implementation. > > I think I suggest that we still have one solution for this, since it's all > small variants on the same theme, and let's make in > SamplingCandidateItemStrategy. > > To me, the remaining question is just, which of these 4 do you want to do? > I suggest 2, 3, and maybe 1. > Follow on question: should we make separately settable limits for each, or > does this get complex without much use? > > On Sun, Dec 4, 2011 at 1:04 PM, Daniel Zohar <disso...@gmail.com> wrote: > > > I assume the parameter does not affect the possibleItemIDs because of the > > following line: > > > > max = (int) > > Math.max(defaultMaxPrefsPerItemConsidered, userItemCountMultiplier * > > Math.log(Math.max(dataModel.getNumUsers(), dataModel.getNumItems()))); > > > > On Sun, Dec 4, 2011 at 2:59 PM, Daniel Zohar <disso...@gmail.com> wrote: > > > > > Sean, your impl. is indeed better than mine but for some reason when I > > ran > > > it with for a user with a lot of interactions, I got 2023 > possibleItemIDs > > > (although I used 10,2 in the constructor). > > > > > > Sebastian, I will try and expriment also with your patch. I would just > > > like to add that in my opinion, as long as 'killing items' has to be > done > > > manually, it is not scalable by definition. I personally would always > > > prefer to avoid these kind of solutions. Also, in my case, the most > > popular > > > item has only 3% of the users interacted with, so I suppose that's not > > > exactly the case as well.. > > >