On Fri, Dec 2, 2011 at 1:53 PM, Sean Owen <[email protected]> wrote: > On Fri, Dec 2, 2011 at 11:28 AM, Daniel Zohar <[email protected]> wrote: > > > I'm already capping it at 100. If this will be my last resort, I will > > decrease it more :) > > > > This just can't be... 100 item-item similarities takes milliseconds to > compute. Something else is going on. > I should make a JIRA to propose my own version of this filtering just to > make sure we're talking about the same thing. > > > I limit only the possibleItemIDs in my altered version of SamplingCandidateItemsStrategy Sine TopItems.getTopItems() computes similarities for every previously interacted item with the set of 'possibleItemIDs', you are correct only when the user have a single interaction. However, if the user had made 20 interactions, we're talking about 2000 item-item similarities.
> > > > You know this code way better than I do, so perhaps I am missing > something > > here. But as I see it (and I tested it as well) the users data point > > remains intact. That's because the preferenceFromUsers Set remains the > same > > while only preferenceForItems is optimized. The main reason it improves > > performance is because of the bottleneck we diagnosed before - > > `GenericBooleanPrefDataModel.getNumUsersWithPreferenceFor` which in turn > > calls `FastIDSet.intersectionSize`. Now, if we know _for sure_ that a > user > > interacted with a single item only, what's the point of checking every > time > > if it had interacted with other items? (I hope I make myself clear) > > Because in my data set, we have over 80% of users which had a single > > interaction, it gives such a performance boost. (I believe this case > might > > be more common than one might think in web apps) > > > > Let me propose a better way to address that bottleneck. I think the problem > is that the intersection computation is dumb, and should really compute > "larger.intersectionSize(smaller)". > > Try ending getNumUsersWithPreferenceFor() with: > > return userIDs1.size() < userIDs2.size() ? > userIDs2.intersectionSize(userIDs1) : > userIDs1.intersectionSize(userIDs2); > > It won't produce the same speedup, but it's more correct than omitting this > data just to get this effect. If it gets 80% of the speedup, that's a great > win. > You nailed it! It extremely improves the performance. Without my last fix, we're talking about max 2s with my fix, it didn't go over 0.5s! I still don't see any problem with not including 'singleton users' inside preferenceForItems as long as preferenceFromUsers stays intact. Can you please elaborate more on the problem as you see it? I feel we're some kind of communication problem :P
