It's already hard to say what has the most significant impact here. Just to summarize, we had basically 3 changes which made an enormous improvement in performance : 1. Limit the number of 'possibleItemIDs' returned by SamplingCandidateStrategy 2. Optimizing getNumUsersWithPreferenceFor() with "larger.intersectionSize( smaller)" 3. Excluding 'singleton users' from GenericBooleanPrefDataModel.preferenceForItems
I think we already agreed about 1 & 2 (correct me if I'm wrong). Regarding 3, if we had a Map that mapped itemID to numOfUsers that might solve the problem and still give a great performance boost. You can see in my previous results that this change diminishes query time to at least a quarter (2s vs 0.5s). What do you think? On Fri, Dec 2, 2011 at 5:38 PM, Sean Owen <[email protected]> wrote: > On Fri, Dec 2, 2011 at 2:33 PM, Daniel Zohar <[email protected]> wrote: > > > On Fri, Dec 2, 2011 at 1:53 PM, Sean Owen <[email protected]> wrote: > > > > > On Fri, Dec 2, 2011 at 11:28 AM, Daniel Zohar <[email protected]> > > wrote: > > > > > > > I'm already capping it at 100. If this will be my last resort, I will > > > > decrease it more :) > > > > > > > > > > This just can't be... 100 item-item similarities takes milliseconds to > > > compute. Something else is going on. > > > I should make a JIRA to propose my own version of this filtering just > to > > > make sure we're talking about the same thing. > > > > > > > > > > > I limit only the possibleItemIDs in my altered version > > of SamplingCandidateItemsStrategy > > Sine TopItems.getTopItems() computes similarities for every previously > > interacted item with the set of 'possibleItemIDs', you are correct only > > when the user have a single interaction. However, if the user had made 20 > > interactions, we're talking about 2000 item-item similarities. > > > > Sure, but why not limit those 2000 interactions to 100? > > I'm not sure "100" is the optimal number or not, but clearly, at smaller > scale, this all runs quite fast and produces reasonable results. Say that, > at some smaller scale, it considers on average X interactions. If you turn > this down such that only X interactions are considered here, I don't see > why you wouldn't get similar performance and quality. > > I was actually considering a patch that would simply apply the max both to > the number of users sampled per item, but the number of items from each of > those users. If you clamped it at 10, then each item you've rated would > produce at most 10*10 interactions. We only limit one of those things now. > > Well... maybe with the change below it speeds up further so that you can > get away with more data in less time, so that it is much easier to find a > good tradeoff. > > > > > > > You nailed it! It extremely improves the performance. Without my last > fix, > > we're talking about max 2s with my fix, it didn't go over 0.5s! > > > > > > > .. but does this change alone produce a good speedup? > > > > > I still don't see any problem with not including 'singleton users' inside > > preferenceForItems as long as preferenceFromUsers stays intact. Can you > > please elaborate more on the problem as you see it? I feel we're some > kind > > of communication problem :P > > > > The calculations are just wrong then, since you don't have the right user > counts per item. Methods that answer that question give the wrong answer; > similarity metrics like log-likelihood give slightly wrong results too. At > this point I don't think it's good to sacrifice correctness for an > optimization when there seems to be (?) a way to have most of the speed up > without any correctness problem. >
