I'll commit #2 since that's an obvious win. I'll make a patch with a proposal for #1, which is sort of like what you're doing, just slightly different -- it gives a much stronger cap on the number of interactions considered.
For #3: You can't just store counts -- you need the user IDs to compute intersection sizes. If I'm right about the reason #3 makes it run faster, then #2 should make it run faster by about the same amount -- so #2 subsumes #3. And if that's true, I'd strongly prefer to not do #3 as it would add little and make things incorrect. I don't think that the reason that #3 helps is because it makes fewer items available as candidates, because it shouldn't. Say user A rates items X, Y and Z. Say user B has rated only Y. When recommending for A, we'll find the set of users who have rated Y (A, B and others). And when we look at B, we'll add Y as a candidate item (the only item B rated). But Y will be removed at the end since A already rated it. So, B didn't add any more item-item interactions to consider. If you remove user B, you still have the same work to do. I think #3 "helps" because it sets the count of users per item, for some items, to 0 instead of 1, which avoids an intersection computation. But that's later, after you have gotten the list of candidates. It's not reducing the number of candidates, but ignoring a lot of them later by assuming their similarity is 0 (when it isn't!) But that intersection computation doesn't have to be so slow. It's slow to check whether each of 1000 items in one set are members of another set of 1 item. But checking 1 item for membership in another set of 1000 is just one quick hash lookup. Now, I agree that pretending that set has 0 items instead of 1 is even a little faster still! And perhaps you could argue it's a reasonable approximation. But my guess is that it's barely faster, at the price of breaking correctness. On Fri, Dec 2, 2011 at 3:58 PM, Daniel Zohar <[email protected]> wrote: > It's already hard to say what has the most significant impact here. Just to > summarize, we had basically 3 changes which made an enormous improvement in > performance : > 1. Limit the number of 'possibleItemIDs' returned > by SamplingCandidateStrategy > 2. Optimizing getNumUsersWithPreferenceFor() with "larger.intersectionSize( > smaller)" > 3. Excluding 'singleton users' from > GenericBooleanPrefDataModel.preferenceForItems > > I think we already agreed about 1 & 2 (correct me if I'm wrong). > Regarding 3, if we had a Map that mapped itemID to numOfUsers that might > solve the problem and still give a great performance boost. You can see in > my previous results that this change diminishes query time to at least a > quarter (2s vs 0.5s). What do you think? > >
