Hi,

I ran into some issues with GenericItembasedRecommender this week, which I could only work around by creating a custom ItembasedRecommender implementation. I think the issues might be worth discussing here and I'd look forward to committing back my changes if we find them useful.

The first issue is with GenericItembasedRecommender.MultiMostSimilarEstimator, which is used to compute the most similar items to a collection of items. The current implementation filters out all items that are not similar (having NaN as similarity value) to at least one of the input items. While this might be algorithmically correct it very often leads to empty results. Users might e.g. put very different things in a shopping cart and using those things as input for mostSimilarItems produces empty results in lots of cases in my experience. My workaround was to interpret NaN as 0 when computing the average estimate here (and in the end filtering out results that had 0 as average), thus allowing an item to be included in the result if it is similar to at least one of the input items. If we decide to include this we could either introduce a second mostSimilarItems method or make it receive a parameter to determine the "exclusion mode" or whatever we might call it.

The second issue is a little bit more complicated. A while ago we introduced an component called CandidateItemsStrategy to enable the customization of the selection of the initial candidate items that might be recommended to a user. I noticed that we actually should do the same thing with the selection of candidate items for mostSimilarItems, which is currently done in GenericItembasedRecommender.doMostSimilarItems(...). This especially wastes CPU time when we use precomputed similarities (GenericItemSimilarity or FileItemSimilarity) because we already "know" the possibly similar items. Unfortunately there's no way to ask ItemSimilarity to directly give you all similar items to an item (which would be very the most efficient way of use when dealing with already precomputed similarities). I created a small file-based indexing component which can be asked for those but I'm not to happy with spreading the information about the precomputed similarities. Though I think we should work on improving the efficiency here as it turned out to be a performance killer in my usecase.

I hope I can make it clear what the problems were (and what solutions I propose). I could also supply a patch in the next weeks but I wanted to have a discussion first.

--sebastian

Reply via email to