Broadly I think you are welcome to propose a patch. These changes make sense.
I would ignore similarities that are NaN rather than treat them as 0. Such changes are defensible if practice, even if it's starting to drift away from something theoretically sound. In my head I have started to think that, when computing an estimate which is conceptually the average of many things, it would be best to rank by something like "average-stddev" rather than average. While an average is the best estimate, this other value captures some notion of sureness and penalizes estimates based on fewer data points. Conceptually the value is one that is X% likely to be less than or equal to the real value. (Here, it's 84.1%) But this is a digression. You can add a new bulk method to ItemSimilarity (and UserSimilarity for symmetry) if you feel it is sound and useful. On Fri, Dec 3, 2010 at 2:52 PM, Sebastian Schelter <[email protected]> wrote: > Hi, > > I ran into some issues with GenericItembasedRecommender this week, which I > could only work around by creating a custom ItembasedRecommender > implementation. I think the issues might be worth discussing here and I'd > look forward to committing back my changes if we find them useful. > > The first issue is with > GenericItembasedRecommender.MultiMostSimilarEstimator, which is used to > compute the most similar items to a collection of items. The current > implementation filters out all items that are not similar (having NaN as > similarity value) to at least one of the input items. While this might be > algorithmically correct it very often leads to empty results. Users might > e.g. put very different things in a shopping cart and using those things as > input for mostSimilarItems produces empty results in lots of cases in my > experience. My workaround was to interpret NaN as 0 when computing the > average estimate here (and in the end filtering out results that had 0 as > average), thus allowing an item to be included in the result if it is > similar to at least one of the input items. If we decide to include this we > could either introduce a second mostSimilarItems method or make it receive a > parameter to determine the "exclusion mode" or whatever we might call it. > > The second issue is a little bit more complicated. A while ago we > introduced an component called CandidateItemsStrategy to enable the > customization of the selection of the initial candidate items that might be > recommended to a user. I noticed that we actually should do the same thing > with the selection of candidate items for mostSimilarItems, which is > currently done in GenericItembasedRecommender.doMostSimilarItems(...). This > especially wastes CPU time when we use precomputed similarities > (GenericItemSimilarity or FileItemSimilarity) because we already "know" the > possibly similar items. Unfortunately there's no way to ask ItemSimilarity > to directly give you all similar items to an item (which would be very the > most efficient way of use when dealing with already precomputed > similarities). I created a small file-based indexing component which can be > asked for those but I'm not to happy with spreading the information about > the precomputed similarities. Though I think we should work on improving the > efficiency here as it turned out to be a performance killer in my usecase. > > I hope I can make it clear what the problems were (and what solutions I > propose). I could also supply a patch in the next weeks but I wanted to have > a discussion first. > > --sebastian >
