Hello Sebastian, Thanks very much for those explanations, very useful indeed.
For 1) --maxCooccurrencesPerItem, I understand that it's a cap on the number of items that are considered for individual users. This is very clear. I also agree with you that a more intelligent form of sampling could be useful here. For 2) --maxSimilaritiesPerItem, I'm not so sure that I follow so I'd like to confirm my understanding with you. Imagine that we have item A, that has been co-rated with items B, C and D. As a user, I have rated items B, C and D, and I'd like to predict my rating for item A. If --maxSimilaritiesPerItem is set to 1, then only my rating for one of B, C and D will be taken into account? Similarly, if --maxSimilaritiesPerItem is set to 2, then values for 2 of B, C and D will be taken into account? If that's correct, how is the selection made (e.g. random, frequency of co-rating)? Thanks, Kris 2011/7/14 Sebastian Schelter <[email protected]> > Hi Jack, > > trying to answer your questions as detailed as possible: > > Regarding point 2) --maxSimilaritiesPerItem > > RecommenderJob uses Itembased Collaborative Filtering to compute the > recommendations and is a parallelized implementation of the algorithm > presented in [1]. The main idea is to use a "neighbourhood" of similar items > that have already been rated by a user to estimate his/her preference > towards an unknown item. These similar items are found by comparing the > ratings of frequently co-rated items according to some similarity measure. > The parameter --maxSimilaritiesPerItem lets you specify the number of > similar items per item to consider when estimating preferences towards an > unknown item. Usually a small number of items should be sufficient, have a > look into [1] for some numbers and experiments. > > Regarding point 1) --maxCooccurrencesPerItem > > In order to compute the item-item-similarities a naive approach would have > to consider all possible pairs of items which has quadratic complexity and > obviously won't scale. > > RowSimilarityJob which is at the heart of both RecommenderJob and > ItemSimilarityJob ensures that only pairs of items that have at least been > co-rated once are taken into consideration which helps a lot in > recommendation usecases as most users have only rated a very small number of > items. > > However if you look at the distribution of the number of ratings per user > or per item, it will usually follow a heavily tailed distribution, which > means that there is a small number of items ("topsellers") with an > exorbitant number of ratings as well as a small number of users > ("powerusers") that show the same behavior. > > These powerusers and topsellers might slow down the similarity computation > orders of magnitude (as all pairs of items that have been co-rated have to > be considered which is still quadratic growth) without providing a lot of > additional insight. I think Ted wrote a mail to this list some time ago > where he confirmed this observation from his experience. > > So we need some way to sample down these ratings which is done in > MaybePruneRowsMapper with a very simple heuristic using > --maxCooccurrencesPerItem that only looks at the portion of data available > for that single mapper instance and might throw away ratings for very > frequently rated items. > > I think this is a point where a lot of optimization is possible, Mahout > should provide support for customizable sampling strategies here, like > looking only at the x latest ratings of a user for example. > > > --sebastian > > [1] Sarwar et. al. "Itembased Collaborative Filtering Algorithms" > http://portal.acm.org/**citation.cfm?id=372071<http://portal.acm.org/citation.cfm?id=372071> > > > > On 14.07.2011 16:11, Kris Jack wrote: > >> Hello, >> >> I'm trying to get a better understanding of the following 2 RecommenderJob >> parameters: >> 1) --maxCooccurrencesPerItem (integer): Maximum number of cooccurrences >> considered per item (100) >> 2) --maxSimilaritiesPerItem (integer): Maximum number of similarities >> considered per item (100) >> >> Could you please help me to understand these in terms of a recommender job >> where we are trying to recommend items to users? >> >> From what I see, maxCooccurrencesPerItem first gets used in job 4/12 in >> the >> pipeline, the MaybePruneRowsMapper job. Does maxCooccurrencesPerItem >> limit >> the number of cooccurrences that are kept for that item? Is this limit >> within a single user's set of items or globally for all users? For >> example, >> if a user has 100 items then each item can be seen to cooccur with the 99 >> other items. Taking all user libraries, however, assume that it cooccurs >> with 1,000,000 other items. Does maxCooccurrencesPerItem limit the number >> of cooccurrences on a user item set basis or is this applied to the set of >> items with which the item cooccurs with regard to all user libraries? >> Also, >> how is the selection made (most frequent or first found)? >> >> maxSimilaritiesPerItem first gets used in job 7/12 in the pipeline, >> EntriesToVectorsReducer. Does this cap the number of rows that are >> compared >> with one another? Are the rows cooccurrence vectors of items for a given >> user by this point in the process? >> >> Thanks, >> Kris >> >> > -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/
