Hello Sebastian,

Thanks very much for those explanations, very useful indeed.

For 1) --maxCooccurrencesPerItem, I understand that it's a cap on the number
of items that are considered for individual users.  This is very clear.  I
also agree with you that a more intelligent form of sampling could be useful
here.

For 2) --maxSimilaritiesPerItem, I'm not so sure that I follow so I'd like
to confirm my understanding with you.  Imagine that we have item A, that has
been co-rated with items B, C and D.  As a user, I have rated items B, C and
D, and I'd like to predict my rating for item A.  If
--maxSimilaritiesPerItem is set to 1, then only my rating for one of B, C
and D will be taken into account?  Similarly, if --maxSimilaritiesPerItem is
set to 2, then values for 2 of B, C and D will be taken into account?  If
that's correct, how is the selection made (e.g. random, frequency of
co-rating)?

Thanks,
Kris



2011/7/14 Sebastian Schelter <[email protected]>

> Hi Jack,
>
> trying to answer your questions as detailed as possible:
>
> Regarding point 2) --maxSimilaritiesPerItem
>
> RecommenderJob uses Itembased Collaborative Filtering to compute the
> recommendations and is a parallelized implementation of the algorithm
> presented in [1]. The main idea is to use a "neighbourhood" of similar items
> that have already been rated by a user to estimate his/her preference
> towards an unknown item. These similar items are found by comparing the
> ratings of frequently co-rated items according to some similarity measure.
> The parameter --maxSimilaritiesPerItem lets you specify the number of
> similar items per item to consider when estimating preferences towards an
> unknown item. Usually a small number of items should be sufficient, have a
> look into [1] for some numbers and experiments.
>
> Regarding point 1) --maxCooccurrencesPerItem
>
> In order to compute the item-item-similarities a naive approach would have
> to consider all possible pairs of items which has quadratic complexity and
> obviously won't scale.
>
> RowSimilarityJob which is at the heart of both RecommenderJob and
> ItemSimilarityJob ensures that only pairs of items that have at least been
> co-rated once are taken into consideration which helps a lot in
> recommendation usecases as most users have only rated a very small number of
> items.
>
> However if you look at the distribution of the number of ratings per user
> or per item, it will usually follow a heavily tailed distribution, which
> means that there is a small number of items ("topsellers") with an
> exorbitant number of ratings as well as a small number of users
> ("powerusers") that show the same behavior.
>
> These powerusers and topsellers might slow down the similarity computation
> orders of magnitude (as all pairs of items that have been co-rated have to
> be considered which is still quadratic growth) without providing a lot of
> additional insight. I think Ted wrote a mail to this list some time ago
> where he confirmed this observation from his experience.
>
> So we need some way to sample down these ratings which is done in
> MaybePruneRowsMapper with a very simple heuristic using
> --maxCooccurrencesPerItem that only looks at the portion of data available
> for that single mapper instance and might throw away ratings for very
> frequently rated items.
>
> I think this is a point where a lot of optimization is possible, Mahout
> should provide support for customizable sampling strategies here, like
> looking only at the x latest ratings of a user for example.
>
>
> --sebastian
>
> [1] Sarwar et. al. "Itembased Collaborative Filtering Algorithms"
> http://portal.acm.org/**citation.cfm?id=372071<http://portal.acm.org/citation.cfm?id=372071>
>
>
>
> On 14.07.2011 16:11, Kris Jack wrote:
>
>> Hello,
>>
>> I'm trying to get a better understanding of the following 2 RecommenderJob
>> parameters:
>> 1) --maxCooccurrencesPerItem (integer): Maximum number of cooccurrences
>> considered per item (100)
>> 2) --maxSimilaritiesPerItem (integer): Maximum number of similarities
>> considered per item (100)
>>
>> Could you please help me to understand these in terms of a recommender job
>> where we are trying to recommend items to users?
>>
>>  From what I see, maxCooccurrencesPerItem first gets used in job 4/12 in
>> the
>> pipeline, the MaybePruneRowsMapper job.  Does maxCooccurrencesPerItem
>> limit
>> the number of cooccurrences that are kept for that item?  Is this limit
>> within a single user's set of items or globally for all users?  For
>> example,
>> if a user has 100 items then each item can be seen to cooccur with the 99
>> other items.  Taking all user libraries, however, assume that it cooccurs
>> with 1,000,000 other items.  Does maxCooccurrencesPerItem limit the number
>> of cooccurrences on a user item set basis or is this applied to the set of
>> items with which the item cooccurs with regard to all user libraries?
>>  Also,
>> how is the selection made (most frequent or first found)?
>>
>> maxSimilaritiesPerItem first gets used in job 7/12 in the pipeline,
>> EntriesToVectorsReducer.  Does this cap the number of rows that are
>> compared
>> with one another?  Are the rows cooccurrence vectors of items for a given
>> user by this point in the process?
>>
>> Thanks,
>> Kris
>>
>>
>


-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Reply via email to