Yes log-likelihood is already implemented as well as a similarity
metric, no problem.

What I need to do is cook up a quick DataModel that reads your file
format. Is it really like this: user ID followed by item IDs that are
associated? how can I tell when a line specifies the opposite, item
followed by user IDs? the former is easier, BTW.

[1234: 324, 555, 333]

Then the code to build a recommender is basically:

DataModel model = ...; // the custom DataModel I'll make
ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
// can also try ... = new TanimotoCoefficientSimilarity(model);
similarity = new CachingItemSimilarity(similarity); // for speed
Recommender recommender = new GenericItemBasedRecommender(model, similarity);
List<RecommendedItem> recommendations = recommender.recommend("1234", 10);

Your data set is big but not so large that it couldn't fit in memory,
I think. For now I think this easily runs on a reasonably beefy
machine -- it's going to need a lot of RAM to cache lots of item-item
similarities or else that computation will slow things down a lot.

Easy enough to try and see how it flies.

Sean

On Tue, Jan 20, 2009 at 5:30 AM, Goel, Ankur <[email protected]> wrote:
> Hi Sean,
>        Thanks for helping out here. The data can be assumed to be in
> either form mentioned below, since both forms are interchangeable:-
>
> [userA: item1, item2, item3 ... ]
> OR
> [item1: userA, userB, userC ..]
>
> Each user and each item is assigned a unique identifier. The ratings can
> be considered as binary 1 if user clicked on an item and 0 otherwise.
> Thing to note here is that in case of 0 the item does not exist in the
> user history. So what we have essentially is a sparse representation
> where 0's are not stored at all.
>
> As for which one is more (user/item) from the dataset we have relatively
> high number of users and less items. There are around 200 - 300 thousand
> unique items but expected to grow to 1 - 2 million. So I think item
> based recommender sounds like something we can try out.
>
> About Tanimoto measure, I thought of using it in hierarchical clustering
> but Ted suggested it might not solve the purpose. He suggested that we
> can try computing the log-likelihood of co-occurrence of items.
>
> I would like to try out both the item based recommender you suggested
> and also the log-likelihood approach. Do we have the map-red version of
> log-likelihood code in Mahout?
>
> Ted, any thoughts?
>
> Regards
> -Ankur

Reply via email to