Hello,
Boolean* classes sped up things for me:
UserSimilarity similarity = new BooleanTanimotoCoefficientSimilarity(model);
hood = new NearestNUserNeighborhood(HOOD_SIZE, MIN_SIMILARITY, similarity,
model);
recommender = new BooleanUserGenericUserBasedRecommender(model, hood,
similarity);
Sean did recommend using Item-based recommender when the number of items is
relatively low compared to the number of users, but we only have Boolean
flavour of User-based recommender in svn for now.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
> From: Sean Owen <[email protected]>
> To: [email protected]
> Sent: Tuesday, January 20, 2009 7:26:07 AM
> Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer
>
> Yes log-likelihood is already implemented as well as a similarity
> metric, no problem.
>
> What I need to do is cook up a quick DataModel that reads your file
> format. Is it really like this: user ID followed by item IDs that are
> associated? how can I tell when a line specifies the opposite, item
> followed by user IDs? the former is easier, BTW.
>
> [1234: 324, 555, 333]
>
> Then the code to build a recommender is basically:
>
> DataModel model = ...; // the custom DataModel I'll make
> ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
> // can also try ... = new TanimotoCoefficientSimilarity(model);
> similarity = new CachingItemSimilarity(similarity); // for speed
> Recommender recommender = new GenericItemBasedRecommender(model, similarity);
> Listrecommendations = recommender.recommend("1234", 10);
>
> Your data set is big but not so large that it couldn't fit in memory,
> I think. For now I think this easily runs on a reasonably beefy
> machine -- it's going to need a lot of RAM to cache lots of item-item
> similarities or else that computation will slow things down a lot.
>
> Easy enough to try and see how it flies.
>
> Sean
>
> On Tue, Jan 20, 2009 at 5:30 AM, Goel, Ankur wrote:
> > Hi Sean,
> > Thanks for helping out here. The data can be assumed to be in
> > either form mentioned below, since both forms are interchangeable:-
> >
> > [userA: item1, item2, item3 ... ]
> > OR
> > [item1: userA, userB, userC ..]
> >
> > Each user and each item is assigned a unique identifier. The ratings can
> > be considered as binary 1 if user clicked on an item and 0 otherwise.
> > Thing to note here is that in case of 0 the item does not exist in the
> > user history. So what we have essentially is a sparse representation
> > where 0's are not stored at all.
> >
> > As for which one is more (user/item) from the dataset we have relatively
> > high number of users and less items. There are around 200 - 300 thousand
> > unique items but expected to grow to 1 - 2 million. So I think item
> > based recommender sounds like something we can try out.
> >
> > About Tanimoto measure, I thought of using it in hierarchical clustering
> > but Ted suggested it might not solve the purpose. He suggested that we
> > can try computing the log-likelihood of co-occurrence of items.
> >
> > I would like to try out both the item based recommender you suggested
> > and also the log-likelihood approach. Do we have the map-red version of
> > log-likelihood code in Mahout?
> >
> > Ted, any thoughts?
> >
> > Regards
> > -Ankur