Hi Sean,
Thanks for helping out here. The data can be assumed to be in
either form mentioned below, since both forms are interchangeable:-
[userA: item1, item2, item3 ... ]
OR
[item1: userA, userB, userC ..]
Each user and each item is assigned a unique identifier. The ratings can
be considered as binary 1 if user clicked on an item and 0 otherwise.
Thing to note here is that in case of 0 the item does not exist in the
user history. So what we have essentially is a sparse representation
where 0's are not stored at all.
As for which one is more (user/item) from the dataset we have relatively
high number of users and less items. There are around 200 - 300 thousand
unique items but expected to grow to 1 - 2 million. So I think item
based recommender sounds like something we can try out.
About Tanimoto measure, I thought of using it in hierarchical clustering
but Ted suggested it might not solve the purpose. He suggested that we
can try computing the log-likelihood of co-occurrence of items.
I would like to try out both the item based recommender you suggested
and also the log-likelihood approach. Do we have the map-red version of
log-likelihood code in Mahout?
Ted, any thoughts?
Regards
-Ankur
-----Original Message-----
From: Sean Owen [mailto:[email protected]]
Sent: Monday, January 19, 2009 7:02 PM
To: [email protected]
Subject: Re: RE: RE: [jira] Commented: (MAHOUT-19) Hierarchial clusterer
Binary meaning you just have a "yes, the user likes or has seen or
bought the book" versus no relation at all? Yeah that should be a
special case of what CF normally works on, where you have some degree
of preference versus yes/no. In that sense yes it is supported and in
theory should be a lot faster -- in practice it's only easy to gain a
little speedup since to really re-orient the algorithms to take
advantage of this case would take a lot of change.
Thinking it through... I am not sure slope one would work in this
case. It operates on relative differences in ratings across items, and
if all your ratings are "1.0" if they exist at all, then it falls
apart.
So perhaps the other algorithms are a better place to start after all.
The binary case does allow you to use fast similarity metrics like the
Tanimoto measure, and if you have a fast similarity metric you
generally have a fast algorithm since most algorithms rely heavily on
computing similarity metrics.
Do you have relatively lots of users, or lots of items? If you have
relatively few items, and item-based recommender is ideal -- and vice
versa with user-based recommenders.
How is that sounding? what form is your data in? I could send over
rough draft code to try out.
On Mon, Jan 19, 2009 at 12:37 PM, Goel, Ankur <[email protected]>
wrote:
> Yep! I actually want to recommend items of interest, where item
depends
> on the context say for an online bookshop it is books. Few question
> regarding slope one.
> 1. Can I be applied to a binary data setting like mine?
> 2. Do we have an implementation for it in Mahout?
> 3. Will it scale well?