Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-09-29 Thread Ted Dunning
How are you using LLR to compute user similarity?  It is normally used to
compute item similarity?

Also, what is your scale?  how many users? how many items?  how many
actions per user?



On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit 
wrote:

> Hi,
>
> I am exploring a random-walk based algorithm for recommender systems which
> works by propagating the item preferences for users on the user-user graph.
> To do this, I have to compute user-user similarity and form a neighborhood.
> I have tried the following three simple techniques to compute the score
> between two users and find the neighborhood.
>
> 1. Score = (Common Items between users A and B) / (items preferred by A +
> items Preferred by B)
> 2. Scoring based on Mahout's Cosine Similarity
> 3. Scoring based on Mahout's LogLikelihood similarity.
>
> My understanding is that similarity based on LogLikelihood is more robust,
> however, I get better results using the naive approach (technique 1 from
> the above list). The problems I am addressing are collaborator
> recommendation, conference recommendation and reference recommendation and
> the data has implicit feedback.
>
> So, my questions is, are there any cases where cosine similarity and
> loglikelihood metrics fail (to capture similarity), for example, for the
> problems stated above, users only collaborate with few other users (based
> on area of interest), publish in only few conferences (again based on area
> of interest) and refer to publications in a specific domain. So, the
> preference counts are fairly small compared to other domains (music/video
> etc).
>
> Secondly, for CosineSimilarity, should I treat the preferences as boolean
> or use the counts? (I think loglikelihood metric does not take into account
> the preference counts.. correct me if I am wrong.)
>
> Any insight into this is much appreciated.
>
> Thanks,
> Rohit
>
> p.s. Ted, Pat: I am following the discussion on the thread
> "LogLikelihoodSimilarity Calculation" and your answers helped me a lot to
> understand how it works and made me wonder why things are different in my
> case.
>


Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-09-30 Thread Parimi Rohit
Ted, Thanks for your response. Following is the information about the
approach and the datasets:

I am using the ItemSimilarityJob and passing it  "itemID, userID,
prefCount" tuples as input to compute user-user similarity using LLR. I
read this approach from a response for one of the stackoverflow questions
on calculating user similarity using mahout. .


Following are the stats for the datasets:

Coauthor dataset:

users = 29189
items =  140091
averageItemsClicked = 15.808660796875536

Conference Dataset:

users = 29189
items =  2393
averageItemsClicked = 7.265099866388023

Reference Dataset:

users = 29189
items =  201570
averageItemsClicked = 61.08564870327863

By Scale, did you mean rating scale? If so, I am using preference counts,
not rating.

Thanks,
Rohit


On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning  wrote:

> How are you using LLR to compute user similarity?  It is normally used to
> compute item similarity?
>
> Also, what is your scale?  how many users? how many items?  how many
> actions per user?
>
>
>
> On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit 
> wrote:
>
> > Hi,
> >
> > I am exploring a random-walk based algorithm for recommender systems
> which
> > works by propagating the item preferences for users on the user-user
> graph.
> > To do this, I have to compute user-user similarity and form a
> neighborhood.
> > I have tried the following three simple techniques to compute the score
> > between two users and find the neighborhood.
> >
> > 1. Score = (Common Items between users A and B) / (items preferred by A +
> > items Preferred by B)
> > 2. Scoring based on Mahout's Cosine Similarity
> > 3. Scoring based on Mahout's LogLikelihood similarity.
> >
> > My understanding is that similarity based on LogLikelihood is more
> robust,
> > however, I get better results using the naive approach (technique 1 from
> > the above list). The problems I am addressing are collaborator
> > recommendation, conference recommendation and reference recommendation
> and
> > the data has implicit feedback.
> >
> > So, my questions is, are there any cases where cosine similarity and
> > loglikelihood metrics fail (to capture similarity), for example, for the
> > problems stated above, users only collaborate with few other users (based
> > on area of interest), publish in only few conferences (again based on
> area
> > of interest) and refer to publications in a specific domain. So, the
> > preference counts are fairly small compared to other domains (music/video
> > etc).
> >
> > Secondly, for CosineSimilarity, should I treat the preferences as boolean
> > or use the counts? (I think loglikelihood metric does not take into
> account
> > the preference counts.. correct me if I am wrong.)
> >
> > Any insight into this is much appreciated.
> >
> > Thanks,
> > Rohit
> >
> > p.s. Ted, Pat: I am following the discussion on the thread
> > "LogLikelihoodSimilarity Calculation" and your answers helped me a lot to
> > understand how it works and made me wonder why things are different in my
> > case.
> >
>


Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-09-30 Thread Ted Dunning
This is an incredibly tiny dataset.  If you delete singletons, it is likely
to get significantly smaller.

I think that something like LDA might work much better for you. It was
designed to work on small data like this.


On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit 
wrote:

> Ted, Thanks for your response. Following is the information about the
> approach and the datasets:
>
> I am using the ItemSimilarityJob and passing it  "itemID, userID,
> prefCount" tuples as input to compute user-user similarity using LLR. I
> read this approach from a response for one of the stackoverflow questions
> on calculating user similarity using mahout. .
>
>
> Following are the stats for the datasets:
>
> Coauthor dataset:
>
> users = 29189
> items =  140091
> averageItemsClicked = 15.808660796875536
>
> Conference Dataset:
>
> users = 29189
> items =  2393
> averageItemsClicked = 7.265099866388023
>
> Reference Dataset:
>
> users = 29189
> items =  201570
> averageItemsClicked = 61.08564870327863
>
> By Scale, did you mean rating scale? If so, I am using preference counts,
> not rating.
>
> Thanks,
> Rohit
>
>
> On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning 
> wrote:
>
> > How are you using LLR to compute user similarity?  It is normally used to
> > compute item similarity?
> >
> > Also, what is your scale?  how many users? how many items?  how many
> > actions per user?
> >
> >
> >
> > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit 
> > wrote:
> >
> > > Hi,
> > >
> > > I am exploring a random-walk based algorithm for recommender systems
> > which
> > > works by propagating the item preferences for users on the user-user
> > graph.
> > > To do this, I have to compute user-user similarity and form a
> > neighborhood.
> > > I have tried the following three simple techniques to compute the score
> > > between two users and find the neighborhood.
> > >
> > > 1. Score = (Common Items between users A and B) / (items preferred by
> A +
> > > items Preferred by B)
> > > 2. Scoring based on Mahout's Cosine Similarity
> > > 3. Scoring based on Mahout's LogLikelihood similarity.
> > >
> > > My understanding is that similarity based on LogLikelihood is more
> > robust,
> > > however, I get better results using the naive approach (technique 1
> from
> > > the above list). The problems I am addressing are collaborator
> > > recommendation, conference recommendation and reference recommendation
> > and
> > > the data has implicit feedback.
> > >
> > > So, my questions is, are there any cases where cosine similarity and
> > > loglikelihood metrics fail (to capture similarity), for example, for
> the
> > > problems stated above, users only collaborate with few other users
> (based
> > > on area of interest), publish in only few conferences (again based on
> > area
> > > of interest) and refer to publications in a specific domain. So, the
> > > preference counts are fairly small compared to other domains
> (music/video
> > > etc).
> > >
> > > Secondly, for CosineSimilarity, should I treat the preferences as
> boolean
> > > or use the counts? (I think loglikelihood metric does not take into
> > account
> > > the preference counts.. correct me if I am wrong.)
> > >
> > > Any insight into this is much appreciated.
> > >
> > > Thanks,
> > > Rohit
> > >
> > > p.s. Ted, Pat: I am following the discussion on the thread
> > > "LogLikelihoodSimilarity Calculation" and your answers helped me a lot
> to
> > > understand how it works and made me wonder why things are different in
> my
> > > case.
> > >
> >
>


Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-09-30 Thread Parimi Rohit
Ted,

I know LDA can be used to model text data but never used it in this
setting. Can you please give me some pointers on how I can apply it in this
setting?

Thanks,
Rohit

On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning  wrote:

> This is an incredibly tiny dataset.  If you delete singletons, it is likely
> to get significantly smaller.
>
> I think that something like LDA might work much better for you. It was
> designed to work on small data like this.
>
>
> On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit 
> wrote:
>
> > Ted, Thanks for your response. Following is the information about the
> > approach and the datasets:
> >
> > I am using the ItemSimilarityJob and passing it  "itemID, userID,
> > prefCount" tuples as input to compute user-user similarity using LLR. I
> > read this approach from a response for one of the stackoverflow questions
> > on calculating user similarity using mahout. .
> >
> >
> > Following are the stats for the datasets:
> >
> > Coauthor dataset:
> >
> > users = 29189
> > items =  140091
> > averageItemsClicked = 15.808660796875536
> >
> > Conference Dataset:
> >
> > users = 29189
> > items =  2393
> > averageItemsClicked = 7.265099866388023
> >
> > Reference Dataset:
> >
> > users = 29189
> > items =  201570
> > averageItemsClicked = 61.08564870327863
> >
> > By Scale, did you mean rating scale? If so, I am using preference counts,
> > not rating.
> >
> > Thanks,
> > Rohit
> >
> >
> > On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning 
> > wrote:
> >
> > > How are you using LLR to compute user similarity?  It is normally used
> to
> > > compute item similarity?
> > >
> > > Also, what is your scale?  how many users? how many items?  how many
> > > actions per user?
> > >
> > >
> > >
> > > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am exploring a random-walk based algorithm for recommender systems
> > > which
> > > > works by propagating the item preferences for users on the user-user
> > > graph.
> > > > To do this, I have to compute user-user similarity and form a
> > > neighborhood.
> > > > I have tried the following three simple techniques to compute the
> score
> > > > between two users and find the neighborhood.
> > > >
> > > > 1. Score = (Common Items between users A and B) / (items preferred by
> > A +
> > > > items Preferred by B)
> > > > 2. Scoring based on Mahout's Cosine Similarity
> > > > 3. Scoring based on Mahout's LogLikelihood similarity.
> > > >
> > > > My understanding is that similarity based on LogLikelihood is more
> > > robust,
> > > > however, I get better results using the naive approach (technique 1
> > from
> > > > the above list). The problems I am addressing are collaborator
> > > > recommendation, conference recommendation and reference
> recommendation
> > > and
> > > > the data has implicit feedback.
> > > >
> > > > So, my questions is, are there any cases where cosine similarity and
> > > > loglikelihood metrics fail (to capture similarity), for example, for
> > the
> > > > problems stated above, users only collaborate with few other users
> > (based
> > > > on area of interest), publish in only few conferences (again based on
> > > area
> > > > of interest) and refer to publications in a specific domain. So, the
> > > > preference counts are fairly small compared to other domains
> > (music/video
> > > > etc).
> > > >
> > > > Secondly, for CosineSimilarity, should I treat the preferences as
> > boolean
> > > > or use the counts? (I think loglikelihood metric does not take into
> > > account
> > > > the preference counts.. correct me if I am wrong.)
> > > >
> > > > Any insight into this is much appreciated.
> > > >
> > > > Thanks,
> > > > Rohit
> > > >
> > > > p.s. Ted, Pat: I am following the discussion on the thread
> > > > "LogLikelihoodSimilarity Calculation" and your answers helped me a
> lot
> > to
> > > > understand how it works and made me wonder why things are different
> in
> > my
> > > > case.
> > > >
> > >
> >
>


Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-09-30 Thread Ted Dunning
Here is a paper that includes an analysis of voting patterns using LDA.

http://arxiv.org/pdf/math/0604410.pdf



On Tue, Sep 30, 2014 at 7:04 PM, Parimi Rohit 
wrote:

> Ted,
>
> I know LDA can be used to model text data but never used it in this
> setting. Can you please give me some pointers on how I can apply it in this
> setting?
>
> Thanks,
> Rohit
>
> On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning 
> wrote:
>
> > This is an incredibly tiny dataset.  If you delete singletons, it is
> likely
> > to get significantly smaller.
> >
> > I think that something like LDA might work much better for you. It was
> > designed to work on small data like this.
> >
> >
> > On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit 
> > wrote:
> >
> > > Ted, Thanks for your response. Following is the information about the
> > > approach and the datasets:
> > >
> > > I am using the ItemSimilarityJob and passing it  "itemID, userID,
> > > prefCount" tuples as input to compute user-user similarity using LLR. I
> > > read this approach from a response for one of the stackoverflow
> questions
> > > on calculating user similarity using mahout. .
> > >
> > >
> > > Following are the stats for the datasets:
> > >
> > > Coauthor dataset:
> > >
> > > users = 29189
> > > items =  140091
> > > averageItemsClicked = 15.808660796875536
> > >
> > > Conference Dataset:
> > >
> > > users = 29189
> > > items =  2393
> > > averageItemsClicked = 7.265099866388023
> > >
> > > Reference Dataset:
> > >
> > > users = 29189
> > > items =  201570
> > > averageItemsClicked = 61.08564870327863
> > >
> > > By Scale, did you mean rating scale? If so, I am using preference
> counts,
> > > not rating.
> > >
> > > Thanks,
> > > Rohit
> > >
> > >
> > > On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning 
> > > wrote:
> > >
> > > > How are you using LLR to compute user similarity?  It is normally
> used
> > to
> > > > compute item similarity?
> > > >
> > > > Also, what is your scale?  how many users? how many items?  how many
> > > > actions per user?
> > > >
> > > >
> > > >
> > > > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit <
> rohit.par...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am exploring a random-walk based algorithm for recommender
> systems
> > > > which
> > > > > works by propagating the item preferences for users on the
> user-user
> > > > graph.
> > > > > To do this, I have to compute user-user similarity and form a
> > > > neighborhood.
> > > > > I have tried the following three simple techniques to compute the
> > score
> > > > > between two users and find the neighborhood.
> > > > >
> > > > > 1. Score = (Common Items between users A and B) / (items preferred
> by
> > > A +
> > > > > items Preferred by B)
> > > > > 2. Scoring based on Mahout's Cosine Similarity
> > > > > 3. Scoring based on Mahout's LogLikelihood similarity.
> > > > >
> > > > > My understanding is that similarity based on LogLikelihood is more
> > > > robust,
> > > > > however, I get better results using the naive approach (technique 1
> > > from
> > > > > the above list). The problems I am addressing are collaborator
> > > > > recommendation, conference recommendation and reference
> > recommendation
> > > > and
> > > > > the data has implicit feedback.
> > > > >
> > > > > So, my questions is, are there any cases where cosine similarity
> and
> > > > > loglikelihood metrics fail (to capture similarity), for example,
> for
> > > the
> > > > > problems stated above, users only collaborate with few other users
> > > (based
> > > > > on area of interest), publish in only few conferences (again based
> on
> > > > area
> > > > > of interest) and refer to publications in a specific domain. So,
> the
> > > > > preference counts are fairly small compared to other domains
> > > (music/video
> > > > > etc).
> > > > >
> > > > > Secondly, for CosineSimilarity, should I treat the preferences as
> > > boolean
> > > > > or use the counts? (I think loglikelihood metric does not take into
> > > > account
> > > > > the preference counts.. correct me if I am wrong.)
> > > > >
> > > > > Any insight into this is much appreciated.
> > > > >
> > > > > Thanks,
> > > > > Rohit
> > > > >
> > > > > p.s. Ted, Pat: I am following the discussion on the thread
> > > > > "LogLikelihoodSimilarity Calculation" and your answers helped me a
> > lot
> > > to
> > > > > understand how it works and made me wonder why things are different
> > in
> > > my
> > > > > case.
> > > > >
> > > >
> > >
> >
>


Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-10-01 Thread Parimi Rohit
Thanks Ted! Will look into it.

Rohit

On Wed, Oct 1, 2014 at 1:04 AM, Ted Dunning  wrote:

> Here is a paper that includes an analysis of voting patterns using LDA.
>
> http://arxiv.org/pdf/math/0604410.pdf
>
>
>
> On Tue, Sep 30, 2014 at 7:04 PM, Parimi Rohit 
> wrote:
>
> > Ted,
> >
> > I know LDA can be used to model text data but never used it in this
> > setting. Can you please give me some pointers on how I can apply it in
> this
> > setting?
> >
> > Thanks,
> > Rohit
> >
> > On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning 
> > wrote:
> >
> > > This is an incredibly tiny dataset.  If you delete singletons, it is
> > likely
> > > to get significantly smaller.
> > >
> > > I think that something like LDA might work much better for you. It was
> > > designed to work on small data like this.
> > >
> > >
> > > On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit  >
> > > wrote:
> > >
> > > > Ted, Thanks for your response. Following is the information about the
> > > > approach and the datasets:
> > > >
> > > > I am using the ItemSimilarityJob and passing it  "itemID, userID,
> > > > prefCount" tuples as input to compute user-user similarity using
> LLR. I
> > > > read this approach from a response for one of the stackoverflow
> > questions
> > > > on calculating user similarity using mahout. .
> > > >
> > > >
> > > > Following are the stats for the datasets:
> > > >
> > > > Coauthor dataset:
> > > >
> > > > users = 29189
> > > > items =  140091
> > > > averageItemsClicked = 15.808660796875536
> > > >
> > > > Conference Dataset:
> > > >
> > > > users = 29189
> > > > items =  2393
> > > > averageItemsClicked = 7.265099866388023
> > > >
> > > > Reference Dataset:
> > > >
> > > > users = 29189
> > > > items =  201570
> > > > averageItemsClicked = 61.08564870327863
> > > >
> > > > By Scale, did you mean rating scale? If so, I am using preference
> > counts,
> > > > not rating.
> > > >
> > > > Thanks,
> > > > Rohit
> > > >
> > > >
> > > > On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning  >
> > > > wrote:
> > > >
> > > > > How are you using LLR to compute user similarity?  It is normally
> > used
> > > to
> > > > > compute item similarity?
> > > > >
> > > > > Also, what is your scale?  how many users? how many items?  how
> many
> > > > > actions per user?
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit <
> > rohit.par...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am exploring a random-walk based algorithm for recommender
> > systems
> > > > > which
> > > > > > works by propagating the item preferences for users on the
> > user-user
> > > > > graph.
> > > > > > To do this, I have to compute user-user similarity and form a
> > > > > neighborhood.
> > > > > > I have tried the following three simple techniques to compute the
> > > score
> > > > > > between two users and find the neighborhood.
> > > > > >
> > > > > > 1. Score = (Common Items between users A and B) / (items
> preferred
> > by
> > > > A +
> > > > > > items Preferred by B)
> > > > > > 2. Scoring based on Mahout's Cosine Similarity
> > > > > > 3. Scoring based on Mahout's LogLikelihood similarity.
> > > > > >
> > > > > > My understanding is that similarity based on LogLikelihood is
> more
> > > > > robust,
> > > > > > however, I get better results using the naive approach
> (technique 1
> > > > from
> > > > > > the above list). The problems I am addressing are collaborator
> > > > > > recommendation, conference recommendation and reference
> > > recommendation
> > > > > and
> > > > > > the data has implicit feedback.
> > > > > >
> > > > > > So, my questions is, are there any cases where cosine similarity
> > and
> > > > > > loglikelihood metrics fail (to capture similarity), for example,
> > for
> > > > the
> > > > > > problems stated above, users only collaborate with few other
> users
> > > > (based
> > > > > > on area of interest), publish in only few conferences (again
> based
> > on
> > > > > area
> > > > > > of interest) and refer to publications in a specific domain. So,
> > the
> > > > > > preference counts are fairly small compared to other domains
> > > > (music/video
> > > > > > etc).
> > > > > >
> > > > > > Secondly, for CosineSimilarity, should I treat the preferences as
> > > > boolean
> > > > > > or use the counts? (I think loglikelihood metric does not take
> into
> > > > > account
> > > > > > the preference counts.. correct me if I am wrong.)
> > > > > >
> > > > > > Any insight into this is much appreciated.
> > > > > >
> > > > > > Thanks,
> > > > > > Rohit
> > > > > >
> > > > > > p.s. Ted, Pat: I am following the discussion on the thread
> > > > > > "LogLikelihoodSimilarity Calculation" and your answers helped me
> a
> > > lot
> > > > to
> > > > > > understand how it works and made me wonder why things are
> different
> > > in
> > > > my
> > > > > > case.
> > > > > >
> > > > >
> > > >
> > >
> >
>