Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-10-01 Thread Ted Dunning
Here is a paper that includes an analysis of voting patterns using LDA.

http://arxiv.org/pdf/math/0604410.pdf



On Tue, Sep 30, 2014 at 7:04 PM, Parimi Rohit rohit.par...@gmail.com
wrote:

 Ted,

 I know LDA can be used to model text data but never used it in this
 setting. Can you please give me some pointers on how I can apply it in this
 setting?

 Thanks,
 Rohit

 On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  This is an incredibly tiny dataset.  If you delete singletons, it is
 likely
  to get significantly smaller.
 
  I think that something like LDA might work much better for you. It was
  designed to work on small data like this.
 
 
  On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit rohit.par...@gmail.com
  wrote:
 
   Ted, Thanks for your response. Following is the information about the
   approach and the datasets:
  
   I am using the ItemSimilarityJob and passing it  itemID, userID,
   prefCount tuples as input to compute user-user similarity using LLR. I
   read this approach from a response for one of the stackoverflow
 questions
   on calculating user similarity using mahout. .
  
  
   Following are the stats for the datasets:
  
   Coauthor dataset:
  
   users = 29189
   items =  140091
   averageItemsClicked = 15.808660796875536
  
   Conference Dataset:
  
   users = 29189
   items =  2393
   averageItemsClicked = 7.265099866388023
  
   Reference Dataset:
  
   users = 29189
   items =  201570
   averageItemsClicked = 61.08564870327863
  
   By Scale, did you mean rating scale? If so, I am using preference
 counts,
   not rating.
  
   Thanks,
   Rohit
  
  
   On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning ted.dunn...@gmail.com
   wrote:
  
How are you using LLR to compute user similarity?  It is normally
 used
  to
compute item similarity?
   
Also, what is your scale?  how many users? how many items?  how many
actions per user?
   
   
   
On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit 
 rohit.par...@gmail.com
wrote:
   
 Hi,

 I am exploring a random-walk based algorithm for recommender
 systems
which
 works by propagating the item preferences for users on the
 user-user
graph.
 To do this, I have to compute user-user similarity and form a
neighborhood.
 I have tried the following three simple techniques to compute the
  score
 between two users and find the neighborhood.

 1. Score = (Common Items between users A and B) / (items preferred
 by
   A +
 items Preferred by B)
 2. Scoring based on Mahout's Cosine Similarity
 3. Scoring based on Mahout's LogLikelihood similarity.

 My understanding is that similarity based on LogLikelihood is more
robust,
 however, I get better results using the naive approach (technique 1
   from
 the above list). The problems I am addressing are collaborator
 recommendation, conference recommendation and reference
  recommendation
and
 the data has implicit feedback.

 So, my questions is, are there any cases where cosine similarity
 and
 loglikelihood metrics fail (to capture similarity), for example,
 for
   the
 problems stated above, users only collaborate with few other users
   (based
 on area of interest), publish in only few conferences (again based
 on
area
 of interest) and refer to publications in a specific domain. So,
 the
 preference counts are fairly small compared to other domains
   (music/video
 etc).

 Secondly, for CosineSimilarity, should I treat the preferences as
   boolean
 or use the counts? (I think loglikelihood metric does not take into
account
 the preference counts.. correct me if I am wrong.)

 Any insight into this is much appreciated.

 Thanks,
 Rohit

 p.s. Ted, Pat: I am following the discussion on the thread
 LogLikelihoodSimilarity Calculation and your answers helped me a
  lot
   to
 understand how it works and made me wonder why things are different
  in
   my
 case.

   
  
 



Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-10-01 Thread Parimi Rohit
Thanks Ted! Will look into it.

Rohit

On Wed, Oct 1, 2014 at 1:04 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Here is a paper that includes an analysis of voting patterns using LDA.

 http://arxiv.org/pdf/math/0604410.pdf



 On Tue, Sep 30, 2014 at 7:04 PM, Parimi Rohit rohit.par...@gmail.com
 wrote:

  Ted,
 
  I know LDA can be used to model text data but never used it in this
  setting. Can you please give me some pointers on how I can apply it in
 this
  setting?
 
  Thanks,
  Rohit
 
  On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   This is an incredibly tiny dataset.  If you delete singletons, it is
  likely
   to get significantly smaller.
  
   I think that something like LDA might work much better for you. It was
   designed to work on small data like this.
  
  
   On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit rohit.par...@gmail.com
 
   wrote:
  
Ted, Thanks for your response. Following is the information about the
approach and the datasets:
   
I am using the ItemSimilarityJob and passing it  itemID, userID,
prefCount tuples as input to compute user-user similarity using
 LLR. I
read this approach from a response for one of the stackoverflow
  questions
on calculating user similarity using mahout. .
   
   
Following are the stats for the datasets:
   
Coauthor dataset:
   
users = 29189
items =  140091
averageItemsClicked = 15.808660796875536
   
Conference Dataset:
   
users = 29189
items =  2393
averageItemsClicked = 7.265099866388023
   
Reference Dataset:
   
users = 29189
items =  201570
averageItemsClicked = 61.08564870327863
   
By Scale, did you mean rating scale? If so, I am using preference
  counts,
not rating.
   
Thanks,
Rohit
   
   
On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning ted.dunn...@gmail.com
 
wrote:
   
 How are you using LLR to compute user similarity?  It is normally
  used
   to
 compute item similarity?

 Also, what is your scale?  how many users? how many items?  how
 many
 actions per user?



 On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit 
  rohit.par...@gmail.com
 wrote:

  Hi,
 
  I am exploring a random-walk based algorithm for recommender
  systems
 which
  works by propagating the item preferences for users on the
  user-user
 graph.
  To do this, I have to compute user-user similarity and form a
 neighborhood.
  I have tried the following three simple techniques to compute the
   score
  between two users and find the neighborhood.
 
  1. Score = (Common Items between users A and B) / (items
 preferred
  by
A +
  items Preferred by B)
  2. Scoring based on Mahout's Cosine Similarity
  3. Scoring based on Mahout's LogLikelihood similarity.
 
  My understanding is that similarity based on LogLikelihood is
 more
 robust,
  however, I get better results using the naive approach
 (technique 1
from
  the above list). The problems I am addressing are collaborator
  recommendation, conference recommendation and reference
   recommendation
 and
  the data has implicit feedback.
 
  So, my questions is, are there any cases where cosine similarity
  and
  loglikelihood metrics fail (to capture similarity), for example,
  for
the
  problems stated above, users only collaborate with few other
 users
(based
  on area of interest), publish in only few conferences (again
 based
  on
 area
  of interest) and refer to publications in a specific domain. So,
  the
  preference counts are fairly small compared to other domains
(music/video
  etc).
 
  Secondly, for CosineSimilarity, should I treat the preferences as
boolean
  or use the counts? (I think loglikelihood metric does not take
 into
 account
  the preference counts.. correct me if I am wrong.)
 
  Any insight into this is much appreciated.
 
  Thanks,
  Rohit
 
  p.s. Ted, Pat: I am following the discussion on the thread
  LogLikelihoodSimilarity Calculation and your answers helped me
 a
   lot
to
  understand how it works and made me wonder why things are
 different
   in
my
  case.
 

   
  
 



Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-09-30 Thread Parimi Rohit
Ted,

I know LDA can be used to model text data but never used it in this
setting. Can you please give me some pointers on how I can apply it in this
setting?

Thanks,
Rohit

On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 This is an incredibly tiny dataset.  If you delete singletons, it is likely
 to get significantly smaller.

 I think that something like LDA might work much better for you. It was
 designed to work on small data like this.


 On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit rohit.par...@gmail.com
 wrote:

  Ted, Thanks for your response. Following is the information about the
  approach and the datasets:
 
  I am using the ItemSimilarityJob and passing it  itemID, userID,
  prefCount tuples as input to compute user-user similarity using LLR. I
  read this approach from a response for one of the stackoverflow questions
  on calculating user similarity using mahout. .
 
 
  Following are the stats for the datasets:
 
  Coauthor dataset:
 
  users = 29189
  items =  140091
  averageItemsClicked = 15.808660796875536
 
  Conference Dataset:
 
  users = 29189
  items =  2393
  averageItemsClicked = 7.265099866388023
 
  Reference Dataset:
 
  users = 29189
  items =  201570
  averageItemsClicked = 61.08564870327863
 
  By Scale, did you mean rating scale? If so, I am using preference counts,
  not rating.
 
  Thanks,
  Rohit
 
 
  On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   How are you using LLR to compute user similarity?  It is normally used
 to
   compute item similarity?
  
   Also, what is your scale?  how many users? how many items?  how many
   actions per user?
  
  
  
   On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit rohit.par...@gmail.com
   wrote:
  
Hi,
   
I am exploring a random-walk based algorithm for recommender systems
   which
works by propagating the item preferences for users on the user-user
   graph.
To do this, I have to compute user-user similarity and form a
   neighborhood.
I have tried the following three simple techniques to compute the
 score
between two users and find the neighborhood.
   
1. Score = (Common Items between users A and B) / (items preferred by
  A +
items Preferred by B)
2. Scoring based on Mahout's Cosine Similarity
3. Scoring based on Mahout's LogLikelihood similarity.
   
My understanding is that similarity based on LogLikelihood is more
   robust,
however, I get better results using the naive approach (technique 1
  from
the above list). The problems I am addressing are collaborator
recommendation, conference recommendation and reference
 recommendation
   and
the data has implicit feedback.
   
So, my questions is, are there any cases where cosine similarity and
loglikelihood metrics fail (to capture similarity), for example, for
  the
problems stated above, users only collaborate with few other users
  (based
on area of interest), publish in only few conferences (again based on
   area
of interest) and refer to publications in a specific domain. So, the
preference counts are fairly small compared to other domains
  (music/video
etc).
   
Secondly, for CosineSimilarity, should I treat the preferences as
  boolean
or use the counts? (I think loglikelihood metric does not take into
   account
the preference counts.. correct me if I am wrong.)
   
Any insight into this is much appreciated.
   
Thanks,
Rohit
   
p.s. Ted, Pat: I am following the discussion on the thread
LogLikelihoodSimilarity Calculation and your answers helped me a
 lot
  to
understand how it works and made me wonder why things are different
 in
  my
case.
   
  
 



Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-09-29 Thread Ted Dunning
How are you using LLR to compute user similarity?  It is normally used to
compute item similarity?

Also, what is your scale?  how many users? how many items?  how many
actions per user?



On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit rohit.par...@gmail.com
wrote:

 Hi,

 I am exploring a random-walk based algorithm for recommender systems which
 works by propagating the item preferences for users on the user-user graph.
 To do this, I have to compute user-user similarity and form a neighborhood.
 I have tried the following three simple techniques to compute the score
 between two users and find the neighborhood.

 1. Score = (Common Items between users A and B) / (items preferred by A +
 items Preferred by B)
 2. Scoring based on Mahout's Cosine Similarity
 3. Scoring based on Mahout's LogLikelihood similarity.

 My understanding is that similarity based on LogLikelihood is more robust,
 however, I get better results using the naive approach (technique 1 from
 the above list). The problems I am addressing are collaborator
 recommendation, conference recommendation and reference recommendation and
 the data has implicit feedback.

 So, my questions is, are there any cases where cosine similarity and
 loglikelihood metrics fail (to capture similarity), for example, for the
 problems stated above, users only collaborate with few other users (based
 on area of interest), publish in only few conferences (again based on area
 of interest) and refer to publications in a specific domain. So, the
 preference counts are fairly small compared to other domains (music/video
 etc).

 Secondly, for CosineSimilarity, should I treat the preferences as boolean
 or use the counts? (I think loglikelihood metric does not take into account
 the preference counts.. correct me if I am wrong.)

 Any insight into this is much appreciated.

 Thanks,
 Rohit

 p.s. Ted, Pat: I am following the discussion on the thread
 LogLikelihoodSimilarity Calculation and your answers helped me a lot to
 understand how it works and made me wonder why things are different in my
 case.