Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!
Here is a paper that includes an analysis of voting patterns using LDA. http://arxiv.org/pdf/math/0604410.pdf On Tue, Sep 30, 2014 at 7:04 PM, Parimi Rohit rohit.par...@gmail.com wrote: Ted, I know LDA can be used to model text data but never used it in this setting. Can you please give me some pointers on how I can apply it in this setting? Thanks, Rohit On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning ted.dunn...@gmail.com wrote: This is an incredibly tiny dataset. If you delete singletons, it is likely to get significantly smaller. I think that something like LDA might work much better for you. It was designed to work on small data like this. On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit rohit.par...@gmail.com wrote: Ted, Thanks for your response. Following is the information about the approach and the datasets: I am using the ItemSimilarityJob and passing it itemID, userID, prefCount tuples as input to compute user-user similarity using LLR. I read this approach from a response for one of the stackoverflow questions on calculating user similarity using mahout. . Following are the stats for the datasets: Coauthor dataset: users = 29189 items = 140091 averageItemsClicked = 15.808660796875536 Conference Dataset: users = 29189 items = 2393 averageItemsClicked = 7.265099866388023 Reference Dataset: users = 29189 items = 201570 averageItemsClicked = 61.08564870327863 By Scale, did you mean rating scale? If so, I am using preference counts, not rating. Thanks, Rohit On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning ted.dunn...@gmail.com wrote: How are you using LLR to compute user similarity? It is normally used to compute item similarity? Also, what is your scale? how many users? how many items? how many actions per user? On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit rohit.par...@gmail.com wrote: Hi, I am exploring a random-walk based algorithm for recommender systems which works by propagating the item preferences for users on the user-user graph. To do this, I have to compute user-user similarity and form a neighborhood. I have tried the following three simple techniques to compute the score between two users and find the neighborhood. 1. Score = (Common Items between users A and B) / (items preferred by A + items Preferred by B) 2. Scoring based on Mahout's Cosine Similarity 3. Scoring based on Mahout's LogLikelihood similarity. My understanding is that similarity based on LogLikelihood is more robust, however, I get better results using the naive approach (technique 1 from the above list). The problems I am addressing are collaborator recommendation, conference recommendation and reference recommendation and the data has implicit feedback. So, my questions is, are there any cases where cosine similarity and loglikelihood metrics fail (to capture similarity), for example, for the problems stated above, users only collaborate with few other users (based on area of interest), publish in only few conferences (again based on area of interest) and refer to publications in a specific domain. So, the preference counts are fairly small compared to other domains (music/video etc). Secondly, for CosineSimilarity, should I treat the preferences as boolean or use the counts? (I think loglikelihood metric does not take into account the preference counts.. correct me if I am wrong.) Any insight into this is much appreciated. Thanks, Rohit p.s. Ted, Pat: I am following the discussion on the thread LogLikelihoodSimilarity Calculation and your answers helped me a lot to understand how it works and made me wonder why things are different in my case.
Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!
Thanks Ted! Will look into it. Rohit On Wed, Oct 1, 2014 at 1:04 AM, Ted Dunning ted.dunn...@gmail.com wrote: Here is a paper that includes an analysis of voting patterns using LDA. http://arxiv.org/pdf/math/0604410.pdf On Tue, Sep 30, 2014 at 7:04 PM, Parimi Rohit rohit.par...@gmail.com wrote: Ted, I know LDA can be used to model text data but never used it in this setting. Can you please give me some pointers on how I can apply it in this setting? Thanks, Rohit On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning ted.dunn...@gmail.com wrote: This is an incredibly tiny dataset. If you delete singletons, it is likely to get significantly smaller. I think that something like LDA might work much better for you. It was designed to work on small data like this. On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit rohit.par...@gmail.com wrote: Ted, Thanks for your response. Following is the information about the approach and the datasets: I am using the ItemSimilarityJob and passing it itemID, userID, prefCount tuples as input to compute user-user similarity using LLR. I read this approach from a response for one of the stackoverflow questions on calculating user similarity using mahout. . Following are the stats for the datasets: Coauthor dataset: users = 29189 items = 140091 averageItemsClicked = 15.808660796875536 Conference Dataset: users = 29189 items = 2393 averageItemsClicked = 7.265099866388023 Reference Dataset: users = 29189 items = 201570 averageItemsClicked = 61.08564870327863 By Scale, did you mean rating scale? If so, I am using preference counts, not rating. Thanks, Rohit On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning ted.dunn...@gmail.com wrote: How are you using LLR to compute user similarity? It is normally used to compute item similarity? Also, what is your scale? how many users? how many items? how many actions per user? On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit rohit.par...@gmail.com wrote: Hi, I am exploring a random-walk based algorithm for recommender systems which works by propagating the item preferences for users on the user-user graph. To do this, I have to compute user-user similarity and form a neighborhood. I have tried the following three simple techniques to compute the score between two users and find the neighborhood. 1. Score = (Common Items between users A and B) / (items preferred by A + items Preferred by B) 2. Scoring based on Mahout's Cosine Similarity 3. Scoring based on Mahout's LogLikelihood similarity. My understanding is that similarity based on LogLikelihood is more robust, however, I get better results using the naive approach (technique 1 from the above list). The problems I am addressing are collaborator recommendation, conference recommendation and reference recommendation and the data has implicit feedback. So, my questions is, are there any cases where cosine similarity and loglikelihood metrics fail (to capture similarity), for example, for the problems stated above, users only collaborate with few other users (based on area of interest), publish in only few conferences (again based on area of interest) and refer to publications in a specific domain. So, the preference counts are fairly small compared to other domains (music/video etc). Secondly, for CosineSimilarity, should I treat the preferences as boolean or use the counts? (I think loglikelihood metric does not take into account the preference counts.. correct me if I am wrong.) Any insight into this is much appreciated. Thanks, Rohit p.s. Ted, Pat: I am following the discussion on the thread LogLikelihoodSimilarity Calculation and your answers helped me a lot to understand how it works and made me wonder why things are different in my case.
Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!
Ted, I know LDA can be used to model text data but never used it in this setting. Can you please give me some pointers on how I can apply it in this setting? Thanks, Rohit On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning ted.dunn...@gmail.com wrote: This is an incredibly tiny dataset. If you delete singletons, it is likely to get significantly smaller. I think that something like LDA might work much better for you. It was designed to work on small data like this. On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit rohit.par...@gmail.com wrote: Ted, Thanks for your response. Following is the information about the approach and the datasets: I am using the ItemSimilarityJob and passing it itemID, userID, prefCount tuples as input to compute user-user similarity using LLR. I read this approach from a response for one of the stackoverflow questions on calculating user similarity using mahout. . Following are the stats for the datasets: Coauthor dataset: users = 29189 items = 140091 averageItemsClicked = 15.808660796875536 Conference Dataset: users = 29189 items = 2393 averageItemsClicked = 7.265099866388023 Reference Dataset: users = 29189 items = 201570 averageItemsClicked = 61.08564870327863 By Scale, did you mean rating scale? If so, I am using preference counts, not rating. Thanks, Rohit On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning ted.dunn...@gmail.com wrote: How are you using LLR to compute user similarity? It is normally used to compute item similarity? Also, what is your scale? how many users? how many items? how many actions per user? On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit rohit.par...@gmail.com wrote: Hi, I am exploring a random-walk based algorithm for recommender systems which works by propagating the item preferences for users on the user-user graph. To do this, I have to compute user-user similarity and form a neighborhood. I have tried the following three simple techniques to compute the score between two users and find the neighborhood. 1. Score = (Common Items between users A and B) / (items preferred by A + items Preferred by B) 2. Scoring based on Mahout's Cosine Similarity 3. Scoring based on Mahout's LogLikelihood similarity. My understanding is that similarity based on LogLikelihood is more robust, however, I get better results using the naive approach (technique 1 from the above list). The problems I am addressing are collaborator recommendation, conference recommendation and reference recommendation and the data has implicit feedback. So, my questions is, are there any cases where cosine similarity and loglikelihood metrics fail (to capture similarity), for example, for the problems stated above, users only collaborate with few other users (based on area of interest), publish in only few conferences (again based on area of interest) and refer to publications in a specific domain. So, the preference counts are fairly small compared to other domains (music/video etc). Secondly, for CosineSimilarity, should I treat the preferences as boolean or use the counts? (I think loglikelihood metric does not take into account the preference counts.. correct me if I am wrong.) Any insight into this is much appreciated. Thanks, Rohit p.s. Ted, Pat: I am following the discussion on the thread LogLikelihoodSimilarity Calculation and your answers helped me a lot to understand how it works and made me wonder why things are different in my case.
Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!
How are you using LLR to compute user similarity? It is normally used to compute item similarity? Also, what is your scale? how many users? how many items? how many actions per user? On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit rohit.par...@gmail.com wrote: Hi, I am exploring a random-walk based algorithm for recommender systems which works by propagating the item preferences for users on the user-user graph. To do this, I have to compute user-user similarity and form a neighborhood. I have tried the following three simple techniques to compute the score between two users and find the neighborhood. 1. Score = (Common Items between users A and B) / (items preferred by A + items Preferred by B) 2. Scoring based on Mahout's Cosine Similarity 3. Scoring based on Mahout's LogLikelihood similarity. My understanding is that similarity based on LogLikelihood is more robust, however, I get better results using the naive approach (technique 1 from the above list). The problems I am addressing are collaborator recommendation, conference recommendation and reference recommendation and the data has implicit feedback. So, my questions is, are there any cases where cosine similarity and loglikelihood metrics fail (to capture similarity), for example, for the problems stated above, users only collaborate with few other users (based on area of interest), publish in only few conferences (again based on area of interest) and refer to publications in a specific domain. So, the preference counts are fairly small compared to other domains (music/video etc). Secondly, for CosineSimilarity, should I treat the preferences as boolean or use the counts? (I think loglikelihood metric does not take into account the preference counts.. correct me if I am wrong.) Any insight into this is much appreciated. Thanks, Rohit p.s. Ted, Pat: I am following the discussion on the thread LogLikelihoodSimilarity Calculation and your answers helped me a lot to understand how it works and made me wonder why things are different in my case.