Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-10-01 Thread Ted Dunning
Here is a paper that includes an analysis of voting patterns using LDA.

http://arxiv.org/pdf/math/0604410.pdf



On Tue, Sep 30, 2014 at 7:04 PM, Parimi Rohit rohit.par...@gmail.com
wrote:

 Ted,

 I know LDA can be used to model text data but never used it in this
 setting. Can you please give me some pointers on how I can apply it in this
 setting?

 Thanks,
 Rohit

 On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  This is an incredibly tiny dataset.  If you delete singletons, it is
 likely
  to get significantly smaller.
 
  I think that something like LDA might work much better for you. It was
  designed to work on small data like this.
 
 
  On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit rohit.par...@gmail.com
  wrote:
 
   Ted, Thanks for your response. Following is the information about the
   approach and the datasets:
  
   I am using the ItemSimilarityJob and passing it  itemID, userID,
   prefCount tuples as input to compute user-user similarity using LLR. I
   read this approach from a response for one of the stackoverflow
 questions
   on calculating user similarity using mahout. .
  
  
   Following are the stats for the datasets:
  
   Coauthor dataset:
  
   users = 29189
   items =  140091
   averageItemsClicked = 15.808660796875536
  
   Conference Dataset:
  
   users = 29189
   items =  2393
   averageItemsClicked = 7.265099866388023
  
   Reference Dataset:
  
   users = 29189
   items =  201570
   averageItemsClicked = 61.08564870327863
  
   By Scale, did you mean rating scale? If so, I am using preference
 counts,
   not rating.
  
   Thanks,
   Rohit
  
  
   On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning ted.dunn...@gmail.com
   wrote:
  
How are you using LLR to compute user similarity?  It is normally
 used
  to
compute item similarity?
   
Also, what is your scale?  how many users? how many items?  how many
actions per user?
   
   
   
On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit 
 rohit.par...@gmail.com
wrote:
   
 Hi,

 I am exploring a random-walk based algorithm for recommender
 systems
which
 works by propagating the item preferences for users on the
 user-user
graph.
 To do this, I have to compute user-user similarity and form a
neighborhood.
 I have tried the following three simple techniques to compute the
  score
 between two users and find the neighborhood.

 1. Score = (Common Items between users A and B) / (items preferred
 by
   A +
 items Preferred by B)
 2. Scoring based on Mahout's Cosine Similarity
 3. Scoring based on Mahout's LogLikelihood similarity.

 My understanding is that similarity based on LogLikelihood is more
robust,
 however, I get better results using the naive approach (technique 1
   from
 the above list). The problems I am addressing are collaborator
 recommendation, conference recommendation and reference
  recommendation
and
 the data has implicit feedback.

 So, my questions is, are there any cases where cosine similarity
 and
 loglikelihood metrics fail (to capture similarity), for example,
 for
   the
 problems stated above, users only collaborate with few other users
   (based
 on area of interest), publish in only few conferences (again based
 on
area
 of interest) and refer to publications in a specific domain. So,
 the
 preference counts are fairly small compared to other domains
   (music/video
 etc).

 Secondly, for CosineSimilarity, should I treat the preferences as
   boolean
 or use the counts? (I think loglikelihood metric does not take into
account
 the preference counts.. correct me if I am wrong.)

 Any insight into this is much appreciated.

 Thanks,
 Rohit

 p.s. Ted, Pat: I am following the discussion on the thread
 LogLikelihoodSimilarity Calculation and your answers helped me a
  lot
   to
 understand how it works and made me wonder why things are different
  in
   my
 case.

   
  
 



Re: word weights using BM25

2014-10-01 Thread Arian Pasquali
Hey guys,
I think it is fair to give you some feedback.
I managed to implement BM25+ http://en.wikipedia.org/wiki/Okapi_BM25 term
score on Mahout.
It was straightforward using the current TFIDF implementation as an example.

Basically what I did was implement the interface
org.apache.mahout.vectorizer.Weight, create a BM25Converter and
BM25PartialVectorReducer similar to TFIDFConverter
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
and
TFIDFPartialVectorReducer
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
 respectively .

cheers
Arian

Arian Pasquali
http://about.me/arianpasquali

2014-09-24 14:14 GMT+01:00 Arian Pasquali ar...@arianpasquali.com:

 Yes,
 I'm studying his work http://nlp.uned.es/~jperezi/Lucene-BM25/ and the
 current mahout's tfidf code.
 Trying to understand how I would port that to mr.
 I ll try to share something if I succeed.

 Arian Pasquali
 http://about.me/arianpasquali

 2014-09-24 5:12 GMT+01:00 Suneel Marthi suneel.mar...@gmail.com:

 Lucene 4.x supports okapi-bm25. So it should be easy to implement.

 On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  Should be pretty easy. I haven't heard of anyone doing it.
 
  Sent from my iPhone
 
   On Sep 23, 2014, at 18:53, Arian Pasquali ar...@arianpasquali.com
  wrote:
  
   Hi,
   I was wondering if would be possible to support bm25 term weighting
   extending Mahout's tf-idf implementation.
  
   I was curious to know if anyone here has already tried to do so.
   If not, what would be your suggestion for such implementation on
 Mahout?
  
  
   Arian Pasquali
   http://about.me/arianpasquali
 





Re: word weights using BM25

2014-10-01 Thread Suneel Marthi
How did u implement BM25PartialVectorReducer and BM25Converter?? The
present implementations for TFIDFConverter and Reducer are MR.
Mahout is not accepting any new MapReduce code.

On Wed, Oct 1, 2014 at 7:18 AM, Arian Pasquali ar...@arianpasquali.com
wrote:

 Hey guys,
 I think it is fair to give you some feedback.
 I managed to implement BM25+ http://en.wikipedia.org/wiki/Okapi_BM25
 term
 score on Mahout.
 It was straightforward using the current TFIDF implementation as an
 example.

 Basically what I did was implement the interface
 org.apache.mahout.vectorizer.Weight, create a BM25Converter and
 BM25PartialVectorReducer similar to TFIDFConverter
 
 https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
 
 and
 TFIDFPartialVectorReducer
 
 https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
 
  respectively .

 cheers
 Arian

 Arian Pasquali
 http://about.me/arianpasquali

 2014-09-24 14:14 GMT+01:00 Arian Pasquali ar...@arianpasquali.com:

  Yes,
  I'm studying his work http://nlp.uned.es/~jperezi/Lucene-BM25/ and the
  current mahout's tfidf code.
  Trying to understand how I would port that to mr.
  I ll try to share something if I succeed.
 
  Arian Pasquali
  http://about.me/arianpasquali
 
  2014-09-24 5:12 GMT+01:00 Suneel Marthi suneel.mar...@gmail.com:
 
  Lucene 4.x supports okapi-bm25. So it should be easy to implement.
 
  On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   Should be pretty easy. I haven't heard of anyone doing it.
  
   Sent from my iPhone
  
On Sep 23, 2014, at 18:53, Arian Pasquali ar...@arianpasquali.com
   wrote:
   
Hi,
I was wondering if would be possible to support bm25 term weighting
extending Mahout's tf-idf implementation.
   
I was curious to know if anyone here has already tried to do so.
If not, what would be your suggestion for such implementation on
  Mahout?
   
   
Arian Pasquali
http://about.me/arianpasquali
  
 
 
 



Re: word weights using BM25

2014-10-01 Thread Ted Dunning
Thanks so much for the feedback.  Glad to hear it was straightforward.


But the important question is 

how did BM25 work for you?



On Wed, Oct 1, 2014 at 6:18 AM, Arian Pasquali ar...@arianpasquali.com
wrote:

 Hey guys,
 I think it is fair to give you some feedback.
 I managed to implement BM25+ http://en.wikipedia.org/wiki/Okapi_BM25
 term
 score on Mahout.
 It was straightforward using the current TFIDF implementation as an
 example.

 Basically what I did was implement the interface
 org.apache.mahout.vectorizer.Weight, create a BM25Converter and
 BM25PartialVectorReducer similar to TFIDFConverter
 
 https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
 
 and
 TFIDFPartialVectorReducer
 
 https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
 
  respectively .

 cheers
 Arian

 Arian Pasquali
 http://about.me/arianpasquali

 2014-09-24 14:14 GMT+01:00 Arian Pasquali ar...@arianpasquali.com:

  Yes,
  I'm studying his work http://nlp.uned.es/~jperezi/Lucene-BM25/ and the
  current mahout's tfidf code.
  Trying to understand how I would port that to mr.
  I ll try to share something if I succeed.
 
  Arian Pasquali
  http://about.me/arianpasquali
 
  2014-09-24 5:12 GMT+01:00 Suneel Marthi suneel.mar...@gmail.com:
 
  Lucene 4.x supports okapi-bm25. So it should be easy to implement.
 
  On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   Should be pretty easy. I haven't heard of anyone doing it.
  
   Sent from my iPhone
  
On Sep 23, 2014, at 18:53, Arian Pasquali ar...@arianpasquali.com
   wrote:
   
Hi,
I was wondering if would be possible to support bm25 term weighting
extending Mahout's tf-idf implementation.
   
I was curious to know if anyone here has already tried to do so.
If not, what would be your suggestion for such implementation on
  Mahout?
   
   
Arian Pasquali
http://about.me/arianpasquali
  
 
 
 



Re: word weights using BM25

2014-10-01 Thread Arian Pasquali
Hi Ted,

My dataset is a collection of documents in german and I can say that the
scores seems better compared to my TFIDF scores. Results make more sense
now, specially my bi-grams.




Arian Pasquali
http://about.me/arianpasquali

2014-10-01 13:09 GMT+01:00 Ted Dunning ted.dunn...@gmail.com:

 Thanks so much for the feedback.  Glad to hear it was straightforward.


 But the important question is 

 how did BM25 work for you?



 On Wed, Oct 1, 2014 at 6:18 AM, Arian Pasquali ar...@arianpasquali.com
 wrote:

  Hey guys,
  I think it is fair to give you some feedback.
  I managed to implement BM25+ http://en.wikipedia.org/wiki/Okapi_BM25
  term
  score on Mahout.
  It was straightforward using the current TFIDF implementation as an
  example.
 
  Basically what I did was implement the interface
  org.apache.mahout.vectorizer.Weight, create a BM25Converter and
  BM25PartialVectorReducer similar to TFIDFConverter
  
 
 https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
  
  and
  TFIDFPartialVectorReducer
  
 
 https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
  
   respectively .
 
  cheers
  Arian
 
  Arian Pasquali
  http://about.me/arianpasquali
 
  2014-09-24 14:14 GMT+01:00 Arian Pasquali ar...@arianpasquali.com:
 
   Yes,
   I'm studying his work http://nlp.uned.es/~jperezi/Lucene-BM25/ and
 the
   current mahout's tfidf code.
   Trying to understand how I would port that to mr.
   I ll try to share something if I succeed.
  
   Arian Pasquali
   http://about.me/arianpasquali
  
   2014-09-24 5:12 GMT+01:00 Suneel Marthi suneel.mar...@gmail.com:
  
   Lucene 4.x supports okapi-bm25. So it should be easy to implement.
  
   On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning ted.dunn...@gmail.com
   wrote:
  
Should be pretty easy. I haven't heard of anyone doing it.
   
Sent from my iPhone
   
 On Sep 23, 2014, at 18:53, Arian Pasquali 
 ar...@arianpasquali.com
wrote:

 Hi,
 I was wondering if would be possible to support bm25 term
 weighting
 extending Mahout's tf-idf implementation.

 I was curious to know if anyone here has already tried to do so.
 If not, what would be your suggestion for such implementation on
   Mahout?


 Arian Pasquali
 http://about.me/arianpasquali
   
  
  
  
 



Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-10-01 Thread Parimi Rohit
Thanks Ted! Will look into it.

Rohit

On Wed, Oct 1, 2014 at 1:04 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Here is a paper that includes an analysis of voting patterns using LDA.

 http://arxiv.org/pdf/math/0604410.pdf



 On Tue, Sep 30, 2014 at 7:04 PM, Parimi Rohit rohit.par...@gmail.com
 wrote:

  Ted,
 
  I know LDA can be used to model text data but never used it in this
  setting. Can you please give me some pointers on how I can apply it in
 this
  setting?
 
  Thanks,
  Rohit
 
  On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   This is an incredibly tiny dataset.  If you delete singletons, it is
  likely
   to get significantly smaller.
  
   I think that something like LDA might work much better for you. It was
   designed to work on small data like this.
  
  
   On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit rohit.par...@gmail.com
 
   wrote:
  
Ted, Thanks for your response. Following is the information about the
approach and the datasets:
   
I am using the ItemSimilarityJob and passing it  itemID, userID,
prefCount tuples as input to compute user-user similarity using
 LLR. I
read this approach from a response for one of the stackoverflow
  questions
on calculating user similarity using mahout. .
   
   
Following are the stats for the datasets:
   
Coauthor dataset:
   
users = 29189
items =  140091
averageItemsClicked = 15.808660796875536
   
Conference Dataset:
   
users = 29189
items =  2393
averageItemsClicked = 7.265099866388023
   
Reference Dataset:
   
users = 29189
items =  201570
averageItemsClicked = 61.08564870327863
   
By Scale, did you mean rating scale? If so, I am using preference
  counts,
not rating.
   
Thanks,
Rohit
   
   
On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning ted.dunn...@gmail.com
 
wrote:
   
 How are you using LLR to compute user similarity?  It is normally
  used
   to
 compute item similarity?

 Also, what is your scale?  how many users? how many items?  how
 many
 actions per user?



 On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit 
  rohit.par...@gmail.com
 wrote:

  Hi,
 
  I am exploring a random-walk based algorithm for recommender
  systems
 which
  works by propagating the item preferences for users on the
  user-user
 graph.
  To do this, I have to compute user-user similarity and form a
 neighborhood.
  I have tried the following three simple techniques to compute the
   score
  between two users and find the neighborhood.
 
  1. Score = (Common Items between users A and B) / (items
 preferred
  by
A +
  items Preferred by B)
  2. Scoring based on Mahout's Cosine Similarity
  3. Scoring based on Mahout's LogLikelihood similarity.
 
  My understanding is that similarity based on LogLikelihood is
 more
 robust,
  however, I get better results using the naive approach
 (technique 1
from
  the above list). The problems I am addressing are collaborator
  recommendation, conference recommendation and reference
   recommendation
 and
  the data has implicit feedback.
 
  So, my questions is, are there any cases where cosine similarity
  and
  loglikelihood metrics fail (to capture similarity), for example,
  for
the
  problems stated above, users only collaborate with few other
 users
(based
  on area of interest), publish in only few conferences (again
 based
  on
 area
  of interest) and refer to publications in a specific domain. So,
  the
  preference counts are fairly small compared to other domains
(music/video
  etc).
 
  Secondly, for CosineSimilarity, should I treat the preferences as
boolean
  or use the counts? (I think loglikelihood metric does not take
 into
 account
  the preference counts.. correct me if I am wrong.)
 
  Any insight into this is much appreciated.
 
  Thanks,
  Rohit
 
  p.s. Ted, Pat: I am following the discussion on the thread
  LogLikelihoodSimilarity Calculation and your answers helped me
 a
   lot
to
  understand how it works and made me wonder why things are
 different
   in
my
  case.
 

   
  
 



Re: word weights using BM25

2014-10-01 Thread Arian Pasquali
Yes Suneel,
Indeed It is in MR fashion.

What exactly do you mean when you said Mahout is not accepting any new
MapReduce code?
Do you mean for submitting a patch?
I'm sure there might be better ways to implement it, but I'm more
interesting in the results right now.

What would be your suggestion?

best





Arian Pasquali
http://about.me/arianpasquali

2014-10-01 13:10 GMT+01:00 Suneel Marthi smar...@apache.org:

 How did u implement BM25PartialVectorReducer and BM25Converter?? The
 present implementations for TFIDFConverter and Reducer are MR.
 Mahout is not accepting any new MapReduce code.

 On Wed, Oct 1, 2014 at 7:18 AM, Arian Pasquali ar...@arianpasquali.com
 wrote:

  Hey guys,
  I think it is fair to give you some feedback.
  I managed to implement BM25+ http://en.wikipedia.org/wiki/Okapi_BM25
  term
  score on Mahout.
  It was straightforward using the current TFIDF implementation as an
  example.
 
  Basically what I did was implement the interface
  org.apache.mahout.vectorizer.Weight, create a BM25Converter and
  BM25PartialVectorReducer similar to TFIDFConverter
  
 
 https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
  
  and
  TFIDFPartialVectorReducer
  
 
 https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
  
   respectively .
 
  cheers
  Arian
 
  Arian Pasquali
  http://about.me/arianpasquali
 
  2014-09-24 14:14 GMT+01:00 Arian Pasquali ar...@arianpasquali.com:
 
   Yes,
   I'm studying his work http://nlp.uned.es/~jperezi/Lucene-BM25/ and
 the
   current mahout's tfidf code.
   Trying to understand how I would port that to mr.
   I ll try to share something if I succeed.
  
   Arian Pasquali
   http://about.me/arianpasquali
  
   2014-09-24 5:12 GMT+01:00 Suneel Marthi suneel.mar...@gmail.com:
  
   Lucene 4.x supports okapi-bm25. So it should be easy to implement.
  
   On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning ted.dunn...@gmail.com
   wrote:
  
Should be pretty easy. I haven't heard of anyone doing it.
   
Sent from my iPhone
   
 On Sep 23, 2014, at 18:53, Arian Pasquali 
 ar...@arianpasquali.com
wrote:

 Hi,
 I was wondering if would be possible to support bm25 term
 weighting
 extending Mahout's tf-idf implementation.

 I was curious to know if anyone here has already tried to do so.
 If not, what would be your suggestion for such implementation on
   Mahout?


 Arian Pasquali
 http://about.me/arianpasquali
   
  
  
  
 



Re: word weights using BM25

2014-10-01 Thread Ted Dunning
On Wed, Oct 1, 2014 at 7:52 AM, Arian Pasquali ar...@arianpasquali.com
wrote:

 My dataset is a collection of documents in german and I can say that the
 scores seems better compared to my TFIDF scores. Results make more sense
 now, specially my bi-grams.


OK.

I will take note.


Re: how to get recommendations by using user-user correlation for the given table in this mail

2014-10-01 Thread Pat Ferrel
First I agree with Ted that LLR is better. I've tried all of the similarity 
methods in Mahout on exactly the same dataset and got far higher 
cross-validation scores for LLR. You may still use pearson with Mahout 0.9 and 
1.0 but it is not supported in the Mahout 1.0 Spark jobs. 

If you have data in tables you need to create single interactions. These will 
look like:

user1,vendor1,rating
userN,vendorM,rating
...

If you are recommending vendors (not specific services of specific vendors) you 
need to map your IDs into IDs that the recommender can ingest. You can’t tell 
which of the separate ratings will be used if the same user rated multiple 
services of the same vendor so you should determine which rating you want to 
use as input. 

You need to translate your IDs into Mahout IDs. Let’s say you go through all of 
your vendors, assign the first one a Mahout ID of integer = 0, then the next 
unique vendor you see will get Mahout ID = 1 and so on. You need to do this for 
your Items (vendors) as well. So your input to Mahout will look something like 
this:

Formatted as Mahout User ID, Mahout Item ID, rating your input files will 
contain:

0,0,1
0,2000,3
0,4,5
1,3,1
1000,2000,5
…

Then after you run the Mahout Item-based recommender you will get back a list 
of recommendations for each user. The key will be an integer equal to the 
Mahout user ID. The value will be a list of Mahout Item IDs with strengths. You 
will need to map the Mahout IDs back into your application ids. Since you are 
recommending vendors the vendors are items so map all Mahout Item IDs into your 
vendor ids and the Mahout User IDs into your user ids.

On Sep 30, 2014, at 6:55 PM, vinayakb malagatti vinayakbmalaga...@gmail.com 
wrote:

Thank you  @Ted, but my guide is suggesting to go with what Pat is
suggesting. @Pat could you plz tell, if I want to recommend vendors to the
user from the table how they should be grouped and  you mentioned *your
recs will be returned using the same integer IDs so you will have to
translate your “user1” and “vendor1-service1” into non-negative contiguous
integers* i don't know about translation could you plz tell more about the
translation.

Thanks and Regards,
Vinayak B


On Tue, Sep 30, 2014 at 10:36 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Yes.  But I strongly suggest that you not use Pearson Correlation.
 
 Use the LLR similarity to compute indicator actions for each vendor.  Then
 use a user's history of actions to score vendors.  This is not only much
 simpler than what you are asking for, it will be more accurate.
 
 You should also measure additional actions besides ratings.
 
 
 
 On Mon, Sep 29, 2014 at 6:56 PM, vinayakb malagatti 
 vinayakbmalaga...@gmail.com wrote:
 
 @Pat and @Ted Thank You so much for the replay. I was looking for the
 solution as Pat suggested, here I want to suggest the Vendors to the User
 which he not yet used by User taking the history of that User and compare
 with other user who have rated the common vendors. If we take the table
 in
 that
 
   -   for User 1 - he has rated Vendor 1 ,Vendor 3 and Vendor 4 and
 User 2
   has rated Vendor 1, Vendor 2 and Vendor 3.
   -  Common between User 2 and User 1 are Vendor 1 and Vendor 3.
   - Assume that if Pearson Correlation between them is nearly 1, hence
 we
   can Recommend the Vendor 2 to the User 1 which User 1 is not used.
 
 Can we do like this, using the Apache Mahout  if Yes could you plz give
 some brief idea.
 
 Thanks and Regards,
 Vinayak B
 
 
 On Tue, Sep 30, 2014 at 2:10 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
 I would recommend that you look at actions other than ratings as well.
 
 Did a user expand and read 1 review?  did they read 3 reviews?
 
 Did they mark a rating as useful?
 
 Did they ask for contact information?
 
 You know your system better than I possibly could, but using other
 information in addition to ratings is very important for getting the
 highest quality predictive information.
 
 You can start with ratings, but you should push to get other kinds of
 information as much as possible.  Ratings are often given by only a
 very
 small number of people.  That severely limits how much value you can
 add
 with a recommendation engine.  At the same time most people are busy
 not
 giving you ratings, they are doing lots of other things that tell you
 what
 they are thinking and reacting to.  If you don't pay attention to that
 additional information, you are handicapping yourself severely.
 
 
 On Mon, Sep 29, 2014 at 9:53 AM, vinayakb malagatti 
 vinayakbmalaga...@gmail.com wrote:
 
 Hi all,
 
 I have table something looks like in DB :
 
 
 ​​​
 rating table
 
 
 
 
 https://docs.google.com/spreadsheets/d/1PrShX7X70PqnfIQg0Dfv6mIHtX1k7KSZHTBfTPMv_Do/edit?usp=drive_web
 
 ​
 
 
 
 
 
 Thanks and Regards,
 Vinayak B
 
 
 
 



Re: how to get recommendations by using user-user correlation for the given table in this mail

2014-10-01 Thread vinayakb malagatti
Hi Pat,

If I am wrong plz correct me, if we take table 2 (user2) then he rated for
vendor 1 - vendor 3,

   1. I am going assign for each user an ID starting from 1 - N.
   2. Vendors will have the ID with 601,602,603
   3. Services will have the ID with 501,502,503.
   4. If I translate the Vendor and Service IDs it looks like
   601501,601502,601503..
   5. The input to the Mahout will be for USER ID, COMBINED ID, RATING
   6. output form the Mahout will be COMBINED IDs, for the user and again I
   have to separate the COMBINED ID into Vendor ID and Service ID.

Is this the correct flow ?


Thanks and Regards,
Vinayak B


On Thu, Oct 2, 2014 at 12:23 AM, Pat Ferrel p...@occamsmachete.com wrote:

 First I agree with Ted that LLR is better. I've tried all of the
 similarity methods in Mahout on exactly the same dataset and got far higher
 cross-validation scores for LLR. You may still use pearson with Mahout 0.9
 and 1.0 but it is not supported in the Mahout 1.0 Spark jobs.

 If you have data in tables you need to create single interactions. These
 will look like:

 user1,vendor1,rating
 userN,vendorM,rating
 ...

 If you are recommending vendors (not specific services of specific
 vendors) you need to map your IDs into IDs that the recommender can ingest.
 You can’t tell which of the separate ratings will be used if the same user
 rated multiple services of the same vendor so you should determine which
 rating you want to use as input.

 You need to translate your IDs into Mahout IDs. Let’s say you go through
 all of your vendors, assign the first one a Mahout ID of integer = 0, then
 the next unique vendor you see will get Mahout ID = 1 and so on. You need
 to do this for your Items (vendors) as well. So your input to Mahout will
 look something like this:

 Formatted as Mahout User ID, Mahout Item ID, rating your input files will
 contain:

 0,0,1
 0,2000,3
 0,4,5
 1,3,1
 1000,2000,5
 …

 Then after you run the Mahout Item-based recommender you will get back a
 list of recommendations for each user. The key will be an integer equal to
 the Mahout user ID. The value will be a list of Mahout Item IDs with
 strengths. You will need to map the Mahout IDs back into your application
 ids. Since you are recommending vendors the vendors are items so map all
 Mahout Item IDs into your vendor ids and the Mahout User IDs into your user
 ids.

 On Sep 30, 2014, at 6:55 PM, vinayakb malagatti 
 vinayakbmalaga...@gmail.com wrote:

 Thank you  @Ted, but my guide is suggesting to go with what Pat is
 suggesting. @Pat could you plz tell, if I want to recommend vendors to the
 user from the table how they should be grouped and  you mentioned *your
 recs will be returned using the same integer IDs so you will have to
 translate your “user1” and “vendor1-service1” into non-negative contiguous
 integers* i don't know about translation could you plz tell more about the
 translation.

 Thanks and Regards,
 Vinayak B


 On Tue, Sep 30, 2014 at 10:36 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  Yes.  But I strongly suggest that you not use Pearson Correlation.
 
  Use the LLR similarity to compute indicator actions for each vendor.
 Then
  use a user's history of actions to score vendors.  This is not only much
  simpler than what you are asking for, it will be more accurate.
 
  You should also measure additional actions besides ratings.
 
 
 
  On Mon, Sep 29, 2014 at 6:56 PM, vinayakb malagatti 
  vinayakbmalaga...@gmail.com wrote:
 
  @Pat and @Ted Thank You so much for the replay. I was looking for the
  solution as Pat suggested, here I want to suggest the Vendors to the
 User
  which he not yet used by User taking the history of that User and
 compare
  with other user who have rated the common vendors. If we take the table
  in
  that
 
-   for User 1 - he has rated Vendor 1 ,Vendor 3 and Vendor 4 and
  User 2
has rated Vendor 1, Vendor 2 and Vendor 3.
-  Common between User 2 and User 1 are Vendor 1 and Vendor 3.
- Assume that if Pearson Correlation between them is nearly 1, hence
  we
can Recommend the Vendor 2 to the User 1 which User 1 is not used.
 
  Can we do like this, using the Apache Mahout  if Yes could you plz give
  some brief idea.
 
  Thanks and Regards,
  Vinayak B
 
 
  On Tue, Sep 30, 2014 at 2:10 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
  I would recommend that you look at actions other than ratings as well.
 
  Did a user expand and read 1 review?  did they read 3 reviews?
 
  Did they mark a rating as useful?
 
  Did they ask for contact information?
 
  You know your system better than I possibly could, but using other
  information in addition to ratings is very important for getting the
  highest quality predictive information.
 
  You can start with ratings, but you should push to get other kinds of
  information as much as possible.  Ratings are often given by only a
  very
  small number of people.  That severely limits how much value you can
  add