Re: evaluating recommender with boolean prefs

2013-06-07 Thread Sean Owen
In point 1, I don't think I'd say it that way. It's not true that
test/training is divided by user, because every user would either be
100% in the training or 100% in the test data. Instead you hold out
part of the data for each user, or at least, for some subset of users.
Then you can see whether recs for those users match the held out data.

Yes then you see how the held-out set matches the predictions by
computing ratios that give you precision/recall.

The key question is really how you choose the test data. It's implicit
data; one is as good as the next. In the framework I think it just
randomly picks a subset of the data. You could also split by time;
that's a defensible way to do it. Training data up to time t and test
data after time t.

On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov
msoko...@safaribooksonline.com wrote:
 I'm trying to evaluate a few different recommenders based on boolean
 preferences.  The in action book suggests using an precision/recall metric,
 but I'm not sure I understand what that does, and in particular how it is
 dividing my data into test/train sets.

 What I think I'd like to do is:

 1. Divide the test data by user: identify a set of training data with data
 from 80% of the users, and test using the remaining 20% (say).

 2. Build a similarity model from the training data

 3. For the test users, divide their data in half; a training set and an
 evaluation set.  Then for each test user, use their training data as input
 to the recommender, and see if it recommends the data in the evaluation set
 or not.

 Is this what the precision/recall test is actually doing?

 --
 Michael Sokolov
 Senior Architect
 Safari Books Online



Re: evaluating recommender with boolean prefs

2013-06-07 Thread Koobas
Since I am primarily an HPC person, probably a naive question from the ML
perspective.
What if, when computing recommendations, we don't exclude what the user
already has,
and then see if the items he has end up being recommended to him (compute
some appropriate metric / ratio)?
Wouldn't that be the ultimate evaluator?


On Fri, Jun 7, 2013 at 2:58 PM, Sean Owen sro...@gmail.com wrote:

 In point 1, I don't think I'd say it that way. It's not true that
 test/training is divided by user, because every user would either be
 100% in the training or 100% in the test data. Instead you hold out
 part of the data for each user, or at least, for some subset of users.
 Then you can see whether recs for those users match the held out data.

 Yes then you see how the held-out set matches the predictions by
 computing ratios that give you precision/recall.

 The key question is really how you choose the test data. It's implicit
 data; one is as good as the next. In the framework I think it just
 randomly picks a subset of the data. You could also split by time;
 that's a defensible way to do it. Training data up to time t and test
 data after time t.

 On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov
 msoko...@safaribooksonline.com wrote:
  I'm trying to evaluate a few different recommenders based on boolean
  preferences.  The in action book suggests using an precision/recall
 metric,
  but I'm not sure I understand what that does, and in particular how it is
  dividing my data into test/train sets.
 
  What I think I'd like to do is:
 
  1. Divide the test data by user: identify a set of training data with
 data
  from 80% of the users, and test using the remaining 20% (say).
 
  2. Build a similarity model from the training data
 
  3. For the test users, divide their data in half; a training set and an
  evaluation set.  Then for each test user, use their training data as
 input
  to the recommender, and see if it recommends the data in the evaluation
 set
  or not.
 
  Is this what the precision/recall test is actually doing?
 
  --
  Michael Sokolov
  Senior Architect
  Safari Books Online
 



Re: evaluating recommender with boolean prefs

2013-06-07 Thread Michael Sokolov

Thanks for your help

Yes, I think a time-based division of test v. training probably would 
make sense since that will correspond to our actual intended practice.


But before I worry about that I seem to have some more fundamental 
problem that is giving me 0 precision and 0 recall all the time...



-Mike

On 06/07/2013 02:58 PM, Sean Owen wrote:

In point 1, I don't think I'd say it that way. It's not true that
test/training is divided by user, because every user would either be
100% in the training or 100% in the test data. Instead you hold out
part of the data for each user, or at least, for some subset of users.
Then you can see whether recs for those users match the held out data.

Yes then you see how the held-out set matches the predictions by
computing ratios that give you precision/recall.

The key question is really how you choose the test data. It's implicit
data; one is as good as the next. In the framework I think it just
randomly picks a subset of the data. You could also split by time;
that's a defensible way to do it. Training data up to time t and test
data after time t.

On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov
msoko...@safaribooksonline.com  wrote:
   

I'm trying to evaluate a few different recommenders based on boolean
preferences.  The in action book suggests using an precision/recall metric,
but I'm not sure I understand what that does, and in particular how it is
dividing my data into test/train sets.

What I think I'd like to do is:

1. Divide the test data by user: identify a set of training data with data
from 80% of the users, and test using the remaining 20% (say).

2. Build a similarity model from the training data

3. For the test users, divide their data in half; a training set and an
evaluation set.  Then for each test user, use their training data as input
to the recommender, and see if it recommends the data in the evaluation set
or not.

Is this what the precision/recall test is actually doing?

--
Michael Sokolov
Senior Architect
Safari Books Online

 



--
Michael Sokolov
Senior Architect
Safari Books Online



Re: evaluating recommender with boolean prefs

2013-06-07 Thread Sean Owen
It depends on the algorithm I suppose. In some cases, the
already-known items would always be top recommendations and the test
would tell you nothing. Just like in an RMSE test -- if you already
know the right answers your score is always a perfect 0.

But in some cases I agree you could get some of use out of observing
where the algorithm ranks known associations, because they won't in
some cases all be the very first ones.

it raises an interesting question: if the top recommendation wasn't an
already known association, how do we know it's wrong? We don't. You
rate Star Trek, Star Trek V, and Star Trek IV. Say Star Trek II is
your top recommendation. That's actually probably right, and should be
ranked higher than all your observed associations. (It's a good
movie.) But the test would consider it wrong. In fact anything that
you haven't interacted with before is wrong.

This sort of explains why precision/recall can be really low in these
tests. I would not be surprised if you get 0 in some cases, on maybe
small input. Is it a bad predictor? maybe, but it's not clear.



On Fri, Jun 7, 2013 at 8:06 PM, Koobas koo...@gmail.com wrote:
 Since I am primarily an HPC person, probably a naive question from the ML
 perspective.
 What if, when computing recommendations, we don't exclude what the user
 already has,
 and then see if the items he has end up being recommended to him (compute
 some appropriate metric / ratio)?
 Wouldn't that be the ultimate evaluator?


 On Fri, Jun 7, 2013 at 2:58 PM, Sean Owen sro...@gmail.com wrote:

 In point 1, I don't think I'd say it that way. It's not true that
 test/training is divided by user, because every user would either be
 100% in the training or 100% in the test data. Instead you hold out
 part of the data for each user, or at least, for some subset of users.
 Then you can see whether recs for those users match the held out data.

 Yes then you see how the held-out set matches the predictions by
 computing ratios that give you precision/recall.

 The key question is really how you choose the test data. It's implicit
 data; one is as good as the next. In the framework I think it just
 randomly picks a subset of the data. You could also split by time;
 that's a defensible way to do it. Training data up to time t and test
 data after time t.

 On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov
 msoko...@safaribooksonline.com wrote:
  I'm trying to evaluate a few different recommenders based on boolean
  preferences.  The in action book suggests using an precision/recall
 metric,
  but I'm not sure I understand what that does, and in particular how it is
  dividing my data into test/train sets.
 
  What I think I'd like to do is:
 
  1. Divide the test data by user: identify a set of training data with
 data
  from 80% of the users, and test using the remaining 20% (say).
 
  2. Build a similarity model from the training data
 
  3. For the test users, divide their data in half; a training set and an
  evaluation set.  Then for each test user, use their training data as
 input
  to the recommender, and see if it recommends the data in the evaluation
 set
  or not.
 
  Is this what the precision/recall test is actually doing?
 
  --
  Michael Sokolov
  Senior Architect
  Safari Books Online
 



Re: evaluating recommender with boolean prefs

2013-06-07 Thread Koobas
On Fri, Jun 7, 2013 at 4:50 PM, Sean Owen sro...@gmail.com wrote:

 It depends on the algorithm I suppose. In some cases, the
 already-known items would always be top recommendations and the test
 would tell you nothing. Just like in an RMSE test -- if you already
 know the right answers your score is always a perfect 0.

 It's very much to the point.
ALS works by constructing a low-rank approximation of the original matrix.
We check how good that approximation it by comparing it against the
original.

I see an analogy here, in the case of kNN.
The suggestions are a model of your interests, in a sense can be used to
reconstruct
your original set.


 But in some cases I agree you could get some of use out of observing
 where the algorithm ranks known associations, because they won't in
 some cases all be the very first ones.

 it raises an interesting question: if the top recommendation wasn't an
 already known association, how do we know it's wrong? We don't. You
 rate Star Trek, Star Trek V, and Star Trek IV. Say Star Trek II is
 your top recommendation. That's actually probably right, and should be
 ranked higher than all your observed associations. (It's a good
 movie.) But the test would consider it wrong. In fact anything that
 you haven't interacted with before is wrong.

 You can look at it from the other side.
It's not about the ones that are not in your original set.
It's about how good the recommender is in putting back the original, if
they were removed.
Except we would not actually be removing them.
It's the same approach, simply without splitting the input into the
training set and the validation set.
In a sense the whole set is the training set and the validation set.
Again, I am not coming from the ML background.
Am I making sense here?


 This sort of explains why precision/recall can be really low in these
 tests. I would not be surprised if you get 0 in some cases, on maybe
 small input. Is it a bad predictor? maybe, but it's not clear.



 On Fri, Jun 7, 2013 at 8:06 PM, Koobas koo...@gmail.com wrote:
  Since I am primarily an HPC person, probably a naive question from the ML
  perspective.
  What if, when computing recommendations, we don't exclude what the user
  already has,
  and then see if the items he has end up being recommended to him (compute
  some appropriate metric / ratio)?
  Wouldn't that be the ultimate evaluator?
 
 
  On Fri, Jun 7, 2013 at 2:58 PM, Sean Owen sro...@gmail.com wrote:
 
  In point 1, I don't think I'd say it that way. It's not true that
  test/training is divided by user, because every user would either be
  100% in the training or 100% in the test data. Instead you hold out
  part of the data for each user, or at least, for some subset of users.
  Then you can see whether recs for those users match the held out data.
 
  Yes then you see how the held-out set matches the predictions by
  computing ratios that give you precision/recall.
 
  The key question is really how you choose the test data. It's implicit
  data; one is as good as the next. In the framework I think it just
  randomly picks a subset of the data. You could also split by time;
  that's a defensible way to do it. Training data up to time t and test
  data after time t.
 
  On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov
  msoko...@safaribooksonline.com wrote:
   I'm trying to evaluate a few different recommenders based on boolean
   preferences.  The in action book suggests using an precision/recall
  metric,
   but I'm not sure I understand what that does, and in particular how
 it is
   dividing my data into test/train sets.
  
   What I think I'd like to do is:
  
   1. Divide the test data by user: identify a set of training data with
  data
   from 80% of the users, and test using the remaining 20% (say).
  
   2. Build a similarity model from the training data
  
   3. For the test users, divide their data in half; a training set
 and an
   evaluation set.  Then for each test user, use their training data as
  input
   to the recommender, and see if it recommends the data in the
 evaluation
  set
   or not.
  
   Is this what the precision/recall test is actually doing?
  
   --
   Michael Sokolov
   Senior Architect
   Safari Books Online
  
 



Re: evaluating recommender with boolean prefs

2013-06-07 Thread simon.2.thompson
But why would she want the things she has?

- Original Message -
From: Koobas [mailto:koo...@gmail.com]
Sent: Friday, June 07, 2013 08:06 PM
To: user@mahout.apache.org user@mahout.apache.org
Subject: Re: evaluating recommender with boolean prefs

Since I am primarily an HPC person, probably a naive question from the ML
perspective.
What if, when computing recommendations, we don't exclude what the user
already has,
and then see if the items he has end up being recommended to him (compute
some appropriate metric / ratio)?
Wouldn't that be the ultimate evaluator?


On Fri, Jun 7, 2013 at 2:58 PM, Sean Owen sro...@gmail.com wrote:

 In point 1, I don't think I'd say it that way. It's not true that
 test/training is divided by user, because every user would either be
 100% in the training or 100% in the test data. Instead you hold out
 part of the data for each user, or at least, for some subset of users.
 Then you can see whether recs for those users match the held out data.

 Yes then you see how the held-out set matches the predictions by
 computing ratios that give you precision/recall.

 The key question is really how you choose the test data. It's implicit
 data; one is as good as the next. In the framework I think it just
 randomly picks a subset of the data. You could also split by time;
 that's a defensible way to do it. Training data up to time t and test
 data after time t.

 On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov
 msoko...@safaribooksonline.com wrote:
  I'm trying to evaluate a few different recommenders based on boolean
  preferences.  The in action book suggests using an precision/recall
 metric,
  but I'm not sure I understand what that does, and in particular how it is
  dividing my data into test/train sets.
 
  What I think I'd like to do is:
 
  1. Divide the test data by user: identify a set of training data with
 data
  from 80% of the users, and test using the remaining 20% (say).
 
  2. Build a similarity model from the training data
 
  3. For the test users, divide their data in half; a training set and an
  evaluation set.  Then for each test user, use their training data as
 input
  to the recommender, and see if it recommends the data in the evaluation
 set
  or not.
 
  Is this what the precision/recall test is actually doing?
 
  --
  Michael Sokolov
  Senior Architect
  Safari Books Online
 



Re: evaluating recommender with boolean prefs

2013-06-07 Thread Sean Owen
Yes it makes sense in the case of for example ALS.
With or without this idea, the more general point is that this result
is still problematic. It is somewhat useful in comparing in a relative
sense; I'd rather have a recommender that stacks my input values
somewhere near the top than bottom. But metrics like precision@5 get
hard to interpret -- because they are often near 0 even when things
are working reasonably well. Mean average precision considers the
results in a more complete sense, as would AUC.

On Fri, Jun 7, 2013 at 10:04 PM, Koobas koo...@gmail.com wrote:
 On Fri, Jun 7, 2013 at 4:50 PM, Sean Owen sro...@gmail.com wrote:

 It depends on the algorithm I suppose. In some cases, the
 already-known items would always be top recommendations and the test
 would tell you nothing. Just like in an RMSE test -- if you already
 know the right answers your score is always a perfect 0.

 It's very much to the point.
 ALS works by constructing a low-rank approximation of the original matrix.
 We check how good that approximation it by comparing it against the
 original.

 I see an analogy here, in the case of kNN.
 The suggestions are a model of your interests, in a sense can be used to
 reconstruct
 your original set.


 But in some cases I agree you could get some of use out of observing
 where the algorithm ranks known associations, because they won't in
 some cases all be the very first ones.

 it raises an interesting question: if the top recommendation wasn't an
 already known association, how do we know it's wrong? We don't. You
 rate Star Trek, Star Trek V, and Star Trek IV. Say Star Trek II is
 your top recommendation. That's actually probably right, and should be
 ranked higher than all your observed associations. (It's a good
 movie.) But the test would consider it wrong. In fact anything that
 you haven't interacted with before is wrong.

 You can look at it from the other side.
 It's not about the ones that are not in your original set.
 It's about how good the recommender is in putting back the original, if
 they were removed.
 Except we would not actually be removing them.
 It's the same approach, simply without splitting the input into the
 training set and the validation set.
 In a sense the whole set is the training set and the validation set.
 Again, I am not coming from the ML background.
 Am I making sense here?


 This sort of explains why precision/recall can be really low in these
 tests. I would not be surprised if you get 0 in some cases, on maybe
 small input. Is it a bad predictor? maybe, but it's not clear.



 On Fri, Jun 7, 2013 at 8:06 PM, Koobas koo...@gmail.com wrote:
  Since I am primarily an HPC person, probably a naive question from the ML
  perspective.
  What if, when computing recommendations, we don't exclude what the user
  already has,
  and then see if the items he has end up being recommended to him (compute
  some appropriate metric / ratio)?
  Wouldn't that be the ultimate evaluator?
 
 
  On Fri, Jun 7, 2013 at 2:58 PM, Sean Owen sro...@gmail.com wrote:
 
  In point 1, I don't think I'd say it that way. It's not true that
  test/training is divided by user, because every user would either be
  100% in the training or 100% in the test data. Instead you hold out
  part of the data for each user, or at least, for some subset of users.
  Then you can see whether recs for those users match the held out data.
 
  Yes then you see how the held-out set matches the predictions by
  computing ratios that give you precision/recall.
 
  The key question is really how you choose the test data. It's implicit
  data; one is as good as the next. In the framework I think it just
  randomly picks a subset of the data. You could also split by time;
  that's a defensible way to do it. Training data up to time t and test
  data after time t.
 
  On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov
  msoko...@safaribooksonline.com wrote:
   I'm trying to evaluate a few different recommenders based on boolean
   preferences.  The in action book suggests using an precision/recall
  metric,
   but I'm not sure I understand what that does, and in particular how
 it is
   dividing my data into test/train sets.
  
   What I think I'd like to do is:
  
   1. Divide the test data by user: identify a set of training data with
  data
   from 80% of the users, and test using the remaining 20% (say).
  
   2. Build a similarity model from the training data
  
   3. For the test users, divide their data in half; a training set
 and an
   evaluation set.  Then for each test user, use their training data as
  input
   to the recommender, and see if it recommends the data in the
 evaluation
  set
   or not.
  
   Is this what the precision/recall test is actually doing?
  
   --
   Michael Sokolov
   Senior Architect
   Safari Books Online
  
 



Re: evaluating recommender with boolean prefs

2013-06-07 Thread Sean Owen
I believe the suggestion is just for purposes of evaluation. You would
not return these items in practice, yes.

Although there are cases where you do want to return known items. For
example, maybe you are modeling user interaction with restaurant
categories. This could be useful, because as soon as you see I
interact with Chinese and Indian you may recommend Thai; it
might even be a stronger recommendation than the two known categories.
But I may not want to actually exclude Chinese and Indian from the
list entirely.

On Fri, Jun 7, 2013 at 10:36 PM,  simon.2.thomp...@bt.com wrote:
 But why would she want the things she has?