Re: evaluating recommender with boolean prefs
In point 1, I don't think I'd say it that way. It's not true that test/training is divided by user, because every user would either be 100% in the training or 100% in the test data. Instead you hold out part of the data for each user, or at least, for some subset of users. Then you can see whether recs for those users match the held out data. Yes then you see how the held-out set matches the predictions by computing ratios that give you precision/recall. The key question is really how you choose the test data. It's implicit data; one is as good as the next. In the framework I think it just randomly picks a subset of the data. You could also split by time; that's a defensible way to do it. Training data up to time t and test data after time t. On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I'm trying to evaluate a few different recommenders based on boolean preferences. The in action book suggests using an precision/recall metric, but I'm not sure I understand what that does, and in particular how it is dividing my data into test/train sets. What I think I'd like to do is: 1. Divide the test data by user: identify a set of training data with data from 80% of the users, and test using the remaining 20% (say). 2. Build a similarity model from the training data 3. For the test users, divide their data in half; a training set and an evaluation set. Then for each test user, use their training data as input to the recommender, and see if it recommends the data in the evaluation set or not. Is this what the precision/recall test is actually doing? -- Michael Sokolov Senior Architect Safari Books Online
Re: evaluating recommender with boolean prefs
Since I am primarily an HPC person, probably a naive question from the ML perspective. What if, when computing recommendations, we don't exclude what the user already has, and then see if the items he has end up being recommended to him (compute some appropriate metric / ratio)? Wouldn't that be the ultimate evaluator? On Fri, Jun 7, 2013 at 2:58 PM, Sean Owen sro...@gmail.com wrote: In point 1, I don't think I'd say it that way. It's not true that test/training is divided by user, because every user would either be 100% in the training or 100% in the test data. Instead you hold out part of the data for each user, or at least, for some subset of users. Then you can see whether recs for those users match the held out data. Yes then you see how the held-out set matches the predictions by computing ratios that give you precision/recall. The key question is really how you choose the test data. It's implicit data; one is as good as the next. In the framework I think it just randomly picks a subset of the data. You could also split by time; that's a defensible way to do it. Training data up to time t and test data after time t. On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I'm trying to evaluate a few different recommenders based on boolean preferences. The in action book suggests using an precision/recall metric, but I'm not sure I understand what that does, and in particular how it is dividing my data into test/train sets. What I think I'd like to do is: 1. Divide the test data by user: identify a set of training data with data from 80% of the users, and test using the remaining 20% (say). 2. Build a similarity model from the training data 3. For the test users, divide their data in half; a training set and an evaluation set. Then for each test user, use their training data as input to the recommender, and see if it recommends the data in the evaluation set or not. Is this what the precision/recall test is actually doing? -- Michael Sokolov Senior Architect Safari Books Online
Re: evaluating recommender with boolean prefs
Thanks for your help Yes, I think a time-based division of test v. training probably would make sense since that will correspond to our actual intended practice. But before I worry about that I seem to have some more fundamental problem that is giving me 0 precision and 0 recall all the time... -Mike On 06/07/2013 02:58 PM, Sean Owen wrote: In point 1, I don't think I'd say it that way. It's not true that test/training is divided by user, because every user would either be 100% in the training or 100% in the test data. Instead you hold out part of the data for each user, or at least, for some subset of users. Then you can see whether recs for those users match the held out data. Yes then you see how the held-out set matches the predictions by computing ratios that give you precision/recall. The key question is really how you choose the test data. It's implicit data; one is as good as the next. In the framework I think it just randomly picks a subset of the data. You could also split by time; that's a defensible way to do it. Training data up to time t and test data after time t. On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I'm trying to evaluate a few different recommenders based on boolean preferences. The in action book suggests using an precision/recall metric, but I'm not sure I understand what that does, and in particular how it is dividing my data into test/train sets. What I think I'd like to do is: 1. Divide the test data by user: identify a set of training data with data from 80% of the users, and test using the remaining 20% (say). 2. Build a similarity model from the training data 3. For the test users, divide their data in half; a training set and an evaluation set. Then for each test user, use their training data as input to the recommender, and see if it recommends the data in the evaluation set or not. Is this what the precision/recall test is actually doing? -- Michael Sokolov Senior Architect Safari Books Online -- Michael Sokolov Senior Architect Safari Books Online
Re: evaluating recommender with boolean prefs
It depends on the algorithm I suppose. In some cases, the already-known items would always be top recommendations and the test would tell you nothing. Just like in an RMSE test -- if you already know the right answers your score is always a perfect 0. But in some cases I agree you could get some of use out of observing where the algorithm ranks known associations, because they won't in some cases all be the very first ones. it raises an interesting question: if the top recommendation wasn't an already known association, how do we know it's wrong? We don't. You rate Star Trek, Star Trek V, and Star Trek IV. Say Star Trek II is your top recommendation. That's actually probably right, and should be ranked higher than all your observed associations. (It's a good movie.) But the test would consider it wrong. In fact anything that you haven't interacted with before is wrong. This sort of explains why precision/recall can be really low in these tests. I would not be surprised if you get 0 in some cases, on maybe small input. Is it a bad predictor? maybe, but it's not clear. On Fri, Jun 7, 2013 at 8:06 PM, Koobas koo...@gmail.com wrote: Since I am primarily an HPC person, probably a naive question from the ML perspective. What if, when computing recommendations, we don't exclude what the user already has, and then see if the items he has end up being recommended to him (compute some appropriate metric / ratio)? Wouldn't that be the ultimate evaluator? On Fri, Jun 7, 2013 at 2:58 PM, Sean Owen sro...@gmail.com wrote: In point 1, I don't think I'd say it that way. It's not true that test/training is divided by user, because every user would either be 100% in the training or 100% in the test data. Instead you hold out part of the data for each user, or at least, for some subset of users. Then you can see whether recs for those users match the held out data. Yes then you see how the held-out set matches the predictions by computing ratios that give you precision/recall. The key question is really how you choose the test data. It's implicit data; one is as good as the next. In the framework I think it just randomly picks a subset of the data. You could also split by time; that's a defensible way to do it. Training data up to time t and test data after time t. On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I'm trying to evaluate a few different recommenders based on boolean preferences. The in action book suggests using an precision/recall metric, but I'm not sure I understand what that does, and in particular how it is dividing my data into test/train sets. What I think I'd like to do is: 1. Divide the test data by user: identify a set of training data with data from 80% of the users, and test using the remaining 20% (say). 2. Build a similarity model from the training data 3. For the test users, divide their data in half; a training set and an evaluation set. Then for each test user, use their training data as input to the recommender, and see if it recommends the data in the evaluation set or not. Is this what the precision/recall test is actually doing? -- Michael Sokolov Senior Architect Safari Books Online
Re: evaluating recommender with boolean prefs
On Fri, Jun 7, 2013 at 4:50 PM, Sean Owen sro...@gmail.com wrote: It depends on the algorithm I suppose. In some cases, the already-known items would always be top recommendations and the test would tell you nothing. Just like in an RMSE test -- if you already know the right answers your score is always a perfect 0. It's very much to the point. ALS works by constructing a low-rank approximation of the original matrix. We check how good that approximation it by comparing it against the original. I see an analogy here, in the case of kNN. The suggestions are a model of your interests, in a sense can be used to reconstruct your original set. But in some cases I agree you could get some of use out of observing where the algorithm ranks known associations, because they won't in some cases all be the very first ones. it raises an interesting question: if the top recommendation wasn't an already known association, how do we know it's wrong? We don't. You rate Star Trek, Star Trek V, and Star Trek IV. Say Star Trek II is your top recommendation. That's actually probably right, and should be ranked higher than all your observed associations. (It's a good movie.) But the test would consider it wrong. In fact anything that you haven't interacted with before is wrong. You can look at it from the other side. It's not about the ones that are not in your original set. It's about how good the recommender is in putting back the original, if they were removed. Except we would not actually be removing them. It's the same approach, simply without splitting the input into the training set and the validation set. In a sense the whole set is the training set and the validation set. Again, I am not coming from the ML background. Am I making sense here? This sort of explains why precision/recall can be really low in these tests. I would not be surprised if you get 0 in some cases, on maybe small input. Is it a bad predictor? maybe, but it's not clear. On Fri, Jun 7, 2013 at 8:06 PM, Koobas koo...@gmail.com wrote: Since I am primarily an HPC person, probably a naive question from the ML perspective. What if, when computing recommendations, we don't exclude what the user already has, and then see if the items he has end up being recommended to him (compute some appropriate metric / ratio)? Wouldn't that be the ultimate evaluator? On Fri, Jun 7, 2013 at 2:58 PM, Sean Owen sro...@gmail.com wrote: In point 1, I don't think I'd say it that way. It's not true that test/training is divided by user, because every user would either be 100% in the training or 100% in the test data. Instead you hold out part of the data for each user, or at least, for some subset of users. Then you can see whether recs for those users match the held out data. Yes then you see how the held-out set matches the predictions by computing ratios that give you precision/recall. The key question is really how you choose the test data. It's implicit data; one is as good as the next. In the framework I think it just randomly picks a subset of the data. You could also split by time; that's a defensible way to do it. Training data up to time t and test data after time t. On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I'm trying to evaluate a few different recommenders based on boolean preferences. The in action book suggests using an precision/recall metric, but I'm not sure I understand what that does, and in particular how it is dividing my data into test/train sets. What I think I'd like to do is: 1. Divide the test data by user: identify a set of training data with data from 80% of the users, and test using the remaining 20% (say). 2. Build a similarity model from the training data 3. For the test users, divide their data in half; a training set and an evaluation set. Then for each test user, use their training data as input to the recommender, and see if it recommends the data in the evaluation set or not. Is this what the precision/recall test is actually doing? -- Michael Sokolov Senior Architect Safari Books Online
Re: evaluating recommender with boolean prefs
But why would she want the things she has? - Original Message - From: Koobas [mailto:koo...@gmail.com] Sent: Friday, June 07, 2013 08:06 PM To: user@mahout.apache.org user@mahout.apache.org Subject: Re: evaluating recommender with boolean prefs Since I am primarily an HPC person, probably a naive question from the ML perspective. What if, when computing recommendations, we don't exclude what the user already has, and then see if the items he has end up being recommended to him (compute some appropriate metric / ratio)? Wouldn't that be the ultimate evaluator? On Fri, Jun 7, 2013 at 2:58 PM, Sean Owen sro...@gmail.com wrote: In point 1, I don't think I'd say it that way. It's not true that test/training is divided by user, because every user would either be 100% in the training or 100% in the test data. Instead you hold out part of the data for each user, or at least, for some subset of users. Then you can see whether recs for those users match the held out data. Yes then you see how the held-out set matches the predictions by computing ratios that give you precision/recall. The key question is really how you choose the test data. It's implicit data; one is as good as the next. In the framework I think it just randomly picks a subset of the data. You could also split by time; that's a defensible way to do it. Training data up to time t and test data after time t. On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I'm trying to evaluate a few different recommenders based on boolean preferences. The in action book suggests using an precision/recall metric, but I'm not sure I understand what that does, and in particular how it is dividing my data into test/train sets. What I think I'd like to do is: 1. Divide the test data by user: identify a set of training data with data from 80% of the users, and test using the remaining 20% (say). 2. Build a similarity model from the training data 3. For the test users, divide their data in half; a training set and an evaluation set. Then for each test user, use their training data as input to the recommender, and see if it recommends the data in the evaluation set or not. Is this what the precision/recall test is actually doing? -- Michael Sokolov Senior Architect Safari Books Online
Re: evaluating recommender with boolean prefs
Yes it makes sense in the case of for example ALS. With or without this idea, the more general point is that this result is still problematic. It is somewhat useful in comparing in a relative sense; I'd rather have a recommender that stacks my input values somewhere near the top than bottom. But metrics like precision@5 get hard to interpret -- because they are often near 0 even when things are working reasonably well. Mean average precision considers the results in a more complete sense, as would AUC. On Fri, Jun 7, 2013 at 10:04 PM, Koobas koo...@gmail.com wrote: On Fri, Jun 7, 2013 at 4:50 PM, Sean Owen sro...@gmail.com wrote: It depends on the algorithm I suppose. In some cases, the already-known items would always be top recommendations and the test would tell you nothing. Just like in an RMSE test -- if you already know the right answers your score is always a perfect 0. It's very much to the point. ALS works by constructing a low-rank approximation of the original matrix. We check how good that approximation it by comparing it against the original. I see an analogy here, in the case of kNN. The suggestions are a model of your interests, in a sense can be used to reconstruct your original set. But in some cases I agree you could get some of use out of observing where the algorithm ranks known associations, because they won't in some cases all be the very first ones. it raises an interesting question: if the top recommendation wasn't an already known association, how do we know it's wrong? We don't. You rate Star Trek, Star Trek V, and Star Trek IV. Say Star Trek II is your top recommendation. That's actually probably right, and should be ranked higher than all your observed associations. (It's a good movie.) But the test would consider it wrong. In fact anything that you haven't interacted with before is wrong. You can look at it from the other side. It's not about the ones that are not in your original set. It's about how good the recommender is in putting back the original, if they were removed. Except we would not actually be removing them. It's the same approach, simply without splitting the input into the training set and the validation set. In a sense the whole set is the training set and the validation set. Again, I am not coming from the ML background. Am I making sense here? This sort of explains why precision/recall can be really low in these tests. I would not be surprised if you get 0 in some cases, on maybe small input. Is it a bad predictor? maybe, but it's not clear. On Fri, Jun 7, 2013 at 8:06 PM, Koobas koo...@gmail.com wrote: Since I am primarily an HPC person, probably a naive question from the ML perspective. What if, when computing recommendations, we don't exclude what the user already has, and then see if the items he has end up being recommended to him (compute some appropriate metric / ratio)? Wouldn't that be the ultimate evaluator? On Fri, Jun 7, 2013 at 2:58 PM, Sean Owen sro...@gmail.com wrote: In point 1, I don't think I'd say it that way. It's not true that test/training is divided by user, because every user would either be 100% in the training or 100% in the test data. Instead you hold out part of the data for each user, or at least, for some subset of users. Then you can see whether recs for those users match the held out data. Yes then you see how the held-out set matches the predictions by computing ratios that give you precision/recall. The key question is really how you choose the test data. It's implicit data; one is as good as the next. In the framework I think it just randomly picks a subset of the data. You could also split by time; that's a defensible way to do it. Training data up to time t and test data after time t. On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I'm trying to evaluate a few different recommenders based on boolean preferences. The in action book suggests using an precision/recall metric, but I'm not sure I understand what that does, and in particular how it is dividing my data into test/train sets. What I think I'd like to do is: 1. Divide the test data by user: identify a set of training data with data from 80% of the users, and test using the remaining 20% (say). 2. Build a similarity model from the training data 3. For the test users, divide their data in half; a training set and an evaluation set. Then for each test user, use their training data as input to the recommender, and see if it recommends the data in the evaluation set or not. Is this what the precision/recall test is actually doing? -- Michael Sokolov Senior Architect Safari Books Online
Re: evaluating recommender with boolean prefs
I believe the suggestion is just for purposes of evaluation. You would not return these items in practice, yes. Although there are cases where you do want to return known items. For example, maybe you are modeling user interaction with restaurant categories. This could be useful, because as soon as you see I interact with Chinese and Indian you may recommend Thai; it might even be a stronger recommendation than the two known categories. But I may not want to actually exclude Chinese and Indian from the list entirely. On Fri, Jun 7, 2013 at 10:36 PM, simon.2.thomp...@bt.com wrote: But why would she want the things she has?