Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data
This is exactly the core problem in the linked issue - normally you would use the TrainValidationSplit or CrossValidator to do hyper-parameter selection using cross-validation. You could tune the factor size, regularization parameter and alpha (for implicit preference data), for example. Because of the NaN issue you cannot use the cross-validators currently with ALS. So you would have to do it yourself manually (dropping the NaNs from the prediction results as Krishna says). On Mon, 25 Jul 2016 at 11:40 Rohit Chaddhawrote: > Hi Krishna, > > Great .. I had no idea about this. I tried your suggestion by using > na.drop() and got a rmse = 1.5794048211812495 > Any suggestions how this can be reduced and the model improved ? > > Regards, > Rohit > > On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar > wrote: > >> Thanks Nick. I also ran into this issue. >> VG, One workaround is to drop the NaN from predictions (df.na.drop()) and >> then use the dataset for the evaluator. In real life, probably detect the >> NaN and recommend most popular on some window. >> HTH. >> Cheers >> >> >> On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath < >> nick.pentre...@gmail.com> wrote: >> >>> It seems likely that you're running into >>> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when >>> the test dataset in the train/test split contains users or items that were >>> not in the training set. Hence the model doesn't have computed factors for >>> those ids, and ALS 'transform' currently returns NaN for those ids. This in >>> turn results in NaN for the evaluator result. >>> >>> I have a PR open on that issue that will hopefully address this soon. >>> >>> >>> On Sun, 24 Jul 2016 at 17:49 VG wrote: >>> ping. Anyone has some suggestions/advice for me . It will be really helpful. VG On Sun, Jul 24, 2016 at 12:19 AM, VG wrote: > Sean, > > I did this just to test the model. When I do a split of my data as > training to 80% and test to be 20% > > I get a Root-mean-square error = NaN > > So I am wondering where I might be going wrong > > Regards, > VG > > On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen > wrote: > >> No, that's certainly not to be expected. ALS works by computing a much >> lower-rank representation of the input. It would not reproduce the >> input exactly, and you don't want it to -- this would be seriously >> overfit. This is why in general you don't evaluate a model on the >> training set. >> >> On Sat, Jul 23, 2016 at 7:37 PM, VG wrote: >> > I am trying to run ml.ALS to compute some recommendations. >> > >> > Just to test I am using the same dataset for training using >> ALSModel and for >> > predicting the results based on the model . >> > >> > When I evaluate the result using RegressionEvaluator I get a >> > Root-mean-square error = 1.5544064263236066 >> > >> > I thin this should be 0. Any suggestions what might be going wrong. >> > >> > Regards, >> > Vipul >> > > >>
Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data
Hi Krishna, Great .. I had no idea about this. I tried your suggestion by using na.drop() and got a rmse = 1.5794048211812495 Any suggestions how this can be reduced and the model improved ? Regards, Rohit On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankarwrote: > Thanks Nick. I also ran into this issue. > VG, One workaround is to drop the NaN from predictions (df.na.drop()) and > then use the dataset for the evaluator. In real life, probably detect the > NaN and recommend most popular on some window. > HTH. > Cheers > > > On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath > wrote: > >> It seems likely that you're running into >> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the >> test dataset in the train/test split contains users or items that were not >> in the training set. Hence the model doesn't have computed factors for >> those ids, and ALS 'transform' currently returns NaN for those ids. This in >> turn results in NaN for the evaluator result. >> >> I have a PR open on that issue that will hopefully address this soon. >> >> >> On Sun, 24 Jul 2016 at 17:49 VG wrote: >> >>> ping. Anyone has some suggestions/advice for me . >>> It will be really helpful. >>> >>> VG >>> On Sun, Jul 24, 2016 at 12:19 AM, VG wrote: >>> Sean, I did this just to test the model. When I do a split of my data as training to 80% and test to be 20% I get a Root-mean-square error = NaN So I am wondering where I might be going wrong Regards, VG On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen wrote: > No, that's certainly not to be expected. ALS works by computing a much > lower-rank representation of the input. It would not reproduce the > input exactly, and you don't want it to -- this would be seriously > overfit. This is why in general you don't evaluate a model on the > training set. > > On Sat, Jul 23, 2016 at 7:37 PM, VG wrote: > > I am trying to run ml.ALS to compute some recommendations. > > > > Just to test I am using the same dataset for training using ALSModel > and for > > predicting the results based on the model . > > > > When I evaluate the result using RegressionEvaluator I get a > > Root-mean-square error = 1.5544064263236066 > > > > I thin this should be 0. Any suggestions what might be going wrong. > > > > Regards, > > Vipul > >
Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data
Good suggestion Krishna One issue is that this doesn't work with TrainValidationSplit or CrossValidator for parameter tuning. Hence my solution in the PR which makes it work with the cross-validators. On Mon, 25 Jul 2016 at 00:42, Krishna Sankarwrote: > Thanks Nick. I also ran into this issue. > VG, One workaround is to drop the NaN from predictions (df.na.drop()) and > then use the dataset for the evaluator. In real life, probably detect the > NaN and recommend most popular on some window. > HTH. > Cheers > > > On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath > wrote: > >> It seems likely that you're running into >> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the >> test dataset in the train/test split contains users or items that were not >> in the training set. Hence the model doesn't have computed factors for >> those ids, and ALS 'transform' currently returns NaN for those ids. This in >> turn results in NaN for the evaluator result. >> >> I have a PR open on that issue that will hopefully address this soon. >> >> >> On Sun, 24 Jul 2016 at 17:49 VG wrote: >> >>> ping. Anyone has some suggestions/advice for me . >>> It will be really helpful. >>> >>> VG >>> On Sun, Jul 24, 2016 at 12:19 AM, VG wrote: >>> Sean, I did this just to test the model. When I do a split of my data as training to 80% and test to be 20% I get a Root-mean-square error = NaN So I am wondering where I might be going wrong Regards, VG On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen wrote: > No, that's certainly not to be expected. ALS works by computing a much > lower-rank representation of the input. It would not reproduce the > input exactly, and you don't want it to -- this would be seriously > overfit. This is why in general you don't evaluate a model on the > training set. > > On Sat, Jul 23, 2016 at 7:37 PM, VG wrote: > > I am trying to run ml.ALS to compute some recommendations. > > > > Just to test I am using the same dataset for training using ALSModel > and for > > predicting the results based on the model . > > > > When I evaluate the result using RegressionEvaluator I get a > > Root-mean-square error = 1.5544064263236066 > > > > I thin this should be 0. Any suggestions what might be going wrong. > > > > Regards, > > Vipul > >
Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data
Great thanks both of you. I was struggling with this issue as well. -Rohit On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankarwrote: > Thanks Nick. I also ran into this issue. > VG, One workaround is to drop the NaN from predictions (df.na.drop()) and > then use the dataset for the evaluator. In real life, probably detect the > NaN and recommend most popular on some window. > HTH. > Cheers > > > On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath > wrote: > >> It seems likely that you're running into >> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the >> test dataset in the train/test split contains users or items that were not >> in the training set. Hence the model doesn't have computed factors for >> those ids, and ALS 'transform' currently returns NaN for those ids. This in >> turn results in NaN for the evaluator result. >> >> I have a PR open on that issue that will hopefully address this soon. >> >> >> On Sun, 24 Jul 2016 at 17:49 VG wrote: >> >>> ping. Anyone has some suggestions/advice for me . >>> It will be really helpful. >>> >>> VG >>> On Sun, Jul 24, 2016 at 12:19 AM, VG wrote: >>> Sean, I did this just to test the model. When I do a split of my data as training to 80% and test to be 20% I get a Root-mean-square error = NaN So I am wondering where I might be going wrong Regards, VG On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen wrote: > No, that's certainly not to be expected. ALS works by computing a much > lower-rank representation of the input. It would not reproduce the > input exactly, and you don't want it to -- this would be seriously > overfit. This is why in general you don't evaluate a model on the > training set. > > On Sat, Jul 23, 2016 at 7:37 PM, VG wrote: > > I am trying to run ml.ALS to compute some recommendations. > > > > Just to test I am using the same dataset for training using ALSModel > and for > > predicting the results based on the model . > > > > When I evaluate the result using RegressionEvaluator I get a > > Root-mean-square error = 1.5544064263236066 > > > > I thin this should be 0. Any suggestions what might be going wrong. > > > > Regards, > > Vipul > >
Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data
Thanks Nick. I also ran into this issue. VG, One workaround is to drop the NaN from predictions (df.na.drop()) and then use the dataset for the evaluator. In real life, probably detect the NaN and recommend most popular on some window. HTH. Cheers On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreathwrote: > It seems likely that you're running into > https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the > test dataset in the train/test split contains users or items that were not > in the training set. Hence the model doesn't have computed factors for > those ids, and ALS 'transform' currently returns NaN for those ids. This in > turn results in NaN for the evaluator result. > > I have a PR open on that issue that will hopefully address this soon. > > > On Sun, 24 Jul 2016 at 17:49 VG wrote: > >> ping. Anyone has some suggestions/advice for me . >> It will be really helpful. >> >> VG >> On Sun, Jul 24, 2016 at 12:19 AM, VG wrote: >> >>> Sean, >>> >>> I did this just to test the model. When I do a split of my data as >>> training to 80% and test to be 20% >>> >>> I get a Root-mean-square error = NaN >>> >>> So I am wondering where I might be going wrong >>> >>> Regards, >>> VG >>> >>> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen wrote: >>> No, that's certainly not to be expected. ALS works by computing a much lower-rank representation of the input. It would not reproduce the input exactly, and you don't want it to -- this would be seriously overfit. This is why in general you don't evaluate a model on the training set. On Sat, Jul 23, 2016 at 7:37 PM, VG wrote: > I am trying to run ml.ALS to compute some recommendations. > > Just to test I am using the same dataset for training using ALSModel and for > predicting the results based on the model . > > When I evaluate the result using RegressionEvaluator I get a > Root-mean-square error = 1.5544064263236066 > > I thin this should be 0. Any suggestions what might be going wrong. > > Regards, > Vipul >>> >>>
Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data
It seems likely that you're running into https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the test dataset in the train/test split contains users or items that were not in the training set. Hence the model doesn't have computed factors for those ids, and ALS 'transform' currently returns NaN for those ids. This in turn results in NaN for the evaluator result. I have a PR open on that issue that will hopefully address this soon. On Sun, 24 Jul 2016 at 17:49 VGwrote: > ping. Anyone has some suggestions/advice for me . > It will be really helpful. > > VG > On Sun, Jul 24, 2016 at 12:19 AM, VG wrote: > >> Sean, >> >> I did this just to test the model. When I do a split of my data as >> training to 80% and test to be 20% >> >> I get a Root-mean-square error = NaN >> >> So I am wondering where I might be going wrong >> >> Regards, >> VG >> >> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen wrote: >> >>> No, that's certainly not to be expected. ALS works by computing a much >>> lower-rank representation of the input. It would not reproduce the >>> input exactly, and you don't want it to -- this would be seriously >>> overfit. This is why in general you don't evaluate a model on the >>> training set. >>> >>> On Sat, Jul 23, 2016 at 7:37 PM, VG wrote: >>> > I am trying to run ml.ALS to compute some recommendations. >>> > >>> > Just to test I am using the same dataset for training using ALSModel >>> and for >>> > predicting the results based on the model . >>> > >>> > When I evaluate the result using RegressionEvaluator I get a >>> > Root-mean-square error = 1.5544064263236066 >>> > >>> > I thin this should be 0. Any suggestions what might be going wrong. >>> > >>> > Regards, >>> > Vipul >>> >> >>
Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data
ping. Anyone has some suggestions/advice for me . It will be really helpful. VG On Sun, Jul 24, 2016 at 12:19 AM, VGwrote: > Sean, > > I did this just to test the model. When I do a split of my data as > training to 80% and test to be 20% > > I get a Root-mean-square error = NaN > > So I am wondering where I might be going wrong > > Regards, > VG > > On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen wrote: > >> No, that's certainly not to be expected. ALS works by computing a much >> lower-rank representation of the input. It would not reproduce the >> input exactly, and you don't want it to -- this would be seriously >> overfit. This is why in general you don't evaluate a model on the >> training set. >> >> On Sat, Jul 23, 2016 at 7:37 PM, VG wrote: >> > I am trying to run ml.ALS to compute some recommendations. >> > >> > Just to test I am using the same dataset for training using ALSModel >> and for >> > predicting the results based on the model . >> > >> > When I evaluate the result using RegressionEvaluator I get a >> > Root-mean-square error = 1.5544064263236066 >> > >> > I thin this should be 0. Any suggestions what might be going wrong. >> > >> > Regards, >> > Vipul >> > >
Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data
Any suggestions / ideas here ? On Sun, Jul 24, 2016 at 12:19 AM, VGwrote: > Sean, > > I did this just to test the model. When I do a split of my data as > training to 80% and test to be 20% > > I get a Root-mean-square error = NaN > > So I am wondering where I might be going wrong > > Regards, > VG > > On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen wrote: > >> No, that's certainly not to be expected. ALS works by computing a much >> lower-rank representation of the input. It would not reproduce the >> input exactly, and you don't want it to -- this would be seriously >> overfit. This is why in general you don't evaluate a model on the >> training set. >> >> On Sat, Jul 23, 2016 at 7:37 PM, VG wrote: >> > I am trying to run ml.ALS to compute some recommendations. >> > >> > Just to test I am using the same dataset for training using ALSModel >> and for >> > predicting the results based on the model . >> > >> > When I evaluate the result using RegressionEvaluator I get a >> > Root-mean-square error = 1.5544064263236066 >> > >> > I thin this should be 0. Any suggestions what might be going wrong. >> > >> > Regards, >> > Vipul >> > >
Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data
Sean, I did this just to test the model. When I do a split of my data as training to 80% and test to be 20% I get a Root-mean-square error = NaN So I am wondering where I might be going wrong Regards, VG On Sun, Jul 24, 2016 at 12:12 AM, Sean Owenwrote: > No, that's certainly not to be expected. ALS works by computing a much > lower-rank representation of the input. It would not reproduce the > input exactly, and you don't want it to -- this would be seriously > overfit. This is why in general you don't evaluate a model on the > training set. > > On Sat, Jul 23, 2016 at 7:37 PM, VG wrote: > > I am trying to run ml.ALS to compute some recommendations. > > > > Just to test I am using the same dataset for training using ALSModel and > for > > predicting the results based on the model . > > > > When I evaluate the result using RegressionEvaluator I get a > > Root-mean-square error = 1.5544064263236066 > > > > I thin this should be 0. Any suggestions what might be going wrong. > > > > Regards, > > Vipul >
Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data
No, that's certainly not to be expected. ALS works by computing a much lower-rank representation of the input. It would not reproduce the input exactly, and you don't want it to -- this would be seriously overfit. This is why in general you don't evaluate a model on the training set. On Sat, Jul 23, 2016 at 7:37 PM, VGwrote: > I am trying to run ml.ALS to compute some recommendations. > > Just to test I am using the same dataset for training using ALSModel and for > predicting the results based on the model . > > When I evaluate the result using RegressionEvaluator I get a > Root-mean-square error = 1.5544064263236066 > > I thin this should be 0. Any suggestions what might be going wrong. > > Regards, > Vipul - To unsubscribe e-mail: user-unsubscr...@spark.apache.org