[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-07-27 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395275#comment-15395275
 ] 

Nick Pentreath commented on SPARK-14489:


Thanks for the thoughts Krishna.

# Initially I also thought a flag to ignore NaN in the evaluators would make 
sense. However frankly I have never seen (and I can't think of) a situation 
where this is desirable, _outside_ of this situation where splitting the 
dataset can result in user/item ids the model has not been trained on (this 
applies in general to "ranking" cases). But for all other typical supervised 
learning cases, NaN means either (a) NaN inputs, in which case that should be 
dealt with by the user in the pipeline before training; (b) a model that has 
bad coefficients. In both these cases, I'd argue that it is correct to return 
NaN, and not desirable to ignore NaN;

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-07-25 Thread Krishna Sankar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392293#comment-15392293
 ] 

Krishna Sankar commented on SPARK-14489:


>From my experience in the field and R experience, couple of thoughts:
The ALS and the evaluator are doing the right thing - with the information they 
have and without any contextual directives.
1. For the evaluator, as mentioned earlier, similar to what R has, a na.rm flag 
(ignoreNaN=false, to keep the current behavior) would be a good choice. I have 
a suspicion that we would need the ignoreNaN elsewhere as well, for example in 
the crossValidator
2. For ALS, in the absence of a directive, we shouldn't calculate a default 
average recommendation or even 0; current NaN is the right one. Depending on 
the context it is possible that an application might decide not to recommend 
anything, have a default recommendation or even have a dynamic calculated value 
e.g. over a recent window.  So a parameter defaultRecommendation="NaN" or 
"average" or a value would be a good choice to cover all the possibilities. Or 
the developer can use the na.fill() for other operations.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270455#comment-15270455
 ] 

Apache Spark commented on SPARK-14489:
--

User 'MLnick' has created a pull request for this issue:
https://github.com/apache/spark/pull/12896

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-22 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254680#comment-15254680
 ] 

Seth Hendrickson commented on SPARK-14489:
--

This is an interesting idea. I would say that under the current framework for 
stratified sampling, there is not a performant way to guarantee each split 
contains every user at least once (even if we filter out users with < k * n 
items). In naive stratified sampling, you would simply generate a random key 
for each user, and sort the entire dataset, taking even splits amongst each 
user. I am not sure if that is an acceptable option given how expensive a sort 
over the entire dataset would be. Using ScaSRS might actually be worse in this 
case, if the waitlist is close to the size of the requested sample, since the 
waitlists are collected on the driver. I am not sure what options open up if we 
don't require even splits, but just that each split contains every user, but 
there might be something to that.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-22 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253536#comment-15253536
 ] 

Nick Pentreath commented on SPARK-14489:


Is naive sampling not an option then for the recommendation setting? In fact we 
don't strictly need to guarantee that the proportions are maintained in each 
sample, I think it would be enough to ensure that each user has at least 1 (or 
n) items in each sample - this would require filtering out users with < k * n 
items though. Not sure if that makes things simpler or more complex though.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251988#comment-15251988
 ] 

Apache Spark commented on SPARK-14489:
--

User 'MLnick' has created a pull request for this issue:
https://github.com/apache/spark/pull/12577

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-19 Thread Abou Haydar Elias (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247710#comment-15247710
 ] 

Abou Haydar Elias commented on SPARK-14489:
---

I totally agree with [~sethah]. I have stumbled on this today. It seems like 
when randomly splitting, we can end up with a test set containing user now 
available in the training set which can also happen to and item. Thus a certain 
user can't have predictions an ALS produces and NaN instead. We are falling 
into the new user/new item problematic. So what can we do here ?

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-18 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246538#comment-15246538
 ] 

Joseph K. Bradley commented on SPARK-14489:
---

I agree that it's unclear what to do with a new item.  I don't think there are 
any good options and would support either not tolerating or ignoring new items.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-14 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242251#comment-15242251
 ] 

Seth Hendrickson commented on SPARK-14489:
--

[~mlnick] I am skeptical that 
[SPARK-8971|https://issues.apache.org/jira/browse/SPARK-8971] applies here. In 
order to guarantee that the user proportions are maintained in each sample, we 
need to use the Scalable Simple Random Sampling algorithm. From my 
understanding, this will not work well for small stratums like you might 
encounter in a recommendation setting. Say you need to guarantee that a user 
with 100 ratings appears in each of 5 folds. The ScaSRS uses a method where it 
only needs to sort the number of items in a waitlist, which depends on the 
probability of acceptance, the sample size, and the desired accuracy. For a 
loose accuracy setting, I compute that the expected waitlist size in this 
scenario is about 20 - the size of the sample! This degrades to naive sampling. 
In this situation, I get the following:

(numRatings, expectedWaitListSize)
(100, 20.36)
(1000, 61.59)
(1, 192.75)
(10, 607.75)
(100, 1920.18)

I am using [this paper|http://jmlr.org/proceedings/papers/v28/meng13a.pdf] as a 
reference. Perhaps [~mengxr] could clarify since he wrote the paper :D ?

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-14 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240718#comment-15240718
 ] 

Nick Pentreath commented on SPARK-14489:


+1 for having CrossValidator be able to handle this in a more principled way by 
doing stratified sampling by say one of the input columns (user id for 
example). This links to SPARK-8971 (which is focused on sampling by class 
label, but I think can be generalized to sampling by any input column).

Until we have something like this, allowing skipping NaNs in the evaluators is 
perhaps the best option. If we agree I can take a look at that - we could make 
it an "expertParam" setting with appropriate warning in the doc.

I like the "average user" option in ALS a lot too. We can offer both options, 
and provide some documentation about common use cases for them, as well as 
expanding the ALS examples to illustrate this.

Finally, is the case for #1 and #2 for a new item different from a new user? It 
may make sense to recommend based on the average user for a new user in the 
absence of any data, but does this make sense for a new item? I'm not sure, 
though it doesn't make as much sense to me.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240364#comment-15240364
 ] 

Joseph K. Bradley commented on SPARK-14489:
---

(Oh, I had not refreshed the page before commenting, but it looks like my 
comments mesh with Nick's.)

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240360#comment-15240360
 ] 

Joseph K. Bradley commented on SPARK-14489:
---

I'd to try to separate a few issues here based on use cases and suggest the 
"right thing to do" in each case:
* Deploying an ALSModel to make predictions: The model should make best-effort 
predictions, even for new users.  I'd say new users should get recommendations 
based on the average user, for both the explicit and implicit settings.  
Providing a Param which makes the model output NaN for unknown users seems 
reasonable as an additional feature.
* Evaluating an ALSModel on a held-out dataset: This is the same as the first 
case; the model should behave the same way it will when deployed.
* Model tuning using CrossValidator: I'm less sure about this.  Both of your 
suggestions seem reasonable (either returning NaN for missing users and 
ignoring NaN in the evaluator, or making best-effort predictions for all 
users).  I also suspect it would be worthwhile to examine literature to find 
what tends to be best.  E.g., should CrossValidator handle ranking specially by 
doing stratified sampling to divide each user or item's ratings evenly across 
folds of CV?

If we want the evaluator to be able to ignore NaNs, then I'd prefer we keep the 
current behavior as the default and provide a Param which allows users to 
ignore NaNs.  I'd be afraid of linear models not having enough regularization, 
getting NaNs in the coefficients, having all of its predictions ignored by the 
evaluator, etc.

What do you think?

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-13 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239847#comment-15239847
 ] 

Nick Pentreath commented on SPARK-14489:


In the live setting you definitely want to recommend something even when your 
model can't compute a recommendation. For evaluation, I'd say it makes more 
sense to split the train/test dataset by say user, such that each user has a 
small proportion of ratings in the test set. This avoids the issue and is quite 
common in the literature.

Given this approach doesn't fit neatly into 
{{CrossValidator}}/{{TrainValidationSplit}} (perhaps something like SPARK-8971 
could help), we could as you say use this as an improvement to at least allow 
using ALS with {{RegresssionEvaluator}}.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232357#comment-15232357
 ] 

Sean Owen commented on SPARK-14489:
---

NaN, to me, means the result was undefined or uncomputable. However for 
recommenders there's nothing too strange about being asked for a recommendation 
for a new user. For some methods there's a clear answer: a new user with no 
data gets 0 recommendations; 0 is the meaningful default for the implicit case. 
Some kind of global mean is better than nothing for the explicit case. It 
doesn't bias the metrics, as an answer is an answer; some are better than 
others but that's what we're measuring.

As I say the problem with ignoring NaN is that you don't consider these cases, 
but they're legitimate cases where the recommender wasn't able to produce a 
result, and that should be reflected as "bad".

Still, as a stop-gap, assuming NaN is rare, ignoring NaN in the evaluator is 
strictly an improvement since it means you can return some meaningful answer 
instead of none. Later, if the ALS implementation never returns NaN, then this 
behavior in the evaluator doesn't matter anyway. So I'd support that change as 
a local improvement.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232297#comment-15232297
 ] 

Boris Clémençon  commented on SPARK-14489:
--

Hi all, 

Perhaps the transform() method returning NaN or any other particular symbol (na 
in DF?) to deal with users who are not in the training set is not a bad idea. 
It makes it clear that the method cannot deal with these users (so far), which 
is good. Using multiple methods to compute the score might seem a bit confusing 
(like imputing average, 0 or any fix values), and it could also introduce bias 
in some metrics like MSE. Making a fix in RegressionEvaluator (removing NaN) 
and sending a warn log seems a better option to me. 

Besides, I think it is theoritically possible to evaluate the scores of users 
or items that were not in the training set, just it is possible adding new 
users in a precomputed PCA system of axes. 

I am just a simple Spark user (and fan), but I could try to clone the project 
and push the modif if required.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232241#comment-15232241
 ] 

Sean Owen commented on SPARK-14489:
---

You could also argue that the problem is returning NaN for an unknown user when 
making recommendations. I'm not sure where that comes in, but that seems like a 
reasonable way to address this would be to always provide a value: empty list 
of recommendations, estimated strength of 0 for the implicit case, some kind of 
global average rating for the explicit case.

If the model is allowed to return NaN, and they're ignored by the evaluation 
metric, it is essentially not penalizing the model for 'passing' on a question. 
Making this model return an answer in all cases seems more useful and requires 
no work-around or flag.

 [~dulajrajitha] [~clemencb] what do you think of trying to implement that?

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-08 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232226#comment-15232226
 ] 

Nick Pentreath commented on SPARK-14489:


This issue would also apply to any ranking-based evaluator for ALS. [~srowen] 
what do you think? A param flag to allow excluding NaNs (with a warning should 
they be encountered) for {{RegressionEvaluator}}?

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org