[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

Krishna Sankar (JIRA) Mon, 25 Jul 2016 10:02:30 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392293#comment-15392293
 ]


Krishna Sankar commented on SPARK-14489:
----------------------------------------

>From my experience in the field and R experience, couple of thoughts:
The ALS and the evaluator are doing the right thing - with the information they 
have and without any contextual directives.
1. For the evaluator, as mentioned earlier, similar to what R has, a na.rm flag 
(ignoreNaN=false, to keep the current behavior) would be a good choice. I have 
a suspicion that we would need the ignoreNaN elsewhere as well, for example in 
the crossValidator
2. For ALS, in the absence of a directive, we shouldn't calculate a default 
average recommendation or even 0; current NaN is the right one. Depending on 
the context it is possible that an application might decide not to recommend 
anything, have a default recommendation or even have a dynamic calculated value 
e.g. over a recent window.  So a parameter defaultRecommendation="NaN" or 
"average" or a value would be a good choice to cover all the possibilities. Or 
the developer can use the na.fill() for other operations.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---------------------------------------------------
>
>                 Key: SPARK-14489
>                 URL: https://issues.apache.org/jira/browse/SPARK-14489
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.0
>         Environment: AWS EMR
>            Reporter: Boris Clémençon 
>              Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
>     val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
>     splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>       val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>       val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>       // multi-model training
>       logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>       val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>       trainingDataset.unpersist()
>       var i = 0
>       while (i < numModels) {
>         // TODO: duplicate evaluator to take extra params from input
>         val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
>         logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
>         metrics(i) += metric
>         i += 1
>       }
>       validationDataset.unpersist()
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

Reply via email to