[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boris Clémençon  updated SPARK-14489:
-------------------------------------
    Description: 
When building a Spark ML pipeline containing an ALS estimator, the metrics 
"rmse", "mse", "r2" and "mae" all return NaN. 

The reason is in CrossValidator.scala line 109. The K-folds are randomly 
generated. For large and sparse datasets, there is a significant probability 
that at least one user of the validation set is missing in the training set, 
hence generating a few NaN estimation with transform method and NaN 
RegressionEvaluator's metrics too. 

Suggestion to fix the bug: remove the NaN values while computing the rmse or 
other metrics (ie, removing users or items in validation test that is missing 
in the learning set). Send logs when this happen.

Issue SPARK-14153 seems to be the same pbm
{code:title=Bar.scala|borderStyle=solid}
    val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
    splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
      val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
      val validationDataset = sqlCtx.createDataFrame(validation, schema).cache()
      // multi-model training
      logDebug(s"Train split $splitIndex with multiple sets of parameters.")
      val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
      trainingDataset.unpersist()
      var i = 0
      while (i < numModels) {
        // TODO: duplicate evaluator to take extra params from input
        val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)))
        logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
        metrics(i) += metric
        i += 1
      }
      validationDataset.unpersist()
    }
{code}

  was:
When building a Spark ML pipeline containing an ALS estimator, the metrics 
"rmse", "mse", "r2" and "mae" all return NaN. 

The reason is in CrossValidator.scala line 109. The K-folds are randomly 
generated. For large and sparse datasets, there is a significant probability 
that at least one user of the validation set is missing in the training set, 
hence generating a few NaN estimation with transform method and NaN 
RegressionEvaluator's metrics too. 

Suggestion to fix the bug: remove the NaN values while computing the rmse or 
other metrics (ie, removing users or items in validation test that is missing 
in the learning set). Send logs when this happen.

Issue SPARK-14153 seems to the same pbm
{code:title=Bar.scala|borderStyle=solid}
    val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
    splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
      val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
      val validationDataset = sqlCtx.createDataFrame(validation, schema).cache()
      // multi-model training
      logDebug(s"Train split $splitIndex with multiple sets of parameters.")
      val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
      trainingDataset.unpersist()
      var i = 0
      while (i < numModels) {
        // TODO: duplicate evaluator to take extra params from input
        val metric = eval.evaluate(models(i).transform(validationDataset, 
epm(i)))
        logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
        metrics(i) += metric
        i += 1
      }
      validationDataset.unpersist()
    }
{code}


> RegressionEvaluator returns NaN for ALS in Spark ml
> ---------------------------------------------------
>
>                 Key: SPARK-14489
>                 URL: https://issues.apache.org/jira/browse/SPARK-14489
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.0
>         Environment: AWS EMR
>            Reporter: Boris Clémençon 
>              Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
>     val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
>     splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>       val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>       val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>       // multi-model training
>       logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>       val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>       trainingDataset.unpersist()
>       var i = 0
>       while (i < numModels) {
>         // TODO: duplicate evaluator to take extra params from input
>         val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
>         logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
>         metrics(i) += metric
>         i += 1
>       }
>       validationDataset.unpersist()
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to