[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392293#comment-15392293 ]
Krishna Sankar commented on SPARK-14489: ---------------------------------------- >From my experience in the field and R experience, couple of thoughts: The ALS and the evaluator are doing the right thing - with the information they have and without any contextual directives. 1. For the evaluator, as mentioned earlier, similar to what R has, a na.rm flag (ignoreNaN=false, to keep the current behavior) would be a good choice. I have a suspicion that we would need the ignoreNaN elsewhere as well, for example in the crossValidator 2. For ALS, in the absence of a directive, we shouldn't calculate a default average recommendation or even 0; current NaN is the right one. Depending on the context it is possible that an application might decide not to recommend anything, have a default recommendation or even have a dynamic calculated value e.g. over a recent window. So a parameter defaultRecommendation="NaN" or "average" or a value would be a good choice to cover all the possibilities. Or the developer can use the na.fill() for other operations. > RegressionEvaluator returns NaN for ALS in Spark ml > --------------------------------------------------- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 1.6.0 > Environment: AWS EMR > Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org