[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395275#comment-15395275 ] Nick Pentreath commented on SPARK-14489: Thanks for the thoughts Krishna. # Initially I also thought a flag to ignore NaN in the evaluators would make sense. However frankly I have never seen (and I can't think of) a situation where this is desirable, _outside_ of this situation where splitting the dataset can result in user/item ids the model has not been trained on (this applies in general to "ranking" cases). But for all other typical supervised learning cases, NaN means either (a) NaN inputs, in which case that should be dealt with by the user in the pipeline before training; (b) a model that has bad coefficients. In both these cases, I'd argue that it is correct to return NaN, and not desirable to ignore NaN; > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392293#comment-15392293 ] Krishna Sankar commented on SPARK-14489: >From my experience in the field and R experience, couple of thoughts: The ALS and the evaluator are doing the right thing - with the information they have and without any contextual directives. 1. For the evaluator, as mentioned earlier, similar to what R has, a na.rm flag (ignoreNaN=false, to keep the current behavior) would be a good choice. I have a suspicion that we would need the ignoreNaN elsewhere as well, for example in the crossValidator 2. For ALS, in the absence of a directive, we shouldn't calculate a default average recommendation or even 0; current NaN is the right one. Depending on the context it is possible that an application might decide not to recommend anything, have a default recommendation or even have a dynamic calculated value e.g. over a recent window. So a parameter defaultRecommendation="NaN" or "average" or a value would be a good choice to cover all the possibilities. Or the developer can use the na.fill() for other operations. > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270455#comment-15270455 ] Apache Spark commented on SPARK-14489: -- User 'MLnick' has created a pull request for this issue: https://github.com/apache/spark/pull/12896 > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254680#comment-15254680 ] Seth Hendrickson commented on SPARK-14489: -- This is an interesting idea. I would say that under the current framework for stratified sampling, there is not a performant way to guarantee each split contains every user at least once (even if we filter out users with < k * n items). In naive stratified sampling, you would simply generate a random key for each user, and sort the entire dataset, taking even splits amongst each user. I am not sure if that is an acceptable option given how expensive a sort over the entire dataset would be. Using ScaSRS might actually be worse in this case, if the waitlist is close to the size of the requested sample, since the waitlists are collected on the driver. I am not sure what options open up if we don't require even splits, but just that each split contains every user, but there might be something to that. > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253536#comment-15253536 ] Nick Pentreath commented on SPARK-14489: Is naive sampling not an option then for the recommendation setting? In fact we don't strictly need to guarantee that the proportions are maintained in each sample, I think it would be enough to ensure that each user has at least 1 (or n) items in each sample - this would require filtering out users with < k * n items though. Not sure if that makes things simpler or more complex though. > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251988#comment-15251988 ] Apache Spark commented on SPARK-14489: -- User 'MLnick' has created a pull request for this issue: https://github.com/apache/spark/pull/12577 > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247710#comment-15247710 ] Abou Haydar Elias commented on SPARK-14489: --- I totally agree with [~sethah]. I have stumbled on this today. It seems like when randomly splitting, we can end up with a test set containing user now available in the training set which can also happen to and item. Thus a certain user can't have predictions an ALS produces and NaN instead. We are falling into the new user/new item problematic. So what can we do here ? > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246538#comment-15246538 ] Joseph K. Bradley commented on SPARK-14489: --- I agree that it's unclear what to do with a new item. I don't think there are any good options and would support either not tolerating or ignoring new items. > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242251#comment-15242251 ] Seth Hendrickson commented on SPARK-14489: -- [~mlnick] I am skeptical that [SPARK-8971|https://issues.apache.org/jira/browse/SPARK-8971] applies here. In order to guarantee that the user proportions are maintained in each sample, we need to use the Scalable Simple Random Sampling algorithm. From my understanding, this will not work well for small stratums like you might encounter in a recommendation setting. Say you need to guarantee that a user with 100 ratings appears in each of 5 folds. The ScaSRS uses a method where it only needs to sort the number of items in a waitlist, which depends on the probability of acceptance, the sample size, and the desired accuracy. For a loose accuracy setting, I compute that the expected waitlist size in this scenario is about 20 - the size of the sample! This degrades to naive sampling. In this situation, I get the following: (numRatings, expectedWaitListSize) (100, 20.36) (1000, 61.59) (1, 192.75) (10, 607.75) (100, 1920.18) I am using [this paper|http://jmlr.org/proceedings/papers/v28/meng13a.pdf] as a reference. Perhaps [~mengxr] could clarify since he wrote the paper :D ? > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240718#comment-15240718 ] Nick Pentreath commented on SPARK-14489: +1 for having CrossValidator be able to handle this in a more principled way by doing stratified sampling by say one of the input columns (user id for example). This links to SPARK-8971 (which is focused on sampling by class label, but I think can be generalized to sampling by any input column). Until we have something like this, allowing skipping NaNs in the evaluators is perhaps the best option. If we agree I can take a look at that - we could make it an "expertParam" setting with appropriate warning in the doc. I like the "average user" option in ALS a lot too. We can offer both options, and provide some documentation about common use cases for them, as well as expanding the ALS examples to illustrate this. Finally, is the case for #1 and #2 for a new item different from a new user? It may make sense to recommend based on the average user for a new user in the absence of any data, but does this make sense for a new item? I'm not sure, though it doesn't make as much sense to me. > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240364#comment-15240364 ] Joseph K. Bradley commented on SPARK-14489: --- (Oh, I had not refreshed the page before commenting, but it looks like my comments mesh with Nick's.) > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240360#comment-15240360 ] Joseph K. Bradley commented on SPARK-14489: --- I'd to try to separate a few issues here based on use cases and suggest the "right thing to do" in each case: * Deploying an ALSModel to make predictions: The model should make best-effort predictions, even for new users. I'd say new users should get recommendations based on the average user, for both the explicit and implicit settings. Providing a Param which makes the model output NaN for unknown users seems reasonable as an additional feature. * Evaluating an ALSModel on a held-out dataset: This is the same as the first case; the model should behave the same way it will when deployed. * Model tuning using CrossValidator: I'm less sure about this. Both of your suggestions seem reasonable (either returning NaN for missing users and ignoring NaN in the evaluator, or making best-effort predictions for all users). I also suspect it would be worthwhile to examine literature to find what tends to be best. E.g., should CrossValidator handle ranking specially by doing stratified sampling to divide each user or item's ratings evenly across folds of CV? If we want the evaluator to be able to ignore NaNs, then I'd prefer we keep the current behavior as the default and provide a Param which allows users to ignore NaNs. I'd be afraid of linear models not having enough regularization, getting NaNs in the coefficients, having all of its predictions ignored by the evaluator, etc. What do you think? > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239847#comment-15239847 ] Nick Pentreath commented on SPARK-14489: In the live setting you definitely want to recommend something even when your model can't compute a recommendation. For evaluation, I'd say it makes more sense to split the train/test dataset by say user, such that each user has a small proportion of ratings in the test set. This avoids the issue and is quite common in the literature. Given this approach doesn't fit neatly into {{CrossValidator}}/{{TrainValidationSplit}} (perhaps something like SPARK-8971 could help), we could as you say use this as an improvement to at least allow using ALS with {{RegresssionEvaluator}}. > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232357#comment-15232357 ] Sean Owen commented on SPARK-14489: --- NaN, to me, means the result was undefined or uncomputable. However for recommenders there's nothing too strange about being asked for a recommendation for a new user. For some methods there's a clear answer: a new user with no data gets 0 recommendations; 0 is the meaningful default for the implicit case. Some kind of global mean is better than nothing for the explicit case. It doesn't bias the metrics, as an answer is an answer; some are better than others but that's what we're measuring. As I say the problem with ignoring NaN is that you don't consider these cases, but they're legitimate cases where the recommender wasn't able to produce a result, and that should be reflected as "bad". Still, as a stop-gap, assuming NaN is rare, ignoring NaN in the evaluator is strictly an improvement since it means you can return some meaningful answer instead of none. Later, if the ALS implementation never returns NaN, then this behavior in the evaluator doesn't matter anyway. So I'd support that change as a local improvement. > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232297#comment-15232297 ] Boris Clémençon commented on SPARK-14489: -- Hi all, Perhaps the transform() method returning NaN or any other particular symbol (na in DF?) to deal with users who are not in the training set is not a bad idea. It makes it clear that the method cannot deal with these users (so far), which is good. Using multiple methods to compute the score might seem a bit confusing (like imputing average, 0 or any fix values), and it could also introduce bias in some metrics like MSE. Making a fix in RegressionEvaluator (removing NaN) and sending a warn log seems a better option to me. Besides, I think it is theoritically possible to evaluate the scores of users or items that were not in the training set, just it is possible adding new users in a precomputed PCA system of axes. I am just a simple Spark user (and fan), but I could try to clone the project and push the modif if required. > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232241#comment-15232241 ] Sean Owen commented on SPARK-14489: --- You could also argue that the problem is returning NaN for an unknown user when making recommendations. I'm not sure where that comes in, but that seems like a reasonable way to address this would be to always provide a value: empty list of recommendations, estimated strength of 0 for the implicit case, some kind of global average rating for the explicit case. If the model is allowed to return NaN, and they're ignored by the evaluation metric, it is essentially not penalizing the model for 'passing' on a question. Making this model return an answer in all cases seems more useful and requires no work-around or flag. [~dulajrajitha] [~clemencb] what do you think of trying to implement that? > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232226#comment-15232226 ] Nick Pentreath commented on SPARK-14489: This issue would also apply to any ranking-based evaluator for ALS. [~srowen] what do you think? A param flag to allow excluding NaNs (with a warning should they be encountered) for {{RegressionEvaluator}}? > RegressionEvaluator returns NaN for ALS in Spark ml > --- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: AWS EMR >Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org