[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254680#comment-15254680
 ] 

Seth Hendrickson commented on SPARK-14489:
------------------------------------------

This is an interesting idea. I would say that under the current framework for 
stratified sampling, there is not a performant way to guarantee each split 
contains every user at least once (even if we filter out users with < k * n 
items). In naive stratified sampling, you would simply generate a random key 
for each user, and sort the entire dataset, taking even splits amongst each 
user. I am not sure if that is an acceptable option given how expensive a sort 
over the entire dataset would be. Using ScaSRS might actually be worse in this 
case, if the waitlist is close to the size of the requested sample, since the 
waitlists are collected on the driver. I am not sure what options open up if we 
don't require even splits, but just that each split contains every user, but 
there might be something to that.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---------------------------------------------------
>
>                 Key: SPARK-14489
>                 URL: https://issues.apache.org/jira/browse/SPARK-14489
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.0
>         Environment: AWS EMR
>            Reporter: Boris Clémençon 
>              Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
>     val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
>     splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>       val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>       val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>       // multi-model training
>       logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>       val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>       trainingDataset.unpersist()
>       var i = 0
>       while (i < numModels) {
>         // TODO: duplicate evaluator to take extra params from input
>         val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
>         logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
>         metrics(i) += metric
>         i += 1
>       }
>       validationDataset.unpersist()
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to