[ https://issues.apache.org/jira/browse/SPARK-29812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aman Omer updated SPARK-29812: ------------------------------ Parent: SPARK-29818 Issue Type: Sub-task (was: Improvement) > Missing persist on predictionAndLabels in MulticlassClassificationEvaluator > --------------------------------------------------------------------------- > > Key: SPARK-29812 > URL: https://issues.apache.org/jira/browse/SPARK-29812 > Project: Spark > Issue Type: Sub-task > Components: ML > Affects Versions: 2.4.3 > Reporter: Dong Wang > Priority: Major > > The rdd predictionAndLabels in > ml.evaluation.MulticlassificationEvaluator.evaluate() needs to be persisted. > When MulticlassMetrics uses predictionAndLabels to initialize fileds, there > will be at least five actions executed on predictionAndLabels. > {code:scala} > override def evaluate(dataset: Dataset[_]): Double = { > val schema = dataset.schema > SchemaUtils.checkColumnType(schema, $(predictionCol), DoubleType) > SchemaUtils.checkNumericType(schema, $(labelCol)) > // Needs to be persisted > val predictionAndLabels = > dataset.select(col($(predictionCol)), > col($(labelCol)).cast(DoubleType)).rdd.map { > case Row(prediction: Double, label: Double) => (prediction, label) > } > // The initialization will use predictionAndLabels multi times in > different actions. > val metrics = new MulticlassMetrics(predictionAndLabels) > val metric = $(metricName) match { > case "f1" => metrics.weightedFMeasure > case "weightedPrecision" => metrics.weightedPrecision > case "weightedRecall" => metrics.weightedRecall > case "accuracy" => metrics.accuracy > } > metric > } > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org