[ https://issues.apache.org/jira/browse/SPARK-29816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aman Omer updated SPARK-29816: ------------------------------ Parent: SPARK-29818 Issue Type: Sub-task (was: Improvement) > Missing persist in > mllib.evaluation.BinaryClassificationMetrics.recallByThreshold() > ----------------------------------------------------------------------------------- > > Key: SPARK-29816 > URL: https://issues.apache.org/jira/browse/SPARK-29816 > Project: Spark > Issue Type: Sub-task > Components: MLlib > Affects Versions: 2.4.3 > Reporter: Dong Wang > Priority: Minor > > The rdd scoreAndLabels.combineByKey is used by two actions: sortByKey and > count(), so it needs to be persisted. > {code:scala} > val counts = scoreAndLabels.combineByKey( > createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) += > label, > mergeValue = (c: BinaryLabelCounter, label: Double) => c += label, > mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 > += c2 > ).sortByKey(ascending = false) // first use > val binnedCounts = > // Only down-sample if bins is > 0 > if (numBins == 0) { > // Use original directly > counts > } else { > val countsSize = counts.count() //second use > {scala} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org