Antoine Galataud created SPARK-24875: ----------------------------------------
Summary: MulticlassMetrics should offer a more efficient way to compute count by label Key: SPARK-24875 URL: https://issues.apache.org/jira/browse/SPARK-24875 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.3.1 Reporter: Antoine Galataud Currently _MulticlassMetrics_ calls _countByValue_() to get count by class/label {code:java} private lazy val labelCountByClass: Map[Double, Long] = predictionAndLabels.values.countByValue() {code} If input _RDD[(Double, Double)]_ is huge (which can be the case with a large test dataset), it will lead to poor execution performance. One option could be to allow using _countByValueApprox_ (could require adding an extra configuration param for MulticlassMetrics). Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, I don't know how this could be ported there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org