[ https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16860922#comment-16860922 ]
zhengruifeng commented on SPARK-24875: -------------------------------------- The <label, score> dataset is usually much smaller than the training dataset containing <features>, if the score data is to huge to perform a simple op like countByValue, how could you train the model? I doubt whether it is worth to apply a approximation. > MulticlassMetrics should offer a more efficient way to compute count by label > ----------------------------------------------------------------------------- > > Key: SPARK-24875 > URL: https://issues.apache.org/jira/browse/SPARK-24875 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 2.3.1 > Reporter: Antoine Galataud > Priority: Minor > > Currently _MulticlassMetrics_ calls _countByValue_() to get count by > class/label > {code:java} > private lazy val labelCountByClass: Map[Double, Long] = > predictionAndLabels.values.countByValue() > {code} > If input _RDD[(Double, Double)]_ is huge (which can be the case with a large > test dataset), it will lead to poor execution performance. > One option could be to allow using _countByValueApprox_ (could require adding > an extra configuration param for MulticlassMetrics). > Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, > I don't know how this could be ported there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org