[ 
https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551640#comment-16551640
 ] 

Antoine Galataud commented on SPARK-24875:
------------------------------------------

True, I was proposing this not as a replacement, but as an option (e.g 
setUseApproxStats on MulticlassMetrics) that wouldn’t be the default. 
Correctness is key, but having an approximate result is better than no result 
at all.
However there should be better solutions that using countByValueApprox. Open to 
suggestions! 

> MulticlassMetrics should offer a more efficient way to compute count by label
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-24875
>                 URL: https://issues.apache.org/jira/browse/SPARK-24875
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.3.1
>            Reporter: Antoine Galataud
>            Priority: Minor
>
> Currently _MulticlassMetrics_ calls _countByValue_() to get count by 
> class/label
> {code:java}
> private lazy val labelCountByClass: Map[Double, Long] = 
> predictionAndLabels.values.countByValue()
> {code}
> If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
> test dataset), it will lead to poor execution performance.
> One option could be to allow using _countByValueApprox_ (could require adding 
> an extra configuration param for MulticlassMetrics).
> Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, 
> I don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to