[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label

2019-06-11 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860922#comment-16860922
 ] 

zhengruifeng commented on SPARK-24875:
--

The  dataset is usually much smaller than the training dataset 
containing ,

if the score data is to huge to perform a simple op like countByValue, how 
could you train the model?

I doubt whether it is worth to apply a approximation.

> MulticlassMetrics should offer a more efficient way to compute count by label
> -
>
> Key: SPARK-24875
> URL: https://issues.apache.org/jira/browse/SPARK-24875
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Antoine Galataud
>Priority: Minor
>
> Currently _MulticlassMetrics_ calls _countByValue_() to get count by 
> class/label
> {code:java}
> private lazy val labelCountByClass: Map[Double, Long] = 
> predictionAndLabels.values.countByValue()
> {code}
> If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
> test dataset), it will lead to poor execution performance.
> One option could be to allow using _countByValueApprox_ (could require adding 
> an extra configuration param for MulticlassMetrics).
> Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, 
> I don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label

2018-07-21 Thread Antoine Galataud (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551640#comment-16551640
 ] 

Antoine Galataud commented on SPARK-24875:
--

True, I was proposing this not as a replacement, but as an option (e.g 
setUseApproxStats on MulticlassMetrics) that wouldn’t be the default. 
Correctness is key, but having an approximate result is better than no result 
at all.
However there should be better solutions that using countByValueApprox. Open to 
suggestions! 

> MulticlassMetrics should offer a more efficient way to compute count by label
> -
>
> Key: SPARK-24875
> URL: https://issues.apache.org/jira/browse/SPARK-24875
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Antoine Galataud
>Priority: Minor
>
> Currently _MulticlassMetrics_ calls _countByValue_() to get count by 
> class/label
> {code:java}
> private lazy val labelCountByClass: Map[Double, Long] = 
> predictionAndLabels.values.countByValue()
> {code}
> If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
> test dataset), it will lead to poor execution performance.
> One option could be to allow using _countByValueApprox_ (could require adding 
> an extra configuration param for MulticlassMetrics).
> Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, 
> I don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label

2018-07-20 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551457#comment-16551457
 ] 

Liang-Chi Hsieh commented on SPARK-24875:
-

hmm, I think for calculation of precision, recall and true/false positive rate, 
we should only care about exact calculation but approximate one. Thus is it 
reasonable to use countByValueApprox here?

> MulticlassMetrics should offer a more efficient way to compute count by label
> -
>
> Key: SPARK-24875
> URL: https://issues.apache.org/jira/browse/SPARK-24875
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Antoine Galataud
>Priority: Minor
>
> Currently _MulticlassMetrics_ calls _countByValue_() to get count by 
> class/label
> {code:java}
> private lazy val labelCountByClass: Map[Double, Long] = 
> predictionAndLabels.values.countByValue()
> {code}
> If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
> test dataset), it will lead to poor execution performance.
> One option could be to allow using _countByValueApprox_ (could require adding 
> an extra configuration param for MulticlassMetrics).
> Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, 
> I don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org