[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label
[ https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860922#comment-16860922 ] zhengruifeng commented on SPARK-24875: -- The dataset is usually much smaller than the training dataset containing , if the score data is to huge to perform a simple op like countByValue, how could you train the model? I doubt whether it is worth to apply a approximation. > MulticlassMetrics should offer a more efficient way to compute count by label > - > > Key: SPARK-24875 > URL: https://issues.apache.org/jira/browse/SPARK-24875 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Antoine Galataud >Priority: Minor > > Currently _MulticlassMetrics_ calls _countByValue_() to get count by > class/label > {code:java} > private lazy val labelCountByClass: Map[Double, Long] = > predictionAndLabels.values.countByValue() > {code} > If input _RDD[(Double, Double)]_ is huge (which can be the case with a large > test dataset), it will lead to poor execution performance. > One option could be to allow using _countByValueApprox_ (could require adding > an extra configuration param for MulticlassMetrics). > Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, > I don't know how this could be ported there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label
[ https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551640#comment-16551640 ] Antoine Galataud commented on SPARK-24875: -- True, I was proposing this not as a replacement, but as an option (e.g setUseApproxStats on MulticlassMetrics) that wouldn’t be the default. Correctness is key, but having an approximate result is better than no result at all. However there should be better solutions that using countByValueApprox. Open to suggestions! > MulticlassMetrics should offer a more efficient way to compute count by label > - > > Key: SPARK-24875 > URL: https://issues.apache.org/jira/browse/SPARK-24875 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Antoine Galataud >Priority: Minor > > Currently _MulticlassMetrics_ calls _countByValue_() to get count by > class/label > {code:java} > private lazy val labelCountByClass: Map[Double, Long] = > predictionAndLabels.values.countByValue() > {code} > If input _RDD[(Double, Double)]_ is huge (which can be the case with a large > test dataset), it will lead to poor execution performance. > One option could be to allow using _countByValueApprox_ (could require adding > an extra configuration param for MulticlassMetrics). > Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, > I don't know how this could be ported there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label
[ https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551457#comment-16551457 ] Liang-Chi Hsieh commented on SPARK-24875: - hmm, I think for calculation of precision, recall and true/false positive rate, we should only care about exact calculation but approximate one. Thus is it reasonable to use countByValueApprox here? > MulticlassMetrics should offer a more efficient way to compute count by label > - > > Key: SPARK-24875 > URL: https://issues.apache.org/jira/browse/SPARK-24875 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Antoine Galataud >Priority: Minor > > Currently _MulticlassMetrics_ calls _countByValue_() to get count by > class/label > {code:java} > private lazy val labelCountByClass: Map[Double, Long] = > predictionAndLabels.values.countByValue() > {code} > If input _RDD[(Double, Double)]_ is huge (which can be the case with a large > test dataset), it will lead to poor execution performance. > One option could be to allow using _countByValueApprox_ (could require adding > an extra configuration param for MulticlassMetrics). > Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, > I don't know how this could be ported there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org