[ 
https://issues.apache.org/jira/browse/SPARK-31217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064333#comment-17064333
 ] 

CacheCheck commented on SPARK-31217:
------------------------------------

Besides, I think we also should add persist() APIs in other metrics class. 
E.g., _summary_ in RegressionMetrics.
In other three metrics classes, i.e., MulticlassMetics, MultilabelMetrics, 
RankingMetrics, _predictionAndLabels_ is important and is used by multiple 
actions in object initialization, it's better to check if it is cached before. 
If not, we should cache it in these classes.

> Unnecessary persist on cumulativeCounts in BinaryClassificationMetrics
> ----------------------------------------------------------------------
>
>                 Key: SPARK-31217
>                 URL: https://issues.apache.org/jira/browse/SPARK-31217
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 2.4.4, 2.4.5
>            Reporter: CacheCheck
>            Priority: Major
>
> In mllib.evaluation.BinaryClassificationMetrics, _cumulativeCounts_ is cached 
> in a lazy initialization. But when I run LogisticRegressionSummaryExample as 
> well as ModelSelectionViaCrossValidationExample, I find that cached 
> _cumulativeCounts_ only used by one action during execution. 
> So I think it should not be cached in initilization, we can set an extra 
> persist() API in this class, just as that the unpersist() API in 
> BinaryClassificationMetrics releases cached _cumulativeCounts_. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to