[ https://issues.apache.org/jira/browse/SPARK-31217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064333#comment-17064333 ]
CacheCheck commented on SPARK-31217: ------------------------------------ Besides, I think we also should add persist() APIs in other metrics class. E.g., _summary_ in RegressionMetrics. In other three metrics classes, i.e., MulticlassMetics, MultilabelMetrics, RankingMetrics, _predictionAndLabels_ is important and is used by multiple actions in object initialization, it's better to check if it is cached before. If not, we should cache it in these classes. > Unnecessary persist on cumulativeCounts in BinaryClassificationMetrics > ---------------------------------------------------------------------- > > Key: SPARK-31217 > URL: https://issues.apache.org/jira/browse/SPARK-31217 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Affects Versions: 2.4.4, 2.4.5 > Reporter: CacheCheck > Priority: Major > > In mllib.evaluation.BinaryClassificationMetrics, _cumulativeCounts_ is cached > in a lazy initialization. But when I run LogisticRegressionSummaryExample as > well as ModelSelectionViaCrossValidationExample, I find that cached > _cumulativeCounts_ only used by one action during execution. > So I think it should not be cached in initilization, we can set an extra > persist() API in this class, just as that the unpersist() API in > BinaryClassificationMetrics releases cached _cumulativeCounts_. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org