[ https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-23686. ---------------------------------- Resolution: Incomplete > Make better usage of org.apache.spark.ml.util.Instrumentation > ------------------------------------------------------------- > > Key: SPARK-23686 > URL: https://issues.apache.org/jira/browse/SPARK-23686 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.3.0 > Reporter: Bago Amirbekian > Priority: Major > Labels: bulk-closed > > This Jira is a bit high level and might require subtasks or other jiras for > more specific tasks. > I've noticed that we don't make the best usage of the instrumentation class. > Specifically sometimes we bypass the instrumentation class and use the > debugger instead. For example, > [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143] > Also there are some things that might be useful to log in the instrumentation > class that we currently don't. For example: > number of training examples > mean/var of label (regression) > I know computing these things can be expensive in some cases, but especially > when this data is already available we can log it for free. For example, > Logistic Regression Summarizer computes some useful data including numRows > that we don't log. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org