Bago Amirbekian created SPARK-23686:
---------------------------------------

             Summary: Make better usage of 
org.apache.spark.ml.util.Instrumentation
                 Key: SPARK-23686
                 URL: https://issues.apache.org/jira/browse/SPARK-23686
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 2.3.0
            Reporter: Bago Amirbekian


This Jira is a bit high level and might require subtasks or other jiras for 
more specific tasks.

I've noticed that we don't make the best usage of the instrumentation class. 
Specifically sometimes we bypass the instrumentation class and use the debugger 
instead. For example, 
[https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]

Also there are some things that might be useful to log in the instrumentation 
class that we currently don't. For example:

number of training examples
mean/var of label (regression)

I know computing these things can be expensive in some cases, but especially 
when this data is already available we can log it for free. For example, 
Logistic Regression Summarizer computes some useful data including numRows that 
we don't log.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to