[jira] [Comment Edited] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation

Joseph K. Bradley (JIRA) Tue, 01 May 2018 17:22:47 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16460164#comment-16460164
 ]


Joseph K. Bradley edited comment on SPARK-23686 at 5/2/18 12:21 AM:
--------------------------------------------------------------------

[~yogeshgarg] and [~WeichenXu123] made the good point that some logging occurs 
on executors.  This brings up the question:
* Should we use Instrumentation on executors?
* What levels of logging should we use on executors (in MLlib algorithms)?

I figure it's safe to assume that executor logs should be more for developers 
than for users.  (Current use in MLlib seems like this, e.g., for training of 
trees in https://github.com/apache/spark/pull/21163 )  These all seem to be at 
the DEBUG level, which is not really useful for users.

(UPDATED BELOW)
Since it'd be handy to have prefixes on executor logs too (to link them with 
Estimators), let's use Instrumentation on executors.



was (Author: josephkb):
[~yogeshgarg] and [~WeichenXu123] made the good point that some logging occurs 
on executors.  This brings up the question:
* Should we use Instrumentation on executors?
* What levels of logging should we use on executors (in MLlib algorithms)?

I figure it's safe to assume that executor logs should be more for developers 
than for users.  (Current use in MLlib seems like this, e.g., for training of 
trees in https://github.com/apache/spark/pull/21163 )  These all seem to be at 
the DEBUG level, which is not really useful for users.

Given that, I recommend:
* We leave Instrumentation non-Serializable to avoid use on executors
* We use regular Logging on executors.

Developers who are debugging algorithms will presumably be running pretty 
isolated tests anyways.

> Make better usage of org.apache.spark.ml.util.Instrumentation
> -------------------------------------------------------------
>
>                 Key: SPARK-23686
>                 URL: https://issues.apache.org/jira/browse/SPARK-23686
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Bago Amirbekian
>            Priority: Major
>
> This Jira is a bit high level and might require subtasks or other jiras for 
> more specific tasks.
> I've noticed that we don't make the best usage of the instrumentation class. 
> Specifically sometimes we bypass the instrumentation class and use the 
> debugger instead. For example, 
> [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]
> Also there are some things that might be useful to log in the instrumentation 
> class that we currently don't. For example:
> number of training examples
> mean/var of label (regression)
> I know computing these things can be expensive in some cases, but especially 
> when this data is already available we can log it for free. For example, 
> Logistic Regression Summarizer computes some useful data including numRows 
> that we don't log.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation

Reply via email to