LucaCanali edited a comment on issue #26953: [SPARK-30306][CORE][PYTHON] 
Instrument Python UDF execution time and metrics using Spark Metrics system
URL: https://github.com/apache/spark/pull/26953#issuecomment-567869272
 
 
   Thanks @HyukjinKwon for taking time for this.
   This functionality propoesed in ths PR is different from Python profiler, or 
at least its intended use is. Also it is intended to be lightweight so that it 
can be used for measuring Spark workloads in production as part of a 
performance dashboard based on the metrics coming from the Spark metrics system.
   
   I'd like to add some additional context on how we intend to use this. 
   - I have described how we implement and use a Spark performance dashboard 
based on the metrics system in a recent Spark Summit presentation 
https://databricks.com/session_eu19/performance-troubleshooting-using-apache-spark-metrics
 and in a blog entry 
http://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark
   - Over the last couple of years I have helped improving this by adding to 
the Spark metrics system instrumentation for aggregated task metrics values per 
executor [SPARK-25228], JVM CPU usage [SPARK-22190], memory usage 
[SPARK-27189]. This still does not cover all the space of possible activities 
and "time usage" by Spark tasks and executors, in particular the problem that I 
am trying to solve with this PR is that time spent waiting for results to come 
back from Python workers (typically when executing a UDF) is currently not 
instrumented, so it appears in the dashboard as run time "without attribution" 
in the current dashboard, while it can be visualized using the metrics 
implemented here (see 
[image](https://camo.githubusercontent.com/6bbade0b6d6dceeb75cea27a5041298a3132599c/68747470733a2f2f6973737565732e6170616368652e6f72672f6a6972612f7365637572652f6174746163686d656e742f31323938393230312f50616e6461735544465f54696d655f496e737472756d656e746174696f6e5f416e6e6f74617465642e706e67)
 )
   
   As you mentioned, in the current PR I have implemented several details, 
which I guess can be useful when troubleshooting, but for the general use can 
be simplified as you propose, in particular in the number and detials exposed 
in the user-visible metrics in the PythonMetrics source.
   A proposal for a simplified set of metrics to expose in the PythonMetrics 
source is:
     - BatchCountFromWorker
     - BatchCountToWorker
     - BytesSentToWorker
     - RunAndReadTimeFromWorker
     - WriteTimeToWorker
     - PandasUDFSentRowCount
     - PandasUDFReceivedRowCount

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to