LucaCanali edited a comment on issue #26953: [SPARK-30306][CORE][PYTHON] Instrument Python UDF execution time and metrics using Spark Metrics system URL: https://github.com/apache/spark/pull/26953#issuecomment-567869272 Thanks @HyukjinKwon for taking time for this. This functionality propoesed in ths PR is different from Python profiler, or at least its intended use is. Also it is intended to be lightweight so that it can be used for measuring Spark workloads in production as part of a performance dashboard based on the metrics coming from the Spark metrics system. I'd like to add some additional context on how we intend to use this. - I have described how we implement and use a Spark performance dashboard based on the metrics system in a recent Spark Summit presentation https://databricks.com/session_eu19/performance-troubleshooting-using-apache-spark-metrics and in a blog entry http://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark - Over the last couple of years I have helped improving this by adding to the Spark metrics system instrumentation for aggregated task metrics values per executor [SPARK-25228], JVM CPU usage [SPARK-22190], memory usage [SPARK-27189]. This still does not cover all the space of possible activities and "time usage" by Spark tasks and executors, in particular the problem that I am trying to solve with this PR is that time spent waiting for results to come back from Python workers (typically when executing a UDF) is currently not instrumented, so it appears in the dashboard as run time "without attribution" in the current dashboard, while it can be visualized using the metrics implemented here (see [image](https://camo.githubusercontent.com/6bbade0b6d6dceeb75cea27a5041298a3132599c/68747470733a2f2f6973737565732e6170616368652e6f72672f6a6972612f7365637572652f6174746163686d656e742f31323938393230312f50616e6461735544465f54696d655f496e737472756d656e746174696f6e5f416e6e6f74617465642e706e67) ) As you mentioned, in the current PR I have implemented several details, which I guess can be useful when troubleshooting, but for the general use can be simplified as you propose, in particular in the number and detials exposed in the user-visible metrics in the PythonMetrics source. A proposal for a simplified set of metrics to expose in the PythonMetrics source is: - BatchCountFromWorker - BatchCountToWorker - BytesSentToWorker - RunAndReadTimeFromWorker - WriteTimeToWorker - PandasUDFSentRowCount - PandasUDFReceivedRowCount
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org