LucaCanali opened a new pull request #26953: [SPARK-30306][CORE][PYTHON] Instrument Python UDF execution time and metrics using Spark Metrics system URL: https://github.com/apache/spark/pull/26953 ### What changes were proposed in this pull request? This proposes to extend Spark instrumentation to add metrics aimed at drilling down on the performance of Python code called by Spark: via UDF, Pandas UDF or with MapPartittions. Relevant performance counters, notably exuction time, are exposed using the Spark Metrics System (based on the Dropwizard library). ### Why are the changes needed? This allows to easily consume the metrics produced by executors, for example using a performance dashboard (this references to previous work as discucssed in https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark ). See also the screenshot that compares the existing state (no Python UDF time instrumentation) to the proposed new functionality ![](https://issues.apache.org/jira/secure/attachment/12989201/PandasUDF_Time_Instrumentation_Annotated.png) ### Does this PR introduce any user-facing change? This PR adds the PythonMetrics source to the Spark Metrics system. The list of the implemented metrics has been added to the Monitoring documentation. ### How was this patch tested? Added relevant tests + manually tested end-to-end on a YARN cluster and using an existing Spark dashboard extended with the metrics proposed here.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org