snmvaughan opened a new pull request, #45123: URL: https://github.com/apache/spark/pull/45123
We currently capture metrics which include the number of files, bytes and rows for a task along with the updated partitions. This change captures metrics for each updated partition, reporting the partition sub-paths along with the number of files, bytes, and rows per partition for each task. ### What changes were proposed in this pull request? - Update the `WriteTaskStatsTracker` implementation to associate a partition with new files during writing, and to track the number of rows written to each file. The final stats now include a map of partitions and the associated stats (number of committed files, bytes, and rows) - Update the `WriteJobStatsTracker` implementation to capture the partition subpaths and to publish a new Event to the listener bus. The processed stats aggregate the statistics for each partition which are reported by the executors - Add a new `SparkListenerEvent` used to publish the task's collected partition metrics ### Why are the changes needed? This increases our understanding of written data by tracking the impact for each task on our datasets ### Does this PR introduce _any_ user-facing change? This adds an additional event which provides partition-level data to listeners. ### How was this patch tested? In addition to the new unit tests, this was run in a Kubernetes environment writing tables with differing partitioning strategies and validating the reported stats. In all cases where partitioning was enabled we also verified that the aggregated partition metrics matched the existing metrics for number of files, bytes, and rows. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org