snmvaughan opened a new pull request, #45123:
URL: https://github.com/apache/spark/pull/45123

   We currently capture metrics which include the number of files, bytes and 
rows for a task along with the updated partitions. 
    This change captures metrics for each updated partition, reporting the 
partition sub-paths along with the number of files, bytes, and rows per 
partition for each task.
   
   ### What changes were proposed in this pull request?
   
   - Update the `WriteTaskStatsTracker` implementation to associate a partition 
with new files during writing, and to track the number of rows written to each 
file. The final stats now include a map of partitions and the associated stats 
(number of committed files, bytes, and rows)
   - Update the `WriteJobStatsTracker` implementation to capture the partition 
subpaths and to publish a new Event to the listener bus. The processed stats 
aggregate the statistics for each partition which are reported by the executors
   - Add a new `SparkListenerEvent` used to publish the task's collected 
partition metrics
   
   ### Why are the changes needed?
   This increases our understanding of written data by tracking the impact for 
each task on our datasets
   
   ### Does this PR introduce _any_ user-facing change?
   This adds an additional event which provides partition-level data to 
listeners.
   
   ### How was this patch tested?
   In addition to the new unit tests, this was run in a Kubernetes environment 
writing tables with differing partitioning strategies and validating the 
reported stats.  In all cases where partitioning was enabled we also verified 
that the aggregated partition metrics matched the existing metrics for number 
of files, bytes, and rows.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to