JoshRosen opened a new pull request, #39763:
URL: https://github.com/apache/spark/pull/39763

   ### What changes were proposed in this pull request?
   
   This PR modifies JsonProtocol in order to exclude certain redundant 
accumulator information from Spark event logs in order to save space and 
processing time.
   
   Several event logs types contain both TaskMetrics and Accumulables, but 
there is redundancy in how the TaskMetrics data is stored:
   
   - TaskMetrics are stored in a map called "Task Metrics" which maps from 
metric names to metric values.
   - An "Accumulables" field contains information on accumulator updates from 
the task, but this field includes updates from the TaskMetrics internal 
accumulators (both the value from the task, plus a running "sum-so-far" from 
all of the tasks completed up to that point).
   
   The redundant task metrics accumulables are not actually used by the Spark 
History Server: I verified this by reading AppStatusListener and 
SQLAppStatusListener.
   
   I believe that this redundancy was introduced back in 
[SPARK-10620](https://issues.apache.org/jira/browse/SPARK-10620) when Spark 
1.x's separate TaskMetrics implementation was replaced by the current 
accumulator-based version.
   
   In this PR, I add logic to exclude TaskMetrics internal accumulators when 
writing this field.
   
   Although I think it's highly unlikely that third-party non-Spark consumers 
of the event logs would be relying on this redundant information, I have added 
an "escape-hatch" configuration out of an abundance of caution. The new 
`spark.eventLog.includeTaskMetricsAccumulators` configuration (default `false`, 
meaning "exclude the redundant information") can be set to `true` to restore 
the old behavior.
   
   
   ### Why are the changes needed?
   
   This change reduces the size of Spark event logs, especially for logs from 
applications that run many tasks. It should also have slight benefits on event 
log read and write speed (although I haven't tried to quantify this).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No user-facing changes in Spark History Server.
   
   This could be considered a user-facing change from the perspective of 
third-party code which does its own direct processing of Spark event logs, 
hence the config. As I said, I think it's really unlikely that anyone is 
relying on this behavior, so I think it's okay to change the default behavior 
to one which benefits a majority of users (given that we have an escape-hatch 
fallback config).
   
   ### How was this patch tested?
   
   New unit tests in `JsonProtocolSuite`.
   
   Manual tests of event log size in `spark-shell` with a job that runs 
`spark.parallelize(1 to 1000, 1000).count()`. For this toy query, this PR's 
change shrunk the uncompressed event log size by ~15%. The relative size 
reduction will be even greater once other issues like 
https://issues.apache.org/jira/browse/SPARK-42206 or 
https://issues.apache.org/jira/browse/SPARK-42203 are fixed. The relative 
reduction will be smaller for tasks with many SQL metrics because those 
accumulables cannot be excluded.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to