GitHub user kayousterhout opened a pull request: https://github.com/apache/spark/pull/16053
[SPARK-17931] Eliminate unncessary task (de) serialization ## What changes were proposed in this pull request? In the existing code, there are three layers of serialization involved in sending a task from the scheduler to an executor: - A Task object is serialized - The Task object is copied to a byte buffer that also contains serialized information about any additional JARs, files, and Properties needed for the task to execute. This byte buffer is stored as the member variable serializedTask in the TaskDescription class. - The TaskDescription is serialized (in addition to the serialized task + JARs, the TaskDescription class contains the task ID and other metadata) and sent in a LaunchTask message. While it *is* necessary to have two layers of serialization, so that the JAR, file, and Property info can be deserialized prior to deserializing the Task object, the third layer of deserialization is unnecessary. This commit eliminates a layer of serialization by moving the JARs, files, and Properties into the TaskDescription class. ## How was this patch tested? Unit tests This is a simpler alternative to the approach proposed in #15505. The biggest difference in functionality from the approach there is that, in that code, all of the serialization occurs in one place (in CoarseGrainedExecutorBackend), whereas this approach maintains the split of serialization (where some happens in TaskSetManager and some in CoarseGrainedExecutorBackend) that was present in the existing code. I do think there are some benefits of doing all of the serialization in one place (e.g., to time it all, or to enable opportunities for parallelism in the future) but given that we don't take advantage of any of those thigns currently, it doesn't seem necessary to change (and it's simpler not to change it). cc @shivaram You can merge this pull request into a Git repository by running: $ git pull https://github.com/kayousterhout/spark-1 SPARK-17931 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16053.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16053 ---- commit e3cb71ea57e135975ce3e28caaba314a9a599b86 Author: Kay Ousterhout <kayousterh...@gmail.com> Date: 2016-11-29T06:23:45Z [SPARK-17931] Eliminate unncessary task (de) serialization In the existing code, there are three layers of serialization involved in sending a task from the scheduler to an executor: - A Task object is serialized - The Task object is copied to a byte buffer that also contains serialized information about any additional JARs, files, and Properties needed for the task to execute. This byte buffer is stored as the member variable serializedTask in the TaskDescription class. - The TaskDescription is serialized (in addition to the serialized task + JARs, the TaskDescription class contains the task ID and other metadata) and sent in a LaunchTask message. While it *is* necessary to have two layers of serialization, so that the JAR, file, and Property info can be deserialized prior to deserializing the Task object, the third layer of deserialization is unnecessary. This commit eliminates a layer of serialization by moving the JARs, files, and Properties into the TaskDescription class. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org