GitHub user kayousterhout opened a pull request:

    https://github.com/apache/spark/pull/16053

    [SPARK-17931] Eliminate unncessary task (de) serialization

    ## What changes were proposed in this pull request?
    
    In the existing code, there are three layers of serialization
        involved in sending a task from the scheduler to an executor:
            - A Task object is serialized
            - The Task object is copied to a byte buffer that also
              contains serialized information about any additional JARs,
              files, and Properties needed for the task to execute. This
              byte buffer is stored as the member variable serializedTask
              in the TaskDescription class.
            - The TaskDescription is serialized (in addition to the serialized
              task + JARs, the TaskDescription class contains the task ID and
              other metadata) and sent in a LaunchTask message.
        
        While it *is* necessary to have two layers of serialization, so that
        the JAR, file, and Property info can be deserialized prior to
        deserializing the Task object, the third layer of deserialization is
        unnecessary.  This commit eliminates a layer of serialization by moving
        the JARs, files, and Properties into the TaskDescription class.
    
    ## How was this patch tested?
    
    Unit tests
    
    This is a simpler alternative to the approach proposed in #15505.
    
    The biggest difference in functionality from the approach there is that, in 
that code, all of the serialization occurs in one place (in 
CoarseGrainedExecutorBackend), whereas this approach maintains the split of 
serialization (where some happens in TaskSetManager and some in 
CoarseGrainedExecutorBackend) that was present in the existing code.  I do 
think there are some benefits of doing all of the serialization in one place 
(e.g., to time it all, or to enable opportunities for parallelism in the 
future) but given that we don't take advantage of any of those thigns 
currently, it doesn't seem necessary to change (and it's simpler not to change 
it).
    
    cc @shivaram

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kayousterhout/spark-1 SPARK-17931

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16053.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16053
    
----
commit e3cb71ea57e135975ce3e28caaba314a9a599b86
Author: Kay Ousterhout <kayousterh...@gmail.com>
Date:   2016-11-29T06:23:45Z

    [SPARK-17931] Eliminate unncessary task (de) serialization
    
    In the existing code, there are three layers of serialization
    involved in sending a task from the scheduler to an executor:
        - A Task object is serialized
        - The Task object is copied to a byte buffer that also
          contains serialized information about any additional JARs,
          files, and Properties needed for the task to execute. This
          byte buffer is stored as the member variable serializedTask
          in the TaskDescription class.
        - The TaskDescription is serialized (in addition to the serialized
          task + JARs, the TaskDescription class contains the task ID and
          other metadata) and sent in a LaunchTask message.
    
    While it *is* necessary to have two layers of serialization, so that
    the JAR, file, and Property info can be deserialized prior to
    deserializing the Task object, the third layer of deserialization is
    unnecessary.  This commit eliminates a layer of serialization by moving
    the JARs, files, and Properties into the TaskDescription class.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to