[GitHub] spark pull request #15982: [SPARK-18546][core] Fix merging shuffle spills wh...

vanzin Tue, 22 Nov 2016 13:28:48 -0800

GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/15982


    [SPARK-18546][core] Fix merging shuffle spills when using encryption.

    The problem exists because it's not possible to just concatenate encrypted
    partition data from different spill files; currently each partition would
    have its own initial vector to set up encryption, and the final merged file
    should contain a single initial vector for each merged partiton, otherwise
    iterating over each record becomes really hard.
    
    To fix that, UnsafeShuffleWriter now decrypts the partitions when merging,
    so that the merged file contains a single initial vector at the start of
    the partition data.
    
    Because it's not possible to do that using the fast transferTo path, when
    encryption is enabled UnsafeShuffleWriter will revert back to using file
    streams when merging. It may be possible to use a hybrid approach when
    using encryption, using an intermediate direct buffer when reading from
    files and encrypting the data, but that's better left for a separate patch.
    
    As part of the change I made DiskBlockObjectWriter take a SerializerManager
    instead of a "wrap stream" closure, since that makes it easier to test the
    code without having to mock SerializerManager functionality.
    
    Tested with newly added unit tests (UnsafeShuffleWriterSuite for the write
    side and ExternalAppendOnlyMapSuite for integration), and by running some
    apps that failed without the fix.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-18546

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15982.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15982
    
----
commit 7ed0d7c0312224252768b6f463603e57ca5e65d4
Author: Marcelo Vanzin <van...@cloudera.com>
Date:   2016-11-20T02:02:56Z

    [SPARK-18547][core] Propagate I/O encryption key when executors register.
    
    This change modifies the method used to propagate encryption keys used 
during
    shuffle. Instead of relying on YARN's UserGroupInformation credential 
propagation,
    this change explicitly distributes the key using the messages exchanged 
between
    driver and executor during registration. When RPC encryption is enabled, 
this means
    key propagation is also secure.
    
    This allows shuffle encryption to work in non-YARN mode, which means that 
it's
    easier to write unit tests for areas of the code that are affected by the 
feature.
    
    The key is stored in the SecurityManager; because there are many instances 
of
    that class used in the code, the key is only guaranteed to exist in the 
instance
    managed by the SparkEnv. This path was chosen to avoid storing the key in 
the
    SparkConf, which would risk having the key being written to disk as part of 
the
    configuration (as, for example, is done when starting YARN applications).
    
    Test by new and existing unit tests (which were moved from the YARN module 
to
    core), and by running apps with shuffle encryption enabled.

commit fbd2bcaa44d7dba2a43d4fe2b3ac3830d3388857
Author: Marcelo Vanzin <van...@cloudera.com>
Date:   2016-11-18T01:07:03Z

    [SPARK-18546][core] Fix merging shuffle spills when using encryption.
    
    The problem exists because it's not possible to just concatenate encrypted
    partition data from different spill files; currently each partition would
    have its own initial vector to set up encryption, and the final merged file
    should contain a single initial vector for each merged partiton, otherwise
    iterating over each record becomes really hard.
    
    To fix that, UnsafeShuffleWriter now decrypts the partitions when merging,
    so that the merged file contains a single initial vector at the start of
    the partition data.
    
    Because it's not possible to do that using the fast transferTo path, when
    encryption is enabled UnsafeShuffleWriter will revert back to using file
    streams when merging. It may be possible to use a hybrid approach when
    using encryption, using an intermediate direct buffer when reading from
    files and encrypting the data, but that's better left for a separate patch.
    
    As part of the change I made DiskBlockObjectWriter take a SerializerManager
    instead of a "wrap stream" closure, since that makes it easier to test the
    code without having to mock SerializerManager functionality.
    
    Tested with newly added unit tests (UnsafeShuffleWriterSuite for the write
    side and ExternalAppendOnlyMapSuite for integration), and by running some
    apps that failed without the fix.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15982: [SPARK-18546][core] Fix merging shuffle spills wh...

Reply via email to