GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/17295

    [SPARK-19556][core] Do not encrypt block manager data in memory.

    This change modifies the way block data is encrypted to make the more
    common cases faster, while penalizing an edge case. As a side effect
    of the change, all data that goes through the block manager is now
    encrypted only when needed, including the previous path (broadcast
    variables) where that did not happen.
    
    The way the change works is by not encrypting data that is stored in
    memory; so if a serialized block is in memory, it will only be encrypted
    once it is evicted to disk.
    
    The penalty comes when transferring that encrypted data from disk. If the
    data ends up in memory again, it is as efficient as before; but if the
    evicted block needs to be transferred directly to a remote executor, then
    there's now a performance penalty, since the code now uses a custom
    FileRegion implementation to decrypt the data before transferring.
    
    This also means that block data transferred between executors now is
    not encrypted (and thus relies on the network library encryption support
    for secrecy). Shuffle blocks are still transferred in encrypted form,
    since they're handled in a slightly different way by the code. This also
    keeps compatibility with existing external shuffle services, which transfer
    encrypted shuffle blocks, and avoids having to make the external service
    aware of encryption at all.
    
    Another change in the disk store is that it now stores a tiny metadata
    file next to the file holding the block data; this is needed to accurately
    account for the decrypted block size, which may be significantly different
    from the size of the encrypted file on disk.
    
    The serialization and deserialization APIs in the SerializerManager now
    do not do encryption automatically; callers need to explicitly wrap their
    streams with an appropriate crypto stream before using those.
    
    As a result of these changes, some of the workarounds added in SPARK-19520
    are removed here.
    
    Testing: a new trait ("EncryptionFunSuite") was added that provides an easy
    way to run a test twice, with encryption on and off; broadcast, block 
manager
    and caching tests were modified to use this new trait so that the existing
    tests exercise both encrypted and non-encrypted paths. I also ran some
    applications with encryption turned on to verify that they still work,
    including streaming tests that failed without the fix for SPARK-19520.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-19556

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17295.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17295
    
----
commit 3aa752f9becdfe0e35a47d731736d942e3e5b3bf
Author: Marcelo Vanzin <van...@cloudera.com>
Date:   2017-02-10T23:59:51Z

    [SPARK-19556][core] Do not encrypt block manager data in memory.
    
    This change modifies the way block data is encrypted to make the more
    common cases faster, while penalizing an edge case. As a side effect
    of the change, all data that goes through the block manager is now
    encrypted only when needed, including the previous path (broadcast
    variables) where that did not happen.
    
    The way the change works is by not encrypting data that is stored in
    memory; so if a serialized block is in memory, it will only be encrypted
    once it is evicted to disk.
    
    The penalty comes when transferring that encrypted data from disk. If the
    data ends up in memory again, it is as efficient as before; but if the
    evicted block needs to be transferred directly to a remote executor, then
    there's now a performance penalty, since the code now uses a custom
    FileRegion implementation to decrypt the data before transferring.
    
    This also means that block data transferred between executors now is
    not encrypted (and thus relies on the network library encryption support
    for secrecy). Shuffle blocks are still transferred in encrypted form,
    since they're handled in a slightly different way by the code. This also
    keeps compatibility with existing external shuffle services, which transfer
    encrypted shuffle blocks, and avoids having to make the external service
    aware of encryption at all.
    
    Another change in the disk store is that it now stores a tiny metadata
    file next to the file holding the block data; this is needed to accurately
    account for the decrypted block size, which may be significantly different
    from the size of the encrypted file on disk.
    
    The serialization and deserialization APIs in the SerializerManager now
    do not do encryption automatically; callers need to explicitly wrap their
    streams with an appropriate crypto stream before using those.
    
    As a result of these changes, some of the workarounds added in SPARK-19520
    are removed here.
    
    Testing: a new trait ("EncryptionFunSuite") was added that provides an easy
    way to run a test twice, with encryption on and off; broadcast, block 
manager
    and caching tests were modified to use this new trait so that the existing
    tests exercise both encrypted and non-encrypted paths. I also ran some
    applications with encryption turned on to verify that they still work,
    including streaming tests that failed without the fix for SPARK-19520.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to