GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/16862

    [SPARK-19520][streaming] Do not encrypt data written to the WAL.

    Spark's I/O encryption uses an ephemeral key for each driver instance.
    So driver B cannot decrypt data written by driver A since it doesn't
    have the correct key.
    
    The write ahead log is used for recovery, thus needs to be readable by
    a different driver. So it cannot be encrypted by Spark's I/O encryption
    code.
    
    The BlockManager APIs used by the WAL code to write the data automatically
    encrypt data, so changes are needed so that callers can to opt out of
    encryption.
    
    Aside from that, the "putBytes" API in the BlockManager does not do
    encryption, so a separate situation arised where the WAL would write
    unencrypted data to the BM and, when those blocks were read, decryption
    would fail. So the WAL code needs to ask the BM to encrypt that data
    when encryption is enabled; this code is not optimal since it results
    in a (temporary) second copy of the data block in memory, but should be
    OK for now until a more performant solution is added. The non-encryption
    case should not be affected.
    
    Tested with new unit tests, and by running streaming apps that do
    recovery using the WAL data with I/O encryption turned on.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-19520

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16862.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16862
    
----
commit bdbe267617f14392455881862571561724f7d5de
Author: Marcelo Vanzin <van...@cloudera.com>
Date:   2017-02-08T22:18:14Z

    [SPARK-19520][streaming] Do not encrypt data written to the WAL.
    
    Spark's I/O encryption uses an ephemeral key for each driver instance.
    So driver B cannot decrypt data written by driver A since it doesn't
    have the correct key.
    
    The write ahead log is used for recovery, thus needs to be readable by
    a different driver. So it cannot be encrypted by Spark's I/O encryption
    code.
    
    The BlockManager APIs used by the WAL code to write the data automatically
    encrypt data, so changes are needed so that callers can to opt out of
    encryption.
    
    Aside from that, the "putBytes" API in the BlockManager does not do
    encryption, so a separate situation arised where the WAL would write
    unencrypted data to the BM and, when those blocks were read, decryption
    would fail. So the WAL code needs to ask the BM to encrypt that data
    when encryption is enabled; this code is not optimal since it results
    in a (temporary) second copy of the data block in memory, but should be
    OK for now until a more performant solution is added. The non-encryption
    case should not be affected.
    
    Tested with new unit tests, and by running streaming apps that do
    recovery using the WAL data with I/O encryption turned on.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to