[ 
https://issues.apache.org/jira/browse/SPARK-33966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259454#comment-17259454
 ] 

DB Tsai commented on SPARK-33966:
---------------------------------

cc [~dongjoon] [~chaosun] and [~viirya]

> Two-tier encryption key management
> ----------------------------------
>
>                 Key: SPARK-33966
>                 URL: https://issues.apache.org/jira/browse/SPARK-33966
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Gidon Gershinsky
>            Priority: Major
>
> Columnar data formats (Parquet and ORC) have recently added a column 
> encryption capability. The data protection follows the practice of envelope 
> encryption, where the Data Encryption Key (DEK) is freshly generated for each 
> file/column, and is encrypted with a master key (or an intermediate key, that 
> is in turn encrypted with a master key). The master keys are kept in a 
> centralized Key Management Service (KMS) - meaning that each Spark worker 
> needs to interact with a (typically slow) KMS server. 
> This Jira (and its sub-tasks) introduce an alternative approach, that on one 
> hand preserves the best practice of generating fresh encryption keys for each 
> data file/column, and on the other hand allows Spark clusters to have a 
> scalable interaction with a KMS server, by delegating it to the application 
> driver. This is done via two-tier management of the keys, where a random Key 
> Encryption Key (KEK) is generated by the driver, encrypted by the master key 
> in the KMS, and distributed by the driver to the workers, so they can use it 
> to encrypt the DEKs, generated there by Parquet or ORC libraries. In the 
> workers, the KEKs are distributed to the executors/threads in the write path. 
> In the read path, the encrypted KEKs are fetched by workers from file 
> metadata, decrypted via interaction with the driver, and shared among the 
> executors/threads.
> The KEK layer further improves scalability of the key management, because 
> neither driver or workers need to interact with the KMS for each file/column.
> Stand-alone Parquet/ORC libraries (without Spark) and/or other frameworks 
> (e.g., Presto, pandas) must be able to read/decrypt the files, 
> written/encrypted by this Spark-driven key management mechanism - and 
> vice-versa. [of course, only if both sides have proper authorisation for 
> using the master keys in the KMS]
> A link to a discussion/design doc is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to