[jira] [Commented] (SPARK-33966) Two-tier encryption key management

2021-04-17 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324421#comment-17324421
 ] 

Gidon Gershinsky commented on SPARK-33966:
--

This Jira (and its subtasks) require considerable changes/additions in Spark 
and underlying format libraries. We are working on the design, but it (and the 
implementation) won't be ready in time for the 3.2.0 release.

> Two-tier encryption key management
> --
>
> Key: SPARK-33966
> URL: https://issues.apache.org/jira/browse/SPARK-33966
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gidon Gershinsky
>Priority: Major
>
> Columnar data formats (Parquet and ORC) have recently added a column 
> encryption capability. The data protection follows the practice of envelope 
> encryption, where the Data Encryption Key (DEK) is freshly generated for each 
> file/column, and is encrypted with a master key (or an intermediate key, that 
> is in turn encrypted with a master key). The master keys are kept in a 
> centralized Key Management Service (KMS) - meaning that each Spark worker 
> needs to interact with a (typically slow) KMS server. 
> This Jira (and its sub-tasks) introduce an alternative approach, that on one 
> hand preserves the best practice of generating fresh encryption keys for each 
> data file/column, and on the other hand allows Spark clusters to have a 
> scalable interaction with a KMS server, by delegating it to the application 
> driver. This is done via two-tier management of the keys, where a random Key 
> Encryption Key (KEK) is generated by the driver, encrypted by the master key 
> in the KMS, and distributed by the driver to the workers, so they can use it 
> to encrypt the DEKs, generated there by Parquet or ORC libraries. In the 
> workers, the KEKs are distributed to the executors/threads in the write path. 
> In the read path, the encrypted KEKs are fetched by workers from file 
> metadata, decrypted via interaction with the driver, and shared among the 
> executors/threads.
> The KEK layer further improves scalability of the key management, because 
> neither driver or workers need to interact with the KMS for each file/column.
> Stand-alone Parquet/ORC libraries (without Spark) and/or other frameworks 
> (e.g., Presto, pandas) must be able to read/decrypt the files, 
> written/encrypted by this Spark-driven key management mechanism - and 
> vice-versa. [of course, only if both sides have proper authorisation for 
> using the master keys in the KMS]
> A link to a discussion/design doc is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33966) Two-tier encryption key management

2021-01-05 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259454#comment-17259454
 ] 

DB Tsai commented on SPARK-33966:
-

cc [~dongjoon] [~chaosun] and [~viirya]

> Two-tier encryption key management
> --
>
> Key: SPARK-33966
> URL: https://issues.apache.org/jira/browse/SPARK-33966
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gidon Gershinsky
>Priority: Major
>
> Columnar data formats (Parquet and ORC) have recently added a column 
> encryption capability. The data protection follows the practice of envelope 
> encryption, where the Data Encryption Key (DEK) is freshly generated for each 
> file/column, and is encrypted with a master key (or an intermediate key, that 
> is in turn encrypted with a master key). The master keys are kept in a 
> centralized Key Management Service (KMS) - meaning that each Spark worker 
> needs to interact with a (typically slow) KMS server. 
> This Jira (and its sub-tasks) introduce an alternative approach, that on one 
> hand preserves the best practice of generating fresh encryption keys for each 
> data file/column, and on the other hand allows Spark clusters to have a 
> scalable interaction with a KMS server, by delegating it to the application 
> driver. This is done via two-tier management of the keys, where a random Key 
> Encryption Key (KEK) is generated by the driver, encrypted by the master key 
> in the KMS, and distributed by the driver to the workers, so they can use it 
> to encrypt the DEKs, generated there by Parquet or ORC libraries. In the 
> workers, the KEKs are distributed to the executors/threads in the write path. 
> In the read path, the encrypted KEKs are fetched by workers from file 
> metadata, decrypted via interaction with the driver, and shared among the 
> executors/threads.
> The KEK layer further improves scalability of the key management, because 
> neither driver or workers need to interact with the KMS for each file/column.
> Stand-alone Parquet/ORC libraries (without Spark) and/or other frameworks 
> (e.g., Presto, pandas) must be able to read/decrypt the files, 
> written/encrypted by this Spark-driven key management mechanism - and 
> vice-versa. [of course, only if both sides have proper authorisation for 
> using the master keys in the KMS]
> A link to a discussion/design doc is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org