[ https://issues.apache.org/jira/browse/SPARK-33966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259454#comment-17259454 ]
DB Tsai commented on SPARK-33966: --------------------------------- cc [~dongjoon] [~chaosun] and [~viirya] > Two-tier encryption key management > ---------------------------------- > > Key: SPARK-33966 > URL: https://issues.apache.org/jira/browse/SPARK-33966 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.1.0 > Reporter: Gidon Gershinsky > Priority: Major > > Columnar data formats (Parquet and ORC) have recently added a column > encryption capability. The data protection follows the practice of envelope > encryption, where the Data Encryption Key (DEK) is freshly generated for each > file/column, and is encrypted with a master key (or an intermediate key, that > is in turn encrypted with a master key). The master keys are kept in a > centralized Key Management Service (KMS) - meaning that each Spark worker > needs to interact with a (typically slow) KMS server. > This Jira (and its sub-tasks) introduce an alternative approach, that on one > hand preserves the best practice of generating fresh encryption keys for each > data file/column, and on the other hand allows Spark clusters to have a > scalable interaction with a KMS server, by delegating it to the application > driver. This is done via two-tier management of the keys, where a random Key > Encryption Key (KEK) is generated by the driver, encrypted by the master key > in the KMS, and distributed by the driver to the workers, so they can use it > to encrypt the DEKs, generated there by Parquet or ORC libraries. In the > workers, the KEKs are distributed to the executors/threads in the write path. > In the read path, the encrypted KEKs are fetched by workers from file > metadata, decrypted via interaction with the driver, and shared among the > executors/threads. > The KEK layer further improves scalability of the key management, because > neither driver or workers need to interact with the KMS for each file/column. > Stand-alone Parquet/ORC libraries (without Spark) and/or other frameworks > (e.g., Presto, pandas) must be able to read/decrypt the files, > written/encrypted by this Spark-driven key management mechanism - and > vice-versa. [of course, only if both sides have proper authorisation for > using the master keys in the KMS] > A link to a discussion/design doc is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org