Thanks for the feedback to the doc, we are also closely following the Parquet encryption work and would like to have that in Iceberg as soon as possible with the right architecture. Here are some brief thoughts for the points you mentioned in the email, I will add more details in the google doc:
*Key rotation* My initial thought was to consider key rotation as a separated process and DEK rewrapping can be done with a Spark stored procedure, that's why I did not add any detail for it. But your point about the work needed to rewrite and clean up manifests is a really good point that I should fully describe the details. For instance, we can choose to store the encrypted DEKs inside the manifest or as a separated instruction file with a pointer in key_metadata, and there are tradeoffs for those approaches. I will update the doc for these details. *Acceleration of KMS interactions* Thanks for bringing up double wrapping, I was hesitant to mention that in the initial version of the doc because it would add complexity for understanding the overall architecture. And for the use cases I have seen with AWS KMS, people are all using single-wrapping and the service was able to handle generation of millions of DEKs, and it seems like there was no complaint about it. I think the right way to go is to support the 3 common cases: (1) direct DEK ID, (2) KEK ID + encrypted DEK, (3) MEK ID + encrypted KEK + encrypted DEK, and that should be enough to cover most of the use cases with different types of KMS. I will update the encryption spec with more details on that. *DDL clauses for encryption and key rotation* These definitely make sense to me. I will add a list of the DDL clauses I was thinking about to the doc. *Cryptographic integrity of Data Tables* Yes, I think in this doc at least the location and structure of AAD prefix should be discussed, so hopefully we can reach some general consensus for integrity support for Iceberg tables and make sure the right information is in place or can be added later. I am also working on a POC to flush out some details for the aspects described in the doc, I will update in this thread once I publish that. Best, Jack Ye On Tue, Mar 23, 2021 at 5:04 AM Gidon Gershinsky <gg5...@gmail.com> wrote: > Hi Jack, > > We're working on Parquet encryption, which is about to be released in the > upcoming parquet-mr-1.12 version. Recently, we've started to look into its > integration in Iceberg. It became immediately clear we need to take a wider > view that covers other types of encryption in Iceberg (file streams and > ORC); otherwise, we'd end up with a number of silos. > At the time, there was no top-down design for data encryption in Iceberg, > so we've started to tinker with it. But now we can base this on your > document. I really liked it, a solid foundation. > > There are a number of high-level concepts I believe we'd need to add there: > > - Key rotation in Iceberg > (Not just in KMS). The envelope encryption practice requires periodic (or > on-demand) re-wrapping of DEKs with new versions of master keys. KMS > generates the new versions, and keeps the master key history, but the > re-wrapped DEKs need to be updated in Iceberg metadata. If key_metadata is > kept in manifest files, this means all manifest files must be deleted > (because they keep DEKs wrapped with the previous master key version, which > is not safe anymore), and created again with the updated key_metadata > field. We've quickly discussed this with Anton, seems to be feasible, but > there are other alternatives. We need to decide if manifests are the right > place to store all key_metadata; and to design a mechanism (potentially > with a DDL clause) to perform the rotation operation. > > - Acceleration of KMS interactions > KMSs can be very slow, especially when backed by HSMs. Per the doc, "The > KEK is stored in a key management service (KMS) to control access and key > rotation." We should not fetch secret keys from KMS, because this exposes > them; instead, many KMSs allow to wrap/encrypt DEKs inside the KMS server, > without ever exposing the master keys. But since we have to generate a DEK > per file/column, we'll end up with many KMS wrap calls when writing the > data (and many unwrap calls when reading the data). That's why Parquet > encryption uses a concept of double wrapping, where DEKs are wrapped with > KEKs, and KEKs are wrapped with master keys (MEKs). Only MEKs are > stored/managed inside KMS. > > - DDL clauses for encryption and key rotation, such as > ALTER TABLE .. KEY_ROTATION (params) > ALTER TABLE .. ENCRYPT (params): encrypts existing table (with plaintext > files) - Russell's proposal > CREATE TABLE ... ENCRYPTION (params) ; or simply use the TBLPROPERTIES > Btw, we can re-use the joint ORC/Parquet column encryption parameter > format, defined in this jira discussion started by Xinli - > https://issues.apache.org/jira/browse/HIVE-21848 > > - Cryptographic integrity of Data Tables > Besides protecting data confidentiality, we need to protect its integrity > against tampering attacks. This one is a longer term work item, based on > these tickets: > https://github.com/apache/iceberg/issues/44, > https://github.com/apache/iceberg/issues/2060, > https://github.com/apache/iceberg/issues/2073 > We'll work on these at a later stage, after the confidentiality basis is > ready; but we need to make sure the current work on confidentiality enables > (or at least doesn't block) the future integrity work. For example, we can > start using https://github.com/apache/iceberg/issues/2060 sooner rather > than later, for encrypting the Iceberg metadata files and Avro data files. > > That was a high level description, I'll add detailed comments inside the > design googledoc. > > Cheers, Gidon > > > On Mon, Mar 22, 2021 at 7:25 PM Jack Ye <yezhao...@gmail.com> wrote: > >> Hi everyone, >> >> To continue the discussion in the last sync meeting about encryption in >> Iceberg, here is the document for a proposal: >> >> >> https://docs.google.com/document/d/1kkcjr9KrlB9QagRX3ToulG_Rf-65NMSlVANheDNzJq4/edit?usp=sharing >> >> Would be very appreciated for any feedback. >> >> Best, >> Jack Ye >> >