I had a quick look at the new comments, shouldn't be a problem to factor them in.
Re section 5.4 - it describes an issue that is resolved today with a workaround (in the current Parquet code, for regular files). However, the workaround stops working when files are encrypted. Therefore, the new fields are required for parsing encrypted files, and are nice-to-have for parsing regular files. I'll add details in the pr comments (some are already available in the pr itself). Still, for the purpose of this discussion - these fields are indeed related to encryption indirectly, and have a wider applicability. The encryption spec doesn't have to allocate a separate section for them; can mention this point in a sentence or two in the section 5.2. TBD at the pr. Cheers, Gidon. On Mon, Dec 3, 2018 at 9:48 PM Ryan Blue <rb...@netflix.com> wrote: > I added another round of reviews. I didn't find many technical changes, > but I think that several things should be more clear to make it easier for > people to implement the spec. > > The main technical problem I have is the addition of section 5.4 about > vectorization. These extra fields don't appear to be required for > encryption, unless I'm missing something. Can someone explain why they were > added to the encryption doc and not added separately as an optimization for > certain readers? > > rb > > On Sat, Dec 1, 2018 at 3:23 PM Ryan Blue <rb...@netflix.com> wrote: > >> I'll take a look at it again. Sorry for the delay. >> >> On Sat, Dec 1, 2018 at 5:04 AM Gidon Gershinsky <gg5...@gmail.com> wrote: >> >>> Xinli, thanks, this is a team / community effort. E.g., in the latest >>> round, I had many constructive comments from Ryan and Zoltan. >>> The design and document got better as the result. >>> >>> Ryan, its been a while since then. Let us know if the doc is ok now. We >>> will resume the work once the document is signed off. >>> >>> Cheers, Gidon. >>> >>> >>> On Fri, Nov 30, 2018 at 10:48 PM Xinli shang <sha...@uber.com.invalid> >>> wrote: >>> >>>> + 1 (non-binding). The latest one looks good to me! Thanks, Gidon! >>>> >>>> On Thu, Oct 25, 2018 at 4:17 PM Ryan Blue <rb...@netflix.com.invalid> >>>> wrote: >>>> >>>> > Thanks, Gidon! I'll have a look as soon as I get some free time. >>>> > >>>> > On Thu, Oct 25, 2018 at 6:20 AM Gidon Gershinsky <gg5...@gmail.com> >>>> wrote: >>>> > >>>> > > Ryan, >>>> > > >>>> > > Your suggestions have been addressed, please check out the >>>> > > https://github.com/apache/parquet-format/pull/114/files >>>> > > >>>> > > Cheers, Gidon. >>>> > > >>>> > > >>>> > > ---------- Forwarded message --------- >>>> > > From: Gidon Gershinsky <gg5...@gmail.com> >>>> > > Date: Tue, Oct 23, 2018 at 10:32 AM >>>> > > Subject: Re: [VOTE] Modular Encryption design sign-off >>>> > > To: <dev@parquet.apache.org> >>>> > > >>>> > > >>>> > > Regarding the EncryptedColumnChunk, we might be talking about two >>>> > different >>>> > > things, looks like we need a deeper tech dive to sync up and get to >>>> the >>>> > > bottom of this. >>>> > > I'll <> send you a separate mail on this subject, but no doubt we'll >>>> > > figure it out one way or another. >>>> > > >>>> > > It is the only discussion point we have left, since all other >>>> suggestions >>>> > > sound good to me - the md doc can certainly use a substantial >>>> addition of >>>> > > explicit details for every feature >>>> > > of the encryption format, to make it a full specification document. >>>> > > >>>> > > Cheers, Gidon. >>>> > > >>>> > > On Mon, Oct 22, 2018 at 9:36 PM Ryan Blue <rb...@netflix.com.invalid >>>> > >>>> > > wrote: >>>> > > >>>> > > > Using a new magic string (PARE) for files with a plaintext footer >>>> would >>>> > > not >>>> > > > allow legacy readers to parse these files. >>>> > > > This should be clear from reading the spec. I see that the >>>> plaintext >>>> > > footer >>>> > > > section states “Then the footer is written as usual, followed by >>>> …, >>>> > and a >>>> > > > final magic string, ‘PAR1’” but, like the other section, this is >>>> not >>>> > > > clearly called out and nothing states that this is applies to the >>>> > header >>>> > > > magic as well. The spec needs to include a statement like “the >>>> magic >>>> > > bytes >>>> > > > for plaintext footer mode are ‘PAR1’ to allow older readers to >>>> read >>>> > > > projections of the file that do not include encrypted columns”. >>>> > > Similarly, >>>> > > > the other section needs to clearly state what the magic string is >>>> for >>>> > > > encrypted footers. >>>> > > > >>>> > > > Support for multiple encryption algorithms is one of the >>>> > differentiating >>>> > > > features of Parquet encryption, therefore having a second cipher >>>> > upfront >>>> > > is >>>> > > > significant. >>>> > > > >>>> > > > That sounds okay to me since CTR is a subset of GCM so it isn’t >>>> that >>>> > > large >>>> > > > of a change. The spec does need to be explicit about exactly how >>>> the >>>> > two >>>> > > > relate. For example, CTR resources tend to use the term “nonce” >>>> where >>>> > GCM >>>> > > > uses IV, so it needs to be explicitly stated that the GCM IV is >>>> used as >>>> > > the >>>> > > > CTR nonce. >>>> > > > >>>> > > > EncryptedColumnChunk structure will be identical to a >>>> ColumnChunk. The >>>> > > > reasons are: … >>>> > > > >>>> > > > These structures should be defined in the encryption spec. Some >>>> details >>>> > > can >>>> > > > be included by reference, like “the remaining fields match those >>>> in >>>> > > > ColumnMetaData” but there isn’t enough information in the current >>>> spec >>>> > > > draft. >>>> > > > >>>> > > > My point about file_offset is that this field already has a use >>>> and a >>>> > > > definition and this spec should not change its use or purpose. If >>>> you >>>> > > need >>>> > > > a field for the location of an encrypted ColumnMetaData, then >>>> > > > EncryptedColumnChunk should add one. >>>> > > > >>>> > > > EncryptedColumnChunk structure will be identical to a >>>> ColumnChunk. … >>>> > > > ColumnCryptoMetaData is needed in ColumnChunk >>>> > > > >>>> > > > How is EncryptedColumnChunk identical to ColumnChunk if it also >>>> > includes >>>> > > an >>>> > > > extra structure for crypto? >>>> > > > >>>> > > > There are no options 1,2 in the NIST spec section 8.2.2. >>>> > > > >>>> > > > I’m referring to this section: “The random field shall either >>>> consist >>>> > of >>>> > > 1) >>>> > > > an output string of r(i) bits from an approved RBG with a >>>> sufficient >>>> > > > security strength, or 2) the result of applying the r(i)–bit >>>> > incrementing >>>> > > > function to the random field of the preceding IV for the given >>>> key” >>>> > > > >>>> > > > My main point is that reusing any IV with the same key is a >>>> problem. >>>> > The >>>> > > > spec must explicitly state that no IV can be used with the same >>>> key. >>>> > > Sounds >>>> > > > like you’re confident that the approach you’re using is >>>> sufficient, but >>>> > > if >>>> > > > it isn’t stated in the spec then other implementations might do it >>>> > > > incorrectly. Referencing section 8.2.2 or relying on other people >>>> > having >>>> > > > knowledge that you do is not sufficient for a spec. >>>> > > > >>>> > > > I also have a minor concern over the amount of crypto random that >>>> this >>>> > > may >>>> > > > require because Parquet requires so many encryption streams, but >>>> that’s >>>> > > an >>>> > > > implementation detail and doesn’t need to block the spec. >>>> > > > >>>> > > > On Sat, Oct 20, 2018 at 2:22 AM Gidon Gershinsky < >>>> gg5...@gmail.com> >>>> > > wrote: >>>> > > > >>>> > > > > Ryan, >>>> > > > > >>>> > > > > Thank you for the clear and detailed feedback. >>>> > > > > >>>> > > > > 1. Using a new magic string (PARE) for files with a plaintext >>>> footer >>>> > > > would >>>> > > > > not allow legacy readers to parse these files. >>>> > > > > >>>> > > > > 2. Support for multiple encryption algorithms is one of the >>>> > > > differentiating >>>> > > > > features of Parquet encryption, therefore having a second cipher >>>> > > upfront >>>> > > > is >>>> > > > > significant. >>>> > > > > The design/review overhead of adding a CTR-based cipher to GCM >>>> is >>>> > > > minimal. >>>> > > > > In fact, CTR is a subset of GCM, because the latter is built on >>>> CTR >>>> > > (and >>>> > > > > GMAC). >>>> > > > > Since you explicitly mark your suggestion as a minor note, I >>>> propose >>>> > we >>>> > > > > keep this part unchanged. >>>> > > > > >>>> > > > > 3. EncryptedColumnChunk structure will be identical to a >>>> > ColumnChunk. >>>> > > > The >>>> > > > > reasons are: >>>> > > > > - In both EF and PF modes (encrypted/plaintext footer), the >>>> > > > ColumnMetaData >>>> > > > > is not always separated. >>>> > > > > It is kept inside ColumnChunk if the column is plaintext, or >>>> > encrypted >>>> > > > > with the footer key in the EF mode. No reason to separate it >>>> then. >>>> > > > > - ColumnCryptoMetaData is needed in ColumnChunk to support >>>> column >>>> > > > > encryption in PF mode (and in EF mode). >>>> > > > > - We can make the file_offset an optional field. But actually >>>> it is >>>> > not >>>> > > > set >>>> > > > > anyway in today's Parquet, which works only with the optional >>>> > meta_data >>>> > > > > field. So, this reason alone shouldn't be enough to create a new >>>> > Thrift >>>> > > > > structure that fully replicates an existing one, save for >>>> optional vs >>>> > > > > required classifier in one field. >>>> > > > > >>>> > > > > 4. There are no options 1,2 in the NIST spec section 8.2.2. This >>>> > > subject >>>> > > > > had been extensively discussed by community during the design >>>> review: >>>> > > > > - If you mean the free field option there - its inapplicable >>>> for the >>>> > > > 96-bit >>>> > > > > IV we are using. >>>> > > > > Per the spec, 96-bit IV is fully random, 0 free field length. >>>> 96 bit >>>> > > is a >>>> > > > > recommended length for AES-GCM IV, see discussion at >>>> > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> https://crypto.stackexchange.com/questions/41601/aes-gcm-recommended-iv-size-why-12-bytes/41610 >>>> > > > > eg "*For GCM a 12 byte IV is strongly suggested as other IV >>>> lengths >>>> > > will >>>> > > > > require additional calculations.*" >>>> > > > > - If you mean the 8.2.1 section, with deterministic construction >>>> > > option - >>>> > > > > its highly problematic, since deterministic IV prefix ("fixed >>>> field") >>>> > > > > "*shall >>>> > > > > identify the device, or, more generally, the context for the >>>> instance >>>> > > of >>>> > > > > the authenticated encryption function*". This is not trivial in >>>> a >>>> > > single >>>> > > > > process, and even more challenging in distributed frameworks >>>> like >>>> > Spark >>>> > > > > ("*no >>>> > > > > two distinct devices shall share the same fixed field*"). >>>> Moreover, >>>> > > this >>>> > > > > can be unsafe - a mistake in prefix creation can lead to >>>> identical >>>> > IVs >>>> > > - >>>> > > > > which breaks the encryption. >>>> > > > > - The good news is we don't need these complications. The 8.2.2 >>>> > allows >>>> > > us >>>> > > > > to have 4 billion random IVs for a single key before any >>>> collision >>>> > > would >>>> > > > > happen in IVs that are open to attackers. With Parquet modular >>>> > > > encryption, >>>> > > > > the number is more like 400 billion, because most (>99%) of IVs >>>> are >>>> > > > hidden >>>> > > > > from an attacker - their offset is encrypted with the >>>> column/footer >>>> > > keys. >>>> > > > > Of course, we use the SecureRandom class, seeded properly. >>>> > > > > - Besides checking the NIST spec and crypto forums, I work >>>> directly >>>> > > with >>>> > > > > cipher technology experts - who confirm the Parquet approach to >>>> IV >>>> > > > > generation is the right thing to do in distributed systems. >>>> > > > > >>>> > > > > >>>> > > > > All other suggestions sound good, and are not hard to add to the >>>> > spec. >>>> > > > I'll >>>> > > > > update the document accordingly. >>>> > > > > >>>> > > > > Cheers, Gidon. >>>> > > > > >>>> > > > > On Fri, Oct 19, 2018 at 7:38 PM Ryan Blue >>>> <rb...@netflix.com.invalid >>>> > > >>>> > > > > wrote: >>>> > > > > >>>> > > > > > After thinking about this more last night, I think it also >>>> needs to >>>> > > be >>>> > > > > more >>>> > > > > > careful around IV generation. The spec just says that >>>> > implementations >>>> > > > > > should use RBG-based IV construction, but it needs to be more >>>> > > explicit >>>> > > > > that >>>> > > > > > no two GCM streams use the same IV. If an IV is reused, then >>>> there >>>> > > is a >>>> > > > > > possible plaintext leak when blocks are padded with a >>>> predictable >>>> > > > > pattern. >>>> > > > > > >>>> > > > > > I think the spec should state that an initial IV should be >>>> > generated >>>> > > > > using >>>> > > > > > secure random as described by 8.2.2 option 1 and the IV for >>>> each >>>> > > > > encrypted >>>> > > > > > stream should be generated using 8.2.2 option 2 until there >>>> is a >>>> > > > > collision, >>>> > > > > > when the base IV should be regenerated using 8.2.2 option 1. >>>> That >>>> > > will >>>> > > > > > ensure that each IV is unique, but that Parquet doesn't >>>> consume a >>>> > > huge >>>> > > > > > amount of crypto randomness. >>>> > > > > > >>>> > > > > > rb >>>> > > > > > >>>> > > > > > On Thu, Oct 18, 2018 at 6:15 PM Ryan Blue <rb...@netflix.com> >>>> > wrote: >>>> > > > > > >>>> > > > > > > -1 >>>> > > > > > > >>>> > > > > > > - I don’t think that it contains enough information for >>>> > someone >>>> > > to >>>> > > > > > > successfully implement the proposal. For example, this >>>> > > references >>>> > > > > > thrift >>>> > > > > > > structures like AesGcmV1 and AesGcmCtrV1, but doesn’t >>>> > introduce >>>> > > > them >>>> > > > > > and >>>> > > > > > > explain what they are or detail what they contain. >>>> Similarly, >>>> > > > > > > ColumnCryptoMetaData and its sub-types are not defined. >>>> It >>>> > looks >>>> > > > > like >>>> > > > > > this >>>> > > > > > > spec relies on thrift, but I think that it should be >>>> > > > self-contained. >>>> > > > > > > - It should be more clear exactly how to construct each >>>> GCM or >>>> > > CTR >>>> > > > > > > encrypted buffer. Is it necessary to specify padding for >>>> > > AES-GCM? >>>> > > > I >>>> > > > > > see >>>> > > > > > > from a quick google search that some implementations do, >>>> but >>>> > > > there’s >>>> > > > > > > nothing in this doc. >>>> > > > > > > - I don’t think that this explains enough of the intent >>>> that >>>> > is >>>> > > > > behind >>>> > > > > > > the specification. For example, “Key metadata is a >>>> free-form >>>> > > byte >>>> > > > > > array >>>> > > > > > > that can be used by a reader to retrieve the column >>>> encryption >>>> > > > key” >>>> > > > > > should >>>> > > > > > > be more clear about what this byte array actually >>>> contains or >>>> > > > > > represents. >>>> > > > > > > Will it identify a key that can be fetched from a store? >>>> Will >>>> > it >>>> > > > be >>>> > > > > a >>>> > > > > > > randomly-generated key stored encrypted by another key? >>>> > > > > > > - This changes the definition of ColumnChunk’s >>>> file_offset >>>> > field >>>> > > > and >>>> > > > > > > the location of ColumnMetaData in some cases. Instead of >>>> > > > repurposing >>>> > > > > > thrift >>>> > > > > > > structures, the spec should introduce new ones, like an >>>> > > > > > > EncryptedColumnChunk that preserves the meaning of >>>> fields in >>>> > > > > > ColumnChunk >>>> > > > > > > but separates out ColumnMetaData. >>>> > > > > > > - This contains details that aren’t required, like using >>>> the >>>> > > > > > > “.parquet.encrypted” file extension. >>>> > > > > > > - The only time the new magic bytes are mentioned is >>>> after the >>>> > > > > footer >>>> > > > > > > and encryption metadata, but the diagram shows that the >>>> first >>>> > > > bytes >>>> > > > > > in the >>>> > > > > > > file are updated as well. This is also only in the >>>> encrypted >>>> > > > footer >>>> > > > > > mode. >>>> > > > > > > Should PARE magic bytes be used in plaintext footer mode? >>>> > > > > > > >>>> > > > > > > Minor note: I would also prefer to vote on GCM, leaving out >>>> CTR >>>> > for >>>> > > > now >>>> > > > > > > and adding it once the GCM spec is finished. That way we can >>>> > > > > concentrate >>>> > > > > > on >>>> > > > > > > a single cipher mode instead of thinking about multiple >>>> modes at >>>> > > > once. >>>> > > > > > > >>>> > > > > > > On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi >>>> > > > > <szo...@cloudera.com.invalid >>>> > > > > > > >>>> > > > > > > wrote: >>>> > > > > > > >>>> > > > > > >> +1 (non-binding) >>>> > > > > > >> >>>> > > > > > >> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar >>>> > > > > > >> <nkol...@cloudera.com.invalid> >>>> > > > > > >> wrote: >>>> > > > > > >> >>>> > > > > > >> > +1 (non-binding) >>>> > > > > > >> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cjjnj...@gmail.com >>>> > >>>> > > wrote: >>>> > > > > > >> > > >>>> > > > > > >> > > +1 (non-binding) >>>> > > > > > >> > > >>>> > > > > > >> > > >>>> > > > > > >> > > Zoltan Ivanfi <z...@cloudera.com.invalid> 于2018年10月16日周二 >>>> > > > 下午4:46写道: >>>> > > > > > >> > > >>>> > > > > > >> > > > +1 (binding) >>>> > > > > > >> > > > >>>> > > > > > >> > > > Cheers, >>>> > > > > > >> > > > >>>> > > > > > >> > > > Zoltan >>>> > > > > > >> > > > >>>> > > > > > >> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky < >>>> > > > > > gg5...@gmail.com >>>> > > > > > >> > >>>> > > > > > >> > > > wrote: >>>> > > > > > >> > > > >>>> > > > > > >> > > > > Hello Parquet developers, >>>> > > > > > >> > > > > >>>> > > > > > >> > > > > Per the last sync discussion, it is time to call >>>> for a >>>> > > vote >>>> > > > on >>>> > > > > > the >>>> > > > > > >> > > > Parquet >>>> > > > > > >> > > > > Modular Encryption design sign-off. The design doc >>>> can >>>> > be >>>> > > > > found >>>> > > > > > at >>>> > > > > > >> > the >>>> > > > > > >> > > > > encryption branch of the parquet-format repository, >>>> > > > > > >> > > > > >>>> > > > > > >> > >>>> > > > > > >>>> > > >>>> https://github.com/apache/parquet-format/blob/encryption/Encryption.md >>>> > > > . >>>> > > > > > >> > > > > >>>> > > > > > >> > > > > The design is stable by now. This work had started >>>> 10 >>>> > > months >>>> > > > > > ago, >>>> > > > > > >> has >>>> > > > > > >> > > > been >>>> > > > > > >> > > > > extensively reviewed - and implemented (in Java, >>>> > > partially >>>> > > > in >>>> > > > > > >> C++), >>>> > > > > > >> > by a >>>> > > > > > >> > > > > number of folks from different companies. To >>>> continue >>>> > with >>>> > > > the >>>> > > > > > >> > > > > implementation pull requests, we need the design >>>> to be >>>> > > > > formally >>>> > > > > > >> > signed >>>> > > > > > >> > > > off >>>> > > > > > >> > > > > by the community. >>>> > > > > > >> > > > > >>>> > > > > > >> > > > > Cheers, Gidon >>>> > > > > > >> > > > > >>>> > > > > > >> > > > >>>> > > > > > >> > > >>>> > > > > > >> > > >>>> > > > > > >> > > -- >>>> > > > > > >> > > Thanks & Best Regards >>>> > > > > > >> > >>>> > > > > > >> >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > -- >>>> > > > > > > Ryan Blue >>>> > > > > > > Software Engineer >>>> > > > > > > Netflix >>>> > > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > -- >>>> > > > > > Ryan Blue >>>> > > > > > Software Engineer >>>> > > > > > Netflix >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > > >>>> > > > -- >>>> > > > Ryan Blue >>>> > > > Software Engineer >>>> > > > Netflix >>>> > > > >>>> > > >>>> > >>>> > >>>> > -- >>>> > Ryan Blue >>>> > Software Engineer >>>> > Netflix >>>> > >>>> >>>> >>>> -- >>>> Xinli Shang >>>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > > -- > Ryan Blue > Software Engineer > Netflix >