ggershinsky commented on PR #41821: URL: https://github.com/apache/arrow/pull/41821#issuecomment-2145109371
> Although I understand the intention of this issue and the corresponding fix, I don't think the design of parquet encryption has included the `_metadata` summary file because it may point to several different encrypted parquet files. It would be great if @ggershinsky could advise to see if there is any defection in this use case. Yep, we haven't worked on supporting this (basically, there was no requirement; seemed heading towards deprecation). In general, using different encryption keys for different data files is considered to be a good security practice (mainly because there is a limit on number of crypto operations with one key; also, the key leak scope is smaller) - that's why we generate a fresh key for each parquet file in most of the APIs (Parquet, Arrow, Spark, Iceberg, etc). However, there are obviously some low-level parquet APIs that will allow to pass the same key to many files - if used carefully (making sure, somehow, not to exceed the limit), this might be ok in some cases. The limit is hight (~1 billion pages, somethings like 10TB-1PB of data), but if exceeded, the cipher breaks and the data can be decrypted. Another option could be to create a separate key for the `_metadata` summary file, and manage it separately from the data file keys. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
