ggershinsky commented on pull request #9631:
URL: https://github.com/apache/arrow/pull/9631#issuecomment-831961388


   Hey @GPSnoopy , thanks for the detailed input! It is particularly 
interesting because you use asymmetric encryption of AES keys; we always wanted 
to check the high-level API against such scenario. Regarding addressing the 
immediate needs of your usecase - I'm sure we'll find a practical solution, 
with one of the APIs (more on that later in this comment).
   
   >This crypto library generates the AES key, encrypts it using asymmetric 
keys (obtained via the KMS, driven by an company-internal user provided key 
identifier), adds some extra necessary header information and publishes that to 
Parquet as the key identifier.
   
   this could mapped rather easily to the high-level API. Basically, it 
requires a developer to implement a method `string wrapKey(byte[] aesKey, 
string masterKeyID)` - here, you could take the AES key (generated by us) with 
the ID of the master key (specified by you for the table/column), obtain the 
asymmetric key with this ID from your KMS, encrypt the AES key with it, add any 
extra header information, and return to us as a base64 encoded string. We keep 
it, and give this string back to you upon reading, via the `byte[] 
unwrapKey(string wrappedKey, masterKeyID)` method - that you use to decrypt the 
AES key. Will this conceptually work for you? I know, the AES key is generated 
by us, but we do it to make sure that one DEK is not used more than allowed by 
the NIST spec for GCM to prevent its break-down. 
   
   >It also deals with user authentication and key permissions.
   >This means that the way we manage Parquet encryption inside the company is 
consistent with the rest of the company; approved by the various security teams.
   
   This is precisely the intent of having a pluggable KMS interface in the 
high-level API; it works like that in other companies.
   
   > Being compatible with other external tools and a de-facto Parquet 
encryption high-level standard is nice, but ultimately the company cares about 
its own sensitive IP. So being compatible with the company ecosystem is higher 
priority than being compatible with Spark (ultimately we will never share 
encrypted files with other companies, kind of the main point).
   
   Yep, I understand. Some companies though use both Spark and PyArrow/pandas 
in their data pipelines. Or migrate from one to the other.
   
   > The low-level API is internally used by us in both C++ and C#. So why is 
Python different?
   
   No choice with C++, the low-level has to be implemented in some language. 
With Python, we do have a choice to make the API safer. Also, the Python API 
will be used by a wider set of developers, including users that don't have any 
experience with data encryption.
   
   > ..provide the necessary flexibility for users with use-cases you have not 
anticipated or foreseen... the last point should be carefully considered, as 
it's reflected and used in highly acclaimed libraries and APIs
   
   In general, I'd totally agree. However, this is a somewhat unusual library, 
because it belongs in the field of security, and is supposed to protect 
sensitive data. Unfortunately, its low-level interface can be easily misused by 
unexperienced users, resulting in broken protection.
   
   Now, for the practical solutions for your usecase. I can think of the 
following options:
   - be an early adopter of the high-level API. A basic C++ version is ready, 
and a basic Python version should be ready soon. Obviously, this is my top 
preference, because it will help the library and its future users in the 
community to benefit from your experience / contribution.
   - use the low-level Python wrapping, developed by Itamar. You already have 
it working, and you have enough key management experience to make it safe in 
your deployment. No need to upstream it to an open source repo that is 
leveraged by many users without experience in data security.
   - open source / expose the low-level API, with warnings (in a hope users 
will see / heed them). IMHO, this is the least preferable option, I'd still 
like to understand why nothing else would work. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to