Dormant7 opened a new issue, #2871:
URL: https://github.com/apache/datafusion-comet/issues/2871

   ### Describe the bug
   
   hen using Comet with Parquet Modular Encryption on an S3-compatible 
filesystem (like S3A), the query fails with a ParquetCryptoRuntimeException: 
Failed to find DecryptionKeyRetriever.
   
   This is caused by a path string mismatch used as the key for the 
retrieverCache in CometFileKeyUnwrapper.java.
   
   ### Steps to reproduce
   
   Configure a Spark job with Comet enabled to read encrypted Parquet files 
from an S3A path.
   Enable Parquet Modular Encryption using the following configurations, which 
rely on Comet's JNI bridge to Parquet's native crypto framework:
   
parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory
   parquet.encryption.kms.client.class=<My KmsClient implementation>
   Run a query that reads the encrypted data.
   
   The job fails with the following exception stack trace:
   Caused by: org.apache.comet.CometNativeException: Parquet error: Java 
exception during key retrieval: 
   org.apache.parquet.crypto.ParquetCryptoRuntimeException: Failed to find 
DecryptionKeyRetriever for path: s3://user1-bucket/pme/...
        at org.apache.comet.parquet.Native.readNextRecordBatch(Native Method)
       ...
   
   By adding debug logs to CometFileKeyUnwrapper.java, I was able to confirm 
the root cause.
   
   The storeDecryptionKeyRetriever method is called by the native (Rust) side 
with a file path that includes the s3a:// scheme. The DecryptionKeyRetriever is 
then cached with this full path as the key.
   
   -- Log from storeDecryptionKeyRetriever:
   Creating new DecryptionKeyRetriever for path: 
s3a://user1-bucket/pme/.../file.parquet
   Later, when the getKey method is called via JNI to retrieve the decryption 
key, the file path argument has been changed to use the s3:// scheme.
   
   -- Log from getKey:
   Getting key for path: s3://user1-bucket/pme/.../file.parquet
   Since "s3a://..." and "s3://..." are different strings, the 
retrieverCache.get() call returns null, leading to the Failed to find 
DecryptionKeyRetriever exception. This indicates an inconsistency in how the 
file path is sourced or represented between the two JNI calls from the native 
side.
   
   The issue can be resolved by normalizing the file path before using it as a 
key in both storeDecryptionKeyRetriever and getKey methods. A robust way to do 
this is to extract only the path component from the URI.
   
   Here is a proposed change for CometFileKeyUnwrapper.java:
   
   
[CometFileKeyUnwrapper.java](https://github.com/user-attachments/files/24067596/CometFileKeyUnwrapper.java)
   
   This normalization ensures that regardless of the scheme (s3a:// or s3://) 
passed from the native side, the cache lookup will be consistent and successful.
   
   Thank you for looking into this!
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   The Spark job should execute successfully, correctly decrypting the Parquet 
file data using the provided KMS client.
   
   Specifically, inside CometFileKeyUnwrapper.java:
   
   The retrieverCache.get(filePath) call within the getKey method should 
successfully find the DecryptionKeyRetriever instance that was previously 
stored by the storeDecryptionKeyRetriever method.
   The cache lookup should be successful regardless of minor variations in the 
file path URI scheme (e.g., s3a:// vs. s3://) between the two JNI calls.
   The job should not throw a ParquetCryptoRuntimeException and should proceed 
to read and process the decrypted data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to