Kontinuation opened a new issue, #2139:
URL: https://github.com/apache/datafusion-comet/issues/2139

   ### Describe the bug
   
   This is a bug of a feature introduced by 
https://github.com/apache/datafusion-comet/pull/1817.
   
   The S3 object store support for the native parquet reader incorrectly 
url-decode the path. The path should already been url-decoded so decoding it 
again will corrupt the original path. If the path does not contain escape 
sequences then it is fine. However, if the S3 path has escape sequences, it 
will corrupt the path and we'll end up getting an error, or silently reading 
the wrong data.
   
   I found S3 paths containing escape sequences when reading a partitioned 
table. The partition key contains a '#' character and the S3 paths for files in 
the partitioned table are something like this:
   
   s3://bucket_name/path/to/data/p_brand=Brand%2321/part-xxxx.parquet
   Note that Brand%2321 is part of the original S3 path, not the url-encoded 
path. The partition key is Brand#21, the directory names of partitioned tables 
are url-encoded by design to support any character sequences.
   
   If we url-decode this path twice, the resulting path will be 
s3://bucket_name/path/to/data/p_brand=Brand#21/part-xxxx.parquet, which is 
different from the original path.
   
   ### Steps to reproduce
   
   Simply counting the number of rows in a parquet file with Comet enabled. The 
S3 path should contain escape sequence:
   
   ```python
   
spark.read.parquet("s3://bucket_name/path/to/data/p_brand=Brand%2321/part-xxxx.parquet").count()
   ```
   
   This produces an error:
   
   ```
   Caused by: org.apache.comet.CometNativeException: External: Object at 
location path/to/data/p_brand=Brand#21/part-xxxx.parquet not found: Error 
performing GET 
https://s3.us-west-2.amazonaws.com/.../p_brand%3DBrand%2352/part-xxxx.snappy.parquet
 in 53.743599ms - Server returned non-2xx status code: 404 Not Found: <?xml 
version="1.0" encoding="UTF-8"?>
   <Error><Code>NoSuchKey</Code><Message>The specified key does not 
exist.</Message><Key>path/to/data/p_brand=Brand#21/part-xxxx.parquet</Key><RequestId>R05Q6ASV5FECFQGW</RequestId><HostId>sFNJHdsH0it3d0WbQTczSO5wku4zVzEKXgp0d/K4z1Onj/Sy+m18q54xvYzeu2eRhJ8qz+dIBBE=</HostId></Error>
        at org.apache.comet.parquet.Native.readNextRecordBatch(Native Method)
        at 
org.apache.comet.parquet.NativeBatchReader.loadNextBatch(NativeBatchReader.java:812)
        at 
org.apache.comet.parquet.NativeBatchReader.nextBatch(NativeBatchReader.java:749)
        at 
org.apache.comet.parquet.NativeBatchReader.nextKeyValue(NativeBatchReader.java:707)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:131)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:286)
        ... 36 more
   ```
   
   ### Expected behavior
   
   The parquet file should be loaded correctly.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to