Kontinuation opened a new issue, #2139: URL: https://github.com/apache/datafusion-comet/issues/2139
### Describe the bug This is a bug of a feature introduced by https://github.com/apache/datafusion-comet/pull/1817. The S3 object store support for the native parquet reader incorrectly url-decode the path. The path should already been url-decoded so decoding it again will corrupt the original path. If the path does not contain escape sequences then it is fine. However, if the S3 path has escape sequences, it will corrupt the path and we'll end up getting an error, or silently reading the wrong data. I found S3 paths containing escape sequences when reading a partitioned table. The partition key contains a '#' character and the S3 paths for files in the partitioned table are something like this: s3://bucket_name/path/to/data/p_brand=Brand%2321/part-xxxx.parquet Note that Brand%2321 is part of the original S3 path, not the url-encoded path. The partition key is Brand#21, the directory names of partitioned tables are url-encoded by design to support any character sequences. If we url-decode this path twice, the resulting path will be s3://bucket_name/path/to/data/p_brand=Brand#21/part-xxxx.parquet, which is different from the original path. ### Steps to reproduce Simply counting the number of rows in a parquet file with Comet enabled. The S3 path should contain escape sequence: ```python spark.read.parquet("s3://bucket_name/path/to/data/p_brand=Brand%2321/part-xxxx.parquet").count() ``` This produces an error: ``` Caused by: org.apache.comet.CometNativeException: External: Object at location path/to/data/p_brand=Brand#21/part-xxxx.parquet not found: Error performing GET https://s3.us-west-2.amazonaws.com/.../p_brand%3DBrand%2352/part-xxxx.snappy.parquet in 53.743599ms - Server returned non-2xx status code: 404 Not Found: <?xml version="1.0" encoding="UTF-8"?> <Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>path/to/data/p_brand=Brand#21/part-xxxx.parquet</Key><RequestId>R05Q6ASV5FECFQGW</RequestId><HostId>sFNJHdsH0it3d0WbQTczSO5wku4zVzEKXgp0d/K4z1Onj/Sy+m18q54xvYzeu2eRhJ8qz+dIBBE=</HostId></Error> at org.apache.comet.parquet.Native.readNextRecordBatch(Native Method) at org.apache.comet.parquet.NativeBatchReader.loadNextBatch(NativeBatchReader.java:812) at org.apache.comet.parquet.NativeBatchReader.nextBatch(NativeBatchReader.java:749) at org.apache.comet.parquet.NativeBatchReader.nextKeyValue(NativeBatchReader.java:707) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:131) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:286) ... 36 more ``` ### Expected behavior The parquet file should be loaded correctly. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org