andygrove opened a new issue, #4316:
URL: https://github.com/apache/datafusion-comet/issues/4316

   ## Describe the bug
   
   When the `native_datafusion` scan adapter rejects an incompatible Parquet 
column read, the resulting `SparkError::ParquetSchemaConvert` carries an empty 
`file_path`. The JVM shim translates this to a `SparkException` whose message 
reads:
   
   ```
   Parquet column cannot be converted in file . Column: [a], Expected: int, 
Found: BINARY.
   ```
   
   (Note the empty path between `in file` and `.`.) Spark's vectorized reader 
populates this path via `FileScanRDD`'s catch block 
(`currentFile.urlEncodedPath`), so its message reads e.g. `... in file 
file:/tmp/.../part-00000.parquet. Column: ...`.
   
   This blocks several Spark SQL tests that extract the path from the message 
and re-open the file (e.g. `ParquetSchemaSuite > schema mismatch failure error 
message for parquet vectorized reader`).
   
   ## Where the gap is
   
   `SparkPhysicalExprAdapter::replace_with_spark_cast` and the deferred 
`RejectOnNonEmpty` expression build the error with `file_path: String::new()` 
because `PhysicalExprAdapterFactory::create` does not receive the file path. 
Fixing this likely requires either:
   
   - Capturing the file path when the per-file adapter is created (would need a 
DataFusion API extension), or
   - Catching `ParquetSchemaConvert` at a higher layer with file context (e.g. 
the parquet `ScanExec`/`FileOpener` wrapper) and re-raising with the path 
filled in.
   
   ## Repro
   
   `./dev/diffs/3.4.3.diff` has the test currently tagged with 
`IgnoreCometNativeDataFusion` pointing at this issue. Drop the tag and run:
   
   ```
   ENABLE_COMET=true ENABLE_COMET_ONHEAP=true build/sbt "sql/testOnly 
*ParquetSchemaSuite -- -z 'schema mismatch failure error message for parquet 
vectorized reader'"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to