andygrove commented on PR #3500: URL: https://github.com/apache/datafusion-comet/pull/3500#issuecomment-3892756243
## Root Cause Analysis: `ParquetFieldIdIOSuite` Failure ### Summary The new `ParquetFieldIdIOSuite` failure is caused by the projection computation change in `parquet_exec.rs`. This affects **all scan modes** (auto, native_iceberg_compat, native_datafusion), not just native_datafusion. ### Key finding: `NativeBatchReader` calls `init_datasource_exec` The `NativeBatchReader` (used by auto/iceberg_compat mode) initializes via JNI at [`mod.rs:764`](https://github.com/apache/datafusion-comet/blob/dc2b9a459/native/core/src/parquet/mod.rs#L764): ```rust let scan = init_datasource_exec( required_schema, Some(data_schema), // always Some None, // partition_schema None, // partition_fields object_store_url, file_groups, None, // projection_vector is always None data_filters, ... ); ``` This means `data_schema = Some(...)` and `projection_vector = None` — which is the **exact new code path** added by this PR. ### What changed **Before (base df52):** ```rust let base_schema = match (&data_schema, &projection_vector) { (Some(schema), Some(_)) => Arc::clone(schema), _ => Arc::clone(&required_schema), // ← (Some, None) falls here: uses required_schema, no projection }; ``` **After (this PR):** ```rust let (base_schema, projection) = match (&data_schema, &projection_vector) { (Some(schema), Some(proj)) => (Arc::clone(schema), Some(proj.clone())), (Some(schema), None) => { // ← NEW: computes projection by NAME matching required_schema → data_schema let projection: Vec<usize> = required_schema.fields().iter() .filter_map(|req_field| { schema.fields().iter().position(|data_field| { data_field.name() == req_field.name() // name-based lookup }) }) .collect(); (Arc::clone(schema), Some(projection)) } _ => (Arc::clone(&required_schema), None), }; ``` ### Why it fails When Parquet **field IDs** are used, the `required_schema` field names (from Spark's read schema) can differ from the `data_schema` field names (from the Parquet file schema). The name-based `position()` lookup fails to find matches, producing a projection vector that is shorter than expected (or empty). This causes the downstream `index out of bounds` panic in `currentColumnBatch` when `NativeBatchReader` tries to access columns that don't exist in the batch. The specific errors observed: - `CometNativeException: index out of bounds: the len is 0 but the index is 0` - `CometNativeException: index out of bounds: the len is 1 but the index is 2` ### Verification - Run at base commit [`dc2b9a459`](https://github.com/apache/datafusion-comet/actions/runs/21949974183) (without PR): `ParquetFieldIdIOSuite` **passes** (919ms) - Run at PR commit [`b08c3a00b`](https://github.com/apache/datafusion-comet/actions/runs/21954911628) (with PR): `ParquetFieldIdIOSuite` **fails** (144ms) The "absence of field ids" tests also fail because the name-based projection can produce incorrect column ordering even without field IDs, when the `required_schema` and `data_schema` have columns in different orders. ### Suggested fix The `(Some(schema), None)` case needs to account for the fact that column correspondence between `required_schema` and `data_schema` may not be name-based (e.g., field ID mapping). Consider either: 1. Falling back to the original behavior (`required_schema` as base, no projection) when names don't fully match 2. Using the same column mapping logic that the Java `NativeBatchReader` uses (which handles field IDs) 3. Only applying the new projection logic when called from `native_datafusion` (not from `NativeBatchReader`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
