RafaelHerrero opened a new pull request, #21127:
URL: https://github.com/apache/datafusion/pull/21127

   ## Which issue does this PR close?
   
   - Closes #19049.
   
   ## Rationale for this change
   
   We're building a SQL engine on top of DataFusion and hit this while running 
benchmarks. A `UNION ALL` query against Parquet files that carry field metadata 
(like `PARQUET:field_id` or InfluxDB's `iox::column::type`). When one branch of 
the union has a NULL literal, `intersect_metadata_for_union` intersects the 
metadata from the data source with the empty metadata from the NULL — and since 
intersecting anything with an empty set gives empty, all metadata gets wiped 
out.
   
   Later, when `optimize_projections` prunes columns and `recompute_schema` 
rebuilds the Union schema, the logical schema has `{}` while the physical 
schema still has the original metadata from Parquet. This causes:
   
   ```
   Internal error: Physical input schema should be the same as the one
   converted from logical input schema.
   Differences:
     - field metadata at index 0 [usage_idle]: (physical) {"iox::column::type": 
"..."} vs (logical) {}
   ```
   
   As @erratic-pattern and @alamb discussed in the issue, empty metadata from 
NULL literals isn't saying "this field has no metadata" — it's saying "I don't 
know." It shouldn't erase metadata from branches that actually have it.
   
   I fixed this in `intersect_metadata_for_union` directly rather than patching 
`optimize_projections` or `recompute_schema`, since that's where the bad 
intersection happens and it covers all code paths that derive Union schemas.
   
   ## What changes are included in this PR?
   
   One change to `intersect_metadata_for_union` in 
`datafusion/expr/src/expr.rs`: branches with empty metadata are skipped during 
intersection instead of participating. Non-empty branches still intersect 
normally (conflicting values still get dropped). If every branch is empty, the 
result is empty — same as before.
   
   ## Are these changes tested?
   
   Added 7 unit tests for `intersect_metadata_for_union`:
   
   - Same metadata across branches — preserved
   - Conflicting non-empty values — dropped (existing behavior, unchanged)
   - One branch has metadata, other is empty — metadata preserved (the fix)
   - Empty branch comes first — still works
   - All branches empty — empty result
   - Mix of empty and conflicting non-empty — intersects only the non-empty ones
   - No inputs — empty result
   
   The full end-to-end reproduction needs Parquet files with field metadata as 
described in the issue. The unit tests cover the intersection logic directly.
   
   ## Are there any user-facing changes?
   
   No API changes. `UNION ALL` queries combining metadata-carrying sources with 
NULL literals will stop failing with schema mismatch errors.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to