ethan-tyler opened a new pull request, #20071: URL: https://github.com/apache/datafusion/pull/20071
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #6051 ## Rationale for this change This is the end to end plumbing PR to get `input_file_name()` working. Started with an SLT test to define the expected behavior, then built out the plumbing to make it pass. Scoped to SELECT-list only (guaranteed pushdown case) per discussion with @alamb and @adriangb, with broader pushdown support to follow once #19538 lands. ## What changes are included in this PR? Add `input_file_name()` function that returns the file path for each row by injecting the value at the file opener boundary. Opt in (only when referenced), keeps `SELECT *` stable, errors on unsupported contexts. **Analyzer rewrite** - Rewrites `input_file_name()` to reserved column `__datafusion_input_file_name` - Annotates `TableScan.projected_schema` only when needed - Errors on reserved name collisions **Physical planning + execution** - Planner enables scan time injection when internal field is projected - `FileScanConfig::open` wraps opener to append Utf8 column with file location per batch - Stats/equivalence properties/schema updated for appended field **Optimizer** - `OptimizeProjections` handles internal column safely (prevents index OOB) - Regression test: reserved column from source schema not treated as injected **Scope (V1)** - Works in SELECT list only - Plan time errors for non file sources (VALUES/MemTable), joins (ambiguous file origin), and non SELECT list usage (WHERE/GROUP BY/ORDER BY/HAVING) ## Are these changes tested? Yes. ```bash cargo test -p datafusion-sqllogictest --test sqllogictests -- input_file_name.slt cargo test -p datafusion-datasource extended_file_columns_inject_input_file_name -q cargo test -p datafusion-optimizer optimize_projections_keeps_reserved_column_from_source -q ``` SLT uses CSV for deterministic multi file assertions. Parquet supported via same `FileScanConfig` path and Parquet specific SLTs can follow. ## Are there any user-facing changes? Yes. New 0-arg volatile scalar function: `input_file_name() -> Utf8` ```sql CREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION '...'; SELECT col1, input_file_name() FROM t; ``` `SELECT *` output unchanged unless `input_file_name()` is explicitly referenced. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
