DDtKey opened a new issue, #14058: URL: https://github.com/apache/datafusion/issues/14058
### Describe the bug Affected Version: 42.x, 43.x, 44.x (regression since 41.x) The `DataFrame::schema` method returns a schema that includes all columns from the joined sources, including columns not present in the final output. This behavior is incorrect and inconsistent with the documented behavior: > Returns the DFSchema describing the output of this DataFrame. ### To Reproduce Simple MRE here: ```rust // Works for datafusion: 41.x and earlier // Failed for datafusion: 42.x and later (including 44.x) use datafusion::arrow::util::pretty; use datafusion::prelude::*; #[tokio::main] async fn main() -> datafusion::error::Result<()> { let ctx = SessionContext::new(); // Create table1 ctx.sql( r#" CREATE TABLE table1 AS SELECT * FROM ( VALUES (1, 'a'), (2, 'b'), (3, 'c') ) AS t(id, value1) "#, ) .await?; // Create table2 ctx.sql( r#" CREATE TABLE table2 AS SELECT * FROM ( VALUES (1, 'x'), (3, 'y'), (4, 'z') ) AS t(id, value2) "#, ) .await?; // Execute NATURAL JOIN query let df = ctx.sql("SELECT * FROM table1 NATURAL JOIN table2").await?; // Incorrect schema includes all columns from both tables let schema = df.schema().as_arrow().clone(); println!("Schema: {:?}", schema); // Output does not include all columns let result = df.collect().await?; pretty::print_batches(&result)?; let result_schema = result.first().unwrap().schema(); assert_eq!(&schema, &*result_schema, "Schema mismatch"); Ok(()) } ``` Deps: ```toml datafusion = "44.0.0" tokio = { version = "1", features = ["full"] } ``` ### Expected behavior The schema returned by `DataFrame::schema` should match the structure of the output produced by `collect`/`collect_partitioned` and etc. Specifically: - Excluded columns from the result of a NATURAL JOIN should not appear in the schema. ___ Or, if it was intended - the documentation should be aligned and be clear how to access the schema. However, I find previous behavior correct and useful (e.g - get schema before methods like `write_parquet`/`csv`/`json`) ### Additional context This is a regression, as the method previously **worked correctly in version 41.x.x and earlier.** Also, it probably points to the missing test coverage for particular code-paths. In a sense it's not enough to compare SQL execution results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org