alamb opened a new issue, #7746: URL: https://github.com/apache/arrow-datafusion/issues/7746
### Describe the bug While working to integrate the change in https://github.com/apache/arrow-datafusion/pull/7670 I believe I have found a pre-existing bug that is newly exposed. Some of our plans (that have pushed down predicates) started generating errors like this: ``` Error while planning query: Optimizer rule 'simplify_expressions' failed caused by Schema error: No field named disk.device. Valid fields are disk.bytes_free.", ``` The relevant part of the initial plan looks like this: ``` Projection: Dictionary(Int32, Utf8("disk")) AS iox::measurement, TimestampNanosecond(0, None) AS time, Utf8(NULL) AS cpu, Int64(NULL) AS count_usage_idle, coalesce_struct(COUNT(disk.bytes_free), Int64(0)) AS count_bytes_free Aggregate: groupBy=[[]], aggr=[[COUNT(disk.bytes_free)]] Filter: disk.device IS NULL AND Boolean(NULL) OR disk.device IS NOT NULL AND (Boolean(NULL) OR disk.device = Dictionary(Int32, Utf8("disk1s1"))) TableScan: disk ``` After running filter_pushdown and projection_pushdown the plan looks like this (note that the scan for `disk` only fetches `bytes_free` even though there is a predicate for `disk.device`) ``` Projection: Dictionary(Int32, Utf8("disk")) AS iox::measurement, TimestampNanosecond(0, None) AS time, Utf8(NULL) AS cpu, Int64(NULL) AS count_usage_idle, coalesce_struct(COUNT(disk.bytes_free), Int64(0)) AS count_bytes_free Aggregate: groupBy=[[]], aggr=[[COUNT(disk.bytes_free)]] TableScan: disk projection=[bytes_free], full_filters=[disk.device IS NULL AND Boolean(NULL) OR disk.device IS NOT NULL AND (Boolean(NULL) OR disk.device = Dictionary(Int32, Utf8("disk1s1")))] ``` This projection for only `bytes_free` is good in that it means some datasources like Parquet can avoid decoding certain columns 🎉 However, simplify_expressions is called again, it tries to simplify the expression in terms of the output of the `TableScan` which may not have all the columns, due to the pushed down projection: https://github.com/apache/arrow-datafusion/blob/4b2b7dcfc63abfc03b0279abe122c5bdfcca5275/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs#L68-L71 ### To Reproduce I am working on a self contained reproducer for DataFusion -- I think it will simply involve optimizing the plan twice. This patch fixes the error in IOx: ``` --- a/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs +++ b/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs @@ -65,10 +65,15 @@ impl SimplifyExpressions { ) -> Result<LogicalPlan> { let schema = if !plan.inputs().is_empty() { DFSchemaRef::new(merge_schema(plan.inputs())) - } else if let LogicalPlan::TableScan(_) = plan { + } else if let LogicalPlan::TableScan(scan) = plan { // When predicates are pushed into a table scan, there needs to be // a schema to resolve the fields against. - Arc::clone(plan.schema()) + + // note that some expressions that have been pushed to the scan + // can refer to columns that are *NOT* part of the output of the Scan, + // so we use the schema of the actual provider itself without any projection applied + let schema = DFSchema::try_from_qualified_schema(&scan.table_name, scan.source.schema().as_ref())?; + Arc::new(schema) } else { Arc::new(DFSchema::empty()) }; ``` ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org