alamb opened a new issue, #7746:
URL: https://github.com/apache/arrow-datafusion/issues/7746

   ### Describe the bug
   
   While working to integrate the change in 
https://github.com/apache/arrow-datafusion/pull/7670 I believe I have found a 
pre-existing bug that is newly exposed. 
   
   Some of our plans (that have pushed down predicates) started generating 
errors like this:
   ```
   Error while planning query: Optimizer rule 'simplify_expressions' failed
   caused by Schema error: No field named disk.device. Valid fields are 
disk.bytes_free.",
   ```
   
   The relevant part of the initial plan looks like this:
   
   ```
       Projection: Dictionary(Int32, Utf8("disk")) AS iox::measurement, 
TimestampNanosecond(0, None) AS time, Utf8(NULL) AS cpu, Int64(NULL) AS 
count_usage_idle, coalesce_struct(COUNT(disk.bytes_free), Int64(0)) AS 
count_bytes_free
         Aggregate: groupBy=[[]], aggr=[[COUNT(disk.bytes_free)]]
           Filter: disk.device IS NULL AND Boolean(NULL) OR disk.device IS NOT 
NULL AND (Boolean(NULL) OR disk.device = Dictionary(Int32, Utf8("disk1s1")))
             TableScan: disk
   ```
   
   After running filter_pushdown and projection_pushdown the plan looks like 
this (note that the scan for `disk` only fetches  `bytes_free` even though 
there is a predicate for `disk.device`)
   
   ```
       Projection: Dictionary(Int32, Utf8("disk")) AS iox::measurement, 
TimestampNanosecond(0, None) AS time, Utf8(NULL) AS cpu, Int64(NULL) AS 
count_usage_idle, coalesce_struct(COUNT(disk.bytes_free), Int64(0)) AS 
count_bytes_free
         Aggregate: groupBy=[[]], aggr=[[COUNT(disk.bytes_free)]]
           TableScan: disk projection=[bytes_free], full_filters=[disk.device 
IS NULL AND Boolean(NULL) OR disk.device IS NOT NULL AND (Boolean(NULL) OR 
disk.device = Dictionary(Int32, Utf8("disk1s1")))]
   ```
   
   This projection for only `bytes_free` is good in that it means some 
datasources like Parquet can avoid decoding certain columns 🎉 
   
   However,  simplify_expressions is called again, it tries to simplify the 
expression in terms of the output of the `TableScan` which may not have all the 
columns, due to the pushed down projection: 
   
   
https://github.com/apache/arrow-datafusion/blob/4b2b7dcfc63abfc03b0279abe122c5bdfcca5275/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs#L68-L71
   
   ### To Reproduce
   
   I am working on a self contained reproducer for DataFusion -- I think it 
will simply involve optimizing the plan twice. 
   
   This patch fixes the error in IOx:
   
   ```
   --- a/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs
   +++ b/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs
   @@ -65,10 +65,15 @@ impl SimplifyExpressions {
        ) -> Result<LogicalPlan> {
            let schema = if !plan.inputs().is_empty() {
                DFSchemaRef::new(merge_schema(plan.inputs()))
   -        } else if let LogicalPlan::TableScan(_) = plan {
   +        } else if let LogicalPlan::TableScan(scan) = plan {
                // When predicates are pushed into a table scan, there needs to 
be
                // a schema to resolve the fields against.
   -            Arc::clone(plan.schema())
   +
   +            // note that some expressions that have been pushed to the scan
   +            // can refer to columns that are *NOT* part of the output of 
the Scan,
   +            // so we use the schema of the actual provider itself without 
any projection applied
   +            let schema = 
DFSchema::try_from_qualified_schema(&scan.table_name, 
scan.source.schema().as_ref())?;
   +            Arc::new(schema)
            } else {
                Arc::new(DFSchema::empty())
            };
   ```
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to