Hello,
I hope this is the right place to ask this.
While working on a project based on arrow-datafusion, I came across some weird
behavior where a projection did not get eliminated as expected, thus breaking a
custom optimizer rule's assumption (into which I won't go further, as it's not
important for this question).
Specifically, I found an execution plan like:
Projection [col1, col2]
TableScan projection=[col1, col2, col3], full_filters=[ ... col3 ...]
to not be simplified to:
TableScan projection=[col1, col2], full_filters=[... col3 ...]
This does seem to be intentional, as I have found this
snippet<https://github.com/apache/arrow-datafusion/blob/c97048d178594b10b813c6bcd1543f157db4ba3f/datafusion/optimizer/src/push_down_projection.rs#L174>
in the optimizer rule:
...
LogicalPlan::TableScan(scan)
if !scan.projected_schema.fields().is_empty() =>
{
let mut used_columns: HashSet<Column> = HashSet::new();
// filter expr may not exist in expr in projection.
// like: TableScan: t1 projection=[bool_col, int_col],
full_filters=[t1.id = Int32(1)]
// projection=[bool_col, int_col] don't contain `ti.id`.
exprlist_to_columns(&scan.filters, &mut used_columns)?;
...
However, the comment does not explain why we need to keep the extra projection
and return the extra column - after all, the filters inside of the scan are
internal to that scan, and should not affect the execution plan, right?
I am looking forward to any opinions.
Best regards, Markus Appel