nuno-faria commented on PR #17197:
URL: https://github.com/apache/datafusion/pull/17197#issuecomment-3191090428
Thanks @adriangb for looking into it.
I found some issues with the fix.
1. It appears to be causing a regression with the regular dynamic filter
pushdown. Now more rows are being returned than what is supposed to. For
example:
```sql
copy (select i as k from generate_series(1, 10000000) as t(i)) to
't1.parquet';
copy (select i as k, i as v from generate_series(1, 10000000) as t(i)) to
't2.parquet';
create external table t1 stored as parquet location 't1.parquet';
create external table t2 stored as parquet location 't2.parquet';
explain analyze select *
from t1
join t2 on t1.k = t2.k
where v = 1;
-- in each run, the number of returned rows by t1 is different
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND
k@0 <= 1 ]
output_rows=6854272
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND
k@0 <= 1 ]
output_rows=8951424
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND
k@0 <= 1 ]
output_rows=7902848
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND
k@0 <= 1 ]
output_rows=8951424
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND
k@0 <= 1 ]
output_rows=7902848
```
On main, the number of rows returned by t1 is the minimum expected:
```sql
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND
k@0 <= 1 ]
output_rows=20480
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND
k@0 <= 1 ]
output_rows=20480
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND
k@0 <= 1 ]
output_rows=20480
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND
k@0 <= 1 ]
output_rows=20480
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND
k@0 <= 1 ]
output_rows=20480
```
2. With partitioned joins, the dynamic filter is always `true`, while I
think it should be the minimum/maximum of all partial filters. For example:
```sql
explain analyze select *
from t1
join t2 on t1.k = t2.k
where v = 1 or v = 2;
-- all rows are returned
DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ true ]
output_rows=10000000
```
I think in this case it should be `DynamicFilterPhysicalExpr [ k@0 >= 1 AND
k@0 <= 2 ]`.
However, if we had `v=1 or v=10000000` the filter in this case would be `[
k@0 >= 1 AND k@0 <= 10000000 ]`, which would also return all rows. Wouldn't it
be feasible to build a filter by combining the different partition expressions
with OR? Like so: `[ (k@0 >= 1 AND k@0 <= 1) OR (k@0 >= 10000000 AND k@0 <=
10000000) ]`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]