[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #3380: RFC: Integrate `RowFilter` into `ParquetExec`

GitBox Wed, 07 Sep 2022 04:09:49 -0700


thinkharderdev commented on PR #3380:
URL: 
https://github.com/apache/arrow-datafusion/pull/3380#issuecomment-1239247569


   > > A separate conceptual question is around optimizing the number of 
distinct filters. In this design we simply assume that we want to break the 
filter into as many distinct predicates as we can but I'm not sure that is 
always the case given that this forces serial evaluation of the filters. I can 
imagine many cases where it would be better to group predicates together for 
evaluation. I didn't want to make the initial implementation too complicated so 
I punted on that for now, but eventually may want to do cost estimation at a 
higher level to determine the optimal grouping.
   > 
   > @thinkharderdev Agree! I remember each distinct filters will apply to the 
projected col with `selection`.
   > 
   > One thing i want to mention , when applying filter pushdowm to parquet, 
some `filters exprs` are `partial_filters`, it will also exits in `filer 
operator`. I think before all filters base on min_max are `partial_filters`(is 
there any situation pushDowan to parquet use `full_filters`🤔 ).
   > 
   > After use this row_filter i think it could be a `full_filters` （we need 
some code change in push down rule implemention）and then we could eliminate the 
`filters exprs` in `filter operator`.🤔 @alamb I think you are familiar with 
this（rewrite the push down expr）
   
   Yes! This is I think the next phase. Once we can push down exact filters to 
the scan we can represent that in the `ListingTable`. The pushdown doesn't 
actually rewrite the filters. The existing filter `Expr` just get pushed down 
and it's actually `PruningPredicate` which rewrites them as min/max filters on 
the statistics. But they all (currently) get pushed down as inexact which means 
they would get executed twice (once in the scan and once again in the filter 
operator). If the optimizer can push down ALL the filters as exact then we can 
eliminate the `Filter` operator entirely (which also unlocks the possibility of 
pushing the limit down to the scan as well if there is one)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #3380: RFC: Integrate `RowFilter` into `ParquetExec`

Reply via email to