adriangb commented on issue #20325: URL: https://github.com/apache/datafusion/issues/20325#issuecomment-3893737697
> There are two major differences with the pushdown path: > > 1. The IO pattern is different (first the data needed for filtering is fetched, and then only the pages for the rows passing are fetched for the projection). Without pushdown all pages for all columns (both filter and projection are fetched) > > 2. The overhead of evaluating the filter, then selectively decoding only the rows that match for the projection column. Without pushdown, all columns are decoded and then a single filter pass is applied afterwards I think a third one is that parallelism is different. `FilterExec` often sits on top of a `RepartitionExec`. Unless there are more files than cores (even if there are, some may be smaller/larger, etc.) or infra-file re-partitioning is turned on the parallelism is going to be lower with predicate pushdown than in a `FilterExec`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
