nevi-me edited a comment on issue #970: URL: https://github.com/apache/arrow-datafusion/issues/970#issuecomment-943415912
> Specifically, you could push filters down here into the underlying DBMS and avoid materializing the entire table: I also lean towards this approach. @hu6360567 see the `write_filters` function in https://github.com/TheDataEngine/datafusion-mongo-connector/blob/main/src/lib.rs#L288-L396. However, as you mention, they won't get us very far. With SQL, we still need a way of the source declaring what pushdown it supports (and for datafusion to support pushing aggregates and joins to source). > Split optimized plan into sub plans, which depends on where inputs are located (from datasource or output from other sub plan). I believe we already have this, as each `LogicalPlan` unit is an enum with the type of plan (TableScan, Aggregate, Project, etc). The `async fn scan()` function allows us to push the filters and projection by converting them into a SQL query based on the dialect of the source. This is based on `LogicalPlan::TableScan`. > Convert sub plans to SQL, and send sql to each execution node. I haven't thought about it much lately (I attempted an arrow-datafusion-sql thingy early last year), but the biggest change would be allowing datasources to declare if they support certain plans. For example, if there's a window function and an aggregation, and say MySQL supports them, there should be a `TableProvider::supports_filter_pushdown` equivalent, which we can call to determine which plans to push down, and which to execute on the resulting data. > But for more comlex cases, recursive queries on DBMS data source ( query on table of join two tables in same database), TableProvider may not be suffcient. It could be a good exercise to see how Spark implements this. Many years ago when I was working with Spark + Impala data source, I saw that certain joins could be pushed to source via JDBC. It was a multi-source virtualised setup (SAP HANA via Spark; don't ask why), but if 2 or more joins came from the same catalog/schema in the same data source, they could be pushed down as a joined SQL query instead of only filtering the input tables. ___ So perhaps a good process could be: * Propose an extended pushdown model on `TableProvider`. * Implement it on a POC datasource. If it's too much work to do it on SQL (given the effort to convert a plan to SQL + dialect considerations), you could do it on the MongoDB repo that I linked. MongoDB supports very limited joins LEFT OUTER joins (https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/), aggregates (we'd need to know which aggregates can be pushed, might be an all or nothing), cast projections, some window functions. While converting a whole logical plan to SQL might be the easier approach, I don't know if it would be generally suitable as there are going to be some queries that datafusion executes differently to different SQL engines, and it's better for datasources to have more control of what gets pushed down. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org