[GitHub] [arrow-datafusion] nevi-me edited a comment on issue #970: Execute LogicalPlan on DBMS directly

GitBox Thu, 14 Oct 2021 07:33:16 -0700


nevi-me edited a comment on issue #970:
URL: 
https://github.com/apache/arrow-datafusion/issues/970#issuecomment-943415912



   > Specifically, you could push filters down here into the underlying DBMS 
and avoid materializing the entire table:
   
   I also lean towards this approach. @hu6360567 see the `write_filters` 
function in 
https://github.com/TheDataEngine/datafusion-mongo-connector/blob/main/src/lib.rs#L288-L396.
 However, as you mention, they won't get us very far.
   
   With SQL, we still need a way of the source declaring what pushdown it 
supports (and for datafusion to support pushing aggregates and joins to source).
   
   > Split optimized plan into sub plans, which depends on where inputs are 
located (from datasource or output from other sub plan).
   
   I believe we already have this, as each `LogicalPlan` unit is an enum with 
the type of plan (TableScan, Aggregate, Project, etc).
   
   The `async fn scan()` function allows us to push the filters and projection 
by converting them into a SQL query based on the dialect of the source. This is 
based on `LogicalPlan::TableScan`.
   
   > Convert sub plans to SQL, and send sql to each execution node.
   
   I haven't thought about it much lately (I attempted an arrow-datafusion-sql 
thingy early last year), but the biggest change would be allowing datasources 
to declare if they support certain plans. For example, if there's a window 
function and an aggregation, and say MySQL supports them, there should be a 
`TableProvider::supports_filter_pushdown` equivalent, which we can call to 
determine which plans to push down, and which to execute on the resulting data.
   
   > But for more comlex cases, recursive queries on DBMS data source ( query 
on table of join two tables in same database), TableProvider may not be 
suffcient.
   
   It could be a good exercise to see how Spark implements this. Many years ago 
when I was working with Spark + Impala data source, I saw that certain joins 
could be pushed to source via JDBC. It was a multi-source virtualised setup 
(SAP HANA via Spark; don't ask why), but if 2 or more joins came from the same 
catalog/schema in the same data source, they could be pushed down as a joined 
SQL query instead of only filtering the input tables.
   
   ___
   
   So perhaps a good process could be:
   
   * Propose an extended pushdown model on `TableProvider`.
   * Implement it on a POC datasource. If it's too much work to do it on SQL 
(given the effort to convert a plan to SQL + dialect considerations), you could 
do it on the MongoDB repo that I linked. MongoDB supports very limited joins 
LEFT OUTER joins 
(https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/), 
aggregates (we'd need to know which aggregates can be pushed, might be an all 
or nothing), cast projections, some window functions.
   
   While converting a whole logical plan to SQL might be the easier approach, I 
don't know if it would be generally suitable as there are going to be some 
queries that datafusion executes differently to different SQL engines, and it's 
better for datasources to have more control of what gets pushed down.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] nevi-me edited a comment on issue #970: Execute LogicalPlan on DBMS directly

Reply via email to