discussed with one of our user groups, in Spark 3.0, this is called "dynamic
partition pruning"

Yang Liu <whilg...@gmail.com> 于2020年3月6日周五 下午6:12写道:

> Hi,
>
> I am wondering if Calcite will support "lazy optimization" (execution time
> optimization / runtime optimization).
>
> For example, we want to do an inner join between an Elasticsearch table
> and a MySQL table, like this:
>
> WITH logic_table_2 AS
>   (SELECT _MAP['status'] AS "status",
>           _MAP['user'] AS "user"
>    FROM "es"."insight-by-sql-v3"
>    LIMIT 12345)
> SELECT *
> FROM "insight_user"."user_tab" AS t1
> JOIN logic_table_2 AS t2 ON t1."email" = t2."user"
> WHERE t2."status" = 'fail'
> LIMIT 10
>
> t2 is a ES table and t1 is a MySQL table, and it may generate a execution
> plan like this:
>
> EnumerableProject(id=[$2], name=[$3], email=[$4], desc=[$5],
> is_super=[$6], create_time=[$7], has_all_access=[$8], status=[$0],
> user=[$1])
>   EnumerableLimit(fetch=[10])
>     EnumerableHashJoin(condition=[=($1, $4)], joinType=[inner])
>       ElasticsearchToEnumerableConverter
>         ElasticsearchProject(status=[ITEM($0, 'status')], user=[ITEM($0,
> 'user')])
>           ElasticsearchFilter(condition=[=(ITEM($0, 'status'), 'fail')])
>             ElasticsearchSort(fetch=[12345])
>               ElasticsearchTableScan(table=[[es, insight-by-sql-v3]])
>       JdbcToEnumerableConverter
>         JdbcTableScan(table=[[insight_user, user_tab]])
>
> since here ES query has a filter, in execution Calcite will do the ES
> query first and get the build table, and then do JdbcTableScan and get the
> probe table, and do the HashJoin finally.
>
> But, since this is a INNER JOIN, there is an implicit filter on the later
> JdbcTableScan:
> ``` t1.email in (select user from t2 where t2.status='fail') ```, if
> applying this implicit filter, the dataset we will handle may become
> extremely small (save memory) and running much faster since the full
> JdbcTableScan is always time-wasting. But since Calcite do the optimization
> in planner phase, this dynamic/lazy optimization seems missed ...
>
> To summarize, serial execution with a "lazy optimization" may be faster
> and use less memory than parallel execution with an optimized execution
> plan since the former one can reduce dataset we handle.
>
> Any ideas?
>
>
>

Reply via email to