Re: "lazy" optmization?

Danny Chan Fri, 06 Mar 2020 05:52:45 -0800

Sorry to tell that Calcite runtime does not support this, the "dynamic
partition pruning" or "runtime filter" called in Impala, would build a
bloom filter for the join keys for the build side table and push it down to
the probe table source, thus, in some cases, it can reduce the data.


Yang Liu <whilg...@gmail.com> 于2020年3月6日周五 下午6:54写道：

> discussed with one of our user groups, in Spark 3.0, this is called
> "dynamic
> partition pruning"
>
> Yang Liu <whilg...@gmail.com> 于2020年3月6日周五 下午6:12写道：
>
> > Hi,
> >
> > I am wondering if Calcite will support "lazy optimization" (execution
> time
> > optimization / runtime optimization).
> >
> > For example, we want to do an inner join between an Elasticsearch table
> > and a MySQL table, like this:
> >
> > WITH logic_table_2 AS
> >   (SELECT _MAP['status'] AS "status",
> >           _MAP['user'] AS "user"
> >    FROM "es"."insight-by-sql-v3"
> >    LIMIT 12345)
> > SELECT *
> > FROM "insight_user"."user_tab" AS t1
> > JOIN logic_table_2 AS t2 ON t1."email" = t2."user"
> > WHERE t2."status" = 'fail'
> > LIMIT 10
> >
> > t2 is a ES table and t1 is a MySQL table, and it may generate a execution
> > plan like this:
> >
> > EnumerableProject(id=[$2], name=[$3], email=[$4], desc=[$5],
> > is_super=[$6], create_time=[$7], has_all_access=[$8], status=[$0],
> > user=[$1])
> >   EnumerableLimit(fetch=[10])
> >     EnumerableHashJoin(condition=[=($1, $4)], joinType=[inner])
> >       ElasticsearchToEnumerableConverter
> >         ElasticsearchProject(status=[ITEM($0, 'status')], user=[ITEM($0,
> > 'user')])
> >           ElasticsearchFilter(condition=[=(ITEM($0, 'status'), 'fail')])
> >             ElasticsearchSort(fetch=[12345])
> >               ElasticsearchTableScan(table=[[es, insight-by-sql-v3]])
> >       JdbcToEnumerableConverter
> >         JdbcTableScan(table=[[insight_user, user_tab]])
> >
> > since here ES query has a filter, in execution Calcite will do the ES
> > query first and get the build table, and then do JdbcTableScan and get
> the
> > probe table, and do the HashJoin finally.
> >
> > But, since this is a INNER JOIN, there is an implicit filter on the later
> > JdbcTableScan:
> > ``` t1.email in (select user from t2 where t2.status='fail') ```, if
> > applying this implicit filter, the dataset we will handle may become
> > extremely small (save memory) and running much faster since the full
> > JdbcTableScan is always time-wasting. But since Calcite do the
> optimization
> > in planner phase, this dynamic/lazy optimization seems missed ...
> >
> > To summarize, serial execution with a "lazy optimization" may be faster
> > and use less memory than parallel execution with an optimized execution
> > plan since the former one can reduce dataset we handle.
> >
> > Any ideas?
> >
> >
> >
>

Re: "lazy" optmization?

Reply via email to