This is what I have always wanted to do. It is very challenging, but also very meaningful.
In fact, other projects have already taken this step, such as datafusion. I have an idea whether calcite can be combined with datafusion. Calcite focuses on query optimization, while datafusion focuses on execution. Perhaps the two can complement each other. I know that someone in cmu is making similar attempts, such as optd. I am also making similar attempts. [1] https://github.com/apache/datafusion [2] https://github.com/cmu-db/optd > 2024年8月15日 14:49,Stamatis Zampetakis <[email protected]> 写道: > > Hey Mihai, > > Putting the rules together and building a good query planner is a > pretty challenging task. I don't know if there is a perfect solution > out there but there are certainly some available planners that you can > have a look at. > > Apache Hive's logical optimizer [1] is built using Calcite and has > been running real production workloads for many years now. In terms of > results you can get an idea of the produced plans for TPC-DS queries > by looking at the plans in [2]. > > Best, > Stamatis > > [1] > https://github.com/apache/hive/blob/b09d76e68bfba6be19733d864b3207f95265d11f/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L1609 > [2] > https://github.com/apache/hive/tree/b09d76e68bfba6be19733d864b3207f95265d11f/ql/src/test/results/clientpositive/perf/tpcds30tb/tez > > On Thu, Aug 15, 2024 at 6:02 AM suibianwanwan <[email protected]> > wrote: >> >> I'm trying to reproduce the problem, but the result of my convert Query >> doesn't have the LogicalCorrelate. >> >> >> So I tried to create a test in calcite org.apache.calcite.tools.PlannerTest >> as follows: >> >> >> Planner planner = getPlanner(null); >> final String sql = "SELECT 3 in (SELECT \"deptno\" FROM >> \"emps\")"; >> SqlNode parse = planner.parse(sql); >> SqlNode validate = planner.validate(parse); >> RelNode convert = planner.rel(validate).rel; >> >> relnode as follows: >> LogicalProject.NONE.[](input=LogicalValues#6,exprs=[IN(3, { >> LogicalProject(deptno=[$1]) >> LogicalTableScan(table=[[hr, emps]]) >> })]) >> >> >> >> Do you have a way to reproduce this in calcite? >> >> >> Regards, >> suibianwanwan >> ------------------ Original ------------------ >> From: >> "dev" >> >> <[email protected]>; >> Date: Thu, Aug 15, 2024 06:23 AM >> To: "dev"<[email protected]>; >> >> Subject: Re: Re:Query planner structure >> >> >> >> I have tried this approach, but the decorrelator fails to decorrelate many >> queries. Here is a simple query where after decorrelation there are still >> LogicalCorrelate nodes left: >> >> SELECT 3 in (SELECT empno FROM emp) >> >> Maybe I am missing some optimization steps? >> >> From my reading of the code, the Calcite decorrelator seems to be a >> heuristic decorrelator, which only handles a few types of patterns of >> correlated subqueries. If your query doesn't fall into one of these >> patterns, the decorrelator leaves it unchanged. >> >> For example, this query, after converting it to use LogicalCorrelate as you >> have described, becomes: >> >> LogicalProject(EXPR$0=[IS NOT NULL($1)]), id = 211 >> LogicalCorrelate(correlation=[$cor0], >> joinType=[left], requiredColumns=[{}]), id = 213 >> LogicalValues(tuples=[[{ 0 }]]), >> id = 180 >> LogicalAggregate(group=[{0}]), id >> = 207 >> >> LogicalProject(cs=[true]), id = 205 >> >> LogicalFilter(condition=[=(3, $0)]), id = 203 >> >> LogicalProject(EMPNO=[$0]), id = 201 >> >> LogicalTableScan(table=[[schema, EMP]]), id = 182 >> >> The call to RelDecorrelator.decorrelateQuery(rel, builder) cannot >> decorrelate this query. >> >> Do you have a solution that handles this case? >> >> BTW: I also think that the FILTER_INTO_JOIN rule is unsound , e.g.: >> https://issues.apache.org/jira/browse/CALCITE-5627 >> >> Thank you, >> Mihai >> ________________________________ >> From: suibianwanwan33 <[email protected]> >> Sent: Tuesday, August 13, 2024 10:31 PM >> To: dev <[email protected]> >> Subject: Re:Query planner structure >> >> Hi, this is one of the classic optimization cases in my work. >> First you need to add all CORRELATE optimizations such as >> FILTER_SUB_QUERY_TO_CORRELATE in a HepPlanner, after that you need to do a >> decorrelateQuery, and finally use a HepPlanner to add the >> JOIN_CONDITION_PUSH and the FILTER_INTO_JOIN. my guess is that the Cartesian >> product of this query will be optimized >> >> >> >> >> ------------------&nbsp;Original&nbsp;------------------ >> From: >> >> "dev" >> <[email protected]&gt;; >> Date:&nbsp;Wed, Aug 14, 2024 12:51 PM >> To:&nbsp;"[email protected]"<[email protected]&gt;; >> >> Subject:&nbsp;Query planner structure >> >> >> >> Hello all, >> >> Calcite contains many optimization rules. It is not obvious to me how to >> assemble them into a good planner. Can people recommend tested query >> planners based on Calcite that are open-source and that can be used as a >> starting point? >> >> In particular, I am having trouble making a HEP-planner produce a good plan >> for Q21 from TPC-H; the planner I assembled produces many cartesian products. >> >> For reference, here is the variant of Q21 that I am using: >> >> create view q21 ( >> &nbsp;&nbsp;&nbsp; s_name, >> &nbsp;&nbsp;&nbsp; numwait >> ) as >> select >> &nbsp;&nbsp;&nbsp; s_name, >> &nbsp;&nbsp;&nbsp; count(*) as numwait >> from >> &nbsp;&nbsp;&nbsp; supplier, >> &nbsp;&nbsp;&nbsp; lineitem l1, >> &nbsp;&nbsp;&nbsp; orders, >> &nbsp;&nbsp;&nbsp; nation >> where >> &nbsp;&nbsp;&nbsp; s_suppkey = l1.l_suppkey >> &nbsp;&nbsp;&nbsp; and o_orderkey = l1.l_orderkey >> &nbsp;&nbsp;&nbsp; and o_orderstatus = 'F' >> &nbsp;&nbsp;&nbsp; and l1.l_receiptdate &gt; l1.l_commitdate >> &nbsp;&nbsp;&nbsp; and exists ( >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; select >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; >> * >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; from >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; >> lineitem l2 >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; where >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; >> l2.l_orderkey = l1.l_orderkey >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; >> and l2.l_suppkey <&gt; l1.l_suppkey >> &nbsp;&nbsp;&nbsp; ) >> &nbsp;&nbsp;&nbsp; and not exists ( >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; select >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; >> * >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; from >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; >> lineitem l3 >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; where >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; >> l3.l_orderkey = l1.l_orderkey >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; >> and l3.l_suppkey <&gt; l1.l_suppkey >> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; >> and l3.l_receiptdate &gt; l3.l_commitdate >> &nbsp;&nbsp;&nbsp; ) >> &nbsp;&nbsp;&nbsp; and s_nationkey = n_nationkey >> &nbsp;&nbsp;&nbsp; and n_name = 'GERMANY' >> group by >> &nbsp;&nbsp;&nbsp; s_name >> order by >> &nbsp;&nbsp;&nbsp; numwait desc, >> &nbsp;&nbsp;&nbsp; s_name >> LIMIT 100; >> >> Thank you for any suggestions! >> Mihai
