Hi Juri, it's true that the tables in the joins are fully connected via the predicates, but order matters and the concrete order I see can't do without cartesian products: it's joining "company_type" with other tables before joining with "movie_companies", but the only predicate in the where clause around "company_type" is "ct.id = mc.company_type_id", which can't be used in that subtree as "movie_companies" hasn't been joined yet, so basically it's a join ordering "issue" (which could not be an issue at all based on the size of the tables, selectivity of the predicates etc.).
Are you using rules for join ordering like LoptOptimizeJoinRule <https://github.com/apache/calcite/blob/bfbe8930f4ed7ba8da530e862e212a057191cfa3/core/src/main/java/org/apache/calcite/rel/rules/LoptOptimizeJoinRule.java> in your program (the set of rules you use could help people provide a better answer)? If you are using 1.39.0 there is a new join ordering algorithm, you can refer to CALCITE-6846 <https://issues.apache.org/jira/browse/CALCITE-6846> and related PR for more details which should be exhaustive. If you think you have added all the rules and you can't still get a sense of why you end up with a particular plan, you can activate the extended logs around rule applications and transformations to be able to then put breakpoints in the involved rules at the specific step which is generally tricky as rules are called multiple times. You can refer to these slides https://www.slideshare.net/StamatisZampetakis/debugging-planning-issues-using-calcites-builtin-loggers (there is also the full video and other links at https://calcite.apache.org/community/, the talk is "Debugging planning issues using Calcite’s built in loggers"). Best regards, Alessandro On Wed, 26 Mar 2025 at 11:10, Juri Petersen <j...@itu.dk.invalid> wrote: > Hi, > As mentioned by Mads in a previous mail, we are working on a SQL-API in > Apache Wayang. > We are trying to set up experiments with the JOB Benchmark and see that we > have to rewrite queries to explicit INNER JOINS for them to be parsed > correctly. > Since we are planning to do other benchmarks with thousands of queries, > rewriting is not feasible. > > Given this (not-rewritten) query from JOB: > > SELECT MIN(chn.name) AS uncredited_voiced_character, > MIN(t.title) AS russian_movie > FROM postgres.char_name AS chn, > postgres.cast_info AS ci, > postgres.company_name AS cn, > postgres.company_type AS ct, > postgres.movie_companies AS mc, > postgres.role_type AS rt, > postgres.title AS t > WHERE ci.note LIKE '%(voice)%' > AND ci.note LIKE '%(uncredited)%' > AND cn.country_code = '[ru]' > AND rt.role = 'actor' > AND t.production_year > 2005 > AND t.id = mc.movie_id > AND t.id = ci.movie_id > AND ci.movie_id = mc.movie_id > AND chn.id = ci.person_role_id > AND rt.id = ci.role_id > AND cn.id = mc.company_id > AND ct.id = mc.company_type_id; > > We use calcite to get the following tree: > > LogicalAggregate(group=[{}], uncredited_voiced_character=[MIN($0)], > russian_movie=[MIN($1)]) > LogicalProject(name=[$1], title=[$31]) > LogicalFilter(condition=[AND(LIKE($11, '%(voice)%'), LIKE($11, > '%(uncredited)%'), =($16, '[ru]'), =($29, 'actor'), >($34, 2005), =($30, > $24), =($30, $9), =($9, $24), =($0, $10), =($28, $13), =($14, $25), =($21, > $26))]) > LogicalJoin(condition=[true], joinType=[inner]) > LogicalJoin(condition=[true], joinType=[inner]) > LogicalJoin(condition=[true], joinType=[inner]) > LogicalJoin(condition=[true], joinType=[inner]) > LogicalJoin(condition=[true], joinType=[inner]) > LogicalJoin(condition=[true], joinType=[inner]) > LogicalTableScan(table=[[postgres, char_name]]) > LogicalTableScan(table=[[postgres, cast_info]]) > LogicalTableScan(table=[[postgres, company_name]]) > LogicalTableScan(table=[[postgres, company_type]]) > LogicalTableScan(table=[[postgres, movie_companies]]) > LogicalTableScan(table=[[postgres, role_type]]) > LogicalTableScan(table=[[postgres, title]]) > > > I then try to apply the CoreRules.FILTER_INTO_JOIN (tried smart and dumb > version), in order to avoid the cartesian products, hoping to push the join > conditions into the respective LogicalJoins. > Heres the resulting tree: > > LogicalAggregate(group=[{}], uncredited_voiced_character=[MIN($0)], > russian_movie=[MIN($1)]) > LogicalProject(name=[$1], title=[$31]) > LogicalJoin(condition=[=($24, $30)], joinType=[inner]) > LogicalJoin(condition=[=($28, $13)], joinType=[inner]) > LogicalJoin(condition=[AND(=($9, $24), =($14, $25), =($21, $26))], > joinType=[inner]) > LogicalJoin(condition=[true], joinType=[inner]) > LogicalJoin(condition=[true], joinType=[inner]) > LogicalJoin(condition=[=($0, $10)], joinType=[inner]) > LogicalTableScan(table=[[postgres, char_name]]) > LogicalFilter(condition=[AND(LIKE($4, '%(voice)%'), > LIKE($4, '%(uncredited)%'))]) > LogicalTableScan(table=[[postgres, cast_info]]) > LogicalFilter(condition=[=($2, '[ru]')]) > LogicalTableScan(table=[[postgres, company_name]]) > LogicalTableScan(table=[[postgres, company_type]]) > LogicalTableScan(table=[[postgres, movie_companies]]) > LogicalFilter(condition=[=($1, 'actor')]) > LogicalTableScan(table=[[postgres, role_type]]) > LogicalFilter(condition=[>($4, 2005)]) > LogicalTableScan(table=[[postgres, title]]) > > Some of the conditions are pushed down, but we still have remaining > cartesian products and a multi-condition join. > Looking at the input query, I would expect every Join to have a condition, > as there are no unspecified joins, right? > What am I missing or what can we do to deconstruct the multi-conditional > join and avoid cartesian products? > > Thanks in advance for any help! > > Best, > Juri > >