Hello,
I was playing with some statistics information, and I started having weird
results in Calcite execution for correlated aggregations that seems to be
related to BindableTableScan and RelMdColumnUniqueness
Specifically, it seems that RelMdColumnUniqueness does not respect
BindableTableScan’s projections list when asking the statistics about specific
column positions.
For example:
In our system the ID field of a table is considered unique, and this column is
always at column position 0 of the table schema.
The table provides the bitset {0} in the key’s field of the Statistics object
from table.getStatistic.
Lets look at this base query:
select Id, (select count(C0_21) from T2 where C0_21 = T1.Id) from T1
For each row of T1, I want the id, and the number of C0_21’s which match.
We should the same number of rows from this query as T1.
When we provider no key’s information to in the table statistics, we get the
correct results:
ID, Count
1, 2
2, 0
3, 1
4, 2
When we provider the {0} bitset for key’s information in the table statistics,
we do not get the correct results, we get (potentially) repeated rows of T1 for
each match in T2.
ID, Count
1, 1
1, 1
2, 0
3, 1
4, 1
4, 1
Here are some print-outs from the CONVERTED and PLAN_BEFORE_IMPLEMENTATION
hooks:
CONVERTED
LogicalProject(variablesSet=[[$cor0]], Id=[$0], EXPR$1=[$SCALAR_QUERY(
│....LogicalAggregate(group=[{}], EXPR$0=[COUNT($0)])
│....└── LogicalProject(C0_21=[$1])
│........└── LogicalFilter(condition=[=($1, $cor0.ID)])
│............└── QueryTableScan(table=[T2], Schema=[ID:Dimension,
C0_21:Dimension, C0_22:Decimal(3, 3)])
│)])
└── QueryTableScan(table=[T1], Schema=[ID:Dimension, C0_12:Decimal(3, 3),
C0_13:Text, C0_14:Timestamp])
PLAN_BEFORE_IMPLEMENTATION with no key information
EnumerableCalc(expr#0..2=[{inputs}], expr#3=[IS NULL($t2)], expr#4=[0:BIGINT],
expr#5=[CASE($t3, $t4, $t2)], Id=[$t0], EXPR$1=[$t5])
..EnumerableHashJoin(condition=[=($0, $1)], joinType=[left])
....EnumerableInterpreter
......BindableTableScan(table=[[QUERY, T1]], projects=[[0]])
....EnumerableAggregate(group=[{0}], EXPR$0=[COUNT($0)])
......EnumerableInterpreter
........BindableTableScan(table=[[QUERY, T2]], filters=[[IS NOT NULL($1)]],
projects=[[1]])
PLAN_BEFORE_IMPLEMENTATION with {0} key information
EnumerableCalc(expr#0..2=[{inputs}], expr#3=[IS NULL($t2)], expr#4=[0:BIGINT],
expr#5=[CASE($t3, $t4, $t2)], Id=[$t0], EXPR$1=[$t5])
..EnumerableHashJoin(condition=[=($0, $1)], joinType=[left])
....EnumerableInterpreter
......BindableTableScan(table=[[QUERY, T1]], projects=[[0]])
....EnumerableCalc(expr#0=[{inputs}], expr#1=[IS NOT NULL($t0)],
expr#2=[1:BIGINT], expr#3=[0:BIGINT], expr#4=[CASE($t1, $t2, $t3)],
C0_21=[$t0], $f1=[$t4])
......EnumerableInterpreter
........BindableTableScan(table=[[QUERY, T2]], filters=[[IS NOT NULL($1)]],
projects=[[1]])
Notice that the EnumerableAggregate is missing when the key information is
present.
If you set a breakpoint, you can ask
RelMedataQuery.areColumnsUnique(bindableTableScan, {0}) and it will return
true, which is not correct, since column 0 maps to column 1 in the table
(C0_21) which is not unique.
I bet that this fact is then used by AggregateRemoveRule to incorrectly drop
the aggregate, which results in multiple rows from T2 per T1.ID
RelMdColumnUniqueness.areColumnsUnqueue(TableScan) will attempt to unwrap a
ColumnUniqueness.Handler from the table, but one is not given.
I feel like BindableTableScan should provide some wrapper to remap the
projected columns.
We don’t do much configuration on when executing in the “native”/Enumerable
Calcite pathway, so I don’t know if/what configurations we might be missing,
but I don’t think we should need to augment the table statistics depending on
what kind of table scan is provided.
Its also not clear to me how we would provide an alternative
RelMdColumnUniqueness handler for BindableTableScan in this pathway (which is
through CalcitePrepareImpl)
If this is not a configuration issue, I will file a bug JIRA
-Ian Bertolacci