Hi,

I've spent some time on stabilizing the costs (see
https://github.com/apache/calcite/pull/1702/commits ), and it looks like we
might want to have some notion of "cost unit".

For instance, we want to express that sorting table with 2 int columns is
cheaper than sorting table with 22 int columns.
Unfortunately, VolcanoCost is compared by rows field only, so, for now, we
express the number of fields into the cost#rows field by adding something
like "cost += fieldCount * 0.1" :(

Of course, you might say that cost is pluggable, however, I would like to
make the default implementation sane.
At least it should be good enough for inspirational purposes.

What do you think if add a way to convert Cost to double?
For instance, we can add measurer.measure(cost) that would weight the cost
or we can add method like `double cost#toSeconds()`.

I guess, if we add a unit (e.g. microsecond), then we could even
micro-benchmark different join implementations, and use the appropriate
cost values
for extra columns and so on.

I fully understand that the cost does not have to be precise, however, it
is sad to guestimate the multipliers for an extra field in projection.



Just to recap:
1) I've started with making tests parallel <-- this was the main goal
2) Then I run into EnumerableJoinTest#testSortMergeJoinWithEquiCondition
which was using static variables
3) As I fixed EnumerableMergeJoinRule, it turned out the optimizer started
to use merge join all over the place
4) It was caused by inappropriate costing of Sort, which I fixed
5) Then I updated the cost functions of EnumerableHashJoin and
EnumerableNestedLoopJoin, and it was not enough because ReflectiveSchema
was not providing the proper statistics
6) Then I updated ReflectiveSchema and Calc to propagate uniqueness and
rowcount metadata.

All of the above seems to be more-or-less stable (the plans improved!), and
the failing tests, for now, are MaterializationTest.

The problems with those tests are the cost differences between NestedLoop
and HashJoin are tiny.
For instance:
testJoinMaterialization8

EnumerableProject(empid=[$2]): rowcount = 6.6, cumulative cost =
{105.19999999999999 rows, 82.6 cpu, 0.0 io}, id = 780
  EnumerableHashJoin(condition=[=($1, $4)], joinType=[inner]): rowcount =
6.6, cumulative cost = {98.6 rows, 76.0 cpu, 0.0 io}, id = 779
    EnumerableProject(name=[$0], name0=[CAST($0):VARCHAR]): rowcount =
22.0, cumulative cost = {44.0 rows, 67.0 cpu, 0.0 io}, id = 777
      EnumerableTableScan(table=[[hr, m0]]): rowcount = 22.0, cumulative
cost = {22.0 rows, 23.0 cpu, 0.0 io}, id = 152
    EnumerableProject(empid=[$0], name=[$1], name0=[CAST($1):VARCHAR]):
rowcount = 2.0, cumulative cost = {4.0 rows, 9.0 cpu, 0.0 io}, id = 778
      EnumerableTableScan(table=[[hr, dependents]]): rowcount = 2.0,
cumulative cost = {2.0 rows, 3.0 cpu, 0.0 io}, id = 125

vs

EnumerableProject(empid=[$0]): rowcount = 6.6, cumulative cost =
{81.19999999999999 rows, 55.6 cpu, 0.0 io}, id = 778
  EnumerableNestedLoopJoin(condition=[=(CAST($1):VARCHAR,
CAST($2):VARCHAR)], joinType=[inner]): rowcount = 6.6, cumulative cost =
{74.6 rows, 49.0 cpu, 0.0 io}, id = 777
    EnumerableTableScan(table=[[hr, dependents]]): rowcount = 2.0,
cumulative cost = {2.0 rows, 3.0 cpu, 0.0 io}, id = 125
    EnumerableTableScan(table=[[hr, m0]]): rowcount = 22.0, cumulative cost
= {22.0 rows, 23.0 cpu, 0.0 io}, id = 152

The second plan looks cheaper to the optimizer, however, the key difference
comes from three projects in the first plan (project account for
6.6+22+2=30.6 cost).
If I increase hr.dependents table to 3 rows, then hash-based plan becomes
cheaper.

As for me both plans looks acceptable, however, it is sad to analyze/debug
those differences without being able to tell if that is a plan degradation
or if it is acceptable.

Vladimir

Reply via email to