Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

Michael Mior Sat, 04 Jan 2020 11:12:46 -0800

A cost unit sounds fine to me, although I would be hesitant to refer
to "seconds" or other concrete measurements since there's no easy way
to guess how long execution will take on any particular system.
--
Michael Mior
mm...@apache.org


Le sam. 4 janv. 2020 à 10:56, Vladimir Sitnikov
<sitnikov.vladi...@gmail.com> a écrit :
>
> Hi,
>
> I've spent some time on stabilizing the costs (see
> https://github.com/apache/calcite/pull/1702/commits ), and it looks like we
> might want to have some notion of "cost unit".
>
> For instance, we want to express that sorting table with 2 int columns is
> cheaper than sorting table with 22 int columns.
> Unfortunately, VolcanoCost is compared by rows field only, so, for now, we
> express the number of fields into the cost#rows field by adding something
> like "cost += fieldCount * 0.1" :(
>
> Of course, you might say that cost is pluggable, however, I would like to
> make the default implementation sane.
> At least it should be good enough for inspirational purposes.
>
> What do you think if add a way to convert Cost to double?
> For instance, we can add measurer.measure(cost) that would weight the cost
> or we can add method like `double cost#toSeconds()`.
>
> I guess, if we add a unit (e.g. microsecond), then we could even
> micro-benchmark different join implementations, and use the appropriate
> cost values
> for extra columns and so on.
>
> I fully understand that the cost does not have to be precise, however, it
> is sad to guestimate the multipliers for an extra field in projection.
>
>
>
> Just to recap:
> 1) I've started with making tests parallel <-- this was the main goal
> 2) Then I run into EnumerableJoinTest#testSortMergeJoinWithEquiCondition
> which was using static variables
> 3) As I fixed EnumerableMergeJoinRule, it turned out the optimizer started
> to use merge join all over the place
> 4) It was caused by inappropriate costing of Sort, which I fixed
> 5) Then I updated the cost functions of EnumerableHashJoin and
> EnumerableNestedLoopJoin, and it was not enough because ReflectiveSchema
> was not providing the proper statistics
> 6) Then I updated ReflectiveSchema and Calc to propagate uniqueness and
> rowcount metadata.
>
> All of the above seems to be more-or-less stable (the plans improved!), and
> the failing tests, for now, are MaterializationTest.
>
> The problems with those tests are the cost differences between NestedLoop
> and HashJoin are tiny.
> For instance:
> testJoinMaterialization8
>
> EnumerableProject(empid=[$2]): rowcount = 6.6, cumulative cost =
> {105.19999999999999 rows, 82.6 cpu, 0.0 io}, id = 780
>   EnumerableHashJoin(condition=[=($1, $4)], joinType=[inner]): rowcount =
> 6.6, cumulative cost = {98.6 rows, 76.0 cpu, 0.0 io}, id = 779
>     EnumerableProject(name=[$0], name0=[CAST($0):VARCHAR]): rowcount =
> 22.0, cumulative cost = {44.0 rows, 67.0 cpu, 0.0 io}, id = 777
>       EnumerableTableScan(table=[[hr, m0]]): rowcount = 22.0, cumulative
> cost = {22.0 rows, 23.0 cpu, 0.0 io}, id = 152
>     EnumerableProject(empid=[$0], name=[$1], name0=[CAST($1):VARCHAR]):
> rowcount = 2.0, cumulative cost = {4.0 rows, 9.0 cpu, 0.0 io}, id = 778
>       EnumerableTableScan(table=[[hr, dependents]]): rowcount = 2.0,
> cumulative cost = {2.0 rows, 3.0 cpu, 0.0 io}, id = 125
>
> vs
>
> EnumerableProject(empid=[$0]): rowcount = 6.6, cumulative cost =
> {81.19999999999999 rows, 55.6 cpu, 0.0 io}, id = 778
>   EnumerableNestedLoopJoin(condition=[=(CAST($1):VARCHAR,
> CAST($2):VARCHAR)], joinType=[inner]): rowcount = 6.6, cumulative cost =
> {74.6 rows, 49.0 cpu, 0.0 io}, id = 777
>     EnumerableTableScan(table=[[hr, dependents]]): rowcount = 2.0,
> cumulative cost = {2.0 rows, 3.0 cpu, 0.0 io}, id = 125
>     EnumerableTableScan(table=[[hr, m0]]): rowcount = 22.0, cumulative cost
> = {22.0 rows, 23.0 cpu, 0.0 io}, id = 152
>
> The second plan looks cheaper to the optimizer, however, the key difference
> comes from three projects in the first plan (project account for
> 6.6+22+2=30.6 cost).
> If I increase hr.dependents table to 3 rows, then hash-based plan becomes
> cheaper.
>
> As for me both plans looks acceptable, however, it is sad to analyze/debug
> those differences without being able to tell if that is a plan degradation
> or if it is acceptable.
>
> Vladimir

Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

Reply via email to