A cost unit sounds fine to me, although I would be hesitant to refer to "seconds" or other concrete measurements since there's no easy way to guess how long execution will take on any particular system. -- Michael Mior mm...@apache.org
Le sam. 4 janv. 2020 à 10:56, Vladimir Sitnikov <sitnikov.vladi...@gmail.com> a écrit : > > Hi, > > I've spent some time on stabilizing the costs (see > https://github.com/apache/calcite/pull/1702/commits ), and it looks like we > might want to have some notion of "cost unit". > > For instance, we want to express that sorting table with 2 int columns is > cheaper than sorting table with 22 int columns. > Unfortunately, VolcanoCost is compared by rows field only, so, for now, we > express the number of fields into the cost#rows field by adding something > like "cost += fieldCount * 0.1" :( > > Of course, you might say that cost is pluggable, however, I would like to > make the default implementation sane. > At least it should be good enough for inspirational purposes. > > What do you think if add a way to convert Cost to double? > For instance, we can add measurer.measure(cost) that would weight the cost > or we can add method like `double cost#toSeconds()`. > > I guess, if we add a unit (e.g. microsecond), then we could even > micro-benchmark different join implementations, and use the appropriate > cost values > for extra columns and so on. > > I fully understand that the cost does not have to be precise, however, it > is sad to guestimate the multipliers for an extra field in projection. > > > > Just to recap: > 1) I've started with making tests parallel <-- this was the main goal > 2) Then I run into EnumerableJoinTest#testSortMergeJoinWithEquiCondition > which was using static variables > 3) As I fixed EnumerableMergeJoinRule, it turned out the optimizer started > to use merge join all over the place > 4) It was caused by inappropriate costing of Sort, which I fixed > 5) Then I updated the cost functions of EnumerableHashJoin and > EnumerableNestedLoopJoin, and it was not enough because ReflectiveSchema > was not providing the proper statistics > 6) Then I updated ReflectiveSchema and Calc to propagate uniqueness and > rowcount metadata. > > All of the above seems to be more-or-less stable (the plans improved!), and > the failing tests, for now, are MaterializationTest. > > The problems with those tests are the cost differences between NestedLoop > and HashJoin are tiny. > For instance: > testJoinMaterialization8 > > EnumerableProject(empid=[$2]): rowcount = 6.6, cumulative cost = > {105.19999999999999 rows, 82.6 cpu, 0.0 io}, id = 780 > EnumerableHashJoin(condition=[=($1, $4)], joinType=[inner]): rowcount = > 6.6, cumulative cost = {98.6 rows, 76.0 cpu, 0.0 io}, id = 779 > EnumerableProject(name=[$0], name0=[CAST($0):VARCHAR]): rowcount = > 22.0, cumulative cost = {44.0 rows, 67.0 cpu, 0.0 io}, id = 777 > EnumerableTableScan(table=[[hr, m0]]): rowcount = 22.0, cumulative > cost = {22.0 rows, 23.0 cpu, 0.0 io}, id = 152 > EnumerableProject(empid=[$0], name=[$1], name0=[CAST($1):VARCHAR]): > rowcount = 2.0, cumulative cost = {4.0 rows, 9.0 cpu, 0.0 io}, id = 778 > EnumerableTableScan(table=[[hr, dependents]]): rowcount = 2.0, > cumulative cost = {2.0 rows, 3.0 cpu, 0.0 io}, id = 125 > > vs > > EnumerableProject(empid=[$0]): rowcount = 6.6, cumulative cost = > {81.19999999999999 rows, 55.6 cpu, 0.0 io}, id = 778 > EnumerableNestedLoopJoin(condition=[=(CAST($1):VARCHAR, > CAST($2):VARCHAR)], joinType=[inner]): rowcount = 6.6, cumulative cost = > {74.6 rows, 49.0 cpu, 0.0 io}, id = 777 > EnumerableTableScan(table=[[hr, dependents]]): rowcount = 2.0, > cumulative cost = {2.0 rows, 3.0 cpu, 0.0 io}, id = 125 > EnumerableTableScan(table=[[hr, m0]]): rowcount = 22.0, cumulative cost > = {22.0 rows, 23.0 cpu, 0.0 io}, id = 152 > > The second plan looks cheaper to the optimizer, however, the key difference > comes from three projects in the first plan (project account for > 6.6+22+2=30.6 cost). > If I increase hr.dependents table to 3 rows, then hash-based plan becomes > cheaper. > > As for me both plans looks acceptable, however, it is sad to analyze/debug > those differences without being able to tell if that is a plan degradation > or if it is acceptable. > > Vladimir