Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

2020-01-09 Thread Vladimir Sitnikov
Michael>If we want to calibrate

A part of the question is "What should Aggregate#computeSelfCost return?"

A) There's an option to make that method abstract so every sub-class
defines its own cost implementation.
It might be sad, and it might look like a NLogN duplication all over the
place.

B) On the other hand, if there's a generic implementation in Aggregate
class, then it should be aligned
with Enumerable* / Bindable* / Cassandra* and so on costs.

For instance, if EnumerableHashJoin uses "0.5 units per hashed cell", then
Aggregate should use the same units,
so the optimizer can properly tell the cost difference between
EnumerableHashSemiJoin(A, B) vs EnumerableHashJoin(A, Aggregate(B))

Vladimir


Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

2020-01-09 Thread Michael Mior
Having some kind of calibration that could run would be nice :) I
suppose single block read times on HDDs are likely stable, but wee
don't really know where the data is coming from. It could an HDD, an
SDD or even a network service with variable latency. So I'm not
convinced we'll ever get estimations that are really accurate. That's
why I'm suggesting we just call things cost units and leave it at
that. If we want to calibrate, what really matters is the ratio
between CPU time and data read time.
--
Michael Mior
mm...@apache.org

Le sam. 4 janv. 2020 à 15:10, Vladimir Sitnikov
 a écrit :
>
> Technically speaking, single-block read time for HDDs is pretty much
> stable, so the use of seconds might be not that bad.
> However, it seconds might be complicated to measure CPU-like activity (e.g.
> different machines might execute EnumerableJoin at different rate :( )
>
>
> What if we benchmark a trivial EnumerableCalc(EnumerableTableScan) for a
> table of 100 rows and 10 columns
> and call it a single cost unit?
>
> In other words, we could have an etalon benchmark that takes X seconds and
> we could call it a single cost unit.
>
> For instance, org.apache.calcite.rel.core.Sort#computeSelfCost returns a
> cost.
> Of course, it has NLogN assumption, but which multiplier should it use?
>
> One could measure the wallclock time for the sort, and divide it by the
> time it takes to execute the etalon cost benchmark.
>
> WDYT?
>
> Vladimir


Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

2020-01-04 Thread Vladimir Sitnikov
Technically speaking, single-block read time for HDDs is pretty much
stable, so the use of seconds might be not that bad.
However, it seconds might be complicated to measure CPU-like activity (e.g.
different machines might execute EnumerableJoin at different rate :( )


What if we benchmark a trivial EnumerableCalc(EnumerableTableScan) for a
table of 100 rows and 10 columns
and call it a single cost unit?

In other words, we could have an etalon benchmark that takes X seconds and
we could call it a single cost unit.

For instance, org.apache.calcite.rel.core.Sort#computeSelfCost returns a
cost.
Of course, it has NLogN assumption, but which multiplier should it use?

One could measure the wallclock time for the sort, and divide it by the
time it takes to execute the etalon cost benchmark.

WDYT?

Vladimir


Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

2020-01-04 Thread Michael Mior
I understand the cost doesn't have to match actual execution duration
and it doesn't really matter if it does as long as we can get the
relative ordering of plans roughly similar. That's why I'm suggesting
not calling the cost seconds, even if we are trying to roughly
approximate them. But I don't feel that strongly about this.
--
Michael Mior
mm...@apache.org

Le sam. 4 janv. 2020 à 14:18, Vladimir Sitnikov
 a écrit :
>
> Michael>although I would be hesitant to refer to "seconds"
>
> Do you have better ideas?
> If my memory serves me well, PostgreSQL uses seconds as well for cost units.
> OracleDB is using "singleblock read" for the cost unit.
>
> Michael>how long execution will take on any particular system
>
> The idea for the cost is to be able to compare different plans.
> The cost does not have to match the actual execution duration.
>
> Then there might be tuning knobs like "how much seconds a single io lasts".
>
> Vladimir


Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

2020-01-04 Thread Vladimir Sitnikov
Michael>although I would be hesitant to refer to "seconds"

Do you have better ideas?
If my memory serves me well, PostgreSQL uses seconds as well for cost units.
OracleDB is using "singleblock read" for the cost unit.

Michael>how long execution will take on any particular system

The idea for the cost is to be able to compare different plans.
The cost does not have to match the actual execution duration.

Then there might be tuning knobs like "how much seconds a single io lasts".

Vladimir


Re: [DISCUSS] CALCITE-3656, 3657, 1842: cost improvements, cost units

2020-01-04 Thread Michael Mior
A cost unit sounds fine to me, although I would be hesitant to refer
to "seconds" or other concrete measurements since there's no easy way
to guess how long execution will take on any particular system.
--
Michael Mior
mm...@apache.org

Le sam. 4 janv. 2020 à 10:56, Vladimir Sitnikov
 a écrit :
>
> Hi,
>
> I've spent some time on stabilizing the costs (see
> https://github.com/apache/calcite/pull/1702/commits ), and it looks like we
> might want to have some notion of "cost unit".
>
> For instance, we want to express that sorting table with 2 int columns is
> cheaper than sorting table with 22 int columns.
> Unfortunately, VolcanoCost is compared by rows field only, so, for now, we
> express the number of fields into the cost#rows field by adding something
> like "cost += fieldCount * 0.1" :(
>
> Of course, you might say that cost is pluggable, however, I would like to
> make the default implementation sane.
> At least it should be good enough for inspirational purposes.
>
> What do you think if add a way to convert Cost to double?
> For instance, we can add measurer.measure(cost) that would weight the cost
> or we can add method like `double cost#toSeconds()`.
>
> I guess, if we add a unit (e.g. microsecond), then we could even
> micro-benchmark different join implementations, and use the appropriate
> cost values
> for extra columns and so on.
>
> I fully understand that the cost does not have to be precise, however, it
> is sad to guestimate the multipliers for an extra field in projection.
>
>
>
> Just to recap:
> 1) I've started with making tests parallel <-- this was the main goal
> 2) Then I run into EnumerableJoinTest#testSortMergeJoinWithEquiCondition
> which was using static variables
> 3) As I fixed EnumerableMergeJoinRule, it turned out the optimizer started
> to use merge join all over the place
> 4) It was caused by inappropriate costing of Sort, which I fixed
> 5) Then I updated the cost functions of EnumerableHashJoin and
> EnumerableNestedLoopJoin, and it was not enough because ReflectiveSchema
> was not providing the proper statistics
> 6) Then I updated ReflectiveSchema and Calc to propagate uniqueness and
> rowcount metadata.
>
> All of the above seems to be more-or-less stable (the plans improved!), and
> the failing tests, for now, are MaterializationTest.
>
> The problems with those tests are the cost differences between NestedLoop
> and HashJoin are tiny.
> For instance:
> testJoinMaterialization8
>
> EnumerableProject(empid=[$2]): rowcount = 6.6, cumulative cost =
> {105.19 rows, 82.6 cpu, 0.0 io}, id = 780
>   EnumerableHashJoin(condition=[=($1, $4)], joinType=[inner]): rowcount =
> 6.6, cumulative cost = {98.6 rows, 76.0 cpu, 0.0 io}, id = 779
> EnumerableProject(name=[$0], name0=[CAST($0):VARCHAR]): rowcount =
> 22.0, cumulative cost = {44.0 rows, 67.0 cpu, 0.0 io}, id = 777
>   EnumerableTableScan(table=[[hr, m0]]): rowcount = 22.0, cumulative
> cost = {22.0 rows, 23.0 cpu, 0.0 io}, id = 152
> EnumerableProject(empid=[$0], name=[$1], name0=[CAST($1):VARCHAR]):
> rowcount = 2.0, cumulative cost = {4.0 rows, 9.0 cpu, 0.0 io}, id = 778
>   EnumerableTableScan(table=[[hr, dependents]]): rowcount = 2.0,
> cumulative cost = {2.0 rows, 3.0 cpu, 0.0 io}, id = 125
>
> vs
>
> EnumerableProject(empid=[$0]): rowcount = 6.6, cumulative cost =
> {81.19 rows, 55.6 cpu, 0.0 io}, id = 778
>   EnumerableNestedLoopJoin(condition=[=(CAST($1):VARCHAR,
> CAST($2):VARCHAR)], joinType=[inner]): rowcount = 6.6, cumulative cost =
> {74.6 rows, 49.0 cpu, 0.0 io}, id = 777
> EnumerableTableScan(table=[[hr, dependents]]): rowcount = 2.0,
> cumulative cost = {2.0 rows, 3.0 cpu, 0.0 io}, id = 125
> EnumerableTableScan(table=[[hr, m0]]): rowcount = 22.0, cumulative cost
> = {22.0 rows, 23.0 cpu, 0.0 io}, id = 152
>
> The second plan looks cheaper to the optimizer, however, the key difference
> comes from three projects in the first plan (project account for
> 6.6+22+2=30.6 cost).
> If I increase hr.dependents table to 3 rows, then hash-based plan becomes
> cheaper.
>
> As for me both plans looks acceptable, however, it is sad to analyze/debug
> those differences without being able to tell if that is a plan degradation
> or if it is acceptable.
>
> Vladimir