Re: [DISCUSS] Revert [CALCITE-1842] Sort.computeSelfCost() calls makeCost() with arguments in wrong order
Stamatis>HepPlanner does not perform cost-based decisions so for most use-cases I It seems to have some cost-based decisions, so providing ill-defined cost seems wrong. Well, let's see how new VolcanoCost would work, and then we could discuss Hep. Stamatis>Although it is nice to account for CPU and IO in the cost model, it has Stamatis>been shown [1] I don't think the article applies here. Calcite's Volcano optimizer tends to produce lots of Project and Filter relations, and it is important to weight them appropriately, otherwise, it would endlessly generate dumb projects. Stamatis>[1] has also an interesting discussion regarding Postgres Stamatis>cost model. PostgreSQL cost model has _two_ cost values for each relation: startup cost and total cost (see https://www.postgresql.org/docs/12/sql-explain.html ) However, the article does not mention the difference between startup and total cost, thus it does not look like they properly discuss PostgreSQL cost model. I'm not sure how often startup cost is useful, but it might help for cases like "(...) limit 5" For example, currently, we don't really use "offset/limit" for cost computation. And we do not have a clear answer to that. On the other hand, if each relation could supply its "startup cost", then the cost of "limit" would be startupCost + totalCost*(limit / totalRows) Vladimir
Re: [DISCUSS] Revert [CALCITE-1842] Sort.computeSelfCost() calls makeCost() with arguments in wrong order
Thanks for the PR! HepPlanner does not perform cost-based decisions so for most use-cases I don't think it matters a lot what is the cost. For more advanced use-cases, where there might be rules that use CumulativeCost and other kinds of such metadata, the HepPlanner can be configured with another cost factory. I don't see a big reason to change the API of RelOptCost#getRows, the callers who ask for cumulative cost should know what to expect. Apart from that rejected rows and other metrics can be added if necessary. Although it is nice to account for CPU and IO in the cost model, it has been shown [1] that for main memory settings even a simple cost model based entirely on cardinality estimations can work fine for many queries. For those interested [1] has also an interesting discussion regarding Postgres cost model. Best, Stamatis [1] http://www.vldb.org/pvldb/vol9/p204-leis.pdf On Sun, Jan 5, 2020 at 2:12 PM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > In the meantime, I've created a PR that updates VolcanoCost: > https://github.com/apache/calcite/pull/1722 > > Vladimir >
Re: [DISCUSS] Revert [CALCITE-1842] Sort.computeSelfCost() calls makeCost() with arguments in wrong order
In the meantime, I've created a PR that updates VolcanoCost: https://github.com/apache/calcite/pull/1722 Vladimir
Re: [DISCUSS] Revert [CALCITE-1842] Sort.computeSelfCost() calls makeCost() with arguments in wrong order
>This of course requires the VolcanoCost to be adapted. What do you think of HepPlanner? It uses RelOptCostImpl.FACTORY by default which explicitly ignores CPU and IO cost factors :(( Regarding cost#rows, there's a problem: cost#rows field does not add well when computing cumulative cost. What if we put the number of _rejected_ rows to the cost#rows field? Then the field would have certain meaning: * If the value is high, the plan is probably rejecting a lot of unrelated rows, thus it is suboptimal * Extra Project/Calc nodes won't artificially increase rows in the cost fields. Currently each Project adds "rows" which is not very good. * It is clear what to put to the rows field: "rejected rows" is more-or-less understandable. For Project it would be 0. * Join/Filter/Calc nodes would show "estimated number of returned rows=X (from metadataquery), rejected rows=Y (from cost)" which would help understanding where the time is spent That is inspired by PostgreSQL's "rows removed by filter" when running explain analyze (which is statement execution + collecting statistics on each execution plan node): http://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.2#Explain_improvements Vladimir
Re: [DISCUSS] Revert [CALCITE-1842] Sort.computeSelfCost() calls makeCost() with arguments in wrong order
>I think we should try to make our cost estimations more realistic in terms of cpu and io and don't try to put everything in rows as it is the case for various operators. This of course requires the VolcanoCost to be adapted. Well. The more I revise costs, the more I incline to that opinion as well. It looks the comparison should be based on cpu+io*CPU_PER_IO. Then 'rows' becomes obsolete :-/ It seems rows duplicates rowCount metadata, thus I am inclined to hide 'rows' from toString output. Probably, toString should show cpu+io*CPU_PER_IO result. Vladimir
Re: [DISCUSS] Revert [CALCITE-1842] Sort.computeSelfCost() calls makeCost() with arguments in wrong order
Hi Vladimir, I think we should leave it as it was. The fact that VolcanoCost does not exploit the cpu and io information is a problem of that class and not of the Sort#computeSelfCost method. Note that the VolcanoPlanner can be configured with a RelOptCostFactory which means that people who use the planner may use another implementation of RelOptCost which does take into account cpu and io metrics. Regarding the more general question, I think we should try to make our cost estimations more realistic in terms of cpu and io and don't try to put everything in rows as it is the case for various operators. This of course requires the VolcanoCost to be adapted. Best, Stamatis On Sun, Dec 29, 2019 at 7:33 PM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > Hi, > > I'm inclined to revert > > https://github.com/apache/calcite/commit/48a20668647b5a5e86073ef0e9ce206669ad6867 > Motivation can be found in > > https://issues.apache.org/jira/browse/CALCITE-1842?focusedCommentId=17004696=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17004696 > > WDYT? > > The question there is Sort#computeSelfCost > We have (rows, cpu, io) cost fields, however, most of the time we use just > **rows** to represent the costing. > For instance, EnumerableHashJoin computes the cost and returns (rows, 0, > 0). > > CALCITE-1842 adjusted Sort costing so it moved NLogN to cpu field, and it > makes the sorting virtually free > because the current Volcano is using rows field only when comparing the > costs. > > Unfortunately, CALCITE-1842 has no tests, so I don't really see what was > the problem. > > Vladimir >
[DISCUSS] Revert [CALCITE-1842] Sort.computeSelfCost() calls makeCost() with arguments in wrong order
Hi, I'm inclined to revert https://github.com/apache/calcite/commit/48a20668647b5a5e86073ef0e9ce206669ad6867 Motivation can be found in https://issues.apache.org/jira/browse/CALCITE-1842?focusedCommentId=17004696=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17004696 WDYT? The question there is Sort#computeSelfCost We have (rows, cpu, io) cost fields, however, most of the time we use just **rows** to represent the costing. For instance, EnumerableHashJoin computes the cost and returns (rows, 0, 0). CALCITE-1842 adjusted Sort costing so it moved NLogN to cpu field, and it makes the sorting virtually free because the current Volcano is using rows field only when comparing the costs. Unfortunately, CALCITE-1842 has no tests, so I don't really see what was the problem. Vladimir