Re: Gather real world statistics about row count, CPU, and IO cost

Stamatis Zampetakis Tue, 11 Aug 2020 14:32:43 -0700

Any help for improving the cost model is much appreciated.

I am not sure what you have in mind when you mention real-world statistics
but I think you could also start with the stats provided by synthetic
datasets
such as Foodmart, TPC-H and TPC-DS.


Foodmart, TPC-H and TPC-DS come along with a rich set of queries
(especially the latter)
that you could use to demonstrate that the changes in the cost-model are
beneficial.
The advantage is that you can also run the query and validate the
improvement.

Even if you want to emphasize on two table joins you could do that by
taking
subpatterns of the queries appearing in TPC-DS for instance.

In general, apart from the cost-model the default ruleset can be improved
and using the TPC-H/TPC-DS query plans as a baseline could help a lot.

All the datasets mentioned previously and more are already present in the
repo so you can rely on them to get started.

Best,
Stamatis


On Mon, Aug 10, 2020 at 1:48 PM Thomas Rebele <treb...@tibco.com.invalid>
wrote:

> Thank you for your analysis and the references. I would be more interested
> in sub-problem 2.
> A kind of tuning tool would be nice. If a project using Calcite  introduces
> custom operators, it could help to find a cost model for them.
>
> I would start with a simple case, e.g., deciding whether a particular join
> should be translated into a hash join, merge join, or correlate.
> With a different cardinality of the inputs, one algorithm or the other
> might be better. The real-world statistics might help to decide which one
> to choose.
>
> Cordialement / Best Regards,
> *Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris,
> France
> | www.tibco.com
>
>
> On Fri, Aug 7, 2020 at 11:32 PM Julian Hyde <jh...@apache.org> wrote:
>
> > Also, check out the paper "Learning to Optimize Join Queries With Deep
> > Reinforcement Learning" by Krishnan, Yang, Goldberg, Hellerstein,
> > Stoica 2019 (which aims to improve the join-order benchmark and uses
> > Calcite as one of its test platforms):
> > https://arxiv.org/pdf/1808.03196.pdf
> >
> > On Fri, Aug 7, 2020 at 2:02 PM Julian Hyde <jh...@apache.org> wrote:
> > >
> > > I consider there to be two fairly independent sub-problems:
> > >
> > > 1. Improve our cardinality estimates (especially for queries with
> > > complex joins and aggregation).
> > >
> > > 2. Calibrate physical operators so that, given a good estimate of the
> > > number of rows they will see, we can come up with a reasonable
> > > estimate of the physical cost (e.g. how long the query will take to
> > > execute, and how much memory).
> > >
> > > For 1, the paper you cite, and the join-order benchmark it introduces,
> > > is an excellent contribution to the field. It inspired me to do work
> > > on profiling [1]. I would encourage you to build on the work I have
> > > already done.
> > >
> > > For 2, I have not done any work personally. An approach would be to
> > > give each physical operator (e.g. EnumerableHashJoin) a cost model
> > > parameterized by certain constants, and then run experiments to
> > > determine the values of those constants empirically. Perhaps we could
> > > write a "TuningTool" that generates an "operator constants file", and
> > > thereby start to formalize the process.
> > >
> > > Julian
> > >
> > > [1]
> > https://www.slideshare.net/julianhyde/data-profiling-in-apache-calcite
> > >
> > > On Fri, Aug 7, 2020 at 6:57 AM Thomas Rebele <treb...@tibco.com.invalid
> >
> > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I'm working on basic query optimization. I once stumbled on the case
> > that
> > > > two operators had the same row count but one had a much higher CPU
> > cost.
> > > > Unfortunately the default cost model only takes the row count into
> > account
> > > > (see [1]). Stamatis had pointed out in another mail that the row
> count
> > > > might be much more important than the other costs [2]. However, if
> > there
> > > > are two possible choices with the same row count, we should prefer
> the
> > one
> > > > with the least CPU cost. I'm wondering whether the assumption that a
> > > > smaller row count is better in most cases is actually correct. Also,
> > what
> > > > is "better" in this context? The query plan with the least execution
> > time?
> > > > Maybe there's a plan that is just <10% slower, but consumes much less
> > > > CPU/memory/etc.
> > > >
> > > > So I thought about the cost model in general, and how to improve it.
> I
> > > > assume the better the estimated cost corresponds to the real cost,
> the
> > > > better the optimized plans. So the first step would be to collect the
> > real
> > > > world statistics and the second step to adapt the cost estimation so
> > that
> > > > there's a better correspondence. For the beginning I would just
> > measure how
> > > > many rows have been in the result and how much time has passed for
> each
> > > > RelNode during query execution. Is there already a way to do this in
> > > > Calcite? Does this make sense at all?
> > > >
> > > > [1]
> > > >
> >
> https://github.com/apache/calcite/blob/52a57078ba081b24b9d086ed363c715485d1a519/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoCost.java#L100
> > > > [2]
> > > >
> >
> https://15721.courses.cs.cmu.edu/spring2019/papers/24-costmodels/p204-leis.pdf
> > > >
> > > > Cordialement / Best Regards,
> > > > *Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris,
> > France
> > > > | www.tibco.com
> >
>

Re: Gather real world statistics about row count, CPU, and IO cost

Reply via email to