Gather real world statistics about row count, CPU, and IO cost

Thomas Rebele Fri, 07 Aug 2020 06:57:26 -0700

Hi all,

I'm working on basic query optimization. I once stumbled on the case that
two operators had the same row count but one had a much higher CPU cost.
Unfortunately the default cost model only takes the row count into account
(see [1]). Stamatis had pointed out in another mail that the row count
might be much more important than the other costs [2]. However, if there
are two possible choices with the same row count, we should prefer the one
with the least CPU cost. I'm wondering whether the assumption that a
smaller row count is better in most cases is actually correct. Also, what
is "better" in this context? The query plan with the least execution time?
Maybe there's a plan that is just <10% slower, but consumes much less
CPU/memory/etc.


So I thought about the cost model in general, and how to improve it. I
assume the better the estimated cost corresponds to the real cost, the
better the optimized plans. So the first step would be to collect the real
world statistics and the second step to adapt the cost estimation so that
there's a better correspondence. For the beginning I would just measure how
many rows have been in the result and how much time has passed for each
RelNode during query execution. Is there already a way to do this in
Calcite? Does this make sense at all?

[1]
https://github.com/apache/calcite/blob/52a57078ba081b24b9d086ed363c715485d1a519/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoCost.java#L100
[2]
https://15721.courses.cs.cmu.edu/spring2019/papers/24-costmodels/p204-leis.pdf

Cordialement / Best Regards,
*Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris, France
| www.tibco.com

Gather real world statistics about row count, CPU, and IO cost

Reply via email to