Hi all, I'm working on basic query optimization. I once stumbled on the case that two operators had the same row count but one had a much higher CPU cost. Unfortunately the default cost model only takes the row count into account (see [1]). Stamatis had pointed out in another mail that the row count might be much more important than the other costs [2]. However, if there are two possible choices with the same row count, we should prefer the one with the least CPU cost. I'm wondering whether the assumption that a smaller row count is better in most cases is actually correct. Also, what is "better" in this context? The query plan with the least execution time? Maybe there's a plan that is just <10% slower, but consumes much less CPU/memory/etc.
So I thought about the cost model in general, and how to improve it. I assume the better the estimated cost corresponds to the real cost, the better the optimized plans. So the first step would be to collect the real world statistics and the second step to adapt the cost estimation so that there's a better correspondence. For the beginning I would just measure how many rows have been in the result and how much time has passed for each RelNode during query execution. Is there already a way to do this in Calcite? Does this make sense at all? [1] https://github.com/apache/calcite/blob/52a57078ba081b24b9d086ed363c715485d1a519/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoCost.java#L100 [2] https://15721.courses.cs.cmu.edu/spring2019/papers/24-costmodels/p204-leis.pdf Cordialement / Best Regards, *Thomas Rebele* | R&D Developer | 18 rue du 4 septembre, 75002 Paris, France | www.tibco.com