On 8/11/21 2:08 PM, Dean Rasheed wrote:
On Wed, 11 Aug 2021 at 00:05, Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:

So with the statistics, the estimate gets a bit worse. The reason is
fairly simple - if you look at the two parts of the OR clause, we get this:

      clause                   actual        no stats    with stats
     ---------------------------------------------------------------
      (A < B and A <> A)            0          331667             1
      not (A < A)             1000000          333333        333333

This clearly shows that the first clause is clearly improved, while the
(A < A) is estimated the same way, because the clause has a single Var
so it's considered to be "simple" so we ignore the MCV selectivity and
just use the simple_sel calculated by clause_selectivity_ext.

And the 333333 and 331667 just happen to be closer to the actual row
count. But that's mostly by luck, clearly.

But now that I think about it, maybe the problem really is in how
statext_mcv_clauselist_selectivity treats this clause - the definition
of "simple" clauses as "has one attnum" was appropriate when only
clauses (Var op Const) were supported. But with (Var op Var) that's
probably not correct anymore.


Hmm, interesting. Clearly the fact that the combined estimate without
extended stats was better was just luck, based on it's large
overestimate of the first clause. But it's also true that a (Var op
Var) clause should not be treated as simple, because "simple" in this
context is meant to be for clauses that are likely to be better
estimated with regular stats, whereas in this case, extended stats
would almost certainly do better on the second clause.

I don't see why extended stats would do better on the second clause. I mean, if you have (A < A) then extended stats pretty much "collapse" into per-column stats. We could get almost the same estimate on single-column MCV list, etc. The reason why that does not happen is that we just treat it as a range clause, and assign it a default 33% estimate.

But we could make that a bit smarter, and assign better estimates to those clauses

  (A < A) => 0.0
  (A = A) => 1.0
  (A <= A) => 1.0

And that'd give us the same estimates, I think. Not sure that's worth it, because (A op A) clauses are probably very rare, OTOH it's cheap.


Perhaps the easiest way to identify simple clauses would be in
statext_is_compatible_clause(), rather than the way it's done now,
because it has the relevant information at hand, so it could be made
to return an extra flag.


Agreed, that seems like a better place to fix this.

This feels like rather an artificial example though. Is there any real
use for this sort of clause?


True. It seems a bit artificial, which is understandable as it came from a synthetic test generating all possible clauses. OTOH, fixing it seems fairly cheap ...


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to