Re: Use extended statistics to estimate (Var op Var) clauses

Tomas Vondra Tue, 10 Aug 2021 16:05:22 -0700

On 8/9/21 9:19 PM, Mark Dilger wrote:

On Jul 20, 2021, at 11:28 AM, Tomas Vondra <[email protected]> 
wrote:

Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
<0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch>


Hi Tomas,

I tested this patch against master looking for types of clauses that uniformly 
get worse with the patch applied.  I found some.

The tests are too large to attach, but the scripts that generate them are not.  
To perform the tests:

        git checkout master
        perl ./gentest.pl > src/test/regress/sql/gentest.sql
        cat /dev/null > src/test/regress/expected/gentest.out
        echo "test: gentest" >> src/test/regress/parallel_schedule
        ./configure && make && make check
        cp src/test/regress/results/gentest.out 
src/test/regress/expected/gentest.out
        patch -p 1 < 
0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch
        make check
        cat src/test/regress/regression.diffs | perl ./check.pl

This shows patterns of conditions that get worse, such as:

better:0, worse:80:  A < B and A <> A or not A < A
better:0, worse:80:  A < B and not A <= A or A <= A
better:0, worse:80:  A < B or A = A
better:0, worse:80:  A < B or A = A or not A >= A
better:0, worse:80:  A < B or A >= A
better:0, worse:80:  A < B or A >= A and not A <> A
better:0, worse:80:  A < B or not A < A
better:0, worse:80:  A < B or not A <> A
better:0, worse:80:  A < B or not A <> A or A <= A
better:0, worse:80:  A < B or not A >= A or not A < A

It seems things get worse when the conditions contain a column compared against 
itself.  I suspect that is being handled incorrectly.


Thanks for this testing!

I took a quick look, and I think this is mostly due to luck in how the(default) range estimates combine without and with extended statistics.Consider for example this simple example:


create table t (a int, b int);

insert into t select mod(i,10), mod(i,20)
                from generate_series(1,1000000) s(i);

Without stats, the first clauses example is estimated like this:

    explain (timing off, analyze) select * from t
      where (A < B and A <> A) or not A < A;

                            QUERY PLAN
    ----------------------------------------------------------
     Seq Scan on t  (cost=0.00..21925.00 rows=554444 width=8)
                    (actual rows=1000000 loops=1)
       Filter: (((a < b) AND (a <> a)) OR (a >= a))
     Planning Time: 0.054 ms
     Execution Time: 80.485 ms
    (4 rows)

and with MCV on (a,b) it gets estimates like this:

QUERY PLAN

    ----------------------------------------------------------
     Seq Scan on t  (cost=0.00..21925.00 rows=333333 width=8)
                    (actual rows=1000000 loops=1)
       Filter: (((a < b) AND (a <> a)) OR (a >= a))
     Planning Time: 0.152 ms
     Execution Time: 79.917 ms
    (4 rows)

So with the statistics, the estimate gets a bit worse. The reason isfairly simple - if you look at the two parts of the OR clause, we get this:


    clause                   actual        no stats    with stats
   ---------------------------------------------------------------
    (A < B and A <> A)            0          331667             1
    not (A < A)             1000000          333333        333333

This clearly shows that the first clause is clearly improved, while the(A < A) is estimated the same way, because the clause has a single Varso it's considered to be "simple" so we ignore the MCV selectivity andjust use the simple_sel calculated by clause_selectivity_ext.

And the 333333 and 331667 just happen to be closer to the actual rowcount. But that's mostly by luck, clearly.

But now that I think about it, maybe the problem really is in howstatext_mcv_clauselist_selectivity treats this clause - the definitionof "simple" clauses as "has one attnum" was appropriate when onlyclauses (Var op Const) were supported. But with (Var op Var) that'sprobably not correct anymore.

And indeed, commenting out the if condition on line 1933 (so ignoringsimple_sel) and that does improve the estimates for this query. Butperhaps I'm missing something, this needs more thought.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use extended statistics to estimate (Var op Var) clauses

Reply via email to