On 8/9/21 9:19 PM, Mark Dilger wrote:


On Jul 20, 2021, at 11:28 AM, Tomas Vondra <tomas.von...@enterprisedb.com> 
wrote:

Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
<0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch>

Hi Tomas,

I tested this patch against master looking for types of clauses that uniformly 
get worse with the patch applied.  I found some.

The tests are too large to attach, but the scripts that generate them are not.  
To perform the tests:

        git checkout master
        perl ./gentest.pl > src/test/regress/sql/gentest.sql
        cat /dev/null > src/test/regress/expected/gentest.out
        echo "test: gentest" >> src/test/regress/parallel_schedule
        ./configure && make && make check
        cp src/test/regress/results/gentest.out 
src/test/regress/expected/gentest.out
        patch -p 1 < 
0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch
        make check
        cat src/test/regress/regression.diffs | perl ./check.pl

This shows patterns of conditions that get worse, such as:

better:0, worse:80:  A < B and A <> A or not A < A
better:0, worse:80:  A < B and not A <= A or A <= A
better:0, worse:80:  A < B or A = A
better:0, worse:80:  A < B or A = A or not A >= A
better:0, worse:80:  A < B or A >= A
better:0, worse:80:  A < B or A >= A and not A <> A
better:0, worse:80:  A < B or not A < A
better:0, worse:80:  A < B or not A <> A
better:0, worse:80:  A < B or not A <> A or A <= A
better:0, worse:80:  A < B or not A >= A or not A < A

It seems things get worse when the conditions contain a column compared against 
itself.  I suspect that is being handled incorrectly.


Thanks for this testing!

I took a quick look, and I think this is mostly due to luck in how the (default) range estimates combine without and with extended statistics. Consider for example this simple example:

create table t (a int, b int);

insert into t select mod(i,10), mod(i,20)
                from generate_series(1,1000000) s(i);

Without stats, the first clauses example is estimated like this:

    explain (timing off, analyze) select * from t
      where (A < B and A <> A) or not A < A;

                            QUERY PLAN
    ----------------------------------------------------------
     Seq Scan on t  (cost=0.00..21925.00 rows=554444 width=8)
                    (actual rows=1000000 loops=1)
       Filter: (((a < b) AND (a <> a)) OR (a >= a))
     Planning Time: 0.054 ms
     Execution Time: 80.485 ms
    (4 rows)

and with MCV on (a,b) it gets estimates like this:

QUERY PLAN
    ----------------------------------------------------------
     Seq Scan on t  (cost=0.00..21925.00 rows=333333 width=8)
                    (actual rows=1000000 loops=1)
       Filter: (((a < b) AND (a <> a)) OR (a >= a))
     Planning Time: 0.152 ms
     Execution Time: 79.917 ms
    (4 rows)

So with the statistics, the estimate gets a bit worse. The reason is fairly simple - if you look at the two parts of the OR clause, we get this:

    clause                   actual        no stats    with stats
   ---------------------------------------------------------------
    (A < B and A <> A)            0          331667             1
    not (A < A)             1000000          333333        333333

This clearly shows that the first clause is clearly improved, while the (A < A) is estimated the same way, because the clause has a single Var so it's considered to be "simple" so we ignore the MCV selectivity and just use the simple_sel calculated by clause_selectivity_ext.

And the 333333 and 331667 just happen to be closer to the actual row count. But that's mostly by luck, clearly.

But now that I think about it, maybe the problem really is in how statext_mcv_clauselist_selectivity treats this clause - the definition of "simple" clauses as "has one attnum" was appropriate when only clauses (Var op Const) were supported. But with (Var op Var) that's probably not correct anymore.

And indeed, commenting out the if condition on line 1933 (so ignoring simple_sel) and that does improve the estimates for this query. But perhaps I'm missing something, this needs more thought.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to