On 25/02/26 09:46, jian he wrote:
On Tue, Jan 27, 2026 at 11:42 PM Matheus Alcantara
<[email protected]> wrote:

Although this make sense to me I see difference in row estimation using
your v2 patch for the following example:

...


V1 patch:

postgres=# select * from check_estimated_rows($$ SELECT FROM t1, t2 WHERE t1.c 
= t2.c GROUP BY t1.c, t2.c $$);
  estimated | actual
-----------+--------
         12 |     12

V2 patch:

postgres=# select * from check_estimated_rows($$ SELECT FROM t1, t2 WHERE t1.c 
= t2.c GROUP BY t1.c, t2.c $$);
  estimated | actual
-----------+--------
        144 |     12

I've also tried to make the partitions of t1 and t2 as foreign tables
and I got the same row estimation difference.

I'm just wondering if we are missing something?

Hi.

in function process_equivalence, we have:
     /*
      * Ensure both input expressions expose the desired collation (their types
      * should be OK already); see comments for canonicalize_ec_expression.
      */
     item1 = canonicalize_ec_expression(item1,
                                        exprType((Node *) item1),
                                        collation);
     item2 = canonicalize_ec_expression(item2,
                                        exprType((Node *) item2),
                                        collation);


Let's simplify the test case.
CREATE COLLATION case_insensitive (provider = icu, locale =
'@colStrength=secondary', deterministic = false);
CREATE DOMAIN d_txt1 AS text collate case_insensitive;
CREATE TABLE t3 (a int, b int, c text);
INSERT INTO t3 SELECT i  % 12, i, to_char(i/50, 'FM0000') FROM
generate_series(0, 599, 2) i;
ANALYZE t3;
CREATE TABLE t4 (a int, b int, c d_txt1);
INSERT INTO t4 SELECT i % 10, i, to_char(i/50, 'FM0000') FROM
generate_series(0, 599, 3) i;
ANALYZE t4;
EXPLAIN SELECT FROM t3, t4 WHERE t3.c = t4.c GROUP BY t3.c, t4.c;

The ``WHERE t3.c = t4.c `` after the function process_equivalence, it will
produce 2 RELABELTYPE node in EquivalenceClass->ec_members->em_expr and your v1
uncondition strip these two RELABELTYPE, exprs_known_equal will retrun true,
therefore for numdistinct it will think "GROUP BY t3.c, t4.c" is the same as
""GROUP BY t3.c".

However if we not unconditionly strip RELABELTYPE, exprs_known_equal will return
false, therefore "GROUP BY t3.c, t4.c", "t3.c", "t4.c" is not identical, so it
multiply these two distinct numbers.  IMHO, That's the reason for
estimate number 144 versus 12.


Ok, that make sense to me, thanks for pointing this out.

So do you think that v3 attached on [1] correctly address this?

Please also see the comments in canonicalize_ec_expression.
Actually, I think the comments in canonicalize_ec_expression discourage strip
RelabelType nodes when RelabelType->resultcollid differs from the collation of
RelabelType->arg.


Agree, thanks.

[1] https://www.postgresql.org/message-id/DFZHIGROJHVS.25OYGENTHBLSM%40gmail.com

--
Matheus Alcantara
EDB: https://www.enterprisedb.com


Reply via email to