On 10/01/23 10:53, David Rowley wrote:

the total cost is the same for both of these, but the execution time
seems to vary quite a bit.

This is really weird, I tried it different ways (to rule out any issues due to

caching) and execution time varied in spite of having same cost.

Maybe we should try and do this for DISTINCT queries if the
distinct_pathkeys match the orderby_pathkeys. That seems a little less
copout-ish. If the ORDER BY is the same as the DISTINCT then it seems
likely that the ORDER BY might opt to use the Unique path for DISTINCT
since it'll already have the correct pathkeys.

However, if the ORDER BY has fewer columns then it might be cheaper to Hash 
Aggregate and
then sort all over again, especially so when the DISTINCT removes a
large proportion of the rows.

Isn't order by pathkeys are always fewer than distinct pathkeys?

distinct pathkeys = order by pathkeys + window functions pathkeys

Again, I got your point which that it is okay to pushdown order by clause

if distinct is doing unique sort. But problem is (atleast from what I am facing),

distinct is not caring about pushed down sortkeys, it goes with hashagg or unique

with some other logic (mentioned below).


Consider following (with distinct clause restriction removed)

if (parse->distinctClause)
{
        List* distinct_pathkeys = make_pathkeys_for_sortclauses(root, 
parse->distinctClause, root->processed_tlist);
        if (!compare_pathkeys(distinct_pathkeys, orderby_pathkeys)==1) // distinct 
key > order by key
                skip = true; // this is used to skip order by pushdown


}

CASE #1:

explain (costs off) select distinct a,b, min(a) over (partition by a), sum (a) 
over (partition by a) from abcd order by a,b;
                        QUERY PLAN
-----------------------------------------------------------
 Sort
   Sort Key: a, b
   ->  HashAggregate
         Group Key: a, b, min(a) OVER (?), sum(a) OVER (?)
         ->  WindowAgg
               ->  Sort
                     Sort Key: a, b
                     ->  Seq Scan on abcd
(8 rows)

explain (costs off) select distinct a,b,c, min(a) over (partition by a), sum 
(a) over (partition by a) from abcd order by a,b,c;
                          QUERY PLAN
--------------------------------------------------------------
 Sort
   Sort Key: a, b, c
   ->  HashAggregate
         Group Key: a, b, c, min(a) OVER (?), sum(a) OVER (?)
         ->  WindowAgg
               ->  Sort
                     Sort Key: a, b, c
                     ->  Seq Scan on abcd
(8 rows)

No matter how many columns are pushed down, it does hashagg.

On the other hand:

CASE #2:

EXPLAIN (costs off) SELECT DISTINCT depname, empno, min(salary) OVER (PARTITION 
BY depname) depminsalary,sum(salary) OVER (PARTITION BY depname) depsalary
FROM empsalary
ORDER BY depname, empno;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Unique
   ->  Sort
         Sort Key: depname, empno, (min(salary) OVER (?)), (sum(salary) OVER 
(?))
         ->  WindowAgg
               ->  Sort
                     Sort Key: depname, empno
                     ->  Seq Scan on empsalary
(7 rows)

EXPLAIN (costs off) SELECT DISTINCT depname, empno, enroll_date, min(salary) 
OVER (PARTITION BY depname) depminsalary,sum(salary) OVER (PARTITION BY 
depname) depsalary
FROM empsalary
ORDER BY depname, empno, enroll_date;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Unique
   ->  Sort
         Sort Key: depname, empno, enroll_date, (min(salary) OVER (?)), 
(sum(salary) OVER (?))
         ->  WindowAgg
               ->  Sort
                     Sort Key: depname, empno, enroll_date
                     ->  Seq Scan on empsalary
(7 rows)

It keeps doing Unique.

In both of the cases, compare_pathkeys(distinct_pathkeys, orderby_pathkeys) returns 1

Looking bit further, planner is choosing things correctly.

I could see cost of unique being higher in 1st case and lower in 2nd case.

But the point is, if sort for orderby is pushdown, shouldn't there be some discount on

cost of Unique sort (so that there is more possibility of it being favorable compared to HashAgg in certain cases)?

Again, cost of Unqiue node is taken as cost of sort node as it is, but

for HashAgg, new cost is being computed. If we do incremental sort here (for unique node),

as we have pushed down orderby's, unique cost could be reduced and our optimization could

be made worthwhile (I assume this is what you intended here) in case of distinct.

Eg:

EXPLAIN SELECT DISTINCT depname, empno, enroll_date, min(salary) OVER 
(PARTITION BY depname) depminsalary,sum(salary) OVER (PARTITION BY depname) 
depsalary
FROM empsalary
ORDER BY depname, empno, enroll_date;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Unique  (cost=1.63..1.78 rows=10 width=56)
   ->  Sort  (cost=1.63..1.66 rows=10 width=56)
         Sort Key: depname, empno, enroll_date, (min(salary) OVER (?)), 
(sum(salary) OVER (?))
         ->  WindowAgg  (cost=1.27..1.47 rows=10 width=56)
               ->  Sort  (cost=1.27..1.29 rows=10 width=48)
                     Sort Key: depname, empno, enroll_date
                     ->  Seq Scan on empsalary  (cost=0.00..1.10 rows=10 
width=48)

depname, empno, enroll_date are presorted but still strict sorting is done on all columns.


Additionally,

the total cost is the same for both of these, but the execution time
seems to vary quite a bit.

Even if I pushdown one or two path keys, end result is same cost (which isn't helping).


--
Regards,
Ankit Kumar Pandey



Reply via email to