On 10/01/23 10:53, David Rowley wrote:
the total cost is the same for both of these, but the execution time
seems to vary quite a bit.
This is really weird, I tried it different ways (to rule out any issues
due to
caching) and execution time varied in spite of having same cost.
Maybe we should try and do this for DISTINCT queries if the
distinct_pathkeys match the orderby_pathkeys. That seems a little less
copout-ish. If the ORDER BY is the same as the DISTINCT then it seems
likely that the ORDER BY might opt to use the Unique path for DISTINCT
since it'll already have the correct pathkeys.
However, if the ORDER BY has fewer columns then it might be cheaper to Hash
Aggregate and
then sort all over again, especially so when the DISTINCT removes a
large proportion of the rows.
Isn't order by pathkeys are always fewer than distinct pathkeys?
distinct pathkeys = order by pathkeys + window functions pathkeys
Again, I got your point which that it is okay to pushdown order by clause
if distinct is doing unique sort. But problem is (atleast from what I am
facing),
distinct is not caring about pushed down sortkeys, it goes with hashagg
or unique
with some other logic (mentioned below).
Consider following (with distinct clause restriction removed)
if (parse->distinctClause)
{
List* distinct_pathkeys = make_pathkeys_for_sortclauses(root,
parse->distinctClause, root->processed_tlist);
if (!compare_pathkeys(distinct_pathkeys, orderby_pathkeys)==1) // distinct
key > order by key
skip = true; // this is used to skip order by pushdown
}
CASE #1:
explain (costs off) select distinct a,b, min(a) over (partition by a), sum (a)
over (partition by a) from abcd order by a,b;
QUERY PLAN
-----------------------------------------------------------
Sort
Sort Key: a, b
-> HashAggregate
Group Key: a, b, min(a) OVER (?), sum(a) OVER (?)
-> WindowAgg
-> Sort
Sort Key: a, b
-> Seq Scan on abcd
(8 rows)
explain (costs off) select distinct a,b,c, min(a) over (partition by a), sum
(a) over (partition by a) from abcd order by a,b,c;
QUERY PLAN
--------------------------------------------------------------
Sort
Sort Key: a, b, c
-> HashAggregate
Group Key: a, b, c, min(a) OVER (?), sum(a) OVER (?)
-> WindowAgg
-> Sort
Sort Key: a, b, c
-> Seq Scan on abcd
(8 rows)
No matter how many columns are pushed down, it does hashagg.
On the other hand:
CASE #2:
EXPLAIN (costs off) SELECT DISTINCT depname, empno, min(salary) OVER (PARTITION
BY depname) depminsalary,sum(salary) OVER (PARTITION BY depname) depsalary
FROM empsalary
ORDER BY depname, empno;
QUERY PLAN
----------------------------------------------------------------------------------
Unique
-> Sort
Sort Key: depname, empno, (min(salary) OVER (?)), (sum(salary) OVER
(?))
-> WindowAgg
-> Sort
Sort Key: depname, empno
-> Seq Scan on empsalary
(7 rows)
EXPLAIN (costs off) SELECT DISTINCT depname, empno, enroll_date, min(salary)
OVER (PARTITION BY depname) depminsalary,sum(salary) OVER (PARTITION BY
depname) depsalary
FROM empsalary
ORDER BY depname, empno, enroll_date;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Unique
-> Sort
Sort Key: depname, empno, enroll_date, (min(salary) OVER (?)),
(sum(salary) OVER (?))
-> WindowAgg
-> Sort
Sort Key: depname, empno, enroll_date
-> Seq Scan on empsalary
(7 rows)
It keeps doing Unique.
In both of the cases, compare_pathkeys(distinct_pathkeys,
orderby_pathkeys) returns 1
Looking bit further, planner is choosing things correctly.
I could see cost of unique being higher in 1st case and lower in 2nd case.
But the point is, if sort for orderby is pushdown, shouldn't there be
some discount on
cost of Unique sort (so that there is more possibility of it being
favorable compared to HashAgg in certain cases)?
Again, cost of Unqiue node is taken as cost of sort node as it is, but
for HashAgg, new cost is being computed. If we do incremental sort here
(for unique node),
as we have pushed down orderby's, unique cost could be reduced and our
optimization could
be made worthwhile (I assume this is what you intended here) in case of
distinct.
Eg:
EXPLAIN SELECT DISTINCT depname, empno, enroll_date, min(salary) OVER
(PARTITION BY depname) depminsalary,sum(salary) OVER (PARTITION BY depname)
depsalary
FROM empsalary
ORDER BY depname, empno, enroll_date;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Unique (cost=1.63..1.78 rows=10 width=56)
-> Sort (cost=1.63..1.66 rows=10 width=56)
Sort Key: depname, empno, enroll_date, (min(salary) OVER (?)),
(sum(salary) OVER (?))
-> WindowAgg (cost=1.27..1.47 rows=10 width=56)
-> Sort (cost=1.27..1.29 rows=10 width=48)
Sort Key: depname, empno, enroll_date
-> Seq Scan on empsalary (cost=0.00..1.10 rows=10
width=48)
depname, empno, enroll_date are presorted but still strict sorting is
done on all columns.
Additionally,
the total cost is the same for both of these, but the execution time
seems to vary quite a bit.
Even if I pushdown one or two path keys, end result is same cost (which
isn't helping).
--
Regards,
Ankit Kumar Pandey