Re: cost_sort vs cost_agg
Thank you Ashutosh. On Fri, Jan 15, 2021 at 7:18 PM Ashutosh Bapat wrote: > On Thu, Jan 14, 2021 at 7:12 PM Andy Fan wrote: > > > > Currently the cost_sort doesn't consider the number of columns to sort, > which > > means the cost of SELECT * FROM t ORDER BY a; equals with the SELECT * > > FROM t ORDER BY a, b; which is obviously wrong. The impact of this is > when we > > choose the plan for SELECT DISTINCT * FROM t ORDER BY c between: > > > > Sort > >Sort Key: c > >-> HashAggregate > > Group Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n > > > > and > > > > Unique > >-> Sort > > Sort Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n > > > > > > Since "Sort (c)" has the same cost as "Sort (c, a, b, d, e, f, g, h, i, > j, k, > > l, m, n)", and Unique node on a sorted input is usually cheaper than > > HashAggregate, so the later one will win usually which might bad at many > > places. > > I can imagine that HashAggregate + Sort will perform better if there > are very few distinct rows but otherwise, Unique on top of Sort would > be a better strategy since it doesn't need two operations. > > Thanks for the hint, I will consider the distinct rows as a factor in the next patch. > > > > Optimizer chose HashAggregate with my patch, but it takes 6s. after set > > enable_hashagg = off, it takes 2s. > > This example actually shows that using Unique is better than > HashAggregate + Sort. May be you want to try with some data which has > very few distinct rows. > > -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: cost_sort vs cost_agg
On Thu, Jan 14, 2021 at 7:12 PM Andy Fan wrote: > > Currently the cost_sort doesn't consider the number of columns to sort, which > means the cost of SELECT * FROM t ORDER BY a; equals with the SELECT * > FROM t ORDER BY a, b; which is obviously wrong. The impact of this is when we > choose the plan for SELECT DISTINCT * FROM t ORDER BY c between: > > Sort >Sort Key: c >-> HashAggregate > Group Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n > > and > > Unique >-> Sort > Sort Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n > > > Since "Sort (c)" has the same cost as "Sort (c, a, b, d, e, f, g, h, i, j, k, > l, m, n)", and Unique node on a sorted input is usually cheaper than > HashAggregate, so the later one will win usually which might bad at many > places. I can imagine that HashAggregate + Sort will perform better if there are very few distinct rows but otherwise, Unique on top of Sort would be a better strategy since it doesn't need two operations. > > Optimizer chose HashAggregate with my patch, but it takes 6s. after set > enable_hashagg = off, it takes 2s. This example actually shows that using Unique is better than HashAggregate + Sort. May be you want to try with some data which has very few distinct rows. -- Best Wishes, Ashutosh Bapat
cost_sort vs cost_agg
Currently the cost_sort doesn't consider the number of columns to sort, which means the cost of SELECT * FROM t ORDER BY a; equals with the SELECT * FROM t ORDER BY a, b; which is obviously wrong. The impact of this is when we choose the plan for SELECT DISTINCT * FROM t ORDER BY c between: Sort Sort Key: c -> HashAggregate Group Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n and Unique -> Sort Sort Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n Since "Sort (c)" has the same cost as "Sort (c, a, b, d, e, f, g, h, i, j, k, l, m, n)", and Unique node on a sorted input is usually cheaper than HashAggregate, so the later one will win usually which might bad at many places. My patch v1 did a simple improvement for cost_sort, which will consider the number of cols to sort. The main part is below: cost_sort: Assert(numSortCols); /* Include the default cost-per-comparison */ + comparison_cost += (2.0 * cpu_operator_cost * numSortCols); However it still chooses a wrong plan in the simple case below. create table wcols (a int , b int, c int, d int, e int, f int, g int, h int, i int, j int, k int, l int, m int, n int); insert into wcols select i, i , i, i , i, i , i, i, i, i, i, i, i, i from generate_series(1, 100)i; select distinct * from wcols order by c; Optimizer chose HashAggregate with my patch, but it takes 6s. after set enable_hashagg = off, it takes 2s. The main reason is both cost_sort and cost_agg doesn't consider the real hash function or real sort function, they use cpu_operator_cost instead. If we really want to fix this issue, shall we a). figure the real pg_proc.oid for sort and hash during planning stage and costing with that? b). in cost_sort, we may consider the nature order of input data for the ordered column as well? c). store the Oids in SortPath and AggPath to avoid the double calculation during createPlan stage? or any better suggestion? Thanks -- Best Regards Andy Fan (https://www.aliyun.com/) v1-0001-Improve-the-cost_sort-v1.patch Description: Binary data