Hi All,

We'd like to share updated performance results for tuple sort after improving 
the benchmark to better stress sort. We added an offset to the query to 
eliminate the contribution of data movement to the client (referring to 
https://www.postgresql.org/message-id/flat/CANWCAZbAmaZ7P%2BARjS97sJLXsBB5CPZyzFgqNDiqe-L%2BBqXzug%40mail.gmail.com).
 With this change, flame graphs confirm that the sort contribution is much 
larger than with our previous setup (e.g., for 100k rows/100k distinct values 
using quicksort, tuplesort_performsort is now ~60% of the cycles compared to 
~10% before). We also added data points at 10k and 50k rows.

Setup:
- Randomly generated integers, with varying row count and cardinality 
(controlled by number of distinct values)
- ORDER BY query with OFFSET equal to number of rows
- Query executed by pgbench (1 job, 1 client) measuring tps (transactions per 
second)
- Data collected on AWS m7i.metal-24xl, Ubuntu 24.04, gcc13

The table below reports measured tps gains for SIMD-enabled sort vs quicksort

 10k rows,  10k distinct values:  +92.0%
 10k rows,   1k distinct values:  +61.5%
 10k rows, 100  distinct values:  +31.3%
 10k rows,  10  distinct values:  +10.9%

 50k rows,  50k distinct values: +101.2%
 50k rows,  10k distinct values:  +85.8%
 50k rows, 100  distinct values:  +33.3%
 50k rows,  10  distinct values:  +10.7%

100k rows, 100k distinct values:  +92.0%
100k rows,  10k distinct values:  +67.8%
100k rows, 100  distinct values:  +16.1%
100k rows,  10  distinct values:   -5.8%
  
  1M rows,   1M distinct values:  +49.1%
  1M rows,  10k distinct values:  +12.1%
  1M rows, 100  distinct values:  -17.5%
  1M rows,  10  distinct values:  -31.2%

In summary, no regressions are observed up to 50k rows. At 100k rows, some 
regression is observed for the very low-cardinality case. The trend continues 
to 1M rows.
It is easy to add dispatching logic at query execution time based on row count 
(to select SIMD sort or quicksort). Conditions based on cardinality may be more 
complex to implement. We'd appreciate any recommendations on planning-time or 
execution-time statistics we could use. We propose to start with the row count 
condition.
We are developing additional optimizations to improve performance for the 1M+ 
rows cases. We'll also explore different data distributions and share results 
as they are available.

Thanks!

Luca Giacchino


Reply via email to