On Sat, Mar 17, 2018 at 7:00 PM, Kefan Yang <starord...@gmail.com> wrote: > What I am trying to say here is that similar optimizations can be applied to > novel algorithms or other implementations of quicksort.
A novel algorithm is something to avoid here, because novel techniques tend to only work out for specific datasets and data distributions. In general, we care about a wide variety of cases, and are very keen to avoid regressions. > The paper about "Dual-Pivot Quicksort" is really helpful and it has a Java > implementation already. I can definitely make use of that. Cool. Please be careful to pick source material that has a license that is compatible with the PostgreSQL license. > Also, I am wondering what's the normal approach to testing if a certain > sorting implementation brings performance gain in this project? It varies quite a bit. You can search the lists archives for some of this. For example, Tomas Vondra often tests my sorting patches by subjecting them to a variety of tests with varied distributions and data types. His results are then collated in a spreadsheet for easy analysis. Maybe you can replicate that. The only obvious trend is that everything is tested using real SQL statements (including CREATE INDEX statements). In the past, enhancements that were successfully incorporated tended to either obviously have little or no downside, or have a very significant upside that made it worth talking about accepting a small regression in a minority of less representative cases. > More > specifically, You mentioned that little progress has been made with novel > algorithmic enhancement. How can we get that conclusion? That's simply been the trend. It isn't particularly likely that Postgres will be a project that pioneers some kind of algorithmic enhancement that has broad applicability, that could just as easily benefit a general purpose library sort routine. If you're interested in algorithmic enhancements in the sense that I understand the term, then that's the high bar that would have to be met -- that would amount to a new insight in an area that has already been researched in enormous detail by many people, most of whom don't know much about database systems in particular. On the other hand, techniques like abbreviated keys work well because they effectively exploit things like the first normal form, and the fact that a high start up cost for the sort is already very likely. It is a technique specifically tailored for a database system, and modern hardware characteristics. -- Peter Geoghegan