Re: [GSoC 2018] Proposal Draft

Peter Geoghegan Sat, 17 Mar 2018 19:25:46 -0700

On Sat, Mar 17, 2018 at 7:00 PM, Kefan Yang <[email protected]> wrote:
> What I am trying to say here is that similar optimizations can be applied to
> novel algorithms or other implementations of quicksort.


A novel algorithm is something to avoid here, because novel techniques
tend to only work out for specific datasets and data distributions. In
general, we care about a wide variety of cases, and are very keen to
avoid regressions.

> The paper about "Dual-Pivot Quicksort" is really helpful and it has a Java
> implementation already. I can definitely make use of that.

Cool. Please be careful to pick source material that has a license
that is compatible with the PostgreSQL license.

> Also, I am wondering what's the normal approach to testing if a certain
> sorting implementation brings performance gain in this project?

It varies quite a bit. You can search the lists archives for some of
this. For example, Tomas Vondra often tests my sorting patches by
subjecting them to a variety of tests with varied distributions and
data types. His results are then collated in a spreadsheet for easy
analysis. Maybe you can replicate that.

The only obvious trend is that everything is tested using real SQL
statements (including CREATE INDEX statements). In the past,
enhancements that were successfully incorporated tended to either
obviously have little or no downside, or have a very significant
upside that made it worth talking about accepting a small regression
in a minority of less representative cases.

> More
> specifically,  You mentioned that little progress has been made with novel
> algorithmic enhancement. How can we get that conclusion?

That's simply been the trend. It isn't particularly likely that
Postgres will be a project that pioneers some kind of algorithmic
enhancement that has broad applicability, that could just as easily
benefit a general purpose library sort routine. If you're interested
in algorithmic enhancements in the sense that I understand the term,
then that's the high bar that would have to be met -- that would
amount to a new insight in an area that has already been researched in
enormous detail by many people, most of whom don't know much about
database systems in particular.

On the other hand, techniques like abbreviated keys work well because
they effectively exploit things like the first normal form, and the
fact that a high start up cost for the sort is already very likely. It
is a technique specifically tailored for a database system, and modern
hardware characteristics.

-- 
Peter Geoghegan

Re: [GSoC 2018] Proposal Draft

Reply via email to