On Thu, Aug 7, 2014 at 1:09 AM, Peter Geoghegan <p...@heroku.com> wrote: > On Wed, Aug 6, 2014 at 10:36 PM, Peter Geoghegan <p...@heroku.com> wrote: >> This *almost* applies to patched Postgres if you pick a benchmark that >> is very sympathetic to my patch. To my surprise, work_mem = '10MB' >> (which results in an external tape sort) is sometimes snapping at the >> heels of a work_mem = '5GB' setting (which results in an in-memory >> quicksort). > > Note that this was with a default temp_tablespaces setting that wrote > temp files on my home partition/SSD. With a /dev/shm/ temp tablespace, > tape sort edges ahead, and has a couple of hundred milliseconds on > quicksort for this test case. It's actually faster.
I decided to do a benchmark of a very large, sympathetic sort. Again, this was with 8 byte random text strings, but this time with a total of 500 million tuples. This single-column table is about 21 GB. This ran on a dedicated server with 32 GB of ram. I took all the usual precautions (the cache was warmed, tuples were frozen, etc). Master took 02:02:20.5561 to sort the data with a 10 MB work_mem setting, without a ramdisk in temp_tablespaces. With a temp_tablespaces /dev/shm ramdisk, there was only a very small improvement that left total execution time at 02:00:58.51878 (while a later repeat attempt took 02:00:41.385937) -- an improvement hardly worth bothering with. Patched Postgres took 00:16:13.228053, again with a work_mem of 10 MB, but no ramdisk. When I changed temp_tablespaces to use the same ramdisk, this went down to 00:11:58.77301, a significant improvement. This is clearly because the data directory was on a filesystem on spinning disks, and more I/O bandwidth (real or simulated) helps external sorts. Since this machine only has 32GB of ram, and a significant proportion of that must be used for shared_buffers (8GB) and the OS cache, I think there is a fair chance that a more capable I/O subsystem could be used to get appreciably better results using otherwise identical hardware. Comparing like with like, the ramdisk patched run was over 10 times faster than master with the same ramdisk. While that disparity is obviously in itself very significant, I think the disparity in how much faster things were with a ramdisk for patched, but not for master is also significant. I'm done with sympathetic cases now. I welcome unsympathetic ones, or more broadly representative large tests. It's hard to come up with a benchmark that isn't either sympathetic or very sympathetic, or a pathological bad case. There is actually a simple enough C program for generating test input, "gensort", which is used by sortbenchmark.org: http://www.ordinal.com/gensort.html (I suggest building without support for "the SUMP Pump library", by modifying the Makefile before building) What's interesting about gensort is that there is a "skew" option. Without it, I can generate totally random ASCII keys. But with it, there is a tendency for there to be a certain amount of redundancy between keys in their first few bytes. This is intended to limit the effectiveness of abbreviation-type optimizations for sortbenchmark.org "Daytona Sort" entrants: http://sortbenchmark.org/FAQ-2014.html ("Indy vs. Daytona") Daytona sort is basically a benchmark that focuses on somewhat commercially representative data (often ASCII data), with sneaky data-aware tricks forbidden, as opposed to totally artificial uniformly distributed random binary data, which is acceptable for their "Indy sort" benchmark that gets to use every trick in the book. They note here that Daytona entrants should "not be overly dependent on the uniform and random distribution of key values in the sort input". They are allowed to be somewhat dependent on it, though - for one thing, keys are always exactly 10 bytes. They must merely "be able to sort the alternate, skewed-keys input data set in an elapsed time of no more than twice the elapsed time of the benchmark entry". It might be useful for Robert or other reviewers to hold the abbreviated keys patch to a similar standard, possibly by using gensort, or their own modified version. I've shown a sympathetic case that is over 10 times faster, with some other less sympathetic cases that were still pretty good, since tie-breakers were generally able to get away with a cheap memcmp(). There has also been some tests showing pathological bad cases for the optimization. The middle ground has now become interesting, and gensort might offer a half-reasonable way to generate tests that are balanced. I've looked at the gensort code, and it seems easy enough to understand and modify for our purposes. You might want to look at multiple cases with this constant modified, for example: #define SKEW_BYTES 6 Since this controls the number of leading bytes that come from a table of skew bytes. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers