The complicated patch I've been working with for a while now is labeled "sequential scan posix fadvise" in the CommitFest queue. There are a lot of parts to that, going back to last December, and I've added the many most relevant links to the September CommitFest page.

The first message there on this topic is http://archives.postgresql.org/message-id/[EMAIL PROTECTED] which is a program from Greg Stark that measures how much prefetching advisory information improves the overall transfer speed on a synthetic random read benchmark. The idea is that you advise the OS about up to n requests at a time, where n goes from 1 (no prefetch at all) to 8192. As n goes up, the total net bandwidth usually goes up as well. You can basically divide the bandwidth at any prefetch level by the baseline (1=no prefetch) to get a speedup multiplier. The program allows you to submit both unsorted and sorted requests, and the speedup is pretty large and similarly distributed (but of different magnitude) in both cases.

While not a useful PostgreSQL patch on its own, this program lets one figure out if the basic idea here, advise about blocks ahead of time to speed up the whole thing, works on a particular system without having to cope with a larger test. What I have to report here are some results from many systems running both Linux and Solaris with various numbers of disk spindles. The Linux systems use the posix fadvise call, while the Solaris ones use its aio library.

Using the maximum prefetch working set tested, 8192, here's the speedup multiplier on this benchmark for both sorted and unsorted requests using a 8GB file:

OS              Spindles        Unsorted X      Sorted X
1:Linux         1               2.3             2.1
2:Linux         1               1.5             1.0
3:Solaris       1               2.6             3.0
4:Linux         3               6.3             2.8
5:Linux (Stark) 3               5.3             3.6
6:Linux         10              5.4             4.9
7:Solaris*      48              16.9            9.2

Systems (1)-(3) are standard single-disk workstations with various speed and size disks. (4) is a 3-disk software RAID0 (on an Areca card in JBOD mode). (5) is the system Greg Stark originally reported his results on, which is also a 3-disk array of some sort. (6) uses a Sun 2640 disk array with a 10 disk RAID0+1 setup, while (7) is a Sun Fire X4500 with 48 disks in a giant RAID-Z array.

The Linux systems drop the OS cache after each run, they're all running kernel 2.6.18 or higher with that feature. Solaris system (3) is using the UFS filesystem with the default tuning, which doesn't cache enough information for that to be necessary[1]--the results look very similar to the Linux case even without explicitly dropping the cache.

* For (7) the results there showed obvious caching (>150MB/s), as I expected from Solaris's ZFS which does cache aggressively by default. In order to get useful results with the server's 16GB of RAM, I increased the test file to 64GB, at which point the results looked reasonable.

Comparing with a prefetch working set of 256, which I eyeballed on the results spreadsheet I made as the best return on prefetch effort before improvements leveled off, the speedups looked like this:

OS              Spindles        Unsorted X      Sorted X
1:Linux         1               2.3             2.0
2:Linux         1               1.5             0.9
3:Solaris       1               2.5             3.3
4:Linux         3               5.8             2.6
5:Linux (Stark) 3               5.6             3.7
6:Linux         10              5.7             5.1
7:Solaris       48              10.0            7.8

Observations:

-For the most part, using the fadvise/aio technique was a significant win even on single disk systems. The worst result, on system (2) with sorted blocks, was basically break even within the measurement tolerance here: 94% of the no prefetch rate is the worst result I saw, but all these bounced around about +/- 5% so I wouldn't read too much into that. In every other case, there was at least a 50% speed increase even with a single disk.

-As Greg Stark suggested, the larger the spindle count the larger the speedup, and the larger the prefetch size that might make sense. His suggestion to model the user GUC as "effective_spindle_count" looks like a good one. The sequential scan fadvise implementation patch submitted uses the earlier preread_pages name for that parameter, which I agree seems less friendly.

-The Solaris aio implementation seems to perform a bit better relative to no prefetch than the Linux fadvise one. I'm left wondering a bit about whether that's just a Solaris vs. Linux thing, in particular whether that's just some lucky caching on Solaris where the cache isn't completely cleared, or whether Linux's aio library might work better than its fadvise call does.

The attached archive file includes a couple of useful bits for anyone who wants to try this test on their hardware. I think I filed away all the rough edges here and it should be real easy for someone else to run this test now. It includes:

-prefetch.c is a slightly modified version of the original test program. I fixed a couple of minor bugs in the parameter input/output code that only showed up under some platform combinations, the actual prefetch implementation is untouched.

-prefetchtest is a shell script that compiles the program and runs it against a full range of prefetch sizes. Just run it and tell it where you want the test data file to go (with an optional size that defaults to 8GB), and it produces an output file named prefetch-results.csv with all the results in it.

-I included all of the raw data for the various systems I tested so other testers have baselines to compare against. An OpenOffice spreadsheet comparing all the results and that computes the ratios shown above is also included.

Conclusion: on all the systems I tested on, this approach gave excellent results, which makes me feel confident that I should see a corresponding speedup on database-level tests that use this same basic technique. I'm not sure whether it might make sense to bundle this test program up somehow so others can use it for similar compatibility tests (I'm thinking of something similar to contrib/test_fsync), will revisit that after the rest of the review.

Next step: I've got two data sets (one generated, one real-world sample) that should demonstrate a useful heap scan prefetch speedup, and one test program I think will demonstrate whether the sequential scan prefetch code works right. Now that I've vetted all the hardware/OS combinations I hope I can squeeze that in this week, I don't need to test all of them now that I know which are the interesting systems.

As far as other platforms go, I should get a Mac OS system in the near future to test on as well (once I have the database tests working, not worth scheduling yet), but as it will only have a single disk that will basically just be a compatibility test rather than a serious performance one. Would be nice to get a report from someone running FreeBSD to see what's needed to make the test script run on that OS.

[1] http://blogs.sun.com/jkshah/entry/postgresql_east_2008_talk_best : Page 8 of the presentation covers just how limited the default UFS cache tuning is.

--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

Attachment: fadvise-prefetch.tar.gz
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to