Hi Tomas, Thank you so much for your report. I have read it with great interest.
Your conclusion sounds reasonable to me. My patchset you call "NTT / segments" got as good performance as "NTT / buffer" patchset. I have been worried that calling mmap/munmap for each WAL segment file could have a lot of overhead. Based on your performance tests, however, the overhead looks less than what I thought. In addition, "NTT / segments" patchset is more compatible to the current PG and more friendly to DBAs because that patchset uses WAL segment files and does not introduce any other new WAL-related file. I also think that supporting both file I/O and mmap is better than supporting only mmap. I will continue my work on "NTT / segments" patchset to support both ways. In the following, I will answer "Issues & Questions" you reported. > While testing the "NTT / segments" patch, I repeatedly managed to crash the > cluster with errors like this: > > 2021-02-28 00:07:21.221 PST client backend [3737139] WARNING: creating > logfile segment just before > mapping; path "pg_wal/00000001000000070000002F" > 2021-02-28 00:07:21.670 PST client backend [3737142] WARNING: creating > logfile segment just before > mapping; path "pg_wal/000000010000000700000030" > ... > 2021-02-28 00:07:21.698 PST client backend [3737145] WARNING: creating > logfile segment just before > mapping; path "pg_wal/000000010000000700000030" > 2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not open > file > "pg_wal/000000010000000700000030": No such file or directory > > I do believe this is a thinko in the 0008 patch, which does XLogFileInit in > XLogFileMap. Notice there are multiple > "creating logfile" messages with the ..0030 segment, followed by the failure. > AFAICS the XLogFileMap may be > called from multiple backends, so they may call XLogFileInit concurrently, > likely triggering some sort of race > condition. It's fairly rare issue, though - I've only seen it twice from ~20 > runs. Thank you for your report. I found that rather the patch 0009 has an issue, and that will also cause WAL loss. I should have set use_existent to true, or InstallXlogFileSegment and BasicOpenFile in XLogFileInit can be racy. I have misunderstood that use_existent can be true because I am creating a brand-new file with XLogFileInit. I will fix the issue. > The other question I have is about WALInsertLockUpdateInsertingAt. 0003 > removes this function, but leaves > behind some of the other bits working with insert locks and insertingAt. But > it does not explain how it works without > WaitXLogInsertionsToFinish() - how does it ensure that when we commit > something, all the preceding WAL is > "complete" (i.e. written by other backends etc.)? To wait for *all* the WALInsertLocks to be released, no matter each of them precedes or follows the current insertion. It would have worked functionally, but I rethink it is not good for performance because XLogFileMap in GetXLogBuffer (where WaitXLogInsertionsToFinish is removed) can block because it can eventually call write() in XLogFileInit. I will restore the WALInsertLockUpdateInsertingAt function and related code for mmap. Best regards, Takashi On Tue, Mar 2, 2021 at 5:40 AM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > Hi, > > I've performed some additional benchmarking and testing on the patches > sent on 26/1 [1], and I'd like to share some interesting results. > > I did the tests on two different machines, with slightly different > configurations. Both machines use the same CPU generation with slightly > different frequency, a different OS (Ubuntu vs. RH), kernel (5.3 vs. > 4.18) and so on. A more detailed description is in the attached PDF, > along with the PostgreSQL configuration. > > The benchmark is fairly simple - pgbench with scale 500 (fits into > shared buffers) and 5000 (fits into RAM). The runs were just 1 minute > each, which is fairly short - it's however intentional, because I've > done this with both full_page_writes=on/off to test how this behaves > with many and no FPIs. This models extreme behaviors at the beginning > and at the end of a checkpoint. > > This thread is rather confusing because there are far too many patches > with over-lapping version numbers - even [1] contains two very different > patches. I'll refer to them as "NTT / buffer" (for the patch using one > large PMEM buffer) and "NTT / segments" for the patch using regular WAL > segments. > > The attached PDF shows all these results along with charts. The two > systems have a bit different performance (throughput), the conclusions > seem to be mostly the same, so I'll just talk about results from one of > the systems here (aka "System A"). > > Note: Those systems are hosted / provided by Intel SDP, and Intel is > interested in providing access to other devs interested in PMEM. > > Furthermore, these patches seem to be very insensitive to WAL segment > size (unlike the experimental patches I shared some time ago), so I'll > only show results for one WAL segment size. (Obviously, the NTT / buffer > patch can't be sensitive to this by definition, as it's not using WAL > segments at all.) > > > Results > ------- > > For scale 500, the results (with full_page_writes=on) look like this: > > 1 8 16 32 48 64 > ------------------------------------------------------------------ > master 9411 58833 111453 181681 215552 234099 > NTT / buffer 10837 77260 145251 222586 255651 264207 > NTT / segments 11011 76892 145049 223078 255022 269737 > > So there is a fairly nice speedup - about 30%, which is consistent with > the results shared before. Moreover, the "NTT / segments" patch performs > about the same as the "NTT / buffer" which is encouraging. > > For scale 5000, the results look like this: > > 1 8 16 32 48 64 > ------------------------------------------------------------------ > master 7388 42020 64523 91877 102805 111389 > NTT / buffer 8650 58018 96314 132440 139512 134228 > NTT / segments 8614 57286 97173 138435 157595 157138 > > That's intriguing - the speedup is even higher, almost 40-60% with > enough clients (16-64). For me this is a bit surprising, because in this > case the data don't fit into shared_buffers, so extra time needs to be > spent copying data between RAM and shared_buffers and perhaps even doing > some writes. So my expectation was that this increases the amount of > time spent outside XLOG code, thus diminishing the speedup. > > Now, let's look at results with full_page_writes=off. For scale 500 the > results are: > > 1 8 16 32 48 64 > ------------------------------------------------------------------ > master 10476 67191 122191 198620 234381 251452 > NTT / buffer 11119 79530 148580 229523 262142 275281 > NTT / segments 11528 79004 148978 229714 259798 274753 > > and on scale 5000: > > 1 8 16 32 48 64 > ------------------------------------------------------------------ > master 8192 55870 98451 145097 172377 172907 > NTT / buffer 9063 62659 110868 161352 173977 164359 > NTT / segments 9277 63226 112307 166070 171997 158085 > > That is, the speedups with scale 500 drops to ~10%, and for scale 5000 > it disappears almost entirely. > > I'd have expected that without FPIs the patches will actually be more > effective - so this seems interesting. The conclusion however seems to > be that the lower the amount of FPIs in the WAL stream, the smaller the > speedup. Or in a different way - it's most effective right after a > checkpoint, and it decreases during the checkpoint. So in a well tuned > system with significant distance between checkpoints, the speedup seems > to be fairly limited. > > This is also consistent with the fact that for scale 5000 (with FPW=on) > the speedups are much more significant, simply because there are far > more pages (and thus FPIs). Also, after disabling FPWs the speedup > almost entirely disappears. > > On the second system, the differences are even more significant (see the > PDF). I suspect this is dues to slightly different hardware config with > slower CPU / different PMEM capacity, etc. The overall behavior and > conclusions are however the same, I think. > > Of course, another question is how this will be affected by never PMEM > versions with higher performance (e.g. the new generation of Intel PMEM > should be ~20% faster, from what I hear). > > > Issues & Questions > ------------------ > > While testing the "NTT / segments" patch, I repeatedly managed to crash > the cluster with errors like this: > > 2021-02-28 00:07:21.221 PST client backend [3737139] WARNING: creating > logfile segment just before mapping; path "pg_wal/00000001000000070000002F" > 2021-02-28 00:07:21.670 PST client backend [3737142] WARNING: creating > logfile segment just before mapping; path "pg_wal/000000010000000700000030" > ... > 2021-02-28 00:07:21.698 PST client backend [3737145] WARNING: creating > logfile segment just before mapping; path "pg_wal/000000010000000700000030" > 2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not > open file "pg_wal/000000010000000700000030": No such file or directory > > I do believe this is a thinko in the 0008 patch, which does XLogFileInit > in XLogFileMap. Notice there are multiple "creating logfile" messages > with the ..0030 segment, followed by the failure. AFAICS the XLogFileMap > may be called from multiple backends, so they may call XLogFileInit > concurrently, likely triggering some sort of race condition. It's fairly > rare issue, though - I've only seen it twice from ~20 runs. > > > The other question I have is about WALInsertLockUpdateInsertingAt. 0003 > removes this function, but leaves behind some of the other bits working > with insert locks and insertingAt. But it does not explain how it works > without WaitXLogInsertionsToFinish() - how does it ensure that when we > commit something, all the preceding WAL is "complete" (i.e. written by > other backends etc.)? > > > Conclusion > ---------- > > I do think the "NTT / segments" patch is the most promising way forward. > It does perform about as well as the "NTT / buffer" patch (and much both > perform much better than the experimental patches I shared in January). > > The "NTT / buffer" patch seems much more disruptive - it introduces one > large buffer for WAL, which makes various other tasks more complicated > (i.e. it needs additional complexity to handle WAL archival, etc.). Are > there some advantages of this patch (compared to the other patch)? > > As for the "NTT / segments" patch, I wonder if we can just rework the > code like this (to use mmap etc.) or whether we need to support both > both ways (file I/O and mmap). I don't have much experience with many > other platforms, but it seems quite possible that mmap won't work all > that well on some of them. So my assumption is we'll need to support > both file I/O and mmap to make any of this committable, but I may be wrong. > > > [1] > https://www.postgresql.org/message-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg%40mail.gmail.com > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Takashi Menjo <takashi.me...@gmail.com>