Re: [PoC] Non-volatile WAL buffer

Takashi Menjo Fri, 05 Mar 2021 00:09:47 -0800

Hi Tomas,

Thank you so much for your report. I have read it with great interest.


Your conclusion sounds reasonable to me. My patchset you call "NTT /
segments" got as good performance as "NTT / buffer" patchset. I have
been worried that calling mmap/munmap for each WAL segment file could
have a lot of overhead. Based on your performance tests, however, the
overhead looks less than what I thought. In addition, "NTT / segments"
patchset is more compatible to the current PG and more friendly to
DBAs because that patchset uses WAL segment files and does not
introduce any other new WAL-related file.

I also think that supporting both file I/O and mmap is better than
supporting only mmap. I will continue my work on "NTT / segments"
patchset to support both ways.

In the following, I will answer "Issues & Questions" you reported.


> While testing the "NTT / segments" patch, I repeatedly managed to crash the 
> cluster with errors like this:
>
> 2021-02-28 00:07:21.221 PST client backend [3737139] WARNING:  creating 
> logfile segment just before
> mapping; path "pg_wal/00000001000000070000002F"
> 2021-02-28 00:07:21.670 PST client backend [3737142] WARNING:  creating 
> logfile segment just before
> mapping; path "pg_wal/000000010000000700000030"
> ...
> 2021-02-28 00:07:21.698 PST client backend [3737145] WARNING:  creating 
> logfile segment just before
> mapping; path "pg_wal/000000010000000700000030"
> 2021-02-28 00:07:21.698 PST client backend [3737130] PANIC:  could not open 
> file
> "pg_wal/000000010000000700000030": No such file or directory
>
> I do believe this is a thinko in the 0008 patch, which does XLogFileInit in 
> XLogFileMap. Notice there are multiple
> "creating logfile" messages with the ..0030 segment, followed by the failure. 
> AFAICS the XLogFileMap may be
> called from multiple backends, so they may call XLogFileInit concurrently, 
> likely triggering some sort of race
> condition. It's fairly rare issue, though - I've only seen it twice from ~20 
> runs.

Thank you for your report. I found that rather the patch 0009 has an
issue, and that will also cause WAL loss. I should have set
use_existent to true, or InstallXlogFileSegment and BasicOpenFile in
XLogFileInit can be racy. I have misunderstood that use_existent can
be true because I am creating a brand-new file with XLogFileInit.

I will fix the issue.


> The other question I have is about WALInsertLockUpdateInsertingAt. 0003 
> removes this function, but leaves
> behind some of the other bits working with insert locks and insertingAt. But 
> it does not explain how it works without
> WaitXLogInsertionsToFinish() - how does it ensure that when we commit 
> something, all the preceding WAL is
> "complete" (i.e. written by other backends etc.)?

To wait for *all* the WALInsertLocks to be released, no matter each of
them precedes or follows the current insertion.

It would have worked functionally, but I rethink it is not good for
performance because XLogFileMap in GetXLogBuffer (where
WaitXLogInsertionsToFinish is removed) can block because it can
eventually call write() in XLogFileInit.

I will restore the WALInsertLockUpdateInsertingAt function and related
code for mmap.


Best regards,
Takashi


On Tue, Mar 2, 2021 at 5:40 AM Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:
>
> Hi,
>
> I've performed some additional benchmarking and testing on the patches
> sent on 26/1 [1], and I'd like to share some interesting results.
>
> I did the tests on two different machines, with slightly different
> configurations. Both machines use the same CPU generation with slightly
> different frequency, a different OS (Ubuntu vs. RH), kernel (5.3 vs.
> 4.18) and so on. A more detailed description is in the attached PDF,
> along with the PostgreSQL configuration.
>
> The benchmark is fairly simple - pgbench with scale 500 (fits into
> shared buffers) and 5000 (fits into RAM). The runs were just 1 minute
> each, which is fairly short - it's however intentional, because I've
> done this with both full_page_writes=on/off to test how this behaves
> with many and no FPIs. This models extreme behaviors at the beginning
> and at the end of a checkpoint.
>
> This thread is rather confusing because there are far too many patches
> with over-lapping version numbers - even [1] contains two very different
> patches. I'll refer to them as "NTT / buffer" (for the patch using one
> large PMEM buffer) and "NTT / segments" for the patch using regular WAL
> segments.
>
> The attached PDF shows all these results along with charts. The two
> systems have a bit different performance (throughput), the conclusions
> seem to be mostly the same, so I'll just talk about results from one of
> the systems here (aka "System A").
>
> Note: Those systems are hosted / provided by Intel SDP, and Intel is
> interested in providing access to other devs interested in PMEM.
>
> Furthermore, these patches seem to be very insensitive to WAL segment
> size (unlike the experimental patches I shared some time ago), so I'll
> only show results for one WAL segment size. (Obviously, the NTT / buffer
> patch can't be sensitive to this by definition, as it's not using WAL
> segments at all.)
>
>
> Results
> -------
>
> For scale 500, the results (with full_page_writes=on) look like this:
>
>                         1       8       16       32       48       64
>    ------------------------------------------------------------------
>    master            9411   58833   111453   181681   215552   234099
>    NTT / buffer     10837   77260   145251   222586   255651   264207
>    NTT / segments   11011   76892   145049   223078   255022   269737
>
> So there is a fairly nice speedup - about 30%, which is consistent with
> the results shared before. Moreover, the "NTT / segments" patch performs
> about the same as the "NTT / buffer" which is encouraging.
>
> For scale 5000, the results look like this:
>
>                         1       8       16       32       48       64
>    ------------------------------------------------------------------
>    master            7388   42020    64523    91877   102805   111389
>    NTT / buffer      8650   58018    96314   132440   139512   134228
>    NTT / segments    8614   57286    97173   138435   157595   157138
>
> That's intriguing - the speedup is even higher, almost 40-60% with
> enough clients (16-64). For me this is a bit surprising, because in this
> case the data don't fit into shared_buffers, so extra time needs to be
> spent copying data between RAM and shared_buffers and perhaps even doing
> some writes. So my expectation was that this increases the amount of
> time spent outside XLOG code, thus diminishing the speedup.
>
> Now, let's look at results with full_page_writes=off. For scale 500 the
> results are:
>
>                         1       8       16       32       48       64
>    ------------------------------------------------------------------
>    master           10476   67191   122191   198620   234381   251452
>    NTT / buffer     11119   79530   148580   229523   262142   275281
>    NTT / segments   11528   79004   148978   229714   259798   274753
>
> and on scale 5000:
>
>                         1       8       16       32       48       64
>    ------------------------------------------------------------------
>    master            8192   55870    98451   145097   172377   172907
>    NTT / buffer      9063   62659   110868   161352   173977   164359
>    NTT / segments    9277   63226   112307   166070   171997   158085
>
> That is, the speedups with scale 500 drops to ~10%, and for scale 5000
> it disappears almost entirely.
>
> I'd have expected that without FPIs the patches will actually be more
> effective - so this seems interesting. The conclusion however seems to
> be that the lower the amount of FPIs in the WAL stream, the smaller the
> speedup. Or in a different way - it's most effective right after a
> checkpoint, and it decreases during the checkpoint. So in a well tuned
> system with significant distance between checkpoints, the speedup seems
> to be fairly limited.
>
> This is also consistent with the fact that for scale 5000 (with FPW=on)
> the speedups are much more significant, simply because there are far
> more pages (and thus FPIs). Also, after disabling FPWs the speedup
> almost entirely disappears.
>
> On the second system, the differences are even more significant (see the
> PDF). I suspect this is dues to slightly different hardware config with
> slower CPU / different PMEM capacity, etc. The overall behavior and
> conclusions are however the same, I think.
>
> Of course, another question is how this will be affected by never PMEM
> versions with higher performance (e.g. the new generation of Intel PMEM
> should be ~20% faster, from what I hear).
>
>
> Issues & Questions
> ------------------
>
> While testing the "NTT / segments" patch, I repeatedly managed to crash
> the cluster with errors like this:
>
> 2021-02-28 00:07:21.221 PST client backend [3737139] WARNING:  creating
> logfile segment just before mapping; path "pg_wal/00000001000000070000002F"
> 2021-02-28 00:07:21.670 PST client backend [3737142] WARNING:  creating
> logfile segment just before mapping; path "pg_wal/000000010000000700000030"
> ...
> 2021-02-28 00:07:21.698 PST client backend [3737145] WARNING:  creating
> logfile segment just before mapping; path "pg_wal/000000010000000700000030"
> 2021-02-28 00:07:21.698 PST client backend [3737130] PANIC:  could not
> open file "pg_wal/000000010000000700000030": No such file or directory
>
> I do believe this is a thinko in the 0008 patch, which does XLogFileInit
> in XLogFileMap. Notice there are multiple "creating logfile" messages
> with the ..0030 segment, followed by the failure. AFAICS the XLogFileMap
> may be called from multiple backends, so they may call XLogFileInit
> concurrently, likely triggering some sort of race condition. It's fairly
> rare issue, though - I've only seen it twice from ~20 runs.
>
>
> The other question I have is about WALInsertLockUpdateInsertingAt. 0003
> removes this function, but leaves behind some of the other bits working
> with insert locks and insertingAt. But it does not explain how it works
> without WaitXLogInsertionsToFinish() - how does it ensure that when we
> commit something, all the preceding WAL is "complete" (i.e. written by
> other backends etc.)?
>
>
> Conclusion
> ----------
>
> I do think the "NTT / segments" patch is the most promising way forward.
> It does perform about as well as the "NTT / buffer" patch (and much both
> perform much better than the experimental patches I shared in January).
>
> The "NTT / buffer" patch seems much more disruptive - it introduces one
> large buffer for WAL, which makes various other tasks more complicated
> (i.e. it needs additional complexity to handle WAL archival, etc.). Are
> there some advantages of this patch (compared to the other patch)?
>
> As for the "NTT / segments" patch, I wonder if we can just rework the
> code like this (to use mmap etc.) or whether we need to support both
> both ways (file I/O and mmap). I don't have much experience with many
> other platforms, but it seems quite possible that mmap won't work all
> that well on some of them. So my assumption is we'll need to support
> both file I/O and mmap to make any of this committable, but I may be wrong.
>
>
> [1]
> https://www.postgresql.org/message-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg%40mail.gmail.com
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Takashi Menjo <takashi.me...@gmail.com>

Re: [PoC] Non-volatile WAL buffer

Reply via email to