Hi Tomas,

> Hello Takashi-san,
>
> On 3/5/21 9:08 AM, Takashi Menjo wrote:
> > Hi Tomas,
> >
> > Thank you so much for your report. I have read it with great interest.
> >
> > Your conclusion sounds reasonable to me. My patchset you call "NTT /
> > segments" got as good performance as "NTT / buffer" patchset. I have
> > been worried that calling mmap/munmap for each WAL segment file could
> > have a lot of overhead. Based on your performance tests, however, the
> > overhead looks less than what I thought. In addition, "NTT / segments"
> > patchset is more compatible to the current PG and more friendly to
> > DBAs because that patchset uses WAL segment files and does not
> > introduce any other new WAL-related file.
> >
>
> I agree. I was actually a bit surprised it performs this well, mostly in
> line with the "NTT / buffer" patchset. I've seen significant issues with
> our simple experimental patches, which however went away with larger WAL
> segments. But the "NTT / segments" patch does not have that issue, so
> either our patches were doing something wrong, or perhaps there was some
> other issue (not sure why larger WAL segments would improve that).
>
> Do these results match your benchmarks? Or are you seeing significantly
> different behavior?

I made a performance test for "NTT / segments" and added its results
to my previous report [1], on the same conditions. The updated graph
is attached to this mail. Note that some legends are renamed: "Mapped
WAL file" to "NTT / simple", and "Non-volatile WAL buffer" to "NTT /
buffer."

The graph tells me that "NTT / segments" performs as well as "NTT /
buffer." This matches with the results you reported.

> Do you have any thoughts regarding the impact of full-page writes? So
> far all the benchmarks we did focused on small OLTP transactions on data
> sets that fit into RAM. The assumption was that that's the workload that
> would benefit from this, but maybe that's missing something important
> about workloads producing much larger WAL records? Say, workloads
> working with large BLOBs, bulk loads etc.

I'd say that more work is needed for workloads producing a large
amount of WAL (in the number of records or the size per record, or
both of them). Based on the case Gang reported and I have tried to
reproduce in this thread [2][3], the current inserting and flushing
method can be unsuitable for such workloads. The case was for "NTT /
buffer," but I think it can be also applied to "NTT / segments."

> The other question is whether simply placing WAL on DAX (without any
> code changes) is safe. If it's not, then all the "speedups" are computed
> with respect to unsafe configuration and so are useless. And BTT should
> be used instead, which would of course produce very different results.

I think it's safe, thanks to the checksum in the header of WAL record
(xl_crc in struct XLogRecord). In DAX mode, user data (WAL record
here) is written to the PMEM device by a smaller unit (probably a byte
or a cache line) than the traditional 512-byte disk sector. So a
torn-write such that "some bytes in a sector persist, other bytes not"
can occur when crash. AFAICS, however, the checksum for WAL records
can also support such a torn-write case.

> > I also think that supporting both file I/O and mmap is better than
> > supporting only mmap. I will continue my work on "NTT / segments"
> > patchset to support both ways.
> >
>
> +1
>
> > In the following, I will answer "Issues & Questions" you reported.
> >
> >
> >> While testing the "NTT / segments" patch, I repeatedly managed to crash 
> >> the cluster with errors like this:
> >>
> >> 2021-02-28 00:07:21.221 PST client backend [3737139] WARNING:  creating 
> >> logfile segment just before
> >> mapping; path "pg_wal/00000001000000070000002F"
> >> 2021-02-28 00:07:21.670 PST client backend [3737142] WARNING:  creating 
> >> logfile segment just before
> >> mapping; path "pg_wal/000000010000000700000030"
> >> ...
> >> 2021-02-28 00:07:21.698 PST client backend [3737145] WARNING:  creating 
> >> logfile segment just before
> >> mapping; path "pg_wal/000000010000000700000030"
> >> 2021-02-28 00:07:21.698 PST client backend [3737130] PANIC:  could not 
> >> open file
> >> "pg_wal/000000010000000700000030": No such file or directory
> >>
> >> I do believe this is a thinko in the 0008 patch, which does XLogFileInit 
> >> in XLogFileMap. Notice there are multiple
> >> "creating logfile" messages with the ..0030 segment, followed by the 
> >> failure. AFAICS the XLogFileMap may be
> >> called from multiple backends, so they may call XLogFileInit concurrently, 
> >> likely triggering some sort of race
> >> condition. It's fairly rare issue, though - I've only seen it twice from 
> >> ~20 runs.
> >
> > Thank you for your report. I found that rather the patch 0009 has an
> > issue, and that will also cause WAL loss. I should have set
> > use_existent to true, or InstallXlogFileSegment and BasicOpenFile in
> > XLogFileInit can be racy. I have misunderstood that use_existent can
> > be true because I am creating a brand-new file with XLogFileInit.
> >
> > I will fix the issue.
> >
>
> OK, thanks for looking into this.
>
> >
> >> The other question I have is about WALInsertLockUpdateInsertingAt. 0003 
> >> removes this function, but leaves
> >> behind some of the other bits working with insert locks and insertingAt. 
> >> But it does not explain how it works without
> >> WaitXLogInsertionsToFinish() - how does it ensure that when we commit 
> >> something, all the preceding WAL is
> >> "complete" (i.e. written by other backends etc.)?
> >
> > To wait for *all* the WALInsertLocks to be released, no matter each of
> > them precedes or follows the current insertion.
> >
> > It would have worked functionally, but I rethink it is not good for
> > performance because XLogFileMap in GetXLogBuffer (where
> > WaitXLogInsertionsToFinish is removed) can block because it can
> > eventually call write() in XLogFileInit.
> >
> > I will restore the WALInsertLockUpdateInsertingAt function and related
> > code for mmap.
> >
>
> OK. I'm still not entirely sure I understand if the current version is
> correct, but I'll wait for the reworked version.
>
>
> kind regards

Best regards,
Takashi

[1] 
https://www.postgresql.org/message-id/caownp3ofofosftmeikqcbmp0ywdjn0kvb4ka_0tj+urq7dt...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/byapr11mb344801ff81e9c92a081d3e10e6...@byapr11mb3448.namprd11.prod.outlook.com
[3] 
https://www.postgresql.org/message-id/CAOwnP3NHAbVFOfAawZPs5ezn57_7fcX%3DKaaQ5YMgirc9rNrijQ%40mail.gmail.com

-- 
Takashi Menjo <takashi.me...@gmail.com>

Reply via email to