Re: [HACKERS] WAL logging problem in 9.4.3?

Kyotaro Horiguchi Tue, 27 Aug 2019 01:40:13 -0700

Hello.

At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch <[email protected]> wrote in 
<[email protected]>
noah> On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote:
noah> > At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch <[email protected]> 
wrote in <[email protected]>
> > > On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > > > I'm not sure the point of the behavior. I suppose that the "log"
> > > > is a sequence of new_page records. It also needs to be synced and
> > > > it is always larger than the file to be synced. I can't think of
> > > > an appropriate threshold without the point.
> > > 
> > > Yes, it would be a sequence of new-page records.  FlushRelationBuffers() 
> > > locks
> > > every buffer header containing a buffer of the current database.  The 
> > > belief
> > > has been that writing one page to xlog is cheaper than 
> > > FlushRelationBuffers()
> > > in a busy system with large shared_buffers.
> > 
> > I'm at a loss.. The decision between WAL and sync is made at
> > commit time, when we no longer have a pin on a buffer. When
> > emitting WAL, opposite to the assumption, lock needs to be
> > re-acquired for every page to emit log_new_page. What is worse,
> > we may need to reload evicted buffers.  If the file has been
> > CopyFrom'ed, ring buffer strategy makes the situnation farther
> > worse. That doesn't seem cheap at all..
> 
> Consider a one-page relfilenode.  Doing all the things you list for a single
> page may be cheaper than locking millions of buffer headers.


If I understand you correctly, I would say that *all* buffers
that don't belong to in-transaction-created files are skipped
before taking locks. No lock conflict happens with other
backends.

FlushRelationBuffers uses double-checked-locking as follows:

FlushRelationBuffers_common():
..
  if(!islocal) {
    for (i for all buffers) {
      if (RelFileNodeEquals(bufHder->tag.rnode, rnode)) {
        LockBufHdr(bufHdr);
        if (RelFileNodeEquals(bufHder->tag.rnode, rnode) && valid & dirty) {
          PinBuffer_Locked(bubHder);
          LWLockAcquire();
          FlushBuffer();

128GB shared buffers contain 16M buffers. On my
perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it
has only 6GB, the test is ignoring the effect of cache that comes
from the difference of the buffer size). (attached 1)

With WAL-emitting we find every buffers of the file using buffer
hash, we suffer partition locks instead of the 15ms of local
latency. That seems worse.

> > If there were any chance on WAL for smaller files here, it would
> > be on the files smaller than the ring size of bulk-write
> > strategy(16MB).
> 
> Like you, I expect the optimal threshold is less than 16MB, though you should
> benchmark to see.  Under the ideal threshold, when a transaction creates a new
> relfilenode just smaller than the threshold, that transaction will be somewhat
> slower than it would be if the threshold were zero.  Locking every buffer

I looked closer on this.

For a 16MB file, the cost of write-fsyncing cost is almost the
same to that of WAL-emitting cost. It was about 200 ms on the
Vista-era machine with non-performant rotating magnetic disks
with xfs. (attached 2, 3) Although write-fsyncing of relation
file makes no lock conflict with other backends, WAL-emitting
delays other backends' commits at most by that milliseconds.


In summary, the characteristics of the two methods on a 16MB file
are as the follows.

File write:
 - 15ms of buffer scan without locks (@128GB shared buffer)

 + no hash search for a buffer

 = take locks on all buffers only of the file one by one (to write)

 + plus 200ms of write-fdatasync (of whole the relation file),
    which doesn't conflict with other backends. (except via CPU
    time slots and IO bandwidth.)

WAL write : 
 + no buffer scan

 - 2048 times (16M/8k) of partition lock on finding every buffer
   for the target file, which can conflict with other backends.

 = take locks on all buffers only of the file one by one (to take FPW)

 - plus 200ms of open(create)-write-fdatasync (of a WAL file (of
   default size)), which can delay commits on other backends at
   most by that duration.

> header causes a distributed slow-down for other queries, and protecting the
> latency of non-DDL queries is typically more useful than accelerating
> TRUNCATE, CREATE TABLE, etc.  Writing more WAL also slows down other queries;
> beyond a certain relfilenode size, the extra WAL harms non-DDL queries more
> than the buffer scan harms them.  That's about where the threshold should be.

If the discussion above is correct, we shouldn't use WAL-write
even for files around 16MB. For smaller shared_buffers and file
size, the delays are:

Scan all buffers takes:
  15  ms for 128GB shared_buffers
   4.5ms for 32GB shared_buffers

fdatasync takes:
  200 ms for  16MB/sync
   51 ms for   1MB/sync
   46 ms for 512kB/sync
   40 ms for 256kB/sync
   37 ms for 128kB/sync
   35 ms for <64kB/sync

It seems reasonable for 5400rpm disks. The threashold seems 64kB
on my configuration. It can differ by configuration but I think
not so largely. (I'm not sure about SSD or in-memory
filesystems.)

So for smaller than 64kB files:

File write:
 -- 15ms of buffer scan without locks
 +  no hash search for a buffer
 =  plus 35 ms of write-fdatasync

WAL write : 
 ++ no buffer scan
 -  one partition lock on finding every buffer for the target
    file, which can conflict with other backends. (but ignorable.)
 =  plus 35 ms of (open(create)-)write-fdatasync

It's possible that WAL records with smaller size is needless of
time for its own sync. This is the most obvious gain by WAL
emitting. considring 5-15ms of buffer scanning time, 256 or 512
kilobytes are the candidate default threshold but it would be
safe to use 64kB. 

> This should be GUC-controlled, especially since this is back-patch material.

Is this size of patch back-patchable?

> We won't necessarily pick the best value on the first attempt, and the best
> value could depend on factors like the filesystem, the storage hardware, and
> the database's latency goals.  One could define the GUC as an absolute size
> (e.g. 1MB) or as a ratio of shared_buffers (e.g. GUC value of 0.001 means the
> threshold is 1MB when shared_buffers is 1GB).  I'm not sure which is better.

I'm not sure whether the knob shows apparent performance gain and
whether we can offer the criteria to identify the proper
value. But I'll add this feature with a GUC
effective_io_block_size defaults to 64kB as the threshold in the
next version. (The name and default value are arguable, of course.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

Reply via email to