Hello. At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch <n...@leadboat.com> wrote in <20190826050843.gb3153...@rfd.leadboat.com> noah> On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote: noah> > At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch <n...@leadboat.com> wrote in <20190820060314.ga3086...@rfd.leadboat.com> > > > On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote: > > > > I'm not sure the point of the behavior. I suppose that the "log" > > > > is a sequence of new_page records. It also needs to be synced and > > > > it is always larger than the file to be synced. I can't think of > > > > an appropriate threshold without the point. > > > > > > Yes, it would be a sequence of new-page records. FlushRelationBuffers() > > > locks > > > every buffer header containing a buffer of the current database. The > > > belief > > > has been that writing one page to xlog is cheaper than > > > FlushRelationBuffers() > > > in a busy system with large shared_buffers. > > > > I'm at a loss.. The decision between WAL and sync is made at > > commit time, when we no longer have a pin on a buffer. When > > emitting WAL, opposite to the assumption, lock needs to be > > re-acquired for every page to emit log_new_page. What is worse, > > we may need to reload evicted buffers. If the file has been > > CopyFrom'ed, ring buffer strategy makes the situnation farther > > worse. That doesn't seem cheap at all.. > > Consider a one-page relfilenode. Doing all the things you list for a single > page may be cheaper than locking millions of buffer headers.
If I understand you correctly, I would say that *all* buffers that don't belong to in-transaction-created files are skipped before taking locks. No lock conflict happens with other backends. FlushRelationBuffers uses double-checked-locking as follows: FlushRelationBuffers_common(): .. if(!islocal) { for (i for all buffers) { if (RelFileNodeEquals(bufHder->tag.rnode, rnode)) { LockBufHdr(bufHdr); if (RelFileNodeEquals(bufHder->tag.rnode, rnode) && valid & dirty) { PinBuffer_Locked(bubHder); LWLockAcquire(); FlushBuffer(); 128GB shared buffers contain 16M buffers. On my perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it has only 6GB, the test is ignoring the effect of cache that comes from the difference of the buffer size). (attached 1) With WAL-emitting we find every buffers of the file using buffer hash, we suffer partition locks instead of the 15ms of local latency. That seems worse. > > If there were any chance on WAL for smaller files here, it would > > be on the files smaller than the ring size of bulk-write > > strategy(16MB). > > Like you, I expect the optimal threshold is less than 16MB, though you should > benchmark to see. Under the ideal threshold, when a transaction creates a new > relfilenode just smaller than the threshold, that transaction will be somewhat > slower than it would be if the threshold were zero. Locking every buffer I looked closer on this. For a 16MB file, the cost of write-fsyncing cost is almost the same to that of WAL-emitting cost. It was about 200 ms on the Vista-era machine with non-performant rotating magnetic disks with xfs. (attached 2, 3) Although write-fsyncing of relation file makes no lock conflict with other backends, WAL-emitting delays other backends' commits at most by that milliseconds. In summary, the characteristics of the two methods on a 16MB file are as the follows. File write: - 15ms of buffer scan without locks (@128GB shared buffer) + no hash search for a buffer = take locks on all buffers only of the file one by one (to write) + plus 200ms of write-fdatasync (of whole the relation file), which doesn't conflict with other backends. (except via CPU time slots and IO bandwidth.) WAL write : + no buffer scan - 2048 times (16M/8k) of partition lock on finding every buffer for the target file, which can conflict with other backends. = take locks on all buffers only of the file one by one (to take FPW) - plus 200ms of open(create)-write-fdatasync (of a WAL file (of default size)), which can delay commits on other backends at most by that duration. > header causes a distributed slow-down for other queries, and protecting the > latency of non-DDL queries is typically more useful than accelerating > TRUNCATE, CREATE TABLE, etc. Writing more WAL also slows down other queries; > beyond a certain relfilenode size, the extra WAL harms non-DDL queries more > than the buffer scan harms them. That's about where the threshold should be. If the discussion above is correct, we shouldn't use WAL-write even for files around 16MB. For smaller shared_buffers and file size, the delays are: Scan all buffers takes: 15 ms for 128GB shared_buffers 4.5ms for 32GB shared_buffers fdatasync takes: 200 ms for 16MB/sync 51 ms for 1MB/sync 46 ms for 512kB/sync 40 ms for 256kB/sync 37 ms for 128kB/sync 35 ms for <64kB/sync It seems reasonable for 5400rpm disks. The threashold seems 64kB on my configuration. It can differ by configuration but I think not so largely. (I'm not sure about SSD or in-memory filesystems.) So for smaller than 64kB files: File write: -- 15ms of buffer scan without locks + no hash search for a buffer = plus 35 ms of write-fdatasync WAL write : ++ no buffer scan - one partition lock on finding every buffer for the target file, which can conflict with other backends. (but ignorable.) = plus 35 ms of (open(create)-)write-fdatasync It's possible that WAL records with smaller size is needless of time for its own sync. This is the most obvious gain by WAL emitting. considring 5-15ms of buffer scanning time, 256 or 512 kilobytes are the candidate default threshold but it would be safe to use 64kB. > This should be GUC-controlled, especially since this is back-patch material. Is this size of patch back-patchable? > We won't necessarily pick the best value on the first attempt, and the best > value could depend on factors like the filesystem, the storage hardware, and > the database's latency goals. One could define the GUC as an absolute size > (e.g. 1MB) or as a ratio of shared_buffers (e.g. GUC value of 0.001 means the > threshold is 1MB when shared_buffers is 1GB). I'm not sure which is better. I'm not sure whether the knob shows apparent performance gain and whether we can offer the criteria to identify the proper value. But I'll add this feature with a GUC effective_io_block_size defaults to 64kB as the threshold in the next version. (The name and default value are arguable, of course.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center