Hi, On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote: > On 21/01/2021 22:36, Andres Freund wrote: > > A quick hack (probably not quite correct!) to evaluate the benefit shows > > that the attached script takes 2m17.223s with the smgrimmedsync and > > 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in > > the former case, CPU bound in the latter. > > > > Creating lots of tables with indexes (directly or indirectly through > > relations having a toast table) is pretty common, particularly after the > > introduction of partitioning. > > > > > > Thinking through the correctness of replacing smgrimmedsync() with sync > > requests, the potential problems that I can see are: > > > > 1) redo point falls between the log_newpage() and the > > write()/register_dirty_segment() in smgrextend/smgrwrite. > > 2) redo point falls between write() and register_dirty_segment() > > > > But both of these are fine in the context of initially filling a newly > > created relfilenode, as far as I can tell? Otherwise the current > > smgrimmedsync() approach wouldn't be safe either, as far as I can tell? > > Hmm. If the redo point falls between write() and the > register_dirty_segment(), and the checkpointer finishes the whole checkpoint > before register_dirty_segment(), you are not safe. That can't happen with > write from the buffer manager, because the checkpointer would block waiting > for the flush of the buffer to finish.
Hm, right. The easiest way to address that race would be to just record the redo pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a checkpoint started since the start of the index build. Another approach would be to utilize PGPROC.delayChkpt, but I would rather not unnecessarily expand the use of that. It's kind of interesting - in my aio branch I moved the register_dirty_segment() to before the actual asynchronous write (due to availability of the necessary data), which ought to be safe because of the buffer interlocking. But that doesn't work here, or for other places doing writes without going through s_b. It'd be great if we could come up with a general solution, but I don't immediately see anything great. The best I can come up with is adding helper functions to wrap some of the complexity for "unbuffered" writes of doing an immedsync iff the redo pointer changed. Something very roughly like typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} UnbufferedWriteState; void unbuffered_prep(UnbufferedWriteState* state); void unbuffered_write(UnbufferedWriteState* state, ...); void unbuffered_extend(UnbufferedWriteState* state, ...); void unbuffered_finish(UnbufferedWriteState* state); which wouldn't just do the dance to avoid the immedsync() if possible, but also took care of PageSetChecksumInplace() (and PageEncryptInplace() if we get that [1]). Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20210112193431.2edcz776qjen7kao%40alap3.anarazel.de