Hi, Every nbtree index build currently does an smgrimmedsync at the end:
/* * Read tuples in correct sort order from tuplesort, and load them into * btree leaves. */ static void _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) ... /* * When we WAL-logged index pages, we must nonetheless fsync index files. * Since we're building outside shared buffers, a CHECKPOINT occurring * during the build has no way to flush the previously written data to * disk (indeed it won't know the index even exists). A crash later on * would replay WAL from the checkpoint, therefore it wouldn't replay our * earlier WAL entries. If we do not fsync those pages here, they might * still not be on disk when the crash occurs. */ if (wstate->btws_use_wal) { RelationOpenSmgr(wstate->index); smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM); } In cases we create lots of small indexes, e.g. because of an initial schema load, partition creation or something like that, that turns out to be a major limiting factor (unless one turns fsync off). One way to address that would be to put newly built indexes into s_b (using a strategy, to avoid blowing out the whole cache), instead of using smgrwrite() etc directly. But that's a discussion with a bit more complex tradeoffs. What I wonder is why the issue addressed in the comment I copied above can't more efficiently be addressed using sync requests, like we do for other writes? It's possibly bit more complicated than just passing skipFsync=false to smgrwrite/smgrextend, but it should be quite doable? A quick hack (probably not quite correct!) to evaluate the benefit shows that the attached script takes 2m17.223s with the smgrimmedsync and 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in the former case, CPU bound in the latter. Creating lots of tables with indexes (directly or indirectly through relations having a toast table) is pretty common, particularly after the introduction of partitioning. Thinking through the correctness of replacing smgrimmedsync() with sync requests, the potential problems that I can see are: 1) redo point falls between the log_newpage() and the write()/register_dirty_segment() in smgrextend/smgrwrite. 2) redo point falls between write() and register_dirty_segment() But both of these are fine in the context of initially filling a newly created relfilenode, as far as I can tell? Otherwise the current smgrimmedsync() approach wouldn't be safe either, as far as I can tell? Greetings, Andres Freund
createlots.sql
Description: application/sql