+1 to the idea. I don't see a reason why checkpointer has to do all of that. Keeping checkpoint to minimal essential work helps servers recover faster in the event of a crash.
RemoveOldXlogFiles is also an O(N) operation that can at least be avoided during the end of recovery (CHECKPOINT_END_OF_RECOVERY) checkpoint. When a sufficient number of WAL files accumulated and the previous checkpoint did not get a chance to cleanup, this can increase the unavailability of the server. RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr); On Wed, Dec 1, 2021 at 12:24 PM Bossart, Nathan <bossa...@amazon.com> wrote: > Hi hackers, > > Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to > avoid individually syncing all database files after a crash. However, > as noted earlier this year [0], there are still a number of O(n) tasks > that affect startup and checkpointing that I'd like to improve. > Below, I've attempted to summarize each task and to offer ideas for > improving matters. I'll likely split each of these into its own > thread, given there is community interest for such changes. > > 1) CheckPointSnapBuild(): This function loops through > pg_logical/snapshots to remove all snapshots that are no longer > needed. If there are many entries in this directory, this can take > a long time. The note above this function indicates that this is > done during checkpoints simply because it is convenient. IIUC > there is no requirement that this function actually completes for a > given checkpoint. My current idea is to move this to a new > maintenance worker. > 2) CheckPointLogicalRewriteHeap(): This function loops through > pg_logical/mappings to remove old mappings and flush all remaining > ones. IIUC there is no requirement that the "remove old mappings" > part must complete for a given checkpoint, but the "flush all > remaining" portion allows replay after a checkpoint to only "deal > with the parts of a mapping that have been written out after the > checkpoint started." Therefore, I think we should move the "remove > old mappings" part to a new maintenance worker (probably the same > one as for 1), and we should consider using syncfs() for the "flush > all remaining" part. (I suspect the main argument against the > latter will be that it could cause IO spikes.) > 3) RemovePgTempFiles(): This step can delay startup if there are many > temporary files to individually remove. This step is already > optionally done after a crash via the remove_temp_files_after_crash > GUC. I propose that we have startup move the temporary file > directories aside and create new ones, and then a separate worker > (probably the same one from 1 and 2) could clean up the old files. > 4) StartupReorderBuffer(): This step deletes logical slot data that > has been spilled to disk. This code appears to be written to avoid > deleting different types of files in these directories, but AFAICT > there shouldn't be any other files. Therefore, I think we could do > something similar to 3 (i.e., move the directories aside during > startup and clean them up via a new maintenance worker). > > I realize adding a new maintenance worker might be a bit heavy-handed, > but I think it would be nice to have somewhere to offload tasks that > really shouldn't impact startup and checkpointing. I imagine such a > process would come in handy down the road, too. WDYT? > > Nathan > > [0] https://postgr.es/m/32B59582-AA6C-4609-B08F-2256A271F7A5%40amazon.com > >