On Tue, Mar 16, 2021 at 4:29 PM Fujii Masao <masao.fu...@oss.nttdata.com> wrote: > > > > On 2021/03/16 8:15, Thomas Munro wrote: > > On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <gu...@vmware.com> wrote: > >> By the way, there is a usual case that we could skip fsync: A fsync-ed > >> already standby generated by pg_rewind/pg_basebackup. > >> The state of those standbys are surely not > >> DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the > >> pgdata directory is fsync-ed again during startup when starting those pg > >> instances. We could ask users to not fsync > >> during pg_rewind&pg_basebackup, but we probably want to just fsync some > >> files in pg_rewind (see [1]), so better > >> let the startup process skip the unnecessary fsync? As to the solution, > >> using guc or writing something in some files like > >> backup_label(?) does not seem to be good ideas since > >> 1. Use guc, we still expect fsync after real crash recovery so we need to > >> reset the guc also need to specify pgoptions in pg_ctl command. > >> 2. Write some hint information to files like backup_label(?) in > >> pg_rewind/pg_basebackup, but people might > >> copy the pgdata directory and then we still need fsync. > >> The only one simple solution I can think out is to let user touch a file > >> to hint startup, before starting the pg instance. > > > > As a thought experiment only, I wonder if there is a way to make your > > touch-a-special-signal-file scheme more reliable and less dangerous > > (considering people might copy the signal file around or otherwise > > screw this up). It seems to me that invalidation is the key, and > > "unlink the signal file after the first crash recovery" isn't good > > enough. Hmm What if the file contained a fingerprint containing... > > let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...
hostname, mac address, or pgdata path (or e.g. inode of a file?) might be the same after vm cloning or directory copying though it is not usual. I can not figure out a stable solution that makes the information is out of date after vm/directory cloning/copying, so the simplest way seems to be that leaves the decision (i.e. touching a file) to users, instead of writing the information automatically by pg_rewind/pg_basebackup. > > (add more seasoning to taste), and then also some flags to say what is > > known to be fully fsync'd already: the WAL, pgdata but only as far as > > changes up to the checkpoint LSN, or all of pgdata? Then you could be > > conservative for a non-match, but skip the extra work in some common > > cases like pg_basebackup, as long as you trust the fingerprint scheme > > not to produce false positives. Or something like that... > > > > I'm not too keen to invent clever new schemes for PG14, though. This > > sync_after_crash=syncfs scheme is pretty simple, and has the advantage > > that it's very cheap to do it extra redundant times assuming nothing > > else is creating new dirty kernel pages in serious quantities. Is > > that useful enough? In particular it avoids the dreaded "open > > 1,000,000 uncached files over high latency network storage" problem. > > > > I don't want to add a hypothetical sync_after_crash=none, because it > > seems like generally a bad idea. We already have a > > running-with-scissors mode you could use for that: fsync=off. > > I heard that some backup tools sync the database directory when restoring it. > I guess that those who use such tools might want the option to disable such > startup sync (i.e., sync_after_crash=none) because it's not necessary. This scenario seems to be a support to the file touching solution since we do not have an automatic solution to skip the fsync. I thought using sync_after_crash=none to fix my issue but as I said we need to reset the guc since we still expect fsync/syncfs after the 2nd crash. > They can skip that sync by fsync=off. But if they just want to skip only that > startup sync and make subsequent recovery (or standby server) work with > fsync=on, they would need to shutdown the server after that startup sync > finishes, enable fsync, and restart the server. In this case, since the server > is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync > would not be performed. This procedure is tricky. So IMO supporting This seems to make the process complex. From the perspective of product design, this seems to be not attractive. > sync_after_crash=none would be helpful for this case and simple. Regards, Paul Guo (Vmware)