Re: fdatasync performance problem with large number of DB files

Paul Guo Tue, 16 Mar 2021 02:44:50 -0700

On Tue, Mar 16, 2021 at 4:29 PM Fujii Masao <[email protected]> wrote:
>
>
>
> On 2021/03/16 8:15, Thomas Munro wrote:
> > On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <[email protected]> wrote:
> >> By the way, there is a usual case that we could skip fsync: A fsync-ed 
> >> already standby generated by pg_rewind/pg_basebackup.
> >> The state of those standbys are surely not 
> >> DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
> >> pgdata directory is fsync-ed again during startup when starting those pg 
> >> instances. We could ask users to not fsync
> >> during pg_rewind&pg_basebackup, but we probably want to just fsync some 
> >> files in pg_rewind (see [1]), so better
> >> let the startup process skip the unnecessary fsync? As to the solution, 
> >> using guc or writing something in some files like
> >> backup_label(?) does not seem to be good ideas since
> >> 1. Use guc, we still expect fsync after real crash recovery so we need to 
> >> reset the guc also need to specify pgoptions in pg_ctl command.
> >> 2. Write some hint information to files like backup_label(?) in 
> >> pg_rewind/pg_basebackup, but people might
> >>       copy the pgdata directory and then we still need fsync.
> >> The only one simple solution I can think out is to let user touch a file 
> >> to hint startup, before starting the pg instance.
> >
> > As a thought experiment only, I wonder if there is a way to make your
> > touch-a-special-signal-file scheme more reliable and less dangerous
> > (considering people might copy the signal file around or otherwise
> > screw this up).  It seems to me that invalidation is the key, and
> > "unlink the signal file after the first crash recovery" isn't good
> > enough.  Hmm  What if the file contained a fingerprint containing...
> > let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...


hostname, mac address, or pgdata path (or  e.g. inode of a file?) might
be the same after vm cloning or directory copying though it is not usual.
I can not figure out a stable solution that makes the information is out of
date after vm/directory cloning/copying, so the simplest way seems to
be that leaves the decision (i.e. touching a file) to users, instead of
writing the information automatically by pg_rewind/pg_basebackup.

> > (add more seasoning to taste), and then also some flags to say what is
> > known to be fully fsync'd already: the WAL, pgdata but only as far as
> > changes up to the checkpoint LSN, or all of pgdata?  Then you could be
> > conservative for a non-match, but skip the extra work in some common
> > cases like pg_basebackup, as long as you trust the fingerprint scheme
> > not to produce false positives.  Or something like that...
> >
> > I'm not too keen to invent clever new schemes for PG14, though.  This
> > sync_after_crash=syncfs scheme is pretty simple, and has the advantage
> > that it's very cheap to do it extra redundant times assuming nothing
> > else is creating new dirty kernel pages in serious quantities.  Is
> > that useful enough?  In particular it avoids the dreaded "open
> > 1,000,000 uncached files over high latency network storage" problem.
> >
> > I don't want to add a hypothetical sync_after_crash=none, because it
> > seems like generally a bad idea.  We already have a
> > running-with-scissors mode you could use for that: fsync=off.
>
> I heard that some backup tools sync the database directory when restoring it.
> I guess that those who use such tools might want the option to disable such
> startup sync (i.e., sync_after_crash=none) because it's not necessary.

This scenario seems to be a support to the file touching solution since
we do not have an automatic solution to skip the fsync. I thought using
sync_after_crash=none to fix my issue but as I said we need to reset
the guc since we still expect fsync/syncfs after the 2nd crash.

> They can skip that sync by fsync=off. But if they just want to skip only that
> startup sync and make subsequent recovery (or standby server) work with
> fsync=on, they would need to shutdown the server after that startup sync
> finishes, enable fsync, and restart the server. In this case, since the server
> is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
> would not be performed. This procedure is tricky. So IMO supporting

This seems to make the process complex. From the perspective of product design,
this seems to be not attractive.

> sync_after_crash=none would be helpful for this case and simple.

Regards,
Paul Guo (Vmware)

Re: fdatasync performance problem with large number of DB files

Reply via email to