Hi,

On 2020-01-19 09:52:21 +1300, Thomas Munro wrote:
> On Sun, Jan 19, 2020 at 3:08 AM Justin Pryzby <pry...@telsasoft.com> wrote:
> > As I understand, the first thing that happens syncing every file in the data
> > dir, like in initdb --sync.  These instances were both 5+TB on zfs, with
> > compression, so that's slow, but tolerable, and at least understandable, and
> > with visible progress in ps.
> >
> > The 2nd stage replays WAL.  strace show's it's occasionally running
> > sync_file_range, and I think recovery might've been several times faster if
> > we'd just dumped the data at the OS ASAP, fsync once per file.  In fact, 
> > I've
> > just kill -9 the recovery process and edited the config to disable this 
> > lest it
> > spend all night in recovery.
> 
> Does sync_file_range() even do anything for non-mmap'd files on ZFS?

Good point. Next time it might be worthwhile to use strace -T to see
whether the sync_file_range calls actually take meaningful time.


> Non-mmap'd ZFS data is not in the Linux page cache, and I think
> sync_file_range() works at that level.  At a guess, there'd need to be
> a new VFS file_operation so that ZFS could get a callback to handle
> data in its ARC.

Yea, it requires the pages to be in the pagecache to do anything:

int sync_file_range(struct file *file, loff_t offset, loff_t nbytes,
                    unsigned int flags)
{
...

        if (flags & SYNC_FILE_RANGE_WRITE) {
                int sync_mode = WB_SYNC_NONE;

                if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) ==
                             SYNC_FILE_RANGE_WRITE_AND_WAIT)
                        sync_mode = WB_SYNC_ALL;

                ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
                                                 sync_mode);
                if (ret < 0)
                        goto out;
        }

and then

int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
                                loff_t end, int sync_mode)
{
        int ret;
        struct writeback_control wbc = {
                .sync_mode = sync_mode,
                .nr_to_write = LONG_MAX,
                .range_start = start,
                .range_end = end,
        };

        if (!mapping_cap_writeback_dirty(mapping) ||
            !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
                return 0;

which means that if there's no pages in the pagecache for the relevant
range, it'll just finish here.  *Iff* there are some, say because
something else mmap()ed a section, it'd potentially call into
address_space->writepages() callback.  So it's possible to emulate
enough state for ZFS or such to still get sync_file_range() call into it
(by setting up a pseudo map tagged as dirty), but it's not really the
normal path.

Greetings,

Andres Freund


Reply via email to