While reading pg_rewind code I found two things could speed up pg_rewind.
Attached are the patches.

First one: pg_rewind would fsync the whole pgdata directory on the target by 
default,
but that is a waste since usually just part of the files/directories on
the target are modified. Other files on the target should have been flushed
since pg_rewind requires a clean shutdown before doing the real work. This
would help the scenario that the target postgres instance includes millions of
files, which has been seen in a real environment.

There are several things that may need further discussions:

1. PG_FLUSH_DATA_WORKS was introduced as "Define PG_FLUSH_DATA_WORKS if we have 
an implementation for pg_flush_data”,
    but now the code guarded by it is just pre_sync_fname() relevant so we 
might want
    to rename it as HAVE_PRE_SYNC kind of name?

2. Pre_sync_fname() implementation

    The code looks like this:
  #if defined(HAVE_SYNC_FILE_RANGE)
      (void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
  #elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
      (void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

    I’m a bit suspicious about calling posix_fadvise() with POSIX_FADV_DONTNEED.
    I did not check the Linux Kernel code but according to the man
    page I suspect that this option might cause the kernel tends to evict the 
related kernel
    pages from the page cache, which might not be something we expect. This is
    not a big issue since sync_file_range() should exist on many widely used 
Linux.

    Also I’m not sure how much we could benefit from the pre_sync code. Also 
note if the
    directory has a lot of files or the IO is fast, pre_sync_fname() might slow 
down
    the process instead. The reasons are: If there are a lot of files it is 
possible that we need
    to read the already-synced-and-evicted inode from disk (by open()-ing) 
after rewinding since
   the inode cache in Linux Kernel is limited; also if the IO is faster and 
kernel do background
   dirty page flush quickly, pre_sync_fname() might just waste cpu cycles.

   A better solution might be launch a separate pthread and do fsync one by one
   when pg_rewind finishes handling one file. pg_basebackup could use the 
solution also.

   Anyway this is independent of this patch.

Second one is use copy_file_range() for the local rewind case to replace 
read()+write().
This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other
code could use copy_file_range() if needed. copy_file_range() was introduced
In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap()
might be better than read()+write() but copy_file_range() is more interesting
given that it could skip the data copying in some file systems - this could 
benefit more
on Linux fs on network-based block storage.

Regards,
Paul

Attachment: 0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch
Description: 0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch

Attachment: 0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch
Description: 0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch

Reply via email to