While reading pg_rewind code I found two things could speed up pg_rewind. Attached are the patches.
First one: pg_rewind would fsync the whole pgdata directory on the target by default, but that is a waste since usually just part of the files/directories on the target are modified. Other files on the target should have been flushed since pg_rewind requires a clean shutdown before doing the real work. This would help the scenario that the target postgres instance includes millions of files, which has been seen in a real environment. There are several things that may need further discussions: 1. PG_FLUSH_DATA_WORKS was introduced as "Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data”, but now the code guarded by it is just pre_sync_fname() relevant so we might want to rename it as HAVE_PRE_SYNC kind of name? 2. Pre_sync_fname() implementation The code looks like this: #if defined(HAVE_SYNC_FILE_RANGE) (void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE); #elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) (void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED); I’m a bit suspicious about calling posix_fadvise() with POSIX_FADV_DONTNEED. I did not check the Linux Kernel code but according to the man page I suspect that this option might cause the kernel tends to evict the related kernel pages from the page cache, which might not be something we expect. This is not a big issue since sync_file_range() should exist on many widely used Linux. Also I’m not sure how much we could benefit from the pre_sync code. Also note if the directory has a lot of files or the IO is fast, pre_sync_fname() might slow down the process instead. The reasons are: If there are a lot of files it is possible that we need to read the already-synced-and-evicted inode from disk (by open()-ing) after rewinding since the inode cache in Linux Kernel is limited; also if the IO is faster and kernel do background dirty page flush quickly, pre_sync_fname() might just waste cpu cycles. A better solution might be launch a separate pthread and do fsync one by one when pg_rewind finishes handling one file. pg_basebackup could use the solution also. Anyway this is independent of this patch. Second one is use copy_file_range() for the local rewind case to replace read()+write(). This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other code could use copy_file_range() if needed. copy_file_range() was introduced In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap() might be better than read()+write() but copy_file_range() is more interesting given that it could skip the data copying in some file systems - this could benefit more on Linux fs on network-based block storage. Regards, Paul
0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch
Description: 0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch
0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch
Description: 0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch