Re: pg_combinebackup --copy-file-range

Tomas Vondra Wed, 03 Apr 2024 15:56:57 -0700

Hi,

Here's a much more polished and cleaned up version of the patches,
fixing all the issues I've been aware of, and with various parts merged
into much more cohesive parts (instead of keeping them separate to make
the changes/evolution more obvious).

I decided to reorder the changes like this:

1) block alignment - As I said earlier, I think we should definitely do
this, even if only to make future improvements possible. After chatting
about this with Robert off-list a bit, he pointed out I actually forgot
to not align the headers for files without any blocks, so this version
fixes that.

2) add clone/copy_file_range for the case that copies whole files. This
is pretty simple, but it also adds the --clone/copy-file-range options,
and similar infrastructure. The one slightly annoying bit is that we now
have the ifdef stuff in two places - when parsing the options, and then
in the copy_file_XXX methods, and the pg_fatal() calls should not be
reachable in practice. But that doesn't seem harmful, and might be a
useful protection against someone calling function that does nothing.

This also merges the original two parts, where the first one only did
this cloning/CoW stuff when checksum did not need to be calculated, and
the second extended it to those cases too (by also reading the data, but
still doing the copy the old way).

I think this is the right way - if that's not desisable, it's easy to
either add --no-manifest or not use the CoW options. Or just not change
the checksum type. There's no other way.

3) add copy_file_range to write_reconstructed_file, by using roughly the
same logic/reasoning as (2). If --copy-file-range is specified and a
checksum should be calculated, the data is read for the checksum, but
the copy is done using copy_file_range.

I did rework the flow write_reconstructed_file() flow a bit, because
tracking what exactly needs to be read/written in each of the cases
(which copy method, zeroed block, checksum calculated) made the flow
really difficult to follow. Instead I introduced a function to
read/write a block, and call them from different places.

I think this is much better, and it also makes the following part
dealing with batches of blocks much easier / smaller change.

4) prefetching - This is mostly unrelated to the CoW stuff, but it has
tremendous benefits, especially for ZFS. I haven't been able to tune ZFS
to get decent performance, and ISTM it'd be rather unusable for backup
purposes without this.

5) batching in write_reconstructed_file - This benefits especially the
copy_file_range case, where I've seen it to yield massive speedups (see
the message from Monday for better data).

6) batching for prefetch - Similar logic to (5), but for fadvise. This
is the only part where I'm not entirely sure whether it actually helps
or not. Needs some more analysis, but I'm including it for completeness.

I do think the parts (1)-(4) are in pretty good shape, good enough to
make them committable in a day or two. I see it mostly a matter of
testing and comment improvements rather than code changes.

(5) is in pretty good shape too, but I'd like to maybe simplify and
refine the write_reconstructed_file changes a bit more. I don't think
it's broken, but it feels a bit too cumbersome.

Not sure about (6) yet.

I changed how I think about this a bit - I don't really see the CoW copy
methods as necessary faster than the regular copy (even though it can be
with (5)). I think the main benefits are the space savings, enabled by
patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
that I don't think the performance is an issue - everything has a cost.

On 4/3/24 15:39, Jakub Wartak wrote:
> On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
> <tomas.von...@enterprisedb.com> wrote:
>>
>> Hi,
>>
>> I've been running some benchmarks and experimenting with various stuff,
>> trying to improve the poor performance on ZFS, and the regression on XFS
>> when using copy_file_range. And oh boy, did I find interesting stuff ...
> 
> [..]
> 
> Congratulations on great results!
> 
>> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster
>> to finish recovery, run pg_checksums --check (to check the patches don't
>> produce something broken)
> 
> I've performed some follow-up small testing on all patches mentioned
> here  (1..7), with the earlier developed nano-incremental-backup-tests
> that helped detect some issues for Robert earlier during original
> development. They all went fine in both cases:
> - no special options when using pg_combinebackup
> - using pg_combinebackup --copy-file-range --manifest-checksums=NONE
> 
> Those were:
> test_across_wallevelminimal.sh
> test_full_pri__incr_stby__restore_on_pri.sh
> test_full_pri__incr_stby__restore_on_stby.sh
> test_full_stby__incr_stby__restore_on_pri.sh
> test_full_stby__incr_stby__restore_on_stby.sh
> test_incr_after_timelineincrease.sh
> test_incr_on_standby_after_promote.sh
> test_many_incrementals_dbcreate_duplicateOID.sh
> test_many_incrementals_dbcreate_filecopy_NOINCR.sh
> test_many_incrementals_dbcreate_filecopy.sh
> test_many_incrementals_dbcreate.sh
> test_many_incrementals.sh
> test_multixact.sh
> test_pending_2pc.sh
> test_reindex_and_vacuum_full.sh
> test_repro_assert_RP.sh
> test_repro_assert.sh
> test_standby_incr_just_backup.sh
> test_stuck_walsum.sh
> test_truncaterollback.sh
> test_unlogged_table.sh
> 
>> Now to the findings ....
>>

Thanks. Would be great if you could run this on the attached version of
the patches, ideally for each of them independently, so make sure it
doesn't get broken+fixed somewhere on the way.

>>
>> 1) block alignment
> 
> [..]
> 
>> And I think we probably want to do this now, because this affects all
>> tools dealing with incremental backups - even if someone writes a custom
>> version of pg_combinebackup, it will have to deal with misaligned data.
>> Perhaps there might be something like pg_basebackup that "transforms"
>> the data received from the server (and also the backup manifest), but
>> that does not seem like a great direction.
> 
> If anything is on the table, then I think in the far future
> pg_refresh_standby_using_incremental_backup_from_primary would be the
> only other tool using the format ?
> 

Possibly, but I was thinking more about backup solutions using the same
format, but doing the client-side differently. Essentially, something
that would still use the server side to generate incremental backups,
but replace pg_combinebackup to do this differently (stream the data
somewhere else, index it somehow, or whatever).

>> 2) prefetch
>> -----------
> [..]
>> I think this means we may need a "--prefetch" option, that'd force
>> prefetching, probably both before pread and copy_file_range. Otherwise
>> people on ZFS are doomed and will have poor performance.
> 
> Right, we could optionally cover in the docs later-on various options
> to get the performance (on XFS use $this, but without $that and so
> on). It's kind of madness dealing with all those performance
> variations.
> 

Yeah, something like that. I'm not sure we want to talk too much about
individual filesystems in our docs, because those things evolve over
time too. And also this depends on how large the increment is. If it
only modifies 1% of the blocks, then 99% will come from the full backup,
and the sequential prefetch should do OK (well, maybe not on ZFS). But
as the incremental backup gets larger / is more random, the prefetch is
more and more important.

> Another idea: remove that 128 posifx_fadvise() hardcode in 0002 and a
> getopt variant like: --prefetch[=HOWMANY] with 128 being as default ?
> 

I did think about that, but there's a dependency on the batching. If
we're prefetching ~1MB of data, we may need to prefetch up to ~1MB
ahead. Because otherwise we might want to read 1MB and only a tiny part
of that would be prefetched. I was thinking maybe we could skip the
sequential parts, but ZFS does need that.

So I don't think we can just allow users to set arbitrary values, at
least not without also tweaking the batch. Or maybe 1MB batches are too
large, and we should use something smaller? I need to think about this a
bit more ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From aae6423b6f0e7955a19d0ff28ec64071e808ca0d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Sun, 31 Mar 2024 14:55:29 +0200
Subject: [PATCH v20240403 1/6] Properly align blocks in incremental backups

Align blocks stored in incremental files to BLCKSZ (typically 8K), so
that the incremental backups can work well with CoW filesystems.

The features based on block sharing in CoW filesystems require the
blocks to be aligned to the filesystem page size, but the format
of the incremental files did not meet that requirement as the files
have variable-length header and the data for the blocks was stored
right after that.

This would features like deduplication or block sharing built into
some file systems (e.g. ZFS, BTRFS, XFS) from working well.

This commit pads the file header with \0 to a multiple of BLCKSZ,
which means the block data is aligned to BLCKSZ too. The padding
is applied only to files with block data, so files with just the
header are left small. For files with blocks this adds a bit of
overhead, but as the number of blocks increases the overhead gets
negligible very quickly. Moreover, as the padding is \0 bytes, it
does compress extremely well.
---
 src/backend/backup/basebackup.c             | 26 ++++++++++++++
 src/backend/backup/basebackup_incremental.c | 39 ++++++++++++++++++---
 src/bin/pg_combinebackup/reconstruct.c      |  8 +++++
 src/include/backup/basebackup_incremental.h |  1 +
 4 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 5fea86ad0fd..b956df072d5 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1623,6 +1623,8 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 	{
 		unsigned	magic = INCREMENTAL_MAGIC;
 		size_t		header_bytes_done = 0;
+		char		padding[BLCKSZ];
+		size_t		paddinglen;
 
 		/* Emit header data. */
 		push_to_sink(sink, &checksum_ctx, &header_bytes_done,
@@ -1635,6 +1637,23 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 					 incremental_blocks,
 					 sizeof(BlockNumber) * num_incremental_blocks);
 
+		/*
+		 * Add padding to align header to a multiple of BLCKSZ, but only if
+		 * the incremental file has some blocks. If there are no blocks we
+		 * don't want make the file unnecessarily large, as that might make
+		 * some filesystem optimizations impossible.
+		 */
+		if (num_incremental_blocks > 0)
+		{
+			paddinglen = (BLCKSZ - (header_bytes_done % BLCKSZ));
+
+			memset(padding, 0, paddinglen);
+			bytes_done += paddinglen;
+
+			push_to_sink(sink, &checksum_ctx, &header_bytes_done,
+						 padding, paddinglen);
+		}
+
 		/* Flush out any data still in the buffer so it's again empty. */
 		if (header_bytes_done > 0)
 		{
@@ -1748,6 +1767,13 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 		blkno += cnt / BLCKSZ;
 		bytes_done += cnt;
 
+		/*
+		 * Make sure incremental files with block data are properly aligned
+		 * (header is a multiple of BLCKSZ, blocks are BLCKSZ too).
+		 */
+		Assert(!((incremental_blocks != NULL && num_incremental_blocks > 0) &&
+				 (bytes_done % BLCKSZ != 0)));
+
 		/* Archive the data we just read. */
 		bbsink_archive_contents(sink, cnt);
 
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
index 990b2872eaf..9cd7d9e15ae 100644
--- a/src/backend/backup/basebackup_incremental.c
+++ b/src/backend/backup/basebackup_incremental.c
@@ -902,6 +902,36 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
 	return BACK_UP_FILE_INCREMENTALLY;
 }
 
+/*
+ * Compute the size for a header of an incremental file containing a given
+ * number of blocks. The header is rounded to a multiple of BLCKSZ, but
+ * only if the file will store some block data.
+ */
+extern size_t
+GetIncrementalHeaderSize(unsigned num_blocks_required)
+{
+	size_t		result;
+
+	/* Make sure we're not going to overflow. */
+	Assert(num_blocks_required <= RELSEG_SIZE);
+
+	/*
+	 * Three four byte quantities (magic number, truncation block length,
+	 * block count) followed by block numbers.
+	 */
+	result = 3 * sizeof(uint32) + (sizeof(BlockNumber) * num_blocks_required);
+
+	/*
+	 * Round the header size to a multiple of BLCKSZ - when not a multiple
+	 * of BLCKSZ, add the missing fraction of a block. But do this only if
+	 * the file will store data for some blocks, otherwise keep it small.
+	 */
+	if ((num_blocks_required > 0) && (result % BLCKSZ != 0))
+		result += BLCKSZ - (result % BLCKSZ);
+
+	return result;
+}
+
 /*
  * Compute the size for an incremental file containing a given number of blocks.
  */
@@ -914,11 +944,12 @@ GetIncrementalFileSize(unsigned num_blocks_required)
 	Assert(num_blocks_required <= RELSEG_SIZE);
 
 	/*
-	 * Three four byte quantities (magic number, truncation block length,
-	 * block count) followed by block numbers followed by block contents.
+	 * Header with three four byte quantities (magic number, truncation block
+	 * length, block count) followed by block numbers, rounded to a multiple
+	 * of BLCKSZ (for files with block data), followed by block contents.
 	 */
-	result = 3 * sizeof(uint32);
-	result += (BLCKSZ + sizeof(BlockNumber)) * num_blocks_required;
+	result = GetIncrementalHeaderSize(num_blocks_required);
+	result += BLCKSZ * num_blocks_required;
 
 	return result;
 }
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index 41f06bb26b5..33c6da02a8c 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -472,6 +472,14 @@ make_incremental_rfile(char *filename)
 		sizeof(rf->truncation_block_length) +
 		sizeof(BlockNumber) * rf->num_blocks;
 
+	/*
+	 * Round header length to a multiple of BLCKSZ, so that blocks contents
+	 * are properly aligned. Only do this when the file actually has data for
+	 * some blocks.
+	 */
+	if ((rf->num_blocks > 0) && ((rf->header_length % BLCKSZ) != 0))
+		rf->header_length += (BLCKSZ - (rf->header_length % BLCKSZ));
+
 	return rf;
 }
 
diff --git a/src/include/backup/basebackup_incremental.h b/src/include/backup/basebackup_incremental.h
index ae5feb4b320..6ba9c9035e2 100644
--- a/src/include/backup/basebackup_incremental.h
+++ b/src/include/backup/basebackup_incremental.h
@@ -51,5 +51,6 @@ extern FileBackupMethod GetFileBackupMethod(IncrementalBackupInfo *ib,
 											BlockNumber *relative_block_numbers,
 											unsigned *truncation_block_length);
 extern size_t GetIncrementalFileSize(unsigned num_blocks_required);
+extern size_t GetIncrementalHeaderSize(unsigned num_blocks_required);
 
 #endif
-- 
2.44.0

From 82df50f7f92fa83d2f59ff96c46677b53abd3465 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Tue, 19 Mar 2024 15:16:29 +0100
Subject: [PATCH v20240403 2/6] Allow copying files using clone/copy_file_range

Adds options --clone/--copy-file-range allowing to copy files using
these accelerated methods, usually faster than the block-by-block
copy. This only applies to files that can did not change and can be
copied as a whole.

These new copy methods may not be available on all platforms, in
which case the command throws an error (immediately, even if no
files could be copied as a whole). This early failure seems better
than failing when trying to copy the first file, after performing
a lot of work on earlier files.

If the requested copy method is available, but the file needs other
processing (e.g. calculating a checksum of a different type), the
file is copied using the requested method, but also read and then
processed, but the data is discarded. Depending on the filesystem
this may be more expensive than just performing the simple copy,
but it does allow the CoW benefits.
---
 doc/src/sgml/ref/pg_combinebackup.sgml      |  46 +++++
 src/bin/pg_combinebackup/copy_file.c        | 202 ++++++++++++++++----
 src/bin/pg_combinebackup/copy_file.h        |  18 +-
 src/bin/pg_combinebackup/pg_combinebackup.c |  47 ++++-
 src/bin/pg_combinebackup/reconstruct.c      |   5 +-
 src/bin/pg_combinebackup/reconstruct.h      |   3 +-
 6 files changed, 277 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
index 6f90dba281f..70310c31985 100644
--- a/doc/src/sgml/ref/pg_combinebackup.sgml
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -185,6 +185,52 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--clone</option></term>
+      <listitem>
+       <para>
+        Use efficient file cloning (also known as <quote>reflinks</quote> on
+        some systems) instead of copying files to the new data directory,
+        which can result in near-instantaneous copying of the data files.
+       </para>
+
+       <para>
+        File cloning may be used only when the checksums match in the backup
+        manifests. If a backup manifest is not available or does not contain
+        checksum of the right type, the file will be copied block-by-block
+        as if the <option>--clone</option> option was not specified.
+       </para>
+
+       <para>
+        File cloning is only supported on some operating systems and file
+        systems.  If it is selected but not supported, the
+        <application>pg_combinebackup</application> run will error.  At present,
+        it is supported on Linux (kernel 4.5 or later) with Btrfs and XFS (on
+        file systems created with reflink support), and on macOS with APFS.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>--copy-file-range</option></term>
+      <listitem>
+       <para>
+        Use the <function>copy_file_range</function> system call for efficient
+        copying.  On some file systems this gives results similar to
+        <option>--clone</option>, sharing physical disk blocks, while on others
+        it may still copy blocks, but do so via an optimized path.  At present,
+        it is supported on Linux and FreeBSD.
+       </para>
+
+       <para>
+        <function>copy_file_range</function> may be used only when the
+        checksums match in the backup manifests. If a backup manifest is not
+        available or does not contain checksum of the right type, the file will
+        be copied block-by-block as if the option was not specified.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
        <term><option>-V</option></term>
        <term><option>--version</option></term>
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
index e6d2423278a..8e8c02e58f8 100644
--- a/src/bin/pg_combinebackup/copy_file.c
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -14,6 +14,7 @@
 #include <copyfile.h>
 #endif
 #include <fcntl.h>
+#include <limits.h>
 #include <sys/stat.h>
 #include <unistd.h>
 
@@ -24,8 +25,15 @@
 static void copy_file_blocks(const char *src, const char *dst,
 							 pg_checksum_context *checksum_ctx);
 
+static void copy_file_clone(const char *src, const char *dst,
+							pg_checksum_context *checksum_ctx);
+
+static void copy_file_by_range(const char *src, const char *dst,
+							   pg_checksum_context *checksum_ctx);
+
 #ifdef WIN32
-static void copy_file_copyfile(const char *src, const char *dst);
+static void copy_file_copyfile(const char *src, const char *dst,
+							   pg_checksum_context *checksum_ctx);
 #endif
 
 /*
@@ -35,8 +43,13 @@ static void copy_file_copyfile(const char *src, const char *dst);
  */
 void
 copy_file(const char *src, const char *dst,
-		  pg_checksum_context *checksum_ctx, bool dry_run)
+		  pg_checksum_context *checksum_ctx, bool dry_run,
+		  CopyMethod copy_method)
 {
+	char   *strategy_name = NULL;
+	void	(*strategy_implementation) (const char *, const char *,
+										pg_checksum_context *checksum_ctx) = NULL;
+
 	/*
 	 * In dry-run mode, we don't actually copy anything, nor do we read any
 	 * data from the source file, but we do verify that we can open it.
@@ -51,61 +64,94 @@ copy_file(const char *src, const char *dst,
 			pg_fatal("could not close \"%s\": %m", src);
 	}
 
-	/*
-	 * If we don't need to compute a checksum, then we can use any special
-	 * operating system primitives that we know about to copy the file; this
-	 * may be quicker than a naive block copy.
-	 */
-	if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
-	{
-		char	   *strategy_name = NULL;
-		void		(*strategy_implementation) (const char *, const char *) = NULL;
-
 #ifdef WIN32
-		strategy_name = "CopyFile";
-		strategy_implementation = copy_file_copyfile;
+	copy_method = COPY_METHOD_COPYFILE;
 #endif
 
-		if (strategy_name != NULL)
-		{
-			if (dry_run)
-				pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
-							 src, dst, strategy_name);
-			else
-			{
-				pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
-							 src, dst, strategy_name);
-				(*strategy_implementation) (src, dst);
-			}
-			return;
-		}
+	/* Determine the name of the copy strategy for use in log messages. */
+	switch (copy_method)
+	{
+		case COPY_METHOD_CLONE:
+			strategy_name = "clone";
+			strategy_implementation = copy_file_clone;
+			break;
+		case COPY_METHOD_COPY:
+			/* leave NULL for simple block-by-block copy */
+			strategy_implementation = copy_file_blocks;
+			break;
+		case COPY_METHOD_COPY_FILE_RANGE:
+			strategy_name = "copy_file_range";
+			strategy_implementation = copy_file_by_range;
+			break;
+#ifdef WIN32
+		case COPY_METHOD_COPYFILE:
+			strategy_name = "CopyFile";
+			strategy_implementation = copy_file_copyfile;
+			break;
+#endif
 	}
 
-	/*
-	 * Fall back to the simple approach of reading and writing all the blocks,
-	 * feeding them into the checksum context as we go.
-	 */
 	if (dry_run)
 	{
-		if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+		if (strategy_name)
+			pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
+						 src, dst, strategy_name);
+		else
 			pg_log_debug("would copy \"%s\" to \"%s\"",
 						 src, dst);
-		else
-			pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
-						 src, dst, pg_checksum_type_name(checksum_ctx->type));
 	}
 	else
 	{
-		if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+		if (strategy_name)
+			pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
+						 src, dst, strategy_name);
+		else if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
 			pg_log_debug("copying \"%s\" to \"%s\"",
 						 src, dst);
 		else
 			pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
 						 src, dst, pg_checksum_type_name(checksum_ctx->type));
-		copy_file_blocks(src, dst, checksum_ctx);
+
+		strategy_implementation(src, dst, checksum_ctx);
 	}
 }
 
+/*
+ * Calculate checksum for src file.
+ */
+static void
+checksum_file(const char *src, pg_checksum_context *checksum_ctx)
+{
+	int			src_fd;
+	uint8	   *buffer;
+	const int	buffer_size = 50 * BLCKSZ;
+	ssize_t		rb;
+	unsigned	offset = 0;
+
+	/* bail out if no checksum needed */
+	if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+		return;
+
+	if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+		pg_fatal("could not open file \"%s\": %m", src);
+
+	buffer = pg_malloc(buffer_size);
+
+	while ((rb = read(src_fd, buffer, buffer_size)) > 0)
+	{
+		if (pg_checksum_update(checksum_ctx, buffer, rb) < 0)
+			pg_fatal("could not update checksum of file \"%s\"", src);
+
+		offset += rb;
+	}
+
+	if (rb < 0)
+		pg_fatal("could not read file \"%s\": %m", src);
+
+	pg_free(buffer);
+	close(src_fd);
+}
+
 /*
  * Copy a file block by block, and optionally compute a checksum as we go.
  */
@@ -156,14 +202,94 @@ copy_file_blocks(const char *src, const char *dst,
 	close(dest_fd);
 }
 
+/*
+ * copy_file_clone
+ *		Clones/reflinks a file from src to dest.
+ */
+static void
+copy_file_clone(const char *src, const char *dest,
+				pg_checksum_context *checksum_ctx)
+{
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+	if (copyfile(src, dest, NULL, COPYFILE_CLONE_FORCE) < 0)
+		pg_fatal("error while cloning file \"%s\" to \"%s\": %m", src, dest);
+#elif defined(__linux__) && defined(FICLONE)
+	{
+		if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %m", src);
+
+		if ((dest_fd = open(dest, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							pg_file_create_mode)) < 0)
+			pg_fatal("could not create file \"%s\": %m", dest);
+
+		if (ioctl(dest_fd, FICLONE, src_fd) < 0)
+		{
+			int			save_errno = errno;
+
+			unlink(dest);
+
+			pg_fatal("error while cloning file \"%s\" to \"%s\": %s",
+					 src, dest);
+		}
+	}
+#else
+	pg_fatal("file cloning not supported on this platform");
+#endif
+
+	/* if needed, calculate checksum of the file */
+	checksum_file(src, checksum_ctx);
+}
+
+/*
+ * copy_file_by_range
+ *		Copies a file from src to dest using copy_file_range system call.
+ */
+static void
+copy_file_by_range(const char *src, const char *dest,
+				   pg_checksum_context *checksum_ctx)
+{
+#if defined(HAVE_COPY_FILE_RANGE)
+	int			src_fd;
+	int			dest_fd;
+	ssize_t		nbytes;
+
+	if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+		pg_fatal("could not open file \"%s\": %m", src);
+
+	if ((dest_fd = open(dest, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+						pg_file_create_mode)) < 0)
+		pg_fatal("could not create file \"%s\": %m", dest);
+
+	do
+	{
+		nbytes = copy_file_range(src_fd, NULL, dest_fd, NULL, SSIZE_MAX, 0);
+		if (nbytes < 0)
+			pg_fatal("error while copying file range from \"%s\" to \"%s\": %m",
+					 src, dest);
+	} while (nbytes > 0);
+
+	close(src_fd);
+	close(dest_fd);
+#else
+	pg_fatal("copy_file_range not supported on this platform");
+#endif
+
+	/* if needed, calculate checksum of the file */
+	checksum_file(src, checksum_ctx);
+}
+
 #ifdef WIN32
 static void
-copy_file_copyfile(const char *src, const char *dst)
+copy_file_copyfile(const char *src, const char *dst,
+				   pg_checksum_context *checksum_ctx)
 {
 	if (CopyFile(src, dst, true) == 0)
 	{
 		_dosmaperr(GetLastError());
 		pg_fatal("could not copy \"%s\" to \"%s\": %m", src, dst);
 	}
+
+	/* if needed, calculate checksum of the file */
+	checksum_file(src, checksum_ctx);
 }
 #endif							/* WIN32 */
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
index 0f6bc09403f..9615595d887 100644
--- a/src/bin/pg_combinebackup/copy_file.h
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -11,9 +11,25 @@
 #ifndef COPY_FILE_H
 #define COPY_FILE_H
 
+#include "c.h"
 #include "common/checksum_helper.h"
+#include "common/file_utils.h"
+
+/*
+ * Enumeration to denote copy modes
+ */
+typedef enum CopyMethod
+{
+	COPY_METHOD_CLONE,
+	COPY_METHOD_COPY,
+	COPY_METHOD_COPY_FILE_RANGE,
+#ifdef WIN32
+	COPY_METHOD_COPYFILE,
+#endif
+} CopyMethod;
 
 extern void copy_file(const char *src, const char *dst,
-					  pg_checksum_context *checksum_ctx, bool dry_run);
+					  pg_checksum_context *checksum_ctx, bool dry_run,
+					  CopyMethod copy_method);
 
 #endif							/* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 74f8be9eeac..4f3d7747ce6 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -69,6 +69,7 @@ typedef struct cb_options
 	pg_checksum_type manifest_checksums;
 	bool		no_manifest;
 	DataDirSyncMethod sync_method;
+	CopyMethod	copy_method;
 } cb_options;
 
 /*
@@ -129,6 +130,8 @@ main(int argc, char *argv[])
 		{"manifest-checksums", required_argument, NULL, 1},
 		{"no-manifest", no_argument, NULL, 2},
 		{"sync-method", required_argument, NULL, 3},
+		{"clone", no_argument, NULL, 4},
+		{"copy-file-range", no_argument, NULL, 5},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -156,6 +159,7 @@ main(int argc, char *argv[])
 	memset(&opt, 0, sizeof(opt));
 	opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
 	opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+	opt.copy_method = COPY_METHOD_COPY;
 
 	/* process command-line options */
 	while ((c = getopt_long(argc, argv, "dnNPo:T:",
@@ -192,6 +196,12 @@ main(int argc, char *argv[])
 				if (!parse_sync_method(optarg, &opt.sync_method))
 					exit(1);
 				break;
+			case 4:
+				opt.copy_method = COPY_METHOD_CLONE;
+				break;
+			case 5:
+				opt.copy_method = COPY_METHOD_COPY_FILE_RANGE;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -213,6 +223,35 @@ main(int argc, char *argv[])
 	if (opt.no_manifest)
 		opt.manifest_checksums = CHECKSUM_TYPE_NONE;
 
+	/* Check that the platform supports the requested copy method. */
+	if (opt.copy_method == COPY_METHOD_CLONE)
+	{
+#if (defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)) || \
+	(defined(__linux__) && defined(FICLONE))
+
+		if (opt.dry_run)
+			pg_log_debug("would use cloning to copy files");
+		else
+			pg_log_debug("will use cloning to copy files");
+
+#else
+		pg_fatal("file cloning not supported on this platform");
+#endif
+	}
+	else if (opt.copy_method == COPY_METHOD_COPY_FILE_RANGE)
+	{
+#if defined(HAVE_COPY_FILE_RANGE)
+
+		if (opt.dry_run)
+			pg_log_debug("would use copy_file_range to copy blocks");
+		else
+			pg_log_debug("will use copy_file_range to copy blocks");
+
+#else
+		pg_fatal("copy_file_range not supported on this platform");
+#endif
+	}
+
 	/* Read the server version from the final backup. */
 	version = read_pg_version_file(argv[argc - 1]);
 
@@ -696,6 +735,8 @@ help(const char *progname)
 			 "                            use algorithm for manifest checksums\n"));
 	printf(_("      --no-manifest         suppress generation of backup manifest\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
+	printf(_("      --clone               clone (reflink) instead of copying files\n"));
+	printf(_("      --copy-file-range     copy using copy_file_range() syscall\n"));
 	printf(_("  -?, --help                show this help, then exit\n"));
 
 	printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
@@ -937,7 +978,8 @@ process_directory_recursively(Oid tsoid,
 											  &checksum_length,
 											  &checksum_payload,
 											  opt->debug,
-											  opt->dry_run);
+											  opt->dry_run,
+											  opt->copy_method);
 		}
 		else
 		{
@@ -993,7 +1035,8 @@ process_directory_recursively(Oid tsoid,
 
 			/* Actually copy the file. */
 			snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
-			copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+			copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run,
+					  opt->copy_method);
 
 			/*
 			 * If copy_file() performed a checksum calculation for us, then
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index 33c6da02a8c..57d8622b714 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -90,7 +90,8 @@ reconstruct_from_incremental_file(char *input_filename,
 								  int *checksum_length,
 								  uint8 **checksum_payload,
 								  bool debug,
-								  bool dry_run)
+								  bool dry_run,
+								  CopyMethod copy_method)
 {
 	rfile	  **source;
 	rfile	   *latest_source = NULL;
@@ -319,7 +320,7 @@ reconstruct_from_incremental_file(char *input_filename,
 	 */
 	if (copy_source != NULL)
 		copy_file(copy_source->filename, output_filename,
-				  &checksum_ctx, dry_run);
+				  &checksum_ctx, dry_run, copy_method);
 	else
 	{
 		write_reconstructed_file(input_filename, output_filename,
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
index 8e33a8a95a0..532cd81052d 100644
--- a/src/bin/pg_combinebackup/reconstruct.h
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -28,6 +28,7 @@ extern void reconstruct_from_incremental_file(char *input_filename,
 											  int *checksum_length,
 											  uint8 **checksum_payload,
 											  bool debug,
-											  bool dry_run);
+											  bool dry_run,
+											  CopyMethod copy_method);
 
 #endif
-- 
2.44.0

From ab5b31d36136b417eee6412e26f6f912b5c00443 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Sat, 23 Mar 2024 18:26:21 +0100
Subject: [PATCH v20240403 3/6] Use copy_file_range in write_reconstructed_file

When reconstructing files from incremental backups, we can use the
copy_file_range to copy the block data, instead of the regular method
that simply reads/writes the data.

To calculate checksums fo the reconstructed files, this uses the
same strategy as when copying whole files - the data is copied using
copy_file_range, but the data is also read for the checksum and then
discarded. The difference is that while for cloning whole files the
checksum calculation is usually not needed, when reconstructing the
files it needs to happen almost always (the only exception is when
the user specified --no-manifest).

Depending on the filesystem this may be more expensive than using the
simple copy method, this maintains the CoW benefits. If this overhead
is not acceptable, the right choice is to not use the option.
---
 src/bin/pg_combinebackup/reconstruct.c | 123 +++++++++++++++++++------
 1 file changed, 94 insertions(+), 29 deletions(-)

diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index 57d8622b714..c21b4ed09ab 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -59,8 +59,13 @@ static void write_reconstructed_file(char *input_filename,
 									 off_t *offsetmap,
 									 pg_checksum_context *checksum_ctx,
 									 bool debug,
-									 bool dry_run);
+									 bool dry_run,
+									 CopyMethod copy_method);
 static void read_bytes(rfile *rf, void *buffer, unsigned length);
+static void write_block(int wfd, char *output_filename,
+						uint8 *buffer,
+						pg_checksum_context *checksum_ctx);
+static void read_block(rfile *s, off_t off, uint8 *buffer);
 
 /*
  * Reconstruct a full file from an incremental file and a chain of prior
@@ -325,7 +330,8 @@ reconstruct_from_incremental_file(char *input_filename,
 	{
 		write_reconstructed_file(input_filename, output_filename,
 								 block_length, sourcemap, offsetmap,
-								 &checksum_ctx, debug, dry_run);
+								 &checksum_ctx, debug, dry_run,
+								 copy_method);
 		debug_reconstruction(n_prior_backups + 1, source, dry_run);
 	}
 
@@ -536,7 +542,8 @@ write_reconstructed_file(char *input_filename,
 						 off_t *offsetmap,
 						 pg_checksum_context *checksum_ctx,
 						 bool debug,
-						 bool dry_run)
+						 bool dry_run,
+						 CopyMethod copy_method)
 {
 	int			wfd = -1;
 	unsigned	i;
@@ -646,38 +653,57 @@ write_reconstructed_file(char *input_filename,
 			 * uninitialized block, so just zero-fill it.
 			 */
 			memset(buffer, 0, BLCKSZ);
-		}
-		else
-		{
-			int			rb;
 
-			/* Read the block from the correct source, except if dry-run. */
-			rb = pg_pread(s->fd, buffer, BLCKSZ, offsetmap[i]);
-			if (rb != BLCKSZ)
-			{
-				if (rb < 0)
-					pg_fatal("could not read file \"%s\": %m", s->filename);
-				else
-					pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %llu",
-							 s->filename, rb, BLCKSZ,
-							 (unsigned long long) offsetmap[i]);
-			}
+			/* Write out the block, update checksum if needed. */
+			write_block(wfd, output_filename, buffer, checksum_ctx);
+
+			continue;
 		}
 
-		/* Write out the block. */
-		if ((wb = write(wfd, buffer, BLCKSZ)) != BLCKSZ)
+		/* copy the block using either read/write or copy_file_range */
+		if (copy_method != COPY_METHOD_COPY_FILE_RANGE)
 		{
-			if (wb < 0)
-				pg_fatal("could not write file \"%s\": %m", output_filename);
-			else
-				pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
-						 output_filename, wb, BLCKSZ);
+			/*
+			 * Read the block from the correct source file, and then write
+			 * it out, possibly with checksum update.
+			 */
+			read_block(s, offsetmap[i], buffer);
+			write_block(wfd, output_filename, buffer, checksum_ctx);
 		}
+		else	/* use copy_file_range */
+		{
+			/* copy_file_range modifies the passed offset, so make a copy */
+			off_t	off = offsetmap[i];
+			size_t	nwritten = 0;
+
+			/*
+			 * Retry until we've written all the bytes (the offset is updated
+			 * by copy_file_range, and so is the wfd file offset).
+			 */
+			do
+			{
+				wb = copy_file_range(s->fd, &off, wfd, NULL, BLCKSZ - nwritten, 0);
+
+				if (wb < 0)
+					pg_fatal("error while copying file range from \"%s\" to \"%s\": %m",
+							 input_filename, output_filename);
+
+				nwritten += wb;
+
+			} while (BLCKSZ > nwritten);
+
+			/* when checksum calculation not needed, we're done */
+			if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
+				continue;
 
-		/* Update the checksum computation. */
-		if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
-			pg_fatal("could not update checksum of file \"%s\"",
-					 output_filename);
+			/* read the block and update the checksum */
+			read_block(s, offsetmap[i], buffer);
+
+			/* Update the checksum computation. */
+			if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+				pg_fatal("could not update checksum of file \"%s\"",
+						 output_filename);
+		}
 	}
 
 	/* Debugging output. */
@@ -693,3 +719,42 @@ write_reconstructed_file(char *input_filename,
 	if (wfd >= 0 && close(wfd) != 0)
 		pg_fatal("could not close \"%s\": %m", output_filename);
 }
+
+static void
+write_block(int fd, char *output_filename,
+			uint8 *buffer, pg_checksum_context *checksum_ctx)
+{
+	int	wb;
+
+	if ((wb = write(fd, buffer, BLCKSZ)) != BLCKSZ)
+	{
+		if (wb < 0)
+			pg_fatal("could not write file \"%s\": %m", output_filename);
+		else
+			pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
+					 output_filename, wb, BLCKSZ);
+	}
+
+	/* Update the checksum computation. */
+	if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+		pg_fatal("could not update checksum of file \"%s\"",
+				 output_filename);
+}
+
+static void
+read_block(rfile *s, off_t off, uint8 *buffer)
+{
+	int			rb;
+
+	/* Read the block from the correct source, except if dry-run. */
+	rb = pg_pread(s->fd, buffer, BLCKSZ, off);
+	if (rb != BLCKSZ)
+	{
+		if (rb < 0)
+			pg_fatal("could not read file \"%s\": %m", s->filename);
+		else
+			pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %llu",
+					 s->filename, rb, BLCKSZ,
+					 (unsigned long long) off);
+	}
+}
-- 
2.44.0

From fae69832c727913c12e930cd40898e11a2475bc4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Sun, 31 Mar 2024 22:39:19 +0200
Subject: [PATCH v20240403 4/6] Prefetch blocks read by
 write_reconstructed_file

When write_reconstructed_file() reconstructs file from incremental
backups, it reads data from multiple files. In most cases the access
is fairly sequential and the kernel read-ahead does a good job, but
the pattern may easily be too random due to skipping many blocks in
each backup. Also, some read-ahead may not work that well in some
filesystems (e.g. ZFS).

This adds a --prefetch option to force explicit prefetching. When
used, the system prefetches the next 128 blocks (1MB). This value
is mostly arbitrary, but not entirely - it needs to be high enough
to cover the 1MB chunks used in one of the following patches.

XXX It would be possible to allow specifying a custom value, so it
would be "--prefetch N" where N would be 0-1024 or something like
that. But it seems unnecessary, so perhaps future improvement.
---
 doc/src/sgml/ref/pg_combinebackup.sgml      |  11 ++
 src/bin/pg_combinebackup/pg_combinebackup.c |  11 +-
 src/bin/pg_combinebackup/reconstruct.c      | 138 +++++++++++++++++++-
 src/bin/pg_combinebackup/reconstruct.h      |   3 +-
 4 files changed, 157 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
index 70310c31985..491747f5bbb 100644
--- a/doc/src/sgml/ref/pg_combinebackup.sgml
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -231,6 +231,17 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--prefetch</option></term>
+      <listitem>
+       <para>
+        Enable asynchronous prefetching of blocks. If this option is not
+        specified, it's up to the operating system to prefetch data based
+        on access pattern.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
        <term><option>-V</option></term>
        <term><option>--version</option></term>
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 4f3d7747ce6..9d7c6cc0c75 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -35,6 +35,9 @@
 #define INCREMENTAL_PREFIX			"INCREMENTAL."
 #define INCREMENTAL_PREFIX_LENGTH	(sizeof(INCREMENTAL_PREFIX) - 1)
 
+/* Default prefetch target 1MB (for 8K blocks). */
+#define	PREFETCH_TARGET	128
+
 /*
  * Tracking for directories that need to be removed, or have their contents
  * removed, if the operation fails.
@@ -68,6 +71,7 @@ typedef struct cb_options
 	cb_tablespace_mapping *tsmappings;
 	pg_checksum_type manifest_checksums;
 	bool		no_manifest;
+	int			prefetch_target;
 	DataDirSyncMethod sync_method;
 	CopyMethod	copy_method;
 } cb_options;
@@ -132,6 +136,7 @@ main(int argc, char *argv[])
 		{"sync-method", required_argument, NULL, 3},
 		{"clone", no_argument, NULL, 4},
 		{"copy-file-range", no_argument, NULL, 5},
+		{"prefetch", no_argument, NULL, 6},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -202,6 +207,9 @@ main(int argc, char *argv[])
 			case 5:
 				opt.copy_method = COPY_METHOD_COPY_FILE_RANGE;
 				break;
+			case 6:
+				opt.prefetch_target = PREFETCH_TARGET;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -979,7 +987,8 @@ process_directory_recursively(Oid tsoid,
 											  &checksum_payload,
 											  opt->debug,
 											  opt->dry_run,
-											  opt->copy_method);
+											  opt->copy_method,
+											  opt->prefetch_target);
 		}
 		else
 		{
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index c21b4ed09ab..d63a060e473 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -60,13 +60,35 @@ static void write_reconstructed_file(char *input_filename,
 									 pg_checksum_context *checksum_ctx,
 									 bool debug,
 									 bool dry_run,
-									 CopyMethod copy_method);
+									 CopyMethod copy_method,
+									 int prefetch_target);
 static void read_bytes(rfile *rf, void *buffer, unsigned length);
 static void write_block(int wfd, char *output_filename,
 						uint8 *buffer,
 						pg_checksum_context *checksum_ctx);
 static void read_block(rfile *s, off_t off, uint8 *buffer);
 
+/*
+ * state of the asynchronous prefetcher
+ */
+typedef struct prefetch_state
+{
+	unsigned	next_block;			/* block to prefetch next */
+	int			prefetch_distance;	/* current distance (<= target) */
+	int			prefetch_target;	/* target distance */
+
+	/* prefetch statistics - number of prefetched blocks */
+	unsigned	prefetch_blocks;
+} prefetch_state;
+
+static void prefetch_init(prefetch_state *state,
+						  int prefetch_target);
+static void prefetch_blocks(prefetch_state *state,
+							unsigned block,
+							unsigned block_length,
+							rfile **sourcemap,
+							off_t *offsetmap);
+
 /*
  * Reconstruct a full file from an incremental file and a chain of prior
  * backups.
@@ -96,7 +118,8 @@ reconstruct_from_incremental_file(char *input_filename,
 								  uint8 **checksum_payload,
 								  bool debug,
 								  bool dry_run,
-								  CopyMethod copy_method)
+								  CopyMethod copy_method,
+								  int prefetch_target)
 {
 	rfile	  **source;
 	rfile	   *latest_source = NULL;
@@ -331,7 +354,7 @@ reconstruct_from_incremental_file(char *input_filename,
 		write_reconstructed_file(input_filename, output_filename,
 								 block_length, sourcemap, offsetmap,
 								 &checksum_ctx, debug, dry_run,
-								 copy_method);
+								 copy_method, prefetch_target);
 		debug_reconstruction(n_prior_backups + 1, source, dry_run);
 	}
 
@@ -543,11 +566,16 @@ write_reconstructed_file(char *input_filename,
 						 pg_checksum_context *checksum_ctx,
 						 bool debug,
 						 bool dry_run,
-						 CopyMethod copy_method)
+						 CopyMethod copy_method,
+						 int prefetch_target)
 {
 	int			wfd = -1;
 	unsigned	i;
 	unsigned	zero_blocks = 0;
+	prefetch_state	prefetch;
+
+	/* initialize the block prefetcher */
+	prefetch_init(&prefetch, prefetch_target);
 
 	/* Debugging output. */
 	if (debug)
@@ -645,6 +673,9 @@ write_reconstructed_file(char *input_filename,
 		if (dry_run)
 			continue;
 
+		/* do prefetching if enabled */
+		prefetch_blocks(&prefetch, i, block_length, sourcemap, offsetmap);
+
 		/* Read or zero-fill the block as appropriate. */
 		if (s == NULL)
 		{
@@ -715,6 +746,14 @@ write_reconstructed_file(char *input_filename,
 			pg_log_debug("zero-filled %u blocks", zero_blocks);
 	}
 
+	if (prefetch.prefetch_blocks > 0)
+	{
+		/* print how many blocks we prefetched / skipped */
+		pg_log_debug("prefetch blocks %u (%u skipped)",
+					 prefetch.prefetch_blocks,
+					 (block_length - prefetch.prefetch_blocks));
+	}
+
 	/* Close the output file. */
 	if (wfd >= 0 && close(wfd) != 0)
 		pg_fatal("could not close \"%s\": %m", output_filename);
@@ -758,3 +797,94 @@ read_block(rfile *s, off_t off, uint8 *buffer)
 					 (unsigned long long) off);
 	}
 }
+
+/*
+ * prefetch_init
+ *		Initializes state of the prefetcher.
+ *
+ * Initialize state of the prefetcher, to start with the first block and maximum
+ * prefetch distance (prefetch_target=0 means prefetching disabled). The actual
+ * prefetch distance will gradually increase, until it reaches the target.
+ */
+static void
+prefetch_init(prefetch_state *state, int prefetch_target)
+{
+	Assert(prefetch_target >= 0);
+
+	state->next_block = 0;
+	state->prefetch_distance = 0;
+	state->prefetch_target = prefetch_target;
+	state->prefetch_blocks = 0;
+}
+
+/*
+ * prefetch_blocks
+ *		Perform asynchronous prefetching of blocks to be reconstructed next.
+ *
+ * Initiates asynchronous prefetch of to be reconstructed blocks.
+ *
+ * current_block - The block to be reconstructed in the current loop, right
+ * after the prefetching. This means we're potentially prefetching blocks
+ * in the range [current_block+1, current_block+current_dinstance], with
+ * both values inclusive.
+ *
+ * block_length - Number of blocks to reconstruct, also length of sourcemap
+ * and offsetmap arrays.
+ */
+static void
+prefetch_blocks(prefetch_state *state, unsigned current_block,
+				unsigned block_length, rfile **sourcemap, off_t* offsetmap)
+{
+#ifdef USE_PREFETCH
+	unsigned	max_block;
+
+	/* bail out if prefetching not enabled */
+	if (state->prefetch_target == 0)
+		return;
+
+	/* bail out if we've already prefetched the last block */
+	if (state->next_block == block_length)
+		return;
+
+	/* gradually increase prefetch distance until the target */
+	state->prefetch_distance = Min(state->prefetch_distance + 1,
+								   state->prefetch_target);
+
+	/*
+	 * Where should we start prefetching? We don't want to prefetch blocks
+	 * that we've already prefetched, that's pointless. And we also don't
+	 * want to prefetch the block we're just about to read. This can't
+	 * overflow because we know (current_block < block_length).
+	 */
+	state->next_block = Max(state->next_block, current_block + 1);
+
+	/*
+	 * How far to prefetch? Calculate the first block to not prefetch, i.e.
+	 * right after [current_block + current_distance]. It's possible the
+	 * second part overflows and wraps around, but in that case we just
+	 * don't prefetch a couple pages at the end.
+	 *
+	 * XXX But this also shouldn't be possible, thanks to the check of
+	 * (next_block == block_length) at the very beginning.
+	 */
+	max_block = Min(block_length, current_block + state->prefetch_distance + 1);
+
+	while (state->next_block < max_block)
+	{
+		rfile  *f = sourcemap[state->next_block];
+		off_t	off = offsetmap[state->next_block];
+
+		state->next_block++;
+
+		if (f == NULL)
+			continue;
+
+		state->prefetch_blocks += 1;
+
+		/* We ignore errors because this is only a hint.*/
+		(void) posix_fadvise(f->fd, off, BLCKSZ, POSIX_FADV_WILLNEED);
+	}
+
+	Assert(state->next_block <= block_length);
+#endif
+}
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
index 532cd81052d..d71a80dfed1 100644
--- a/src/bin/pg_combinebackup/reconstruct.h
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -29,6 +29,7 @@ extern void reconstruct_from_incremental_file(char *input_filename,
 											  uint8 **checksum_payload,
 											  bool debug,
 											  bool dry_run,
-											  CopyMethod copy_method);
+											  CopyMethod copy_method,
+											  int prefetch_target);
 
 #endif
-- 
2.44.0

From 12664b2d25f8e92c1da8525975b46541e7d1ceda Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Mon, 1 Apr 2024 00:37:59 +0200
Subject: [PATCH v20240403 5/6] Try copying larger chunks of data from the dame
 file

When reconstructing a file from incremental backups, try copying data
in larger chunks, not individual blocks, to reduce overhead. This
applies to all copy methods, including copy_file_range() - or rather
especially that, as the overhead may be particularly significant.

This is implemented by looking for runs of up to 128 blocks (1MB) to
be copied from the same source file, and processing them at once.

This commit only applies this to copy_file_range, the read/write copy
and checksum calculation is still done block-by-block.

Try copying larger chunks of data from the dame file (cleanup)

This is primarily a cleanup/simplification of the previous commit,
to make the code cleaned and easier to understand.

It also extends the batching to the regular read/write calls, and
checksum calculation (it only worked for copy_file_range before).
---
 src/bin/pg_combinebackup/reconstruct.c | 91 ++++++++++++++++----------
 1 file changed, 57 insertions(+), 34 deletions(-)

diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index d63a060e473..fe2e60d244a 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -63,10 +63,10 @@ static void write_reconstructed_file(char *input_filename,
 									 CopyMethod copy_method,
 									 int prefetch_target);
 static void read_bytes(rfile *rf, void *buffer, unsigned length);
-static void write_block(int wfd, char *output_filename,
-						uint8 *buffer,
-						pg_checksum_context *checksum_ctx);
-static void read_block(rfile *s, off_t off, uint8 *buffer);
+static void write_blocks(int wfd, char *output_filename,
+						 uint8 *buffer, int nblocks,
+						 pg_checksum_context *checksum_ctx);
+static void read_blocks(rfile *s, off_t off, uint8 *buffer, int nblocks);
 
 /*
  * state of the asynchronous prefetcher
@@ -570,7 +570,7 @@ write_reconstructed_file(char *input_filename,
 						 int prefetch_target)
 {
 	int			wfd = -1;
-	unsigned	i;
+	unsigned	next_idx;
 	unsigned	zero_blocks = 0;
 	prefetch_state	prefetch;
 
@@ -653,20 +653,43 @@ write_reconstructed_file(char *input_filename,
 		pg_fatal("could not open file \"%s\": %m", output_filename);
 
 	/* Read and write the blocks as required. */
-	for (i = 0; i < block_length; ++i)
+	next_idx = 0;
+	while (next_idx < block_length)
 	{
-		uint8		buffer[BLCKSZ];
-		rfile	   *s = sourcemap[i];
+#define	BLOCK_COUNT(first, last)	((last) - (first) + 1)
+#define	BATCH_SIZE		128			/* 1MB */
+		uint8		buffer[BATCH_SIZE * BLCKSZ];
+		int			first_idx = next_idx;
+		int			last_idx = next_idx;
+		rfile	   *s = sourcemap[first_idx];
 		int			wb;
+		int			nblocks;
+
+		/*
+		 * Determine the range of blocks coming from the same source file,
+		 * but not more than BLOCK_COUNT (1MB) at a time. The range starts
+		 * at first_idx, ends with last_idx (both are inclusive).
+		 */
+		while ((last_idx + 1 < block_length) &&			/* valid block */
+			   (sourcemap[last_idx+1] == s) &&			/* same file */
+			   (BLOCK_COUNT(first_idx, last_idx) < BATCH_SIZE))	/* 1MB */
+			last_idx += 1;
+
+		/* Calculate batch size, set start of the next loop. */
+		nblocks = BLOCK_COUNT(first_idx, last_idx);
+		next_idx += nblocks;
+
+		Assert(nblocks <= BATCH_SIZE);
+		Assert(next_idx == (last_idx + 1));
 
 		/* Update accounting information. */
 		if (s == NULL)
-			++zero_blocks;
+			zero_blocks += nblocks;
 		else
 		{
-			s->num_blocks_read++;
+			s->num_blocks_read += nblocks;
 			s->highest_offset_read = Max(s->highest_offset_read,
-										 offsetmap[i] + BLCKSZ);
+										 offsetmap[last_idx] + BLCKSZ);
 		}
 
 		/* Skip the rest of this in dry-run mode. */
@@ -674,7 +697,7 @@ write_reconstructed_file(char *input_filename,
 			continue;
 
 		/* do prefetching if enabled */
-		prefetch_blocks(&prefetch, i, block_length, sourcemap, offsetmap);
+		prefetch_blocks(&prefetch, last_idx, block_length, sourcemap, offsetmap);
 
 		/* Read or zero-fill the block as appropriate. */
 		if (s == NULL)
@@ -685,26 +708,26 @@ write_reconstructed_file(char *input_filename,
 			 */
 			memset(buffer, 0, BLCKSZ);
 
-			/* Write out the block, update checksum if needed. */
-			write_block(wfd, output_filename, buffer, checksum_ctx);
+			/* Write out the block(s), update checksum if needed. */
+			write_blocks(wfd, output_filename, buffer, nblocks, checksum_ctx);
 
 			continue;
 		}
 
-		/* copy the block using either read/write or copy_file_range */
+		/* now copy the blocks using either read/write or copy_file_range */
 		if (copy_method != COPY_METHOD_COPY_FILE_RANGE)
 		{
 			/*
-			 * Read the block from the correct source file, and then write
-			 * it out, possibly with checksum update.
+			 * Read the batch of blocks from the correct source file, and
+			 * then write them out, possibly with checksum update.
 			 */
-			read_block(s, offsetmap[i], buffer);
-			write_block(wfd, output_filename, buffer, checksum_ctx);
+			read_blocks(s, offsetmap[first_idx], buffer, nblocks);
+			write_blocks(wfd, output_filename, buffer, nblocks, checksum_ctx);
 		}
 		else	/* use copy_file_range */
 		{
 			/* copy_file_range modifies the passed offset, so make a copy */
-			off_t	off = offsetmap[i];
+			off_t	off = offsetmap[first_idx];
 			size_t	nwritten = 0;
 
 			/*
@@ -713,7 +736,7 @@ write_reconstructed_file(char *input_filename,
 			 */
 			do
 			{
-				wb = copy_file_range(s->fd, &off, wfd, NULL, BLCKSZ - nwritten, 0);
+				wb = copy_file_range(s->fd, &off, wfd, NULL, (BLCKSZ * nblocks) - nwritten, 0);
 
 				if (wb < 0)
 					pg_fatal("error while copying file range from \"%s\" to \"%s\": %m",
@@ -721,17 +744,17 @@ write_reconstructed_file(char *input_filename,
 
 				nwritten += wb;
 
-			} while (BLCKSZ > nwritten);
+			} while ((nblocks * BLCKSZ) > nwritten);
 
 			/* when checksum calculation not needed, we're done */
 			if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
 				continue;
 
-			/* read the block and update the checksum */
-			read_block(s, offsetmap[i], buffer);
+			/* read the block(s) and update the checksum */
+			read_blocks(s, offsetmap[first_idx], buffer, nblocks);
 
 			/* Update the checksum computation. */
-			if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+			if (pg_checksum_update(checksum_ctx, buffer, (nblocks * BLCKSZ)) < 0)
 				pg_fatal("could not update checksum of file \"%s\"",
 						 output_filename);
 		}
@@ -760,40 +783,40 @@ write_reconstructed_file(char *input_filename,
 }
 
 static void
-write_block(int fd, char *output_filename,
-			uint8 *buffer, pg_checksum_context *checksum_ctx)
+write_blocks(int fd, char *output_filename,
+			 uint8 *buffer, int nblocks, pg_checksum_context *checksum_ctx)
 {
 	int	wb;
 
-	if ((wb = write(fd, buffer, BLCKSZ)) != BLCKSZ)
+	if ((wb = write(fd, buffer, nblocks * BLCKSZ)) != (nblocks * BLCKSZ))
 	{
 		if (wb < 0)
 			pg_fatal("could not write file \"%s\": %m", output_filename);
 		else
 			pg_fatal("could not write file \"%s\": wrote only %d of %d bytes",
-					 output_filename, wb, BLCKSZ);
+					 output_filename, wb, (nblocks * BLCKSZ));
 	}
 
 	/* Update the checksum computation. */
-	if (pg_checksum_update(checksum_ctx, buffer, BLCKSZ) < 0)
+	if (pg_checksum_update(checksum_ctx, buffer, (nblocks * BLCKSZ)) < 0)
 		pg_fatal("could not update checksum of file \"%s\"",
 				 output_filename);
 }
 
 static void
-read_block(rfile *s, off_t off, uint8 *buffer)
+read_blocks(rfile *s, off_t off, uint8 *buffer, int nblocks)
 {
 	int			rb;
 
 	/* Read the block from the correct source, except if dry-run. */
-	rb = pg_pread(s->fd, buffer, BLCKSZ, off);
-	if (rb != BLCKSZ)
+	rb = pg_pread(s->fd, buffer, (nblocks * BLCKSZ), off);
+	if (rb != (nblocks * BLCKSZ))
 	{
 		if (rb < 0)
 			pg_fatal("could not read file \"%s\": %m", s->filename);
 		else
 			pg_fatal("could not read file \"%s\": read only %d of %d bytes at offset %llu",
-					 s->filename, rb, BLCKSZ,
+					 s->filename, rb, (nblocks * BLCKSZ),
 					 (unsigned long long) off);
 	}
 }
-- 
2.44.0

From 9dc68c22ad7f2933c66810a03e6d7ad3b9848a86 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Tue, 2 Apr 2024 17:50:58 +0200
Subject: [PATCH v20240403 6/6] Try prefetching larger chunks of data from the
 same file

Similarly to performing copy/read/write in larger chunks, it may be
beneficial to do the same for posix_fadvise.

There's a caveat - this depends on seeing sufficiently larger chunks
of blocks in the prefetch method, which depends on the copy batch.
Once the distance stops growing, the prefetch window will be the same
as the copy batch. If we copy 8 blocks, the prefetch will also see
8 new blocks. If we copy 128 blocks, prefetch will also see 128 new
blocks. And so on.
---
 src/bin/pg_combinebackup/reconstruct.c | 38 ++++++++++++++++++++------
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index fe2e60d244a..33f3baa25fd 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -77,7 +77,8 @@ typedef struct prefetch_state
 	int			prefetch_distance;	/* current distance (<= target) */
 	int			prefetch_target;	/* target distance */
 
-	/* prefetch statistics - number of prefetched blocks */
+	/* prefetch statistics - number of requests and prefetched blocks */
+	unsigned	prefetch_count;
 	unsigned	prefetch_blocks;
 } prefetch_state;
 
@@ -772,8 +773,8 @@ write_reconstructed_file(char *input_filename,
 	if (prefetch.prefetch_blocks > 0)
 	{
 		/* print how many blocks we prefetched / skipped */
-		pg_log_debug("prefetch blocks %u (%u skipped)",
-					 prefetch.prefetch_blocks,
+		pg_log_debug("prefetch requests %u blocks %u (%u skipped)",
+					 prefetch.prefetch_count, prefetch.prefetch_blocks,
 					 (block_length - prefetch.prefetch_blocks));
 	}
 
@@ -835,8 +836,13 @@ prefetch_init(prefetch_state *state, int prefetch_target)
 	Assert(prefetch_target >= 0);
 
 	state->next_block = 0;
-	state->prefetch_distance = 0;
+
+	/* XXX Disables the gradual ramp-up, but we're reading data in batches and
+	 * we probably need to cover the whole next batch at once. */
+	state->prefetch_distance = prefetch_target;
 	state->prefetch_target = prefetch_target;
+
+	state->prefetch_count = 0;
 	state->prefetch_blocks = 0;
 }
 
@@ -859,6 +865,7 @@ prefetch_blocks(prefetch_state *state, unsigned current_block,
 				unsigned block_length, rfile **sourcemap, off_t* offsetmap)
 {
 #ifdef USE_PREFETCH
+	/* end of prefetch range (first block to not prefetch) */
 	unsigned	max_block;
 
 	/* bail out if prefetching not enabled */
@@ -894,18 +901,31 @@ prefetch_blocks(prefetch_state *state, unsigned current_block,
 
 	while (state->next_block < max_block)
 	{
-		rfile  *f = sourcemap[state->next_block];
-		off_t	off = offsetmap[state->next_block];
+		/* range to prefetch in this round */
+		int			nblocks = 0;
+		unsigned	block = state->next_block;
+
+		rfile  *f = sourcemap[block];
+		off_t	off = offsetmap[block];
+
+		/* find the last block in this prefetch range */
+		while ((block + nblocks < max_block) &&
+			   (sourcemap[block + nblocks] == f))
+			nblocks++;
+
+		Assert(nblocks <= state->prefetch_distance);
 
-		state->next_block++;
+		/* remember how far we prefetched, even for f=NULL */
+		state->next_block = block + nblocks;
 
 		if (f == NULL)
 			continue;
 
-		state->prefetch_blocks += 1;
+		state->prefetch_blocks += nblocks;
+		state->prefetch_count += 1;
 
 		/* We ignore errors because this is only a hint.*/
-		(void) posix_fadvise(f->fd, off, BLCKSZ, POSIX_FADV_WILLNEED);
+		(void) posix_fadvise(f->fd, off, (nblocks * BLCKSZ), POSIX_FADV_WILLNEED);
 	}
 
 	Assert(state->next_block <= block_length);
-- 
2.44.0

Re: pg_combinebackup --copy-file-range

Reply via email to