pg_upgrade --copy-file-range

2023-06-02 Thread Thomas Munro
Hello,

I was just in a pg_upgrade unconference session at PGCon where the
lack of $SUBJECT came up.  This system call gives the kernel the
option to use fast block cloning on XFS, ZFS (as of very recently),
etc, and works on Linux and FreeBSD.  It's probably much the same as
--clone mode on COW file systems, except that is Linux-only.  On
overwrite file systems (ie not copy-on-write, like ext4), it may also
be able to push copies down to storage hardware/network file systems.

There was something like this in the nearby large files patch set, but
in that version it just magically did it when available in --copy
mode.  Now I think the user should have to have to opt in with
--copy-file-range, and simply to error out if it fails.  It may not
work in some cases -- for example, the man page says that older Linux
systems can fail with EXDEV when you try to copy across file systems,
while newer systems will do something less efficient but still
sensible internally; also I saw a claim that some older versions had
weird bugs.  Better to just expose the raw functionality and let users
say when they want it and read the error if it fail, I think.
From 571e68a2948c5bff9fa1d66f382c859fc6606829 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 2 Jun 2023 13:35:54 -0400
Subject: [PATCH] Add --copy-file-range option to pg_upgrade.

The copy_file_range() system call is available on at least Linux and
FreeBSD, and asks the kernel to use efficient ways to copy ranges of a
file.  Options available to the kernel include sharing block ranges
(similar to --clone mode), and pushing down block copies to the storage
layer.
---
 configure  |  2 +-
 configure.ac   |  1 +
 doc/src/sgml/ref/pgupgrade.sgml| 13 +
 meson.build|  1 +
 src/bin/pg_upgrade/check.c |  1 +
 src/bin/pg_upgrade/file.c  | 43 ++
 src/bin/pg_upgrade/option.c| 10 +++
 src/bin/pg_upgrade/pg_upgrade.h|  3 +++
 src/bin/pg_upgrade/relfilenumber.c |  8 ++
 src/include/pg_config.h.in |  3 +++
 src/tools/msvc/Solution.pm |  1 +
 11 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/configure b/configure
index 1b415142d1..a620e049fa 100755
--- a/configure
+++ b/configure
@@ -15700,7 +15700,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols copyfile getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
+for ac_func in backtrace_symbols copyfile copy_file_range getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 09558ada0f..69b9256037 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1794,6 +1794,7 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 AC_CHECK_FUNCS(m4_normalize([
 	backtrace_symbols
 	copyfile
+	copy_file_range
 	getifaddrs
 	getpeerucred
 	inet_pton
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 7816b4c685..9180513307 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -240,6 +240,19 @@ PostgreSQL documentation
   
  
 
+ 
+  --copy-file-range
+  
+   
+Use the copy_file_range system call for efficient
+copying.  On some file systems this gives results similar to
+--clone, sharing physical disk blocks, while on others
+it may still copy blocks, but do so via an optimized path.  At present,
+it is supported on Linux and FreeBSD.
+   
+  
+ 
+
  
   -?
   --help
diff --git a/meson.build b/meson.build
index 16b2e86646..322d8f822d 100644
--- a/meson.build
+++ b/meson.build
@@ -2404,6 +2404,7 @@ func_checks = [
   ['backtrace_symbols', {'dependencies': [execinfo_dep]}],
   ['clock_gettime', {'dependencies': [rt_dep, posix4_dep], 'define': false}],
   ['copyfile'],
+  ['copy_file_range'],
   # gcc/clang's sanitizer helper library provides dlopen but not dlsym, thus
   # when enabling asan the dlopen check doesn't notice that -ldl is actually
   # required. Just checking for dlsym() ought to suffice.
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 64024e3b9e..8c4e56a568 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -199,6 +199,7 @@ check_new_cluster(void)
 			check_file_clone();
 			break;
 		case TRANSFER_MODE_COPY:
+		case TRANSFER_MODE_COPY_FILE_RANGE:
 			break;
 		case TRANSFER_MODE_LINK:
 			check_hard_link();
diff --

Re: pg_upgrade --copy-file-range

2024-03-05 Thread Peter Eisentraut

On 05.01.24 13:40, Jakub Wartak wrote:

Random patch notes:
- main meat is in v3-0002*, I hope i did not screw something seriously
- in worst case: it is opt-in through switch, so the user always can
stick to the classic copy
- no docs so far
- pg_copyfile_offload_supported() should actually be fixed if it is a
good path forward
- pgindent actually indents larger areas of code that I would like to,
any ideas or is it ok?
- not tested on Win32/MacOS/FreeBSD
- i've tested pg_upgrade manually and it seems to work and issue
correct syscalls, however some tests are failing(?). I haven't
investigated why yet due to lack of time.


Something is wrong with the pgindent in your patch set.  Maybe you used 
a wrong version.  You should try to fix that, because it is hard to 
process your patch with that amount of unrelated reformatting.


As far as I can tell, the original pg_upgrade patch has been ready to 
commit since October.  Unless Thomas has any qualms that have not been 
made explicit in this thread, I suggest we move ahead with that.


And then Jakub could rebase his patch set on top of that.  It looks like 
if the formatting issues are fixed, the remaining pg_combinebackup 
support isn't that big.






Re: pg_upgrade --copy-file-range

2024-03-05 Thread Thomas Munro
On Wed, Mar 6, 2024 at 2:43 AM Peter Eisentraut  wrote:
> As far as I can tell, the original pg_upgrade patch has been ready to
> commit since October.  Unless Thomas has any qualms that have not been
> made explicit in this thread, I suggest we move ahead with that.

pg_upgrade --copy-file-range pushed.  The only change I made was to
remove the EINTR retry condition which was subtly wrong and actually
not needed here AFAICS.  (Erm, maybe I did have an unexpressed qualm
about some bug reports unfolding around that time about corruption
linked to copy_file_range that might have spooked me but those seem to
have been addressed.)

> And then Jakub could rebase his patch set on top of that.  It looks like
> if the formatting issues are fixed, the remaining pg_combinebackup
> support isn't that big.

+1

I'll also go and rebase CREATE DATABASE ... STRATEGY=file_clone[1].

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com




Re: pg_upgrade --copy-file-range

2023-12-22 Thread Peter Eisentraut

On 13.11.23 08:15, Peter Eisentraut wrote:

On 08.10.23 07:15, Thomas Munro wrote:

About your patch:

I think you should have a "check" function called from
check_new_cluster().  That check function can then also handle the "not
supported" case, and you don't need to handle that in
parseCommandLine().  I suggest following the clone example for these,
since the issues there are very similar.


Done.


This version looks good to me.

Tiny nit:  You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a 
different suffix.


Thomas, are you planning to proceed with this patch?





Re: pg_upgrade --copy-file-range

2023-12-22 Thread Thomas Munro
On Sat, Dec 23, 2023 at 9:40 AM Peter Eisentraut  wrote:
> On 13.11.23 08:15, Peter Eisentraut wrote:
> > On 08.10.23 07:15, Thomas Munro wrote:
> >>> About your patch:
> >>>
> >>> I think you should have a "check" function called from
> >>> check_new_cluster().  That check function can then also handle the "not
> >>> supported" case, and you don't need to handle that in
> >>> parseCommandLine().  I suggest following the clone example for these,
> >>> since the issues there are very similar.
> >>
> >> Done.
> >
> > This version looks good to me.
> >
> > Tiny nit:  You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a
> > different suffix.
>
> Thomas, are you planning to proceed with this patch?

Yes.  Sorry for being slow... got stuck working on an imminent new
version of streaming read.  I will be defrosting my commit bit and
committing this one and a few things shortly.

As it happens I was just thinking about this particular patch because
I suddenly had a strong urge to teach pg_combinebackup to use
copy_file_range.  I wonder if you had the same idea...




Re: pg_upgrade --copy-file-range

2023-12-23 Thread Michael Paquier
On Sat, Dec 23, 2023 at 09:52:59AM +1300, Thomas Munro wrote:
> As it happens I was just thinking about this particular patch because
> I suddenly had a strong urge to teach pg_combinebackup to use
> copy_file_range.  I wonder if you had the same idea...

Yeah, +1.  That would make copy_file_blocks() more efficient where the
code is copying 50 blocks in batches because it needs to reassign
checksums to the blocks copied.
--
Michael


signature.asc
Description: PGP signature


Re: pg_upgrade --copy-file-range

2023-10-07 Thread Thomas Munro
On Mon, Jul 3, 2023 at 7:47 PM Peter Eisentraut  wrote:
> When we added --clone, copy_file_range() was available, but the problem
> was that it was hard for the user to predict whether you'd get the fast
> clone behavior or the slow copy behavior.  That's the kind of thing you
> want to know when planning and testing your upgrade.  At the time, there
> were patches passed around in Linux kernel circles that would have been
> able to enforce cloning via the flags argument of copy_file_range(), but
> that didn't make it to the mainline.
>
> So, yes, being able to specify exactly which copy mechanism to use makes
> sense, so that users can choose the tradeoffs.

Thanks for looking.  Yeah, it is quite inconvenient for planning
purposes that it is hard for a user to know which internal strategy it
uses, but that's the interface we have (and clearly "flags" is
reserved for future usage so that might still evolve..).

> About your patch:
>
> I think you should have a "check" function called from
> check_new_cluster().  That check function can then also handle the "not
> supported" case, and you don't need to handle that in
> parseCommandLine().  I suggest following the clone example for these,
> since the issues there are very similar.

Done.
From 9ea1c3fc39a47f634a4fffd1ff1c9b9cf0299d65 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 2 Jun 2023 13:35:54 -0400
Subject: [PATCH v2] Add --copy-file-range option to pg_upgrade.

The copy_file_range() system call is available on at least Linux and
FreeBSD, and asks the kernel to use efficient ways to copy ranges of a
file.  Options available to the kernel include sharing block ranges
(similar to --clone mode), and pushing down block copies to the storage
layer.

Reviewed-by: Peter Eisentraut 
Discussion: https://postgr.es/m/CA%2BhUKGKe7Hb0-UNih8VD5UNZy5-ojxFb3Pr3xSBBL8qj2M2%3DdQ%40mail.gmail.com
---
 configure  |  2 +-
 configure.ac   |  1 +
 doc/src/sgml/ref/pgupgrade.sgml| 13 +
 meson.build|  1 +
 src/bin/pg_upgrade/check.c |  3 ++
 src/bin/pg_upgrade/file.c  | 78 ++
 src/bin/pg_upgrade/option.c|  7 ++-
 src/bin/pg_upgrade/pg_upgrade.h|  4 ++
 src/bin/pg_upgrade/relfilenumber.c |  8 +++
 src/include/pg_config.h.in |  3 ++
 src/tools/msvc/Solution.pm |  1 +
 11 files changed, 119 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index d47e0f8b26..2076b19a1b 100755
--- a/configure
+++ b/configure
@@ -15578,7 +15578,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols copyfile getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
+for ac_func in backtrace_symbols copyfile copy_file_range getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 440b08d113..d0d31dd91e 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1767,6 +1767,7 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 AC_CHECK_FUNCS(m4_normalize([
 	backtrace_symbols
 	copyfile
+	copy_file_range
 	getifaddrs
 	getpeerucred
 	inet_pton
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index f17fdb1ba5..8275cc067b 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -263,6 +263,19 @@ PostgreSQL documentation
   
  
 
+ 
+  --copy-file-range
+  
+   
+Use the copy_file_range system call for efficient
+copying.  On some file systems this gives results similar to
+--clone, sharing physical disk blocks, while on others
+it may still copy blocks, but do so via an optimized path.  At present,
+it is supported on Linux and FreeBSD.
+   
+  
+ 
+
  
   -?
   --help
diff --git a/meson.build b/meson.build
index 862c955453..20e7327e9e 100644
--- a/meson.build
+++ b/meson.build
@@ -2415,6 +2415,7 @@ func_checks = [
   ['backtrace_symbols', {'dependencies': [execinfo_dep]}],
   ['clock_gettime', {'dependencies': [rt_dep], 'define': false}],
   ['copyfile'],
+  ['copy_file_range'],
   # gcc/clang's sanitizer helper library provides dlopen but not dlsym, thus
   # when enabling asan the dlopen check doesn't notice that -ldl is actually
   # required. Just checking for dlsym() ought to suffice.
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 21a0ff9e42..4a615edb62 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check

Re: pg_upgrade --copy-file-range

2023-11-12 Thread Peter Eisentraut

On 08.10.23 07:15, Thomas Munro wrote:

About your patch:

I think you should have a "check" function called from
check_new_cluster().  That check function can then also handle the "not
supported" case, and you don't need to handle that in
parseCommandLine().  I suggest following the clone example for these,
since the issues there are very similar.


Done.


This version looks good to me.

Tiny nit:  You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a 
different suffix.





Re: pg_upgrade --copy-file-range

2023-07-03 Thread Peter Eisentraut

On 02.06.23 21:30, Thomas Munro wrote:

I was just in a pg_upgrade unconference session at PGCon where the
lack of $SUBJECT came up.  This system call gives the kernel the
option to use fast block cloning on XFS, ZFS (as of very recently),
etc, and works on Linux and FreeBSD.  It's probably much the same as
--clone mode on COW file systems, except that is Linux-only.  On
overwrite file systems (ie not copy-on-write, like ext4), it may also
be able to push copies down to storage hardware/network file systems.

There was something like this in the nearby large files patch set, but
in that version it just magically did it when available in --copy
mode.  Now I think the user should have to have to opt in with
--copy-file-range, and simply to error out if it fails.  It may not
work in some cases -- for example, the man page says that older Linux
systems can fail with EXDEV when you try to copy across file systems,
while newer systems will do something less efficient but still
sensible internally; also I saw a claim that some older versions had
weird bugs.  Better to just expose the raw functionality and let users
say when they want it and read the error if it fail, I think.


When we added --clone, copy_file_range() was available, but the problem 
was that it was hard for the user to predict whether you'd get the fast 
clone behavior or the slow copy behavior.  That's the kind of thing you 
want to know when planning and testing your upgrade.  At the time, there 
were patches passed around in Linux kernel circles that would have been 
able to enforce cloning via the flags argument of copy_file_range(), but 
that didn't make it to the mainline.


So, yes, being able to specify exactly which copy mechanism to use makes 
sense, so that users can choose the tradeoffs.


About your patch:

I think you should have a "check" function called from 
check_new_cluster().  That check function can then also handle the "not 
supported" case, and you don't need to handle that in 
parseCommandLine().  I suggest following the clone example for these, 
since the issues there are very similar.






Re: pg_upgrade --copy-file-range

2024-03-19 Thread Tomas Vondra
Hi,

I took a quick look at the remaining part adding copy_file_range to
pg_combinebackup. The patch no longer applies, so I had to rebase it.
Most of the issues were trivial, but I had to fix a couple missing
prototypes - I added them to copy_file.h/c, mostly.

0001 is the minimal rebase + those fixes

0002 has a couple review comments in copy_file, and it also undoes a lot
of unnecessary formatting changes (already pointed out by Peter a couple
days ago).

A couple review comments:

1) AFAIK opt_errinfo() returns pointer to the local "buf" variable.

2) I wonder if we even need opt_errinfo(). I'm not sure it actually
makes anything simpler.

3) I think it'd be nice to make CopyFileMethod more consistent with
transferMode in pg_upgrade.h (I mean, it seems wise to make the naming
more consistent, it's probably not worth unifying this somehow).

4) I wonder how we came up with copying the files by 50 blocks, but I
now realize it's been like this before this patch. I only noticed
because the patch adds a comment before buffer_size calculation.

5) I dislike the renaming of copy_file_blocks to pg_copyfile. The new
name is way more generic / less descriptive - it's clear it copies the
file block by block (well, in chunks). pg_copyfile is pretty vague.

6) This leaves behind copy_file_copyfile, which is now unused.

7) The patch reworks how combinebackup deals with alternative copy
implementations - instead of setting strategy_implementation and calling
that, the decisions now happen in pg_copyfile_offload with a lot of
conditions / ifdef / defined ... I find it pretty hard to understand and
reason about. I liked the strategy_implementation approach, as it forces
us to keep each method in a separate function.

Perhaps there's a reason why that doesn't work for copy_file_range? But
in that case this needs much clearer comments.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 39f42eee4c6f50d106672afe108294ee59082500 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 19 Mar 2024 15:34:18 +0100
Subject: [PATCH v20240319 2/2] review and cleanup

---
 src/bin/pg_combinebackup/copy_file.c|   3 +
 src/bin/pg_combinebackup/copy_file.h|   1 +
 src/bin/pg_combinebackup/pg_combinebackup.c | 197 +++-
 src/bin/pg_combinebackup/reconstruct.c  | 105 ++-
 src/bin/pg_combinebackup/reconstruct.h  |  19 +-
 5 files changed, 190 insertions(+), 135 deletions(-)

diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
index 16e26b4f573..f45670dd47c 100644
--- a/src/bin/pg_combinebackup/copy_file.c
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -77,6 +77,8 @@ opt_errinfo(const char *addon_errmsg)
 		return "";
 
 	strcpy(buf, " ");
+
+	/* XXX isn't this broken? this returns pointer to local variable */
 	return strncat(buf, addon_errmsg, sizeof(buf) - 2);
 }
 
@@ -93,6 +95,7 @@ pg_copyfile(const char *src, const char *dest, const char *addon_errmsg,
 	int			dest_fd;
 	uint8	   *buffer;
 
+	/* XXX where does the 50 blocks come from? larger/smaller? */
 	/* copy in fairly large chunks for best efficiency */
 	const int	buffer_size = 50 * BLCKSZ;
 
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
index 2797a340055..f4d0ac47d0e 100644
--- a/src/bin/pg_combinebackup/copy_file.h
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -15,6 +15,7 @@
 #include "common/checksum_helper.h"
 #include "common/file_utils.h"
 
+/* XXX do we even want this? how does pg_upgrade to this? */
 typedef enum CopyFileMethod
 {
 	PG_COPYFILE_FALLBACK = 0x1,
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 1455360d81c..8fa7827c563 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -99,10 +99,15 @@ static void cleanup_directories_atexit(void);
 static void create_output_directory(char *dirname, cb_options *opt);
 static void help(const char *progname);
 static bool parse_oid(char *s, Oid *result);
-static void process_directory_recursively(
-		  Oid tsoid, char *input_directory, char *output_directory,
-		  char *relative_path, int n_prior_backups, char **prior_backup_dirs,
-		  manifest_data **manifests, manifest_writer *mwriter, cb_options *opt);
+static void process_directory_recursively(Oid tsoid,
+		  char *input_directory,
+		  char *output_directory,
+		  char *relative_path,
+		  int n_prior_backups,
+		  char **prior_backup_dirs,
+		  manifest_data **manifests,
+		  manifest_writer *mwriter,
+		  cb_options *opt);
 static int	read_pg_version_file(char *directory);
 static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
 static void reset_directory_cleanup_list(void);
@@ -156,8 +161,8 @@ main(int argc, char *argv[])
 	opt.copy_method = 0;
 
 	/* process co

Re: pg_upgrade --copy-file-range

2024-03-20 Thread Jakub Wartak
Hi Tomas,

> I took a quick look at the remaining part adding copy_file_range to
> pg_combinebackup. The patch no longer applies, so I had to rebase it.
> Most of the issues were trivial, but I had to fix a couple missing
> prototypes - I added them to copy_file.h/c, mostly.
>
> 0001 is the minimal rebase + those fixes
>
> 0002 has a couple review comments in copy_file, and it also undoes a lot
> of unnecessary formatting changes (already pointed out by Peter a couple
> days ago).
>

Thank you very much for this! As discussed privately, I'm not in
position right now to pursue this further at this late stage (at least
for v17, which would require an aggressive schedule ). My plan was
more for v18 after Peter's email, due to other obligations. But if you
have cycles and want to continue, please do so without hesitation -
I'll try to chime in a long way to test and review for sure.

> A couple review comments:
>
> 1) AFAIK opt_errinfo() returns pointer to the local "buf" variable.
>
> 2) I wonder if we even need opt_errinfo(). I'm not sure it actually
> makes anything simpler.

Yes, as it stands it's broken (somewhat I've missed gcc warning),
should be pg_malloc(). I hardly remember, but I wanted to avoid code
duplication. No strong opinion, maybe that's a different style, I'll
adapt as necessary.

> 3) I think it'd be nice to make CopyFileMethod more consistent with
> transferMode in pg_upgrade.h (I mean, it seems wise to make the naming
> more consistent, it's probably not worth unifying this somehow).
>
> 4) I wonder how we came up with copying the files by 50 blocks, but I
> now realize it's been like this before this patch. I only noticed
> because the patch adds a comment before buffer_size calculation.

It looks like it was like that before pg_upgrade even was moved into
the core. 400kB is indeed bit strange value, so we can leave it as it
is or make the COPY_BUF_SIZ 128kb - see [1] (i've double checked cp(1)
uses still 128kB today), or maybe just stick to something like 256 or
512 kBs.

> 5) I dislike the renaming of copy_file_blocks to pg_copyfile. The new
> name is way more generic / less descriptive - it's clear it copies the
> file block by block (well, in chunks). pg_copyfile is pretty vague.
>
> 6) This leaves behind copy_file_copyfile, which is now unused.
>
> 7) The patch reworks how combinebackup deals with alternative copy
> implementations - instead of setting strategy_implementation and calling
> that, the decisions now happen in pg_copyfile_offload with a lot of
> conditions / ifdef / defined ... I find it pretty hard to understand and
> reason about. I liked the strategy_implementation approach, as it forces
> us to keep each method in a separate function.

Well some context (maybe it was my mistake to continue in this
./thread rather starting a new one): my plan was 3-in-1: in the
original proposal (from Jan) to provide CoW as generic facility for
other to use - in src/common/file_utils.c as per
v3-0002-Confine-various-OS-copy-on-write-and-other-copy-a.patch - to
unify & confine CoW methods and their quirkiness between
pg_combinebackup and pg_upgrade and other potential CoW uses too. That
was before Thomas M. pushed CoW just for pg_upgrade as
d93627bcbe5001750e7611f0e637200e2d81dcff. I had this idea back then to
have pg_copyfile() [normal blocks copy] and
pg_copyfile_offload_supported(),
pg_copyfile_offload(PG_COPYFILE_IOCTL_FICLONE ,
PG_COPYFILE_COPY_FILE_RANGE,
PG_COPYFILE_who_has_idea_what_they_come_up_with_in_future). In Your's
version of the patch it's local to pg_combinebackup, so it might make
no sense after all. If you look at the pg_upgrade and pg_combinebackup
they both have code duplication with lots of those ifs/IFs (assuming
user wants to have it as drop-in [--clone/--copy/--copyfile] and
platform may / may not have it). I've even considered
--cow=ficlone|copy_file_range to sync both tools from CLI arguments
point of view, but that would break backwards compatibility, so I did
not do that.

Also there's a problem with pg_combinebackup's strategy_implementation
that it actually cannot on its own decide (I think) which CoW to use
or not. There were some longer discussions that settled on one thing
(for pg_upgrade): it's the user who is in control HOW the copy gets
done (due to potential issues in OS CoW() implementations where e.g.
if NFS would be involved on one side). See pg_upgrade
--clone/--copy/--copy-file-range/--sync-method options. I wanted to
stick to that, so pg_combinebackup also needs to give the same options
to the user.

That's was for the historical context, now you wrote "it's probably
not worth unifying this somehow" few sentences earlier, so my take is
the following: we can just concentrate on getting the
copy_file_range() and ioctl_ficlone to pg_combinebackup at the price
of duplicating some complexity for now (in short to start with clear
plate , it doesn't necessary needs to be my patch as base if we think
it's worthwhile for v17 - or stick to your rework

Re: pg_upgrade --copy-file-range

2024-03-22 Thread Tomas Vondra
Here's a patch reworked along the lines from a couple days ago.

The primary goals were to add clone/copy_file_range while minimizing
unnecessary disruption, and overall cleanup of the patch. I'm not saying
it's committable, but I think the patch is much easier to understand.

The main change is that this abandons the idea of handling all possible
cases in a single function that looks like a maze of ifdefs, and instead
separates each case into it's own function and the decision happens much
earlier. This is pretty much exactly what pg_upgrade does, BTW.

There's maybe an argument that these functions could be unified and
moved to a library in src/common - I can imagine doing that, but I don't
think it's required. The functions are pretty trivial wrappers, and it's
not like we expect many more callers. And there's probably stuff we'd
need to keep out of that library (e.g. the decision which copy/clone
methods are available / should be used or error reporting). So it
doesn't seem worth it, at least for now.

There's one question, though. As it stands, there's a bit of asymmetry
between handling CopyFile() on WIN32 and the clone/copy_file_range on
other platforms). On WIN32, we simply automatically switch to CopyFile
automatically, if we don't need to calculate checksum. But with the
other methods, error out if the user requests those and we need to
calculate the checksum.

The asymmetry comes from the fact there's no option to request CopyFile
on WIN32, and we feel confident it's always the right thing to do (safe,
faster). We don't seem to know that for the other methods, so the user
has to explicitly request those. And if the user requests --clone, for
example, it'd be wrong to silently fallback to plain copy.

Still, I wonder if this might cause some undesirable issues during
restores. But I guess that's why we have --dry-run.

This asymmetry also shows a bit in the code - the CopyFile is coded and
called a bit differently from the other methods. FWIW I abandoned the
approach with "strategy" and just use a switch on CopyMode enum, just
like pg_upgrade does.

There's a couple more smaller changes:

- Addition of docs for --clone/--copy-file-range (shameless copy from
pg_upgrade docs).

- Removal of opt_errinfo - not only was it buggy, I think the code is
actually cleaner without it.

- Removal of EINTR retry condition from copy_file_range handling (this
is what Thomas ended up for pg_upgrade while committing that part).

Put together, this cuts the patch from ~40kB to ~15kB (most of this is
due to the cleanup of unnecessary whitespace changes, though).

I think to make this committable, this requires some review and testing,
ideally on a range of platforms.

One open question is how to allow testing this. For pg_upgrade we now
have PG_TEST_PG_UPGRADE_MODE, which can be set to e.g. "--clone". I
wonder if we should add PG_TEST_PG_COMBINEBACKUP_MODE ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 7b6a6fe1b555647109caec2817f9200ac8fe9db9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 19 Mar 2024 15:16:29 +0100
Subject: [PATCH v20240322] pg_combinebackup - allow using
 clone/copy_file_range

---
 doc/src/sgml/ref/pg_combinebackup.sgml  |  34 +
 src/bin/pg_combinebackup/copy_file.c| 157 +++-
 src/bin/pg_combinebackup/copy_file.h|  18 ++-
 src/bin/pg_combinebackup/pg_combinebackup.c |  26 +++-
 src/bin/pg_combinebackup/reconstruct.c  |   5 +-
 src/bin/pg_combinebackup/reconstruct.h  |   5 +-
 6 files changed, 202 insertions(+), 43 deletions(-)

diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
index 8a0a600c2b2..60a60e3fae6 100644
--- a/doc/src/sgml/ref/pg_combinebackup.sgml
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -185,6 +185,40 @@ PostgreSQL documentation
   
  
 
+ 
+  --clone
+  
+   
+Use efficient file cloning (also known as reflinks on
+some systems) instead of copying files to the new cluster.  This can
+result in near-instantaneous copying of the data files, giving the
+speed advantages of -k/--link while
+leaving the old cluster untouched.
+   
+
+   
+File cloning is only supported on some operating systems and file
+systems.  If it is selected but not supported, the
+pg_combinebackup run will error.  At present,
+it is supported on Linux (kernel 4.5 or later) with Btrfs and XFS (on
+file systems created with reflink support), and on macOS with APFS.
+   
+  
+ 
+
+ 
+  --copy-file-range
+  
+   
+Use the copy_file_range system call for efficient
+copying.  On some file systems this gives results similar to
+--clone, sharing physical disk blocks, while on others
+it may still copy blocks, but do so via an optimized path.  At present,
+it is supported on Linu

Re: pg_upgrade --copy-file-range

2024-03-22 Thread Robert Haas
On Fri, Mar 22, 2024 at 10:40 AM Tomas Vondra
 wrote:
> There's one question, though. As it stands, there's a bit of asymmetry
> between handling CopyFile() on WIN32 and the clone/copy_file_range on
> other platforms). On WIN32, we simply automatically switch to CopyFile
> automatically, if we don't need to calculate checksum. But with the
> other methods, error out if the user requests those and we need to
> calculate the checksum.

That seems completely broken. copy_file() needs to have the ability to
calculate a checksum if one is required; when one isn't required, it
can do whatever it likes. So we should always fall back to the
block-by-block method if we need a checksum. Whatever option the user
specified should only be applied when we don't need a checksum.

Consider, for example:

pg_basebackup -D sunday -c fast --manifest-checksums=CRC32C
pg_basebackup -D monday -c fast --manifest-checksums=SHA224
--incremental sunday/backup_manifest
pg_combinebackup sunday monday -o tuesday --manifest-checksums=CRC32C --clone

Any files that are copied in their entirety from Sunday's backup can
be cloned, if we have support for cloning. But any files copied from
Monday's backup will need to be re-checksummed, since the checksum
algorithms don't match. With what you're describing, it sounds like
pg_combinebackup would just fail in this case; I don't think that's
the behavior that the user is going to want.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_upgrade --copy-file-range

2024-03-22 Thread Tomas Vondra
On 3/22/24 17:42, Robert Haas wrote:
> On Fri, Mar 22, 2024 at 10:40 AM Tomas Vondra
>  wrote:
>> There's one question, though. As it stands, there's a bit of asymmetry
>> between handling CopyFile() on WIN32 and the clone/copy_file_range on
>> other platforms). On WIN32, we simply automatically switch to CopyFile
>> automatically, if we don't need to calculate checksum. But with the
>> other methods, error out if the user requests those and we need to
>> calculate the checksum.
> 
> That seems completely broken. copy_file() needs to have the ability to
> calculate a checksum if one is required; when one isn't required, it
> can do whatever it likes. So we should always fall back to the
> block-by-block method if we need a checksum. Whatever option the user
> specified should only be applied when we don't need a checksum.
> 
> Consider, for example:
> 
> pg_basebackup -D sunday -c fast --manifest-checksums=CRC32C
> pg_basebackup -D monday -c fast --manifest-checksums=SHA224
> --incremental sunday/backup_manifest
> pg_combinebackup sunday monday -o tuesday --manifest-checksums=CRC32C --clone
> 
> Any files that are copied in their entirety from Sunday's backup can
> be cloned, if we have support for cloning. But any files copied from
> Monday's backup will need to be re-checksummed, since the checksum
> algorithms don't match. With what you're describing, it sounds like
> pg_combinebackup would just fail in this case; I don't think that's
> the behavior that the user is going to want.
> 

Right, this will happen:

  pg_combinebackup: error: unable to use accelerated copy when manifest
  checksums are to be calculated. Use --no-manifest

Are you saying we should just silently override the copy method and do
the copy block by block? I'm not strongly opposed to that, but it feels
wrong to just ignore that the user explicitly requested cloning, and I'm
not sure why should this be different from any other case when the user
requests incompatible combination of options and/or options that are not
supported on the current configuration.

Why not just to tell the user to use the correct parameters, i.e. either
remove --clone or add --no-manifest?

FWIW I now realize it actually fails a bit earlier than I thought - when
parsing the options, not in copy_file. But then some checks (if a given
copy method is supported) happen in the copy functions. I wonder if it'd
be better/possible to do all of that in one place, not sure.

Also, the message only suggests to use --no-manifest. It probably should
suggest removing --clone too.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_upgrade --copy-file-range

2024-03-22 Thread Robert Haas
On Fri, Mar 22, 2024 at 1:22 PM Tomas Vondra
 wrote:
> Right, this will happen:
>
>   pg_combinebackup: error: unable to use accelerated copy when manifest
>   checksums are to be calculated. Use --no-manifest
>
> Are you saying we should just silently override the copy method and do
> the copy block by block?

Yes.

> I'm not strongly opposed to that, but it feels
> wrong to just ignore that the user explicitly requested cloning, and I'm
> not sure why should this be different from any other case when the user
> requests incompatible combination of options and/or options that are not
> supported on the current configuration.

I don't feel like copying block-by-block when that's needed to compute
a checksum is ignoring what the user requested. I mean, if we'd had to
perform reconstruction rather than copying an entire file, we would
have done that regardless of whether --clone had been there, and I
don't see why the need-to-compute-a-checksum case is any different. I
think we should view a flag like --clone as specifying how to copy a
file when we don't need to do anything but copy it. I don't think it
should dictate that we're not allowed to perform other processing when
that other processing is required.

>From my point of view, this is not a case of incompatible options
having been specified. If you specify run pg_basebackup with both
--format=p and --format=t, those are incompatible options; the backup
can be done one way or the other, but not both at once. But here
there's no such conflict. Block-by-block copying and fast-copying can
happen as part of the same operation, as in the example that I showed,
where some files need the block-by-block copying and some can be
fast-copied. The user is entitled to specify which fast-copying method
they would like to have used for the files where fast-copying is
possible without getting a failure just because it isn't possible for
every single file.

Or to say it the other way around, if there's 1 file that needs to be
copied block by block and 99 files that can be fast-copied, you want
to force the user to the block-by-block method for all 100 files. I
want to force the use of the block-by-block method for the 1 file
where that's the only valid method, and let them choose what they want
to do for the other 99.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_upgrade --copy-file-range

2024-03-22 Thread Thomas Munro
Hmm, this discussion seems to assume that we only use
copy_file_range() to copy/clone whole segment files, right?  That's
great and may even get most of the available benefit given typical
databases with many segments of old data that never changes, but... I
think copy_write_range() allows us to go further than the other
whole-file clone techniques: we can stitch together parts of an old
backup segment file and an incremental backup to create a new file.
If you're interested in minimising disk use while also removing
dependencies on the preceding chain of backups, then it might make
sense to do that even if you *also* have to read the data to compute
the checksums, I think?  That's why I mentioned it: if
copy_file_range() (ie sub-file-level block sharing) is a solution in
search of a problem, has the world ever seen a better problem than
pg_combinebackup?




Re: pg_upgrade --copy-file-range

2024-03-23 Thread Robert Haas
On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  wrote:
> Hmm, this discussion seems to assume that we only use
> copy_file_range() to copy/clone whole segment files, right?  That's
> great and may even get most of the available benefit given typical
> databases with many segments of old data that never changes, but... I
> think copy_write_range() allows us to go further than the other
> whole-file clone techniques: we can stitch together parts of an old
> backup segment file and an incremental backup to create a new file.
> If you're interested in minimising disk use while also removing
> dependencies on the preceding chain of backups, then it might make
> sense to do that even if you *also* have to read the data to compute
> the checksums, I think?  That's why I mentioned it: if
> copy_file_range() (ie sub-file-level block sharing) is a solution in
> search of a problem, has the world ever seen a better problem than
> pg_combinebackup?

That makes sense; it's just a different part of the code than I
thought we were talking about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_upgrade --copy-file-range

2024-03-23 Thread Tomas Vondra


On 3/22/24 19:40, Robert Haas wrote:
> On Fri, Mar 22, 2024 at 1:22 PM Tomas Vondra
>  wrote:
>> Right, this will happen:
>>
>>   pg_combinebackup: error: unable to use accelerated copy when manifest
>>   checksums are to be calculated. Use --no-manifest
>>
>> Are you saying we should just silently override the copy method and do
>> the copy block by block?
> 
> Yes.
> 
>> I'm not strongly opposed to that, but it feels
>> wrong to just ignore that the user explicitly requested cloning, and I'm
>> not sure why should this be different from any other case when the user
>> requests incompatible combination of options and/or options that are not
>> supported on the current configuration.
> 
> I don't feel like copying block-by-block when that's needed to compute
> a checksum is ignoring what the user requested. I mean, if we'd had to
> perform reconstruction rather than copying an entire file, we would
> have done that regardless of whether --clone had been there, and I
> don't see why the need-to-compute-a-checksum case is any different. I
> think we should view a flag like --clone as specifying how to copy a
> file when we don't need to do anything but copy it. I don't think it
> should dictate that we're not allowed to perform other processing when
> that other processing is required.
> 
> From my point of view, this is not a case of incompatible options
> having been specified. If you specify run pg_basebackup with both
> --format=p and --format=t, those are incompatible options; the backup
> can be done one way or the other, but not both at once. But here
> there's no such conflict. Block-by-block copying and fast-copying can
> happen as part of the same operation, as in the example that I showed,
> where some files need the block-by-block copying and some can be
> fast-copied. The user is entitled to specify which fast-copying method
> they would like to have used for the files where fast-copying is
> possible without getting a failure just because it isn't possible for
> every single file.
> 
> Or to say it the other way around, if there's 1 file that needs to be
> copied block by block and 99 files that can be fast-copied, you want
> to force the user to the block-by-block method for all 100 files. I
> want to force the use of the block-by-block method for the 1 file
> where that's the only valid method, and let them choose what they want
> to do for the other 99.
> 

OK, that makes sense. Here's a patch that should work like this - in
copy_file we check if we need to calculate checksums, and either use the
requested copy method, or fall back to the block-by-block copy.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 558321b7ee10fa3902aaed1d1a08857865a232bb Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 19 Mar 2024 15:16:29 +0100
Subject: [PATCH v20240323] pg_combinebackup - allow using
 clone/copy_file_range

---
 doc/src/sgml/ref/pg_combinebackup.sgml  |  34 +
 src/bin/pg_combinebackup/copy_file.c| 156 +++-
 src/bin/pg_combinebackup/copy_file.h|  18 ++-
 src/bin/pg_combinebackup/pg_combinebackup.c |  18 ++-
 src/bin/pg_combinebackup/reconstruct.c  |   5 +-
 src/bin/pg_combinebackup/reconstruct.h  |   5 +-
 6 files changed, 192 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
index 8a0a600c2b2..60a60e3fae6 100644
--- a/doc/src/sgml/ref/pg_combinebackup.sgml
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -185,6 +185,40 @@ PostgreSQL documentation
   
  
 
+ 
+  --clone
+  
+   
+Use efficient file cloning (also known as reflinks on
+some systems) instead of copying files to the new cluster.  This can
+result in near-instantaneous copying of the data files, giving the
+speed advantages of -k/--link while
+leaving the old cluster untouched.
+   
+
+   
+File cloning is only supported on some operating systems and file
+systems.  If it is selected but not supported, the
+pg_combinebackup run will error.  At present,
+it is supported on Linux (kernel 4.5 or later) with Btrfs and XFS (on
+file systems created with reflink support), and on macOS with APFS.
+   
+  
+ 
+
+ 
+  --copy-file-range
+  
+   
+Use the copy_file_range system call for efficient
+copying.  On some file systems this gives results similar to
+--clone, sharing physical disk blocks, while on others
+it may still copy blocks, but do so via an optimized path.  At present,
+it is supported on Linux and FreeBSD.
+   
+  
+ 
+
  
-V
--version
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
index e6d2423278a..0874679e53a 100644
--- a/src/bin/pg_combinebackup/copy_file.c
+++ b/src/bin/pg_combinebackup/cop

Re: pg_upgrade --copy-file-range

2024-03-23 Thread Tomas Vondra
On 3/23/24 13:38, Robert Haas wrote:
> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  wrote:
>> Hmm, this discussion seems to assume that we only use
>> copy_file_range() to copy/clone whole segment files, right?  That's
>> great and may even get most of the available benefit given typical
>> databases with many segments of old data that never changes, but... I
>> think copy_write_range() allows us to go further than the other
>> whole-file clone techniques: we can stitch together parts of an old
>> backup segment file and an incremental backup to create a new file.
>> If you're interested in minimising disk use while also removing
>> dependencies on the preceding chain of backups, then it might make
>> sense to do that even if you *also* have to read the data to compute
>> the checksums, I think?  That's why I mentioned it: if
>> copy_file_range() (ie sub-file-level block sharing) is a solution in
>> search of a problem, has the world ever seen a better problem than
>> pg_combinebackup?
> 
> That makes sense; it's just a different part of the code than I
> thought we were talking about.
> 

Yeah, that's in write_reconstructed_file() and the patch does not touch
that at all. I agree it would be nice to use copy_file_range() in this
part too, and it doesn't seem it'd be that hard to do, I think.

It seems we'd just need a "fork" that either calls pread/pwrite or
copy_file_range, depending on checksums and what was requested.

BTW is there a reason why the code calls "write" and not "pg_pwrite"?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_upgrade --copy-file-range

2024-03-23 Thread Tomas Vondra


On 3/23/24 14:47, Tomas Vondra wrote:
> On 3/23/24 13:38, Robert Haas wrote:
>> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  wrote:
>>> Hmm, this discussion seems to assume that we only use
>>> copy_file_range() to copy/clone whole segment files, right?  That's
>>> great and may even get most of the available benefit given typical
>>> databases with many segments of old data that never changes, but... I
>>> think copy_write_range() allows us to go further than the other
>>> whole-file clone techniques: we can stitch together parts of an old
>>> backup segment file and an incremental backup to create a new file.
>>> If you're interested in minimising disk use while also removing
>>> dependencies on the preceding chain of backups, then it might make
>>> sense to do that even if you *also* have to read the data to compute
>>> the checksums, I think?  That's why I mentioned it: if
>>> copy_file_range() (ie sub-file-level block sharing) is a solution in
>>> search of a problem, has the world ever seen a better problem than
>>> pg_combinebackup?
>>
>> That makes sense; it's just a different part of the code than I
>> thought we were talking about.
>>
> 
> Yeah, that's in write_reconstructed_file() and the patch does not touch
> that at all. I agree it would be nice to use copy_file_range() in this
> part too, and it doesn't seem it'd be that hard to do, I think.
> 
> It seems we'd just need a "fork" that either calls pread/pwrite or
> copy_file_range, depending on checksums and what was requested.
> 

Here's a patch to use copy_file_range in write_reconstructed_file too,
when requested/possible. One thing that I'm not sure about is whether to
do pg_fatal() if --copy-file-range but the platform does not support it.
This is more like what pg_upgrade does, but maybe we should just ignore
what the user requested and fallback to the regular copy (a bit like
when having to do a checksum for some files). Or maybe the check should
just happen earlier ...

I've been thinking about what Thomas wrote - that maybe it'd be good to
do copy_file_range() even when calculating the checksum, and I think he
may be right. But the current patch does not do that, and while it
doesn't seem very difficult to do (at least when reconstructing the file
from incremental backups) I don't have a very good intuition whether
it'd be a win or not in typical cases.

I have a naive question about the checksumming - if we used a
merkle-tree-like scheme, i.e. hashing blocks and not hashes of blocks,
wouldn't that allow calculating the hashes even without having to read
the blocks, making copy_file_range more efficient? Sure, it's more
complex, but a well known scheme. (OK, now I realized it'd mean we can't
use tools like sha224sum to hash the files and compare to manifest. I
guess that's why we don't do this ...)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom d2b717d14638632c25d0e6919f5cd40333e9ebd0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sat, 23 Mar 2024 18:26:21 +0100
Subject: [PATCH v20240323-2 2/2] write_reconstructed_file

---
 src/bin/pg_combinebackup/reconstruct.c | 32 +++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index f5c7af8a23c..4de92894bed 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -59,7 +59,8 @@ static void write_reconstructed_file(char *input_filename,
 	 off_t *offsetmap,
 	 pg_checksum_context *checksum_ctx,
 	 bool debug,
-	 bool dry_run);
+	 bool dry_run,
+	 CopyMode copy_mode);
 static void read_bytes(rfile *rf, void *buffer, unsigned length);
 
 /*
@@ -325,7 +326,8 @@ reconstruct_from_incremental_file(char *input_filename,
 	{
 		write_reconstructed_file(input_filename, output_filename,
  block_length, sourcemap, offsetmap,
- &checksum_ctx, debug, dry_run);
+ &checksum_ctx, debug, dry_run,
+ copy_method);
 		debug_reconstruction(n_prior_backups + 1, source, dry_run);
 	}
 
@@ -528,7 +530,8 @@ write_reconstructed_file(char *input_filename,
 		 off_t *offsetmap,
 		 pg_checksum_context *checksum_ctx,
 		 bool debug,
-		 bool dry_run)
+		 bool dry_run,
+		 CopyMode copy_mode)
 {
 	int			wfd = -1;
 	unsigned	i;
@@ -630,6 +633,29 @@ write_reconstructed_file(char *input_filename,
 		if (dry_run)
 			continue;
 
+		/*
+		 * If requested, copy the block using copy_file_range.
+		 *
+		 * We can'd do this if the block needs to be zero-filled or when we
+		 * need to update checksum.
+		 */
+		if ((copy_mode == COPY_MODE_COPY_FILE_RANGE) &&
+			(s != NULL) && (checksum_ctx->type == CHECKSUM_TYPE_NONE))
+		{
+#if defined(HAVE_COPY_FILE_RANGE)
+			do
+			{
+wb = copy_file_range(s->fd, &offsetmap[i], wfd, NULL, BLCKSZ, 0);
+if (wb < 0)
+	pg_fatal("error while copying file ran

Re: pg_upgrade --copy-file-range

2024-03-25 Thread Robert Haas
On Sat, Mar 23, 2024 at 9:37 AM Tomas Vondra
 wrote:
> OK, that makes sense. Here's a patch that should work like this - in
> copy_file we check if we need to calculate checksums, and either use the
> requested copy method, or fall back to the block-by-block copy.

+Use efficient file cloning (also known as reflinks on
+some systems) instead of copying files to the new cluster.  This can

new cluster -> output directory

I think your version kind of messes up the debug logging. In my
version, every call to copy_file() would emit either "would copy
\"%s\" to \"%s\" using strategy %s" and "copying \"%s\" to \"%s\"
using strategy %s". In your version, the dry_run mode emits a string
similar to the former, but creates separate translatable strings for
each copy method instead of using the same one with a different value
of %s. In non-dry-run mode, I think your version loses the debug
logging altogether.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_upgrade --copy-file-range

2024-03-25 Thread Robert Haas
On Sat, Mar 23, 2024 at 9:47 AM Tomas Vondra
 wrote:
> BTW is there a reason why the code calls "write" and not "pg_pwrite"?

I think it's mostly because I learned to code a really long time ago.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_upgrade --copy-file-range

2024-03-26 Thread Jakub Wartak
On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra
 wrote:

> On 3/23/24 14:47, Tomas Vondra wrote:
> > On 3/23/24 13:38, Robert Haas wrote:
> >> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  
> >> wrote:
[..]
> > Yeah, that's in write_reconstructed_file() and the patch does not touch
> > that at all. I agree it would be nice to use copy_file_range() in this
> > part too, and it doesn't seem it'd be that hard to do, I think.
> >
> > It seems we'd just need a "fork" that either calls pread/pwrite or
> > copy_file_range, depending on checksums and what was requested.
> >
>
> Here's a patch to use copy_file_range in write_reconstructed_file too,
> when requested/possible. One thing that I'm not sure about is whether to
> do pg_fatal() if --copy-file-range but the platform does not support it.
[..]

Hi Tomas, so I gave a go to the below patches today:
- v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch
- v20240323-2-0002-write_reconstructed_file.patch

My assessment:

v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch -
looks like more or less good to go
v20240323-2-0002-write_reconstructed_file.patch - needs work and
without that clone/copy_file_range() good effects are unlikely

Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional
tables with GiST and GIN indexes , just to see more WAL record types)
and with backups sizes in MB like that:

106831  full
2823incr.1 # captured after some time with pgbench -R 100
165 incr.2 # captured after some time with pgbench -R 100

Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee
/proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o
outtest full incr.1 incr.2

Test results of various copies on small I/O constrained XFS device:
normal copy: 31m47.407s
--clone copy: error: file cloning not supported on this platform (it's
due #ifdef of having COPY_FILE_RANGE available)
--copy-file-range: aborted, as it was taking too long , I was
expecting it to accelerate, but it did not... obviously this is the
transparent failover in case of calculating checksums...
--manifest-checksums=NONE --copy-file-range: BUG, it keep on appending
to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended
up with ENOSPC [more on this later]
--manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s
--manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
loop-fix: 5m1.986s but it creates corruption as it stands

Issues:

1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
compains about win32/mingw:

[15:47:27.184] In file included from copy_file.c:22:
[15:47:27.184] copy_file.c: In function ‘copy_file’:
[15:47:27.184] ../../../src/include/common/logging.h:134:6: error:
this statement may fall through [-Werror=implicit-fallthrough=]
[15:47:27.184]   134 |   if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \
[15:47:27.184]   |  ^
[15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’
[15:47:27.184]96 | pg_log_debug("would copy \"%s\" to \"%s\"
(copy_file_range)",
[15:47:27.184]   | ^~~~
[15:47:27.184] copy_file.c:99:4: note: here
[15:47:27.184]99 |case COPY_MODE_COPYFILE:
[15:47:27.184]   |^~~~
[15:47:27.184] cc1: all warnings being treated as errors

2. I do not know what's the consensus between --clone and
--copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
apply the same logic to the --help so that --clone is not visible
there (for consistency?). Also the "error: file cloning not supported
on this platform " is technically incorrect, Linux does support
ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
over another (so technically it is "available"). Nitpicking I know.

3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
ENOSPACE spiral-of-death-bug symptoms are like that:

strace:
copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697729024], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697737216], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697745408], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697753600], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697761792], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697769984], 9, NULL, 8192, 0) = 8192

Notice that dest_off_t (poutoff) is NULL.

(gdb) where
#0  0x7f2cd56f6733 in copy_file_range (infd=8,
pinoff=pinoff@entry=0x7f2cd53f54e8, outfd=outfd@entry=9,
poutoff=poutoff@entry=0x0,
length=length@entry=8192, flags=flags@entry=0) at
../sysdeps/unix/sys

Re: pg_upgrade --copy-file-range

2024-03-26 Thread Tomas Vondra
On 3/25/24 15:31, Robert Haas wrote:
> On Sat, Mar 23, 2024 at 9:37 AM Tomas Vondra
>  wrote:
>> OK, that makes sense. Here's a patch that should work like this - in
>> copy_file we check if we need to calculate checksums, and either use the
>> requested copy method, or fall back to the block-by-block copy.
> 
> +Use efficient file cloning (also known as reflinks on
> +some systems) instead of copying files to the new cluster.  This can
> 
> new cluster -> output directory
> 

Ooops, forgot to fix this. Will do in next version.

> I think your version kind of messes up the debug logging. In my
> version, every call to copy_file() would emit either "would copy
> \"%s\" to \"%s\" using strategy %s" and "copying \"%s\" to \"%s\"
> using strategy %s". In your version, the dry_run mode emits a string
> similar to the former, but creates separate translatable strings for
> each copy method instead of using the same one with a different value
> of %s. In non-dry-run mode, I think your version loses the debug
> logging altogether.
> 

Yeah. Sorry for not being careful enough about that, I was focusing on
the actual copy logic and forgot about this.

The patch I shared a couple minutes ago should fix this, effectively
restoring the original debug behavior. I liked the approach with calling
strategy_implementation a bit more, I wonder if it'd be better to go
back to that for the "accelerated" copy methods, somehow.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_upgrade --copy-file-range

2024-03-28 Thread Robert Haas
On Tue, Mar 26, 2024 at 2:09 PM Tomas Vondra
 wrote:
> The patch I shared a couple minutes ago should fix this, effectively
> restoring the original debug behavior. I liked the approach with calling
> strategy_implementation a bit more, I wonder if it'd be better to go
> back to that for the "accelerated" copy methods, somehow.

Somehow I don't see this patch?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_upgrade --copy-file-range

2024-03-28 Thread Tomas Vondra



On 3/28/24 21:45, Robert Haas wrote:
> On Tue, Mar 26, 2024 at 2:09 PM Tomas Vondra
>  wrote:
>> The patch I shared a couple minutes ago should fix this, effectively
>> restoring the original debug behavior. I liked the approach with calling
>> strategy_implementation a bit more, I wonder if it'd be better to go
>> back to that for the "accelerated" copy methods, somehow.
> 
> Somehow I don't see this patch?
> 

It's here:

https://www.postgresql.org/message-id/90866c27-265a-4adb-89d0-18c8dd22bc19%40enterprisedb.com

I did change the subject to reflect that it's no longer about
pg_upgrade, maybe that breaks the threading for you somehow?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company