file cloning in pg_upgrade and CREATE DATABASE

2018-02-20 Thread Peter Eisentraut
Here is another attempt at implementing file cloning for pg_upgrade and
CREATE DATABASE.  The idea is to take advantage of file systems that can
make copy-on-write clones, which would make the copy run much faster.
For pg_upgrade, this will give the performance of --link mode without
the associated drawbacks.

There have been patches proposed previously [0][1].  The concerns there
were mainly that they required a Linux-specific ioctl() call and only
worked for Btrfs.

Some new things have happened since then:

- XFS has (optional) reflink support.  This file system is probably more
widely used than Btrfs.

- Linux and glibc have a proper function to do this now.

- APFS on macOS supports file cloning.

So altogether this feature will be more widely usable and less ugly to
implement.  Note, however, that you will currently need literally the
latest glibc release, so it probably won't be accessible right now
unless you are using Fedora 28 for example.  (This is the
copy_file_range() function that had us recently rename the same function
in pg_rewind.)

Some example measurements:

6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
and APFS)

similar for a CREATE DATABASE from a large template

Even if you don't have a file system with cloning support, the special
library calls make copying faster.  For example, on APFS, in this
example, an unpatched CREATE DATABASE takes 30 seconds, with the library
call (but without cloning) it takes 10 seconds.

For amusement/bewilderment, without the recent flush optimization on
APFS, this takes 2 minutes 30 seconds.  I suppose this optimization will
now actually obsolete, since macOS will no longer hit that code.


[0]:
https://www.postgresql.org/message-id/flat/513C0E7C.5080606%40socialserve.com

[1]:
https://www.postgresql.org/message-id/flat/20140213030731.GE4831%40momjian.us
-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 56b5b574f6d900d5eb4932be499cf3bae0e7ba86 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Feb 2018 10:41:16 -0500
Subject: [PATCH] Use file cloning in pg_upgrade and CREATE DATABASE

For file copying in pg_upgrade and CREATE DATABASE, use special file
cloning calls if available.  This makes the copying faster and more
space efficient.  For pg_upgrade, this achieves speed similar to --link
mode without the associated drawbacks.

On Linux, use copy_file_range().  This supports file cloning
automatically on Btrfs and XFS (if formatted with reflink support).  On
macOS, use copyfile(), which supports file cloning on APFS.

Even on file systems without cloning/reflink support, this is faster
than the existing code, because it avoids copying the file contents out
of kernel space and allows the OS to apply other optimizations.
---
 configure  |  2 +-
 configure.in   |  2 +-
 doc/src/sgml/ref/pgupgrade.sgml| 11 
 src/backend/storage/file/copydir.c | 55 +-
 src/bin/pg_upgrade/file.c  | 37 -
 src/include/pg_config.h.in |  6 +
 6 files changed, 104 insertions(+), 9 deletions(-)

diff --git a/configure b/configure
index 7dcca506f8..eb8b321723 100755
--- a/configure
+++ b/configure
@@ -13079,7 +13079,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np 
readlink setproctitle setsid shm_open symlink sync_file_range utime utimes 
wcstombs_l
+for ac_func in cbrt clock_gettime copy_file_range copyfile dlopen fdatasync 
getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat 
pthread_is_threaded_np readlink setproctitle setsid shm_open symlink 
sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 4d26034579..dfe3507b25 100644
--- a/configure.in
+++ b/configure.in
@@ -1425,7 +1425,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np 
readlink setproctitle setsid shm_open symlink sync_file_range utime utimes 
wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime copy_file_range copyfile dlopen fdatasync 
getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat 
pthread_is_threaded_np readlink setproctitle setsid shm_open symlink 
sync_file_range utime utimes wcs

Re: file cloning in pg_upgrade and CREATE DATABASE

2018-02-21 Thread Robert Haas
On Tue, Feb 20, 2018 at 10:00 PM, Peter Eisentraut
 wrote:
> Some example measurements:
>
> 6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
> and APFS)
>
> similar for a CREATE DATABASE from a large template
>
> Even if you don't have a file system with cloning support, the special
> library calls make copying faster.  For example, on APFS, in this
> example, an unpatched CREATE DATABASE takes 30 seconds, with the library
> call (but without cloning) it takes 10 seconds.

Nice results.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-02-21 Thread Tomas Vondra
On 02/21/2018 04:00 AM, Peter Eisentraut wrote:
> ...
> 
> Some example measurements:
> 
> 6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
> and APFS)
> 
> similar for a CREATE DATABASE from a large template
> 

Nice improvement, of course. How does that affect performance on the
cloned database? If I understand this correctly, it essentially enables
CoW on the files, so what's the overhead on that? It'd be unfortunate to
speed up CREATE DATABASE only to get degraded performance later.

In any case, I find this interesting mainly for pg_upgrade use case. On
running systems I think the main issue with CREATE DATABASE is that it
forces a checkpoint.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-02-22 Thread Peter Eisentraut
On 2/21/18 18:57, Tomas Vondra wrote:
> Nice improvement, of course. How does that affect performance on the
> cloned database? If I understand this correctly, it essentially enables
> CoW on the files, so what's the overhead on that? It'd be unfortunate to
> speed up CREATE DATABASE only to get degraded performance later.

I ran a little test (on APFS and XFS): Create a large (unlogged) table,
copy the database, then delete everything from the table in the copy.
That should need to CoW all the blocks.  It has about the same
performance with cloning, possibly slightly faster.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-03-19 Thread Michael Paquier
On Tue, Feb 20, 2018 at 10:00:04PM -0500, Peter Eisentraut wrote:
> Some new things have happened since then:
> 
> - XFS has (optional) reflink support.  This file system is probably more
> widely used than Btrfs.

Btrfs is still in development, there are I think no many people who
would use it in production.

> - Linux and glibc have a proper function to do this now.
> 
> - APFS on macOS supports file cloning.

So copyfile() is only part of macos?  I am not able to find references
in FreeBSD, NetBSD or OpenBSD, but I may be missing something.

> So altogether this feature will be more widely usable and less ugly to
> implement.  Note, however, that you will currently need literally the
> latest glibc release, so it probably won't be accessible right now
> unless you are using Fedora 28 for example.  (This is the
> copy_file_range() function that had us recently rename the same function
> in pg_rewind.)

For reference, Debian SID is using glibc 2.27.  ArchLinux is still on
2.26.

> Some example measurements:
> 
> 6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
> and APFS)

Interesting.  I'll try to test that on an XFS partition and see if I can
see a difference.  For now I have just read through the patch.

+#ifdef HAVE_COPYFILE
+   if (copyfile(fromfile, tofile, NULL,
+#ifdef COPYFILE_CLONE
+COPYFILE_CLONE
+#else
+COPYFILE_DATA
+#endif
+   ) < 0)
+   ereport(ERROR,
+   (errcode_for_file_access(),
+errmsg("could not copy file \"%s\" to \"%s\": %m", 
fromfile, tofile)));
+#else
copy_file(fromfile, tofile);
+#endif

Any backend-side callers of copy_file() would not benefit from
copyfile() on OSX.  Shouldn't all that handling be inside copy_file(),
similarly to what your patch actually does for pg_upgrade?  I think that
you should also consider fcopyfile() instead of copyfile() as it works
directly on the file descriptors and share the same error handling as
the others.
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-03-19 Thread Michael Paquier
On Mon, Mar 19, 2018 at 04:06:36PM +0900, Michael Paquier wrote:
> Any backend-side callers of copy_file() would not benefit from
> copyfile() on OSX.  Shouldn't all that handling be inside copy_file(),
> similarly to what your patch actually does for pg_upgrade?  I think that
> you should also consider fcopyfile() instead of copyfile() as it works
> directly on the file descriptors and share the same error handling as
> the others.

Two other things I have noticed as well:
1) src/bin/pg_rewind/copy_fetch.c could benefit from similar speed-ups I
think when copying data from source to target using the local mode of
pg_rewind.  This could really improve cases where new relations are
added after a promotion.
2) XLogFileCopy() uses a copy logic as well.  For large segments things
could be improved, however we need to be careful about filling in the
end of segments with zeros.
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-07-12 Thread Thomas Munro
On Wed, Feb 21, 2018 at 4:00 PM, Peter Eisentraut
 wrote:
> - XFS has (optional) reflink support.  This file system is probably more
> widely used than Btrfs.
>
> - Linux and glibc have a proper function to do this now.
>
> - APFS on macOS supports file cloning.

TIL that Solaris 11.4 (closed) ZFS supports reflink() too.  Sadly,
it's not in OpenZFS though I see numerous requests and discussions...
(Of course you can just clone the whole filesystem and then pg_upgrade
the clone in-place).

-- 
Thomas Munro
http://www.enterprisedb.com



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-07-13 Thread Peter Eisentraut
On 13.07.18 07:09, Thomas Munro wrote:
> On Wed, Feb 21, 2018 at 4:00 PM, Peter Eisentraut
>  wrote:
>> - XFS has (optional) reflink support.  This file system is probably more
>> widely used than Btrfs.
>>
>> - Linux and glibc have a proper function to do this now.
>>
>> - APFS on macOS supports file cloning.
> 
> TIL that Solaris 11.4 (closed) ZFS supports reflink() too.  Sadly,
> it's not in OpenZFS though I see numerous requests and discussions...

I look forward to your FreeBSD patch then. ;-)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-07-13 Thread Michael Paquier
On Fri, Jul 13, 2018 at 10:22:21AM +0200, Peter Eisentraut wrote:
> On 13.07.18 07:09, Thomas Munro wrote:
>> TIL that Solaris 11.4 (closed) ZFS supports reflink() too.  Sadly,
>> it's not in OpenZFS though I see numerous requests and discussions...
> 
> I look forward to your FreeBSD patch then. ;-)

+1.
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-07-16 Thread Michael Paquier
On Wed, Jun 06, 2018 at 11:58:14AM -0400, Peter Eisentraut wrote:
> 
>  The setting always requires the use of relinks.  If
>  they are not supported, the pg_upgrade run
>  will abort.  Use this in production to limit the upgrade run time.
>  The setting auto uses reflinks when available,
>  otherwise it falls back to a normal copy.  This is the default.  The
>  setting never prevents use of reflinks and always
>  uses a normal copy.  This can be useful to ensure that the upgraded
>  cluster has its disk space fully allocated and not shared with the old
>  cluster.
> 

Hm...  I am wondering if we actually want the "auto" mode where we make
the option smarter automatically.  I am afraid of users trying to use it
and being surprised that there is no gain while they expected some.  I
would rather switch that to an on/off switch, which defaults to "off",
or simply what is available now.  huge_pages=try was a bad idea as the
result is not deterministic, so I would not have more of that...

Putting CloneFile and check_reflink in a location that other frontend
binaries could use would be nice, like pg_rewind.
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-07-17 Thread Peter Eisentraut
On 17.07.18 08:58, Michael Paquier wrote:
> Hm...  I am wondering if we actually want the "auto" mode where we make
> the option smarter automatically.  I am afraid of users trying to use it
> and being surprised that there is no gain while they expected some.  I
> would rather switch that to an on/off switch, which defaults to "off",
> or simply what is available now.  huge_pages=try was a bad idea as the
> result is not deterministic, so I would not have more of that...

Why do you think that was a bad idea?  Doing the best possible job by
default without requiring explicit configuration in each case seems like
an attractive feature.

> Putting CloneFile and check_reflink in a location that other frontend
> binaries could use would be nice, like pg_rewind.

This could be done in subsequent patches, but the previous iteration of
this patch for CREATE DATABASE integration already showed that each of
those cases needs separate consideration.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-06-06 Thread Peter Eisentraut
I have made a major revision of this patch.

I have removed all the changes to CREATE DATABASE.  That was too
contentious and we got lost in unrelated details there.  The real
benefit is for pg_upgrade.

Another point was that for pg_upgrade use a user would like to know
beforehand whether reflinking would be used, which was not possible with
the copy_file_range() API.  So here I have switched to using the ioctl()
call directly.

So the new interface is that pg_upgrade has a new option
--reflink={always,auto,never}.  (This option name is adapted from GNU
cp.)  From the documentation:


 The setting always requires the use of relinks.  If
 they are not supported, the pg_upgrade run
 will abort.  Use this in production to limit the upgrade run time.
 The setting auto uses reflinks when available,
 otherwise it falls back to a normal copy.  This is the default.  The
 setting never prevents use of reflinks and always
 uses a normal copy.  This can be useful to ensure that the upgraded
 cluster has its disk space fully allocated and not shared with the old
 cluster.


Also, pg_upgrade --check will check whether the selected option would work.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From c39e40640e70e8fc4b90e762b985201a1ce9f912 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 5 Jun 2018 17:24:53 -0400
Subject: [PATCH v3] pg_upgrade: Allow use of file cloning

For file copying in pg_upgrade, allow using special file cloning calls
if available.  This makes the copying faster and more space efficient.
This achieves speed similar to --link mode without the associated
drawbacks.

Add an option --reflink to select whether file cloning is turned on,
off, or automatic.  Automatic is the default.

On Linux, file cloning is supported on Btrfs and XFS (if formatted with
reflink support).  On macOS, file cloning is supported on APFS.
---
 configure|   2 +-
 configure.in |   2 +-
 doc/src/sgml/ref/pgupgrade.sgml  |  33 +
 src/bin/pg_upgrade/check.c   |   2 +
 src/bin/pg_upgrade/file.c| 123 +++
 src/bin/pg_upgrade/option.c  |  14 
 src/bin/pg_upgrade/pg_upgrade.h  |  15 
 src/bin/pg_upgrade/relfilenode.c |  31 +++-
 src/include/pg_config.h.in   |   3 +
 9 files changed, 220 insertions(+), 5 deletions(-)

diff --git a/configure b/configure
index 3d219c802b..a0eebf7462 100755
--- a/configure
+++ b/configure
@@ -14827,7 +14827,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np 
readlink setproctitle setsid shm_open symlink sync_file_range utime utimes 
wcstombs_l
+for ac_func in cbrt clock_gettime copyfile dlopen fdatasync getifaddrs 
getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat 
pthread_is_threaded_np readlink setproctitle setsid shm_open symlink 
sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 862d8b128d..73632bee91 100644
--- a/configure.in
+++ b/configure.in
@@ -1528,7 +1528,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np 
readlink setproctitle setsid shm_open symlink sync_file_range utime utimes 
wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime copyfile dlopen fdatasync getifaddrs 
getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat 
pthread_is_threaded_np readlink setproctitle setsid shm_open symlink 
sync_file_range utime utimes wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 6dafb404a1..01a426f714 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -182,6 +182,39 @@ Options
   display version information, then exit
  
 
+ 
+  
--reflink={always|auto|never}
+  
+   
+Determines whether pg_upgrade, when in copy
+mode, should use efficient file cloning (also known as
+reflinks) on some operating systems and file systems.
+This can result in near-instantaneous copying of the data files,
+giving the speed advantages of
+-k/--link while leaving the old
+cluster untouched.
+   
+
+   
+The setting always requires the use of relinks.  If
+they are not supported, the pg_upgrade run
+will abort.  Use this in production to limit the upgrade run time

Re: file cloning in pg_upgrade and CREATE DATABASE

2018-06-08 Thread Robert Haas
On Wed, Jun 6, 2018 at 11:58 AM, Peter Eisentraut
 wrote:
> --reflink={always,auto,never}.  (This option name is adapted from GNU
...
>  The setting always requires the use of relinks.  If

Is it supposed to be relinks or reflinks?  The two lines above don't agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-06-11 Thread Peter Eisentraut
On 6/8/18 14:06, Robert Haas wrote:
> Is it supposed to be relinks or reflinks?  The two lines above don't agree.

It's supposed to be "reflinks".  I'll fix that.

I have also used the more general term "cloning" in the documentation.
We can discuss which term we should use more.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-11-07 Thread Peter Eisentraut
On 19/10/2018 01:50, Michael Paquier wrote:
> On Thu, Oct 18, 2018 at 11:59:00PM +0200, Peter Eisentraut wrote:
>> New patch that removes all the various reflink modes and adds a new
>> option --clone that works similar to --link.  I think it's much cleaner now.
> 
> Thanks Peter for the new version.
> 
> +
> + {"clone", no_argument, NULL, 1},
> +
>   {NULL, 0, NULL, 0}
> 
> Why newlines here?

fixed

> @@ -293,6 +300,7 @@ usage(void)
>   printf(_("  -U, --username=NAME   cluster superuser (default 
> \"%s\")\n"), os_info.user);
>   printf(_("  -v, --verbose enable verbose internal 
> logging\n"));
>   printf(_("  -V, --version display version information, 
> then exit\n"));
> + printf(_("  --clone   clone instead of copying 
> files to new cluster\n"));
>   printf(_("  -?, --helpshow this help, then 
> exit\n"));
>   printf(_("\n"
> 
> An idea for a one-letter option could be -n.  This patch can live
> without.

-n is often used for something like "dry run", so it didn't go for that
here.  I suspect the cloning will remain a marginal option for some
time, so having only a long option is acceptable.

> + pg_fatal("error while cloning relation \"%s.%s\": could not open file 
> \"%s\": %s\n",
> +  schemaName, relName, src, strerror(errno));
> 
> The tail of those error messages "could not open file" and "could not
> create file" are already available as translatable error messages.
> Would it be better to split those messages in two for translators?  One
> is generated with pg_fatal("error while cloning relation \"%s.%s\":
> could not open file \"%s\": %s\n",
> +  schemaName, relName, src, strerror(errno));

I think this is too complicated for a few messages.

> Those are all minor points, the patch looks good to me.

Committed, thanks.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-11-07 Thread Tom Lane
Peter Eisentraut  writes:
> Committed, thanks.

It seems unfortunate that there is no test coverage in the committed
patch.  I realize that it may be hard to test given the filesystem
dependency, but how will we know if it works at all?

regards, tom lane



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-11-07 Thread Michael Paquier
On Wed, Nov 07, 2018 at 06:42:00PM +0100, Peter Eisentraut wrote:
> -n is often used for something like "dry run", so it didn't go for that
> here.  I suspect the cloning will remain a marginal option for some
> time, so having only a long option is acceptable.

Good point.  I didn't consider this, and it is true that --dry-run is
used, pg_rewind being one example.
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-11-07 Thread Peter Eisentraut
On 07/11/2018 19:03, Tom Lane wrote:
> Peter Eisentraut  writes:
>> Committed, thanks.
> 
> It seems unfortunate that there is no test coverage in the committed
> patch.  I realize that it may be hard to test given the filesystem
> dependency, but how will we know if it works at all?

You can use something like

PG_UPGRADE_OPTS=--clone make -C src/bin/pg_upgrade check

(--link is similarly untested.)

If we do get the TAP tests for pg_upgrade set up, we can create smaller
and faster test cases for this.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-08-31 Thread Peter Eisentraut
rebased patch, no functionality changes

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 5e0f60e9cf182063f2f711251430d79282be1f93 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 1 Sep 2018 07:05:02 +0200
Subject: [PATCH v4] pg_upgrade: Allow use of file cloning

For file copying in pg_upgrade, allow using special file cloning calls
if available.  This makes the copying faster and more space efficient.
This achieves speed similar to --link mode without the associated
drawbacks.

Add an option --reflink to select whether file cloning is turned on,
off, or automatic.  Automatic is the default.

On Linux, file cloning is supported on Btrfs and XFS (if formatted with
reflink support).  On macOS, file cloning is supported on APFS.
---
 configure|   2 +-
 configure.in |   2 +-
 doc/src/sgml/ref/pgupgrade.sgml  |  33 +
 src/bin/pg_upgrade/check.c   |   2 +
 src/bin/pg_upgrade/file.c| 123 +++
 src/bin/pg_upgrade/option.c  |  14 
 src/bin/pg_upgrade/pg_upgrade.h  |  15 
 src/bin/pg_upgrade/relfilenode.c |  31 +++-
 src/include/pg_config.h.in   |   3 +
 9 files changed, 220 insertions(+), 5 deletions(-)

diff --git a/configure b/configure
index dd94c5bbab..a324dd9ff7 100755
--- a/configure
+++ b/configure
@@ -15060,7 +15060,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np 
readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range 
utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime copyfile dlopen fdatasync getifaddrs 
getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat 
pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open 
symlink sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 3280afa0da..614262fc52 100644
--- a/configure.in
+++ b/configure.in
@@ -1544,7 +1544,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np 
readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range 
utime utimes wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime copyfile dlopen fdatasync getifaddrs 
getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat 
pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open 
symlink sync_file_range utime utimes wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index d51146d641..d994218c44 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -182,6 +182,39 @@ Options
   display version information, then exit
  
 
+ 
+  
--reflink={always|auto|never}
+  
+   
+Determines whether pg_upgrade, when in copy
+mode, should use efficient file cloning (also known as
+reflinks) on some operating systems and file systems.
+This can result in near-instantaneous copying of the data files,
+giving the speed advantages of
+-k/--link while leaving the old
+cluster untouched.
+   
+
+   
+The setting always requires the use of reflinks.  If
+they are not supported, the pg_upgrade run
+will abort.  Use this in production to limit the upgrade run time.
+The setting auto uses reflinks when available,
+otherwise it falls back to a normal copy.  This is the default.  The
+setting never prevents use of reflinks and always
+uses a normal copy.  This can be useful to ensure that the upgraded
+cluster has its disk space fully allocated and not shared with the old
+cluster.
+   
+
+   
+At present, reflinks are supported on Linux (kernel 4.5 or later) with
+Btrfs and XFS (on file systems created with reflink support, which is
+not the default for XFS at this writing), and on macOS with APFS.
+   
+  
+ 
+
  
   -?
   --help
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 5a78d603dc..eb1f18180a 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -151,6 +151,8 @@ check_new_cluster(void)
 
if (user_opts.transfer_mode == TRANSFER_MODE_LINK)
check_hard_link();
+   else if (user_opts.transfer_mode == TRANSF

Re: file cloning in pg_upgrade and CREATE DATABASE

2018-09-25 Thread Michael Paquier
Hi Peter,

On Sat, Sep 01, 2018 at 07:28:37AM +0200, Peter Eisentraut wrote:
> rebased patch, no functionality changes

Could you rebase once again?  I am going through the patch and wanted to
test pg_upgrade on Linux with XFS, but it does not apply anymore.

Thanks,
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-09-27 Thread Peter Eisentraut
On 26/09/2018 08:44, Michael Paquier wrote:
> On Sat, Sep 01, 2018 at 07:28:37AM +0200, Peter Eisentraut wrote:
>> rebased patch, no functionality changes
> 
> Could you rebase once again?  I am going through the patch and wanted to
> test pg_upgrade on Linux with XFS, but it does not apply anymore.

attached

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 9a58fc2589e50d69b4b158ea5e8f3898483290d0 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 27 Sep 2018 22:42:33 +0200
Subject: [PATCH v5] pg_upgrade: Allow use of file cloning

For file copying in pg_upgrade, allow using special file cloning calls
if available.  This makes the copying faster and more space efficient.
This achieves speed similar to --link mode without the associated
drawbacks.

Add an option --reflink to select whether file cloning is turned on,
off, or automatic.  Automatic is the default.

On Linux, file cloning is supported on Btrfs and XFS (if formatted with
reflink support).  On macOS, file cloning is supported on APFS.
---
 configure|   2 +-
 configure.in |   2 +-
 doc/src/sgml/ref/pgupgrade.sgml  |  33 +
 src/bin/pg_upgrade/check.c   |   2 +
 src/bin/pg_upgrade/file.c| 123 +++
 src/bin/pg_upgrade/option.c  |  14 
 src/bin/pg_upgrade/pg_upgrade.h  |  15 
 src/bin/pg_upgrade/relfilenode.c |  31 +++-
 src/include/pg_config.h.in   |   3 +
 9 files changed, 220 insertions(+), 5 deletions(-)

diff --git a/configure b/configure
index 6414ec1ea6..ae6f1a2e17 100755
--- a/configure
+++ b/configure
@@ -15100,7 +15100,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit 
mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np 
readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range 
utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime copyfile fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat 
pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open 
symlink sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 158d5a1ac8..265faf1b99 100644
--- a/configure.in
+++ b/configure.in
@@ -1571,7 +1571,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit 
mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np 
readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range 
utime utimes wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime copyfile fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat 
pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open 
symlink sync_file_range utime utimes wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index d51146d641..d994218c44 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -182,6 +182,39 @@ Options
   display version information, then exit
  
 
+ 
+  
--reflink={always|auto|never}
+  
+   
+Determines whether pg_upgrade, when in copy
+mode, should use efficient file cloning (also known as
+reflinks) on some operating systems and file systems.
+This can result in near-instantaneous copying of the data files,
+giving the speed advantages of
+-k/--link while leaving the old
+cluster untouched.
+   
+
+   
+The setting always requires the use of reflinks.  If
+they are not supported, the pg_upgrade run
+will abort.  Use this in production to limit the upgrade run time.
+The setting auto uses reflinks when available,
+otherwise it falls back to a normal copy.  This is the default.  The
+setting never prevents use of reflinks and always
+uses a normal copy.  This can be useful to ensure that the upgraded
+cluster has its disk space fully allocated and not shared with the old
+cluster.
+   
+
+   
+At present, reflinks are supported on Linux (kernel 4.5 or later) with
+Btrfs and XFS (on file systems created with reflink support, which is
+not the default for XFS at this writing), and on macOS with APFS.
+   
+  
+ 
+
  
   -?
   --help
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 5a78d603dc..eb1f18

Re: file cloning in pg_upgrade and CREATE DATABASE

2018-09-27 Thread Michael Paquier
On Thu, Sep 27, 2018 at 11:10:08PM +0200, Peter Eisentraut wrote:
> On 26/09/2018 08:44, Michael Paquier wrote:
>> Could you rebase once again?  I am going through the patch and wanted to
>> test pg_upgrade on Linux with XFS, but it does not apply anymore.
> 
> attached

Thanks for the rebase.  At the end I got my hands on only an APFS using
a mac.  I ran a test with an instance holding a database with pgbench at
scaling factor 500, which gives close to 6.5GB.  The results are nice on
my laptop:
- --reflink=never runs in 15s
- --reflink=always runs in 4s
So that's a very nice gain!

+static bool cloning_ok = true;
+
+pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
+   old_file, new_file);
+if (cloning_ok &&
+!cloneFile(old_file, new_file, map->nspname, map->relname, true))
+{
+pg_log(PG_VERBOSE, "cloning not supported, switching to copying\n");
+cloning_ok = false;
+copyFile(old_file, new_file, map->nspname, map->relname);
+}
+else
+copyFile(old_file, new_file, map->nspname, map->relname);

This part overlaps with the job that check_reflink() already does.
Wouldn't it be more simple to have check_reflink do a one-time check
with the automatic mode, enforcing to REFLINK_NEVER if cloning test did
not pass when REFLINK_AUTO is used?  This would simplify
transfer_relfile().

The --help output of pg_upgrade has not been updated.

I am not a fan of the --reflink interface to be honest, even if this
maps to what cp offers, particularly because there is already the --link
mode, and that the new option interacts with it.  Here is an idea of
interface with an option named, named say, --transfer-method:
- "link", maps to the existing --link, which is just kept as a
deprecated alias.
- "clone" is the new mode you propose.
- "copy" is the default, and copies directly files.  This automatic mode
also makes the implementation around transfer_relfile more difficult to
apprehend in my opinion, and I would think that all the different
transfer modes ought to be maintained within it.  pg_upgrade.h also has
logic for such transfer modes.

After that, the implementation of cloneFile() looks logically correct to
me.
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-09-30 Thread Michael Paquier
On Fri, Sep 28, 2018 at 02:19:53PM +0900, Michael Paquier wrote:
> Thanks for the rebase.  At the end I got my hands on only an APFS using
> a mac.  I ran a test with an instance holding a database with pgbench at
> scaling factor 500, which gives close to 6.5GB.  The results are nice on
> my laptop:
> - --reflink=never runs in 15s
> - --reflink=always runs in 4s
> So that's a very nice gain!

Please note that I have moved the patch to CF 2018-11, as the last
review is very recent.
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-10-02 Thread Peter Eisentraut
On 28/09/2018 07:19, Michael Paquier wrote:
> +static bool cloning_ok = true;
> +
> +pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
> +   old_file, new_file);
> +if (cloning_ok &&
> +!cloneFile(old_file, new_file, map->nspname, map->relname, true))
> +{
> +pg_log(PG_VERBOSE, "cloning not supported, switching to copying\n");
> +cloning_ok = false;
> +copyFile(old_file, new_file, map->nspname, map->relname);
> +}
> +else
> +copyFile(old_file, new_file, map->nspname, map->relname);
> 
> This part overlaps with the job that check_reflink() already does.
> Wouldn't it be more simple to have check_reflink do a one-time check
> with the automatic mode, enforcing to REFLINK_NEVER if cloning test did
> not pass when REFLINK_AUTO is used?  This would simplify
> transfer_relfile().

I'll look into that.

> The --help output of pg_upgrade has not been updated.

will fix

> I am not a fan of the --reflink interface to be honest, even if this
> maps to what cp offers, particularly because there is already the --link
> mode, and that the new option interacts with it.  Here is an idea of
> interface with an option named, named say, --transfer-method:
> - "link", maps to the existing --link, which is just kept as a
> deprecated alias.
> - "clone" is the new mode you propose.
> - "copy" is the default, and copies directly files.  This automatic mode
> also makes the implementation around transfer_relfile more difficult to
> apprehend in my opinion, and I would think that all the different
> transfer modes ought to be maintained within it.  pg_upgrade.h also has
> logic for such transfer modes.

I can see the argument for that.  But I don't understand where the
automatic mode fits into this.  I would like to keep all three modes
from my patch: copy, clone-if-possible, clone-or-fail, unless you want
to argue against that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-10-02 Thread Michael Paquier
On Tue, Oct 02, 2018 at 02:31:35PM +0200, Peter Eisentraut wrote:
> I can see the argument for that.  But I don't understand where the
> automatic mode fits into this.  I would like to keep all three modes
> from my patch: copy, clone-if-possible, clone-or-fail, unless you want
> to argue against that.

I'd like to argue against that :)

There could be an argument for having an automatic more within this
scheme, still I am not really a fan of this.  When somebody integrates
pg_upgrade within an upgrade framework, they would likely test if
cloning actually works, bumping immediately on a failure, no?  I'd like
to think that copy should be the default, cloning being available as an
option.  Cloning is not supported on many filesystems anyway..
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-10-02 Thread Alvaro Herrera
On 2018-Oct-03, Michael Paquier wrote:

> There could be an argument for having an automatic more within this
> scheme, still I am not really a fan of this.  When somebody integrates
> pg_upgrade within an upgrade framework, they would likely test if
> cloning actually works, bumping immediately on a failure, no?  I'd like
> to think that copy should be the default, cloning being available as an
> option.  Cloning is not supported on many filesystems anyway..

I'm not clear what interface are you proposing.  Maybe they would just
use the clone-or-fail mode, and note whether it fails?  If that's not
it, please explain.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-10-02 Thread Michael Paquier
On Tue, Oct 02, 2018 at 10:35:02PM -0300, Alvaro Herrera wrote:
> I'm not clear what interface are you proposing.  Maybe they would just
> use the clone-or-fail mode, and note whether it fails?  If that's not
> it, please explain.

Okay.  What I am proposing is to not have any kind of automatic mode to
keep the code simple, with a new option called --transfer-mode, able to
do three things:
- "link", which is the equivalent of the existing --link.
- "copy", the default and the current behavior, which copies files.
- "clone", the new mode proposed in this thread.
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-10-02 Thread Alvaro Herrera
On 2018-Oct-03, Michael Paquier wrote:

> On Tue, Oct 02, 2018 at 10:35:02PM -0300, Alvaro Herrera wrote:
> > I'm not clear what interface are you proposing.  Maybe they would just
> > use the clone-or-fail mode, and note whether it fails?  If that's not
> > it, please explain.
> 
> Okay.  What I am proposing is to not have any kind of automatic mode to
> keep the code simple, with a new option called --transfer-mode, able to
> do three things:
> - "link", which is the equivalent of the existing --link.
> - "copy", the default and the current behavior, which copies files.
> - "clone", the new mode proposed in this thread.

I see.  Peter is proposing to have a fourth mode, essentially
--transfer-mode=clone-or-copy.

Thinking about a generic tool that wraps pg_upgrade (say, Debian's
wrapper for it) this makes sense: just use the fastest non-destructive
mode available.  Which ones are available depends on what the underlying
filesystem is, so it's not up to the tool's writer to decide which to
use ahead of time.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-10-03 Thread Michael Paquier
On Tue, Oct 02, 2018 at 11:07:06PM -0300, Alvaro Herrera wrote:
> I see.  Peter is proposing to have a fourth mode, essentially
> --transfer-mode=clone-or-copy.

Yes, a mode which depends on what the file system supports.  Perhaps
"safe" or "fast" could be another name, in the shape of the fastest
method available which does not destroy the existing cluster's data.

> Thinking about a generic tool that wraps pg_upgrade (say, Debian's
> wrapper for it) this makes sense: just use the fastest non-destructive
> mode available.  Which ones are available depends on what the underlying
> filesystem is, so it's not up to the tool's writer to decide which to
> use ahead of time.

This could have merit.  Now, it seems to me that we have two separate
concepts here, which should be addressed separately:
1) cloning file mode, which is the new actual feature.
2) automatic mode, which is a subset of the copy mode and the new clone
mode.  What I am not sure when it comes to that stuff is if we should
consider in some way the link mode as being part of this automatic
selection concept..
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-10-10 Thread Bruce Momjian
On Tue, Oct  2, 2018 at 11:07:06PM -0300, Alvaro Herrera wrote:
> On 2018-Oct-03, Michael Paquier wrote:
> 
> > On Tue, Oct 02, 2018 at 10:35:02PM -0300, Alvaro Herrera wrote:
> > > I'm not clear what interface are you proposing.  Maybe they would just
> > > use the clone-or-fail mode, and note whether it fails?  If that's not
> > > it, please explain.
> > 
> > Okay.  What I am proposing is to not have any kind of automatic mode to
> > keep the code simple, with a new option called --transfer-mode, able to
> > do three things:
> > - "link", which is the equivalent of the existing --link.
> > - "copy", the default and the current behavior, which copies files.
> > - "clone", the new mode proposed in this thread.
> 
> I see.  Peter is proposing to have a fourth mode, essentially
> --transfer-mode=clone-or-copy.

Uh, if you use --link, and the two data directories are on different
file systems, it fails.  No one has ever asked for link-or-copy, so why
are we considering clone-or-copy?  Are we going to need
link-or-clone-or-copy too?  I do realize that clone and copy have
non-destructive behavior on the old cluster once started, so it does
make some sense to merge them, unlike link.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-10-11 Thread Peter Eisentraut
On 10/10/2018 21:50, Bruce Momjian wrote:
>> I see.  Peter is proposing to have a fourth mode, essentially
>> --transfer-mode=clone-or-copy.
> 
> Uh, if you use --link, and the two data directories are on different
> file systems, it fails.  No one has ever asked for link-or-copy, so why
> are we considering clone-or-copy?  Are we going to need
> link-or-clone-or-copy too?  I do realize that clone and copy have
> non-destructive behavior on the old cluster once started, so it does
> make some sense to merge them, unlike link.

I'm OK to get rid of the clone-or-copy mode.  I can see the confusion.

The original reason for this behavior was that the Linux implementation
used copy_file_range(), which does clone-or-copy without any way to
control it.  But the latest patch doesn't use that anymore, so we don't
really need it, if it's controversial.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-10-18 Thread Peter Eisentraut
On 11/10/2018 16:50, Peter Eisentraut wrote:
> On 10/10/2018 21:50, Bruce Momjian wrote:
>>> I see.  Peter is proposing to have a fourth mode, essentially
>>> --transfer-mode=clone-or-copy.
>>
>> Uh, if you use --link, and the two data directories are on different
>> file systems, it fails.  No one has ever asked for link-or-copy, so why
>> are we considering clone-or-copy?  Are we going to need
>> link-or-clone-or-copy too?  I do realize that clone and copy have
>> non-destructive behavior on the old cluster once started, so it does
>> make some sense to merge them, unlike link.
> 
> I'm OK to get rid of the clone-or-copy mode.  I can see the confusion.

New patch that removes all the various reflink modes and adds a new
option --clone that works similar to --link.  I think it's much cleaner now.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 8b7285e61e425fad4d07a011efe2ccd7fd280d54 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 18 Oct 2018 23:09:59 +0200
Subject: [PATCH v6] pg_upgrade: Allow use of file cloning

Add another transfer mode --clone to pg_upgrade (besides the existing
--link and the default copy), using special file cloning calls.  This
makes the file transfer faster and more space efficient, achieving
speed similar to --link mode without the associated drawbacks.

On Linux, file cloning is supported on Btrfs and XFS (if formatted with
reflink support).  On macOS, file cloning is supported on APFS.
---
 configure|  2 +-
 configure.in |  1 +
 doc/src/sgml/ref/pgupgrade.sgml  | 37 ++---
 src/bin/pg_upgrade/check.c   | 13 -
 src/bin/pg_upgrade/file.c| 90 
 src/bin/pg_upgrade/option.c  |  8 +++
 src/bin/pg_upgrade/pg_upgrade.h  |  6 ++-
 src/bin/pg_upgrade/relfilenode.c | 44 ++--
 src/include/pg_config.h.in   |  3 ++
 9 files changed, 179 insertions(+), 25 deletions(-)

diff --git a/configure b/configure
index 43ae8c869d..5a7aa02b27 100755
--- a/configure
+++ b/configure
@@ -15130,7 +15130,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit 
mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np 
readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink 
sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime copyfile fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat 
pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open 
strchrnul symlink sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 519ecd5e1e..7f22bcee75 100644
--- a/configure.in
+++ b/configure.in
@@ -1602,6 +1602,7 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 
's/-lreadline//g'`
 AC_CHECK_FUNCS(m4_normalize([
cbrt
clock_gettime
+   copyfile
fdatasync
getifaddrs
getpeerucred
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index d51146d641..e616ea9dbc 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -182,6 +182,25 @@ Options
   display version information, then exit
  
 
+ 
+  --clone
+  
+   
+Use efficient file cloning (also known as reflinks)
+instead of copying files to the new cluster.  This can result in
+near-instantaneous copying of the data files, giving the speed
+advantages of -k/--link while
+leaving the old cluster untouched.
+   
+
+   
+At present, reflinks are supported on Linux (kernel 4.5 or later) with
+Btrfs and XFS (on file systems created with reflink support, which is
+not the default for XFS at this writing), and on macOS with APFS.
+   
+  
+ 
+
  
   -?
   --help
@@ -340,7 +359,7 @@ Run pg_upgrade
  Always run the pg_upgrade binary of the new 
server, not the old one.
  pg_upgrade requires the specification of the 
old and new cluster's
  data and executable (bin) directories. You can also 
specify
- user and port values, and whether you want the data files linked
+ user and port values, and whether you want the data files linked or cloned
  instead of the default copy behavior.
 
 
@@ -351,8 +370,12 @@ Run pg_upgrade
  once you start the new cluster after the upgrade.  Link mode also
  requires that the old and new cluster data directories be in the
  same file system.  (Tablespaces and pg_wal can be on
- different file systems.)  See pg_upgrade --help for a 
full
- list of options.
+ different file sy

Re: file cloning in pg_upgrade and CREATE DATABASE

2018-10-18 Thread Michael Paquier
On Thu, Oct 18, 2018 at 11:59:00PM +0200, Peter Eisentraut wrote:
> New patch that removes all the various reflink modes and adds a new
> option --clone that works similar to --link.  I think it's much cleaner now.

Thanks Peter for the new version.

+
+   {"clone", no_argument, NULL, 1},
+
{NULL, 0, NULL, 0}

Why newlines here?

@@ -293,6 +300,7 @@ usage(void)
printf(_("  -U, --username=NAME   cluster superuser (default 
\"%s\")\n"), os_info.user);
printf(_("  -v, --verbose enable verbose internal 
logging\n"));
printf(_("  -V, --version display version information, 
then exit\n"));
+   printf(_("  --clone   clone instead of copying 
files to new cluster\n"));
printf(_("  -?, --helpshow this help, then 
exit\n"));
printf(_("\n"

An idea for a one-letter option could be -n.  This patch can live
without.

+ pg_fatal("error while cloning relation \"%s.%s\": could not open file 
\"%s\": %s\n",
+  schemaName, relName, src, strerror(errno));

The tail of those error messages "could not open file" and "could not
create file" are already available as translatable error messages.
Would it be better to split those messages in two for translators?  One
is generated with pg_fatal("error while cloning relation \"%s.%s\":
could not open file \"%s\": %s\n",
+  schemaName, relName, src, strerror(errno));

The tail of those error messages "could not open file" and "could not
create file" are already available as translatable error messages.
Would it be better to split those messages in two for translators?  One
is generated with PG_VERBOSE to mention that cloning checks are
done, and the second one is fatal with the actual errors.

Those are all minor points, the patch looks good to me.
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-03-19 Thread Michael Paquier
On Mon, Mar 19, 2018 at 04:14:15PM +0900, Michael Paquier wrote:
> Two other things I have noticed as well:
> 1) src/bin/pg_rewind/copy_fetch.c could benefit from similar speed-ups I
> think when copying data from source to target using the local mode of
> pg_rewind.  This could really improve cases where new relations are
> added after a promotion.
> 2) XLogFileCopy() uses a copy logic as well.  For large segments things
> could be improved, however we need to be careful about filling in the
> end of segments with zeros.

I have been thinking about this patch over the night, and here is a list
of bullet points which would be nice to tackle:
- Remove the current diff in copydir.
- Extend copy_file so as it is able to use fcopyfile.
- Move the work done in pg_upgrade into a common API which can as well
be used by pg_rewind as well.  One place would be to have a
frontend-only API in src/common which does the leg work.  I would
recommend working only on file descriptors as well for consistency with
copy_file_range.
- Add proper wait events for the backend calls.  Those are missing for
copy_file_range and copyfile.
- For XLogFileCopy, the problem may be trickier as the tail of a segment
is filled with zeroes, so dropping it from the first version of the
patch sounds wiser.

Patch is switched as waiting on author, I have set myself as a
reviewer.

Thanks,
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-03-20 Thread Peter Eisentraut
On 3/19/18 22:58, Michael Paquier wrote:
> I have been thinking about this patch over the night, and here is a list
> of bullet points which would be nice to tackle:
> - Remove the current diff in copydir.

done

> - Extend copy_file so as it is able to use fcopyfile.

fcopyfile() does not support cloning.  (This is not documented.)

> - Move the work done in pg_upgrade into a common API which can as well
> be used by pg_rewind as well.  One place would be to have a
> frontend-only API in src/common which does the leg work.  I would
> recommend working only on file descriptors as well for consistency with
> copy_file_range.

pg_upgrade copies files, whereas pg_rewind needs to copy file ranges.
So I don't think this is going to be a good match.

We could add support for using Linux copy_file_range() in pg_rewind, but
that would work a bit differently.  I also don't have a good sense of
how to test the performance of that.

Another thing to think about is that we go through some trouble to
initialize new WAL files so that the disk space is fully allocated.  If
we used file cloning calls in pg_rewind, that would potentially
invalidate some of that.  At least, we'd have to think through this more
carefully.

> - Add proper wait events for the backend calls.  Those are missing for
> copy_file_range and copyfile.

done

> - For XLogFileCopy, the problem may be trickier as the tail of a segment
> is filled with zeroes, so dropping it from the first version of the
> patch sounds wiser.

Seems like a possible follow-on project.  But see also under pg_rewind
above.

Another oddity is that pg_upgrade uses CopyFile() on Windows, but the
backend does not.  The Git log shows that the backend used to use
CopyFile(), but that was then removed when the generic code was added,
but when pg_upgrade was imported, it came with the CopyFile() call.

I suspect the CopyFile() call can be quite a bit faster, so we should
consider adding it back in.  Or if not, remove it from pg_upgrade.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From bd8fe105f6b1c64098e344c4a7d0fc9c94d2e31d Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Mar 2018 10:21:47 -0400
Subject: [PATCH v2] Use file cloning in pg_upgrade and CREATE DATABASE

For file copying in pg_upgrade and CREATE DATABASE, use special file
cloning calls if available.  This makes the copying faster and more
space efficient.  For pg_upgrade, this achieves speed similar to --link
mode without the associated drawbacks.  Other backend users of copydir.c
will also take advantage of these changes, but the performance
improvement will probably not be as noticeable there.

On Linux, use copy_file_range().  This supports file cloning
automatically on Btrfs and XFS (if formatted with reflink support).  On
macOS, use copyfile(), which supports file cloning on APFS.

Even on file systems without cloning/reflink support, this is faster
than the existing code, because it avoids copying the file contents out
of kernel space and allows the OS to apply other optimizations.
---
 configure  |  2 +-
 configure.in   |  2 +-
 doc/src/sgml/monitoring.sgml   |  8 +++-
 doc/src/sgml/ref/pgupgrade.sgml| 11 +
 src/backend/postmaster/pgstat.c|  3 ++
 src/backend/storage/file/copydir.c | 84 ++
 src/backend/storage/file/reinit.c  |  3 +-
 src/bin/pg_upgrade/file.c  | 56 +++--
 src/include/pg_config.h.in |  6 +++
 src/include/pgstat.h   |  1 +
 10 files changed, 141 insertions(+), 35 deletions(-)

diff --git a/configure b/configure
index 3943711283..f27c78f63a 100755
--- a/configure
+++ b/configure
@@ -13085,7 +13085,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np 
readlink setproctitle setsid shm_open symlink sync_file_range utime utimes 
wcstombs_l
+for ac_func in cbrt clock_gettime copy_file_range copyfile dlopen fdatasync 
getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat 
pthread_is_threaded_np readlink setproctitle setsid shm_open symlink 
sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 1babdbb755..7eb8673753 100644
--- a/configure.in
+++ b/configure.in
@@ -1428,7 +1428,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`

Re: file cloning in pg_upgrade and CREATE DATABASE

2018-03-21 Thread Michael Paquier
On Tue, Mar 20, 2018 at 10:55:04AM -0400, Peter Eisentraut wrote:
> On 3/19/18 22:58, Michael Paquier wrote:
>> - Extend copy_file so as it is able to use fcopyfile.
> 
> fcopyfile() does not support cloning.  (This is not documented.)

You are right.  I have been reading the documentation here to get an
idea as I don't have a macos system at hand:
https://www.unix.com/man-page/osx/3/fcopyfile/

However I have bumped into that:
http://www.openradar.me/30706426

Future versions will be visibly fixed.

>> - Move the work done in pg_upgrade into a common API which can as well
>> be used by pg_rewind as well.  One place would be to have a
>> frontend-only API in src/common which does the leg work.  I would
>> recommend working only on file descriptors as well for consistency with
>> copy_file_range.
> 
> pg_upgrade copies files, whereas pg_rewind needs to copy file ranges.
> So I don't think this is going to be a good match.
>
> We could add support for using Linux copy_file_range() in pg_rewind, but
> that would work a bit differently.  I also don't have a good sense of
> how to test the performance of that.

One simple way to test that would be to limit the time it takes to scan
the WAL segments on the target so as the filemap is computed quickly,
and create many, say gigabyte-size relations on the promoted source
which will need to be copied from the source to the target.

> Another thing to think about is that we go through some trouble to
> initialize new WAL files so that the disk space is fully allocated.  If
> we used file cloning calls in pg_rewind, that would potentially
> invalidate some of that.  At least, we'd have to think through this more
> carefully.

Agreed.  Let's keep in mind such things but come with a sane, first cut
of this patch based on the time remaining in this commit fest.

>> - Add proper wait events for the backend calls.  Those are missing for
>> copy_file_range and copyfile.
> 
> done

+ CopyFileCopy
+ Waiting for a file copy operation (if the copying is done by
+ an operating system call rather than as separate read and write
+ operations).
CopyFileCopy is... Redundant.  Perhaps CopyFileSystem or CopyFileRange?

>> - For XLogFileCopy, the problem may be trickier as the tail of a segment
>> is filled with zeroes, so dropping it from the first version of the
>> patch sounds wiser.
> 
> Seems like a possible follow-on project.  But see also under pg_rewind
> above.

No objections to do that in the future for both.

> Another oddity is that pg_upgrade uses CopyFile() on Windows, but the
> backend does not.  The Git log shows that the backend used to use
> CopyFile(), but that was then removed when the generic code was added,
> but when pg_upgrade was imported, it came with the CopyFile() call.

You mean 558730ac, right?

> I suspect the CopyFile() call can be quite a bit faster, so we should
> consider adding it back in.  Or if not, remove it from pg_upgrade.

Hm.  The proposed patch also removes an important property of what
happens now in copy_file: the copied files are periodically synced to
avoid spamming the cache, so for some loads wouldn't this cause a
performance regression?

At least on Linux it is possible to rely on sync_file_range which is
called via pg_flush_data, so it seems to me that we ought to roughly
keep the loop working on FLUSH_DISTANCE, and replace the calls of
read/write by copy_file_range.  copyfile is only able to do a complete
file copy, so we would also lose this property as well on Linux.  Even
for Windows using CopyFile would be a step backwards for the backend.

pg_upgrade is different though as it copies files fully, so using both
copyfile and copy_file_range makes sense.

At the end, it seems to me that using copy_file_range has some values as
you save a set of read/write calls, but copyfile comes with its
limitations, which I think will cause side issues, so I would recommend
dropping it from a first cut of the patch for the backend.
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-03-23 Thread Bruce Momjian
I think this documentation change:

+   leaving the old cluster untouched.  At present, this is supported 
on Linux
-

would be better by changing "untouched" to "unmodified".

Also, it would be nice if users could easily know if pg_upgrade is going
to use COW or not because it might affect whether they choose --link or
not.  Right now it seems unclear how a user would know.  Can we have
pg_upgrade --check perhaps output something.  Can we also have the
pg_upgrade status display indicate that too, e.g. change

Copying user relation files

to

Copying (copy-on-write) user relation files

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-03-25 Thread Peter Eisentraut
On 3/23/18 13:16, Bruce Momjian wrote:
> Also, it would be nice if users could easily know if pg_upgrade is going
> to use COW or not because it might affect whether they choose --link or
> not.  Right now it seems unclear how a user would know.  Can we have
> pg_upgrade --check perhaps output something.  Can we also have the
> pg_upgrade status display indicate that too, e.g. change
> 
>   Copying user relation files
> 
> to
> 
>   Copying (copy-on-write) user relation files

That would be nice, but we don't have a way to tell that, AFAICT.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-03-25 Thread Peter Eisentraut
On 3/21/18 22:38, Michael Paquier wrote:
> At least on Linux it is possible to rely on sync_file_range which is
> called via pg_flush_data, so it seems to me that we ought to roughly
> keep the loop working on FLUSH_DISTANCE, and replace the calls of
> read/write by copy_file_range.  copyfile is only able to do a complete
> file copy, so we would also lose this property as well on Linux.

I have shown earlier in the thread that copy_file_range in one go is
still better than doing it in pieces.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-03-25 Thread Michael Paquier
On Sun, Mar 25, 2018 at 09:33:38PM -0400, Peter Eisentraut wrote:
> On 3/21/18 22:38, Michael Paquier wrote:
>> At least on Linux it is possible to rely on sync_file_range which is
>> called via pg_flush_data, so it seems to me that we ought to roughly
>> keep the loop working on FLUSH_DISTANCE, and replace the calls of
>> read/write by copy_file_range.  copyfile is only able to do a complete
>> file copy, so we would also lose this property as well on Linux.
> 
> I have shown earlier in the thread that copy_file_range in one go is
> still better than doing it in pieces.

f8c183a has introduced the optimization that your patch is removing,
which was discussed on this thread:
https://www.postgresql.org/message-id/flat/4B78906A.7020309%40mark.mielke.cc
I am not much into the internals of copy_file_range, but isn't there a
risk to have a large range of blocks copied to discard potentially
useful blocks from the OS cache?  That's what this patch makes me worry
about.  Performance is good, but on a system where the OS cache is
heavily used for a set of hot blocks this could cause performance side
effects that I think we canot neglect.

Another thing is that 71d6d07 allowed a couple of database commands to
be more sensitive to interruptions.  With large databases used as a base
template it seems to me that this would cause the interruptions to be
less responsive.
--
Michael


signature.asc
Description: PGP signature


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-03-30 Thread Peter Eisentraut
On 3/26/18 02:15, Michael Paquier wrote:
> f8c183a has introduced the optimization that your patch is removing,
> which was discussed on this thread:
> https://www.postgresql.org/message-id/flat/4B78906A.7020309%40mark.mielke.cc

Note that that thread is from 2010 and talks about creation of a
database from the standard template being too slow on spinning rust,
because we fsync too often.  I think we have moved well past that
problem size.

I have run some more tests on both macOS and Linux with ext4, and my
results are that the bigger the flush distance, the better.  Before we
made the adjustments for APFS, we had a flush size of 64kB, now it's 1MB
and 32MB on macOS.  In my tests, I see 256MB as the best across both
platforms, and not flushing early at all is only minimally worse.

You can measure this to death, and this obviously doesn't apply equally
on all systems and configurations, but clearly some of the old
assumptions from 8 years ago are no longer applicable.

> I am not much into the internals of copy_file_range, but isn't there a
> risk to have a large range of blocks copied to discard potentially
> useful blocks from the OS cache?  That's what this patch makes me worry
> about.  Performance is good, but on a system where the OS cache is
> heavily used for a set of hot blocks this could cause performance side
> effects that I think we canot neglect.

How would we go about assessing that?  It's possible, but if
copy_file_range() really blows away all your in-use cache, that would be
surprising.

> Another thing is that 71d6d07 allowed a couple of database commands to
> be more sensitive to interruptions.  With large databases used as a base
> template it seems to me that this would cause the interruptions to be
> less responsive.

The maximum file size that we copy is 1GB and that nowadays takes maybe
10 seconds.  I think that would be an acceptable response time.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-04-05 Thread Peter Eisentraut
I think we have raised a number of interesting issues here which require
more deeper consideration.  So I suggest to set this patch to Returned
with feedback.

Btw., I just learned that copy_file_range() only works on files on the
same device.  So more arrangements will need to be made for that.

> I have run some more tests on both macOS and Linux with ext4, and my> results 
> are that the bigger the flush distance, the better.  Before
we> made the adjustments for APFS, we had a flush size of 64kB, now it's
1MB> and 32MB on macOS.  In my tests, I see 256MB as the best across
both> platforms, and not flushing early at all is only minimally worse.
Based on this, I suggest that we set the flush distance to 32MB on all
platforms.  Not only is it faster, it avoids having different settings
on some platforms.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services