Re: [HACKERS] pg_rewind failure by file deletion in source server
On 08/03/2015 07:01 AM, Michael Paquier wrote: On Sun, Aug 2, 2015 at 4:01 AM, Heikki Linnakangas wrote: Perhaps it's best if we copy all the WAL files from source in copy-mode, but not in libpq mode. Regarding old WAL files in the target, it's probably best to always leave them alone. They should do no harm, and as a general principle it's best to avoid destroying evidence. It'd be nice to get some fix for this for alpha2, so I'll commit a fix to do that on Monday, unless we come to a different conclusion before that. +1. Both things sound like a good plan to me. I had some trouble implementing that. Recovery seemed to get confused sometimes, when it didn't find some of the WAL files in pg_xlog directory, even though it could fetch them through streaming replication. I'll have to investigate that further, but in the meantime, to have some fix in place for alpha2, I committed an even simpler fix for the immediate issue that pg_xlog is a symlink: just pretend that pg_xlog is a normal directory, even when it's a symlink. I'll continue to investigate what was wrong with my initial attempt. And it would be nice to avoid copying the pre-allocated WAL files from the source, because it's really unnecessary. But this fixes the immediate problem that pg_rewind didn't work at all if pg_xlog was a symlink. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Fri, Jul 17, 2015 at 12:28 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Jul 1, 2015 at 9:31 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Jul 1, 2015 at 2:21 AM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/29/2015 09:44 AM, Michael Paquier wrote: On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote: But we'll still need to handle the pg_xlog symlink case somehow. Perhaps it would be enough to special-case pg_xlog for now. Well, sure, pg_rewind does not copy the soft links either way. Now it would be nice to have an option to be able to recreate the soft link of at least pg_xlog even if it can be scripted as well after a run. Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In any non-trivial scenarios, just copying all the files from pg_xlog isn't enough anyway, and you need to set up a recovery.conf after running pg_rewind that contains a restore_command or primary_conninfo, to fetch the WAL. So you can argue that by not copying pg_xlog automatically, we're actually doing a favour to the DBA, by forcing him to set up the recovery.conf file correctly. Because if you just test simple scenarios where not much time has passed between the failover and running pg_rewind, it might be enough to just copy all the WAL currently in pg_xlog, but it would not be enough if more time had passed and not all the required WAL is present in pg_xlog anymore. And by not copying the WAL, we can avoid some copying, as restore_command or streaming replication will only copy what's needed, while pg_rewind would copy all WAL it can find the target's data directory. pg_basebackup also doesn't include any WAL, unless you pass the --xlog option. It would be nice to also add an optional --xlog option to pg_rewind, but with pg_rewind it's possible that all the required WAL isn't present in the pg_xlog directory anymore, so you wouldn't always achieve the same effect of making the backup self-contained. So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory in both the source and the target. If pg_xlog is simply ignored, some old WAL files may remain in target server. Don't these old files cause the subsequent startup of target server as new standby to fail? That is, it's the case where the WAL file with the same name but different content exist both in target and source. If that's harmfull, pg_rewind also should remove the files in pg_xlog of target server. This would reduce usability. The rewound node will replay WAL from the previous checkpoint where WAL forked up to the minimum recovery point of source node where pg_rewind has been run. Hence if we remove completely the contents of pg_xlog we'd lose a portion of the logs that need to be replayed until timeline is switched on the rewound node when recovering it (while streaming from the promoted standby, whatever). Even if we remove the WAL files in *target server, we don't lose any files in *source server that we will need to replay later. I don't really see why recycled segments would be a problem, as that's perhaps what you are referring to, but perhaps I am missing something. Please imagine the case where the WAL files with the same name were created in both servers after the fork. Their contents may be different. After pg_rewind is executed successfully, the rewound server (i.e., target server) should retrieve and replay that WAL file from the *source* server. But the problem is that the rewound server tries to replay the WAL file from its local since the file exists locally (even if primary_conninfo is specified). Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Sun, Aug 2, 2015 at 4:01 AM, Heikki Linnakangas wrote: Perhaps it's best if we copy all the WAL files from source in copy-mode, but not in libpq mode. Regarding old WAL files in the target, it's probably best to always leave them alone. They should do no harm, and as a general principle it's best to avoid destroying evidence. It'd be nice to get some fix for this for alpha2, so I'll commit a fix to do that on Monday, unless we come to a different conclusion before that. +1. Both things sound like a good plan to me. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On 07/17/2015 06:28 AM, Michael Paquier wrote: On Wed, Jul 1, 2015 at 9:31 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Jul 1, 2015 at 2:21 AM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/29/2015 09:44 AM, Michael Paquier wrote: On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote: But we'll still need to handle the pg_xlog symlink case somehow. Perhaps it would be enough to special-case pg_xlog for now. Well, sure, pg_rewind does not copy the soft links either way. Now it would be nice to have an option to be able to recreate the soft link of at least pg_xlog even if it can be scripted as well after a run. Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In any non-trivial scenarios, just copying all the files from pg_xlog isn't enough anyway, and you need to set up a recovery.conf after running pg_rewind that contains a restore_command or primary_conninfo, to fetch the WAL. So you can argue that by not copying pg_xlog automatically, we're actually doing a favour to the DBA, by forcing him to set up the recovery.conf file correctly. Because if you just test simple scenarios where not much time has passed between the failover and running pg_rewind, it might be enough to just copy all the WAL currently in pg_xlog, but it would not be enough if more time had passed and not all the required WAL is present in pg_xlog anymore. And by not copying the WAL, we can avoid some copying, as restore_command or streaming replication will only copy what's needed, while pg_rewind would copy all WAL it can find the target's data directory. pg_basebackup also doesn't include any WAL, unless you pass the --xlog option. It would be nice to also add an optional --xlog option to pg_rewind, but with pg_rewind it's possible that all the required WAL isn't present in the pg_xlog directory anymore, so you wouldn't always achieve the same effect of making the backup self-contained. So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory in both the source and the target. If pg_xlog is simply ignored, some old WAL files may remain in target server. Don't these old files cause the subsequent startup of target server as new standby to fail? That is, it's the case where the WAL file with the same name but different content exist both in target and source. If that's harmfull, pg_rewind also should remove the files in pg_xlog of target server. This would reduce usability. The rewound node will replay WAL from the previous checkpoint where WAL forked up to the minimum recovery point of source node where pg_rewind has been run. Hence if we remove completely the contents of pg_xlog we'd lose a portion of the logs that need to be replayed until timeline is switched on the rewound node when recovering it (while streaming from the promoted standby, whatever). I don't really see why recycled segments would be a problem, as that's perhaps what you are referring to, but perhaps I am missing something. Hmm. My thinking was that you need to set up restore_command or primary_conninfo anyway, to fetch the old WAL, so there's no need to copy any WAL. But there's a problem with that: you might have WAL files in the source server that haven't been archived yet, and you need them to recover the rewound target node. That's OK for libpq mode, I think as the server is still running and presumably and you can fetch the WAL with streaming replication, but for copy-mode, that's not a good assumption. You might be relying on a WAL archive, and the file might not be archived yet. Perhaps it's best if we copy all the WAL files from source in copy-mode, but not in libpq mode. Regarding old WAL files in the target, it's probably best to always leave them alone. They should do no harm, and as a general principle it's best to avoid destroying evidence. It'd be nice to get some fix for this for alpha2, so I'll commit a fix to do that on Monday, unless we come to a different conclusion before that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Wed, Jul 1, 2015 at 9:31 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Jul 1, 2015 at 2:21 AM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/29/2015 09:44 AM, Michael Paquier wrote: On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote: But we'll still need to handle the pg_xlog symlink case somehow. Perhaps it would be enough to special-case pg_xlog for now. Well, sure, pg_rewind does not copy the soft links either way. Now it would be nice to have an option to be able to recreate the soft link of at least pg_xlog even if it can be scripted as well after a run. Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In any non-trivial scenarios, just copying all the files from pg_xlog isn't enough anyway, and you need to set up a recovery.conf after running pg_rewind that contains a restore_command or primary_conninfo, to fetch the WAL. So you can argue that by not copying pg_xlog automatically, we're actually doing a favour to the DBA, by forcing him to set up the recovery.conf file correctly. Because if you just test simple scenarios where not much time has passed between the failover and running pg_rewind, it might be enough to just copy all the WAL currently in pg_xlog, but it would not be enough if more time had passed and not all the required WAL is present in pg_xlog anymore. And by not copying the WAL, we can avoid some copying, as restore_command or streaming replication will only copy what's needed, while pg_rewind would copy all WAL it can find the target's data directory. pg_basebackup also doesn't include any WAL, unless you pass the --xlog option. It would be nice to also add an optional --xlog option to pg_rewind, but with pg_rewind it's possible that all the required WAL isn't present in the pg_xlog directory anymore, so you wouldn't always achieve the same effect of making the backup self-contained. So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory in both the source and the target. If pg_xlog is simply ignored, some old WAL files may remain in target server. Don't these old files cause the subsequent startup of target server as new standby to fail? That is, it's the case where the WAL file with the same name but different content exist both in target and source. If that's harmfull, pg_rewind also should remove the files in pg_xlog of target server. This would reduce usability. The rewound node will replay WAL from the previous checkpoint where WAL forked up to the minimum recovery point of source node where pg_rewind has been run. Hence if we remove completely the contents of pg_xlog we'd lose a portion of the logs that need to be replayed until timeline is switched on the rewound node when recovering it (while streaming from the promoted standby, whatever). I don't really see why recycled segments would be a problem, as that's perhaps what you are referring to, but perhaps I am missing something. Attached is a rebased version of the previous patch to ignore the contents of pg_xlog/ when rewinding. -- Michael 20150717_pgrewind_ignore_xlog.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Wed, Jul 1, 2015 at 2:21 AM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/29/2015 09:44 AM, Michael Paquier wrote: On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote: But we'll still need to handle the pg_xlog symlink case somehow. Perhaps it would be enough to special-case pg_xlog for now. Well, sure, pg_rewind does not copy the soft links either way. Now it would be nice to have an option to be able to recreate the soft link of at least pg_xlog even if it can be scripted as well after a run. Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In any non-trivial scenarios, just copying all the files from pg_xlog isn't enough anyway, and you need to set up a recovery.conf after running pg_rewind that contains a restore_command or primary_conninfo, to fetch the WAL. So you can argue that by not copying pg_xlog automatically, we're actually doing a favour to the DBA, by forcing him to set up the recovery.conf file correctly. Because if you just test simple scenarios where not much time has passed between the failover and running pg_rewind, it might be enough to just copy all the WAL currently in pg_xlog, but it would not be enough if more time had passed and not all the required WAL is present in pg_xlog anymore. And by not copying the WAL, we can avoid some copying, as restore_command or streaming replication will only copy what's needed, while pg_rewind would copy all WAL it can find the target's data directory. pg_basebackup also doesn't include any WAL, unless you pass the --xlog option. It would be nice to also add an optional --xlog option to pg_rewind, but with pg_rewind it's possible that all the required WAL isn't present in the pg_xlog directory anymore, so you wouldn't always achieve the same effect of making the backup self-contained. So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory in both the source and the target. If pg_xlog is simply ignored, some old WAL files may remain in target server. Don't these old files cause the subsequent startup of target server as new standby to fail? That is, it's the case where the WAL file with the same name but different content exist both in target and source. If that's harmfull, pg_rewind also should remove the files in pg_xlog of target server. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Wed, Jul 1, 2015 at 2:21 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In any non-trivial scenarios, just copying all the files from pg_xlog isn't enough anyway, and you need to set up a recovery.conf after running pg_rewind that contains a restore_command or primary_conninfo, to fetch the WAL. So you can argue that by not copying pg_xlog automatically, we're actually doing a favour to the DBA, by forcing him to set up the recovery.conf file correctly. Because if you just test simple scenarios where not much time has passed between the failover and running pg_rewind, it might be enough to just copy all the WAL currently in pg_xlog, but it would not be enough if more time had passed and not all the required WAL is present in pg_xlog anymore. And by not copying the WAL, we can avoid some copying, as restore_command or streaming replication will only copy what's needed, while pg_rewind would copy all WAL it can find the target's data directory. pg_basebackup also doesn't include any WAL, unless you pass the --xlog option. It would be nice to also add an optional --xlog option to pg_rewind, but with pg_rewind it's possible that all the required WAL isn't present in the pg_xlog directory anymore, so you wouldn't always achieve the same effect of making the backup self-contained. Those are very convincing arguments. So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory in both the source and the target. Minor thing: s/continous/continuous. Except that this patch looks sane to me. (btw, it seems to me that we still have a race condition with pg_tablespace_location). -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On 06/29/2015 09:44 AM, Michael Paquier wrote: On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote: But we'll still need to handle the pg_xlog symlink case somehow. Perhaps it would be enough to special-case pg_xlog for now. Well, sure, pg_rewind does not copy the soft links either way. Now it would be nice to have an option to be able to recreate the soft link of at least pg_xlog even if it can be scripted as well after a run. Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In any non-trivial scenarios, just copying all the files from pg_xlog isn't enough anyway, and you need to set up a recovery.conf after running pg_rewind that contains a restore_command or primary_conninfo, to fetch the WAL. So you can argue that by not copying pg_xlog automatically, we're actually doing a favour to the DBA, by forcing him to set up the recovery.conf file correctly. Because if you just test simple scenarios where not much time has passed between the failover and running pg_rewind, it might be enough to just copy all the WAL currently in pg_xlog, but it would not be enough if more time had passed and not all the required WAL is present in pg_xlog anymore. And by not copying the WAL, we can avoid some copying, as restore_command or streaming replication will only copy what's needed, while pg_rewind would copy all WAL it can find the target's data directory. pg_basebackup also doesn't include any WAL, unless you pass the --xlog option. It would be nice to also add an optional --xlog option to pg_rewind, but with pg_rewind it's possible that all the required WAL isn't present in the pg_xlog directory anymore, so you wouldn't always achieve the same effect of making the backup self-contained. So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory in both the source and the target. - Heikki diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 32dc83f..bb1d640 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -49,12 +49,13 @@ PostgreSQL documentation para The result is equivalent to replacing the target data directory with the - source one. All files are copied, including configuration files. The - advantage of applicationpg_rewind/ over taking a new base backup, or - tools like applicationrsync/, is that applicationpg_rewind/ does - not require reading through all unchanged files in the cluster. That makes - it a lot faster when the database is large and only a small portion of it - differs between the clusters. + source one. All files in the data directory are copied, including + configuration files, but excluding the WAL log directory, + filenamepg_xlog/. The advantage of applicationpg_rewind/ over + taking a new base backup, or tools like applicationrsync/, is that + applicationpg_rewind/ does not require reading through all unchanged + files in the cluster. That makes it a lot faster when the database is + large and only a small portion of it differs between the clusters. /para para @@ -74,12 +75,12 @@ PostgreSQL documentation When the target server is started up for the first time after running applicationpg_rewind/, it will go into recovery mode and replay all WAL generated in the source server after the point of divergence. - If some of the WAL was no longer available in the source server when - applicationpg_rewind/ was run, and therefore could not be copied by - applicationpg_rewind/ session, it needs to be made available when the - target server is started up. That can be done by creating a - filenamerecovery.conf/ file in the target data directory with a - suitable varnamerestore_command/. + Like after restoring from a base backup using continous archiving, + the source server's WAL must be made available to the rewound server. + This can be done by creating a filenamerecovery.conf/ file in the + target data directory with a suitable varnamerestore_command/ or + varnameprimary_conninfo/ line, or by copying the required WAL files + manually into filenamepg_xlog/ in the target data directory. /para /refsect1 diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm index 5219ec9..82e3f4c 100644 --- a/src/bin/pg_rewind/RewindTest.pm +++ b/src/bin/pg_rewind/RewindTest.pm @@ -284,14 +284,12 @@ sub run_pg_rewind # Keep a temporary postgresql.conf for master node or it would be # overwritten during the rewind. - copy( - $test_master_datadir/postgresql.conf, - $testroot/master-postgresql.conf.tmp); + copy($test_master_datadir/postgresql.conf, + $testroot/master-postgresql.conf.tmp); # Now run pg_rewind if ($test_mode eq local) { - # Do rewind using a local pgdata as source # Stop the master and be ready to perform the rewind system_or_bail( @@ -306,10 +304,22 @@ sub run_pg_rewind $log_path, '21'); ok($result, 'pg_rewind
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Mon, Jun 29, 2015 at 3:46 AM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/24/2015 09:43 AM, Michael Paquier wrote: Attached is a new set of patches. Except for the last ones that addresses one issue of pg_rewind (symlink management when streaming PGDATA), all the others introduce if_not_exists options for the functions of genfile.c. The pg_rewind stuff could be more polished though. Feel free to comment. I've committed the additional option to the functions in genfile.c (I renamed it to missing_ok, for clarity), and the pg_rewind changes to use that option. I ended up refactoring the patch quite a bit, so if you could double-check what I committed to make sure I didn't break anything, that would be great. Thanks, the new patch looks far better than what I did, I noticed a couple of typos in the docs though: - s/behaviour/behavior, behavior is American English spelling, and it is the one used elsewhere as well, hence I guess that it makes sense to our it. - s/an non-existent/a non-existent - pg_proc.h is still using if_not_exists as in my patch (you corrected it to use missing_ok). Those are fixed as 0001 in the attached set. I didn't commit the tablespace or symlink handling changes yet, will review those separately. Thanks. Attached are rebased versions that take into account your previous changes as well (like the variable renaming, wrapper function usage, etc). I also added missing_ok to pg_readlink for consistency, and rebased the fix of pg_rewind for soft links with 0005. Note that 0005 does not use pg_tablespace_location(), still the patch series include an implementation of missing_ok for it. Feel free to use it if necessary. I also didn't commit the new regression test yet. It would indeed be nice to have one, but I think it was a few bricks shy of a load. It should work in a freshly initdb'd system, but not necessarily on an existing installation. First, it relied on the fact that postgresql.conf.auto exists, but a DBA might remove that if he wants to make sure the feature is not used. Secondly, it relied on the fact that pg_twophase is empty, but there is no guarantee of that either. Third, the error messages included in the expected output, e.g No such file or directory, depend on the operating system and locale. And finally, it'd be nice to test more things, in particular the behaviour of different offsets and lengths to pg_read_binary_file(), although an incomplete test would be better than no test at all. Portability is going to really reduce the test range, the only things that we could test are: - NULL results returned when missing_ok = true (with a dummy file name/link/directory) as missing_ok = false would choke depending on the platform as you mentioned. - Ensure that those functions called by users without superuser rights fail properly. - Path format errors for each function, like that: =# select pg_ls_dir('..'); ERROR: 42501: path must be in or below the current directory LOCATION: convert_and_check_filename, genfile.c:78 For tests on pg_read_binary_file, do you think that there is one file of PGDATA that we could use for scanning? I cannot think of one on the top of my mind now (postgresql.conf or any configuration files could be overridden by the user so they are out of scope, PG_VERSION is an additional maintenance burden). Well, I think that those things are still worth testing in any case and I think that you think so as well. If you are fine with that I could wrap up a patch that's better than nothing done for sure. Thoughts? Now we could have a TAP test for this stuff, where we could have a custom PGDATA with some dummy files that we know will be empty for example, still it seems like an overkill to me for this purpose, initdb is rather costly in this facility.. And on small machines. Regards, -- Michael From 378bd8af5c27e7e20b038f05b693b95e9871ef0b Mon Sep 17 00:00:00 2001 From: Michael Paquier mich...@otacoo.com Date: Mon, 29 Jun 2015 13:38:03 +0900 Subject: [PATCH 1/5] Fix some typos and an incorrect naming introduced by cb2acb1 --- doc/src/sgml/func.sgml| 4 ++-- src/include/catalog/pg_proc.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index d49cd43..8b28e29 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -17851,7 +17851,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup()); para All of these functions take an optional parametermissing_ok/ parameter, -which specifies the behaviour when the file or directory does not exist. +which specifies the behavior when the file or directory does not exist. If literaltrue/literal, the function returns NULL (except functionpg_ls_dir/, which returns an empty result set). If literalfalse/, an error is raised. The default is literalfalse/. @@ -17867,7 +17867,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Mon, Jun 29, 2015 at 3:53 PM, Vladimir Borodin r...@simply.name wrote: 28 июня 2015 г., в 21:46, Heikki Linnakangas hlinn...@iki.fi wrote: I've committed the additional option to the functions in genfile.c (I renamed it to missing_ok, for clarity), and the pg_rewind changes to use that option. And since it changes API it would not be back-ported to 9.4, right? Those API changes are not going to be back-patched to 9.4 as this is basically a new feature. For 9.4's pg_rewind, the problem does not concern this mailing list anyway, so let's discuss it directly on the project page on github. Just mentioning, but I am afraid that we are going to need a set of TRY/CATCH blocks with a wrapper function in plpgsql that does the failure legwork done here, or to export those functions in an extension with some copy/paste as all the necessary routines of genfile.c are not exposed externally, and to be sure that those functions are installed on the source server. I am not sure that I will be able to look at that in the short term unfortunately... Hence patches are welcome there and I will happily review, argue or integrate them if needed. The non-streaming option is not impacted in any case, so it's not like pg_rewind cannot be used at all on 9.4 or 9.3 instances. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On 06/26/2015 10:10 PM, Andres Freund wrote: On 2015-06-26 15:07:59 -0400, Robert Haas wrote: I realize that the recent fsync fiasco demonstrated that people keep files not readable by PG in the data directory It wasn't unreadable files that were the primary problem, it was files with read only permissions, no? Right. oops, I can't read this, that's probably OK just does not seem good enough. Agreed. After thinking about this some more, I think it'd be acceptable if we just fail, if there are any non-writeable files in the data directory. The typical scenario is that postgresql.conf, or an SSL cert file, is a symlink to outside the data directory. It seems reasonable to require that the DBA just removes the symlink before running pg_rewind, and restores it afterwards if appropriate. In many cases, you would *not* want to overwrite your config files, SSL cert files, etc., so the DBA will need to script backing up and restoring those anyway. (It's a fair question whether pg_rewind should be copying those files in the first place. I've opted for yes, so that it's easy to explain the behaviour of pg_rewind: the end result is the same as if you took a new base backup from the source server. Whatever files you want to backup up before you re-initialize from a base backup, you should also backup with pg_rewind.) But we'll still need to handle the pg_xlog symlink case somehow. Perhaps it would be enough to special-case pg_xlog for now. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On 06/24/2015 09:43 AM, Michael Paquier wrote: Attached is a new set of patches. Except for the last ones that addresses one issue of pg_rewind (symlink management when streaming PGDATA), all the others introduce if_not_exists options for the functions of genfile.c. The pg_rewind stuff could be more polished though. Feel free to comment. I've committed the additional option to the functions in genfile.c (I renamed it to missing_ok, for clarity), and the pg_rewind changes to use that option. I ended up refactoring the patch quite a bit, so if you could double-check what I committed to make sure I didn't break anything, that would be great. I didn't commit the tablespace or symlink handling changes yet, will review those separately. I also didn't commit the new regression test yet. It would indeed be nice to have one, but I think it was a few bricks shy of a load. It should work in a freshly initdb'd system, but not necessarily on an existing installation. First, it relied on the fact that postgresql.conf.auto exists, but a DBA might remove that if he wants to make sure the feature is not used. Secondly, it relied on the fact that pg_twophase is empty, but there is no guarantee of that either. Third, the error messages included in the expected output, e.g No such file or directory, depend on the operating system and locale. And finally, it'd be nice to test more things, in particular the behaviour of different offsets and lengths to pg_read_binary_file(), although an incomplete test would be better than no test at all. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Wed, Jun 24, 2015 at 9:41 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Jun 24, 2015 at 3:43 PM, Michael Paquier wrote: I just realized another problem: We recently learned the hard way that some people have files in the data directory that are not writeable by the 'postgres' user (http://www.postgresql.org/message-id/20150523172627.ga24...@msg.df7cb.de). pg_rewind will try to overwrite all files it doesn't recognize as relation files, so it's going to fail on those. A straightforward fix would be to first open the destination file in read-only mode, and compare its contents, and only open the file in write mode if it has changed. It would still fail when the files really differ, but I think that's acceptable. If I am missing nothing, two code paths need to be patched here: copy_file_range and receiveFileChunks. copy_file_range is straight-forward. Now wouldn't it be better to write the contents into a temporary file, compare their content, and then switch if necessary for receiveFileChunks? After sleeping on it, I have been looking at this issue again and came up with the patch attached. Instead of checking if the content of the target and the source file are the same, meaning that we would still need to fetch chunk content from the server in stream mode, I think that it is more robust to check if the target file can be opened and check for EACCES on failure, bypassing it if process does not have permissions on it. the patch contains a test case as well, and is independent on the rest sent upthread. Thoughts? That seems scary as heck to me. Suppose that you run pg_rewind and SELinux decides, due to some labeling problem, to deny access to some files. So we just skip those. Then the user tries to start the server and maybe it works (since the postgres executable is labeled differently) or maybe it fails and the user runs restorecon and then it works. Kaboom, your database is corrupt. I realize that the recent fsync fiasco demonstrated that people keep files not readable by PG in the data directory, but that still seems loopy to me. I realize that we can't de-support that in the back branches because people have existing configurations that we can't just break. But maybe we should say, look, that's just not compatible with pg_rewind, because where the integrity of your data is concerned oops, I can't read this, that's probably OK just does not seem good enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Fri, Jun 26, 2015 at 3:10 PM, Andres Freund and...@anarazel.de wrote: On 2015-06-26 15:07:59 -0400, Robert Haas wrote: I realize that the recent fsync fiasco demonstrated that people keep files not readable by PG in the data directory It wasn't unreadable files that were the primary problem, it was files with read only permissions, no? Yes, that's true. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On 2015-06-26 15:07:59 -0400, Robert Haas wrote: I realize that the recent fsync fiasco demonstrated that people keep files not readable by PG in the data directory It wasn't unreadable files that were the primary problem, it was files with read only permissions, no? oops, I can't read this, that's probably OK just does not seem good enough. Agreed. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Wed, Jun 24, 2015 at 3:43 PM, Michael Paquier wrote: I just realized another problem: We recently learned the hard way that some people have files in the data directory that are not writeable by the 'postgres' user (http://www.postgresql.org/message-id/20150523172627.ga24...@msg.df7cb.de). pg_rewind will try to overwrite all files it doesn't recognize as relation files, so it's going to fail on those. A straightforward fix would be to first open the destination file in read-only mode, and compare its contents, and only open the file in write mode if it has changed. It would still fail when the files really differ, but I think that's acceptable. If I am missing nothing, two code paths need to be patched here: copy_file_range and receiveFileChunks. copy_file_range is straight-forward. Now wouldn't it be better to write the contents into a temporary file, compare their content, and then switch if necessary for receiveFileChunks? After sleeping on it, I have been looking at this issue again and came up with the patch attached. Instead of checking if the content of the target and the source file are the same, meaning that we would still need to fetch chunk content from the server in stream mode, I think that it is more robust to check if the target file can be opened and check for EACCES on failure, bypassing it if process does not have permissions on it. the patch contains a test case as well, and is independent on the rest sent upthread. Thoughts? -- Michael diff --git a/src/bin/pg_rewind/copy_fetch.c b/src/bin/pg_rewind/copy_fetch.c index 224fad1..ef830ac 100644 --- a/src/bin/pg_rewind/copy_fetch.c +++ b/src/bin/pg_rewind/copy_fetch.c @@ -174,7 +174,8 @@ copy_file_range(const char *path, off_t begin, off_t end, bool trunc) if (lseek(srcfd, begin, SEEK_SET) == -1) pg_fatal(could not seek in source file: %s\n, strerror(errno)); - open_target_file(path, trunc); + if (!open_target_file(path, trunc)) + return; while (end - begin 0) { diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c index c2d8aa1..1f68855 100644 --- a/src/bin/pg_rewind/file_ops.c +++ b/src/bin/pg_rewind/file_ops.c @@ -40,17 +40,17 @@ static void remove_target_symlink(const char *path); * Open a target file for writing. If 'trunc' is true and the file already * exists, it will be truncated. */ -void +bool open_target_file(const char *path, bool trunc) { int mode; if (dry_run) - return; + return true; if (dstfd != -1 !trunc strcmp(path, dstpath[strlen(datadir_target) + 1]) == 0) - return; /* already open */ + return true;/* already open */ close_target_file(); @@ -60,9 +60,19 @@ open_target_file(const char *path, bool trunc) if (trunc) mode |= O_TRUNC; dstfd = open(dstpath, mode, 0600); + + /* ignore errors for unreadable files */ + if (dstfd 0 errno == EACCES) + { + pg_log(PG_PROGRESS, could not open target file \%s\, bypassing: %s\n, + dstpath, strerror(errno)); + return false; + } if (dstfd 0) pg_fatal(could not open target file \%s\: %s\n, dstpath, strerror(errno)); + + return true; } /* diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h index f68c71d..e437cb1 100644 --- a/src/bin/pg_rewind/file_ops.h +++ b/src/bin/pg_rewind/file_ops.h @@ -12,7 +12,7 @@ #include filemap.h -extern void open_target_file(const char *path, bool trunc); +extern bool open_target_file(const char *path, bool trunc); extern void write_target_range(char *buf, off_t begin, size_t size); extern void close_target_file(void); extern void truncate_target_file(const char *path, off_t newsize); diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c index df71069..41c096b 100644 --- a/src/bin/pg_rewind/libpq_fetch.c +++ b/src/bin/pg_rewind/libpq_fetch.c @@ -283,7 +283,8 @@ receiveFileChunks(const char *sql) pg_log(PG_DEBUG, received chunk for file \%s\, offset %d, size %d\n, filename, chunkoff, chunksize); - open_target_file(filename, false); + if (!open_target_file(filename, false)) + continue; write_target_range(chunk, chunkoff, chunksize); @@ -406,8 +407,8 @@ libpq_executeFileMap(filemap_t *map) case FILE_ACTION_COPY: /* Truncate the old file out of the way, if any */ -open_target_file(entry-path, true); -fetch_file_range(entry-path, 0, entry-newsize); +if (open_target_file(entry-path, true)) + fetch_file_range(entry-path, 0, entry-newsize); break; case FILE_ACTION_TRUNCATE: diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index 032301f..77dddbe 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -496,7 +496,8 @@ createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr checkpo pg_fatal(backup label buffer too small\n); /* shouldn't happen */ /* TODO: move old file out of the way, if any. */ - open_target_file(backup_label, true); /* BACKUP_LABEL_FILE */
Re: [HACKERS] pg_rewind failure by file deletion in source server
On 06/23/2015 07:51 AM, Michael Paquier wrote: So... Attached are a set of patches dedicated at fixing this issue: Thanks for working on this! - 0001, add if_not_exists to pg_tablespace_location, returning NULL if path does not exist - 0002, same with pg_stat_file, returning NULL if file does not exist - 0003, same with pg_read_*file. I added them to all the existing functions for consistency. - 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs (thanks Robert for the naming!) - 0005, as things get complex, a set of regression tests aimed to covering those things. pg_tablespace_location is platform-dependent, so there are no tests for it. - 0006, the fix for pg_rewind, using what has been implemented before. With thes patches, pg_read_file() will return NULL for any failure to open the file, which makes pg_rewind to assume that the file doesn't exist in the source server, and will remove the file from the destination. That's dangerous, those functions should check specifically for ENOENT. There's still a small race condition with tablespaces. If you run CREATE TABLESPACE in the source server while pg_rewind is running, it's possible that the recursive query that pg_rewind uses sees the symlink in pg_tblspc/ directory, but its snapshot doesn't see the row in pg_tablespace yet. It will think that the symlink is a regular file, try to read it, and fail (if we checked for ENOENT). Actually, I think we need try to deal with symlinks a bit harder. Currently, pg_rewind assumes that anything in pg_tblspace that has a matching row in pg_tablespace is a symlink, and nothing else is. I think symlinks to directories. I just noticed that pg_rewind fails miserable if pg_xlog is a symlink, because of that: The servers diverged at WAL position 0/3023F08 on timeline 1. Rewinding from last common checkpoint at 0/260 on timeline 1 data-master//pg_xlog is not a directory Failure, exiting I think we need to add a column to pg_stat_file output, to indicate symbolic links, and add a pg_readlink() function. That still leaves a race condition if the type of a file changes, i.e. a file is deleted and a directory with the same name is created in its place, but that seems acceptable. I don't think PostgreSQL ever does such a thing, so that could only happen if you mess with the data directory manually while the server is running. I just realized another problem: We recently learned the hard way that some people have files in the data directory that are not writeable by the 'postgres' user (http://www.postgresql.org/message-id/20150523172627.ga24...@msg.df7cb.de). pg_rewind will try to overwrite all files it doesn't recognize as relation files, so it's going to fail on those. A straightforward fix would be to first open the destination file in read-only mode, and compare its contents, and only open the file in write mode if it has changed. It would still fail when the files really differ, but I think that's acceptable. I note that pg_rewind doesn't need to distinguish between an empty and a non-existent directory, so it's quite silly for it to pass include_dot_dirs=true, and then filter out . and .. from the result set. The documentation should mention the main reason for including . and ..: to distinguish between an empty and non-existent directory. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On 06/23/2015 05:03 PM, Fujii Masao wrote: On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/23/2015 07:51 AM, Michael Paquier wrote: So... Attached are a set of patches dedicated at fixing this issue: Thanks for working on this! - 0001, add if_not_exists to pg_tablespace_location, returning NULL if path does not exist - 0002, same with pg_stat_file, returning NULL if file does not exist - 0003, same with pg_read_*file. I added them to all the existing functions for consistency. - 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs (thanks Robert for the naming!) - 0005, as things get complex, a set of regression tests aimed to covering those things. pg_tablespace_location is platform-dependent, so there are no tests for it. - 0006, the fix for pg_rewind, using what has been implemented before. With thes patches, pg_read_file() will return NULL for any failure to open the file, which makes pg_rewind to assume that the file doesn't exist in the source server, and will remove the file from the destination. That's dangerous, those functions should check specifically for ENOENT. I'm wondering if using pg_read_file() to copy the file from source server is reasonable. ISTM that it has two problems as follows. 1. It cannot read very large file like 1GB file. So if such large file was created in source server after failover, pg_rewind would not be able to copy the file. No? pg_read_binary_file() handles large files just fine. It cannot return more than 1GB in one call, but you can call it several times and retrieve the file in chunks. That's what pg_rewind does, except for reading the control file, which is known to be small. 2. Many users may not allow a remote client to connect to the PostgreSQL server as a superuser for some security reasons. IOW, there would be no entry in pg_hba.conf for such connection. In this case, pg_rewind always fails because pg_read_file() needs superuser privilege. No? I'm tempting to implement the replication command version of pg_read_file(). That is, it reads and sends the data like BASE_BACKUP replication command does... Yeah, that would definitely be nice. Peter suggested it back in January (http://www.postgresql.org/message-id/54ac4801.7050...@gmx.net). I think it's way too late to do that for 9.5, however. I'm particularly worried that if we design the required API in a rush, we're not going to get it right, and will have to change it again soon. That might be difficult in a minor release. Using pg_read_file() and friends is quite flexible, even though we just find out that they're not quite flexible enough right now (the ENOENT problem). - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/23/2015 07:51 AM, Michael Paquier wrote: So... Attached are a set of patches dedicated at fixing this issue: Thanks for working on this! - 0001, add if_not_exists to pg_tablespace_location, returning NULL if path does not exist - 0002, same with pg_stat_file, returning NULL if file does not exist - 0003, same with pg_read_*file. I added them to all the existing functions for consistency. - 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs (thanks Robert for the naming!) - 0005, as things get complex, a set of regression tests aimed to covering those things. pg_tablespace_location is platform-dependent, so there are no tests for it. - 0006, the fix for pg_rewind, using what has been implemented before. With thes patches, pg_read_file() will return NULL for any failure to open the file, which makes pg_rewind to assume that the file doesn't exist in the source server, and will remove the file from the destination. That's dangerous, those functions should check specifically for ENOENT. I'm wondering if using pg_read_file() to copy the file from source server is reasonable. ISTM that it has two problems as follows. 1. It cannot read very large file like 1GB file. So if such large file was created in source server after failover, pg_rewind would not be able to copy the file. No? 2. Many users may not allow a remote client to connect to the PostgreSQL server as a superuser for some security reasons. IOW, there would be no entry in pg_hba.conf for such connection. In this case, pg_rewind always fails because pg_read_file() needs superuser privilege. No? I'm tempting to implement the replication command version of pg_read_file(). That is, it reads and sends the data like BASE_BACKUP replication command does... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Tue, Jun 23, 2015 at 11:21 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/23/2015 05:03 PM, Fujii Masao wrote: On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/23/2015 07:51 AM, Michael Paquier wrote: So... Attached are a set of patches dedicated at fixing this issue: Thanks for working on this! - 0001, add if_not_exists to pg_tablespace_location, returning NULL if path does not exist - 0002, same with pg_stat_file, returning NULL if file does not exist - 0003, same with pg_read_*file. I added them to all the existing functions for consistency. - 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs (thanks Robert for the naming!) - 0005, as things get complex, a set of regression tests aimed to covering those things. pg_tablespace_location is platform-dependent, so there are no tests for it. - 0006, the fix for pg_rewind, using what has been implemented before. With thes patches, pg_read_file() will return NULL for any failure to open the file, which makes pg_rewind to assume that the file doesn't exist in the source server, and will remove the file from the destination. That's dangerous, those functions should check specifically for ENOENT. I'm wondering if using pg_read_file() to copy the file from source server is reasonable. ISTM that it has two problems as follows. 1. It cannot read very large file like 1GB file. So if such large file was created in source server after failover, pg_rewind would not be able to copy the file. No? pg_read_binary_file() handles large files just fine. It cannot return more than 1GB in one call, but you can call it several times and retrieve the file in chunks. That's what pg_rewind does, except for reading the control file, which is known to be small. Yeah, you're right. I found that pg_rewind creates a temporary table to fetch the file in chunks. This would prevent pg_rewind from using the *hot standby* server as a source server at all. This is of course a limitation of pg_rewind, but we might want to alleviate it in the future. 2. Many users may not allow a remote client to connect to the PostgreSQL server as a superuser for some security reasons. IOW, there would be no entry in pg_hba.conf for such connection. In this case, pg_rewind always fails because pg_read_file() needs superuser privilege. No? I'm tempting to implement the replication command version of pg_read_file(). That is, it reads and sends the data like BASE_BACKUP replication command does... Yeah, that would definitely be nice. Peter suggested it back in January (http://www.postgresql.org/message-id/54ac4801.7050...@gmx.net). I think it's way too late to do that for 9.5, however. I'm particularly worried that if we design the required API in a rush, we're not going to get it right, and will have to change it again soon. That might be difficult in a minor release. Using pg_read_file() and friends is quite flexible, even though we just find out that they're not quite flexible enough right now (the ENOENT problem). I agree that it's too late to do what I said... But just using pg_read_file() cannot address the #2 problem that I pointed in my previous email. Also requiring a superuer privilege on pg_rewind really conflicts with the motivation why we added replication privilege. So we should change pg_read_file() so that even replication user can read the file? Or replication user version of pg_read_file() should be implemented? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Wed, Jun 24, 2015 at 1:40 AM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Jun 23, 2015 at 11:21 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/23/2015 05:03 PM, Fujii Masao wrote: On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/23/2015 07:51 AM, Michael Paquier wrote: So... Attached are a set of patches dedicated at fixing this issue: Thanks for working on this! - 0001, add if_not_exists to pg_tablespace_location, returning NULL if path does not exist - 0002, same with pg_stat_file, returning NULL if file does not exist - 0003, same with pg_read_*file. I added them to all the existing functions for consistency. - 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs (thanks Robert for the naming!) - 0005, as things get complex, a set of regression tests aimed to covering those things. pg_tablespace_location is platform-dependent, so there are no tests for it. - 0006, the fix for pg_rewind, using what has been implemented before. With thes patches, pg_read_file() will return NULL for any failure to open the file, which makes pg_rewind to assume that the file doesn't exist in the source server, and will remove the file from the destination. That's dangerous, those functions should check specifically for ENOENT. I'm wondering if using pg_read_file() to copy the file from source server is reasonable. ISTM that it has two problems as follows. 1. It cannot read very large file like 1GB file. So if such large file was created in source server after failover, pg_rewind would not be able to copy the file. No? pg_read_binary_file() handles large files just fine. It cannot return more than 1GB in one call, but you can call it several times and retrieve the file in chunks. That's what pg_rewind does, except for reading the control file, which is known to be small. Yeah, you're right. I found that pg_rewind creates a temporary table to fetch the file in chunks. This would prevent pg_rewind from using the *hot standby* server as a source server at all. This is of course a limitation of pg_rewind, but we might want to alleviate it in the future. This is something that a replication command could address properly. 2. Many users may not allow a remote client to connect to the PostgreSQL server as a superuser for some security reasons. IOW, there would be no entry in pg_hba.conf for such connection. In this case, pg_rewind always fails because pg_read_file() needs superuser privilege. No? I'm tempting to implement the replication command version of pg_read_file(). That is, it reads and sends the data like BASE_BACKUP replication command does... Yeah, that would definitely be nice. Peter suggested it back in January (http://www.postgresql.org/message-id/54ac4801.7050...@gmx.net). I think it's way too late to do that for 9.5, however. I'm particularly worried that if we design the required API in a rush, we're not going to get it right, and will have to change it again soon. That might be difficult in a minor release. Using pg_read_file() and friends is quite flexible, even though we just find out that they're not quite flexible enough right now (the ENOENT problem). I agree that it's too late to do what I said... But just using pg_read_file() cannot address the #2 problem that I pointed in my previous email. Also requiring a superuser privilege on pg_rewind really conflicts with the motivation why we added replication privilege. So we should change pg_read_file() so that even replication user can read the file? From the security prospective, a replication user can take a base backup so it can already retrieve easily the contents of PGDATA. Hence I guess that it would be fine. However, what about cases where pg_hba.conf authorizes access to a given replication user via psql and blocks it for the replication protocol? We could say that OP should not give out replication access that easily, but in this case the user would have access to the content of PGDATA even if he should not. Is that unrealistic? Or replication user version of pg_read_file() should be implemented? You mean a new function? In what is it different from authorizing pg_read_file usage for a replication user? Honestly, I can live with this superuser restriction in 9.5. And come back to the replication user restriction in 9.6 once things cool down a bit. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Wed, Jun 24, 2015 at 3:36 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Jun 24, 2015 at 1:40 AM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Jun 23, 2015 at 11:21 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/23/2015 05:03 PM, Fujii Masao wrote: On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/23/2015 07:51 AM, Michael Paquier wrote: So... Attached are a set of patches dedicated at fixing this issue: Thanks for working on this! - 0001, add if_not_exists to pg_tablespace_location, returning NULL if path does not exist - 0002, same with pg_stat_file, returning NULL if file does not exist - 0003, same with pg_read_*file. I added them to all the existing functions for consistency. - 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs (thanks Robert for the naming!) - 0005, as things get complex, a set of regression tests aimed to covering those things. pg_tablespace_location is platform-dependent, so there are no tests for it. - 0006, the fix for pg_rewind, using what has been implemented before. With thes patches, pg_read_file() will return NULL for any failure to open the file, which makes pg_rewind to assume that the file doesn't exist in the source server, and will remove the file from the destination. That's dangerous, those functions should check specifically for ENOENT. I'm wondering if using pg_read_file() to copy the file from source server is reasonable. ISTM that it has two problems as follows. 1. It cannot read very large file like 1GB file. So if such large file was created in source server after failover, pg_rewind would not be able to copy the file. No? pg_read_binary_file() handles large files just fine. It cannot return more than 1GB in one call, but you can call it several times and retrieve the file in chunks. That's what pg_rewind does, except for reading the control file, which is known to be small. Yeah, you're right. I found that pg_rewind creates a temporary table to fetch the file in chunks. This would prevent pg_rewind from using the *hot standby* server as a source server at all. This is of course a limitation of pg_rewind, but we might want to alleviate it in the future. This is something that a replication command could address properly. 2. Many users may not allow a remote client to connect to the PostgreSQL server as a superuser for some security reasons. IOW, there would be no entry in pg_hba.conf for such connection. In this case, pg_rewind always fails because pg_read_file() needs superuser privilege. No? I'm tempting to implement the replication command version of pg_read_file(). That is, it reads and sends the data like BASE_BACKUP replication command does... Yeah, that would definitely be nice. Peter suggested it back in January (http://www.postgresql.org/message-id/54ac4801.7050...@gmx.net). I think it's way too late to do that for 9.5, however. I'm particularly worried that if we design the required API in a rush, we're not going to get it right, and will have to change it again soon. That might be difficult in a minor release. Using pg_read_file() and friends is quite flexible, even though we just find out that they're not quite flexible enough right now (the ENOENT problem). I agree that it's too late to do what I said... But just using pg_read_file() cannot address the #2 problem that I pointed in my previous email. Also requiring a superuser privilege on pg_rewind really conflicts with the motivation why we added replication privilege. So we should change pg_read_file() so that even replication user can read the file? From the security prospective, a replication user can take a base backup so it can already retrieve easily the contents of PGDATA. Hence I guess that it would be fine. However, what about cases where pg_hba.conf authorizes access to a given replication user via psql and blocks it for the replication protocol? We could say that OP should not give out replication access that easily, but in this case the user would have access to the content of PGDATA even if he should not. Is that unrealistic? The most realistic case is that both source and target servers have the pg_hba.conf containing the following authentication setting regarding replication. That is, each server allows other to use the replication user to connect to via replication protocol. # TYPE DATABASE USER ADDRESSMETHOD host replication repuser X.X.X.X/Y md5 This case makes me think that allowing even replication user to call pg_read_file() may not be good solution for us. Because in that case replication user needs to log in the real database like postgres to call pg_read_file(), but usually there would be no entry allowing replication user to connect to any real database in pg_hba.conf. So if we want to address this problem, replication command version of
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Fri, Jun 19, 2015 at 9:22 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jun 19, 2015 at 8:18 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jun 19, 2015 at 12:14 AM, Michael Paquier michael.paqu...@gmail.com wrote: Listing the directories with pg_ls_dir() has the same problem. (After some discussion on IM with Heikki on this one). This is actually more tricky because pg_ls_dir() does not return '.' or '..' that we could use to identify that the directory actually exists or not when it is empty. Hence I think that we should add two options to pg_ls_dir: - include_self, default to false. If set to true, '.' is added in the list of items. - if_not_exists, to bypass error that a folder does not exist, default at false. If if_not_exists = true and include_self = true, returning only '.' would mean that the folder exist but that it is empty. If if_not_exists = true and include_self = false, no rows are returned. We could as well ERROR as well if both options are set like that. I am fine with any of them as long as behavior is properly documented. Including '.' to distinguish between an empty directory and a nonexistent one seems like an unnecessarily complicated and non-obvious API. How about just one additional parameter bool *exists. If NULL and no directory, ERROR, else on return set *exists to true or false. Err, wait. You're talking about an SQL function, heh heh. So that won't work. Maybe what you proposed is the best we can do, then, although I would suggest that you rename the include_self parameter to something like include_dot_dirs and return both . and ... Returning only . seems like it will seem weird to people. So... Attached are a set of patches dedicated at fixing this issue: - 0001, add if_not_exists to pg_tablespace_location, returning NULL if path does not exist - 0002, same with pg_stat_file, returning NULL if file does not exist - 0003, same with pg_read_*file. I added them to all the existing functions for consistency. - 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs (thanks Robert for the naming!) - 0005, as things get complex, a set of regression tests aimed to covering those things. pg_tablespace_location is platform-dependent, so there are no tests for it. - 0006, the fix for pg_rewind, using what has been implemented before. Attached is an updated test case triggering the issue (rewind_test.bash), with the small patch attached that adds a pg_sleep call in pg_rewind.c (20150623_pg_rewind_sleep.patch). I imagine that this is a bug people are going to meet in the field easily, particularly with temporary relation files or temporary XLOG files. Regards, -- Michael From 454fa9e67c67621b0832d7fbd90153328c99197b Mon Sep 17 00:00:00 2001 From: Michael Paquier michael@otacoo.com Date: Tue, 23 Jun 2015 11:45:49 +0900 Subject: [PATCH 1/6] Extend pg_tablespace_location with if_not_exists option This is useful for code paths that prefer receive an empty response instead of an ERROR if tablespace specified does not exist. --- doc/src/sgml/func.sgml| 8 -- src/backend/utils/adt/misc.c | 60 ++- src/include/catalog/pg_proc.h | 2 ++ src/include/utils/builtins.h | 1 + 4 files changed, 57 insertions(+), 14 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 650051b..7929e7e 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -15844,9 +15844,13 @@ SELECT pg_type_is_visible('myschema.widget'::regtype); entryget the set of database OIDs that have objects in the tablespace/entry /row row - entryliteralfunctionpg_tablespace_location(parametertablespace_oid/parameter)/function/literal/entry + entryliteralfunctionpg_tablespace_location(parametertablespace_oid/parameter [, parameterif_not_exists/ typeboolean/type])/function/literal/entry entrytypetext/type/entry - entryget the path in the file system that this tablespace is located in/entry + entry +Get the path in the file system that this tablespace is located in. +If parameterif_not_exists/parameter is true, literalNULL/literal +is returned if tablespace does not exist instead of an error. + /entry /row row entryliteralfunctionpg_typeof(parameterany/parameter)/function/literal/entry diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c index c0495d9..034728d 100644 --- a/src/backend/utils/adt/misc.c +++ b/src/backend/utils/adt/misc.c @@ -336,17 +336,24 @@ pg_tablespace_databases(PG_FUNCTION_ARGS) SRF_RETURN_DONE(funcctx); } - /* - * pg_tablespace_location - get location for a tablespace + * Wrapper function for pg_tablespace_location and + * pg_tablespace_location_extended. */ -Datum -pg_tablespace_location(PG_FUNCTION_ARGS) +static Datum +tablespace_location_wrapper(FunctionCallInfo fcinfo) { Oid tablespaceOid = PG_GETARG_OID(0); char
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Fri, Jun 19, 2015 at 8:18 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jun 19, 2015 at 12:14 AM, Michael Paquier michael.paqu...@gmail.com wrote: Listing the directories with pg_ls_dir() has the same problem. (After some discussion on IM with Heikki on this one). This is actually more tricky because pg_ls_dir() does not return '.' or '..' that we could use to identify that the directory actually exists or not when it is empty. Hence I think that we should add two options to pg_ls_dir: - include_self, default to false. If set to true, '.' is added in the list of items. - if_not_exists, to bypass error that a folder does not exist, default at false. If if_not_exists = true and include_self = true, returning only '.' would mean that the folder exist but that it is empty. If if_not_exists = true and include_self = false, no rows are returned. We could as well ERROR as well if both options are set like that. I am fine with any of them as long as behavior is properly documented. Including '.' to distinguish between an empty directory and a nonexistent one seems like an unnecessarily complicated and non-obvious API. How about just one additional parameter bool *exists. If NULL and no directory, ERROR, else on return set *exists to true or false. Err, wait. You're talking about an SQL function, heh heh. So that won't work. Maybe what you proposed is the best we can do, then, although I would suggest that you rename the include_self parameter to something like include_dot_dirs and return both . and ... Returning only . seems like it will seem weird to people. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Fri, Jun 19, 2015 at 12:14 AM, Michael Paquier michael.paqu...@gmail.com wrote: Listing the directories with pg_ls_dir() has the same problem. (After some discussion on IM with Heikki on this one). This is actually more tricky because pg_ls_dir() does not return '.' or '..' that we could use to identify that the directory actually exists or not when it is empty. Hence I think that we should add two options to pg_ls_dir: - include_self, default to false. If set to true, '.' is added in the list of items. - if_not_exists, to bypass error that a folder does not exist, default at false. If if_not_exists = true and include_self = true, returning only '.' would mean that the folder exist but that it is empty. If if_not_exists = true and include_self = false, no rows are returned. We could as well ERROR as well if both options are set like that. I am fine with any of them as long as behavior is properly documented. Including '.' to distinguish between an empty directory and a nonexistent one seems like an unnecessarily complicated and non-obvious API. How about just one additional parameter bool *exists. If NULL and no directory, ERROR, else on return set *exists to true or false. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On 06/16/2015 02:04 AM, Michael Paquier wrote: In order to reduce the risk of failure to a minimum and to preserve the performance of the tool when using --source-server, I think that we should add some check using pg_stat_file to see if a file is still present or not, and if it is missing we can safely skip it thanks to minRecoveryPoint. Now the problem is that pg_stat_file fails automatically if the file targeted is missing. Hence, to avoid a bunch of round trips with the server with one call to pg_stat_dir per file, I think that we should add some if_not_exists option to pg_stat_file, defaulting to false, to skip the error related to the file missing and have it return NULL in this case. Then we could use this filter on the file path in libpq_executeFileMap() to fetch only the file chunks that actually exist on the server. You'll also need to add the option to pg_read_binary_file, though, because even if you do a test with pg_stat_file() just before reading the file, there's a race condition: someone might still delete file between pg_stat_file() and pg_read_file(). Listing the directories with pg_ls_dir() has the same problem. As does pg_tablespace_location(). Note that we could as well use some plpgsql-ing to do the same, but the extension of pg_stat_file looks more useful to me. Thoughts? Hmm. You'll need to add the option to all of those functions. Maybe it's nevertheless the simplest approach. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Thu, Jun 18, 2015 at 10:55 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/16/2015 02:04 AM, Michael Paquier wrote: In order to reduce the risk of failure to a minimum and to preserve the performance of the tool when using --source-server, I think that we should add some check using pg_stat_file to see if a file is still present or not, and if it is missing we can safely skip it thanks to minRecoveryPoint. Now the problem is that pg_stat_file fails automatically if the file targeted is missing. Hence, to avoid a bunch of round trips with the server with one call to pg_stat_dir per file, I think that we should add some if_not_exists option to pg_stat_file, defaulting to false, to skip the error related to the file missing and have it return NULL in this case. Then we could use this filter on the file path in libpq_executeFileMap() to fetch only the file chunks that actually exist on the server. You'll also need to add the option to pg_read_binary_file, though, because even if you do a test with pg_stat_file() just before reading the file, there's a race condition: someone might still delete file between pg_stat_file() and pg_read_file(). I propose to return NULL values if the file does not exist and if_not_exists = true for both of them. Does that sound fine? Listing the directories with pg_ls_dir() has the same problem. (After some discussion on IM with Heikki on this one). This is actually more tricky because pg_ls_dir() does not return '.' or '..' that we could use to identify that the directory actually exists or not when it is empty. Hence I think that we should add two options to pg_ls_dir: - include_self, default to false. If set to true, '.' is added in the list of items. - if_not_exists, to bypass error that a folder does not exist, default at false. If if_not_exists = true and include_self = true, returning only '.' would mean that the folder exist but that it is empty. If if_not_exists = true and include_self = false, no rows are returned. We could as well ERROR as well if both options are set like that. I am fine with any of them as long as behavior is properly documented. As does pg_tablespace_location(). NULL if tablespace path does not exist anymore. Is that fine. Note that we could as well use some plpgsql-ing to do the same, but the extension of pg_stat_file looks more useful to me. Thoughts? Hmm. You'll need to add the option to all of those functions. Maybe it's nevertheless the simplest approach. With plpgsql you could use a try/catch/raising block to do the work. But it still looks better to me to have alternative options with the in-core functions. I am fine to spend time on all those things and provide test cases, let's just get a precise picture of what we want first. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Fri, Jun 12, 2015 at 9:02 PM, Fujii Masao wrote: You want to draft a patch? Should I? Please feel free to try that! :) OK, so attached are a patch and a test case able to trigger easily the error. Apply the patch and run the test case to reproduce the following failure: $ ERROR: could not open file base/16384/16385_fsm for reading: No such file or directory STATEMENT: SELECT path, begin, pg_read_binary_file(path, begin, len) AS chunk FROM fetchchunks The patch adds a call to pg_usleep after the list of files from source server has been fetched with libpq in pg_rewind.c to let time to run some DROP actions, like DROP DATABASE, DROP TABLE, etc in order to trigger the error easily. In order to reduce the risk of failure to a minimum and to preserve the performance of the tool when using --source-server, I think that we should add some check using pg_stat_file to see if a file is still present or not, and if it is missing we can safely skip it thanks to minRecoveryPoint. Now the problem is that pg_stat_file fails automatically if the file targeted is missing. Hence, to avoid a bunch of round trips with the server with one call to pg_stat_dir per file, I think that we should add some if_not_exists option to pg_stat_file, defaulting to false, to skip the error related to the file missing and have it return NULL in this case. Then we could use this filter on the file path in libpq_executeFileMap() to fetch only the file chunks that actually exist on the server. Note that we could as well use some plpgsql-ing to do the same, but the extension of pg_stat_file looks more useful to me. Thoughts? -- Michael rewind_test.bash Description: Binary data 20150616_pgrewind_sleep.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Thu, Jun 11, 2015 at 5:48 PM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao masao.fu...@gmail.com wrote: Shouldn't pg_rewind ignore that failure of operation? If the file is not found in source server, the file doesn't need to be copied to destination server obviously. So ISTM that pg_rewind safely can skip copying that file. Thought? I think that you should fail. Let's imagine that the master to be rewound has removed a relation file before being stopped cleanly after its standby has been promoted that was here at the last checkpoint before forking, and that the standby still has the relation file after promotion. You should be able to copy it to be able to replay WAL on it. If the standby has removed a file in the file map after taking the file map, I guess that the best thing to do is fail because the file that should be here for the rewound node cannot be fetched. In this case, why do you think that the file should exist in the old master? Even if it doesn't exist, ISTM that the old master can safely replay the WAL records related to the file when it restarts. So what's the problem if the file doesn't exist in the old master? Well, some user may want to rewind the master down to the point where WAL forked, and then recover it immediately when a consistent point is reached just at restart instead of replugging it into the cluster. In this case I think that you need the relation file of the dropped relation to get a consistent state. That's still cheaper than recreating a node from a fresh base backup in some cases, particularly if the last base backup taken is far in the past for this cluster. Documentation should be made clearer about that with a better error message... I'm wondering how we can recover (or rewind again) the old master from that error. This also would need to be documented if we decide not to fix any code regarding this problem... FWIW, here is a scenario able to trigger the error with 1 master (port 5432, data at ~/data/5432) and 1 standby (port 5433, data at ~/data/5433). $ psql -c 'create table aa as select generate_series(1,100)' # Promote standby $ pg_ctl promote -D ~/data/5433/ # Drop table on master $ psql -c 'drop table aa' DROP TABLE $ pg_ctl stop -D ~/data/5432/ At this point there is no more relation file on master for 'aa', it is still present on standby. Running pg_rewind at this point will work, the relation file would be copied from the promoted standby to master. $ lldb -- pg_rewind -D 5432 --source-server=port=5433 dbname=postgres Breakpoint pg_rewind after fetchSourceFileList() and before replaying the changes from the block map, drop table 'aa' on standby and checkpoint it, then the source file list is inconsistent and pg_rewind will fail. This can just happen with --source-server, with --source-pgdata Adding a sleep() of a couple of seconds in pg_rewind may be better to trigger directly the error ;), with DROP DATABASE for example. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Fri, Jun 12, 2015 at 3:17 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Jun 11, 2015 at 5:48 PM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao masao.fu...@gmail.com wrote: Shouldn't pg_rewind ignore that failure of operation? If the file is not found in source server, the file doesn't need to be copied to destination server obviously. So ISTM that pg_rewind safely can skip copying that file. Thought? I think that you should fail. Let's imagine that the master to be rewound has removed a relation file before being stopped cleanly after its standby has been promoted that was here at the last checkpoint before forking, and that the standby still has the relation file after promotion. You should be able to copy it to be able to replay WAL on it. If the standby has removed a file in the file map after taking the file map, I guess that the best thing to do is fail because the file that should be here for the rewound node cannot be fetched. In this case, why do you think that the file should exist in the old master? Even if it doesn't exist, ISTM that the old master can safely replay the WAL records related to the file when it restarts. So what's the problem if the file doesn't exist in the old master? Well, some user may want to rewind the master down to the point where WAL forked, and then recover it immediately when a consistent point is reached just at restart instead of replugging it into the cluster. In this case I think that you need the relation file of the dropped relation to get a consistent state. That's still cheaper than recreating a node from a fresh base backup in some cases, particularly if the last base backup taken is far in the past for this cluster. So it's the case where a user wants to recover old master up to the point BEFORE the file in question is deleted in new master. At that point, since the file must exist, pg_rewind should fail if the file cannot be copied from new master. Is my understanding right? As far as I read the code of pg_rewind, ISTM that your scenario never happens. Because pg_rewind sets the minimum recovery point to the latest WAL location in new master, i.e., AFTER the file is deleted. So old master cannot stop recovering before the file is deleted in new master. If the recovery stops at that point, it fails because the minimum recovery point is not reached yet. IOW, after pg_rewind runs, the old master has to replay the WAL records which were generated by the deletion of the file in the new master. So it's okay if the old master doesn't have the file after pg_rewind runs, I think. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Fri, Jun 12, 2015 at 4:29 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Jun 12, 2015 at 3:50 PM, Fujii Masao masao.fu...@gmail.com wrote: On Fri, Jun 12, 2015 at 3:17 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Jun 11, 2015 at 5:48 PM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao masao.fu...@gmail.com wrote: Shouldn't pg_rewind ignore that failure of operation? If the file is not found in source server, the file doesn't need to be copied to destination server obviously. So ISTM that pg_rewind safely can skip copying that file. Thought? I think that you should fail. Let's imagine that the master to be rewound has removed a relation file before being stopped cleanly after its standby has been promoted that was here at the last checkpoint before forking, and that the standby still has the relation file after promotion. You should be able to copy it to be able to replay WAL on it. If the standby has removed a file in the file map after taking the file map, I guess that the best thing to do is fail because the file that should be here for the rewound node cannot be fetched. In this case, why do you think that the file should exist in the old master? Even if it doesn't exist, ISTM that the old master can safely replay the WAL records related to the file when it restarts. So what's the problem if the file doesn't exist in the old master? Well, some user may want to rewind the master down to the point where WAL forked, and then recover it immediately when a consistent point is reached just at restart instead of replugging it into the cluster. In this case I think that you need the relation file of the dropped relation to get a consistent state. That's still cheaper than recreating a node from a fresh base backup in some cases, particularly if the last base backup taken is far in the past for this cluster. So it's the case where a user wants to recover old master up to the point BEFORE the file in question is deleted in new master. At that point, since the file must exist, pg_rewind should fail if the file cannot be copied from new master. Is my understanding right? Yep. We are on the same line. As far as I read the code of pg_rewind, ISTM that your scenario never happens. Because pg_rewind sets the minimum recovery point to the latest WAL location in new master, i.e., AFTER the file is deleted. So old master cannot stop recovering before the file is deleted in new master. If the recovery stops at that point, it fails because the minimum recovery point is not reached yet. IOW, after pg_rewind runs, the old master has to replay the WAL records which were generated by the deletion of the file in the new master. So it's okay if the old master doesn't have the file after pg_rewind runs, I think. Ah, right. I withdraw, indeed what I thought can not happen: /* * Update control file of target. Make it ready to perform archive * recovery when restarting. * * minRecoveryPoint is set to the current WAL insert location in the * source server. Like in an online backup, it's important that we recover * all the WAL that was generated while we copied the files over. */ So a rewound node will replay WAL up to the current insert location of the source, and will fail at recovery if recovery target is older than this insert location.. You want to draft a patch? Should I? Please feel free to try that! :) I think that we should have a test case as well in pg_rewind/t/. Maybe. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao masao.fu...@gmail.com wrote: Shouldn't pg_rewind ignore that failure of operation? If the file is not found in source server, the file doesn't need to be copied to destination server obviously. So ISTM that pg_rewind safely can skip copying that file. Thought? I think that you should fail. Let's imagine that the master to be rewound has removed a relation file before being stopped cleanly after its standby has been promoted that was here at the last checkpoint before forking, and that the standby still has the relation file after promotion. You should be able to copy it to be able to replay WAL on it. If the standby has removed a file in the file map after taking the file map, I guess that the best thing to do is fail because the file that should be here for the rewound node cannot be fetched. In this case, why do you think that the file should exist in the old master? Even if it doesn't exist, ISTM that the old master can safely replay the WAL records related to the file when it restarts. So what's the problem if the file doesn't exist in the old master? Documentation should be made clearer about that with a better error message... I'm wondering how we can recover (or rewind again) the old master from that error. This also would need to be documented if we decide not to fix any code regarding this problem... Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_rewind failure by file deletion in source server
Hi, While testing pg_rewind, I got the following error and pg_rewind failed. $ pg_rewind -D ... --source-server=... -P ERROR: could not open file base/13243/16384 for reading: No such file or directory STATEMENT: SELECT path, begin, pg_read_binary_file(path, begin, len) AS chunk FROM fetchchunks As far as I read the pg_rewind code, ISTM that the file deletion in source server while pg_rewind is running can cause pg_rewind to fail. That is, at first pg_rewind picks up the files to copy (or do some actions) and creates the file map. Then it performs the actual operation (e.g., file copy from source to dest) according to the file map. The problem can happen if the source server deletes the file listed in the file map before pg_rewind performs the actual operations. The copy of the file must fail because it's not found in source server, and then pg_rewind exits with an error. Shouldn't pg_rewind ignore that failure of operation? If the file is not found in source server, the file doesn't need to be copied to destination server obviously. So ISTM that pg_rewind safely can skip copying that file. Thought? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao masao.fu...@gmail.com wrote: Shouldn't pg_rewind ignore that failure of operation? If the file is not found in source server, the file doesn't need to be copied to destination server obviously. So ISTM that pg_rewind safely can skip copying that file. Thought? I think that you should fail. Let's imagine that the master to be rewound has removed a relation file before being stopped cleanly after its standby has been promoted that was here at the last checkpoint before forking, and that the standby still has the relation file after promotion. You should be able to copy it to be able to replay WAL on it. If the standby has removed a file in the file map after taking the file map, I guess that the best thing to do is fail because the file that should be here for the rewound node cannot be fetched. Documentation should be made clearer about that with a better error message... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers