Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 07.03.2019 10:26, David Steele wrote: On 3/6/19 5:38 PM, Andrey Borodin wrote: The new patch is much smaller (less than 400 lines) and works as advertised. There's a typo "retreive" there. Ough, corrected this in three different places. Not my word, definitely. Thanks! These lines look a little suspicious: char postgres_exec_path[MAXPGPATH], postgres_cmd[MAXPGPATH], cmd_output[MAX_RESTORE_COMMAND]; Is it supposed to be any difference between MAXPGPATH and MAX_RESTORE_COMMAND? Yes, it was supposed to be, but after your message I have double checked everything and figured out that we use MAXPGPATH for final restore_command build (with all aliases replaced). Thus, there is no need in a separated constant. I have replaced it with MAXPGPATH. This patch appears to need attention from the author so I have marked it Waiting on Author. I hope I have addressed all issues in the new patch version which is attached. Also, I have added more detailed explanation of new functionality into the multi-line commit-message. Regards, -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 9770cab4909a3cd98c2db2b8a9fa4af1fedd4614 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 19 Feb 2019 19:14:53 +0300 Subject: [PATCH v5] pg_rewind: options to use restore_command from command line or cluster config Previously, when pg_rewind could not find required WAL files in the target data directory the rewind process would fail. One had to manually figure out which of required WAL files have already moved to the archival storage and copy them back. This patch adds possibility to specify restore_command via command line option or use one specified inside postgresql.conf. Specified restore_command will be used for automatic retrieval of missing WAL files from archival storage. --- doc/src/sgml/ref/pg_rewind.sgml | 30 - src/bin/pg_rewind/parsexlog.c | 161 +- src/bin/pg_rewind/pg_rewind.c | 96 ++- src/bin/pg_rewind/pg_rewind.h | 7 +- src/bin/pg_rewind/t/001_basic.pl | 4 +- src/bin/pg_rewind/t/002_databases.pl | 4 +- src/bin/pg_rewind/t/003_extrafiles.pl | 4 +- src/bin/pg_rewind/t/RewindTest.pm | 84 +- 8 files changed, 370 insertions(+), 20 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 53a64ee29e..90e3f22f97 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -67,8 +67,10 @@ PostgreSQL documentation ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or + files might no longer be present. In that case, they can be automatically + copied by pg_rewind from the WAL archive to the + pg_wal directory if either -r or + -R option is specified, or fetched on startup by configuring or . The use of pg_rewind is not limited to failover, e.g. a standby @@ -200,6 +202,30 @@ PostgreSQL documentation + + -r + --use-postgresql-conf + + +Use restore_command in the postgresql.conf to +retrieve missing in the target pg_wal directory +WAL files from the WAL archive. + + + + + + -R restore_command + --restore-command=restore_command + + +Specifies the restore_command to use for retrieval of the missing +in the target pg_wal directory WAL files from +the WAL archive. + + + + --debug diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index e19c265cbb..6be6dab7e0 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -12,6 +12,7 @@ #include "postgres_fe.h" #include +#include #include "pg_rewind.h" #include "filemap.h" @@ -45,6 +46,7 @@ static char xlogfpath[MAXPGPATH]; typedef struct XLogPageReadPrivate { const char *datadir; + const char *restoreCommand; int tliIndex; } XLogPageReadPrivate; @@ -53,6 +55,9 @@ static int SimpleXLogPageRead(XLogReaderState *xlogreader, int reqLen, XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *pageTLI); +static int RestoreArchivedWAL(const char *path, const char *xlogfname, + off_t expectedSize, const char *restoreCommand); + /* * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline * index 'tliIndex' in target timeline history, until 'endpoint'. Make note of @@ -60,7 +65,7 @@ static int SimpleX
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 26.03.2019 11:19, Michael Paquier wrote: + * This is a simplified and adapted to frontend version of + * RestoreArchivedFile function from transam/xlogarchive.c + */ +static int +RestoreArchivedWAL(const char *path, const char *xlogfname, I don't think that we should have duplicates for that, so I would recommend refactoring the code so as a unique code path is taken by both, especially since the user can fetch the command from postgresql.conf. This comment is here since the beginning of my work on this patch and now it is rather misleading. Even if we does not take into account obvious differences like error reporting, different log levels based on many conditions, cleanup options, check for standby mode; restore_command execution at backend recovery and during pg_rewind has a very important difference. If it fails at backend, then as stated in the comment 'Remember, we rollforward UNTIL the restore fails so failure here is just part of the process' -- it is OK. In opposite, in pg_rewind if we failed to recover some required WAL segment, then it definitely means the end of the entire process, since we will fail at finding last common checkpoint or extracting page map. The only part we can share is constructing restore_command with aliases replacement. However, even in this place the logic is slightly different, since we do not need %r alias for pg_rewind. The only use case of %r in restore_command I know is pg_standby, which seems to be as not a case for pg_rewind. I have tried to move this part to the common, but it becomes full of conditions and less concise. Please, correct me if I am wrong, but it seems that there are enough differences to keep this function separated, isn't it? Why two options? Wouldn't actually be enough use-postgresql-conf to do the job? Note that "postgres" should always be installed if pg_rewind is present because it is a backend-side utility, so while I don't like adding a dependency to other binaries in one binary, having an option to pass out a command directly via the command line of pg_rewind stresses me more. I am not familiar enough with DBA scenarios, where -R option may be useful, but I was asked a few times for that. I can only speculate that for example someone may want to run freshly rewinded cluster as master, not replica, so its config may differ from replica's one, where restore_command is surely intended to be. Thus, it is easier to leave master's config at the place and just specify restore_command as command line argument. Don't we need to worry about signals interrupting the restore command? It seems to me that some refactoring from the stuff in xlogarchive.c would be in order. Thank you for pointing me to this place again. Previously, I thought that we should not care about it, since if restore_command was not successful due to any reason, then rewind failed, so we will stop and exit at upper levels. However, if it was due to a signal, then some of next messages may be misleading, if e.g. user manually interrupted it for some reason. So that, I added a similar check here as well. Updated version of patch is attached. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 9e00f7a7696a88f350e1e328a9758ab85631c813 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 19 Feb 2019 19:14:53 +0300 Subject: [PATCH v6] pg_rewind: options to use restore_command from command line or cluster config Previously, when pg_rewind could not find required WAL files in the target data directory the rewind process would fail. One had to manually figure out which of required WAL files have already moved to the archival storage and copy them back. This patch adds possibility to specify restore_command via command line option or use one specified inside postgresql.conf. Specified restore_command will be used for automatic retrieval of missing WAL files from archival storage. --- doc/src/sgml/ref/pg_rewind.sgml | 30 - src/bin/pg_rewind/parsexlog.c | 167 +- src/bin/pg_rewind/pg_rewind.c | 96 ++- src/bin/pg_rewind/pg_rewind.h | 7 +- src/bin/pg_rewind/t/001_basic.pl | 4 +- src/bin/pg_rewind/t/002_databases.pl | 4 +- src/bin/pg_rewind/t/003_extrafiles.pl | 4 +- src/bin/pg_rewind/t/RewindTest.pm | 84 - 8 files changed, 376 insertions(+), 20 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 53a64ee29e..90e3f22f97 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -67,8 +67,10 @@ PostgreSQL documentation ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the target cluster ran for a long time after the divergence, the old WAL - f
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Hi Tomas, On 14.01.2019 21:23, Tomas Vondra wrote: Attached is an updated patch series, merging fixes and changes to TAP tests proposed by Alexey. I've merged the fixes into the appropriate patches, and I've kept the TAP changes / new tests as separate patches towards the end of the series. I had problems applying this patch along with 2pc streaming one to the current master, but everything applied well on 97c39498e5. Regression tests pass. What I personally do not like in the current TAP tests set is that you have added "WITH (streaming=on)" to all tests including old non-streaming ones. It seems unclear, which mechanism is tested there: streaming, but those transactions probably do not hit memory limit, so it depends on default server parameters; or non-streaming, but then what is the need for (streaming=on)? I would prefer to add (streaming=on) only to the new tests, where it is clearly necessary. I'm a bit unhappy with two aspects of the current patch series: 1) We now track schema changes in two ways - using the pre-existing schema_sent flag in RelationSyncEntry, and the (newly added) flag in ReorderBuffer. While those options are used for regular vs. streamed transactions, fundamentally it's the same thing and so having two competing ways seems like a bad idea. Not sure what's the best way to resolve this, though. Yes, sure, when I have found problems with streaming of extensive DDL, I added new flag in the simplest way, and it worked. Now, old schema_sent flag is per relation based, while the new one - is_schema_sent - is per top-level transaction based. If I get it correctly, the former seems to be more thrifty, since new schema is sent only if we are streaming change for relation, whose schema is outdated. In contrast, in the latter case we will send new schema even if there will be no new changes which belong to this relation. I guess, it would be better to stick to the old behavior. I will try to investigate how to better use it in the streaming mode as well. 2) We've removed quite a few asserts, particularly ensuring sanity of cmin/cmax values. To some extent that's expected, because by allowing decoding of in-progress transactions relaxes some of those rules. But I'd be much happier if some of those asserts could be reinstated, even if only in a weaker form. Asserts have been removed from two places: (1) HeapTupleSatisfiesHistoricMVCC, which seems inevitable, since we are touching the essence of the MVCC visibility rules, when trying to decode an in-progress transaction, and (2) ReorderBufferBuildTupleCidHash, which is probably not related directly to the topic of the ongoing patch, since Arseny Sher faced the same issue with simple repetitive DDL decoding [1] recently. Not many, but I agree, that replacing them with some softer asserts would be better, than just removing, especially point 1). [1] https://www.postgresql.org/message-id/flat/874l9p8hyw.fsf%40ars-thinkpad Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Too rigorous assert in reorderbuffer.c
Hi, On 31.01.2019 9:21, Arseny Sher wrote: My colleague Alexander Lakhin has noticed an assertion failure in reorderbuffer.c:1330. Here is a simple snippet reproducing it: SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding'); create table t(k int); begin; savepoint a; alter table t alter column k type text; rollback to savepoint a; alter table t alter column k type bigint; commit; SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); I just want to add, that I have accidentally discovered the same issue during the testing of the Tomas's large transactions streaming patch [1], and had to remove this assert to get things working. I thought that it was somehow related to the streaming mode and did not test the same query alone. [1] https://www.postgresql.org/message-id/76fc440e-91c3-afe2-b78a-987205b3c758%402ndquadrant.com Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 21.01.2019 23:50, a.kondra...@postgrespro.ru wrote: Thank you for the review! I have updated the patch according to your comments and remarks. Please, find new version attached. During the self-reviewing of the code and tests, I discovered some problems with build on Windows. New version of the patch is attached and it fixes this issue as well as includes some minor code revisions. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 99c6d94f37a797400d41545a271ff111b92e9361 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Fri, 21 Dec 2018 14:00:30 +0300 Subject: [PATCH] pg_rewind: options to use restore_command from postgresql.conf or command line. --- doc/src/sgml/ref/pg_rewind.sgml | 30 +- src/backend/Makefile | 4 +- src/backend/commands/extension.c | 1 + src/backend/utils/misc/.gitignore | 1 - src/backend/utils/misc/Makefile | 8 - src/backend/utils/misc/guc.c | 434 +-- src/bin/pg_rewind/Makefile| 2 +- src/bin/pg_rewind/parsexlog.c | 166 +- src/bin/pg_rewind/pg_rewind.c | 100 +++- src/bin/pg_rewind/pg_rewind.h | 10 +- src/bin/pg_rewind/t/001_basic.pl | 4 +- src/bin/pg_rewind/t/002_databases.pl | 4 +- src/bin/pg_rewind/t/003_extrafiles.pl | 4 +- src/bin/pg_rewind/t/RewindTest.pm | 93 +++- src/common/.gitignore | 1 + src/common/Makefile | 9 +- src/{backend/utils/misc => common}/guc-file.l | 518 -- src/include/common/guc-file.h | 50 ++ src/include/utils/guc.h | 39 +- src/tools/msvc/Mkvcbuild.pm | 7 +- src/tools/msvc/clean.bat | 2 +- 21 files changed, 973 insertions(+), 514 deletions(-) delete mode 100644 src/backend/utils/misc/.gitignore rename src/{backend/utils/misc => common}/guc-file.l (60%) create mode 100644 src/include/common/guc-file.h diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 53a64ee29e..0c2441afa7 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -67,8 +67,10 @@ PostgreSQL documentation ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or + files might no longer be present. In that case, they can be automatically + copied by pg_rewind from the WAL archive to the + pg_wal directory if either -r or + -R option is specified, or fetched on startup by configuring or . The use of pg_rewind is not limited to failover, e.g. a standby @@ -200,6 +202,30 @@ PostgreSQL documentation + + -r + --use-postgresql-conf + + +Use restore_command in the postgresql.conf to +retreive missing in the target pg_wal directory +WAL files from the WAL archive. + + + + + + -R restore_command + --restore-command=restore_command + + +Specifies the restore_command to use for retrieval of the missing +in the target pg_wal directory WAL files from +the WAL archive. + + + + --debug diff --git a/src/backend/Makefile b/src/backend/Makefile index 478a96db9b..721cb57e89 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -186,7 +186,7 @@ distprep: $(MAKE) -C replication repl_gram.c repl_scanner.c syncrep_gram.c syncrep_scanner.c $(MAKE) -C storage/lmgr lwlocknames.h lwlocknames.c $(MAKE) -C utils distprep - $(MAKE) -C utils/misc guc-file.c + $(MAKE) -C common guc-file.c $(MAKE) -C utils/sort qsort_tuple.c @@ -307,7 +307,7 @@ maintainer-clean: distclean replication/syncrep_scanner.c \ storage/lmgr/lwlocknames.c \ storage/lmgr/lwlocknames.h \ - utils/misc/guc-file.c \ + common/guc-file.c \ utils/sort/qsort_tuple.c diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c index daf3f51636..195eb8a821 100644 --- a/src/backend/commands/extension.c +++ b/src/backend/commands/extension.c @@ -50,6 +50,7 @@ #include "commands/defrem.h" #include "commands/extension.h" #include "commands/schemacmds.h" +#include "common/guc-file.h" #include "funcapi.h" #include "mb/pg_wchar.h" #include "miscadmin.h" diff --git a/src/backend/utils/misc/.gitignore b/src/backend/utils/misc/.gitignore deleted file mode 10064
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
Hi! On 09.02.2019 14:31, Andrey Borodin wrote: Here's a typo in postgreslq.conf + fprintf(stderr, _("%s: option -r/--use-postgresql-conf is specified, but postgreslq.conf is absent in the target directory\n"), Fixed, thanks. I do not attach new version of the patch for just one typo, maybe there will be some more remarks from others. Besides this, I think you can switch patch to "Ready for committer". check-world is passing on macbook, docs are here, feature is implemented and tested. OK, cfbot [1] does not complain about anything on Linux and Windows as well, so I am setting it to "Ready for committer" for the next commitfest. [1] http://cfbot.cputube.org/alexey-kondratov.html Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Logical replication and restore from pg_basebackup
Hi Dmitry, On 11.02.2019 17:39, Dmitry Vasiliev wrote: What is the scope of logical replication if I cannot make recovery from pg_basebackup? No, you can, but there are some things to keep in mind: 1) I could be wrong, but usage of pgbench in such a test seems to be a bad idea, since it drops and creates tables from the scratch, when -i is passed. However, if I recall it correctly, pub/sub slots use OIDs of relations, so I expect that you should get only initial sync data on replica and last pgbench results on master. 2) Next, 'srsubstate' check works only for initial sync. After that you should poll master's replication slot lsn for 'pg_current_wal_lsn() <= replay_lsn'. Please, find attached a slightly modified version of your test (and gist [1]), which works just fine. You should replace %username% with your current username, since I did not run it as postgres user. [1] https://gist.github.com/ololobus/a8a11f11eb67dfa1b6a95bff5e8f0096 Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company logical-replication-test.sh Description: application/shellscript
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
Hi Andres, Thank you for your feedback. On 16.02.2019 6:41, Andres Freund wrote: It sounds like a seriously bad idea to use a different parser for pg_rewind. Why don't you just use postgres for it? As in /path/to/postgres -D /path/to/datadir/ -C shared_buffers ? Initially, when I started working on this patch, recovery options were not a part of GUCs, so it was not possible. Now, recovery.conf is a part of postgresql.conf and postgres -C only reads config files, initializes GUCs, prints required parameter and shuts down. Thus, it seems like an acceptable solution for me. Though I am still a little bit afraid to start up a server, which is meant to be shut down during rewind process, even for such a short period of time. The only thing I am concerned most about is that pg_rewind always has been a standalone utility, so you were able to simply rewind two separated data directories one relatively another without any need for other postgres binaries. If we rely on postgres -C this would be tricky in some cases: - end user should always care about postmaster binaries availability; - even so, appropriate postgres executable may be absent in the ENV/PATH; - locations of pg_rewind and postgres may be arbitrary depending on the distribution, which may be custom as well. I cannot propose a reliable way of detecting path to postgres executable without directly asking users to provide it via PATH, command line option, etc. If someone can suggest anything, then it would be possible to make patch simpler in some way, but I always wanted to keep pg_rewind standalone and as simple as possible for end users. Anyway, currently I do not use a different parser for pg_rewind. A few versions back I have made guc-file.l common for frontend/backend. So technically speaking it is the same parser as postmaster use, only small number of sophisticated error reporting is wrapped with IFDEF. But if we go for that, that part of the patch *NEEDS* to be split into a separate commit/patch. It's too hard to see functional changes otherwise. Yes, sure, please find attached new version of the patch set consisting of two separated patches. First is for making guc-file.l common between frontend/backend and second one is for adding new options into pg_rewind. + if (restore_ok) + { + xlogreadfd = open(xlogfpath, O_RDONLY | PG_BINARY, 0); + + if (xlogreadfd < 0) + { + printf(_("could not open restored from archive file \"%s\": %s\n"), xlogfpath, + strerror(errno)); + return -1; + } + else + pg_log(PG_DEBUG, "using restored from archive version of file \"%s\"\n", xlogfpath); + } + else + { + printf(_("could not restore file \"%s\" from archive: %s\n"), xlogfname, + strerror(errno)); + return -1; + } } } I suggest moving this to a separate function. OK, I have slightly refactored and simplified this part. All checks of the recovered file have been moved into RestoreArchivedWAL. Hope it looks better now. Isn't this entirely broken? restore_command could be set in a different file no? Maybe I got it wrong, but I do not think so. Since recovery options are now a part of GUCs, restore_command may be only set inside postgresql.conf or any files/subdirs which are included there to take an effect, isn't it? Parser will walk postgresql.conf with all includes recursively and should eventually find it, if it was set. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From c012e1e1149d04abc39bb4099fe1e18a4cd2ca2d Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 18 Feb 2019 12:23:37 +0300 Subject: [PATCH v3 2/2] Options to use restore_command with pg_rewind --- doc/src/sgml/ref/pg_rewind.sgml | 30 - src/bin/pg_rewind/Makefile| 2 +- src/bin/pg_rewind/parsexlog.c | 163 +- src/bin/pg_rewind/pg_rewind.c | 100 +++- src/bin/pg_rewind/pg_rewind.h | 10 +- src/bin/pg_rewind/t/001_basic.pl | 4 +- src/bin/pg_rewind/t/002_databases.pl | 4 +- src/bin/pg_rewind/t/003_extrafiles.pl | 4 +- src/bin/pg_rewind/t/RewindTest.pm | 93 ++- 9 files changed, 388 insertions(+), 22 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 53a6
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 18.02.2019 19:49, Alvaro Herrera wrote: On 16.02.2019 6:41, Andres Freund wrote: It sounds like a seriously bad idea to use a different parser for pg_rewind. Why don't you just use postgres for it? As in /path/to/postgres -D /path/to/datadir/ -C shared_buffers ? Eh, this is what I suggested in this thread four months ago, though I didn't remember at the time that aaa6e1def292 had already introduced -C in 2011. It's definitely the way to go ... all this messing about with the parser is insane. Yes, but four months ago recovery options were not a part of GUCs. OK, if you and Andres are surely negative about solution with parser, then I will work out this one with postgres -C and come back till the next commitfest. I found that something similar is already used in pg_ctl and there is a mechanism for finding valid executables in exec.c. So it does not seem to be a big deal at the first sight. Thanks for replies! Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
Hi, I will work out this one with postgres -C and come back till the next commitfest. I found that something similar is already used in pg_ctl and there is a mechanism for finding valid executables in exec.c. So it does not seem to be a big deal at the first sight. I have reworked the patch, please find new version attached. It is 3 times as smaller than the previous one and now touches pg_rewind's code only. Tests are also slightly refactored in order to remove duplicated code. Execution of postgres -C is used for restore_command retrieval (if -r is passed) as being suggested. Otherwise everything works as before. Andres, Alvaro does it make sense now? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 4c8f5c228e089e7e72835ae5c409a5bc8425ab15 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 19 Feb 2019 19:14:53 +0300 Subject: [PATCH v4] pg_rewind: options to use restore_command from command line or cluster config --- doc/src/sgml/ref/pg_rewind.sgml | 30 - src/bin/pg_rewind/parsexlog.c | 161 +- src/bin/pg_rewind/pg_rewind.c | 98 +++- src/bin/pg_rewind/pg_rewind.h | 7 +- src/bin/pg_rewind/t/001_basic.pl | 4 +- src/bin/pg_rewind/t/002_databases.pl | 4 +- src/bin/pg_rewind/t/003_extrafiles.pl | 4 +- src/bin/pg_rewind/t/RewindTest.pm | 84 +- 8 files changed, 372 insertions(+), 20 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 53a64ee29e..0c2441afa7 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -67,8 +67,10 @@ PostgreSQL documentation ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or + files might no longer be present. In that case, they can be automatically + copied by pg_rewind from the WAL archive to the + pg_wal directory if either -r or + -R option is specified, or fetched on startup by configuring or . The use of pg_rewind is not limited to failover, e.g. a standby @@ -200,6 +202,30 @@ PostgreSQL documentation + + -r + --use-postgresql-conf + + +Use restore_command in the postgresql.conf to +retreive missing in the target pg_wal directory +WAL files from the WAL archive. + + + + + + -R restore_command + --restore-command=restore_command + + +Specifies the restore_command to use for retrieval of the missing +in the target pg_wal directory WAL files from +the WAL archive. + + + + --debug diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index e19c265cbb..5978ec9b99 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -12,6 +12,7 @@ #include "postgres_fe.h" #include +#include #include "pg_rewind.h" #include "filemap.h" @@ -45,6 +46,7 @@ static char xlogfpath[MAXPGPATH]; typedef struct XLogPageReadPrivate { const char *datadir; + const char *restoreCommand; int tliIndex; } XLogPageReadPrivate; @@ -53,6 +55,9 @@ static int SimpleXLogPageRead(XLogReaderState *xlogreader, int reqLen, XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *pageTLI); +static int RestoreArchivedWAL(const char *path, const char *xlogfname, + off_t expectedSize, const char *restoreCommand); + /* * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline * index 'tliIndex' in target timeline history, until 'endpoint'. Make note of @@ -60,7 +65,7 @@ static int SimpleXLogPageRead(XLogReaderState *xlogreader, */ void extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex, - XLogRecPtr endpoint) + XLogRecPtr endpoint, const char *restore_command) { XLogRecord *record; XLogReaderState *xlogreader; @@ -69,6 +74,7 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex, private.datadir = datadir; private.tliIndex = tliIndex; + private.restoreCommand = restore_command; xlogreader = XLogReaderAllocate(WalSegSz, &SimpleXLogPageRead, &private); if (xlogreader == NULL) @@ -156,7 +162,7 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex) void findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex, XLogRecPtr *lastchkptrec, TimeLineID *lastchkpttli, - XLogRecPtr *lastchkptredo) + XLogRecPtr *lastchkptredo, const char *restoreCommand) {
Re: 2019-03 CF Summary / Review - Tranche #2
On 16.02.2019 8:45, Andres Freund wrote: - pg_rewind: options to use restore_command from recovery.conf or command line WOA: Was previously marked as RFC, but I don't see how it is. Possibly can be finished, but does require a good bit more work. Just sent new version of the patch to the thread [1], which removes all unnecessary complexity. I am willing to address all new issues during 2019-03 CF if any. [1] https://www.postgresql.org/message-id/c9cfabce-8fb6-493f-68ec-e0a72d957bf4%40postgrespro.ru Thanks -- Alexey Kondratov
Probably misleading comments or lack of tests in autoHeld portals management
Hi hackers, I am trying to figure out current cursors/portals management and life cycle in Postgres. There are two if conditions for autoHeld portals: - 'if (portal->autoHeld)' inside AtAbort_Portals at portalmem.c:802; - '|| portal->autoHeld' inside AtCleanup_Portals at portalmem.c:871. Their removal does not seem to affect anything, make check-world is passed. I have tried configure --with-perl/--with-python, which should be a case for autoHeld portals, but nothing changed. For me it seems to be expectable, since autoHeld flag is always set along with createSubid=InvalidSubTransactionId inside HoldPinnedPortals, so the only one check 'createSubid == InvalidSubTransactionId' should be enough. However, comment sections are rather misleading: (1) portal.h:126 confirms my guess 'If the portal is held over from a previous transaction, both subxids are InvalidSubTransactionId'; (2) while portalmem.c:797 states 'This is similar to the case of a cursor from a previous transaction, but it could also be that the cursor was auto-held in this transaction, so it wants to live on'. I have tried, but could not build an example of valid query for the case described in (2), and it is definitely absent in regression tests. Am I missing something? Added Peter to cc, since he is a commiter of 056a5a3, where autoHeld has been introduced. Maybe it will be easier for him to recall the context. Anyway, sorry for disturb if this question is actually trivial. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c index a92b4541bd..841d88df76 100644 --- a/src/backend/utils/mmgr/portalmem.c +++ b/src/backend/utils/mmgr/portalmem.c @@ -798,8 +798,6 @@ AtAbort_Portals(void) * cursor from a previous transaction, but it could also be that the * cursor was auto-held in this transaction, so it wants to live on. */ - if (portal->autoHeld) - continue; /* * If it was created in the current transaction, we can't do normal @@ -868,7 +866,7 @@ AtCleanup_Portals(void) * Do nothing to cursors held over from a previous transaction or * auto-held ones. */ - if (portal->createSubid == InvalidSubTransactionId || portal->autoHeld) + if (portal->createSubid == InvalidSubTransactionId) { Assert(portal->status != PORTAL_ACTIVE); Assert(portal->resowner == NULL);
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On 18.12.2018 1:28, Tomas Vondra wrote: 4) There was a problem with marking top-level transaction as having catalog changes if one of its subtransactions has. It was causing a problem with DDL statements just after subtransaction start (savepoint), so data from new columns is not replicated. 5) Similar issue with schema send. You send schema only once per each sub/transaction (IIRC), while we have to update schema on each catalog change: invalidation execution, snapshot rebuild, adding new tuple cids. So I ended up with adding is_schema_send flag to ReorderBufferTXN, since it is easy to set it inside RB and read in the output plugin. Probably, we have to choose a better place for this flag. Hmm. Can you share an example how to trigger these issues? Test cases inside 014_stream_tough_ddl.pl and old ones (with streaming=true option added) should reproduce all these issues. In general, it happens in a txn like: INSERT SAVEPOINT ALTER TABLE ... ADD COLUMN INSERT then the second insert may discover old version of catalog. Interesting. Any idea where does the extra overhead in this particular case come from? It's hard to deduce that from the single flame graph, when I don't have anything to compare it with (i.e. the flame graph for the "normal" case). I guess that bottleneck is in disk operations. You can check logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and writes (~26%) take around 35% of CPU time in summary. To compare, please, see attached flame graph for the following transaction: INSERT INTO large_text SELECT (SELECT string_agg('x', ',') FROM generate_series(1, 2000)) FROM generate_series(1, 100); Execution Time: 44519.816 ms Time: 98333,642 ms (01:38,334) where disk IO is only ~7-8% in total. So we get very roughly the same ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests. Therefore, probably you may write changes on receiver in bigger chunks, not each change separately. So I'm not particularly worried, but I'll look into that. I'd be much more worried if there was measurable overhead in cases when there's no streaming happening (either because it's disabled or the memory limit was not hit). What I have also just found, is that if a table row is large enough to be TOASTed, e.g.: INSERT INTO large_text SELECT (SELECT string_agg('x', ',') FROM generate_series(1, 100)) FROM generate_series(1, 1000); then logical_work_mem limit is not hit and we neither stream, nor spill to disk this transaction, while it is still large. In contrast, the transaction above (with 100 smaller rows) being comparable in size is streamed. Not sure, that it is easy to add proper accounting of TOAST-able columns, but it worth it. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company <>
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Hi Tomas, I'm a bit confused by the changes to TAP tests. Per the patch summary, some .pl files get renamed (nor sure why), a new one is added, etc. I added new tap test case, streaming=true option inside old stream_* ones and incremented streaming tests number (+2) because of the collision between 009_matviews.pl / 009_stream_simple.pl and 010_truncate.pl / 010_stream_subxact.pl. At least in the previous version of the patch they were under the same numbers. Nothing special, but for simplicity, please, find attached my new tap test separately. So I've instead enabled streaming subscriptions in all tests, which with this patch produces two failures: Test Summary Report --- t/004_sync.pl(Wstat: 7424 Tests: 1 Failed: 0) Non-zero exit status: 29 Parse errors: Bad plan. You planned 7 tests but ran 1. t/011_stream_ddl.pl (Wstat: 256 Tests: 2 Failed: 1) Failed test: 2 Non-zero exit status: 1 So yeah, there's more stuff to fix. But I can't directly apply your fixes because the updated patches are somewhat different. Fixes should apply clearly to the previous version of your patch. Also, I am not sure, that it is a good idea to simply enable streaming subscriptions in all tests (e.g. pre streaming patch t/004_sync.pl), since then they do not hit not streaming code. Interesting. Any idea where does the extra overhead in this particular case come from? It's hard to deduce that from the single flame graph, when I don't have anything to compare it with (i.e. the flame graph for the "normal" case). I guess that bottleneck is in disk operations. You can check logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and writes (~26%) take around 35% of CPU time in summary. To compare, please, see attached flame graph for the following transaction: INSERT INTO large_text SELECT (SELECT string_agg('x', ',') FROM generate_series(1, 2000)) FROM generate_series(1, 100); Execution Time: 44519.816 ms Time: 98333,642 ms (01:38,334) where disk IO is only ~7-8% in total. So we get very roughly the same ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests. Therefore, probably you may write changes on receiver in bigger chunks, not each change separately. Possibly, I/O is certainly a possible culprit, although we should be using buffered I/O and there certainly are not any fsyncs here. So I'm not sure why would it be cheaper to do the writes in batches. BTW does this mean you see the overhead on the apply side? Or are you running this on a single machine, and it's difficult to decide? I run this on a single machine, but walsender and worker are utilizing almost 100% of CPU per each process all the time, and at apply side I/O syscalls take about 1/3 of CPU time. Though I am still not sure, but for me this result somehow links performance drop with problems at receiver side. Writing in batches was just a hypothesis and to validate it I have performed test with large txn, but consisting of a smaller number of wide rows. This test does not exhibit any significant performance drop, while it was streamed too. So it seems to be valid. Anyway, I do not have other reasonable ideas beside that right now. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company 0xx_stream_tough_ddl.pl Description: Perl program
Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Hi Hackers, I would like to propose a change, which allow CLUSTER, VACUUM FULL and REINDEX to modify relation tablespace on the fly. Actually, all these commands rebuild relation filenodes from the scratch, thus it seems natural to allow specifying them a new location. It may be helpful, when a server went out of disk, so you can attach new partition and perform e.g. VACUUM FULL, which will free some space and move data to a new location at the same time. Otherwise, you cannot complete VACUUM FULL until you have up to x2 relation disk space on a single partition. Please, find attached a patch, which extend CLUSTER, VACUUM FULL and REINDEX with additional options: REINDEX [ ( VERBOSE ) ] { INDEX | TABLE } name [ SET TABLESPACE new_tablespace ] CLUSTER [VERBOSE] table_name [ USING index_name ] [ SET TABLESPACE new_tablespace ] CLUSTER [VERBOSE] [ SET TABLESPACE new_tablespace ] VACUUM ( FULL [, ...] ) [ SET TABLESPACE new_tablespace ] [ table_and_columns [, ...] ] VACUUM FULL [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ SET TABLESPACE new_tablespace ] [ table_and_columns [, ...] ] Thereby I have a few questions: 1) What do you think about this concept in general? 2) Is SET TABLESPACE an appropriate syntax for this functionality? I thought also about a plain TABLESPACE keyword, but it seems to be misleading, and WITH (options) clause like in CREATE SUBSCRIPTION ... WITH (options). So I preferred SET TABLESPACE, since the same syntax is used currently in ALTER to change tablespace, but maybe someone will have a better idea. 3) I was not able to update the lexer for VACUUM FULL to use SET TABLESPACE after table_and_columns and completely get rid of shift/reduce conflicts. I guess it happens, since table_and_columns is optional and may be of variable length, but have no idea how to deal with it. Any thoughts? Regards -- Alexey Kondratov Postgres Professionalhttps://www.postgrespro.com Russian Postgres Company >From 0d971ce85f62baca7f6f713fa75a1bc20e09b3a2 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Fri, 21 Dec 2018 14:54:10 +0300 Subject: [PATCH] Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace. --- doc/src/sgml/ref/cluster.sgml | 13 ++- doc/src/sgml/ref/reindex.sgml | 10 ++ doc/src/sgml/ref/vacuum.sgml | 12 ++ src/backend/catalog/index.c | 128 ++ src/backend/commands/cluster.c| 26 +++-- src/backend/commands/indexcmds.c | 23 +++- src/backend/commands/tablecmds.c | 59 +- src/backend/commands/vacuum.c | 39 ++- src/backend/parser/gram.y | 62 +-- src/backend/tcop/utility.c| 16 ++- src/include/catalog/index.h | 4 +- src/include/commands/cluster.h| 2 +- src/include/commands/defrem.h | 6 +- src/include/commands/tablecmds.h | 2 + src/include/commands/vacuum.h | 2 + src/include/nodes/parsenodes.h| 3 + src/test/regress/input/tablespace.source | 43 src/test/regress/output/tablespace.source | 57 ++ 18 files changed, 424 insertions(+), 83 deletions(-) diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml index 4da60d8d56..6e61587809 100644 --- a/doc/src/sgml/ref/cluster.sgml +++ b/doc/src/sgml/ref/cluster.sgml @@ -21,8 +21,8 @@ PostgreSQL documentation -CLUSTER [VERBOSE] table_name [ USING index_name ] -CLUSTER [VERBOSE] +CLUSTER [VERBOSE] table_name [ USING index_name ] [ SET TABLESPACE new_tablespace ] +CLUSTER [VERBOSE] [ SET TABLESPACE new_tablespace ] @@ -99,6 +99,15 @@ CLUSTER [VERBOSE] + +new_tablespace + + + The name of the specific tablespace to store clustered relations. + + + + VERBOSE diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 47cef987d4..661820c1e2 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -22,6 +22,7 @@ PostgreSQL documentation REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } name +REINDEX [ ( VERBOSE ) ] { INDEX | TABLE } name [ SET TABLESPACE new_tablespace ] @@ -151,6 +152,15 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } + +new_tablespace + + + The name of the specific tablespace to store rebuilt indexes. + + + + VERBOSE diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml index fd911f5776..b4e3c59e1f 100644 --- a/doc/src/sgml/ref/vacuum.sgml +++ b/doc/src/sgml/ref/vacuum.sgml @@ -23,6 +23,8 @@ PostgreSQL documentation VACUUM [ ( option [, ...] ) ] [ table_and_columns [, ...] ] VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ table_and_columns [, ...] ] +VACUUM ( FULL [, ...] ) [ SET TABLESPACE new_tablespace ] [ table_and_colu
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
Hi Dmitry, On 30.11.2018 19:04, Dmitry Dolgov wrote: Just to confirm, patch still can be applied without conflicts, and pass all the tests. Also I like the original motivation for the feature, sounds pretty useful. For now I'm moving it to the next CF. Thanks, although I have slightly updated patch to handle recent merge of the recovery.conf into GUCs and postgresq.conf [1], new patch is attached. - Reusing the GUC parser is something I would avoid as well. Not worth the complexity. Yes, I don't like it either. I will try to make guc-file.l frontend safe. Any success with that? I looked into it and found that currently guc-file.c is built as part of guc.c, so it seems to be even more complicated to unbound guc-file.c from backend. Thus, I have some plan of how to proceed with patch: 1) Add guc-file.h and build guc-file.c separately from guc.c 2) Put guc-file.l / guc-file.h into common/* 3) Isolate all backend specific calls in guc-file.l with #ifdef FRONTEND Though I am not sure that this work is worth doing against extra redundancy added by simply adding frontend-safe copy of guc-file.l lexer. If someone has any thoughts I would be glad to receive comments. [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2dedf4d9a899b36d1a8ed29be5efbd1b31a8fe85 Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 521f62872d4e95cd02ddb535b8320256ff5e90cc Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Fri, 21 Dec 2018 14:00:30 +0300 Subject: [PATCH] pg_rewind: options to use restore_command from postgresql.conf or command line. --- src/bin/pg_rewind/Makefile| 5 +- src/bin/pg_rewind/RewindTest.pm | 46 +- src/bin/pg_rewind/guc-file-fe.h | 40 ++ src/bin/pg_rewind/guc-file-fe.l | 776 ++ src/bin/pg_rewind/parsexlog.c | 182 +- src/bin/pg_rewind/pg_rewind.c | 91 ++- src/bin/pg_rewind/pg_rewind.h | 10 +- src/bin/pg_rewind/t/001_basic.pl | 3 +- src/bin/pg_rewind/t/002_databases.pl | 3 +- src/bin/pg_rewind/t/003_extrafiles.pl | 3 +- src/tools/msvc/Mkvcbuild.pm | 1 + 11 files changed, 1141 insertions(+), 19 deletions(-) create mode 100644 src/bin/pg_rewind/guc-file-fe.h create mode 100644 src/bin/pg_rewind/guc-file-fe.l diff --git a/src/bin/pg_rewind/Makefile b/src/bin/pg_rewind/Makefile index 2bcfcc61af..a0f5f97544 100644 --- a/src/bin/pg_rewind/Makefile +++ b/src/bin/pg_rewind/Makefile @@ -15,11 +15,12 @@ subdir = src/bin/pg_rewind top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -override CPPFLAGS := -I$(libpq_srcdir) -DFRONTEND $(CPPFLAGS) +override CPPFLAGS := -I. -I$(srcdir) -I$(libpq_srcdir) -DFRONTEND $(CPPFLAGS) LDFLAGS_INTERNAL += $(libpq_pgport) OBJS = pg_rewind.o parsexlog.o xlogreader.o datapagemap.o timeline.o \ fetch.o file_ops.o copy_fetch.o libpq_fetch.o filemap.o logging.o \ + guc-file-fe.o \ $(WIN32RES) EXTRA_CLEAN = xlogreader.c @@ -32,6 +33,8 @@ pg_rewind: $(OBJS) | submake-libpq submake-libpgport xlogreader.c: % : $(top_srcdir)/src/backend/access/transam/% rm -f $@ && $(LN_S) $< . +distprep: guc-file-fe.c + install: all installdirs $(INSTALL_PROGRAM) pg_rewind$(X) '$(DESTDIR)$(bindir)/pg_rewind$(X)' diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm index 3d07da5d94..b43c18a8c3 100644 --- a/src/bin/pg_rewind/RewindTest.pm +++ b/src/bin/pg_rewind/RewindTest.pm @@ -39,7 +39,9 @@ use Carp; use Config; use Exporter 'import'; use File::Copy; -use File::Path qw(rmtree); +use File::Glob ':bsd_glob'; +use File::Path qw(remove_tree make_path); +use File::Spec::Functions 'catpath'; use IPC::Run qw(run); use PostgresNode; use TestLib; @@ -250,6 +252,48 @@ sub run_pg_rewind ], 'pg_rewind remote'); } + elsif ($test_mode eq "archive") + { + + # Do rewind using a local pgdata as source and + # specified directory with target WALs archive. + my $wals_archive_dir = catpath(${TestLib::tmp_check}, 'master_wals_archive'); + my $test_master_datadir = $node_master->data_dir; + my @wal_files = bsd_glob catpath($test_master_datadir, 'pg_wal', '000*'); + my $restore_command; + + remove_tree($wals_archive_dir); + make_path($wals_archive_dir) or die; + + # Move all old master WAL files to the archive. + # Old master should be stopped at this point. + foreach my $wal_file (@wal_files) + { + move($wal_file, "$wals_archive_dir/") or die; + } + + if ($windows_os) + { + $restore_command = "copy $wals_archive_dir\\\%f \%p"; + } + else + { + $restore_command = "cp $wals_archive_dir/\%f \%p"; + } + + # Stop the new master and be ready to perform the rewind. + $node_standby->stop; + command_ok( + [ +'pg_rewind', +
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
Greetings, - Reusing the GUC parser is something I would avoid as well. Not worth the complexity. Yes, I don't like it either. I will try to make guc-file.l frontend safe. Any success with that? I looked into it and found that currently guc-file.c is built as part of guc.c, so it seems to be even more complicated to unbound guc-file.c from backend. Thus, I have some plan of how to proceed with patch: 1) Add guc-file.h and build guc-file.c separately from guc.c 2) Put guc-file.l / guc-file.h into common/* 3) Isolate all backend specific calls in guc-file.l with #ifdef FRONTEND Though I am not sure that this work is worth doing against extra redundancy added by simply adding frontend-safe copy of guc-file.l lexer. If someone has any thoughts I would be glad to receive comments. I have finally worked it out. Now there is a common version of guc-file.l and guc-file.c is built separately from guc.c. I had to use a limited number of #ifndef FRONTEND, mostly to replace erreport calls. Also, ProcessConfigFile and ProcessConfigFileInternal have been moved inside guc.c explicitly as being a backend specific. So for me this solution looks much more concise and neat. Please, find the new version of patch attached. Tap tests have been updated as well in order to handle both command line and postgresql.conf specified restore_command. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 8a6c9f89f45c9568d95e05b0586d1cc54905e6de Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Fri, 21 Dec 2018 14:00:30 +0300 Subject: [PATCH] pg_rewind: options to use restore_command from postgresql.conf or command line. --- src/backend/Makefile | 4 +- src/backend/commands/extension.c | 1 + src/backend/utils/misc/Makefile | 8 - src/backend/utils/misc/guc.c | 434 +-- src/bin/pg_rewind/Makefile| 2 +- src/bin/pg_rewind/RewindTest.pm | 96 +++- src/bin/pg_rewind/parsexlog.c | 182 ++- src/bin/pg_rewind/pg_rewind.c | 91 +++- src/bin/pg_rewind/pg_rewind.h | 12 +- src/bin/pg_rewind/t/001_basic.pl | 4 +- src/bin/pg_rewind/t/002_databases.pl | 4 +- src/bin/pg_rewind/t/003_extrafiles.pl | 4 +- src/common/Makefile | 7 +- src/{backend/utils/misc => common}/guc-file.l | 514 -- src/include/common/guc-file.h | 50 ++ src/include/utils/guc.h | 39 +- src/tools/msvc/Mkvcbuild.pm | 2 +- src/tools/msvc/clean.bat | 2 +- 18 files changed, 952 insertions(+), 504 deletions(-) rename src/{backend/utils/misc => common}/guc-file.l (60%) create mode 100644 src/include/common/guc-file.h diff --git a/src/backend/Makefile b/src/backend/Makefile index 25eb043941..ddbe2f3fce 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -186,7 +186,7 @@ distprep: $(MAKE) -C replication repl_gram.c repl_scanner.c syncrep_gram.c syncrep_scanner.c $(MAKE) -C storage/lmgr lwlocknames.h lwlocknames.c $(MAKE) -C utils distprep - $(MAKE) -C utils/misc guc-file.c + $(MAKE) -C common guc-file.c $(MAKE) -C utils/sort qsort_tuple.c @@ -307,7 +307,7 @@ maintainer-clean: distclean replication/syncrep_scanner.c \ storage/lmgr/lwlocknames.c \ storage/lmgr/lwlocknames.h \ - utils/misc/guc-file.c \ + common/guc-file.c \ utils/sort/qsort_tuple.c diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c index 31dcfe7b11..ec0367d068 100644 --- a/src/backend/commands/extension.c +++ b/src/backend/commands/extension.c @@ -47,6 +47,7 @@ #include "commands/defrem.h" #include "commands/extension.h" #include "commands/schemacmds.h" +#include "common/guc-file.h" #include "funcapi.h" #include "mb/pg_wchar.h" #include "miscadmin.h" diff --git a/src/backend/utils/misc/Makefile b/src/backend/utils/misc/Makefile index a53fcdf188..2e6a879c46 100644 --- a/src/backend/utils/misc/Makefile +++ b/src/backend/utils/misc/Makefile @@ -25,11 +25,3 @@ override CPPFLAGS += -DPG_KRB_SRVTAB='"$(krb_srvtab)"' endif include $(top_srcdir)/src/backend/common.mk - -# guc-file is compiled as part of guc -guc.o: guc-file.c - -# Note: guc-file.c is not deleted by 'make clean', -# since we want to ship it in distribution tarballs. -clean: - @rm -f lex.yy.c diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 6fe1939881..a866503186 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -41,6 +41,7 @@ #include "commands/vacuum.h" #include "commands/variable.h" #include "commands/trigg
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Hi, Thank you all for replies. ALTER TABLE already has a lot of logic that is oriented towards being able to do multiple things at the same time. If we added CLUSTER, VACUUM FULL, and REINDEX to that set, then you could, say, change a data type, cluster, and change tablespaces all in a single SQL command. That's a great observation. Indeed, I thought that ALTER TABLE executes all actions sequentially one by one, e.g. in the case of ALTER TABLE test_int CLUSTER ON test_int_idx, SET TABLESPACE test_tblspc; it executes CLUSTER and THEN executes SET TABLESPACE. However, if I get it right, ALTER TABLE is rather smart, so in such a case it follows the steps: 1) Only saves new tablespace Oid during prepare phase 1 without actual work; 2) Only executes mark_index_clustered during phase 2, again without actual work done; 3) And finally rewrites relation during phase 3, where CLUSTER and SET TABLESPACE are effectively performed. That would be cool, but probably a lot of work. :-( But is it? ALTER TABLE is already doing one kind of table rewrite during phase 3, and CLUSTER is just a different kind of table rewrite (which happens to REINDEX), and VACUUM FULL is just a special case of CLUSTER. Maybe what we need is an ALTER TABLE variant that executes CLUSTER's table rewrite during phase 3 instead of its ad-hoc table rewrite. According to the ALTER TABLE example above, it is already exist for CLUSTER. As for REINDEX, I think it's valuable to move tablespace together with the reindexing. You can already do it with the CREATE INDEX CONCURRENTLY recipe we recommend, of course; but REINDEX CONCURRENTLY is not going to provide that, and it seems worth doing. Maybe I am missing something, but according to the docs REINDEX CONCURRENTLY does not exist yet, DROP then CREATE CONCURRENTLY is suggested instead. Thus, we have to add REINDEX CONCURRENTLY first, but it is a matter of different patch, I guess. Even for plain REINDEX that seems useful. -- Michael To summarize: 1) Alvaro and Michael agreed, that REINDEX with tablespace move may be useful. This is done in the patch attached to my initial email. Adding REINDEX to ALTER TABLE as new action seems quite questionable for me and not completely semantically correct. ALTER already looks bulky. 2) If I am correct, 'ALTER TABLE ... CLUSTER ON ..., SET TABLESPACE ...' does exactly what I wanted to add to CLUSTER in my patch. So probably no work is necessary here. 3) VACUUM FULL. It seems, that we can add special case 'ALTER TABLE ... VACUUM FULL, SET TABLESPACE ...', which will follow relatively the same path as with CLUSTER ON, but without any specific index. Relation should be rewritten in the new tablespace during phase 3. What do you think? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
[Patch] pg_rewind: options to use restore_command from recovery.conf or command line
Hi hackers, Currently Postgres has options for continuous WAL files archiving, which is quite often used along with master-replica setup. OK, then the worst is happened and it's time to get your old master back and synchronize it with new master (ex-replica) with pg_rewind. However, required WAL files may be already archived and pg_rewind will fail. You can copy these files manually, but it is difficult to calculate, which ones you need. Anyway, it complicates building failover system with automatic failure recovery. I expect, that it will be a good idea to allow pg_rewind to look for a restore_command in the target data directory recovery.conf or pass it is as a command line argument. Then pg_rewind can use it to get missing WAL files from the archive. I had a few talks with DBAs and came to conclusion, that this is a highly requested feature. I prepared a proof of concept patch (please, find attached), which does exactly what I described above. I played with it a little and it seems to be working, tests were accordingly updated to verify this archive retrieval functionality too. Patch is relatively simple excepting the one part: if we want to parse recovery.conf (with all possible includes, etc.) and get restore_command, then we should use guc-file.l parser, which is heavily linked to backend, e.g. in error reporting part. So I copied it and made frontend-safe version guc-file-fe.l. Personally, I don't think it's a good idea, but nothing else came to mind. It is also possible to leave the only one option -- passing restore_command as command line argument. What do you think? -- Alexey Kondratov Postgres Professional: https://www.postgrespro.com Russian Postgres Company diff --combined src/bin/pg_rewind/Makefile index a22fef1352,2bcfcc61af..00 --- a/src/bin/pg_rewind/Makefile +++ b/src/bin/pg_rewind/Makefile @@@ -20,7 -20,6 +20,7 @@@ LDFLAGS_INTERNAL += $(libpq_pgport OBJS = pg_rewind.o parsexlog.o xlogreader.o datapagemap.o timeline.o \ fetch.o file_ops.o copy_fetch.o libpq_fetch.o filemap.o logging.o \ + guc-file-fe.o \ $(WIN32RES) EXTRA_CLEAN = xlogreader.c diff --combined src/bin/pg_rewind/RewindTest.pm index 8dc39dbc05,1dce56d035..00 --- a/src/bin/pg_rewind/RewindTest.pm +++ b/src/bin/pg_rewind/RewindTest.pm @@@ -40,7 -40,6 +40,7 @@@ use Config use Exporter 'import'; use File::Copy; use File::Path qw(rmtree); +use File::Glob; use IPC::Run qw(run); use PostgresNode; use TestLib; @@@ -249,41 -248,6 +249,41 @@@ sub run_pg_rewin "--no-sync" ], 'pg_rewind remote'); + } + elsif ($test_mode eq "archive") + { + + # Do rewind using a local pgdata as source and + # specified directory with target WALs archive. + my $wals_archive_dir = "${TestLib::tmp_check}/master_wals_archive"; + my $test_master_datadir = $node_master->data_dir; + my @wal_files = glob "$test_master_datadir/pg_wal/000*"; + my $restore_command; + + rmtree($wals_archive_dir); + mkdir($wals_archive_dir) or die; + + # Move all old master WAL files to the archive. + # Old master should be stopped at this point. + foreach my $wal_file (@wal_files) + { + move($wal_file, "$wals_archive_dir/") or die; + } + + $restore_command = "cp $wals_archive_dir/\%f \%p"; + + # Stop the new master and be ready to perform the rewind. + $node_standby->stop; + command_ok( + [ +'pg_rewind', +"--debug", +"--source-pgdata=$standby_pgdata", +"--target-pgdata=$master_pgdata", +"--no-sync", +"-R", $restore_command + ], + 'pg_rewind archive'); } else { diff --combined src/bin/pg_rewind/parsexlog.c index 11a9c26cd2,40028471bf..00 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@@ -12,7 -12,6 +12,7 @@@ #include "postgres_fe.h" #include +#include #include "pg_rewind.h" #include "filemap.h" @@@ -46,10 -45,7 +46,10 @@@ static char xlogfpath[MAXPGPATH] typedef struct XLogPageReadPrivate { const char *datadir; + const char *restoreCommand; int tliIndex; + XLogRecPtr oldrecptr; + TimeLineID oldtli; } XLogPageReadPrivate; static int SimpleXLogPageRead(XLogReaderState *xlogreader, @@@ -57,10 -53,6 +57,10 @@@ int reqLen, XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *pageTLI); +static bool RestoreArchivedWAL(const char *path, const char *xlogfname, + off_t expectedSize, const char *restoreCommand, + const char *lastRestartPointFname); + /* * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline * index 'tliIndex' in target timeline history, until 'endpoint'. Make note of @@@ -68,19 -60,15 +68,19 @@@ */ void extract
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
Hi Andrey, Thank you for your reply. I think it is better to load restore_command from recovery.conf. Yes, it seems to be the most native way. That's why I needed this rewritten (mostly copy-pasted) frontend-safe version of parser (guc-file.l). I didn't actually try patch yet, but the idea seems interesting. Will you add it to the commitfest? I am willing to add it to the November commitfest, but I have some concerns regarding frontend version of GUC parser. Probably, it is possible to refactor guc-file.l to use it on both front- and backend. However, it requires usage of IFDEF and mocking up ereport for frontend, which is a bit ugly. -- Alexey Kondratov Postgres Professional: https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 22.10.2018 20:19, Alvaro Herrera wrote: I didn't actually try patch yet, but the idea seems interesting. Will you add it to the commitfest? I am willing to add it to the November commitfest, but I have some concerns regarding frontend version of GUC parser. Probably, it is possible to refactor guc-file.l to use it on both front- and backend. However, it requires usage of IFDEF and mocking up ereport for frontend, which is a bit ugly. Hmm, I remember we had a project to have a new postmaster option that would report the value of some GUC option, so instead of parsing the file in the frontend, you'd invoke the backend to do the parsing. But I don't know what became of that ... Brief searching in the mailing list doesn't return something relevant, but the project seems to be pretty straightforward at first sight. Of course, recovery.conf options are not GUCs either ... that's another pending patch. We do have some backend-mock for frontends, e.g. in pg_waldump; plus palloc is already implemented in libpgcommon. I don't know if what you need to compile the lexer is a project much bigger than finishing the other two patches I mention. This thing, in opposite, is a long-living, there are several threads starting from the 2011th. I have found Michael's, Simon's, Fujii's patches and Greg Smith's proposal (see, e.g. [1, 2]). If I get it right, the main point is that if we turn all options in the recovery.conf into GUCs, then it becomes possible to set them inside postgresql.conf and get rid of recovery.conf. However, it ruins backward compatibility and brings some other issues noted by Heikki https://www.postgresql.org/message-id/5152f778.2070...@vmware.com, while keeping both options is excess and ambiguous. Thus, though everyone agreed that recovery.conf options should be turned into GUCs, there is still no consensus in details. I don't think that I know Postgres architecture enough to start this discussion again, but thank you for pointing me in this direction, it was quite interesting from the historical perspective. I will check guc-file.l again, maybe it is not so painful to make it frontend-safe too. [1] https://www.postgresql.org/message-id/flat/CAHGQGwHi%3D4GV6neLRXF7rexTBkjhcAEqF9_xq%2BtRvFv2bVd59w%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CA%2BU5nMKyuDxr0%3D5PSen1DZJndauNdz8BuSREau%3DScN-7DZ9acA%40mail.gmail.com -- Alexey Kondratov Postgres Professional:https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
Hi Andrey, Will you add this patch to CF? I'm going to review it. Best regards, Andrey Borodin Here it is https://commitfest.postgresql.org/20/1849/ -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
Something that we could think about is directly to provide a command to pg_rewind via command line. In my patch I added this option too. One can pass restore_command via -R option, e.g.: pg_rewind -P --target-pgdata=/path/to/master/pg_data --source-pgdata=/path/to/standby/pg_data -R 'cp /path/to/wals_archive/%f %p' Another possibility would be to have a separate tool which scans a data folder and fetches by itself a range of WAL segments wanted. Currently in the patch, with dry-run option (-n) pg_rewind only fetches missing WALs to be able to build file map, while doesn't touch any data files. So I guess it behaves exactly as you described and we do not need a separate tool. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 30.10.2018 06:01, Michael Paquier wrote: On Mon, Oct 29, 2018 at 12:09:21PM +0300, Alexey Kondratov wrote: Currently in the patch, with dry-run option (-n) pg_rewind only fetches missing WALs to be able to build file map, while doesn't touch any data files. So I guess it behaves exactly as you described and we do not need a separate tool. Makes sense perhaps. Fetching only WAL segments which are needed for the file map is critical, as you don't want to spend bandwidth for nothing. Now, I look at your patch, and I can see things to complain about, at least three at short glance: - The TAP test added will fail on Windows. Thank you for this. Build on Windows has been broken as well. I fixed it in the new version of patch, please find attached. - Simply copy-pasting RestoreArchivedWAL() from the backend code to pg_rewind is not an acceptable option. You don't care about %r either in this case. According to the docs [1] %r is a valid alias and may be used in restore_command too, so if we take restore_command from recovery.conf it might be there. If we just drop it, then restore_command may stop working. Though I do not know real life examples of restore_command with %r, we should treat it in expected way (as backend does), of course if we want an option to take it from recovery.conf. - Reusing the GUC parser is something I would avoid as well. Not worth the complexity. Yes, I don't like it either. I will try to make guc-file.l frontend safe. [1] https://www.postgresql.org/docs/11/archive-recovery-settings.html -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company diff --git a/src/bin/pg_rewind/Makefile b/src/bin/pg_rewind/Makefile index 2bcfcc61af2e..a22fef1352b9 100644 --- a/src/bin/pg_rewind/Makefile +++ b/src/bin/pg_rewind/Makefile @@ -20,6 +20,7 @@ LDFLAGS_INTERNAL += $(libpq_pgport) OBJS = pg_rewind.o parsexlog.o xlogreader.o datapagemap.o timeline.o \ fetch.o file_ops.o copy_fetch.o libpq_fetch.o filemap.o logging.o \ + guc-file-fe.o \ $(WIN32RES) EXTRA_CLEAN = xlogreader.c diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm index 1dce56d0352e..a5499c7027b1 100644 --- a/src/bin/pg_rewind/RewindTest.pm +++ b/src/bin/pg_rewind/RewindTest.pm @@ -39,7 +39,9 @@ use Carp; use Config; use Exporter 'import'; use File::Copy; -use File::Path qw(rmtree); +use File::Glob ':bsd_glob'; +use File::Path qw(remove_tree make_path); +use File::Spec::Functions 'catpath'; use IPC::Run qw(run); use PostgresNode; use TestLib; @@ -249,6 +251,48 @@ sub run_pg_rewind ], 'pg_rewind remote'); } + elsif ($test_mode eq "archive") + { + + # Do rewind using a local pgdata as source and + # specified directory with target WALs archive. + my $wals_archive_dir = catpath(${TestLib::tmp_check}, 'master_wals_archive'); + my $test_master_datadir = $node_master->data_dir; + my @wal_files = bsd_glob catpath($test_master_datadir, 'pg_wal', '000*'); + my $restore_command; + + remove_tree($wals_archive_dir); + make_path($wals_archive_dir) or die; + + # Move all old master WAL files to the archive. + # Old master should be stopped at this point. + foreach my $wal_file (@wal_files) + { + move($wal_file, "$wals_archive_dir/") or die; + } + + if ($windows_os) + { + $restore_command = "copy $wals_archive_dir\\\%f \%p"; + } + else + { + $restore_command = "cp $wals_archive_dir/\%f \%p"; + } + + # Stop the new master and be ready to perform the rewind. + $node_standby->stop; + command_ok( + [ +'pg_rewind', +"--debug", +"--source-pgdata=$standby_pgdata", +"--target-pgdata=$master_pgdata", +"--no-sync", +"-R", $restore_command + ], + 'pg_rewind archive'); + } else { diff --git a/src/bin/pg_rewind/guc-file-fe.h b/src/bin/pg_rewind/guc-file-fe.h new file mode 100644 index ..cf480b806ae5 --- /dev/null +++ b/src/bin/pg_rewind/guc-file-fe.h @@ -0,0 +1,40 @@ +#ifndef PG_REWIND_GUC_FILE_FE_H +#define PG_REWIND_GUC_FILE_FE_H + +#include "c.h" + +#define RECOVERY_COMMAND_FILE "recovery.conf" + +/* + * Parsing the configuration file(s) will return a list of name-value pairs + * with source location info. We also abuse this data structure to carry + * error reports about the config files. An entry reporting an error will + * have errmsg != NULL, and might have NULLs for name, value, and/or filename. + * + * If "ignore" is true, don't attempt to apply the item (it might be an error + * report, or an item we determined to be duplicate). "applied" is set true + * if we successfully applied, or could have applied, the setting. + */ +typedef struct ConfigVariable +{ + char *name; + char *value; + char
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling
On Fri, Dec 1, 2017 at 1:58 AM, Alvaro Herrera wrote: > On a *very* quick look, please use an enum to return from NextCopyFrom > rather than 'int'. The chunks that change bool to int are very > odd-looking. This would move the comment that explains the value from > copy.c to copy.h, obviously. Also, you seem to be using non-ASCII dashes > in the descriptions of those values; please don't. I will fix it, thank you. > > Or maybe I misunderstood the patch completely. > I hope so. Here is my thoughts how it all works, please correct me, where I am wrong: 1) First, I have simply changed ereport level to WARNING for specific validations (extra or missing columns, etc) if INGONE_ERRORS option is used. All these checks are inside NextCopyFrom. Thus, this patch performs here pretty much the same as before, excepting that it is possible to skip bad lines, and this part should be safe as well 2) About PG_TRY/CATCH. I use it to catch the only one specific function call inside NextCopyFrom--it is InputFunctionCall--which is used just to parse datatype from the input string. I have no idea how WAL write or trigger errors could get here All of these is done before actually forming a tuple, putting it into the heap, firing insert-related triggers, etc. I am not trying to catch all errors during the row processing, only input data errors. So why is it unsafe? Best, Alexey
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
Hi Thomas, On 01.07.2019 15:02, Thomas Munro wrote: Hi Alexey, This no longer applies. Since the Commitfest is starting now, could you please rebase it? Thank you for a reminder. Rebased version of the patch is attached. I've also modified my logging code in order to obey new unified logging system for command-line programs commited by Peter (cc8d415117). Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From f5f359274322020c2338b5b494f6327eaa61c0e1 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 19 Feb 2019 19:14:53 +0300 Subject: [PATCH v7] pg_rewind: options to use restore_command from command line or cluster config Previously, when pg_rewind could not find required WAL files in the target data directory the rewind process would fail. One had to manually figure out which of required WAL files have already moved to the archival storage and copy them back. This patch adds possibility to specify restore_command via command line option or use one specified inside postgresql.conf. Specified restore_command will be used for automatic retrieval of missing WAL files from archival storage. --- doc/src/sgml/ref/pg_rewind.sgml | 30 - src/bin/pg_rewind/parsexlog.c | 164 +- src/bin/pg_rewind/pg_rewind.c | 92 ++- src/bin/pg_rewind/pg_rewind.h | 6 +- src/bin/pg_rewind/t/001_basic.pl | 4 +- src/bin/pg_rewind/t/002_databases.pl | 4 +- src/bin/pg_rewind/t/003_extrafiles.pl | 4 +- src/bin/pg_rewind/t/RewindTest.pm | 84 - 8 files changed, 371 insertions(+), 17 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 4d91eeb0ff..746c07e4df 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -67,8 +67,10 @@ PostgreSQL documentation ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or + files might no longer be present. In that case, they can be automatically + copied by pg_rewind from the WAL archive to the + pg_wal directory if either -r or + -R option is specified, or fetched on startup by configuring or . The use of pg_rewind is not limited to failover, e.g. a standby @@ -202,6 +204,30 @@ PostgreSQL documentation + + -r + --use-postgresql-conf + + +Use restore_command in the postgresql.conf to +retrieve missing in the target pg_wal directory +WAL files from the WAL archive. + + + + + + -R restore_command + --restore-command=restore_command + + +Specifies the restore_command to use for retrieval of the missing +in the target pg_wal directory WAL files from +the WAL archive. + + + + --debug diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index 287af60c4e..d1de08320c 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -12,6 +12,7 @@ #include "postgres_fe.h" #include +#include #include "pg_rewind.h" #include "filemap.h" @@ -44,6 +45,7 @@ static char xlogfpath[MAXPGPATH]; typedef struct XLogPageReadPrivate { const char *datadir; + const char *restoreCommand; int tliIndex; } XLogPageReadPrivate; @@ -52,6 +54,9 @@ static int SimpleXLogPageRead(XLogReaderState *xlogreader, int reqLen, XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *pageTLI); +static int RestoreArchivedWAL(const char *path, const char *xlogfname, + off_t expectedSize, const char *restoreCommand); + /* * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline * index 'tliIndex' in target timeline history, until 'endpoint'. Make note of @@ -59,7 +64,7 @@ static int SimpleXLogPageRead(XLogReaderState *xlogreader, */ void extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex, - XLogRecPtr endpoint) + XLogRecPtr endpoint, const char *restore_command) { XLogRecord *record; XLogReaderState *xlogreader; @@ -68,6 +73,7 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex, private.datadir = datadir; private.tliIndex = tliIndex; + private.restoreCommand = restore_command; xlogreader = XLogReaderAllocate(WalSegSz, &SimpleXLogPageRead, &private); if (xlogreader == NULL) @@ -155,7 +161,7 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex) void findLastCheckpoint(const char *datadir, XLogRecPtr forkp
Fix two issues after moving to unified logging system for command-line utils
Hi hackers, I have found two minor issues with unified logging system for command-line programs (commited by Peter cc8d415117), while was rebasing my pg_rewind patch: 1) forgotten new-line symbol in pg_fatal call inside pg_rewind, which will cause the following Assert in common/logging.c to fire Assert(fmt[strlen(fmt) - 1] != '\n'); It seems not to be a problem for a production Postgres installation without asserts, but should be removed for sanity. 2) swapped progname <-> full_path in initdb.c setup_bin_paths's call [1], while logging message remained the same. So the output will be rather misleading, since in the pg_ctl and pg_dumpall the previous order is used. Attached is a small patch that fixes these issues. [1] https://github.com/postgres/postgres/commit/cc8d41511721d25d557fc02a46c053c0a602fed0#diff-c4414062a0071ec15df504d39a6df705R2500 Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 2ea4a17ecc8f9bd57bb676f684fb729279339534 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 1 Jul 2019 18:11:25 +0300 Subject: [PATCH v1] Fix usage of unified logging pg_log_* in pg_rewind and initdb --- src/bin/initdb/initdb.c | 2 +- src/bin/pg_rewind/pg_rewind.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 2ef179165b..70273be783 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -2497,7 +2497,7 @@ setup_bin_paths(const char *argv0) pg_log_error("The program \"postgres\" is needed by %s but was not found in the\n" "same directory as \"%s\".\n" "Check your installation.", - full_path, progname); + progname, full_path); else pg_log_error("The program \"postgres\" was found by \"%s\"\n" "but was not the same version as %s.\n" diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index 6e77201be6..d378053de4 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -555,7 +555,7 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries) else if (controlFile == &ControlFile_target) histfile = slurpFile(datadir_target, path, NULL); else - pg_fatal("invalid control file\n"); + pg_fatal("invalid control file"); history = rewind_parseTimeLineHistory(histfile, tli, nentries); pg_free(histfile); base-commit: 95bbe5d82e428db342fa3ec60b95f1b9873741e5 -- 2.17.1
Re: Conflict handling for COPY FROM
On 28.06.2019 16:12, Alvaro Herrera wrote: On Wed, Feb 20, 2019 at 7:04 PM Andres Freund wrote: Or even just return it as a row. CopyBoth is relatively widely supported these days. i think generating warning about it also sufficiently meet its propose of notifying user about skipped record with existing logging facility and we use it for similar propose in other place too. The different i see is the number of warning that can be generated Warnings seem useless for this purpose. I'm with Andres: returning rows would make this a fine feature. If the user wants the rows in a table as Andrew suggests, she can use wrap the whole thing in an insert. I agree with previous commentators that returning rows will make this feature more versatile. Though, having a possibility to simply skip conflicting/malformed rows is worth of doing from my perspective. However, pushing every single skipped row to the client as a separated WARNING will be too much for a bulk import. So maybe just overall stats about skipped rows number will be enough? Also, I would prefer having an option to ignore all errors, e.g. with option ERROR_LIMIT set to -1. Because it is rather difficult to estimate a number of future errors if you are playing with some badly structured data, while always setting it to 100500k looks ugly. Anyway, below are some issues with existing code after a brief review of the patch: 1) Calculation of processed rows isn't correct (I've checked). You do it in two places, and - processed++; + if (!cstate->error_limit) + processed++; is never incremented if ERROR_LIMIT is specified and no errors occurred/no constraints exist, so the result will always be 0. However, if primary column with constraints exists, then processed is calculated correctly, since another code path is used: + if (specConflict) + { + ... + } + else + processed++; I would prefer this calculation in a single place (as it was before patch) for simplicity and in order to avoid such problems. 2) This ExecInsertIndexTuples call is only executed now if ERROR_LIMIT is specified and was exceeded, which doesn't seem to be correct, does it? - if (resultRelInfo->ri_NumIndices > 0) + if (resultRelInfo->ri_NumIndices > 0 && cstate->error_limit == 0) recheckIndexes = ExecInsertIndexTuples(myslot, 3) Trailing whitespaces added to error messages and tests for some reason: + ereport(WARNING, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("skipping \"%s\" --- missing data for column \"%s\" ", + ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("missing data for column \"%s\" ", -ERROR: missing data for column "e" +ERROR: missing data for column "e" CONTEXT: COPY x, line 1: "2000 230 23 23" -ERROR: missing data for column "e" +ERROR: missing data for column "e" CONTEXT: COPY x, line 1: "2001 231 \N \N" Otherwise, the patch applies/compiles cleanly and regression tests are passed. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 26.07.2019 20:43, Liudmila Mantrova wrote: I would like to suggest a couple of changes to docs and comments, please see the attachment. The "...or fetched on startup" part also seems wrong here, but it's not a part of your patch, so I'm going to ask about it on psql-docs separately. Agreed, thank you a lot! Yes, "...or fetched on startup" looks a bit confusing for me, since the whole paragraph is about target server before running pg_rewind, but this statement is more about target server started first time after running pg_rewind, which is discussed in the next paragraph. It might also be useful to reword the following error messages: - "using restored from archive version of file \"%s\"" - "could not open restored from archive file \"%s\" We could probably say something like "could not open file \"%s\" restored from WAL archive" instead. I have reworded these and some similar messages, thanks. New patch with changed messages is attached. On a more general note, I wonder if everyone is happy with the --using-postgresql-conf option name, or we should continue searching for a narrower term. Unfortunately, I don't have any better suggestions right now, but I believe it should be clear that its purpose is to fetch missing WAL files for target. What do you think? I don't like it either, but this one was my best guess then. Maybe --restore-target-wal instead of --using-postgresql-conf will be better? And --target-restore-command instead of --restore-command if we want to specify that this is restore_command for target server? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 328ed78356e2b270ffe4c84baa462eb6b8e6befb Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 19 Feb 2019 19:14:53 +0300 Subject: [PATCH v9] pg_rewind: options to use restore_command from command line or cluster config Previously, when pg_rewind could not find required WAL files in the target data directory the rewind process would fail. One had to manually figure out which of required WAL files have already moved to the archival storage and copy them back. This patch adds possibility to specify restore_command via command line option or use one specified inside postgresql.conf. Specified restore_command will be used for automatic retrieval of missing WAL files from archival storage. --- doc/src/sgml/ref/pg_rewind.sgml | 49 +++- src/bin/pg_rewind/parsexlog.c | 164 +- src/bin/pg_rewind/pg_rewind.c | 92 ++- src/bin/pg_rewind/pg_rewind.h | 6 +- src/bin/pg_rewind/t/001_basic.pl | 4 +- src/bin/pg_rewind/t/002_databases.pl | 4 +- src/bin/pg_rewind/t/003_extrafiles.pl | 4 +- src/bin/pg_rewind/t/RewindTest.pm | 84 - 8 files changed, 386 insertions(+), 21 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 52a1caa246..d5a14a2e08 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -66,11 +66,12 @@ PostgreSQL documentation can be found either on the target timeline, the source timeline, or their common ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the - target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or - fetched on startup by configuring or - . The use of + target cluster ran for a long time after the divergence, its old WAL + files might no longer be present. In this case, you can manually copy them + from the WAL archive to the pg_wal directory, or run + pg_rewind with the -r or + -R option to automatically retrieve them from the WAL + archive. The use of pg_rewind is not limited to failover, e.g. a standby server can be promoted, run some write transactions, and then rewinded to become a standby again. @@ -202,6 +203,39 @@ PostgreSQL documentation + + -r + --use-postgresql-conf + + +Use the restore_command defined in +postgresql.conf to retrieve WAL files from +the WAL archive if these files are no longer available in the +pg_wal directory of the target cluster. + + +This option cannot be used together with --restore-command. + + + + + + -R restore_command + --restore-command=restore_command + + +Specifies the restore_command to use for retrieving +WAL files from the WAL archive if these files are no longer available +in the pg_wal directory of the target cluster. + + +If restore
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
--- | Stream + spill | --- | 1kk | 5.9 | 18 | x3 | --- | 3kk | 19.5 | 52.4 | x2.7 | --- | 5kk | 33.3 | 86.7 | x2.86 | --- | Stream + BGW pool | --- | 1kk | 6 | 12 | x2 | --- | 3kk | 18.5 | 30.5 | x1.65 | --- | 5kk | 35.6 | 53.9 | x1.51 | --- It seems that overhead added by synchronous replica is lower by 2-3 times compared with Postgres master and streaming with spilling. Therefore, the original patch eliminated delay before large transaction processing start by sender, while this additional patch speeds up the applier side. Although the overall speed up is surely measurable, there is a room for improvements yet: 1) Currently bgworkers are only spawned on demand without some initial pool and never stopped. Maybe we should create a small pool on replication start and offload some of idle bgworkers if they exceed some limit? 2) Probably we can track somehow that incoming change has conflicts with some of being processed xacts, so we can wait for specific bgworkers only in that case? 3) Since the communication between main logical apply worker and each bgworker from the pool is a 'single producer --- single consumer' problem, then probably it is possible to wait and set/check flags without locks, but using just atomics. What do you think about this concept in general? Any concerns and criticism are welcome! Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company P.S. This patch shloud be applicable to your last patch set. I would rebase it against master, but it depends on 2pc patch, that I don't know well enough. >From 11c7549d2732f2f983d4548a81cd509dd7e41ec4 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 28 Aug 2019 15:26:50 +0300 Subject: [PATCH 11/11] BGWorkers pool for streamed transactions apply without spilling on disk --- src/backend/postmaster/bgworker.c|3 + src/backend/postmaster/pgstat.c |3 + src/backend/replication/logical/proto.c | 17 +- src/backend/replication/logical/worker.c | 1780 +++--- src/include/pgstat.h |1 + src/include/replication/logicalproto.h |4 +- src/include/replication/logicalworker.h |1 + 7 files changed, 933 insertions(+), 876 deletions(-) diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index f5db5a8c4a..6860df07ca 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -129,6 +129,9 @@ static const struct }, { "ApplyWorkerMain", ApplyWorkerMain + }, + { + "LogicalApplyBgwMain", LogicalApplyBgwMain } }; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index e5a4d147a7..b32994784f 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3637,6 +3637,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING: event_name = "Hash/GrowBuckets/Reinserting"; break; + case WAIT_EVENT_LOGICAL_APPLY_WORKER_READY: + event_name = "LogicalApplyWorkerReady"; + break; case WAIT_EVENT_LOGICAL_SYNC_DATA: event_name = "LogicalSyncData"; break; diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c index 4bec9fe8b5..954ce7343a 100644 --- a/src/backend/replication/logical/proto.c +++ b/src/backend/replication/logical/proto.c @@ -789,14 +789,11 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn, pq_sendint64(out, txn->commit_time); } -TransactionId +void logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data) { - TransactionId xid; uint8 flags; - xid = pq_getmsgint(in, 4); - /* read flags (unused for now) */ flags = pq_getmsgbyte(in); @@ -807,8 +804,6 @@ logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data) commit_data->commit_lsn = pq_getmsgint64(in); commit_data->end_lsn = pq_getmsgint64(in); commit_data->committime = pq_getmsgint64(in); - - return xid; } void @@ -823,13 +818,3 @@ logicalrep_
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
all xacts have been commited on master since the streamed one started, because we do not start streaming immediately, but only after logical_work_mem hit. I have performed some tests with conflicting xacts and it seems that it's not a problem, since locking mechanism in Postgres guarantees that if there would some deadlocks, they will happen earlier on master. So if some records hit the WAL, it is safe to apply the sequentially. Am I wrong? Anyway, I'm going to double check the safety of this part later. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Probably misleading comments or lack of tests in autoHeld portals management
Hi hackers, I am trying to figure out current cursors/portals management and life cycle in Postgres. There are two if conditions for autoHeld portals: - 'if (portal->autoHeld)' inside AtAbort_Portals at portalmem.c:802; - '|| portal->autoHeld' inside AtCleanup_Portals at portalmem.c:871. Their removal does not seem to affect anything, make check-world is passed. I have tried configure --with-perl/--with-python, which should be a case for autoHeld portals, but nothing changed. For me it seems to be expectable, since autoHeld flag is always set along with createSubid=InvalidSubTransactionId inside HoldPinnedPortals, so the only one check 'createSubid == InvalidSubTransactionId' should be enough. However, comment sections are rather misleading: (1) portal.h:126 confirms my guess 'If the portal is held over from a previous transaction, both subxids are InvalidSubTransactionId'; (2) while portalmem.c:797 states 'This is similar to the case of a cursor from a previous transaction, but it could also be that the cursor was auto-held in this transaction, so it wants to live on'. I have tried, but could not build an example of valid query for the case described in (2), and it is definitely absent in regression tests. Am I missing something? Added Peter to cc, since he is a commiter of 056a5a3, where autoHeld has been introduced. Maybe it will be easier for him to recall the context. Anyway, sorry for disturb if this question is actually trivial. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c index a92b4541bd..841d88df76 100644 --- a/src/backend/utils/mmgr/portalmem.c +++ b/src/backend/utils/mmgr/portalmem.c @@ -798,8 +798,6 @@ AtAbort_Portals(void) * cursor from a previous transaction, but it could also be that the * cursor was auto-held in this transaction, so it wants to live on. */ - if (portal->autoHeld) - continue; /* * If it was created in the current transaction, we can't do normal @@ -868,7 +866,7 @@ AtCleanup_Portals(void) * Do nothing to cursors held over from a previous transaction or * auto-held ones. */ - if (portal->createSubid == InvalidSubTransactionId || portal->autoHeld) + if (portal->createSubid == InvalidSubTransactionId) { Assert(portal->status != PORTAL_ACTIVE); Assert(portal->resowner == NULL);
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 07.03.2019 10:26, David Steele wrote: On 3/6/19 5:38 PM, Andrey Borodin wrote: The new patch is much smaller (less than 400 lines) and works as advertised. There's a typo "retreive" there. Ough, corrected this in three different places. Not my word, definitely. Thanks! These lines look a little suspicious: char postgres_exec_path[MAXPGPATH], postgres_cmd[MAXPGPATH], cmd_output[MAX_RESTORE_COMMAND]; Is it supposed to be any difference between MAXPGPATH and MAX_RESTORE_COMMAND? Yes, it was supposed to be, but after your message I have double checked everything and figured out that we use MAXPGPATH for final restore_command build (with all aliases replaced). Thus, there is no need in a separated constant. I have replaced it with MAXPGPATH. This patch appears to need attention from the author so I have marked it Waiting on Author. I hope I have addressed all issues in the new patch version which is attached. Also, I have added more detailed explanation of new functionality into the multi-line commit-message. Regards, -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 9770cab4909a3cd98c2db2b8a9fa4af1fedd4614 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 19 Feb 2019 19:14:53 +0300 Subject: [PATCH v5] pg_rewind: options to use restore_command from command line or cluster config Previously, when pg_rewind could not find required WAL files in the target data directory the rewind process would fail. One had to manually figure out which of required WAL files have already moved to the archival storage and copy them back. This patch adds possibility to specify restore_command via command line option or use one specified inside postgresql.conf. Specified restore_command will be used for automatic retrieval of missing WAL files from archival storage. --- doc/src/sgml/ref/pg_rewind.sgml | 30 - src/bin/pg_rewind/parsexlog.c | 161 +- src/bin/pg_rewind/pg_rewind.c | 96 ++- src/bin/pg_rewind/pg_rewind.h | 7 +- src/bin/pg_rewind/t/001_basic.pl | 4 +- src/bin/pg_rewind/t/002_databases.pl | 4 +- src/bin/pg_rewind/t/003_extrafiles.pl | 4 +- src/bin/pg_rewind/t/RewindTest.pm | 84 +- 8 files changed, 370 insertions(+), 20 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 53a64ee29e..90e3f22f97 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -67,8 +67,10 @@ PostgreSQL documentation ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or + files might no longer be present. In that case, they can be automatically + copied by pg_rewind from the WAL archive to the + pg_wal directory if either -r or + -R option is specified, or fetched on startup by configuring or . The use of pg_rewind is not limited to failover, e.g. a standby @@ -200,6 +202,30 @@ PostgreSQL documentation + + -r + --use-postgresql-conf + + +Use restore_command in the postgresql.conf to +retrieve missing in the target pg_wal directory +WAL files from the WAL archive. + + + + + + -R restore_command + --restore-command=restore_command + + +Specifies the restore_command to use for retrieval of the missing +in the target pg_wal directory WAL files from +the WAL archive. + + + + --debug diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index e19c265cbb..6be6dab7e0 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -12,6 +12,7 @@ #include "postgres_fe.h" #include +#include #include "pg_rewind.h" #include "filemap.h" @@ -45,6 +46,7 @@ static char xlogfpath[MAXPGPATH]; typedef struct XLogPageReadPrivate { const char *datadir; + const char *restoreCommand; int tliIndex; } XLogPageReadPrivate; @@ -53,6 +55,9 @@ static int SimpleXLogPageRead(XLogReaderState *xlogreader, int reqLen, XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *pageTLI); +static int RestoreArchivedWAL(const char *path, const char *xlogfname, + off_t expectedSize, const char *restoreCommand); + /* * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline * index 'tliIndex' in target timeline history, until 'endpoint'. Make note of @@ -60,7 +65,7 @@ static int SimpleX
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 26.03.2019 11:19, Michael Paquier wrote: + * This is a simplified and adapted to frontend version of + * RestoreArchivedFile function from transam/xlogarchive.c + */ +static int +RestoreArchivedWAL(const char *path, const char *xlogfname, I don't think that we should have duplicates for that, so I would recommend refactoring the code so as a unique code path is taken by both, especially since the user can fetch the command from postgresql.conf. This comment is here since the beginning of my work on this patch and now it is rather misleading. Even if we does not take into account obvious differences like error reporting, different log levels based on many conditions, cleanup options, check for standby mode; restore_command execution at backend recovery and during pg_rewind has a very important difference. If it fails at backend, then as stated in the comment 'Remember, we rollforward UNTIL the restore fails so failure here is just part of the process' -- it is OK. In opposite, in pg_rewind if we failed to recover some required WAL segment, then it definitely means the end of the entire process, since we will fail at finding last common checkpoint or extracting page map. The only part we can share is constructing restore_command with aliases replacement. However, even in this place the logic is slightly different, since we do not need %r alias for pg_rewind. The only use case of %r in restore_command I know is pg_standby, which seems to be as not a case for pg_rewind. I have tried to move this part to the common, but it becomes full of conditions and less concise. Please, correct me if I am wrong, but it seems that there are enough differences to keep this function separated, isn't it? Why two options? Wouldn't actually be enough use-postgresql-conf to do the job? Note that "postgres" should always be installed if pg_rewind is present because it is a backend-side utility, so while I don't like adding a dependency to other binaries in one binary, having an option to pass out a command directly via the command line of pg_rewind stresses me more. I am not familiar enough with DBA scenarios, where -R option may be useful, but I was asked a few times for that. I can only speculate that for example someone may want to run freshly rewinded cluster as master, not replica, so its config may differ from replica's one, where restore_command is surely intended to be. Thus, it is easier to leave master's config at the place and just specify restore_command as command line argument. Don't we need to worry about signals interrupting the restore command? It seems to me that some refactoring from the stuff in xlogarchive.c would be in order. Thank you for pointing me to this place again. Previously, I thought that we should not care about it, since if restore_command was not successful due to any reason, then rewind failed, so we will stop and exit at upper levels. However, if it was due to a signal, then some of next messages may be misleading, if e.g. user manually interrupted it for some reason. So that, I added a similar check here as well. Updated version of patch is attached. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 9e00f7a7696a88f350e1e328a9758ab85631c813 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 19 Feb 2019 19:14:53 +0300 Subject: [PATCH v6] pg_rewind: options to use restore_command from command line or cluster config Previously, when pg_rewind could not find required WAL files in the target data directory the rewind process would fail. One had to manually figure out which of required WAL files have already moved to the archival storage and copy them back. This patch adds possibility to specify restore_command via command line option or use one specified inside postgresql.conf. Specified restore_command will be used for automatic retrieval of missing WAL files from archival storage. --- doc/src/sgml/ref/pg_rewind.sgml | 30 - src/bin/pg_rewind/parsexlog.c | 167 +- src/bin/pg_rewind/pg_rewind.c | 96 ++- src/bin/pg_rewind/pg_rewind.h | 7 +- src/bin/pg_rewind/t/001_basic.pl | 4 +- src/bin/pg_rewind/t/002_databases.pl | 4 +- src/bin/pg_rewind/t/003_extrafiles.pl | 4 +- src/bin/pg_rewind/t/RewindTest.pm | 84 - 8 files changed, 376 insertions(+), 20 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 53a64ee29e..90e3f22f97 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -67,8 +67,10 @@ PostgreSQL documentation ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the target cluster ran for a long time after the divergence, the old WAL - f
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-02-26 22:03, Alexander Korotkov wrote: On Tue, Feb 25, 2020 at 1:48 PM Alexander Korotkov wrote: I think usage of chmod() deserves comment. As I get default permissions are sufficient for work, but we need to set them to satisfy 'check PGDATA permissions' test. I've added this comment myself. Thanks for doing it yourself, I was going to answer tonight, but it would be obviously too late. I've also fixes some indentation. Patch now looks good to me. I'm going to push it if no objections. I think that docs should be corrected. Previously Michael was against the phrase 'restore_command defined in the postgresql.conf', since it also could be defined in any config file included there. We corrected it in the pg_rewind --help output, but now docs say: +Use the restore_command defined in +postgresql.conf to retrieve WAL files from +the WAL archive if these files are no longer available in the +pg_wal directory of the target cluster. Probably it should be something like: +Use the restore_command defined in +the target cluster configuration to retrieve WAL files from +the WAL archive if these files are no longer available in the +pg_wal directory. Here the only text split changed: -* Ignore restore_command when not in archive recovery (meaning -* we are in crash recovery). + * Ignore restore_command when not in archive recovery (meaning we are in +* crash recovery). Should we do so in this patch? I think that this extra dot at the end is not necessary here: + pg_log_debug("using config variable restore_command=\'%s\'.", restore_command); If you agree then attached is a patch with all the corrections above. It is made with default git format-patch format, but yours were in a slightly different format, so I only was able to apply them with git am --patch-format=stgit. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com The Russian Postgres Company From fa2fc359dd9852afc608663fa32733e800652ffa Mon Sep 17 00:00:00 2001 From: Alexander Korotkov Date: Tue, 25 Feb 2020 02:22:45 +0300 Subject: [PATCH v17] pg_rewind: Add options to restore WAL files from archive Currently, pg_rewind fails when it could not find required WAL files in the target data directory. One have to manually figure out which WAL files are required and copy them back from archive. This commit implements new pg_rewind options, which allow pg_rewind to automatically retrieve missing WAL files from archival storage. The restore_command option is read from postgresql.conf. Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1%40postgrespro.ru Author: Alexey Kondratov Reviewed-by: Michael Paquier, Andrey Borodin, Alvaro Herrera Reviewed-by: Andres Freund, Alexander Korotkov --- doc/src/sgml/ref/pg_rewind.sgml | 28 -- src/backend/access/transam/xlogarchive.c | 58 + src/bin/pg_rewind/parsexlog.c| 33 ++- src/bin/pg_rewind/pg_rewind.c| 77 ++-- src/bin/pg_rewind/pg_rewind.h| 6 +- src/bin/pg_rewind/t/001_basic.pl | 3 +- src/bin/pg_rewind/t/RewindTest.pm| 67 +- src/common/Makefile | 2 + src/common/archive.c | 97 + src/common/fe_archive.c | 106 +++ src/include/common/archive.h | 21 + src/include/common/fe_archive.h | 18 src/tools/msvc/Mkvcbuild.pm | 8 +- 13 files changed, 443 insertions(+), 81 deletions(-) create mode 100644 src/common/archive.c create mode 100644 src/common/fe_archive.c create mode 100644 src/include/common/archive.h create mode 100644 src/include/common/fe_archive.h diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 42d29edd4e..64a6942031 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -66,11 +66,11 @@ PostgreSQL documentation can be found either on the target timeline, the source timeline, or their common ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the - target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or - fetched on startup by configuring or - . The use of + target cluster ran for a long time after the divergence, its old WAL + files might no longer be present. In this case, you can manually copy them + from the WAL archive to the pg_wal directory, or run + pg_rewind with the -c option to + automatically retrieve them from the WAL archive. The use of
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-02-27 04:52, Michael Paquier wrote: On Thu, Feb 27, 2020 at 12:43:55AM +0300, Alexander Korotkov wrote: Regarding text split change, it was made by pgindent. I didn't notice it belongs to unchanged part of code. Sure, we shouldn't include this into the patch. I have read through v17 (not tested, sorry), and spotted a couple of issues that need to be addressed. + "--source-pgdata=$standby_pgdata", + "--target-pgdata=$master_pgdata", + "--no-sync", "--no-ensure-shutdown", FWIW, I think that perl indenting would reshape this part. I would recommend to run src/tools/pgindent/pgperltidy and ./src/tools/perlcheck/pgperlcritic before commit. Thanks, formatted this part with perltidy. It also has modified RecursiveCopy's indents. Pgperlcritic has no complains about this file. BTW, being executed on the whole project pgperltidy modifies dozens of perl files an even pgindent itself. + * Copyright (c) 2020, PostgreSQL Global Development Group Wouldn't it be better to just use the full copyright here? I mean the following: Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group Portions Copyright (c) 1994, The Regents of the University of California I think so, it contains some older code parts, so it is better to use unified copyrights. +++ b/src/common/archive.c [...] +#include "postgres.h" + +#include "common/archive.h" This is incorrect. All files shared between the backend and the frontend in src/common/ have to include the following set of headers: #ifndef FRONTEND #include "postgres.h" #else #include "postgres_fe.h" #endif +++ b/src/common/fe_archive.c [...] +#include "postgres_fe.h" This is incomplete. The following piece should be added: #ifndef FRONTEND #error "This file is not expected to be compiled for backend code" #endif Fixed both. + snprintf(postgres_cmd, sizeof(postgres_cmd), "%s -D %s -C restore_command", +postgres_exec_path, datadir_target); + I think that this is missing proper quoting. Yep, added the same quoting as in pg_upgrade/options. I would rename ConstructRestoreCommand() to BuildRestoreCommand() while on it.. OK, shorter is better. I think that it would be saner to check the return status of ConstructRestoreCommand() in xlogarchive.c as a sanity check, with an elog(ERROR) if not 0, as that should never happen. Added. New version of the patch is attached. Thanks again for your review. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com The Russian Postgres Company From c775a2e40e405474f6ecef35843d276d43fb462f Mon Sep 17 00:00:00 2001 From: Alexander Korotkov Date: Tue, 25 Feb 2020 02:22:45 +0300 Subject: [PATCH v18] pg_rewind: Add options to restore WAL files from archive Currently, pg_rewind fails when it could not find required WAL files in the target data directory. One have to manually figure out which WAL files are required and copy them back from archive. This commit implements new pg_rewind options, which allow pg_rewind to automatically retrieve missing WAL files from archival storage. The restore_command option is read from postgresql.conf. Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1%40postgrespro.ru Author: Alexey Kondratov Reviewed-by: Michael Paquier, Andrey Borodin, Alvaro Herrera Reviewed-by: Andres Freund, Alexander Korotkov --- doc/src/sgml/ref/pg_rewind.sgml | 28 -- src/backend/access/transam/xlogarchive.c | 60 ++-- src/bin/pg_rewind/parsexlog.c| 33 ++- src/bin/pg_rewind/pg_rewind.c| 77 +++- src/bin/pg_rewind/pg_rewind.h| 6 +- src/bin/pg_rewind/t/001_basic.pl | 3 +- src/bin/pg_rewind/t/RewindTest.pm| 66 +- src/common/Makefile | 2 + src/common/archive.c | 102 + src/common/fe_archive.c | 111 +++ src/include/common/archive.h | 22 + src/include/common/fe_archive.h | 19 src/tools/msvc/Mkvcbuild.pm | 8 +- 13 files changed, 457 insertions(+), 80 deletions(-) create mode 100644 src/common/archive.c create mode 100644 src/common/fe_archive.c create mode 100644 src/include/common/archive.h create mode 100644 src/include/common/fe_archive.h diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 42d29edd4e..64a6942031 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -66,11 +66,11 @@ PostgreSQL documentation can be found either on the target timeline, the source timeline, or their common ancestor. In the
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-02-27 16:41, Alexey Kondratov wrote: New version of the patch is attached. Thanks again for your review. Last patch (v18) got a conflict with one of today commits (05d8449e73). Rebased version is attached. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com The Russian Postgres Company From ea93b52b298d80aac547735c5917386b37667595 Mon Sep 17 00:00:00 2001 From: Alexander Korotkov Date: Tue, 25 Feb 2020 02:22:45 +0300 Subject: [PATCH v19] pg_rewind: Add options to restore WAL files from archive Currently, pg_rewind fails when it could not find required WAL files in the target data directory. One have to manually figure out which WAL files are required and copy them back from archive. This commit implements new pg_rewind options, which allow pg_rewind to automatically retrieve missing WAL files from archival storage. The restore_command option is read from postgresql.conf. Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1%40postgrespro.ru Author: Alexey Kondratov Reviewed-by: Michael Paquier, Andrey Borodin, Alvaro Herrera Reviewed-by: Andres Freund, Alexander Korotkov --- doc/src/sgml/ref/pg_rewind.sgml | 28 -- src/backend/access/transam/xlogarchive.c | 60 ++-- src/bin/pg_rewind/parsexlog.c| 33 ++- src/bin/pg_rewind/pg_rewind.c| 77 +++- src/bin/pg_rewind/pg_rewind.h| 6 +- src/bin/pg_rewind/t/001_basic.pl | 3 +- src/bin/pg_rewind/t/RewindTest.pm| 66 +- src/common/Makefile | 2 + src/common/archive.c | 102 + src/common/fe_archive.c | 111 +++ src/include/common/archive.h | 22 + src/include/common/fe_archive.h | 19 src/tools/msvc/Mkvcbuild.pm | 8 +- 13 files changed, 457 insertions(+), 80 deletions(-) create mode 100644 src/common/archive.c create mode 100644 src/common/fe_archive.c create mode 100644 src/include/common/archive.h create mode 100644 src/include/common/fe_archive.h diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 42d29edd4e..64a6942031 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -66,11 +66,11 @@ PostgreSQL documentation can be found either on the target timeline, the source timeline, or their common ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the - target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or - fetched on startup by configuring or - . The use of + target cluster ran for a long time after the divergence, its old WAL + files might no longer be present. In this case, you can manually copy them + from the WAL archive to the pg_wal directory, or run + pg_rewind with the -c option to + automatically retrieve them from the WAL archive. The use of pg_rewind is not limited to failover, e.g. a standby server can be promoted, run some write transactions, and then rewinded to become a standby again. @@ -232,6 +232,19 @@ PostgreSQL documentation + + -c + --restore-target-wal + + +Use the restore_command defined in +the target cluster configuration to retrieve WAL files from +the WAL archive if these files are no longer available in the +pg_wal directory. + + + + --debug @@ -318,7 +331,10 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b history forked off from the target cluster. For each WAL record, record each data block that was touched. This yields a list of all the data blocks that were changed in the target cluster, after the - source cluster forked off. + source cluster forked off. If some of the WAL files are no longer + available, try re-running pg_rewind with + the -c option to search for the missing files in + the WAL archive. diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 188b73e752..f78a7e8f02 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -21,6 +21,7 @@ #include "access/xlog.h" #include "access/xlog_internal.h" +#include "common/archive.h" #include "miscadmin.h" #include "postmaster/startup.h" #include "replication/walsender.h" @@ -55,9 +56,6 @@ RestoreArchivedFile(char *path, const char *xlogfname, char xlogpath[MAXPGPATH]; char xlogRestoreCmd[MAXPGPATH]; c
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-02-28 09:43, Michael Paquier wrote: On Thu, Feb 27, 2020 at 06:29:34PM +0300, Alexey Kondratov wrote: On 2020-02-27 16:41, Alexey Kondratov wrote: > > New version of the patch is attached. Thanks again for your review. > Last patch (v18) got a conflict with one of today commits (05d8449e73). Rebased version is attached. The shape of the patch is getting better. I have found some issues when reading through the patch, but nothing huge. + printf(_(" -c, --restore-target-wal use restore_command in target config\n")); + printf(_(" to retrieve WAL files from archive\n")); [...] {"progress", no_argument, NULL, 'P'}, + {"restore-target-wal", no_argument, NULL, 'c'}, It may be better to reorder that alphabetically. Sure, I put it in order. However, the recent -R option is out of order too. + if (rc != 0) + /* Sanity check, should never happen. */ + elog(ERROR, "failed to build restore_command due to missing parameters"); No point in having this comment IMO. I would prefer to keep it, since there are plenty of similar comments near Asserts and elogs all over the Postgres. Otherwise it may look like a valid error state. It may be obvious now, but for someone who is not aware of BuildRestoreCommand refactoring it may be not. So from my perspective there is nothing bad in this extra one line comment. +/* logging support */ +#define pg_fatal(...) do { pg_log_fatal(__VA_ARGS__); exit(1); } while(0) Actually, I don't think that this is a good idea to name this pg_fatal() as we have the same think in pg_rewind so it could be confusing. I have added explicit exit(1) calls, since pg_fatal was used only twice in the archive.c. Probably, pg_log_fatal from common/logging should obey the same logic as FATAL log-level in the backend and do exit the process, but for now including pg_rewind.h inside archive.c or vice versa does not look like a solution. - while ((c = getopt_long(argc, argv, "D:nNPR", long_options, &option_index)) != -1) + while ((c = getopt_long(argc, argv, "D:nNPRc", long_options, &option_index)) != -1) Alphabetical order here. Done. + rmdir($node_master->archive_dir); rmtree() is used in all our other tests. Done. There was an unobvious logic that rmdir only deletes empty directories, which is true in the case of archive_dir in that test, but I have unified it for consistency. + pg_log_error("archive file \"%s\" has wrong size: %lu instead of %lu, %s", +xlogfname, (unsigned long) stat_buf.st_size, +(unsigned long) expectedSize, strerror(errno)); I think that the error message should be reworded: "unexpected WAL file size for \"%s\": %lu instead of %lu". Please note that there is no need for strerror() here at all, as errno should be 0. +if (xlogfd < 0) +pg_log_error("could not open file \"%s\" restored from archive: %s\n", + xlogpath, strerror(errno)); [...] +pg_log_error("could not stat file \"%s\" restored from archive: %s", +xlogpath, strerror(errno)); No need for strerror() as you can just use %m. And no need for the extra newline at the end as pg_log_* routines do that by themselves. + pg_log_error("could not restore file \"%s\" from archive\n", +xlogfname); No need for a newline here. Thanks, I have cleaned up these log statements. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com The Russian Postgres Company From ba20808ffddf3fe2eefe96d3385697fb6583ce9a Mon Sep 17 00:00:00 2001 From: Alexander Korotkov Date: Tue, 25 Feb 2020 02:22:45 +0300 Subject: [PATCH v20] pg_rewind: Add options to restore WAL files from archive Currently, pg_rewind fails when it could not find required WAL files in the target data directory. One have to manually figure out which WAL files are required and copy them back from archive. This commit implements new pg_rewind options, which allow pg_rewind to automatically retrieve missing WAL files from archival storage. The restore_command option is read from postgresql.conf. Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1%40postgrespro.ru Author: Alexey Kondratov Reviewed-by: Michael Paquier, Andrey Borodin, Alvaro Herrera Reviewed-by: Andres Freund, Alexander Korotkov --- doc/src/sgml/ref/pg_rewind.sgml | 28 -- src/backend/access/transam/xlogarchive.c | 60 ++-- src/bin/pg_rewind/parsexlog.c| 33 ++- src/bin/pg_rewind/pg_rewind.c| 77 ++- src/bin/pg_rewind/pg_rewind.h| 6 +- src/bin/pg_rewind
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-02-11 19:48, Justin Pryzby wrote: For your v7 patch, which handles REINDEX to a new tablespace, I have a few minor comments: + * the relation will be rebuilt. If InvalidOid is used, the default => should say "currrent", not default ? Yes, it keeps current index tablespace in that case, thanks. +++ b/doc/src/sgml/ref/reindex.sgml +TABLESPACE ... +class="parameter">new_tablespace => I saw you split the description of TABLESPACE from new_tablespace based on comment earlier in the thread, but I suggest that the descriptions for these should be merged, like: + +TABLESPACEnew_tablespace + + + Allow specification of a tablespace where all rebuilt indexes will be created. + Cannot be used with "mapped" relations. If SCHEMA, + DATABASE or SYSTEM are specified, then + all unsuitable relations will be skipped and a single WARNING + will be generated. + + + It sounds good to me, but here I just obey the structure, which is used all around. Documentation of ALTER TABLE/DATABASE, REINDEX and many others describes each literal/parameter in a separate entry, e.g. new_tablespace. So I would prefer to keep it as it is for now. The existing patch is very natural, especially the parts in the original patch handling vacuum full and cluster. Those were removed to concentrate on REINDEX, and based on comments that it might be nice if ALTER handled CLUSTER and VACUUM FULL. On a separate thread, I brought up the idea of ALTER using clustered order. Tom pointed out some issues with my implementation, but didn't like the idea, either. So I suggest to re-include the CLUSTER/VAC FULL parts as a separate 0002 patch, the same way they were originally implemented. BTW, I think if "ALTER" were updated to support REINDEX (to allow multiple operations at once), it might be either: |ALTER INDEX i SET TABLESPACE , REINDEX -- to reindex a single index on a given tlbspc or |ALTER TABLE tbl REINDEX USING INDEX TABLESPACE spc; -- to reindex all inds on table inds moved to a given tblspc "USING INDEX TABLESPACE" is already used for ALTER..ADD column/table CONSTRAINT. Yes, I also think that allowing REINDEX/CLUSTER/VACUUM FULL to put resulting relation in a different tablespace is a very natural operation. However, I did a couple of attempts to integrate latter two with ALTER TABLE and failed with it, since it is already complex enough. I am still willing to proceed with it, but not sure how soon it will be. Anyway, new version is attached. It is rebased in order to resolve conflicts with a recent fix of REINDEX CONCURRENTLY + temp relations, and includes this small comment fix. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com The Russian Postgres Company From d2b7a5fa2e11601759b47af0c142a7824ef907a2 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 30 Dec 2019 20:00:37 +0300 Subject: [PATCH v8] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 24 +- src/backend/catalog/index.c | 75 -- src/backend/commands/cluster.c| 2 +- src/backend/commands/indexcmds.c | 96 --- src/backend/commands/tablecmds.c | 2 +- src/backend/nodes/copyfuncs.c | 1 + src/backend/nodes/equalfuncs.c| 1 + src/backend/parser/gram.y | 14 ++-- src/backend/tcop/utility.c| 6 +- src/bin/psql/tab-complete.c | 6 ++ src/include/catalog/index.h | 7 +- src/include/commands/defrem.h | 6 +- src/include/nodes/parsenodes.h| 1 + src/test/regress/input/tablespace.source | 49 src/test/regress/output/tablespace.source | 66 15 files changed, 323 insertions(+), 33 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index c54a7c420d4..0628c94bb1e 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -21,7 +21,7 @@ PostgreSQL documentation -REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name +REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name [ TABLESPACE new_tablespace ] where option can be one of: @@ -174,6 +174,28 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + This specifies a tablespace, where all rebuilt indexes will be created. + Cannot be used with "mapped" relations. If SCHEMA, + DATABASE or SYSTEM is specified, then + all unsuitable relations will be skipped and a single WARNING + will be generated. +
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-03-02 07:53, Michael Paquier wrote: + * For fixed-size files, the caller may pass the expected size as an + * additional crosscheck on successful recovery. If the file size is not + * known, set expectedSize = 0. + */ +int +RestoreArchivedWALFile(const char *path, const char *xlogfname, + off_t expectedSize, const char *restoreCommand) Actually, expectedSize is IMO a bad idea, because any caller of this routine passing down zero could be trapped with an incorrect file size. So let's remove the behavior where it is possible to bypass this sanity check. We don't need it in pg_rewind either. OK, sounds reasonable, but just to be clear. I will remove only a possibility to bypass this sanity check (with 0), but leave expectedSize argument intact. We still need it, since pg_rewind takes WalSegSz from ControlFile and should pass it further, am I right? + /* Remove trailing newline */ + if (strchr(cmd_output, '\n') != NULL) + *strchr(cmd_output, '\n') = '\0'; It seems to me that what you are looking here is to use pg_strip_crlf(). Thinking harder, we have pipe_read_line() in src/common/exec.c which does the exact same job.. pg_strip_crlf fits well, but would you mind if I also make pipe_read_line external in this patch? - /* -* construct the command to be executed -*/ Perhaps you meant "build" here. Actually, the verb 'construct' is used historically applied to archive/restore commands (see also xlogarchive.c and pgarch.c), but it should be 'build' in (fe_)archive.c, since we have BuildRestoreCommand there now. All other remarks look clear for me, so I fix them in the next patch version, thanks. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com The Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 04.03.2020 10:45, Michael Paquier wrote: On Mon, Mar 02, 2020 at 08:59:49PM +0300, Alexey Kondratov wrote: All other remarks look clear for me, so I fix them in the next patch version, thanks. Already done as per the attached, with a new routine named getRestoreCommand() and more done. Many thanks for doing that. I went through the diff between v21 and v20. Most of the changes look good to me. - * Functions for finding and validating executable files + * Functions for finding and validating from executables files There is probably something missing here. Finding and validating what? And 'executables files' does not seem to be correct as well. + # First, remove all the content in the archive directory, + # as RecursiveCopy::copypath does not support copying to + # existing directories. I think that 'remove all the content' is not completely correct in this case. We are simply removing archive directory. There is no content there yet, so 'First, remove archive directory...' should be fine. - I did not actually get why you don't check for a missing command when using wait_result_is_any_signal. In this case I'd think that it is better to exit immediately as follow-up calls would just fail. Believe me or not, but I put 'false' there intentionally. The idea was that if the reason is a signal, then maybe user tired of waiting and killed that restore_command process theirself or something like that, so it is better to exit immediately. If it was a missing command, then there is no hurry, so we can go further and complain that attempt of recovering WAL segment has failed. Actually, I guess that there is no big difference if we include missing command here or not. There is no complicated logic further compared to real recovery process in Postgres, where we cannot simply return false in that case. - The code was rather careless about error handling and RestoreArchivedWALFile(), and it seemed to me that it is rather pointless to report an extra message "could not restore file \"%s\" from archive" on top of the other error. Probably you mean several pg_log_error calls not followed by 'return -1;'. Yes, I did it to fall down to the function end and show this extra message, but I agree that there is no much sense in doing so. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 05.03.2020 09:24, Michael Paquier wrote: On Wed, Mar 04, 2020 at 08:14:20PM +0300, Alexey Kondratov wrote: - I did not actually get why you don't check for a missing command when using wait_result_is_any_signal. In this case I'd think that it is better to exit immediately as follow-up calls would just fail. Believe me or not, but I put 'false' there intentionally. The idea was that if the reason is a signal, then maybe user tired of waiting and killed that restore_command process theirself or something like that, so it is better to exit immediately. If it was a missing command, then there is no hurry, so we can go further and complain that attempt of recovering WAL segment has failed. Actually, I guess that there is no big difference if we include missing command here or not. There is no complicated logic further compared to real recovery process in Postgres, where we cannot simply return false in that case. On the contrary, it seems to me that the difference is very important. Imagine for example a frontend tool which calls RestoreArchivedWALFile in a loop, and that this one fails because the command called is missing. This tool would keep looping for nothing. So checking for a missing command and leaving immediately would be more helpful for the user. Can you think about scenarios where it would make sense to be able to loop in this case instead of failing? OK, I was still having in mind pg_rewind as the only one user of this routine. Now it is a part of the common and I could imagine a hypothetical tool that is polling the archive and waiting for a specific WAL segment to become available. In this case 'command not found' is definitely the end of game, while the absence of segment is expected error, so we can continue looping. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Conflict handling for COPY FROM
On 09.03.2020 15:34, Surafel Temesgen wrote: okay attached is a rebased patch with it + Portal portal = NULL; ... + portal = GetPortalByName(""); + SetRemoteDestReceiverParams(dest, portal); I think that you do not need this, since you are using a ready DestReceiver. The whole idea of passing DestReceiver down to the CopyFrom was to avoid that code. This unnamed portal is created in the exec_simple_query [1] and has been already set to the DestReceiver there [2]. Maybe I am missing something, but I have just removed this code and everything works just fine. [1] https://github.com/postgres/postgres/blob/0a42a2e9/src/backend/tcop/postgres.c#L1178 [2] https://github.com/postgres/postgres/blob/0a42a2e9/src/backend/tcop/postgres.c#L1226 Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 12.03.2020 07:39, Michael Paquier wrote: I'd like to commit the refactoring piece in 0001 tomorrow, then let's move on with the rest as of 0002. If more comments and docs are needed for archive.c, let's continue discussing that. I just went through the both patches and realized that I cannot get into semantics of splitting frontend code between common and fe_utils. This applies only to 0002, where we introduce fe_archive.c. Should it be placed into fe_utils alongside of the recent recovery_gen.c also used by pg_rewind? This is a frontend-only code intended to be used by frontend applications, so fe_utils feels like the right place, doesn't it? Just tried to do so and everything went fine, so it seems that there is no obstacles from the build system. BTW, most of 'common' is a really common code with only four exceptions like logging.c, which is frontend-only. Is it there for historical reasons only or something else? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Hi Justin, On 09.03.2020 23:04, Justin Pryzby wrote: On Sat, Feb 29, 2020 at 08:53:04AM -0600, Justin Pryzby wrote: On Sat, Feb 29, 2020 at 03:35:27PM +0300, Alexey Kondratov wrote: Anyway, new version is attached. It is rebased in order to resolve conflicts with a recent fix of REINDEX CONCURRENTLY + temp relations, and includes this small comment fix. Thanks for rebasing - I actually started to do that yesterday. I extracted the bits from your original 0001 patch which handled CLUSTER and VACUUM FULL. I don't think if there's any interest in combining that with ALTER anymore. On another thread (1), I tried to implement that, and Tom pointed out problem with the implementation, but also didn't like the idea. I'm including some proposed fixes, but didn't yet update the docs, errors or tests for that. (I'm including your v8 untouched in hopes of not messing up the cfbot). My fixes avoid an issue if you try to REINDEX onto pg_default, I think due to moving system toast indexes. I was able to avoid this issue by adding a call to GetNewRelFileNode, even though that's already called by RelationSetNewRelfilenode(). Not sure if there's a better way, or if it's worth Alexey's v3 patch which added a tablespace param to RelationSetNewRelfilenode. Do you have any understanding of what exactly causes this error? I have tried to debug it a little bit, but still cannot figure out why we need this extra GetNewRelFileNode() call and a mechanism how it helps. Probably you mean v4 patch. Yes, interestingly, if we do everything at once inside RelationSetNewRelfilenode(), then there is no issue at all with: REINDEX DATABASE template1 TABLESPACE pg_default; It feels like I am doing a monkey coding here, so I want to understand it better :) The current logic allows moving all the indexes and toast indexes, but I think we should use IsSystemRelation() unless allow_system_table_mods, like existing behavior of ALTER. template1=# ALTER TABLE pg_extension_oid_index SET tablespace pg_default; ERROR: permission denied: "pg_extension_oid_index" is a system catalog template1=# REINDEX INDEX pg_extension_oid_index TABLESPACE pg_default; REINDEX Yeah, we definitely should obey the same rules as ALTER TABLE / INDEX in my opinion. Finally, I think the CLUSTER is missing permission checks. It looks like relation_is_movable was factored out, but I don't see how that helps ? I did this relation_is_movable refactoring in order to share the same check between REINDEX + TABLESPACE and ALTER INDEX + SET TABLESPACE. Then I realized that REINDEX already has its own temp tables check and does mapped relations validation in multiple places, so I just added global tablespace checks instead. Thus, relation_is_movable seems to be outdated right now. Probably, we have to do another refactoring here once all proper validations will be accumulated in this patch set. Alexey, I'm hoping to hear back if you think these changes are ok or if you'll publish a new version of the patch addressing the crash I reported. Or if you're too busy, maybe someone else can adopt the patch (I can help). Sorry for the late response, I was not going to abandon this patch, but was a bit busy last month. Many thanks for you review and fixups! There are some inconsistencies like mentions of SET TABLESPACE in error messages and so on. I am going to refactor and include your fixes 0003-0004 into 0001 and 0002, but keep 0005 separated for now, since this part requires more understanding IMO (and comparison with v4 implementation). That way, I am going to prepare a more clear patch set till the middle of the next week. I will be glad to receive more feedback from you then. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Hi Surafel, Thank you for looking at the patch! On 17.09.2019 14:04, Surafel Temesgen wrote: * There are NOWAIT option in alter index, is there a reason not to have similar option here? Currently in Postgres SET TABLESPACE always comes with [ NOWAIT ] option, so I hope it worth adding this option here for convenience. Added in the new version. * SET TABLESPACE command is not documented Actually, new_tablespace parameter was documented, but I've added a more detailed section for SET TABLESPACE too. * There are multiple checking for whether the relation is temporary tables of other sessions, one in check_relation_is_movable and other independently Yes, and there is a comment section in the code describing why. There is a repeatable bunch of checks for verification whether relation movable or not, so I put it into a separated function -- check_relation_is_movable. However, if we want to do only REINDEX, then some of them are excess, so the only one RELATION_IS_OTHER_TEMP is used. Thus, RELATION_IS_OTHER_TEMP is never executed twice, just different code paths. *+ char *tablespacename; calling it new_tablespacename will make it consistent with other places OK, changed, although I don't think it is important, since this is the only one tablespace variable there. *The patch did't applied cleanly http://cfbot.cputube.org/patch_24_2269.log Patch is rebased and attached with all the fixes described above. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 7a19b1fd945502ad55f1fa9e61c3014d8715e404 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 18 Sep 2019 15:22:04 +0300 Subject: [PATCH v2] Allow REINDEX and REINDEX CONCURRENTLY to SET TABLESPACE --- doc/src/sgml/ref/reindex.sgml | 25 + src/backend/catalog/index.c | 109 ++ src/backend/commands/cluster.c| 2 +- src/backend/commands/indexcmds.c | 38 +--- src/backend/commands/tablecmds.c | 59 +++- src/backend/parser/gram.y | 29 -- src/backend/tcop/utility.c| 22 - src/include/catalog/index.h | 7 +- src/include/commands/defrem.h | 6 +- src/include/commands/tablecmds.h | 2 + src/include/nodes/parsenodes.h| 2 + src/test/regress/input/tablespace.source | 32 +++ src/test/regress/output/tablespace.source | 44 + 13 files changed, 308 insertions(+), 69 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 10881ab03a..192243e58f 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -22,6 +22,7 @@ PostgreSQL documentation REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name +REINDEX [ ( VERBOSE ) ] { INDEX | TABLE } name [ SET TABLESPACE new_tablespace [NOWAIT] ] @@ -165,6 +166,30 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURR + +SET TABLESPACE + + + This specifies a tablespace, where all rebuilt indexes will be created. + Can be used only with REINDEX INDEX and + REINDEX TABLE, since the system indexes are not + movable, but SCHEMA, DATABASE or + SYSTEM very likely will has one. If the + NOWAIT option is specified then the command will fail + if it is unable to acquire all of the locks required immediately. + + + + + +new_tablespace + + + The name of the specific tablespace to store rebuilt indexes. + + + + VERBOSE diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 54288a498c..715abfdf65 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -1194,7 +1194,8 @@ index_create(Relation heapRelation, * on. This is called during concurrent reindex processing. */ Oid -index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char *newName) +index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, + Oid tablespaceOid, const char *newName) { Relation indexRelation; IndexInfo *oldInfo, @@ -1324,7 +1325,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char newInfo, indexColNames, indexRelation->rd_rel->relam, - indexRelation->rd_rel->reltablespace, + tablespaceOid ? tablespaceOid : indexRelation->rd_rel->reltablespace, indexRelation->rd_indcollation, indclass->values, indcoloptions->values, @@ -3297,16 +3298,22 @@ IndexGetRelation(Oid indexId, bool missing_ok) * reindex_index - This routine is used to recreate a single index */ void -reindex_index(Oid indexId, bool skip_constraint_checks, char persistence, +reind
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Hi Michael, Thank you for your comments. On 19.09.2019 7:43, Michael Paquier wrote: On Wed, Sep 18, 2019 at 03:46:20PM +0300, Alexey Kondratov wrote: Currently in Postgres SET TABLESPACE always comes with [ NOWAIT ] option, so I hope it worth adding this option here for convenience. Added in the new version. It seems to me that it would be good to keep the patch as simple as possible for its first version, and split it into two if you would like to add this new option instead of bundling both together. This makes the review of one and the other more simple. OK, it makes sense. I would also prefer first patch as simple as possible, but adding this NOWAIT option required only a few dozens of lines, so I just bundled everything together. Anyway, I will split patches if we decide to keep [ SET TABLESPACE ... [NOWAIT] ] grammar. Anyway, regarding the grammar, is SET TABLESPACE really our best choice here? What about: - TABLESPACE = foo, in parenthesis only? - Only using TABLESPACE, without SET at the end of the query? SET is used in ALTER TABLE per the set of subqueries available there, but that's not the case of REINDEX. I like SET TABLESPACE grammar, because it already exists and used both in ALTER TABLE and ALTER INDEX. Thus, if we once add 'ALTER INDEX index_name REINDEX SET TABLESPACE' (as was proposed earlier in the thread), then it will be consistent with 'REINDEX index_name SET TABLESPACE'. If we use just plain TABLESPACE, then it may be misleading in the following cases: - REINDEX TABLE table_name TABLESPACE tablespace_name - REINDEX (TABLESPACE = tablespace_name) TABLE table_name since it may mean 'Reindex all indexes of table_name, that stored in the tablespace_name', doesn't it? However, I have rather limited experience with Postgres, so I doesn't insist. +-- check that all relations moved to new tablespace +SELECT relname FROM pg_class +WHERE reltablespace=(SELECT oid FROM pg_tablespace WHERE spcname='regress_tblspace') +AND relname IN ('regress_tblspace_test_tbl_idx'); +relname +--- + regress_tblspace_test_tbl_idx +(1 row) Just to check one relation you could use \d with the relation (index or table) name. Yes, \d outputs tablespace name if it differs from pg_default, but it shows other information in addition, which is not necessary here. Also its output has more chances to be changed later, which may lead to the failed tests. This query output is more or less stable and new relations may be easily added to tests if we once add tablespace change to CLUSTER/VACUUM FULL. I can change test to use \d, but not sure that it would reduce test output length or will be helpful for a future tests support. - if (RELATION_IS_OTHER_TEMP(iRel)) - ereport(ERROR, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), -errmsg("cannot reindex temporary tables of other - sessions"))) I would keep the order of this operation in order with CheckTableNotInUse(). Sure, I haven't noticed that reordered these operations, thanks. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 19.09.2019 16:21, Robert Haas wrote: On Thu, Sep 19, 2019 at 12:43 AM Michael Paquier wrote: It seems to me that it would be good to keep the patch as simple as possible for its first version, and split it into two if you would like to add this new option instead of bundling both together. This makes the review of one and the other more simple. Anyway, regarding the grammar, is SET TABLESPACE really our best choice here? What about: - TABLESPACE = foo, in parenthesis only? - Only using TABLESPACE, without SET at the end of the query? SET is used in ALTER TABLE per the set of subqueries available there, but that's not the case of REINDEX. So, earlier in this thread, I suggested making this part of ALTER TABLE, and several people seemed to like that idea. Did we have a reason for dropping that approach? If we add this option to REINDEX, then for 'ALTER TABLE tb_name action1, REINDEX SET TABLESPACE tbsp_name, action3' action2 will be just a direct alias to 'REINDEX TABLE tb_name SET TABLESPACE tbsp_name'. So it seems practical to do this for REINDEX first. The only one concern I have against adding REINDEX to ALTER TABLE in this context is that it will allow user to write such a chimera: ALTER TABLE tb_name REINDEX SET TABLESPACE tbsp_name, SET TABLESPACE tbsp_name; when they want to move both table and all the indexes. Because simple ALTER TABLE tb_name REINDEX, SET TABLESPACE tbsp_name; looks ambiguous. Should it change tablespace of table, indexes or both? -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Conflict handling for COPY FROM
Hi Surafel, On 16.07.2019 10:08, Surafel Temesgen wrote: i also add an option to ignore all errors in ERROR set to -1 Great! The patch still applies cleanly (tested on e1c8743e6c), but I've got some problems using more elaborated tests. First of all, there is definitely a problem with grammar. In docs ERROR is defined as option and COPY test FROM '/path/to/copy-test-simple.csv' ERROR -1; works just fine, but if modern 'WITH (...)' syntax is used: COPY test FROM '/path/to/copy-test-simple.csv' WITH (ERROR -1); ERROR: option "error" not recognized while 'WITH (error_limit -1)' it works again. It happens, since COPY supports modern and very-very old syntax: * In the preferred syntax the options are comma-separated * and use generic identifiers instead of keywords. The pre-9.0 * syntax had a hard-wired, space-separated set of options. So I see several options here: 1) Everything is left as is, but then docs should be updated and reflect, that error_limit is required for modern syntax. 2) However, why do we have to support old syntax here? I guess it exists for backward compatibility only, but this is a completely new feature. So maybe just 'WITH (error_limit 42)' will be enough? 3) You also may simply change internal option name from 'error_limit' to 'error' or SQL keyword from 'ERROR' tot 'ERROR_LIMIT'. I would prefer the second option. Next, you use DestRemoteSimple for returning conflicting tuples back: + dest = CreateDestReceiver(DestRemoteSimple); + dest->rStartup(dest, (int) CMD_SELECT, tupDesc); However, printsimple supports very limited subset of built-in types, so CREATE TABLE large_test (id integer primary key, num1 bigint, num2 double precision); COPY large_test FROM '/path/to/copy-test.tsv'; COPY large_test FROM '/path/to/copy-test.tsv' ERROR 3; fails with following error 'ERROR: unsupported type OID: 701', which seems to be very confusing from the end user perspective. I've tried to switch to DestRemote, but couldn't figure it out quickly. Finally, I simply cannot get into this validation: + else if (strcmp(defel->defname, "error_limit") == 0) + { + if (cstate->ignore_error) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"), + parser_errposition(pstate, defel->location))); + cstate->error_limit = defGetInt64(defel); + cstate->ignore_error = true; + if (cstate->error_limit == -1) + cstate->ignore_all_error = true; + } If cstate->ignore_error is defined, then we have already processed options list, since this is the only one place, where it's set. So we should never get into this ereport, doesn't it? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company 1100.5 2420.1 300
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 20.09.2019 19:38, Alvaro Herrera wrote: On 2019-Sep-19, Robert Haas wrote: So, earlier in this thread, I suggested making this part of ALTER TABLE, and several people seemed to like that idea. Did we have a reason for dropping that approach? Hmm, my own reading of that was to add tablespace changing abilities to ALTER TABLE *in addition* to this patch, not instead of it. That was my understanding too. On 20.09.2019 11:26, Jose Luis Tallon wrote: On 20/9/19 4:06, Michael Paquier wrote: Personally, I don't find this idea very attractive as ALTER TABLE is already complicated enough with all the subqueries we already support in the command, all the logic we need to maintain to make combinations of those subqueries in a minimum number of steps, and also the number of bugs we have seen because of the amount of complication present. Yes, but please keep the other options: At it is, cluster, vacuum full and reindex already rewrite the table in full; Being able to write the result to a different tablespace than the original object was stored in enables a whole world of very interesting possibilities including a quick way out of a "so little disk space available that vacuum won't work properly" situation --- which I'm sure MANY users will appreciate, including me Yes, sure, that was my main motivation. The first message in the thread contains a patch, which adds SET TABLESPACE support to all of CLUSTER, VACUUM FULL and REINDEX. However, there came up an idea to integrate CLUSTER/VACUUM FULL with ALTER TABLE and do their work + all the ALTER TABLE stuff in a single table rewrite. I've dig a little bit into this and ended up with some architectural questions and concerns [1]. So I decided to start with a simple REINDEX patch. Anyway, I've followed Michael's advice and split the last patch into two: 1) Adds all the main functionality, but with simplified 'REINDEX INDEX [ CONCURRENTLY ] ... [ TABLESPACE ... ]' grammar; 2) Adds a more sophisticated syntax with '[ SET TABLESPACE ... [ NOWAIT ] ]'. Patch 1 contains all the docs and tests and may be applied/committed separately or together with 2, which is fully optional. Recent merge conflicts and reindex_index validations order are also fixed in the attached version. [1] https://www.postgresql.org/message-id/6b2a5c4de19f111ef24b63428033bb67%40postgrespro.ru Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 4f06996f1e86dee389cb0f901cb83dba77c2abd8 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 24 Sep 2019 12:29:57 +0300 Subject: [PATCH v3 1/2] Allow REINDEX and REINDEX CONCURRENTLY to change TABLESPACE --- doc/src/sgml/ref/reindex.sgml | 23 ++ src/backend/catalog/index.c | 99 --- src/backend/commands/cluster.c| 2 +- src/backend/commands/indexcmds.c | 34 +--- src/backend/commands/tablecmds.c | 59 -- src/backend/parser/gram.y | 21 +++-- src/backend/tcop/utility.c| 16 +++- src/include/catalog/index.h | 7 +- src/include/commands/defrem.h | 6 +- src/include/commands/tablecmds.h | 2 + src/include/nodes/parsenodes.h| 1 + src/test/regress/input/tablespace.source | 31 +++ src/test/regress/output/tablespace.source | 41 ++ 13 files changed, 279 insertions(+), 63 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 10881ab03a..96c9363ad9 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -22,6 +22,7 @@ PostgreSQL documentation REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name +REINDEX [ ( VERBOSE ) ] { INDEX | TABLE } [ CONCURRENTLY ] name [ TABLESPACE new_tablespace ] @@ -165,6 +166,28 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURR + +TABLESPACE + + + This specifies a tablespace, where all rebuilt indexes will be created. + Can be used only with REINDEX INDEX and + REINDEX TABLE, since the system indexes are not + movable, but SCHEMA, DATABASE or + SYSTEM very likely will has one. + + + + + +new_tablespace + + + The name of the specific tablespace to store rebuilt indexes. + + + + VERBOSE diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 098732cc4a..b2fed5dc75 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -1239,7 +1239,8 @@ index_create(Relation heapRelation, * on. This is called during concurrent reindex processing. */ Oid -index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char *newName) +index_concurrently_create_copy(Rela
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 01.08.2019 19:53, Alexey Kondratov wrote: On 26.07.2019 20:43, Liudmila Mantrova wrote: On a more general note, I wonder if everyone is happy with the --using-postgresql-conf option name, or we should continue searching for a narrower term. Unfortunately, I don't have any better suggestions right now, but I believe it should be clear that its purpose is to fetch missing WAL files for target. What do you think? I don't like it either, but this one was my best guess then. Maybe --restore-target-wal instead of --using-postgresql-conf will be better? And --target-restore-command instead of --restore-command if we want to specify that this is restore_command for target server? As Alvaro correctly pointed in the nearby thread [1], we've got an interference regarding -R command line argument. I agree that it's a good idea to reserve -R for recovery configuration write to be consistent with pg_basebackup, so I've updated my patch to use another letters: 1. -c/--restore-target-wal --- to use restore_command from postgresql.conf 2. -C/--target-restore-command --- to pass restore_command as a command line argument Updated and rebased patch is attached. However, now I'm wondering, do we actually need 1. as a separated option and not being enabled by default? I cannot imagine a situation, when restore_command is set in the postgresql.conf and someone prefer pg_rewind to fail instead of fetching missed WALs automatically, but maybe there are some cases? [1] https://www.postgresql.org/message-id/20190925174812.GA4916%40alvherre.pgsql -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From d7e1041c756b79e6e4636be1b0337453db8a7457 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 19 Feb 2019 19:14:53 +0300 Subject: [PATCH v10] pg_rewind: options to use restore_command from command line or cluster config Previously, when pg_rewind could not find required WAL files in the target data directory the rewind process would fail. One had to manually figure out which of required WAL files have already moved to the archival storage and copy them back. This patch adds possibility to specify restore_command via command line option or use one specified inside postgresql.conf. Specified restore_command will be used for automatic retrieval of missing WAL files from archival storage. --- doc/src/sgml/ref/pg_rewind.sgml | 49 +++- src/bin/pg_rewind/parsexlog.c | 164 +- src/bin/pg_rewind/pg_rewind.c | 112 +++--- src/bin/pg_rewind/pg_rewind.h | 6 +- src/bin/pg_rewind/t/001_basic.pl | 4 +- src/bin/pg_rewind/t/002_databases.pl | 4 +- src/bin/pg_rewind/t/003_extrafiles.pl | 4 +- src/bin/pg_rewind/t/RewindTest.pm | 84 - 8 files changed, 396 insertions(+), 31 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index ac142d22fc..27c662cc83 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -66,11 +66,12 @@ PostgreSQL documentation can be found either on the target timeline, the source timeline, or their common ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the - target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or - fetched on startup by configuring or - . The use of + target cluster ran for a long time after the divergence, its old WAL + files might no longer be present. In this case, you can manually copy them + from the WAL archive to the pg_wal directory, or run + pg_rewind with the -c or + -C option to automatically retrieve them from the WAL + archive. The use of pg_rewind is not limited to failover, e.g. a standby server can be promoted, run some write transactions, and then rewinded to become a standby again. @@ -202,6 +203,39 @@ PostgreSQL documentation + + -c + --restore-target-wal + + +Use the restore_command defined in +postgresql.conf to retrieve WAL files from +the WAL archive if these files are no longer available in the +pg_wal directory of the target cluster. + + +This option cannot be used together with --target-restore-command. + + + + + + -C restore_command + --target-restore-command=restore_command + + +Specifies the restore_command to use for retrieving +WAL files from the WAL archive if these files are no longer available +in the pg_wal directory of the target cluster. + + +If restore_command is already set in +postgresql.conf, you c
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
On 27.09.2019 6:27, Paul Guo wrote: Secondarily, I see no reason to test connstr_source rather than just "conn" in the other patch; doing it the other way is more natural, since it's that thing that's tested as an argument. pg_rewind.c: Please put the new #include line keeping the alphabetical order. Agreed to the above suggestions. I attached the v9. I went through the remaining two patches and they seem to be very clear and concise. However, there are two points I could complain about: 1) Maybe I've missed it somewhere in the thread above, but currently pg_rewind allows to run itself with -R and --source-pgdata. In that case -R option is just swallowed and neither standby.signal, nor postgresql.auto.conf is written, which is reasonable though. Should it be stated somehow in the docs that -R option always has to go altogether with --source-server? Or should pg_rewind notify user that options are incompatible and no recovery configuration will be written? 2) Are you going to leave -R option completely without tap-tests? Attached is a small patch, which tests -R option along with the existing 'remote' case. If needed it may be split into two separate cases. First, it tests that pg_rewind is able to succeed with minimal permissions according to the Michael's patch d9f543e [1]. Next, it checks presence of standby.signal and adds REPLICATION permission to rewind_user to test that new standby is able to start with generated recovery configuration. [1] https://github.com/postgres/postgres/commit/d9f543e9e9be15f92abdeaf870e57ef289020191 Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 8c607794f259cd4dec0fa6172b69d62e6468bee3 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Fri, 27 Sep 2019 14:30:57 +0300 Subject: [PATCH v9 3/3] Test new standby start with generated config during pg_rewind remote --- src/bin/pg_rewind/t/001_basic.pl | 2 +- src/bin/pg_rewind/t/002_databases.pl | 2 +- src/bin/pg_rewind/t/003_extrafiles.pl | 2 +- src/bin/pg_rewind/t/004_pg_xlog_symlink.pl | 2 +- src/bin/pg_rewind/t/RewindTest.pm | 11 ++- 5 files changed, 14 insertions(+), 5 deletions(-) diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl index 115192170e..c3293e93df 100644 --- a/src/bin/pg_rewind/t/001_basic.pl +++ b/src/bin/pg_rewind/t/001_basic.pl @@ -1,7 +1,7 @@ use strict; use warnings; use TestLib; -use Test::More tests => 10; +use Test::More tests => 11; use FindBin; use lib $FindBin::RealBin; diff --git a/src/bin/pg_rewind/t/002_databases.pl b/src/bin/pg_rewind/t/002_databases.pl index f1eb4fe1d2..1db534c0dc 100644 --- a/src/bin/pg_rewind/t/002_databases.pl +++ b/src/bin/pg_rewind/t/002_databases.pl @@ -1,7 +1,7 @@ use strict; use warnings; use TestLib; -use Test::More tests => 6; +use Test::More tests => 7; use FindBin; use lib $FindBin::RealBin; diff --git a/src/bin/pg_rewind/t/003_extrafiles.pl b/src/bin/pg_rewind/t/003_extrafiles.pl index c4040bd562..f4710440fc 100644 --- a/src/bin/pg_rewind/t/003_extrafiles.pl +++ b/src/bin/pg_rewind/t/003_extrafiles.pl @@ -3,7 +3,7 @@ use strict; use warnings; use TestLib; -use Test::More tests => 4; +use Test::More tests => 5; use File::Find; diff --git a/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl b/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl index ed1ddb6b60..639eeb9c91 100644 --- a/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl +++ b/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl @@ -14,7 +14,7 @@ if ($windows_os) } else { - plan tests => 4; + plan tests => 5; } use FindBin; diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm index 68b6004e94..fcc48cb1d9 100644 --- a/src/bin/pg_rewind/t/RewindTest.pm +++ b/src/bin/pg_rewind/t/RewindTest.pm @@ -266,9 +266,18 @@ sub run_pg_rewind [ 'pg_rewind', "--debug", "--source-server",$standby_connstr, -"--target-pgdata=$master_pgdata", "--no-sync" +"--target-pgdata=$master_pgdata", "-R", "--no-sync" ], 'pg_rewind remote'); + + # Check that standby.signal has been created. + ok(-e "$master_pgdata/standby.signal"); + + # Now, when pg_rewind apparently succeeded with minimal permissions, + # add REPLICATION privilege. So we could test that new standby + # is able to connect to the new master with generated config. + $node_standby->psql( + 'postgres', "ALTER ROLE rewind_user WITH REPLICATION;"); } else { -- 2.17.1
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
On 27.09.2019 17:28, Alvaro Herrera wrote: + # Now, when pg_rewind apparently succeeded with minimal permissions, + # add REPLICATION privilege. So we could test that new standby + # is able to connect to the new master with generated config. + $node_standby->psql( + 'postgres', "ALTER ROLE rewind_user WITH REPLICATION;"); I think this better use safe_psql. Yes, indeed. On 30.09.2019 10:07, Paul Guo wrote: 2) Are you going to leave -R option completely without tap-tests? Attached is a small patch, which tests -R option along with the existing 'remote' case. If needed it may be split into two separate cases. First, it tests that pg_rewind is able to succeed with minimal permissions according to the Michael's patch d9f543e [1]. Next, it checks presence of standby.signal and adds REPLICATION permission to rewind_user to test that new standby is able to start with generated recovery configuration. [1] https://github.com/postgres/postgres/commit/d9f543e9e9be15f92abdeaf870e57ef289020191 It seems that we could further disabling recovery info setting code for the 'remote' test case? - my $port_standby = $node_standby->port; - $node_master->append_conf( - 'postgresql.conf', qq( -primary_conninfo='port=$port_standby' -)); + if ($test_mode ne "remote") + { + my $port_standby = $node_standby->port; + $node_master->append_conf( + 'postgresql.conf', + qq(primary_conninfo='port=$port_standby')); - $node_master->set_standby_mode(); + $node_master->set_standby_mode(); + } Yeah, it makes sense. It is excessive for remote if we add '-R' there. I've updated and attached my test adding patch. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From b38bc7d71f7e7d68d66d3bf9af4e6371445aeab2 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Fri, 27 Sep 2019 14:30:57 +0300 Subject: [PATCH v10 3/3] Test new standby start with generated config during pg_rewind remote --- src/bin/pg_rewind/t/001_basic.pl | 2 +- src/bin/pg_rewind/t/002_databases.pl | 2 +- src/bin/pg_rewind/t/003_extrafiles.pl | 2 +- src/bin/pg_rewind/t/004_pg_xlog_symlink.pl | 2 +- src/bin/pg_rewind/t/RewindTest.pm | 27 +++--- 5 files changed, 23 insertions(+), 12 deletions(-) diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl index 115192170e..c3293e93df 100644 --- a/src/bin/pg_rewind/t/001_basic.pl +++ b/src/bin/pg_rewind/t/001_basic.pl @@ -1,7 +1,7 @@ use strict; use warnings; use TestLib; -use Test::More tests => 10; +use Test::More tests => 11; use FindBin; use lib $FindBin::RealBin; diff --git a/src/bin/pg_rewind/t/002_databases.pl b/src/bin/pg_rewind/t/002_databases.pl index f1eb4fe1d2..1db534c0dc 100644 --- a/src/bin/pg_rewind/t/002_databases.pl +++ b/src/bin/pg_rewind/t/002_databases.pl @@ -1,7 +1,7 @@ use strict; use warnings; use TestLib; -use Test::More tests => 6; +use Test::More tests => 7; use FindBin; use lib $FindBin::RealBin; diff --git a/src/bin/pg_rewind/t/003_extrafiles.pl b/src/bin/pg_rewind/t/003_extrafiles.pl index c4040bd562..f4710440fc 100644 --- a/src/bin/pg_rewind/t/003_extrafiles.pl +++ b/src/bin/pg_rewind/t/003_extrafiles.pl @@ -3,7 +3,7 @@ use strict; use warnings; use TestLib; -use Test::More tests => 4; +use Test::More tests => 5; use File::Find; diff --git a/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl b/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl index ed1ddb6b60..639eeb9c91 100644 --- a/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl +++ b/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl @@ -14,7 +14,7 @@ if ($windows_os) } else { - plan tests => 4; + plan tests => 5; } use FindBin; diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm index 68b6004e94..2b45c2789c 100644 --- a/src/bin/pg_rewind/t/RewindTest.pm +++ b/src/bin/pg_rewind/t/RewindTest.pm @@ -149,7 +149,7 @@ sub start_master # Create custom role which is used to run pg_rewind, and adjust its # permissions to the minimum necessary. - $node_master->psql( + $node_master->safe_psql( 'postgres', " CREATE ROLE rewind_user LOGIN; GRANT EXECUTE ON function pg_catalog.pg_ls_dir(text, boolean, boolean) @@ -266,9 +266,18 @@ sub run_pg_rewind [ 'pg_rewind', "--debug", "--source-server",$standby_connstr, -"--target-pgdata=$master_pgdata", "--no-sync" +"--target-pgdata=$master_pgdata", "-R", "--no-sync" ], 'pg_rewind remote'); + + # Check
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
Hi Alvaro, On 30.09.2019 20:13, Alvaro Herrera wrote: OK, I pushed this patch as well as Alexey's test patch. It all works for me, and the coverage report shows that we're doing the new thing ... though only in the case that rewind *is* required. There is no test to verify the case where rewind is *not* required. I guess it'd also be good to test the case when we throw the new error, if only for completeness ... I've directly followed your guess and tried to elaborate pg_rewind test cases and... It seems I've caught a few bugs: 1) --dry-run actually wasn't completely 'dry'. It did update target controlfile, which could cause repetitive pg_rewind calls to fail after dry-run ones. 2) --no-ensure-shutdown flag was broken, it simply didn't turn off this new feature. 3) --write-recovery-conf didn't obey the --dry-run flag. Thus, it was definitely a good idea to add new tests. Two patches are attached: 1) First one fixes all the issues above; 2) Second one slightly increases pg_rewind overall code coverage from 74% to 78.6%. Should I put this fix on the next commitfest? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company P.S. My apologies that I've missed two of these bugs during review. >From 7286e31ab0ebf50bb4ab460dd81b82f1c5989272 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 2 Oct 2019 19:24:46 +0300 Subject: [PATCH v1 1/2] Fix functionality of pg_rewind --dry-run and --no-ensure-shutdown options Branch: pg-rewind-fixes --- src/bin/pg_rewind/pg_rewind.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index a7fd9e0cab..1a7fb5242b 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -101,7 +101,7 @@ main(int argc, char **argv) {"write-recovery-conf", no_argument, NULL, 'R'}, {"source-pgdata", required_argument, NULL, 1}, {"source-server", required_argument, NULL, 2}, - {"no-ensure-shutdown", no_argument, NULL, 44}, + {"no-ensure-shutdown", no_argument, NULL, 4}, {"version", no_argument, NULL, 'V'}, {"dry-run", no_argument, NULL, 'n'}, {"no-sync", no_argument, NULL, 'N'}, @@ -435,13 +435,15 @@ main(int argc, char **argv) ControlFile_new.minRecoveryPoint = endrec; ControlFile_new.minRecoveryPointTLI = endtli; ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY; - update_controlfile(datadir_target, &ControlFile_new, do_sync); + + if (!dry_run) + update_controlfile(datadir_target, &ControlFile_new, do_sync); if (showprogress) pg_log_info("syncing target data directory"); syncTargetDirectory(); - if (writerecoveryconf) + if (!dry_run && writerecoveryconf) WriteRecoveryConfig(conn, datadir_target, GenerateRecoveryConfig(conn, NULL)); base-commit: df86e52cace2c4134db51de6665682fb985f3195 -- 2.17.1 >From 28fdd2fa58af718d8a894cb3c3d8f9b2cdf6759e Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 2 Oct 2019 19:25:27 +0300 Subject: [PATCH v1 2/2] Increase pg_rewind code coverage Branch: pg-rewind-fixes --- src/bin/pg_rewind/t/001_basic.pl | 2 +- src/bin/pg_rewind/t/002_databases.pl | 2 +- src/bin/pg_rewind/t/003_extrafiles.pl | 2 +- src/bin/pg_rewind/t/004_pg_xlog_symlink.pl | 2 +- src/bin/pg_rewind/t/005_same_timeline.pl | 27 ++ src/bin/pg_rewind/t/RewindTest.pm | 33 -- 6 files changed, 55 insertions(+), 13 deletions(-) diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl index c3293e93df..a1659460ec 100644 --- a/src/bin/pg_rewind/t/001_basic.pl +++ b/src/bin/pg_rewind/t/001_basic.pl @@ -1,7 +1,7 @@ use strict; use warnings; use TestLib; -use Test::More tests => 11; +use Test::More tests => 14; use FindBin; use lib $FindBin::RealBin; diff --git a/src/bin/pg_rewind/t/002_databases.pl b/src/bin/pg_rewind/t/002_databases.pl index 1db534c0dc..921c4434f5 100644 --- a/src/bin/pg_rewind/t/002_databases.pl +++ b/src/bin/pg_rewind/t/002_databases.pl @@ -1,7 +1,7 @@ use strict; use warnings; use TestLib; -use Test::More tests => 7; +use Test::More tests => 10; use FindBin; use lib $FindBin::RealBin; diff --git a/src/bin/pg_rewind/t/003_extrafiles.pl b/src/bin/pg_rewind/t/003_extrafiles.pl index f4710440fc..bce5b47148 100644 --- a/src/bin/pg_rewind/t/003_extrafiles.pl +++ b/src/bin/pg_rewind/t/003_extrafiles.pl @@ -3,7 +3,7 @@ use strict; use warnings; use TestLib; -use Test::More tests => 5; +use Test::More tests => 8; use File::Find; diff --git a/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl b/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl index 639eeb9c91..a501be8f78 100644 --- a/src/bin/pg_rewind/t/004_pg_xlo
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
On 03.10.2019 6:07, Michael Paquier wrote: On Wed, Oct 02, 2019 at 08:28:09PM +0300, Alexey Kondratov wrote: I've directly followed your guess and tried to elaborate pg_rewind test cases and... It seems I've caught a few bugs: 1) --dry-run actually wasn't completely 'dry'. It did update target controlfile, which could cause repetitive pg_rewind calls to fail after dry-run ones. I have just paid attention to this thread, but this is a bug which goes down to 12 actually so let's treat it independently of the rest. The control file was not written thanks to the safeguards in write_target_range() in past versions, but the recent refactoring around control file handling broke that promise. Another thing which is not completely exact is the progress reporting which should be reported even if the dry-run mode runs. That's less critical, but let's make things consistent. I also thought about v12, though didn't check whether it's affected. Patch 0001 also forgot that recovery.conf should not be written either when no rewind is needed. Yes, definitely, I forgot this code path, thanks. I have reworked your first patch as per the attached. What do you think about it? The part with the control file needs to go down to v12, and I would likely split that into two commits on HEAD: one for the control file and a second for the recovery.conf portion with the fix for --no-ensure-shutdown to keep a cleaner history. It looks fine for me excepting the progress reporting part. It now adds PG_CONTROL_FILE_SIZE to fetch_done. However, I cannot find that control file is either included into filemap and fetch_size or counted during calculate_totals(). Maybe I've missed something, but now it looks like we report something that wasn't planned for progress reporting, doesn't it? + # Check that incompatible options error out. + command_fails( + [ + 'pg_rewind', "--debug", + "--source-pgdata=$standby_pgdata", + "--target-pgdata=$master_pgdata", "-R", + "--no-ensure-shutdown" + ], + 'pg_rewind local with -R'); Incompatible options had better be checked within a separate perl script? We generally do that for the other binaries. Yes, it makes sense. I've reworked the patch with tests and added a couple of extra cases. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 9e828e311dc7c216e5bfb1936022be4f7fd3805f Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Thu, 3 Oct 2019 12:37:26 +0300 Subject: [PATCH v2 2/2] Increase pg_rewind code coverage Branch: pg-rewind-fixes --- src/bin/pg_rewind/t/001_basic.pl | 2 +- src/bin/pg_rewind/t/002_databases.pl | 2 +- src/bin/pg_rewind/t/003_extrafiles.pl | 2 +- src/bin/pg_rewind/t/004_pg_xlog_symlink.pl | 2 +- src/bin/pg_rewind/t/005_same_timeline.pl | 32 +--- src/bin/pg_rewind/t/006_actions.pl | 61 ++ src/bin/pg_rewind/t/RewindTest.pm | 20 ++- 7 files changed, 107 insertions(+), 14 deletions(-) create mode 100644 src/bin/pg_rewind/t/006_actions.pl diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl index c3293e93df..1ba1648af6 100644 --- a/src/bin/pg_rewind/t/001_basic.pl +++ b/src/bin/pg_rewind/t/001_basic.pl @@ -1,7 +1,7 @@ use strict; use warnings; use TestLib; -use Test::More tests => 11; +use Test::More tests => 13; use FindBin; use lib $FindBin::RealBin; diff --git a/src/bin/pg_rewind/t/002_databases.pl b/src/bin/pg_rewind/t/002_databases.pl index 1db534c0dc..57674ff4b3 100644 --- a/src/bin/pg_rewind/t/002_databases.pl +++ b/src/bin/pg_rewind/t/002_databases.pl @@ -1,7 +1,7 @@ use strict; use warnings; use TestLib; -use Test::More tests => 7; +use Test::More tests => 9; use FindBin; use lib $FindBin::RealBin; diff --git a/src/bin/pg_rewind/t/003_extrafiles.pl b/src/bin/pg_rewind/t/003_extrafiles.pl index f4710440fc..16c92cb2d6 100644 --- a/src/bin/pg_rewind/t/003_extrafiles.pl +++ b/src/bin/pg_rewind/t/003_extrafiles.pl @@ -3,7 +3,7 @@ use strict; use warnings; use TestLib; -use Test::More tests => 5; +use Test::More tests => 7; use File::Find; diff --git a/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl b/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl index 639eeb9c91..6dabd11db6 100644 --- a/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl +++ b/src/bin/pg_rewind/t/004_pg_xlog_symlink.pl @@ -14,7 +14,7 @@ if ($windows_os) } else { - plan tests => 5; + plan tests => 7; } use FindBin; diff --git a/src/bin/pg_rewind/t/005_same_timeline.pl b/src/bin/pg_rewind/t/005_same_timeline.pl index 40dbc44caa..089466a
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
On 04.10.2019 11:37, Michael Paquier wrote: On Thu, Oct 03, 2019 at 12:43:37PM +0300, Alexey Kondratov wrote: On 03.10.2019 6:07, Michael Paquier wrote: I have reworked your first patch as per the attached. What do you think about it? The part with the control file needs to go down to v12, and I would likely split that into two commits on HEAD: one for the control file and a second for the recovery.conf portion with the fix for --no-ensure-shutdown to keep a cleaner history. It looks fine for me excepting the progress reporting part. It now adds PG_CONTROL_FILE_SIZE to fetch_done. However, I cannot find that control file is either included into filemap and fetch_size or counted during calculate_totals(). Maybe I've missed something, but now it looks like we report something that wasn't planned for progress reporting, doesn't it? Right. The pre-12 code actually handles that incorrecly as it assumed that any files written through file_ops.c should be part of the progress. So I went with the simplest solution, and backpatched this part with 6f3823b. I have also committed the set of fixes for the new options so as we have a better base of work than what's on HEAD currently. Great, thanks. Regarding the tests, adding a --dry-run command is a good idea. However I think that there is more value to automate the use of the single user mode automatically in the tests as that's more critical from the point of view of rewind run, and stopping the cluster with immediate mode causes, as expected, the next --dry-run command to fail. Another thing is that I think that we should use -F with --single. This makes recovery faster, and the target data folder is synced at the end of pg_rewind anyway. Using the long option names makes the tests easier to follow in this case, so I have switched -R to --write-recovery-conf. Some comments and the docs have been using some confusing wording, so I have reworked what I found (like many "it" in a single sentence referring different things). I agree with all the points. Shutting down target server using 'immediate' mode is a good way to test ensureCleanShutdown automatically. Regarding all the set of incompatible options, we have much more of that after the initial option parsing so I think that we should group all the cheap ones together. Let's tackle that as a separate patch. We can also just check after --no-ensure-shutdown directly in RewindTest.pm as I have switched the cluster to not be cleanly shut down anymore to stress the automatic recovery path, and trigger that before running pg_rewind for the local and remote mode. Attached is an updated patch with all I found. What do you think? I've checked your patch, but it seems that it cannot be applied as is, since it e.g. adds a comment to 005_same_timeline.pl without actually changing the test. So I've slightly modified your patch and tried to fit both dry-run and ensureCleanShutdown testing together. It works just fine and fails immediately if any of recent fixes is reverted. I still think that dry-run testing is worth adding, since it helped to catch this v12 refactoring issue, but feel free to throw it way if it isn't commitable right now, of course. As for incompatible options and sanity checks testing, yes, I agree that it is a matter of different patch. I attached it as a separate WIP patch just for history. Maybe I will try to gather more cases there later. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From 6e5667edcad6b037004288635a7ae0eda40d4262 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Fri, 4 Oct 2019 17:14:12 +0300 Subject: [PATCH v3 1/2] Improve functionality, docs and tests of -R, --no-ensure-shutdown and --dry-run options Branch: pg-rewind-fixes --- doc/src/sgml/ref/pg_rewind.sgml| 10 +-- src/bin/pg_rewind/pg_rewind.c | 19 +++--- src/bin/pg_rewind/t/001_basic.pl | 2 +- src/bin/pg_rewind/t/002_databases.pl | 2 +- src/bin/pg_rewind/t/003_extrafiles.pl | 2 +- src/bin/pg_rewind/t/004_pg_xlog_symlink.pl | 2 +- src/bin/pg_rewind/t/005_same_timeline.pl | 32 +++--- src/bin/pg_rewind/t/RewindTest.pm | 71 +- 8 files changed, 103 insertions(+), 37 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index fbf454803b..42d29edd4e 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -169,12 +169,14 @@ PostgreSQL documentation --no-ensure-shutdown -pg_rewind verifies that the target server -is cleanly shutdown before rewinding; by default, if it isn't, it -starts the server in single-user mode to complete crash recovery. +pg_rewind requires that the target server +is cleanly shut down before rewinding. By default,
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
On 07.10.2019 4:06, Michael Paquier wrote: On Fri, Oct 04, 2019 at 05:21:25PM +0300, Alexey Kondratov wrote: I've checked your patch, but it seems that it cannot be applied as is, since it e.g. adds a comment to 005_same_timeline.pl without actually changing the test. So I've slightly modified your patch and tried to fit both dry-run and ensureCleanShutdown testing together. It works just fine and fails immediately if any of recent fixes is reverted. I still think that dry-run testing is worth adding, since it helped to catch this v12 refactoring issue, but feel free to throw it way if it isn't commitable right now, of course. I can guarantee the last patch I sent can be applied on top of HEAD: https://www.postgresql.org/message-id/20191004083721.ga1...@paquier.xyz Yes, it did, but my comment was about these lines: diff --git a/src/bin/pg_rewind/t/005_same_timeline.pl b/src/bin/pg_rewind/t/005_same_timeline.pl index 40dbc44caa..df469d3939 100644 --- a/src/bin/pg_rewind/t/005_same_timeline.pl +++ b/src/bin/pg_rewind/t/005_same_timeline.pl @@ -1,3 +1,7 @@ +# +# Test that running pg_rewind with the source and target clusters +# on the same timeline runs successfully. +# You have added this new comment section, but kept the old one, which was pretty much the same [1]. Regarding the rest, I have hacked my way through as per the attached. The previous set of patches did the following, which looked either overkill or not necessary: - Why running test 005 with the remote mode? OK, it was definitely an overkill, since remote control file fetch will be also tested in any other remote test case. - --dry-run coverage is basically the same with the local and remote modes, so it seems like a waste of resource to run it for all the tests and all the modes. My point was to test --dry-run + --write-recover-conf in remote, since the last one may cause recovery configuration write without doing any actual work, due to some wrong refactoring for example. - There is no need for the script checking for options combinations to initialize a data folder. It is important to design the tests to be cheap and meaningful. Yes, I agree, moving some of those tests to just a 001_basic seems to be a proper optimization. Patch v3-0002 also had a test to make sure that the source server is shut down cleanly before using it. I have included that part as well, as the flow feels right. So, Alexey, what do you think? It looks good for me. Two minor remarks: + # option combinations. As the code paths taken by those tests + # does not change for the "local" and "remote" modes, just run them I am far from being fluent in English, but should it be 'do not change' instead? +command_fails( + [ + 'pg_rewind', '--target-pgdata', + $primary_pgdata, '--source-pgdata', + $standby_pgdata, 'extra_arg1' + ], Here and below I would prefer traditional options ordering "'--key', 'value'". It should be easier to recognizefrom the reader perspective: +command_fails( + [ + 'pg_rewind', + '--target-pgdata', $primary_pgdata, + '--source-pgdata', $standby_pgdata, + 'extra_arg1' + ], [1] https://github.com/postgres/postgres/blob/caa078353ecd1f3b3681c0d4fa95ad4bb8c2308a/src/bin/pg_rewind/t/005_same_timeline.pl#L15 -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On 22.10.2019 20:22, Tomas Vondra wrote: On Tue, Oct 22, 2019 at 11:01:48AM +0530, Dilip Kumar wrote: On Tue, Oct 22, 2019 at 10:46 AM Amit Kapila wrote: In general, yours and Alexy's test results show that there is merit by having workers applying such transactions. OTOH, as noted above [1], we are also worried about the performance of Rollbacks if we follow that approach. I am not sure how much we need to worry about Rollabcks if commits are faster, but can we think of recording the changes in memory and only write to a file if the changes are above a certain threshold? I think that might help saving I/O in many cases. I am not very sure if we do that how much additional workers can help, but they might still help. I think we need to do some tests and experiments to figure out what is the best approach? What do you think? I agree with the point. I think we might need to do some small changes and test to see what could be the best method to handle the streamed changes at the subscriber end. Tomas, Alexey, do you have any thoughts on this matter? I think it is important that we figure out the way to proceed in this patch. [1] - https://www.postgresql.org/message-id/b25ce80e-f536-78c8-d5c8-a5df3e230785%40postgrespro.ru I think the patch should do the simplest thing possible, i.e. what it does today. Otherwise we'll never get it committed. I have to agree with Tomas, that keeping things as simple as possible should be a main priority right now. Otherwise, the entire patch set will pass next release cycle without being committed at least partially. In the same time, it resolves important problem from my perspective. It moves I/O overhead from primary to replica using large transactions streaming, which is a nice to have feature I guess. Later it would be possible to replace logical apply worker with bgworkers pool in a separated patch, if we decide that it is a viable solution. Anyway, regarding the Amit's questions: - I doubt that maintaining a separate buffer on the apply side before spilling to disk would help enough. We already have ReorderBuffer with logical_work_mem limit, and if we exceeded that limit on the sender side, then most probably we exceed it on the applier side as well, excepting the case when this new buffer size will be significantly higher then logical_work_mem to keep multiple open xacts. - I still think that we should optimize database for commits, not rollbacks. BGworkers pool is dramatically slower for rollbacks-only load, though being at least twice as faster for commits-only. I do not know how it will perform with real life load, but this drawback may be inappropriate for such a general purpose database like Postgres. - Tomas' implementation of streaming with spilling does not have this bias between commits/aborts. However, it has a noticeable performance drop (~x5 slower compared with master [1]) for large transaction consisting of many small rows. Although it is not of an order of magnitude slower. Another thing is it that about a year ago I have found some problems with MVCC/visibility and fixed them somehow [1]. If I get it correctly Tomas adapted some of those fixes into his patch set, but I think that this part should be reviewed carefully again. I would be glad to check it, but now I am a little bit confused with all the patch set variants in the thread. Which is the last one? Is it still dependent on 2pc decoding? [1] https://www.postgresql.org/message-id/flat/40c38758-04b5-74f4-c963-cf300f9e5dff%40postgrespro.ru#98d06fefc88122385dacb2f03f7c30f7 Thanks for moving this patch forward! -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Free port choosing freezes when PostgresNode::use_tcp is used on BSD systems
Hi Hackers, Inside PostgresNode.pm there is a free port choosing routine --- get_free_port(). The comment section there says: # On non-Linux, non-Windows kernels, binding to 127.0.0/24 addresses # other than 127.0.0.1 might fail with EADDRNOTAVAIL. And this is an absolute true, on BSD-like systems (macOS and FreeBSD tested) it hangs on looping through the entire ports range over and over when $PostgresNode::use_tcp = 1 is set, since bind fails with: # Checking port 52208 # bind: 127.0.0.1 52208 # bind: 127.0.0.2 52208 bind: Can't assign requested address To reproduce just apply reproduce.diff and try to run 'make -C src/bin/pg_ctl check'. This is not a case with standard Postgres tests, since TestLib.pm chooses unix sockets automatically everywhere outside Windows. However, we got into this problem when tried to run a custom tap test that required TCP for stable running. That way, if it really could happen why not to just skip binding to 127.0.0/24 addresses other than 127.0.0.1 outside of Linux/Windows as per attached patch_PostgresNode.diff? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index db47a97d196..9add9bde2a4 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -1203,7 +1203,7 @@ sub get_free_port if ($found == 1) { foreach my $addr (qw(127.0.0.1), -$use_tcp ? qw(127.0.0.2 127.0.0.3 0.0.0.0) : ()) +$use_tcp && ($^O eq "linux" || $TestLib::windows_os) ? qw(127.0.0.2 127.0.0.3 0.0.0.0) : ()) { if (!can_bind($addr, $port)) { diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl index b1e419f02e9..c25c0793537 100644 --- a/src/bin/pg_ctl/t/001_start_stop.pl +++ b/src/bin/pg_ctl/t/001_start_stop.pl @@ -11,6 +11,8 @@ use Test::More tests => 24; my $tempdir = TestLib::tempdir; my $tempdir_short = TestLib::tempdir_short; +$PostgresNode::use_tcp = 1; + program_help_ok('pg_ctl'); program_version_ok('pg_ctl'); program_options_handling_ok('pg_ctl');
Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm
Hi Hackers, Today I have accidentally noticed that autoprewarm feature of pg_prewarm used TimestampDifference()'s results in a wrong way. First, it used *seconds* result from it as a *milliseconds*. It was causing it to make dump file autoprewarm.blocks ~every second with default setting of autoprewarm_interval = 300s. Here is a log part with debug output in this case: ``` 2020-11-09 19:09:00.162 MSK [85328] LOG: dumping autoprewarm.blocks 2020-11-09 19:09:01.161 MSK [85328] LOG: dumping autoprewarm.blocks 2020-11-09 19:09:02.160 MSK [85328] LOG: dumping autoprewarm.blocks 2020-11-09 19:09:03.159 MSK [85328] LOG: dumping autoprewarm.blocks ``` After fixing this issue I have noticed that it still dumps blocks twice at each timeout (here I set autoprewarm_interval to 15s): ``` 2020-11-09 19:18:59.692 MSK [85662] LOG: dumping autoprewarm.blocks 2020-11-09 19:18:59.700 MSK [85662] LOG: dumping autoprewarm.blocks 2020-11-09 19:19:14.694 MSK [85662] LOG: dumping autoprewarm.blocks 2020-11-09 19:19:14.704 MSK [85662] LOG: dumping autoprewarm.blocks ``` This happens because at timeout time we were using continue, but actually we still have to wait the entire autoprewarm_interval after successful dumping. I have fixed both issues in the attached patches and also added a minimalistic tap test as a first one to verify that this automatic damping still works after refactoring. I put Robert into CC, since he is an author of this feature. What do you think? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 6d4bab7f21c3661dd4dd5a0de7e097b1de3f642c Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:24:55 +0300 Subject: [PATCH v1 3/3] pg_prewarm: refactor autoprewarm waits Previously it was dumping twice at every timeout time. --- contrib/pg_prewarm/autoprewarm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index b18a065ed5..f52c83de1e 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -238,7 +238,9 @@ autoprewarm_main(Datum main_arg) { last_dump_time = GetCurrentTimestamp(); apw_dump_now(true, false); -continue; + +/* We have to sleep even after a successfull dump */ +delay_in_ms = autoprewarm_interval * 1000; } /* Sleep until the next dump time. */ -- 2.19.1 From 8793b8beb6a5c1ae730f1fffb09dff64c83bc631 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:12:00 +0300 Subject: [PATCH v1 2/3] pg_prewarm: fix autoprewarm_interval behaviour. Previously it misused seconds from TimestampDifference() as milliseconds, so it was dumping autoprewarm.blocks ~every second event with default autoprewarm_interval = 300s. --- contrib/pg_prewarm/autoprewarm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index d3dec6e3ec..b18a065ed5 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -231,7 +231,7 @@ autoprewarm_main(Datum main_arg) autoprewarm_interval * 1000); TimestampDifference(GetCurrentTimestamp(), next_dump_time, &secs, &usecs); - delay_in_ms = secs + (usecs / 1000); + delay_in_ms = secs * 1000 + (usecs / 1000); /* Perform a dump if it's time. */ if (delay_in_ms <= 0) -- 2.19.1 From 31dc30c97861afae9c34852afc5a5b1c91bbeadc Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:04:10 +0300 Subject: [PATCH v1 1/3] pg_prewarm: add tap test for autoprewarm feature --- contrib/pg_prewarm/Makefile | 2 + contrib/pg_prewarm/t/001_autoprewarm.pl | 51 + 2 files changed, 53 insertions(+) create mode 100644 contrib/pg_prewarm/t/001_autoprewarm.pl diff --git a/contrib/pg_prewarm/Makefile b/contrib/pg_prewarm/Makefile index b13ac3c813..9cfde8c4e4 100644 --- a/contrib/pg_prewarm/Makefile +++ b/contrib/pg_prewarm/Makefile @@ -10,6 +10,8 @@ EXTENSION = pg_prewarm DATA = pg_prewarm--1.1--1.2.sql pg_prewarm--1.1.sql pg_prewarm--1.0--1.1.sql PGFILEDESC = "pg_prewarm - preload relation data into system buffer cache" +TAP_TESTS = 1 + ifdef USE_PGXS PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) diff --git a/contrib/pg_prewarm/t/001_autoprewarm.pl b/contrib/pg_prewarm/t/001_autoprewarm.pl new file mode 100644 index 00..b564c29931 --- /dev/null +++ b/contrib/pg_prewarm/t/001_autoprewarm.pl @@ -0,0 +1,51 @@ +# +# Check that pg_prewarm can dump blocks from shared buffers +# to PGDATA/autoprewarm.blocks. +# + +use strict; +use Test::More; +use TestLib; +use Time::HiRes qw(usleep); +use warnings; + +use PostgresNode; + +plan tests => 3; + +my $node = get_new_node("node"); +$node->init; +$node->append_conf( +'postgresql.conf', qq( +sha
Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm
On 2020-11-09 21:53, Tom Lane wrote: Alexey Kondratov writes: After fixing this issue I have noticed that it still dumps blocks twice at each timeout (here I set autoprewarm_interval to 15s): ... This happens because at timeout time we were using continue, but actually we still have to wait the entire autoprewarm_interval after successful dumping. I don't think your 0001 is correct. It would be okay if apw_dump_now() could be counted on to take negligible time, but we shouldn't assume that should we? Yes, it seems so, if I understand you correctly. I had a doubt about possibility of pg_ctl to exit earlier than a dumping process. Now I added an explicit wait for dump file into test. I agree that the "continue" seems a bit bogus, because it's skipping the ResetLatch call at the bottom of the loop; it's not quite clear to me whether that's a good thing or not. But the general idea of the existing code seems to be to loop around and make a fresh calculation of how-long-to-wait, and that doesn't seem wrong. I have left the last patch intact, since it resolves the 'double dump' issue, but I agree with нщгк point about existing logic of the code, although it is a bit broken. So I have to think more about how to fix it in a better way. 0002 seems like a pretty clear bug fix, though I wonder if this is exactly what we want to do going forward. It seems like a very large fraction of the callers of TimestampDifference would like to have the value in msec, which means we're doing a whole lot of expensive and error-prone arithmetic to break down the difference to sec/usec and then put it back together again. Let's get rid of that by inventing, say TimestampDifferenceMilliseconds(...). Yeah, I get into this problem after a bug in another extension — pg_wait_sampling. I have attached 0002, which implements TimestampDifferenceMilliseconds(), so 0003 just uses this new function to solve the initial issues. If it looks good to you, then we can switch all similar callers to it. BTW, I see another bug of a related ilk. Look what postgres_fdw/connection.c is doing: TimestampDifference(now, endtime, &secs, µsecs); /* To protect against clock skew, limit sleep to one minute. */ cur_timeout = Min(6, secs * USECS_PER_SEC + microsecs); /* Sleep until there's something to do */ wc = WaitLatchOrSocket(MyLatch, WL_LATCH_SET | WL_SOCKET_READABLE | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, PQsocket(conn), cur_timeout, PG_WAIT_EXTENSION); WaitLatchOrSocket's timeout is measured in msec not usec. I think the comment about "clock skew" is complete BS, and the Min() calculation was put in as a workaround by somebody observing that the sleep waited too long, but not understanding why. I wonder how many troubles one can get with all these unit conversions. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom c79de17014753b311858b4570ca475f713328c62 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:24:55 +0300 Subject: [PATCH v2 4/4] pg_prewarm: refactor autoprewarm waits Previously it was dumping twice at every timeout time. --- contrib/pg_prewarm/autoprewarm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index e5bd130bc8..872c7d51b1 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -236,7 +236,9 @@ autoprewarm_main(Datum main_arg) { last_dump_time = GetCurrentTimestamp(); apw_dump_now(true, false); -continue; + +/* We have to sleep even after a successful dump */ +delay_in_ms = autoprewarm_interval * 1000; } /* Sleep until the next dump time. */ -- 2.19.1 From c38c07708d57d6dec5a8a1697ca9c9810ad4d7ce Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:12:00 +0300 Subject: [PATCH v2 3/4] pg_prewarm: fix autoprewarm_interval behaviour. Previously it misused seconds from TimestampDifference() as milliseconds, so it was dumping autoprewarm.blocks ~every second event with default autoprewarm_interval = 300s. --- contrib/pg_prewarm/autoprewarm.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index d3dec6e3ec..e5bd130bc8 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -222,16 +222,14 @@ autoprewarm_main(Datum main_arg) { long delay_in_ms = 0; TimestampTz next_dump_time = 0; - long secs = 0; - int usecs = 0; /* Compute the nex
Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm
On 2020-11-09 23:25, Tom Lane wrote: Alexey Kondratov writes: On 2020-11-09 21:53, Tom Lane wrote: 0002 seems like a pretty clear bug fix, though I wonder if this is exactly what we want to do going forward. It seems like a very large fraction of the callers of TimestampDifference would like to have the value in msec, which means we're doing a whole lot of expensive and error-prone arithmetic to break down the difference to sec/usec and then put it back together again. Let's get rid of that by inventing, say TimestampDifferenceMilliseconds(...). Yeah, I get into this problem after a bug in another extension — pg_wait_sampling. I have attached 0002, which implements TimestampDifferenceMilliseconds(), so 0003 just uses this new function to solve the initial issues. If it looks good to you, then we can switch all similar callers to it. Yeah, let's move forward with that --- in fact, I'm inclined to back-patch it. (Not till the current release cycle is done, though. I don't find this important enough to justify a last-moment patch.) BTW, I wonder if we shouldn't make TimestampDifferenceMilliseconds round any fractional millisecond up rather than down. Rounding down seems to create a hazard of uselessly waking just before the delay is completed. Better to wake just after. Yes, it make sense. I have changed TimestampDifferenceMilliseconds() to round result up if there is a reminder. After looking on the autoprewarm code more closely I have realised that this 'double dump' issues was not an issues at all. I have just misplaced a debug elog(), so its second output in the log was only indicating that we calculated delay_in_ms one more time. Actually, even with wrong calculation of delay_in_ms the only problem was that we were busy looping with ~1 second interval instead of waiting on latch. It is still a buggy behaviour, but much less harmful than I have originally thought. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom ce09103d9d58b611728b66366cd24e8a4069f7ac Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:04:10 +0300 Subject: [PATCH v3 3/3] pg_prewarm: add tap test for autoprewarm feature --- contrib/pg_prewarm/Makefile | 2 + contrib/pg_prewarm/t/001_autoprewarm.pl | 59 + 2 files changed, 61 insertions(+) create mode 100644 contrib/pg_prewarm/t/001_autoprewarm.pl diff --git a/contrib/pg_prewarm/Makefile b/contrib/pg_prewarm/Makefile index b13ac3c813..9cfde8c4e4 100644 --- a/contrib/pg_prewarm/Makefile +++ b/contrib/pg_prewarm/Makefile @@ -10,6 +10,8 @@ EXTENSION = pg_prewarm DATA = pg_prewarm--1.1--1.2.sql pg_prewarm--1.1.sql pg_prewarm--1.0--1.1.sql PGFILEDESC = "pg_prewarm - preload relation data into system buffer cache" +TAP_TESTS = 1 + ifdef USE_PGXS PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) diff --git a/contrib/pg_prewarm/t/001_autoprewarm.pl b/contrib/pg_prewarm/t/001_autoprewarm.pl new file mode 100644 index 00..f55b2a5352 --- /dev/null +++ b/contrib/pg_prewarm/t/001_autoprewarm.pl @@ -0,0 +1,59 @@ +# +# Check that pg_prewarm can dump blocks from shared buffers +# to PGDATA/autoprewarm.blocks. +# + +use strict; +use Test::More; +use TestLib; +use Time::HiRes qw(usleep); +use warnings; + +use PostgresNode; + +plan tests => 3; + +# Wait up to 180s for pg_prewarm to dump blocks. +sub wait_for_dump +{ + my $path = shift; + + foreach my $i (0 .. 1800) + { + last if -e $path; + usleep(100_000); + } +} + +my $node = get_new_node("node"); +$node->init; +$node->append_conf( + 'postgresql.conf', qq( +shared_preload_libraries = 'pg_prewarm' +pg_prewarm.autoprewarm = 'on' +pg_prewarm.autoprewarm_interval = 1 +)); +$node->start; + +my $blocks_path = $node->data_dir . '/autoprewarm.blocks'; + +# Check that we can dump blocks on timeout. +wait_for_dump($blocks_path); +ok(-e $blocks_path, 'file autoprewarm.blocks should be present in the PGDATA'); + +# Check that we can dump blocks on shutdown. +$node->stop; +$node->append_conf( + 'postgresql.conf', qq( +pg_prewarm.autoprewarm_interval = 0 +)); + +# Remove autoprewarm.blocks +unlink($blocks_path) || die "$blocks_path: $!"; +ok(!-e $blocks_path, 'sanity check, dump on timeout is turned off'); + +$node->start; +$node->stop; + +wait_for_dump($blocks_path); +ok(-e $blocks_path, 'file autoprewarm.blocks should be present in the PGDATA after clean shutdown'); -- 2.19.1 From fba212ed765c8c411db1ca19c2ac991662109d99 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:12:00 +0300 Subject: [PATCH v3 2/3] pg_prewarm: fix autoprewarm_interval behaviour Previously it misused seconds from TimestampDifference() as milliseconds, so it was busy looping with ~1 second interval instead of wai
Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm
On 2020-11-11 06:59, Tom Lane wrote: Alexey Kondratov writes: After looking on the autoprewarm code more closely I have realised that this 'double dump' issues was not an issues at all. I have just misplaced a debug elog(), so its second output in the log was only indicating that we calculated delay_in_ms one more time. Ah --- that explains why I couldn't see a problem. I've pushed 0001+0002 plus some followup work to fix other places that could usefully use TimestampDifferenceMilliseconds(). I have not done anything with 0003 (the TAP test for pg_prewarm), and will leave that to the judgment of somebody who's worked with pg_prewarm before. To me it looks like it's not really testing things very carefully at all; on the other hand, we have exactly zero test coverage of that module today, so maybe something is better than nothing. Great, thank you for generalisation of the issue and working on it. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Hi, On 2020-11-06 18:56, Anastasia Lubennikova wrote: Status update for a commitfest entry. This thread was inactive for a while and from the latest messages, I see that the patch needs some further work. So I move it to "Waiting on Author". The new status of this patch is: Waiting on Author I had a look on the initial patch and discussed options [1] to proceed with this issue. I agree with Bruce about idle_session_timeout, it would be a nice to have in-core feature on its own. However, this should be a cluster-wide option and it will start dropping all idle connection not only foreign ones. So it may be not an option for some cases, when the same foreign server is used for another load as well. Regarding the initial issue I prefer point #3, i.e. foreign server option. It has a couple of benefits IMO: 1) it may be set separately on per foreign server basis, 2) it will live only in the postgres_fdw contrib without any need to touch core. I would only supplement this postgres_fdw foreign server option with a GUC, e.g. postgres_fdw.keep_connections, so one could easily define such behavior for all foreign servers at once or override server-level option by setting this GUC on per session basis. Attached is a small POC patch, which implements this contrib-level postgres_fdw.keep_connections GUC. What do you think? [1] https://www.postgresql.org/message-id/CALj2ACUFNydy0uo0JL9A1isHQ9pFe1Fgqa_HVanfG6F8g21nSQ%40mail.gmail.com Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index ab3226287d..64f0e96635 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -28,6 +28,8 @@ #include "utils/memutils.h" #include "utils/syscache.h" +#include "postgres_fdw.h" + /* * Connection cache hash table entry * @@ -948,6 +950,7 @@ pgfdw_xact_callback(XactEvent event, void *arg) */ if (PQstatus(entry->conn) != CONNECTION_OK || PQtransactionStatus(entry->conn) != PQTRANS_IDLE || + !keep_connections || entry->changing_xact_state) { elog(DEBUG3, "discarding connection %p", entry->conn); diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 9c5aaacc51..4cd5f71223 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -45,6 +45,8 @@ #include "utils/sampling.h" #include "utils/selfuncs.h" +#include "postgres_fdw.h" + PG_MODULE_MAGIC; /* Default CPU cost to start up a foreign query. */ @@ -301,6 +303,8 @@ typedef struct List *already_used; /* expressions already dealt with */ } ec_member_foreign_arg; +bool keep_connections = true; + /* * SQL functions */ @@ -505,6 +509,15 @@ static void merge_fdw_options(PgFdwRelationInfo *fpinfo, const PgFdwRelationInfo *fpinfo_o, const PgFdwRelationInfo *fpinfo_i); +void +_PG_init(void) +{ + DefineCustomBoolVariable("postgres_fdw.keep_connections", + "Enables postgres_fdw connection caching.", + "When off postgres_fdw will close connections at the end of transaction.", + &keep_connections, true, PGC_USERSET, 0, NULL, + NULL, NULL); +} /* * Foreign-data wrapper handler function: return a struct with pointers diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h index eef410db39..7f1bdb96d6 100644 --- a/contrib/postgres_fdw/postgres_fdw.h +++ b/contrib/postgres_fdw/postgres_fdw.h @@ -124,9 +124,12 @@ typedef struct PgFdwRelationInfo int relation_index; } PgFdwRelationInfo; +extern bool keep_connections; + /* in postgres_fdw.c */ extern int set_transmission_modes(void); extern void reset_transmission_modes(int nestlevel); +extern void _PG_init(void); /* in connection.c */ extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2020-11-18 16:39, Bharath Rupireddy wrote: Thanks for the interest shown! On Wed, Nov 18, 2020 at 1:07 AM Alexey Kondratov wrote: Regarding the initial issue I prefer point #3, i.e. foreign server option. It has a couple of benefits IMO: 1) it may be set separately on per foreign server basis, 2) it will live only in the postgres_fdw contrib without any need to touch core. I would only supplement this postgres_fdw foreign server option with a GUC, e.g. postgres_fdw.keep_connections, so one could easily define such behavior for all foreign servers at once or override server-level option by setting this GUC on per session basis. Below is what I have in my mind, mostly inline with yours: a) Have a server level option (keep_connetion true/false, with the default being true), when set to false the connection that's made with this foreign server is closed and cached entry from the connection cache is deleted at the end of txn in pgfdw_xact_callback. b) Have postgres_fdw level GUC postgres_fdw.keep_connections default being true. When set to false by the user, the connections, that are used after this, are closed and removed from the cache at the end of respective txns. If we don't use a connection that was cached prior to the user setting the GUC as false, then we may not be able to clear it. We can avoid this problem by recommending users either to set the GUC to false right after the CREATE EXTENSION postgres_fdw; or else use the function specified in (c). c) Have a new function that gets defined as part of CREATE EXTENSION postgres_fdw;, say postgres_fdw_discard_connections(), similar to dblink's dblink_disconnect(), which discards all the remote connections and clears connection cache. And we can also have server name as input to postgres_fdw_discard_connections() to discard selectively. Thoughts? If okay with the approach, I will start working on the patch. This approach looks solid enough from my perspective to give it a try. I would only make it as three separate patches for an ease of further review. Attached is a small POC patch, which implements this contrib-level postgres_fdw.keep_connections GUC. What do you think? I see two problems with your patch: 1) It just disconnects the remote connection at the end of txn if the GUC is set to false, but it doesn't remove the connection cache entry from ConnectionHash. Yes, and this looks like a valid state for postgres_fdw and it can get into the same state even without my patch. Next time GetConnection() will find this cache entry, figure out that entry->conn is NULL and establish a fresh connection. It is not clear for me right now, what benefits we will get from clearing also this cache entry, except just doing this for sanity. 2) What happens if there are some cached connections, user set the GUC to false and not run any foreign queries or not use those connections thereafter, so only the new connections will not be cached? Will the existing unused connections still remain in the connection cache? See (b) above for a solution. Yes, they will. This could be solved with that additional disconnect function as you proposed in c). Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2020-11-19 07:11, Bharath Rupireddy wrote: On Wed, Nov 18, 2020 at 10:32 PM Alexey Kondratov wrote: Thanks! I will make separate patches and post them soon. >> Attached is a small POC patch, which implements this contrib-level >> postgres_fdw.keep_connections GUC. What do you think? > > I see two problems with your patch: 1) It just disconnects the remote > connection at the end of txn if the GUC is set to false, but it > doesn't remove the connection cache entry from ConnectionHash. Yes, and this looks like a valid state for postgres_fdw and it can get into the same state even without my patch. Next time GetConnection() will find this cache entry, figure out that entry->conn is NULL and establish a fresh connection. It is not clear for me right now, what benefits we will get from clearing also this cache entry, except just doing this for sanity. By clearing the cache entry we will have 2 advantages: 1) we could save a(small) bit of memory 2) we could allow new connections to be cached, currently ConnectionHash can have only 8 entries. IMHO, along with disconnecting, we can also clear off the cache entry. Thoughts? IIUC, 8 is not a hard limit, it is just a starting size. ConnectionHash is not a shared-memory hash table, so dynahash can expand it on-the-fly as follow, for example, from the comment before hash_create(): * Note: for a shared-memory hashtable, nelem needs to be a pretty good * estimate, since we can't expand the table on the fly. But an unshared * hashtable can be expanded on-the-fly, so it's better for nelem to be * on the small side and let the table grow if it's exceeded. An overly * large nelem will penalize hash_seq_search speed without buying much. Also I am not sure that by doing just a HASH_REMOVE you will free any memory, since hash table is already allocated (or expanded) to some size. So HASH_REMOVE will only add removed entry to the freeList, I guess. Anyway, I can hardly imagine bloating of ConnectionHash to be a problem even in the case, when one has thousands of foreign servers all being accessed during a single backend life span. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Hi, On 2020-11-23 09:48, Bharath Rupireddy wrote: Here is how I'm making 4 separate patches: 1. new function and it's documentation. 2. GUC and it's documentation. 3. server level option and it's documentation. 4. test cases for all of the above patches. Hi, I'm attaching the patches here. Note that, though the code changes for this feature are small, I divided them up as separate patches to make review easy. v1-0001-postgres_fdw-function-to-discard-cached-connections.patch This patch looks pretty straightforward for me, but there are some things to be addressed IMO: + server = GetForeignServerByName(servername, true); + + if (server != NULL) + { Yes, you return a false if no server was found, but for me it worth throwing an error in this case as, for example, dblink does in the dblink_disconnect(). + result = disconnect_cached_connections(FOREIGNSERVEROID, +hashvalue, +false); + if (all || (!all && cacheid == FOREIGNSERVEROID && + entry->server_hashvalue == hashvalue)) + { + if (entry->conn != NULL && + !all && cacheid == FOREIGNSERVEROID && + entry->server_hashvalue == hashvalue) These conditions look bulky for me. First, you pass FOREIGNSERVEROID to disconnect_cached_connections(), but actually it just duplicates 'all' flag, since when it is 'FOREIGNSERVEROID', then 'all == false'; when it is '-1', then 'all == true'. That is all, there are only two calls of disconnect_cached_connections(). That way, it seems that we should keep only 'all' flag at least for now, doesn't it? Second, I think that we should just rewrite this if statement in order to simplify it and make more readable, e.g.: if ((all || entry->server_hashvalue == hashvalue) && entry->conn != NULL) { disconnect_pg_server(entry); result = true; } + if (all) + { + hash_destroy(ConnectionHash); + ConnectionHash = NULL; + result = true; + } Also, I am still not sure that it is a good idea to destroy the whole cache even in 'all' case, but maybe others will have a different opinion. v1-0002-postgres_fdw-add-keep_connections-GUC-to-not-cache-connections.patch + entry->changing_xact_state) || + (entry->used_in_current_xact && + !keep_connections)) I am not sure, but I think, that instead of adding this additional flag into ConnCacheEntry structure we can look on entry->xact_depth and use local: bool used_in_current_xact = entry->xact_depth > 0; for exactly the same purpose. Since we set entry->xact_depth to zero at the end of xact, then it was used if it is not zero. It is set to 1 by begin_remote_xact() called by GetConnection(), so everything seems to be fine. Otherwise, both patches seem to be working as expected. I am going to have a look on the last two patches a bit later. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2020-11-24 06:52, Bharath Rupireddy wrote: Thanks for the review comments. On Mon, Nov 23, 2020 at 9:57 PM Alexey Kondratov wrote: > v1-0001-postgres_fdw-function-to-discard-cached-connections.patch This patch looks pretty straightforward for me, but there are some things to be addressed IMO: + server = GetForeignServerByName(servername, true); + + if (server != NULL) + { Yes, you return a false if no server was found, but for me it worth throwing an error in this case as, for example, dblink does in the dblink_disconnect(). dblink_disconnect() "Returns status, which is always OK (since any error causes the function to throw an error instead of returning)." This behaviour doesn't seem okay to me. Since we throw true/false, I would prefer to throw a warning(with a reason) while returning false over an error. I thought about something a bit more sophisticated: 1) Return 'true' if there were open connections and we successfully closed them. 2) Return 'false' in the no-op case, i.e. there were no open connections. 3) Rise an error if something went wrong. And non-existing server case belongs to this last category, IMO. That looks like a semantically correct behavior, but let us wait for any other opinion. + result = disconnect_cached_connections(FOREIGNSERVEROID, +hashvalue, +false); + if (all || (!all && cacheid == FOREIGNSERVEROID && + entry->server_hashvalue == hashvalue)) + { + if (entry->conn != NULL && + !all && cacheid == FOREIGNSERVEROID && + entry->server_hashvalue == hashvalue) These conditions look bulky for me. First, you pass FOREIGNSERVEROID to disconnect_cached_connections(), but actually it just duplicates 'all' flag, since when it is 'FOREIGNSERVEROID', then 'all == false'; when it is '-1', then 'all == true'. That is all, there are only two calls of disconnect_cached_connections(). That way, it seems that we should keep only 'all' flag at least for now, doesn't it? I added cachid as an argument to disconnect_cached_connections() for reusability. Say, someone wants to use it with a user mapping then they can pass cacheid USERMAPPINGOID, hash value of user mapping. The cacheid == USERMAPPINGOID && entry->mapping_hashvalue == hashvalue can be added to disconnect_cached_connections(). Yeah, I have got your point and motivation to add this argument, but how we can use it? To disconnect all connections belonging to some specific user mapping? But any user mapping is hard bound to some foreign server, AFAIK, so we can pass serverid-based hash in this case. In the case of pgfdw_inval_callback() this argument makes sense, since syscache callbacks work that way, but here I can hardly imagine a case where we can use it. Thus, it still looks as a preliminary complication for me, since we do not have plans to use it, do we? Anyway, everything seems to be working fine, so it is up to you to keep this additional argument. v1-0003-postgres_fdw-server-level-option-keep_connection.patch This patch adds a new server level option, keep_connection, default being on, when set to off, the local session doesn't cache the connections associated with the foreign server. This patch looks good to me, except one note: (entry->used_in_current_xact && - !keep_connections)) + (!keep_connections || !entry->keep_connection))) { Following this logic: 1) If keep_connections == true, then per-server keep_connection has a *higher* priority, so one can disable caching of a single foreign server. 2) But if keep_connections == false, then it works like a global switch off indifferently of per-server keep_connection's, i.e. they have a *lower* priority. It looks fine for me, at least I cannot propose anything better, but maybe it should be documented in 0004? v1-0004-postgres_fdw-connection-cache-discard-tests-and-documentation.patch This patch adds the tests and documentation related to this feature. I have not read all texts thoroughly, but what caught my eye: + A GUC, postgres_fdw.keep_connections, default being + on, when set to off, the local session I think that GUC acronym is used widely only in the source code and Postgres docs tend to do not use it at all, except from acronyms list and a couple of 'GUC parameters' collocation usage. And it never used in a singular form there, so I think that it should be rather: A configuration parameter, postgres_fdw.keep_connections, default being... + + Note that when postgres_fdw.keep_connections is set to +
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2020-11-25 06:17, Bharath Rupireddy wrote: On Wed, Nov 25, 2020 at 7:24 AM Craig Ringer wrote: A quick thought here. Would it make sense to add a hook in the DISCARD ALL implementation that postgres_fdw can register for? There's precedent here, since DISCARD ALL already has the same effect as SELECT pg_advisory_unlock_all(); amongst other things. IIUC, then it is like a core(server) function doing some work for the postgres_fdw module. Earlier in the discussion, one point raised was that it's better not to have core handling something related to postgres_fdw. This is the reason we have come up with postgres_fdw specific function and a GUC, which get defined when extension is created. Similarly, dblink also has it's own bunch of functions one among them is dblink_disconnect(). If I have got Craig correctly, he proposed that we already have a DISCARD ALL statement, which is processed by DiscardAll(), and it releases internal resources known from the core perspective. That way, we can introduce a general purpose hook DiscardAll_hook(), so postgres_fdw can get use of it to clean up its own resources (connections in our context) if needed. In other words, it is not like a core function doing some work for the postgres_fdw module, but rather like a callback/hook, that postgres_fdw is able to register to do some additional work. It can be a good replacement for 0001, but won't it be already an overkill to drop all local caches along with remote connections? I mean, that it would be a nice to have hook from the extensibility perspective, but postgres_fdw_disconnect() still makes sense, since it does a very narrow and specific job. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-13 14:34, Michael Paquier wrote: On Wed, Jan 13, 2021 at 05:22:49PM +0900, Michael Paquier wrote: Yeah, that makes sense. I'll send an updated patch based on that. And here you go as per the attached. I don't think that there was anything remaining on my radar. This version still needs to be indented properly though. Thoughts? Thanks. + bits32 options;/* bitmask of CLUSTEROPT_* */ This should say '/* bitmask of CLUOPT_* */', I guess, since there are only CLUOPT's defined. Otherwise, everything looks as per discussed upthread. By the way, something went wrong with the last email subject, so I have changed it back to the original in this response. I also attached your patch (with only this CLUOPT_* correction) to keep it in the thread for sure. Although, postgresql.org's web archive is clever enough to link your email to the same thread even with different subject. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/src/include/catalog/index.h b/src/include/catalog/index.h index 9904a76387..43cfdeaa6b 100644 --- a/src/include/catalog/index.h +++ b/src/include/catalog/index.h @@ -30,13 +30,16 @@ typedef enum } IndexStateFlagsAction; /* options for REINDEX */ -typedef enum ReindexOption +typedef struct ReindexParams { - REINDEXOPT_VERBOSE = 1 << 0, /* print progress info */ - REINDEXOPT_REPORT_PROGRESS = 1 << 1, /* report pgstat progress */ - REINDEXOPT_MISSING_OK = 1 << 2, /* skip missing relations */ - REINDEXOPT_CONCURRENTLY = 1 << 3 /* concurrent mode */ -} ReindexOption; + bits32 options; /* bitmask of REINDEXOPT_* */ +} ReindexParams; + +/* flag bits for ReindexParams->flags */ +#define REINDEXOPT_VERBOSE 0x01 /* print progress info */ +#define REINDEXOPT_REPORT_PROGRESS 0x02 /* report pgstat progress */ +#define REINDEXOPT_MISSING_OK 0x04 /* skip missing relations */ +#define REINDEXOPT_CONCURRENTLY 0x08 /* concurrent mode */ /* state info for validate_index bulkdelete callback */ typedef struct ValidateIndexState @@ -146,7 +149,7 @@ extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action); extern Oid IndexGetRelation(Oid indexId, bool missing_ok); extern void reindex_index(Oid indexId, bool skip_constraint_checks, - char relpersistence, int options); + char relpersistence, ReindexParams *params); /* Flag bits for reindex_relation(): */ #define REINDEX_REL_PROCESS_TOAST 0x01 @@ -155,7 +158,7 @@ extern void reindex_index(Oid indexId, bool skip_constraint_checks, #define REINDEX_REL_FORCE_INDEXES_UNLOGGED 0x08 #define REINDEX_REL_FORCE_INDEXES_PERMANENT 0x10 -extern bool reindex_relation(Oid relid, int flags, int options); +extern bool reindex_relation(Oid relid, int flags, ReindexParams *params); extern bool ReindexIsProcessingHeap(Oid heapOid); extern bool ReindexIsProcessingIndex(Oid indexOid); diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h index 401a0827ae..1245d944dc 100644 --- a/src/include/commands/cluster.h +++ b/src/include/commands/cluster.h @@ -18,16 +18,17 @@ #include "storage/lock.h" #include "utils/relcache.h" - /* options for CLUSTER */ -typedef enum ClusterOption +#define CLUOPT_RECHECK 0x01 /* recheck relation state */ +#define CLUOPT_VERBOSE 0x02 /* print progress info */ + +typedef struct ClusterParams { - CLUOPT_RECHECK = 1 << 0, /* recheck relation state */ - CLUOPT_VERBOSE = 1 << 1 /* print progress info */ -} ClusterOption; + bits32 options; /* bitmask of CLUOPT_* */ +} ClusterParams; extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel); -extern void cluster_rel(Oid tableOid, Oid indexOid, int options); +extern void cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params); extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid, bool recheck, LOCKMODE lockmode); extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal); diff --git a/src/include/commands/defrem.h b/src/include/commands/defrem.h index e2d2a77ca4..91281d6f8e 100644 --- a/src/include/commands/defrem.h +++ b/src/include/commands/defrem.h @@ -14,6 +14,7 @@ #ifndef DEFREM_H #define DEFREM_H +#include "catalog/index.h" #include "catalog/objectaddress.h" #include "nodes/params.h" #include "parser/parse_node.h" @@ -34,11 +35,7 @@ extern ObjectAddress DefineIndex(Oid relationId, bool check_not_in_use, bool skip_build, bool quiet); -extern int ReindexParseOptions(ParseState *pstate, ReindexStmt *stmt); -extern void ReindexIndex(RangeVar *indexRelation, int options, bool isTopLevel); -extern Oid ReindexTable(RangeVar *relation, int options, bool isTopLevel); -extern void ReindexMultipleTables(const char *objectName, ReindexObjectType ob
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-20 18:54, Alvaro Herrera wrote: On 2021-Jan-20, Alvaro Herrera wrote: On 2021-Jan-20, Michael Paquier wrote: > +/* > + * This is mostly duplicating ATExecSetTableSpaceNoStorage, > + * which should maybe be factored out to a library function. > + */ > Wouldn't it be better to do first the refactoring of 0002 and then > 0001 so as REINDEX can use the new routine, instead of putting that > into a comment? I think merging 0001 and 0002 into a single commit is a reasonable approach. ... except it doesn't make a lot of sense to have set_rel_tablespace in either indexcmds.c or index.c. I think tablecmds.c is a better place for it. (I would have thought catalog/storage.c, but that one's not the right abstraction level it seems.) I did a refactoring of ATExecSetTableSpaceNoStorage() in the 0001. New function SetRelTablesapce() is placed into the tablecmds.c. Following 0002 gets use of it. Is it close to what you and Michael suggested? But surely ATExecSetTableSpaceNoStorage should be using this new routine. (I first thought 0002 was doing that, since that commit is calling itself a "refactoring", but now that I look closer, it's not.) Yeah, this 'refactoring' was initially referring to refactoring of what Justin added to one of the previous 0001. And it was meant to be merged with 0001, once agreed, but we got distracted by other stuff. I have not yet addressed Michael's concerns regarding reindex of partitions. I am going to look closer on it tomorrow. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-20 21:08, Alexey Kondratov wrote: On 2021-01-20 18:54, Alvaro Herrera wrote: On 2021-Jan-20, Alvaro Herrera wrote: On 2021-Jan-20, Michael Paquier wrote: > +/* > + * This is mostly duplicating ATExecSetTableSpaceNoStorage, > + * which should maybe be factored out to a library function. > + */ > Wouldn't it be better to do first the refactoring of 0002 and then > 0001 so as REINDEX can use the new routine, instead of putting that > into a comment? I think merging 0001 and 0002 into a single commit is a reasonable approach. ... except it doesn't make a lot of sense to have set_rel_tablespace in either indexcmds.c or index.c. I think tablecmds.c is a better place for it. (I would have thought catalog/storage.c, but that one's not the right abstraction level it seems.) I did a refactoring of ATExecSetTableSpaceNoStorage() in the 0001. New function SetRelTablesapce() is placed into the tablecmds.c. Following 0002 gets use of it. Is it close to what you and Michael suggested? Ugh, forgot to attach the patches. Here they are. -- AlexeyFrom 2c3876f99bc8ebdd07c532619992e7ec3093e50a Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 23 Mar 2020 21:10:29 +0300 Subject: [PATCH v2 2/2] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 22 + src/backend/catalog/index.c | 72 ++- src/backend/commands/indexcmds.c | 68 ++- src/bin/psql/tab-complete.c | 4 +- src/include/catalog/index.h | 2 + src/test/regress/input/tablespace.source | 53 +++ src/test/regress/output/tablespace.source | 102 ++ 7 files changed, 318 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 627b36300c..4f84060c4d 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN CONCURRENTLY [ boolean ] VERBOSE [ boolean ] +TABLESPACE new_tablespace @@ -187,6 +188,19 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + This specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" relations. If SCHEMA, + DATABASE or SYSTEM is specified, then + all unsuitable relations will be skipped and a single WARNING + will be generated. + + + + VERBOSE @@ -210,6 +224,14 @@ REINDEX [ ( option [, ...] ) ] { IN + +new_tablespace + + + The tablespace where indexes will be rebuilt. + + + diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index b8cd35e995..ed98b17483 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -57,6 +57,7 @@ #include "commands/event_trigger.h" #include "commands/progress.h" #include "commands/tablecmds.h" +#include "commands/tablespace.h" #include "commands/trigger.h" #include "executor/executor.h" #include "miscadmin.h" @@ -1394,9 +1395,13 @@ index_update_collation_versions(Oid relid, Oid coll) * Create concurrently an index based on the definition of the one provided by * caller. The index is inserted into catalogs and needs to be built later * on. This is called during concurrent reindex processing. + * + * "tablespaceOid" is the new tablespace to use for this index. If + * InvalidOid, use the tablespace in-use instead. */ Oid -index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char *newName) +index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, + Oid tablespaceOid, const char *newName) { Relation indexRelation; IndexInfo *oldInfo, @@ -1526,7 +1531,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char newInfo, indexColNames, indexRelation->rd_rel->relam, - indexRelation->rd_rel->reltablespace, + OidIsValid(tablespaceOid) ? +tablespaceOid : indexRelation->rd_rel->reltablespace, indexRelation->rd_indcollation, indclass->values, indcoloptions->values, @@ -3591,6 +3597,8 @@ IndexGetRelation(Oid indexId, bool missing_ok) /* * reindex_index - This routine is used to recreate a single index + * + * See comments of reindex_relation() for details about "tablespaceOid". */ void reindex_index(Oid indexId, bool skip_constraint_checks, char persistence, @@ -3603,6 +3611,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence, volatile bool skipped_constraint = false; PGRUsage
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-21 04:41, Michael Paquier wrote: On Wed, Jan 20, 2021 at 03:34:39PM -0300, Alvaro Herrera wrote: On 2021-Jan-20, Alexey Kondratov wrote: Ugh, forgot to attach the patches. Here they are. Yeah, looks reasonable. + + if (changed) + /* Record dependency on tablespace */ + changeDependencyOnTablespace(RelationRelationId, +reloid, rd_rel->reltablespace); Why have a separate "if (changed)" block here instead of merging with the above? Yep. Sure, this is a refactoring artifact. + if (SetRelTablespace(reloid, newTableSpace)) + /* Make sure the reltablespace change is visible */ + CommandCounterIncrement(); At quick glance, I am wondering why you just don't do a CCI within SetRelTablespace(). I did it that way for a better readability at first, since it looks more natural when you do some change (SetRelTablespace) and then make them visible with CCI. Second argument was that in the case of reindex_index() we have to also call RelationAssumeNewRelfilenode() and RelationDropStorage() before doing CCI and making the new tablespace visible. And this part is critical, I guess. + This specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" relations. If SCHEMA, + DATABASE or SYSTEM is specified, then + all unsuitable relations will be skipped and a single WARNING + will be generated. What is an unsuitable relation? How can the end user know that? This was referring to mapped relations mentioned in the previous sentence. I have tried to rewrite this part and make it more specific in my current version. Also added Justin's changes to the docs and comment. This is missing ACL checks when moving the index into a new location, so this requires some pg_tablespace_aclcheck() calls, and the other patches share the same issue. I added proper pg_tablespace_aclcheck()'s into the reindex_index() and ReindexPartitions(). + else if (partkind == RELKIND_PARTITIONED_TABLE) + { + Relation rel = table_open(partoid, ShareLock); + List*indexIds = RelationGetIndexList(rel); + ListCell *lc; + + table_close(rel, NoLock); + foreach (lc, indexIds) + { + Oid indexid = lfirst_oid(lc); + (void) set_rel_tablespace(indexid, params->tablespaceOid); + } + } This is really a good question. ReindexPartitions() would trigger one transaction per leaf to work on. Changing the tablespace of the partitioned table(s) before doing any work has the advantage to tell any new partition to use the new tablespace. Now, I see a struggling point here: what should we do if the processing fails in the middle of the move, leaving a portion of the leaves in the previous tablespace? On a follow-up reindex with the same command, should the command force a reindex even on the partitions that have been moved? Or could there be a point in skipping the partitions that are already on the new tablespace and only process the ones on the previous tablespace? It seems to me that the first scenario makes the most sense as currently a REINDEX works on all the relations defined, though there could be use cases for the second case. This should be documented, I think. I agree that follow-up REINDEX should also reindex moved partitions, since REINDEX (TABLESPACE ...) is still reindex at first. I will try to put something about this part into the docs. Also I think that we cannot be sure that nothing happened with already reindexed partitions between two consequent REINDEX calls. There are no tests for partitioned tables, aka we'd want to make sure that the new partitioned index is on the correct tablespace, as well as all its leaves. It may be better to have at least two levels of partitioned tables, as well as a partitioned table with no leaves in the cases dealt with. Yes, sure, it makes sense. +* +* Even if a table's indexes were moved to a new tablespace, the index +* on its toast table is not normally moved. */ Still, REINDEX (TABLESPACE) TABLE should move all of them to be consistent with ALTER TABLE SET TABLESPACE, but that's not the case with this code, no? This requires proper test coverage, but there is nothing of the kind in this patch. You are right, we do not move TOAST indexes now, since IsSystemRelation() is true for TOAST indexes, so I thought that we should not allow moving them without allow_system_table_mods=true. Now I wonder why ALTER TABLE does that. I am going to attach the new version of patch set today or tomorrow. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-21 17:06, Alexey Kondratov wrote: On 2021-01-21 04:41, Michael Paquier wrote: There are no tests for partitioned tables, aka we'd want to make sure that the new partitioned index is on the correct tablespace, as well as all its leaves. It may be better to have at least two levels of partitioned tables, as well as a partitioned table with no leaves in the cases dealt with. Yes, sure, it makes sense. +* +* Even if a table's indexes were moved to a new tablespace, the index +* on its toast table is not normally moved. */ Still, REINDEX (TABLESPACE) TABLE should move all of them to be consistent with ALTER TABLE SET TABLESPACE, but that's not the case with this code, no? This requires proper test coverage, but there is nothing of the kind in this patch. You are right, we do not move TOAST indexes now, since IsSystemRelation() is true for TOAST indexes, so I thought that we should not allow moving them without allow_system_table_mods=true. Now I wonder why ALTER TABLE does that. I am going to attach the new version of patch set today or tomorrow. Attached is a new patch set of first two patches, that should resolve all the issues raised before (ACL, docs, tests) excepting TOAST. Double thanks for suggestion to add more tests with nested partitioning. I have found and squashed a huge bug related to the returning back to the default tablespace using newly added tests. Regarding TOAST. Now we skip moving toast indexes or throw error if someone wants to move TOAST index directly. I had a look on ALTER TABLE SET TABLESPACE and it has a bit complicated logic: 1) You cannot move TOAST table directly. 2) But if you move basic relation that TOAST table belongs to, then they are moved altogether. 3) Same logic as 2) happens if one does ALTER TABLE ALL IN TABLESPACE ... That way, ALTER TABLE allows moving TOAST tables (with indexes) implicitly, but does not allow doing that explicitly. In the same time I found docs to be vague about such behavior it only says: All tables in the current database in a tablespace can be moved by using the ALL IN TABLESPACE ... Note that system catalogs are not moved by this command Changing any part of a system catalog table is not permitted. So actually ALTER TABLE treats TOAST relations as system sometimes, but sometimes not. From the end user perspective it makes sense to move TOAST with main table when doing ALTER TABLE SET TABLESPACE. But should we touch indexes on TOAST table with REINDEX? We cannot move TOAST relation itself, since we are doing only a reindex, so we end up in the state when TOAST table and its index are placed in the different tablespaces. This state is not reachable with ALTER TABLE/INDEX, so it seem we should not allow it with REINDEX as well, should we? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom bcd690da6bc3db16a96305b45546d3c9e400f769 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 23 Mar 2020 21:10:29 +0300 Subject: [PATCH v3 2/2] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 29 +++- src/backend/catalog/index.c | 82 +++- src/backend/commands/indexcmds.c | 81 +++- src/bin/psql/tab-complete.c | 4 +- src/include/catalog/index.h | 2 + src/test/regress/input/tablespace.source | 79 +++ src/test/regress/output/tablespace.source | 154 ++ 7 files changed, 425 insertions(+), 6 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 627b36300c..90fdad0b4c 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN CONCURRENTLY [ boolean ] VERBOSE [ boolean ] +TABLESPACE new_tablespace @@ -187,6 +188,20 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + Specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" and system (unless allow_system_table_mods + is set to TRUE) relations. If SCHEMA, + DATABASE or SYSTEM are specified, then + all "mapped" and system relations will be skipped and a single + WARNING will be generated. + + + + VERBOSE @@ -210,6 +225,14 @@ REINDEX [ ( option [, ...] ) ] { IN + +new_tablespace + + + The tablespace where indexes will be rebuilt. + + + @@ -292,7 +315,11 @@ REINDEX [ ( option [, ...] ) ] { IN with REINDEX INDEX or REINDEX TABLE, respectively. Each partition of the specified partitioned relation is reindex
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-22 00:26, Justin Pryzby wrote: On Thu, Jan 21, 2021 at 11:48:08PM +0300, Alexey Kondratov wrote: Attached is a new patch set of first two patches, that should resolve all the issues raised before (ACL, docs, tests) excepting TOAST. Double thanks for suggestion to add more tests with nested partitioning. I have found and squashed a huge bug related to the returning back to the default tablespace using newly added tests. Regarding TOAST. Now we skip moving toast indexes or throw error if someone wants to move TOAST index directly. I had a look on ALTER TABLE SET TABLESPACE and it has a bit complicated logic: 1) You cannot move TOAST table directly. 2) But if you move basic relation that TOAST table belongs to, then they are moved altogether. 3) Same logic as 2) happens if one does ALTER TABLE ALL IN TABLESPACE ... That way, ALTER TABLE allows moving TOAST tables (with indexes) implicitly, but does not allow doing that explicitly. In the same time I found docs to be vague about such behavior it only says: All tables in the current database in a tablespace can be moved by using the ALL IN TABLESPACE ... Note that system catalogs are not moved by this command Changing any part of a system catalog table is not permitted. So actually ALTER TABLE treats TOAST relations as system sometimes, but sometimes not. From the end user perspective it makes sense to move TOAST with main table when doing ALTER TABLE SET TABLESPACE. But should we touch indexes on TOAST table with REINDEX? We cannot move TOAST relation itself, since we are doing only a reindex, so we end up in the state when TOAST table and its index are placed in the different tablespaces. This state is not reachable with ALTER TABLE/INDEX, so it seem we should not allow it with REINDEX as well, should we? + * Even if a table's indexes were moved to a new tablespace, the index +* on its toast table is not normally moved. */ ReindexParams newparams = *params; newparams.options &= ~(REINDEXOPT_MISSING_OK); + if (!allowSystemTableMods) + newparams.tablespaceOid = InvalidOid; I think you're right. So actually TOAST should never move, even if allowSystemTableMods, right ? I think so. I would prefer to do not move TOAST indexes implicitly at all during reindex. @@ -292,7 +315,11 @@ REINDEX [ ( class="parameter">option [, ...] ) ] { IN with REINDEX INDEX or REINDEX TABLE, respectively. Each partition of the specified partitioned relation is reindexed in a separate transaction. Those commands cannot be used inside - a transaction block when working on a partitioned table or index. + a transaction block when working on a partitioned table or index. If + REINDEX with TABLESPACE executed + on partitioned relation fails it may have moved some partitions to the new + tablespace. Repeated command will still reindex all partitions even if they + are already in the new tablespace. Minor corrections here: If a REINDEX command fails when run on a partitioned relation, and TABLESPACE was specified, then it may have moved indexes on some partitions to the new tablespace. Re-running the command will reindex all partitions and move previously-unprocessed indexes to the new tablespace. Sounds good to me. I have updated patches accordingly and also simplified tablespaceOid checks and assignment in the newly added SetRelTableSpace(). Result is attached as two separate patches for an ease of review, but no objections to merge them and apply at once if everything is fine. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 87e47e9b5b3d6b49230045e5db8f844b14b34ba0 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 23 Mar 2020 21:10:29 +0300 Subject: [PATCH v4 2/2] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 30 +++- src/backend/catalog/index.c | 81 ++- src/backend/commands/indexcmds.c | 81 ++- src/bin/psql/tab-complete.c | 4 +- src/include/catalog/index.h | 2 + src/test/regress/input/tablespace.source | 85 src/test/regress/output/tablespace.source | 159 ++ 7 files changed, 436 insertions(+), 6 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 627b36300c..a1c7736aec 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN CONCURRENTLY [ boolean ] VERBOSE [ boolean ] +TABLESPACE new_tablespace @@ -187,6 +188,20 @@ REINDEX
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-25 11:07, Michael Paquier wrote: On Fri, Jan 22, 2021 at 05:07:02PM +0300, Alexey Kondratov wrote: I have updated patches accordingly and also simplified tablespaceOid checks and assignment in the newly added SetRelTableSpace(). Result is attached as two separate patches for an ease of review, but no objections to merge them and apply at once if everything is fine. extern void SetRelationHasSubclass(Oid relationId, bool relhassubclass); +extern bool SetRelTableSpace(Oid reloid, Oid tablespaceOid); Seeing SetRelationHasSubclass(), wouldn't it be more consistent to use SetRelationTableSpace() as routine name? I think that we should document that the caller of this routine had better do a CCI once done to make the tablespace chage visible. Except for those two nits, the patch needs an indentation run and some style tweaks but its logic looks fine. So I'll apply that first piece. I updated comment with CCI info, did pgindent run and renamed new function to SetRelationTableSpace(). New patch is attached. +INSERT INTO regress_tblspace_test_tbl (num1, num2, t) + SELECT round(random()*100), random(), repeat('text', 100) + FROM generate_series(1, 10) s(i); Repeating 1M times a text value is too costly for such a test. And as even for empty tables there is one page created for toast indexes, there is no need for that? Yes, TOAST relation is created anyway. I just wanted to put some data into a TOAST index, so REINDEX did some meaningful work there, not only a new relfilenode creation. However you are right and this query increases tablespace tests execution for more for more than 2 times on my machine. I think that it is not really required. This patch is introducing three new checks for system catalogs: - don't use tablespace for mapped relations. - don't use tablespace for system relations, except if allowSystemTableMods. - don't move non-shared relation to global tablespace. For the non-concurrent case, all three checks are in reindex_index(). For the concurrent case, the two first checks are in ReindexMultipleTables() and the third one is in ReindexRelationConcurrently(). That's rather tricky to follow because CONCURRENTLY is not allowed on system relations. I am wondering if it would be worth an extra comment effort, or if there is a way to consolidate that better. Yeah, all these checks we complicated from the beginning. I will try to find a better place tomorrow or put more info into the comments at least. I am also going to check/fix the remaining points regarding 002 tomorrow. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 39880842d7af31dcbfcffe7219250b31102955d5 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 20 Jan 2021 20:21:12 +0300 Subject: [PATCH v5 1/2] Extract common part from ATExecSetTableSpaceNoStorage for a future usage --- src/backend/commands/tablecmds.c | 95 +++- src/include/commands/tablecmds.h | 2 + 2 files changed, 58 insertions(+), 39 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 8687e9a97c..ec9c440e4e 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13291,6 +13291,59 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode) list_free(reltoastidxids); } +/* + * SetRelationTableSpace - modify relation tablespace in the pg_class entry. + * + * 'reloid' is an Oid of relation to be modified. + * 'tablespaceOid' is an Oid of new tablespace. + * + * Catalog modification is done only if tablespaceOid is different from + * the currently set. Returned bool value is indicating whether any changes + * were made or not. Note that caller is responsible for doing + * CommandCounterIncrement() to make tablespace changes visible. + */ +bool +SetRelationTableSpace(Oid reloid, Oid tablespaceOid) +{ + Relation pg_class; + HeapTuple tuple; + Form_pg_class rd_rel; + bool changed = false; + + /* Get a modifiable copy of the relation's pg_class row. */ + pg_class = table_open(RelationRelationId, RowExclusiveLock); + + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + rd_rel = (Form_pg_class) GETSTRUCT(tuple); + + /* MyDatabaseTableSpace is stored as InvalidOid. */ + if (tablespaceOid == MyDatabaseTableSpace) + tablespaceOid = InvalidOid; + + /* No work if no change in tablespace. */ + if (tablespaceOid != rd_rel->reltablespace) + { + /* Update the pg_class row. */ + rd_rel->reltablespace = tablespaceOid; + CatalogTupleUpdate(pg_class, &tuple->t_self, tuple); + + /* Record dependency on tablespace. */ + changeDependencyOnTablespace(RelationRelationId, + reloid, rd_rel->reltablespace); + + changed = true;
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-26 09:58, Michael Paquier wrote: On Mon, Jan 25, 2021 at 11:11:38PM +0300, Alexey Kondratov wrote: I updated comment with CCI info, did pgindent run and renamed new function to SetRelationTableSpace(). New patch is attached. [...] Yeah, all these checks we complicated from the beginning. I will try to find a better place tomorrow or put more info into the comments at least. I was reviewing that, and I think that we can do a better consolidation on several points that will also help the features discussed on this thread for VACUUM, CLUSTER and REINDEX. If you look closely, ATExecSetTableSpace() uses the same logic as the code modified here to check if a relation can be moved to a new tablespace, with extra checks for mapped relations, GLOBALTABLESPACE_OID or if attempting to manipulate a temp relation from another session. There are two differences though: - Custom actions are taken between the phase where we check if a relation can be moved to a new tablespace, and the update of pg_class. - ATExecSetTableSpace() needs to be able to set a given relation relfilenode on top of reltablespace, the newly-created one. So I think that the heart of the problem is made of two things here: - We should have one common routine for the existing code paths and the new code paths able to check if a tablespace move can be done or not. The case of a cluster, reindex or vacuum on a list of relations extracted from pg_class would still require a different handling as incorrect relations have to be skipped, but the case of individual relations can reuse the refactoring pieces done here (see CheckRelationTableSpaceMove() in the attached). - We need to have a second routine able to update reltablespace and optionally relfilenode for a given relation's pg_class entry, once the caller has made sure that CheckRelationTableSpaceMove() validates a tablespace move. I think that I got your idea. One comment: +bool +CheckRelationTableSpaceMove(Relation rel, Oid newTableSpaceId) +{ + Oid oldTableSpaceId; + Oid reloid = RelationGetRelid(rel); + + /* +* No work if no change in tablespace. Note that MyDatabaseTableSpace +* is stored as 0. +*/ + oldTableSpaceId = rel->rd_rel->reltablespace; + if (newTableSpaceId == oldTableSpaceId || + (newTableSpaceId == MyDatabaseTableSpace && oldTableSpaceId == 0)) + { + InvokeObjectPostAlterHook(RelationRelationId, reloid, 0); + return false; + } CheckRelationTableSpaceMove() does not feel like a right place for invoking post alter hooks. It is intended only to check for tablespace change possibility. Anyway, ATExecSetTableSpace() and ATExecSetTableSpaceNoStorage() already do that in the no-op case. Please note that was a bug in your previous patch 0002: shared dependencies need to be registered if reltablespace is updated of course, but also iff the relation has no physical storage. So changeDependencyOnTablespace() requires a check based on RELKIND_HAS_STORAGE(), or REINDEX would have registered shared dependencies even for relations with storage, something we don't want per the recent work done by Alvaro in ebfe2db. Yes, thanks. I have removed this InvokeObjectPostAlterHook() from your 0001 and made 0002 to work on top of it. I think that now it should look closer to what you described above. In the new 0002 I moved ACL check to the upper level, i.e. ExecReindex(), and removed expensive text generation in test. Not touched yet some of your previously raised concerns. Also, you made SetRelationTableSpace() to accept Relation instead of Oid, so now we have to open/close indexes in the ReindexPartitions(), I am not sure that I use proper locking there, but it works. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 96a37399a9cf9ae08d62e28496e73b36087e5a19 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 26 Jan 2021 15:53:06 +0900 Subject: [PATCH v7 1/2] Refactor code to detect and process tablespace moves --- src/backend/commands/tablecmds.c | 218 +-- src/include/commands/tablecmds.h | 4 + 2 files changed, 127 insertions(+), 95 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 8687e9a97c..c08eedf995 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -3037,6 +3037,116 @@ SetRelationHasSubclass(Oid relationId, bool relhassubclass) table_close(relationRelation, RowExclusiveLock); } +/* + * CheckRelationTableSpaceMove + * Check if relation can be moved to new tablespace. + * + * NOTE: caller must be holding an appropriate lock on the relation. + * ShareUpdateExclusiveLock is sufficient to prevent concurrent schema + * changes. + * + * Returns true if the relation can be moved to the n
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-27 06:14, Michael Paquier wrote: On Wed, Jan 27, 2021 at 01:00:50AM +0300, Alexey Kondratov wrote: In the new 0002 I moved ACL check to the upper level, i.e. ExecReindex(), and removed expensive text generation in test. Not touched yet some of your previously raised concerns. Also, you made SetRelationTableSpace() to accept Relation instead of Oid, so now we have to open/close indexes in the ReindexPartitions(), I am not sure that I use proper locking there, but it works. Passing down Relation to the new routines makes the most sense to me because we force the callers to think about the level of locking that's required when doing any tablespace moves. + Relation iRel = index_open(partoid, ShareLock); + + if (CheckRelationTableSpaceMove(iRel, params->tablespaceOid)) + SetRelationTableSpace(iRel, + params->tablespaceOid, + InvalidOid); Speaking of which, this breaks the locking assumptions of SetRelationTableSpace(). I feel that we should think harder about this part for partitioned indexes and tables because this looks rather unsafe in terms of locking assumptions with partition trees. If we cannot come up with a safe solution, I would be fine with disallowing TABLESPACE in this case, as a first step. Not all problems have to be solved at once, and even without this part the feature is still useful. I have read more about lock levels and ShareLock should prevent any kind of physical modification of indexes. We already hold ShareLock doing find_all_inheritors(), which is higher than ShareUpdateExclusiveLock, so using ShareLock seems to be safe here, but I will look on it closer. + /* It's not a shared catalog, so refuse to move it to shared tablespace */ + if (params->tablespaceOid == GLOBALTABLESPACE_OID) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), +errmsg("cannot move non-shared relation to tablespace \"%s\"", +get_tablespace_name(params->tablespaceOid; Why is that needed if CheckRelationTableSpaceMove() is used? This is from ReindexRelationConcurrently() where we do not use CheckRelationTableSpaceMove(). For me it makes sense to add only this GLOBALTABLESPACE_OID check there, since before we already check for system catalogs and after for temp relations, so adding CheckRelationTableSpaceMove() will be a double-check. - indexRelation->rd_rel->reltablespace, + OidIsValid(tablespaceOid) ? + tablespaceOid : indexRelation->rd_rel->reltablespace, Let's remove this logic from index_concurrently_create_copy() and let the caller directly decide the tablespace to use, without a dependency on InvalidOid in the inner routine. A share update exclusive lock is already hold on the old index when creating the concurrent copy, so there won't be concurrent schema changes. Changed. Also added tests for ACL checks, relfilenode changes. Added ACL recheck for multi-transactional case. Added info about TOAST index reindexing. Changed some comments. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom f176a6e5a81ab133fee849f72e4edb8b287d6062 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 27 Jan 2021 00:46:17 +0300 Subject: [PATCH v8] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 31 +++- src/backend/catalog/index.c | 50 +- src/backend/commands/indexcmds.c | 112 - src/bin/psql/tab-complete.c | 4 +- src/include/catalog/index.h | 9 +- src/test/regress/input/tablespace.source | 106 + src/test/regress/output/tablespace.source | 181 ++ 7 files changed, 481 insertions(+), 12 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 627b36300c..e610a0f52c 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN CONCURRENTLY [ boolean ] VERBOSE [ boolean ] +TABLESPACE new_tablespace @@ -187,6 +188,21 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + Specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" and system (unless allow_system_table_mods + is set to TRUE) relations. If SCHEMA, + DATABASE or SYSTEM are specified, + then all "mapped" and system relations will be skipped and a single + WARNING will be gener
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-28 00:36, Alvaro Herrera wrote: On 2021-Jan-28, Alexey Kondratov wrote: I have read more about lock levels and ShareLock should prevent any kind of physical modification of indexes. We already hold ShareLock doing find_all_inheritors(), which is higher than ShareUpdateExclusiveLock, so using ShareLock seems to be safe here, but I will look on it closer. You can look at lock.c where LockConflicts[] is; that would tell you that ShareLock indeed conflicts with ShareUpdateExclusiveLock ... but it does not conflict with itself! So it would be possible to have more than one process doing this thing at the same time, which surely makes no sense. Thanks for the explanation and pointing me to the LockConflicts[]. This is a good reference. I didn't look at the patch closely enough to understand why you're trying to do something like CLUSTER, VACUUM FULL or REINDEX without holding full AccessExclusiveLock on the relation. But do keep in mind that once you hold a lock on a relation, trying to grab a weaker lock afterwards is pretty pointless. No, you are right, we are doing REINDEX with AccessExclusiveLock as it was before. This part is a more specific one. It only applies to partitioned indexes, which do not hold any data, so we do not reindex them directly, only their leafs. However, if we are doing a TABLESPACE change, we have to record it in their pg_class entry, so all future leaf partitions were created in the proper tablespace. That way, we open partitioned index relation only for a reference, i.e. read-only, but modify pg_class entry under a proper lock (RowExclusiveLock). That's why I thought that ShareLock will be enough. IIUC, 'ALTER TABLE ... SET TABLESPACE' uses AccessExclusiveLock even for relations with no storage, since AlterTableGetLockLevel() chooses it if AT_SetTableSpace is met. This is very similar to our case, so probably we should do the same? Actually it is not completely clear for me why ShareUpdateExclusiveLock is sufficient for newly added SetRelationTableSpace() as Michael wrote in the comment. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-28 14:42, Alexey Kondratov wrote: On 2021-01-28 00:36, Alvaro Herrera wrote: I didn't look at the patch closely enough to understand why you're trying to do something like CLUSTER, VACUUM FULL or REINDEX without holding full AccessExclusiveLock on the relation. But do keep in mind that once you hold a lock on a relation, trying to grab a weaker lock afterwards is pretty pointless. No, you are right, we are doing REINDEX with AccessExclusiveLock as it was before. This part is a more specific one. It only applies to partitioned indexes, which do not hold any data, so we do not reindex them directly, only their leafs. However, if we are doing a TABLESPACE change, we have to record it in their pg_class entry, so all future leaf partitions were created in the proper tablespace. That way, we open partitioned index relation only for a reference, i.e. read-only, but modify pg_class entry under a proper lock (RowExclusiveLock). That's why I thought that ShareLock will be enough. IIUC, 'ALTER TABLE ... SET TABLESPACE' uses AccessExclusiveLock even for relations with no storage, since AlterTableGetLockLevel() chooses it if AT_SetTableSpace is met. This is very similar to our case, so probably we should do the same? Actually it is not completely clear for me why ShareUpdateExclusiveLock is sufficient for newly added SetRelationTableSpace() as Michael wrote in the comment. Changed patch to use AccessExclusiveLock in this part for now. This is what 'ALTER TABLE/INDEX ... SET TABLESPACE' and 'REINDEX' usually do. Anyway, all real leaf partitions are processed in the independent transactions later. Also changed some doc/comment parts Justin pointed me to. + then all "mapped" and system relations will be skipped and a single + WARNING will be generated. Indexes on TOAST tables + are reindexed, but not moved the new tablespace. moved *to* the new tablespace. Fixed. I don't know if that needs to be said at all. We talked about it a lot to arrive at the current behavior, but I think that's only due to the difficulty of correcting the initial mistake. I do not think that it will be a big deal to move indexes on TOAST tables as well. I just thought that since 'ALTER TABLE/INDEX ... SET TABLESPACE' only moves them together with host table, we also should not do that. Yet, I am ready to change this logic if requested. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 6e9db8d362e794edf421733bc7cade38c917bff4 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 27 Jan 2021 00:46:17 +0300 Subject: [PATCH v9] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 31 +++- src/backend/catalog/index.c | 47 +- src/backend/commands/indexcmds.c | 112 - src/bin/psql/tab-complete.c | 4 +- src/include/catalog/index.h | 9 +- src/test/regress/input/tablespace.source | 106 + src/test/regress/output/tablespace.source | 181 ++ 7 files changed, 478 insertions(+), 12 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 627b36300c..2b39699d42 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN CONCURRENTLY [ boolean ] VERBOSE [ boolean ] +TABLESPACE new_tablespace @@ -187,6 +188,21 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + Specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" or (unless allow_system_table_mods) + system relations. If SCHEMA, + DATABASE or SYSTEM are specified, + then all "mapped" and system relations will be skipped and a single + WARNING will be generated. Indexes on TOAST tables + are reindexed, but not moved to the new tablespace. + + + + VERBOSE @@ -210,6 +226,14 @@ REINDEX [ ( option [, ...] ) ] { IN + +new_tablespace + + + The tablespace where indexes will be rebuilt. + + + @@ -292,7 +316,12 @@ REINDEX [ ( option [, ...] ) ] { IN with REINDEX INDEX or REINDEX TABLE, respectively. Each partition of the specified partitioned relation is reindexed in a separate transaction. Those commands cannot be used inside - a transaction block when working on a partitioned table or index. + a transaction block when working on a partitioned table or index. If + a REINDEX command fails when run on a partitioned + relation, and TABLESPACE was specified, then it may have + moved
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-30 05:23, Michael Paquier wrote: On Fri, Jan 29, 2021 at 08:56:47PM +0300, Alexey Kondratov wrote: On 2021-01-28 14:42, Alexey Kondratov wrote: No, you are right, we are doing REINDEX with AccessExclusiveLock as it was before. This part is a more specific one. It only applies to partitioned indexes, which do not hold any data, so we do not reindex them directly, only their leafs. However, if we are doing a TABLESPACE change, we have to record it in their pg_class entry, so all future leaf partitions were created in the proper tablespace. That way, we open partitioned index relation only for a reference, i.e. read-only, but modify pg_class entry under a proper lock (RowExclusiveLock). That's why I thought that ShareLock will be enough. IIUC, 'ALTER TABLE ... SET TABLESPACE' uses AccessExclusiveLock even for relations with no storage, since AlterTableGetLockLevel() chooses it if AT_SetTableSpace is met. This is very similar to our case, so probably we should do the same? Actually it is not completely clear for me why ShareUpdateExclusiveLock is sufficient for newly added SetRelationTableSpace() as Michael wrote in the comment. Nay, it was not fine. That's something Alvaro has mentioned, leading to 2484329. This also means that the main patch of this thread should refresh the comments at the top of CheckRelationTableSpaceMove() and SetRelationTableSpace() to mention that this is used by REINDEX CONCURRENTLY with a lower lock. Hm, IIUC, REINDEX CONCURRENTLY doesn't use either of them. It directly uses index_create() with a proper tablespaceOid instead of SetRelationTableSpace(). And its checks structure is more restrictive even without tablespace change, so it doesn't use CheckRelationTableSpaceMove(). Changed patch to use AccessExclusiveLock in this part for now. This is what 'ALTER TABLE/INDEX ... SET TABLESPACE' and 'REINDEX' usually do. Anyway, all real leaf partitions are processed in the independent transactions later. + if (partkind == RELKIND_PARTITIONED_INDEX) + { + Relation iRel = index_open(partoid, AccessExclusiveLock); + + if (CheckRelationTableSpaceMove(iRel, params->tablespaceOid)) + SetRelationTableSpace(iRel, + params->tablespaceOid, + InvalidOid); + index_close(iRel, NoLock); Are you sure that this does not represent a risk of deadlocks as EAL is not taken consistently across all the partitions? A second issue here is that this breaks the assumption of REINDEX CONCURRENTLY kicked on partitioned relations that should use ShareUpdateExclusiveLock for all its steps. This would make the first transaction invasive for the user, but we don't want that. This makes me really wonder if we would not be better to restrict this operation for partitioned relation as part of REINDEX as a first step. Another thing, mentioned upthread, is that we could do this part of the switch at the last transaction, or we could silently *not* do the switch for partitioned indexes in the flow of REINDEX, letting users handle that with an extra ALTER TABLE SET TABLESPACE once REINDEX has finished on all the partitions, cascading the command only on the partitioned relation of a tree. It may be interesting to look as well at if we could lower the lock used for partitioned relations with ALTER TABLE SET TABLESPACE from AEL to SUEL, choosing AEL only if at least one partition with storage is involved in the command, CheckRelationTableSpaceMove() discarding anything that has no need to change. I am not sure right now, so I split previous patch into two parts: 0001: Adds TABLESPACE into REINDEX with tests, doc and all the stuff we did before with the only exception that it doesn't move partitioned indexes into the new tablespace. Basically, it implements this option "we could silently *not* do the switch for partitioned indexes in the flow of REINDEX, letting users handle that with an extra ALTER TABLE SET TABLESPACE once REINDEX has finished". It probably makes sense, since we are doing tablespace change altogether with index relation rewrite and don't touch relations without storage. Doing ALTER INDEX ... SET TABLESPACE will be almost cost-less on them, since they do not hold any data. 0002: Implements the remaining part where pg_class entry is also changed for partitioned indexes. I think that we should think more about it, maybe it is not so dangerous and proper locking strategy could be achieved. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 6322032b472e6b1a76e0ca9326974e5774371fb9 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 1 Feb 2021 15:20:29 +0300 Subject: [PATCH v10 2/2] Change tablespace of partitioned indexes during REINDEX. There are some doubts about proper locking
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-02-03 09:37, Michael Paquier wrote: On Tue, Feb 02, 2021 at 10:32:19AM +0900, Michael Paquier wrote: On Mon, Feb 01, 2021 at 06:28:57PM +0300, Alexey Kondratov wrote: > Hm, IIUC, REINDEX CONCURRENTLY doesn't use either of them. It directly uses > index_create() with a proper tablespaceOid instead of > SetRelationTableSpace(). And its checks structure is more restrictive even > without tablespace change, so it doesn't use CheckRelationTableSpaceMove(). Sure. I have not checked the patch in details, but even with that it would be much safer to me if we apply the same sanity checks everywhere. That's less potential holes to worry about. Thanks Alexey for the new patch. I have been looking at the main patch in details. /* -* Don't allow reindex on temp tables of other backends ... their local -* buffer manager is not going to cope. +* We don't support moving system relations into different tablespaces +* unless allow_system_table_mods=1. */ If you remove the check on RELATION_IS_OTHER_TEMP() in reindex_index(), you would allow the reindex of a temp relation owned by a different session if its tablespace is not changed, so this cannot be removed. +!allowSystemTableMods && IsSystemRelation(iRel)) ereport(ERROR, -(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("cannot reindex temporary tables of other sessions"))); +(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("permission denied: \"%s\" is a system catalog", +RelationGetRelationName(iRel; Indeed, a system relation with a relfilenode should be allowed to move under allow_system_table_mods. I think that we had better move this check into CheckRelationTableSpaceMove() instead of reindex_index() to centralize the logic. ALTER TABLE does this business in RangeVarCallbackForAlterRelation(), but our code path opening the relation is different for the non-concurrent case. + if (OidIsValid(params->tablespaceOid) && + IsSystemClass(relid, classtuple)) + { + if (!allowSystemTableMods) + { + /* Skip all system relations, if not allowSystemTableMods * I don't see the need for having two warnings here to say the same thing if a relation is mapped or not mapped, so let's keep it simple. Yeah, I just wanted to separate mapped and system relations, but probably it is too complicated. I have found that the test suite was rather messy in its organization. Table creations were done first with a set of tests not really ordered, so that was really hard to follow. This has also led to a set of tests that were duplicated, while other tests have been missed, mainly some cross checks for the concurrent and non-concurrent behaviors. I have reordered the whole so as tests on catalogs, normal tables and partitions are done separately with relations created and dropped for each set. Partitions use a global check for tablespaces and relfilenodes after one concurrent reindex (didn't see the point in doubling with the non-concurrent case as the same code path to select the relations from the partition tree is taken). An ACL test has been added at the end. The case of partitioned indexes was kind of interesting and I thought about that a couple of days, and I took the decision to ignore relations that have no storage as you did, documenting that ALTER TABLE can be used to update the references of the partitioned relations. The command is still useful with this behavior, and the tests I have added track that. Finally, I have reworked the docs, separating the limitations related to system catalogs and partitioned relations, to be more consistent with the notes at the end of the page. Thanks for working on this. + if (tablespacename != NULL) + { + params.tablespaceOid = get_tablespace_oid(tablespacename, false); + + /* Check permissions except when moving to database's default */ + if (OidIsValid(params.tablespaceOid) && This check for OidIsValid() seems to be excessive, since you moved the whole ACL check under 'if (tablespacename != NULL)' here. + params.tablespaceOid != MyDatabaseTableSpace) + { + AclResult aclresult; +CREATE INDEX regress_tblspace_test_tbl_idx ON regress_tblspace_test_tbl (num1); +-- move to global tablespace move fails Maybe 'move to global tablespace, fail', just to match a style of the previous comments. +REINDEX (TABLESPACE pg_global) INDEX regress_tblspace_test_tbl_idx; +SELECT relid, parentrelid, level FROM pg_partition_tree('tbspace_reindex_part_index') + ORDER BY relid, level; +SELECT relid, parentrelid, level FROM pg_partition_tree('tbspace_
Re: Free port choosing freezes when PostgresNode::use_tcp is used on BSD systems
On 2021-04-20 18:03, Tom Lane wrote: Andrew Dunstan writes: On 4/19/21 7:22 PM, Tom Lane wrote: I wonder whether we could get away with just replacing the $use_tcp test with $TestLib::windows_os. It's not really apparent to me why we should care about 127.0.0.not-1 on Unix-oid systems. Yeah The comment is a bit strange anyway - Cygwin is actually going to use Unix sockets, not TCP. I think I would just change the test to this: $use_tcp && $TestLib::windows_os. Works for me, but we need to revise the comment to match. Then it could be somewhat like that, I guess. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index db47a97d196..f7b488ed464 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -1191,19 +1191,19 @@ sub get_free_port # Check to see if anything else is listening on this TCP port. # Seek a port available for all possible listen_addresses values, # so callers can harness this port for the widest range of purposes. - # The 0.0.0.0 test achieves that for post-2006 Cygwin, which - # automatically sets SO_EXCLUSIVEADDRUSE. The same holds for MSYS (a - # Cygwin fork). Testing 0.0.0.0 is insufficient for Windows native - # Perl (https://stackoverflow.com/a/14388707), so we also test - # individual addresses. + # The 0.0.0.0 test achieves that for MSYS, which automatically sets + # SO_EXCLUSIVEADDRUSE. Testing 0.0.0.0 is insufficient for Windows + # native Perl (https://stackoverflow.com/a/14388707), so we also + # have to test individual addresses. Doing that for 127.0.0/24 + # addresses other than 127.0.0.1 might fail with EADDRNOTAVAIL on + # non-Linux, non-Windows kernels. # - # On non-Linux, non-Windows kernels, binding to 127.0.0/24 addresses - # other than 127.0.0.1 might fail with EADDRNOTAVAIL. Binding to - # 0.0.0.0 is unnecessary on non-Windows systems. + # That way, 0.0.0.0 and individual 127.0.0/24 addresses are tested + # only on Windows when TCP usage is requested. if ($found == 1) { foreach my $addr (qw(127.0.0.1), -$use_tcp ? qw(127.0.0.2 127.0.0.3 0.0.0.0) : ()) +$use_tcp && $TestLib::windows_os ? qw(127.0.0.2 127.0.0.3 0.0.0.0) : ()) { if (!can_bind($addr, $port)) {
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On 04.11.2019 13:05, Kuntal Ghosh wrote: On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar wrote: So your result shows that with "streaming on", performance is degrading? By any chance did you try to see where is the bottleneck? Right. But, as we increase the logical_decoding_work_mem, the performance improves. I've not analyzed the bottleneck yet. I'm looking into the same. My guess is that 64 kB is just too small value. In the table schema used for tests every rows takes at least 24 bytes for storing column values. Thus, with this logical_decoding_work_mem value the limit should be hit after about 2500+ rows, or about 400 times during transaction of 100 rows size. It is just too frequent, while ReorderBufferStreamTXN includes a whole bunch of logic, e.g. it always starts internal transaction: /* * Decoding needs access to syscaches et al., which in turn use * heavyweight locks and such. Thus we need to have enough state around to * keep track of those. The easiest way is to simply use a transaction * internally. That also allows us to easily enforce that nothing writes * to the database by checking for xid assignments. ... */ Also it issues separated stream_start/stop messages around each streamed transaction chunk. So if streaming starts and stops too frequently it adds additional overhead and may even interfere with current in-progress transaction. If I get it correctly, then it is rather expected with too small values of logical_decoding_work_mem. Probably it may be optimized, but I am not sure that it is worth doing right now. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Conflict handling for COPY FROM
On 11.11.2019 16:00, Surafel Temesgen wrote: Next, you use DestRemoteSimple for returning conflicting tuples back: + dest = CreateDestReceiver(DestRemoteSimple); + dest->rStartup(dest, (int) CMD_SELECT, tupDesc); However, printsimple supports very limited subset of built-in types, so CREATE TABLE large_test (id integer primary key, num1 bigint, num2 double precision); COPY large_test FROM '/path/to/copy-test.tsv'; COPY large_test FROM '/path/to/copy-test.tsv' ERROR 3; fails with following error 'ERROR: unsupported type OID: 701', which seems to be very confusing from the end user perspective. I've tried to switch to DestRemote, but couldn't figure it out quickly. fixed Thanks, now it works with my tests. 1) Maybe it is fine, but now I do not like this part: + portal = GetPortalByName(""); + dest = CreateDestReceiver(DestRemote); + SetRemoteDestReceiverParams(dest, portal); + dest->rStartup(dest, (int) CMD_SELECT, tupDesc); Here you implicitly use the fact that portal with a blank name is always created in exec_simple_query before we get to this point. Next, you create new DestReceiver and set it to this portal, but it is also already created and set in the exec_simple_query. Would it be better if you just explicitly pass ready DestReceiver to DoCopy (similarly to how it is done for T_ExecuteStmt / ExecuteQuery), as it may be required by COPY now? 2) My second concern is that you use three internal flags to track errors limit: + int error_limit; /* total number of error to ignore */ + bool ignore_error; /* is ignore error specified? */ + bool ignore_all_error; /* is error_limit -1 (ignore all error) + * specified? */ Though it seems that we can just leave error_limit as a user-defined constant and track errors with something like errors_count. In that case you do not need auxiliary ignore_all_error flag. But probably it is a matter of personal choice. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Hi Steve, Thank you for review. On 17.11.2019 3:53, Steve Singer wrote: The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, failed Spec compliant: not tested Documentation:tested, failed * I had to replace heap_open/close with table_open/close to get the patch to compile against master In the documentation + + This specifies a tablespace, where all rebuilt indexes will be created. + Can be used only with REINDEX INDEX and + REINDEX TABLE, since the system indexes are not + movable, but SCHEMA, DATABASE or + SYSTEM very likely will has one. + I found the "SCHEMA,DATABASE or SYSTEM very likely will has one." portion confusing and would be inclined to remove it or somehow reword it. In the attached new version REINDEX with TABLESPACE and {SCHEMA, DATABASE, SYSTEM} now behaves more like with CONCURRENTLY, i.e. it skips unsuitable relations and shows warning. So this section in docs has been updated as well. Also the whole patch has been reworked. I noticed that my code in reindex_index was doing pretty much the same as inside RelationSetNewRelfilenode. So I just added a possibility to specify new tablespace for RelationSetNewRelfilenode instead. Thus, even with addition of new tests the patch becomes less complex. Consider the following - create index foo_bar_idx on foo(bar) tablespace pg_default; CREATE INDEX reindex=# \d foo Table "public.foo" Column | Type | Collation | Nullable | Default +-+---+--+- id | integer | | not null | bar| text| | | Indexes: "foo_pkey" PRIMARY KEY, btree (id) "foo_bar_idx" btree (bar) reindex=# reindex index foo_bar_idx tablespace tst1; REINDEX reindex=# reindex index foo_bar_idx tablespace pg_default; REINDEX reindex=# \d foo Table "public.foo" Column | Type | Collation | Nullable | Default +-+---+--+- id | integer | | not null | bar| text| | | Indexes: "foo_pkey" PRIMARY KEY, btree (id) "foo_bar_idx" btree (bar), tablespace "pg_default" It is a bit strange that it says "pg_default" as the tablespace. If I do this with a alter table to the table, moving the table back to pg_default makes it look as it did before. Otherwise the first patch seems fine. Yes, I missed the fact that default tablespace of database is stored implicitly as InvalidOid, but I was setting it explicitly as specified. I have changed this behavior to stay consistent with ALTER TABLE. With the second patch(for NOWAIT) I did the following T1: begin; T1: insert into foo select generate_series(1,1000); T2: reindex index foo_bar_idx set tablespace tst1 nowait; T2 is waiting for a lock. This isn't what I would expect. Indeed, I have added nowait option for RangeVarGetRelidExtended, so it should not wait if index is locked. However, for reindex we also have to put share lock on the parent table relation, which is done by opening it via table_open(heapId, ShareLock). The only one solution I can figure out right now is to wrap all such opens with ConditionalLockRelationOid(relId, ShareLock) and then do actual open with NoLock. This is how something similar is implemented in VACUUM if VACOPT_SKIP_LOCKED is specified. However, there are multiple code paths with table_open, so it becomes a bit ugly. I will leave the second patch aside for now and experiment with it. Actually, its main idea was to mimic ALTER INDEX ... SET TABLESPACE [NOWAIT] syntax, but probably it is better to stick with more brief plain TABLESPACE like in CREATE INDEX. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company P.S. I have also added all previous thread participants to CC in order to do not split the thread. Sorry if it was a bad idea. >From 22990d58fb549536ca33a1b02c5a21a248deee5d Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 20 Nov 2019 20:09:50 +0300 Subject: [PATCH v4] Allow REINDEX and REINDEX CONCURRENTLY to change TABLESPACE --- doc/src/sgml/ref/reindex.sgml | 24 ++- src/backend/catalog/index.c | 26 +-- src/backend/commands/cluster.c| 2 +- src/backend/commands/indexcmds.c | 88 +++ src/backend/commands/sequence.c | 8 ++- src/backend/commands/tablecmds.c | 9 ++- src/backend/parser/gram.y | 21 -- src/backend/tcop/utility.c| 6 +- src/backend/utils/cache/relcache.c| 18 - src/include/catalog/index.h | 7 +- src/include/commands/defrem.h
Re: Conflict handling for COPY FROM
On 18.11.2019 9:42, Surafel Temesgen wrote: On Fri, Nov 15, 2019 at 6:24 PM Alexey Kondratov mailto:a.kondra...@postgrespro.ru>> wrote: 1) Maybe it is fine, but now I do not like this part: + portal = GetPortalByName(""); + dest = CreateDestReceiver(DestRemote); + SetRemoteDestReceiverParams(dest, portal); + dest->rStartup(dest, (int) CMD_SELECT, tupDesc); Here you implicitly use the fact that portal with a blank name is always created in exec_simple_query before we get to this point. Next, you create new DestReceiver and set it to this portal, but it is also already created and set in the exec_simple_query. Would it be better if you just explicitly pass ready DestReceiver to DoCopy (similarly to how it is done for T_ExecuteStmt / ExecuteQuery), Good idea .Thank you Now the whole patch works exactly as expected for me and I cannot find any new technical flaws. However, the doc is rather vague, especially these places: + specifying it to -1 returns all error record. Actually, we return only rows with constraint violation, but malformed rows are ignored with warning. I guess that we simply cannot return malformed rows back to the caller in the same way as with constraint violation, since we cannot figure out (in general) which column corresponds to which type if there are extra or missing columns. + and same record formatting error is ignored. I can get it, but it definitely should be reworded. What about something like this? + + ERROR_LIMIT + + + Enables ignoring of errored out rows up to limit_number. If limit_number is set + to -1, then all errors will be ignored. + + + + Currently, only unique or exclusion constraint violation + and rows formatting errors are ignored. Malformed + rows will rise warnings, while constraint violating rows + will be returned back to the caller. + + + + Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 27.11.2019 6:54, Michael Paquier wrote: On Tue, Nov 26, 2019 at 11:09:55PM +0100, Masahiko Sawada wrote: I looked at v4 patch. Here are some comments: + /* Skip all mapped relations if TABLESPACE is specified */ + if (OidIsValid(tableSpaceOid) && + classtuple->relfilenode == 0) I think we can use OidIsValid(classtuple->relfilenode) instead. Yes, definitely. Yes, switched to !OidIsValid(classtuple->relfilenode). Also I added a comment that it is meant to be equivalent to RelationIsMapped() and extended tests. This change says that temporary relation is not supported but it actually seems to work. Which is correct? Yeah, I don't really see a reason why it would not work. My bad, I was keeping in mind RELATION_IS_OTHER_TEMP validation, but it is for temp tables of other backends only, so it definitely should not be in the doc. Removed. Your patch has forgotten to update copyfuncs.c and equalfuncs.c with the new tablespace string field. Fixed, thanks. It would be nice to add tab completion for this new clause in psql. Added. There is no need for opt_tablespace_name as new node for the parsing grammar of gram.y as OptTableSpace is able to do the exact same job. Sure, it was an artifact from the times, where I used optional SET TABLESPACE clause. Removed. @@ -3455,6 +3461,8 @@ RelationSetNewRelfilenode(Relation relation, char persistence) */ newrnode = relation->rd_node; newrnode.relNode = newrelfilenode; + if (OidIsValid(tablespaceOid)) + newrnode.spcNode = newTablespaceOid; The core of the patch is actually here. It seems to me that this is a very bad idea because you actually hijack a logic which happens at a much lower level which is based on the state of the tablespace stored in the relation cache entry of the relation being reindexed, then the tablespace choice actually happens in RelationInitPhysicalAddr() which for the new relfilenode once the follow-up CCI is done. So this very likely needs more thoughts, and bringing to the point: shouldn't you actually be careful that the relation tablespace is correctly updated before reindexing it and before creating its new relfilenode? This way, RelationSetNewRelfilenode() does not need any additional work, and I think that this saves from potential bugs in the choice of the tablespace used with the new relfilenode. When I did the first version of the patch I was looking on ATExecSetTableSpace, which implements ALTER ... SET TABLESPACE. And there is very similar pipeline there: 1) Find pg_class entry with SearchSysCacheCopy1 2) Create new relfilenode with GetNewRelFileNode 3) Set new tablespace for this relfilenode 4) Do some work with new relfilenode 5) Update pg_class entry with new tablespace 6) Do CommandCounterIncrement The only difference is that point 3) and tablespace part of 5) were missing in RelationSetNewRelfilenode, so I added them, and I do 4) after 6) in REINDEX. Thus, it seems that in my implementation of tablespace change in REINDEX I am more sure that "the relation tablespace is correctly updated before reindexing", since I do reindex after CCI (point 6), doesn't it? So why it is fine for ATExecSetTableSpace to do pretty much the same, but not for REINDEX? Or the key point is in doing actual work before CCI, but for me it seems a bit against what you have wrote? Thus, I cannot get your point correctly here. Can you, please, elaborate a little bit more your concerns? ISTM the kind of above errors are the same: the given tablespace exists but moving tablespace to it is not allowed since it's not supported in PostgreSQL. So I think we can use ERRCODE_FEATURE_NOT_SUPPORTED instead of ERRCODE_INVALID_PARAMETER_VALUE (which is used at 3 places) . Yes, it is also not project style to use full sentences in error messages, so I would suggest instead (note the missing quotes in the original patch): cannot move non-shared relation to tablespace \"%s\" Same here. I have taken this validation directly from tablecmds.c part for ALTER ... SET TABLESPACE. And there is exactly the same message "only shared relations can be placed in pg_global tablespace" with ERRCODE_INVALID_PARAMETER_VALUE there. However, I understand your point, but still, would it be better if I stick to the same ERRCODE/message? Or should I introduce new ERRCODE/message for the same case? And I have somewhat missed to notice the timing of the review replies as you did not have room to reply, so fixed the CF entry to "waiting on author", and bumped it to next CF instead. Thank you! Attached is a patch, that addresses all the issues above, excepting the last two points (core part and error messages for pg_global), which are not clear for me right now. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 02.12.2019 11:21, Michael Paquier wrote: On Wed, Nov 27, 2019 at 08:47:06PM +0300, Alexey Kondratov wrote: The only difference is that point 3) and tablespace part of 5) were missing in RelationSetNewRelfilenode, so I added them, and I do 4) after 6) in REINDEX. Thus, it seems that in my implementation of tablespace change in REINDEX I am more sure that "the relation tablespace is correctly updated before reindexing", since I do reindex after CCI (point 6), doesn't it? So why it is fine for ATExecSetTableSpace to do pretty much the same, but not for REINDEX? Or the key point is in doing actual work before CCI, but for me it seems a bit against what you have wrote? Nope, the order is not the same on what you do here, causing a duplication in the tablespace selection within RelationSetNewRelfilenode() and when flushing the relation on the new tablespace for the first time after the CCI happens, please see below. And we should avoid that. Thus, I cannot get your point correctly here. Can you, please, elaborate a little bit more your concerns? The case of REINDEX CONCURRENTLY is pretty simple, because a new relation which is a copy of the old relation is created before doing the reindex, so you simply need to set the tablespace OID correctly in index_concurrently_create_copy(). And actually, I think that the computation is incorrect because we need to check after MyDatabaseTableSpace as well, no? No, the same logic already exists in heap_create: if (reltablespace == MyDatabaseTableSpace) reltablespace = InvalidOid; Which is called by index_concurrently_create_copy -> index_create -> heap_create. The case of REINDEX is more tricky, because you are working on a relation that already exists, hence I think that what you need to do a different thing before the actual REINDEX: 1) Update the existing relation's pg_class tuple to point to the new tablespace. 2) Do a CommandCounterIncrement. So I think that the order of the operations you are doing is incorrect, and that you have a risk of breaking the existing tablespace assignment logic done when first flushing a new relfilenode. This actually brings an extra thing: when doing a plain REINDEX you need to make sure that the past relfilenode of the relation gets away properly. The attached POC patch does that before doing the CCI which is a bit ugly, but that's enough to show my point, and there is no need to touch RelationSetNewRelfilenode() this way. Thank you for the detailed answer and PoC patch. I will recheck everything and dig deeper into this problem, and come up with something closer to the next 01.2020 commitfest. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 01.12.2019 5:57, Michael Paquier wrote: On Thu, Sep 26, 2019 at 03:08:22PM +0300, Alexey Kondratov wrote: As Alvaro correctly pointed in the nearby thread [1], we've got an interference regarding -R command line argument. I agree that it's a good idea to reserve -R for recovery configuration write to be consistent with pg_basebackup, so I've updated my patch to use another letters: The patch has rotten and does not apply anymore. Could you please send a rebased version? I have moved the patch to next CF, waiting on author for now. Rebased and updated patch is attached. There was a problem with testing new restore_command options altogether with recent ensureCleanShutdown. My test simply moves all WAL from pg_wal and generates restore_command for a new options testing, but this prevents startup recovery required by ensureCleanShutdown. To test both options in the same we have to leave some recent WAL segments in the pg_wal and be sure that they are enough for startup recovery, but not enough for successful pg_rewind run. I have manually figured out that required amount of inserted records (and generated WAL) to achieve this. However, I think that this approach is not good for test, since tests may be modified in the future (amount of writes to DB changed) or even volume of WAL written by Postgres will change. It will lead to falsely always failed or passed tests. Moreover, testing both ensureCleanShutdown and new options in the same time doesn't hit new code paths, so I decided to test new options with --no-ensure-shutdown for simplicity and stability of tests. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From a05c3343e0bd6fe339c944f6b0cde64ceb46a0b3 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 19 Feb 2019 19:14:53 +0300 Subject: [PATCH v11] pg_rewind: options to use restore_command from command line or cluster config Previously, when pg_rewind could not find required WAL files in the target data directory the rewind process would fail. One had to manually figure out which of required WAL files have already moved to the archival storage and copy them back. This patch adds possibility to specify restore_command via command line option or use one specified inside postgresql.conf. Specified restore_command will be used for automatic retrieval of missing WAL files from archival storage. --- doc/src/sgml/ref/pg_rewind.sgml | 49 +++- src/bin/pg_rewind/parsexlog.c | 164 +- src/bin/pg_rewind/pg_rewind.c | 118 +++--- src/bin/pg_rewind/pg_rewind.h | 6 +- src/bin/pg_rewind/t/001_basic.pl | 4 +- src/bin/pg_rewind/t/002_databases.pl | 4 +- src/bin/pg_rewind/t/003_extrafiles.pl | 4 +- src/bin/pg_rewind/t/RewindTest.pm | 105 - 8 files changed, 416 insertions(+), 38 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 42d29edd4e..b601a5c7e4 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -66,11 +66,12 @@ PostgreSQL documentation can be found either on the target timeline, the source timeline, or their common ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the - target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or - fetched on startup by configuring or - . The use of + target cluster ran for a long time after the divergence, its old WAL + files might no longer be present. In this case, you can manually copy them + from the WAL archive to the pg_wal directory, or run + pg_rewind with the -c or + -C option to automatically retrieve them from the WAL + archive. The use of pg_rewind is not limited to failover, e.g. a standby server can be promoted, run some write transactions, and then rewinded to become a standby again. @@ -232,6 +233,39 @@ PostgreSQL documentation + + -c + --restore-target-wal + + +Use the restore_command defined in +postgresql.conf to retrieve WAL files from +the WAL archive if these files are no longer available in the +pg_wal directory of the target cluster. + + +This option cannot be used together with --target-restore-command. + + + + + + -C restore_command + --target-restore-command=restore_command + + +Specifies the restore_command to use for retrieving +WAL files from the WAL archive if these files are no longer available +in the pg_wal directory of the target cluster. + + +If restore_co
Re: [PATCH] Increase the maximum value track_activity_query_size
On 19.12.2019 20:52, Robert Haas wrote: On Thu, Dec 19, 2019 at 10:59 AM Tom Lane wrote: Bruce Momjian writes: Good question. I am in favor of allowing a larger value if no one objects. I don't think adding the min/max is helpful. The original poster. And probably anyone else, who debugs stuck queries of yet another crazy ORM. Yes, one could use log_min_duration_statement, but having a possibility to directly get it from pg_stat_activity without eyeballing the logs is nice. Also, IIRC log_min_duration_statement applies only to completed statements. I think there are pretty obvious performance and memory-consumption penalties to very large track_activity_query_size values. Who exactly are we really helping if we let them set it to huge values? (wanders away wondering if we have suitable integer-overflow checks in relevant code paths...) The value of pgstat_track_activity_query_size is in bytes, so setting it to any value below INT_MAX seems to be safe from that perspective. However, being multiplied by NumBackendStatSlots its reasonable value should be far below INT_MAX (~2 GB). Sincerely, It does not look for me like something badly needed, but still. We already have hundreds of GUCs and it is easy for a user to build a sub-optimal configuration, so does this overprotection make sense? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Physical replication slot advance is not persistent
Hi Hackers, I have accidentally noticed that pg_replication_slot_advance only changes in-memory state of the slot when its type is physical. Its new value does not survive restart. Reproduction steps: 1) Create new slot and remember its restart_lsn SELECT pg_create_physical_replication_slot('slot1', true); SELECT * from pg_replication_slots; 2) Generate some dummy WAL CHECKPOINT; SELECT pg_switch_wal(); CHECKPOINT; SELECT pg_switch_wal(); 3) Advance slot to the value of pg_current_wal_insert_lsn() SELECT pg_replication_slot_advance('slot1', '0/160001A0'); 4) Check that restart_lsn has been updated SELECT * from pg_replication_slots; 5) Restart server and check restart_lsn again. It should be the same as in the step 1. I dig into the code and it happens because of this if statement: /* Update the on disk state when lsn was updated. */ if (XLogRecPtrIsInvalid(endlsn)) { ReplicationSlotMarkDirty(); ReplicationSlotsComputeRequiredXmin(false); ReplicationSlotsComputeRequiredLSN(); ReplicationSlotSave(); } Actually, endlsn is always a valid LSN after the execution of replication slot advance guts. It works for logical slots only by chance, since there is an implicit ReplicationSlotMarkDirty() call inside LogicalConfirmReceivedLocation. Attached is a small patch, which fixes this bug. I have tried to stick to the same logic in this 'if (XLogRecPtrIsInvalid(endlsn))' and now pg_logical_replication_slot_advance and pg_physical_replication_slot_advance return InvalidXLogRecPtr if no-op. What do you think? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company P.S. CCed Simon and Michael as they are the last who seriously touched pg_replication_slot_advance code. >From 36d1fa2a89b3fb354a813354496df475ee11b62e Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 24 Dec 2019 18:21:50 +0300 Subject: [PATCH v1] Make phsycal replslot advance persistent --- src/backend/replication/slotfuncs.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 46e6dd4d12..826708d3f6 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -358,12 +358,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) * The LSN position to move to is compared simply to the slot's restart_lsn, * knowing that any position older than that would be removed by successive * checkpoints. + * + * Returns InvalidXLogRecPtr if no-op. */ static XLogRecPtr pg_physical_replication_slot_advance(XLogRecPtr moveto) { XLogRecPtr startlsn = MyReplicationSlot->data.restart_lsn; - XLogRecPtr retlsn = startlsn; + XLogRecPtr retlsn = InvalidXLogRecPtr; if (startlsn < moveto) { @@ -386,6 +388,8 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) * because we need to digest WAL to advance restart_lsn allowing to recycle * WAL and removal of old catalog tuples. As decoding is done in fast_forward * mode, no changes are generated anyway. + * + * Returns InvalidXLogRecPtr if no-op. */ static XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr moveto) @@ -393,7 +397,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto) LogicalDecodingContext *ctx; ResourceOwner old_resowner = CurrentResourceOwner; XLogRecPtr startlsn; - XLogRecPtr retlsn; + XLogRecPtr retlsn = InvalidXLogRecPtr; PG_TRY(); { @@ -414,9 +418,6 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto) */ startlsn = MyReplicationSlot->data.restart_lsn; - /* Initialize our return value in case we don't do anything */ - retlsn = MyReplicationSlot->data.confirmed_flush; - /* invalidate non-timetravel entries */ InvalidateSystemCaches(); @@ -480,9 +481,9 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto) * better than always losing the position even on clean restart. */ ReplicationSlotMarkDirty(); - } - retlsn = MyReplicationSlot->data.confirmed_flush; + retlsn = MyReplicationSlot->data.confirmed_flush; + } /* free context, call shutdown callback */ FreeDecodingContext(ctx); @@ -575,7 +576,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) nulls[0] = false; /* Update the on disk state when lsn was updated. */ - if (XLogRecPtrIsInvalid(endlsn)) + if (!XLogRecPtrIsInvalid(endlsn)) { ReplicationSlotMarkDirty(); ReplicationSlotsComputeRequiredXmin(false); -- 2.17.1
Re: Physical replication slot advance is not persistent
On 25.12.2019 07:03, Kyotaro Horiguchi wrote: At Tue, 24 Dec 2019 20:12:32 +0300, Alexey Kondratov wrote in I dig into the code and it happens because of this if statement: /* Update the on disk state when lsn was updated. */ if (XLogRecPtrIsInvalid(endlsn)) { ReplicationSlotMarkDirty(); ReplicationSlotsComputeRequiredXmin(false); ReplicationSlotsComputeRequiredLSN(); ReplicationSlotSave(); } Yes, it seems just broken. Attached is a small patch, which fixes this bug. I have tried to stick to the same logic in this 'if (XLogRecPtrIsInvalid(endlsn))' and now pg_logical_replication_slot_advance and pg_physical_replication_slot_advance return InvalidXLogRecPtr if no-op. What do you think? I think we shoudn't change the definition of pg_*_replication_slot_advance since the result is user-facing. Yes, that was my main concern too. OK. The functions return a invalid value only when the slot had the invalid value and failed to move the position. I think that happens only for uninitalized slots. Anyway what we should do there is dirtying the slot when the operation can be assumed to have been succeeded. As the result I think what is needed there is just checking if the returned lsn is equal or larger than moveto. Doen't the following change work? - if (XLogRecPtrIsInvalid(endlsn)) + if (moveto <= endlsn) Yep, it helps with physical replication slot persistence after advance, but the whole validation (moveto <= endlsn) does not make sense for me. The value of moveto should be >= than minlsn == confirmed_flush / restart_lsn, while endlsn == retlsn is also always initialized with confirmed_flush / restart_lsn. Thus, your condition seems to be true in any case, even if it was no-op one, which we were intended to catch. Actually, if we do not want to change pg_*_replication_slot_advance, we can just add straightforward validation that either confirmed_flush, or restart_lsn changed after slot advance guts execution. It will be a little bit bulky, but much more clear and will never be affected by pg_*_replication_slot_advance logic change. Another weird part I have found is this assignment inside pg_logical_replication_slot_advance: /* Initialize our return value in case we don't do anything */ retlsn = MyReplicationSlot->data.confirmed_flush; It looks redundant, since later we do the same assignment, which should be reachable in any case. I will recheck everything again and try to come up with something during this week. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Physical replication slot advance is not persistent
On 25.12.2019 16:51, Alexey Kondratov wrote: On 25.12.2019 07:03, Kyotaro Horiguchi wrote: As the result I think what is needed there is just checking if the returned lsn is equal or larger than moveto. Doen't the following change work? - if (XLogRecPtrIsInvalid(endlsn)) + if (moveto <= endlsn) Yep, it helps with physical replication slot persistence after advance, but the whole validation (moveto <= endlsn) does not make sense for me. The value of moveto should be >= than minlsn == confirmed_flush / restart_lsn, while endlsn == retlsn is also always initialized with confirmed_flush / restart_lsn. Thus, your condition seems to be true in any case, even if it was no-op one, which we were intended to catch. I will recheck everything again and try to come up with something during this week. If I get it correctly, then we already keep previous slot position in the minlsn, so we just have to compare endlsn with minlsn and treat endlsn <= minlsn as a no-op without slot state flushing. Attached is a patch that does this, so it fixes the bug without affecting any user-facing behavior. Detailed comment section and DEBUG output are also added. What do you think now? I have also forgotten to mention that all versions down to 11.0 should be affected with this bug. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From e08299ddf92abc3fb4e802e8b475097fa746c458 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 25 Dec 2019 20:12:42 +0300 Subject: [PATCH v2] Make physical replslot advance persistent --- src/backend/replication/slotfuncs.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 6683fc3f9b..bc5c93b089 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -573,9 +573,17 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) values[0] = NameGetDatum(&MyReplicationSlot->data.name); nulls[0] = false; - /* Update the on disk state when lsn was updated. */ - if (XLogRecPtrIsInvalid(endlsn)) + /* + * Update the on disk state when LSN was updated. Here we rely on the facts + * that: 1) minlsn is initialized with restart_lsn and confirmed_flush LSN for + * physical and logical replication slot respectively, and 2) endlsn is set in + * the same way by pg_*_replication_slot_advance, but after advance. Thus, + * endlsn <= minlsn is treated as a no-op. + */ + if (endlsn > minlsn) { + elog(DEBUG1, "flushing replication slot '%s' state", + NameStr(MyReplicationSlot->data.name)); ReplicationSlotMarkDirty(); ReplicationSlotsComputeRequiredXmin(false); ReplicationSlotsComputeRequiredLSN(); base-commit: 8ce3aa9b5914d1ac45ed3f9bc484f66b3c4850c7 -- 2.17.1
Re: Physical replication slot advance is not persistent
On 26.12.2019 11:33, Kyotaro Horiguchi wrote: At Wed, 25 Dec 2019 20:28:04 +0300, Alexey Kondratov wrote in Yep, it helps with physical replication slot persistence after advance, but the whole validation (moveto <= endlsn) does not make sense for me. The value of moveto should be >= than minlsn == confirmed_flush / restart_lsn, while endlsn == retlsn is also always initialized with confirmed_flush / restart_lsn. Thus, your condition seems to be true in any case, even if it was no-op one, which we were intended to catch. ... If I get it correctly, then we already keep previous slot position in the minlsn, so we just have to compare endlsn with minlsn and treat endlsn <= minlsn as a no-op without slot state flushing. I think you're right about the condition. (endlsn cannot be less than minlsn, though) But I came to think that we shouldn't use locations in that decision. Attached is a patch that does this, so it fixes the bug without affecting any user-facing behavior. Detailed comment section and DEBUG output are also added. What do you think now? I have also forgotten to mention that all versions down to 11.0 should be affected with this bug. pg_replication_slot_advance is the only caller of pg_logical/physical_replication_slot_advacne so there's no apparent determinant on who-does-what about dirtying and other housekeeping calculation like *ComputeRequired*() functions, but the current shape seems a kind of inconsistent between logical and physical. I think pg_logaical/physical_replication_slot_advance should dirty the slot if they actually changed anything. And pg_replication_slot_advance should do the housekeeping if the slots are dirtied. (Otherwise both the caller function should dirty the slot in lieu of the two.) The attached does that. Both approaches looks fine for me: my last patch with as minimal intervention as possible and yours refactoring. I think that it is a right direction to let everyone who modifies slot->data also mark slot as dirty. I found some comment section in your code as rather misleading: + /* + * We don't need to dirty the slot only for the above change, but dirty + * this slot for the same reason with + * pg_logical_replication_slot_advance. + */ We just modified MyReplicationSlot->data, which is "On-Disk data of a replication slot, preserved across restarts.", so it definitely should be marked as dirty, not because pg_logical_replication_slot_advance does the same. Also I think that using this transient variable in ReplicationSlotIsDirty is not necessary. MyReplicationSlot is already a pointer to the slot in shared memory. + ReplicationSlot *slot = MyReplicationSlot; + + Assert(MyReplicationSlot != NULL); + + SpinLockAcquire(&slot->mutex); Otherwise it looks fine for me, so attached is the same diff, but with these proposed corrections. Another concern is that ReplicationSlotIsDirty is added with the only one user. It also cannot be used by SaveSlotToPath due to the simultaneous usage of both flags dirty and just_dirtied there. In that way, I hope that we should call ReplicationSlotSave unconditionally in the pg_replication_slot_advance, so slot will be saved or not automatically based on the slot->dirty flag. In the same time, ReplicationSlotsComputeRequiredXmin and ReplicationSlotsComputeRequiredLSN should be called by anyone, who modifies xmin and LSN fields in the slot. Otherwise, currently we are getting some leaky abstractions. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 21ae8531b3..edf661521a 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -672,6 +672,23 @@ ReplicationSlotMarkDirty(void) SpinLockRelease(&slot->mutex); } +/* + * Verify whether currently acquired slot is dirty. + */ +bool +ReplicationSlotIsDirty(void) +{ + bool dirty; + + Assert(MyReplicationSlot != NULL); + + SpinLockAcquire(&MyReplicationSlot->mutex); + dirty = MyReplicationSlot->dirty; + SpinLockRelease(&MyReplicationSlot->mutex); + + return dirty; +} + /* * Convert a slot that's marked as RS_EPHEMERAL to a RS_PERSISTENT slot, * guaranteeing it will be there after an eventual crash. diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 6683fc3f9b..d7a16a9071 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -370,6 +370,12 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) MyReplicationSlot->data.restart_lsn = moveto; SpinLockRelease(&MyReplicationSlot->mutex); retlsn = moveto; + + /* + * Dirty the slot as we updated data that is meant to be + * persistent on disk. + */ + ReplicationSlotMarkDirty(); } retur
Re: Supply restore_command to pg_rewind via CLI argument
Hi, On Tue, Mar 22, 2022 at 3:32 AM Andres Freund wrote: > > Doesn't apply once more: http://cfbot.cputube.org/patch_37_3213.log > Thanks for the reminder, a rebased version is attached. Regards -- Alexey Kondratov From df56b5c7b882e781fdc0b92e7a83331f0baab094 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 29 Jun 2021 17:17:47 +0300 Subject: [PATCH v4] Allow providing restore_command as a command line option to pg_rewind This could be useful when postgres is usually run with -c config_file=..., so the actual configuration and restore_command is not inside $PGDATA/postgresql.conf. --- doc/src/sgml/ref/pg_rewind.sgml | 19 + src/bin/pg_rewind/pg_rewind.c| 45 ++- src/bin/pg_rewind/t/001_basic.pl | 1 + src/bin/pg_rewind/t/RewindTest.pm| 95 ++-- src/test/perl/PostgreSQL/Test/Cluster.pm | 5 +- 5 files changed, 106 insertions(+), 59 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 33e6bb64ad..af75f35867 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -241,6 +241,25 @@ PostgreSQL documentation + + -C restore_command + --target-restore-command=restore_command + + +Specifies the restore_command to use for retrieving +WAL files from the WAL archive if these files are no longer available +in the pg_wal directory of the target cluster. + + +If restore_command is already set in +postgresql.conf, you can provide the +--restore-target-wal option instead. If both options +are provided, then --target-restore-command +will be used. + + + + --debug diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index b39b5c1aac..9aca041425 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -85,21 +85,22 @@ usage(const char *progname) printf(_("%s resynchronizes a PostgreSQL cluster with another copy of the cluster.\n\n"), progname); printf(_("Usage:\n %s [OPTION]...\n\n"), progname); printf(_("Options:\n")); - printf(_(" -c, --restore-target-wal use restore_command in target configuration to\n" - " retrieve WAL files from archives\n")); - printf(_(" -D, --target-pgdata=DIRECTORY existing data directory to modify\n")); - printf(_(" --source-pgdata=DIRECTORY source data directory to synchronize with\n")); - printf(_(" --source-server=CONNSTRsource server to synchronize with\n")); - printf(_(" -n, --dry-run stop before modifying anything\n")); - printf(_(" -N, --no-sync do not wait for changes to be written\n" - " safely to disk\n")); - printf(_(" -P, --progress write progress messages\n")); - printf(_(" -R, --write-recovery-conf write configuration for replication\n" - " (requires --source-server)\n")); - printf(_(" --debugwrite a lot of debug messages\n")); - printf(_(" --no-ensure-shutdown do not automatically fix unclean shutdown\n")); - printf(_(" -V, --version output version information, then exit\n")); - printf(_(" -?, --help show this help, then exit\n")); + printf(_(" -c, --restore-target-wal use restore_command in target configuration to\n" + "retrieve WAL files from archives\n")); + printf(_(" -C, --target-restore-command=COMMAND target WAL restore_command\n")); + printf(_(" -D, --target-pgdata=DIRECTORY existing data directory to modify\n")); + printf(_(" --source-pgdata=DIRECTORY source data directory to synchronize with\n")); + printf(_(" --source-server=CONNSTR source server to synchronize with\n")); + printf(_(" -n, --dry-run stop before modifying anything\n")); + printf(_(" -N, --no-sync do not wait for changes to be written\n" + "safely to disk\n")); + printf(_(" -P, --progresswrite progress messages\n")); + printf(_(" -R, --write-recovery-conf write configuration for replication\n" + "(requires --source-server)\n")); + printf(_(" --debug write a lot of debug messages\n")); + printf(_(" --no-ensure-shutdown do not automatically
Re: Printing LSN made easy
Hi, On 2020-11-27 13:40, Ashutosh Bapat wrote: Off list Peter Eisentraut pointed out that we can not use these macros in elog/ereport since it creates problems for translations. He suggested adding functions which return strings and use %s when doing so. The patch has two functions pg_lsn_out_internal() which takes an LSN as input and returns a palloc'ed string containing the string representation of LSN. This may not be suitable in performance critical paths and also may leak memory if not freed. So there's another function pg_lsn_out_buffer() which takes LSN and a char array as input, fills the char array with the string representation and returns the pointer to the char array. This allows the function to be used as an argument in printf/elog etc. Macro MAXPG_LSNLEN has been extern'elized for this purpose. If usage of macros in elog/ereport can cause problems for translation, then even with this patch life is not get simpler significantly. For example, instead of just doing like: elog(WARNING, - "xlog min recovery request %X/%X is past current point %X/%X", - (uint32) (lsn >> 32), (uint32) lsn, - (uint32) (newMinRecoveryPoint >> 32), - (uint32) newMinRecoveryPoint); + "xlog min recovery request " LSN_FORMAT " is past current point " LSN_FORMAT, + LSN_FORMAT_ARG(lsn), + LSN_FORMAT_ARG(newMinRecoveryPoint)); we have to either declare two additional local buffers, which is verbose; or use pg_lsn_out_internal() and rely on memory contexts (or do pfree() manually, which is verbose again) to prevent memory leaks. Off list Craig Ringer suggested introducing a new format specifier similar to %m for LSN but I did not get time to take a look at the relevant code. AFAIU it's available only to elog/ereport, so may not be useful generally. But teaching printf variants about the new format would be the best solution. However, I didn't find any way to do that. It seems that this topic has been extensively discussed off-list, but still strong +1 for the patch. I always wanted LSN printing to be more concise. I have just tried new printing utilities in a couple of new places and it looks good to me. +char * +pg_lsn_out_internal(XLogRecPtr lsn) +{ + charbuf[MAXPG_LSNLEN + 1]; + + snprintf(buf, sizeof(buf), LSN_FORMAT, LSN_FORMAT_ARG(lsn)); + + return pstrdup(buf); +} Would it be a bit more straightforward if we palloc buf initially and just return a pointer instead of doing pstrdup()? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 698e481f5f55b967b5c60dba4bc577f8baa20ff4 Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Fri, 16 Oct 2020 17:09:29 +0530 Subject: [PATCH] Make it easy to print LSN The commit introduces following macros and functions to make it easy to use LSNs in printf variants, elog, ereport and appendStringInfo variants. LSN_FORMAT - macro representing the format in which LSN is printed LSN_FORMAT_ARG - macro to pass LSN as an argument to the above format pg_lsn_out_internal - a function which returns palloc'ed char array containing string representation of given LSN. pg_lsn_out_buffer - similar to above but accepts and returns a char array of size (MAXPG_LSNLEN + 1) The commit also has some example usages of these. Ashutosh Bapat --- contrib/pageinspect/rawpage.c| 3 +- src/backend/access/rmgrdesc/replorigindesc.c | 5 +- src/backend/access/rmgrdesc/xlogdesc.c | 3 +- src/backend/access/transam/xlog.c| 8 ++-- src/backend/utils/adt/pg_lsn.c | 49 ++-- src/include/access/xlogdefs.h| 7 +++ src/include/utils/pg_lsn.h | 3 ++ 7 files changed, 55 insertions(+), 23 deletions(-) diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c index c0181506a5..2cd055a5f0 100644 --- a/contrib/pageinspect/rawpage.c +++ b/contrib/pageinspect/rawpage.c @@ -261,8 +261,7 @@ page_header(PG_FUNCTION_ARGS) { char lsnchar[64]; - snprintf(lsnchar, sizeof(lsnchar), "%X/%X", - (uint32) (lsn >> 32), (uint32) lsn); + snprintf(lsnchar, sizeof(lsnchar), LSN_FORMAT, LSN_FORMAT_ARG(lsn)); values[0] = CStringGetTextDatum(lsnchar); } else diff --git a/src/backend/access/rmgrdesc/replorigindesc.c b/src/backend/access/rmgrdesc/replorigindesc.c index 19e14f910b..a3f49b5750 100644 --- a/src/backend/access/rmgrdesc/replorigindesc.c +++ b/src/backend/access/rmgrdesc/replorigindesc.c @@ -29,10 +29,9 @@ replorigin_desc(StringInfo buf, XLogReaderState *record) xlrec = (xl_replorigin_set *) rec; -appendStringInfo(buf, "set %u; lsn %X/%X; force: %d", +appendStringInfo(buf, "set %u; lsn " LSN_FORMAT "; force:
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-11-30 14:33, Michael Paquier wrote: On Tue, Nov 24, 2020 at 09:31:23AM -0600, Justin Pryzby wrote: @cfbot: rebased Catching up with the activity here, I can see four different things in the patch set attached: 1) Refactoring of the grammar of CLUSTER, VACUUM, ANALYZE and REINDEX to support values in parameters. 2) Tablespace change for REINDEX. 3) Tablespace change for VACUUM FULL/CLUSTER. 4) Tablespace change for indexes with VACUUM FULL/CLUSTER. I am not sure yet about the last three points, so let's begin with 1) that is dealt with in 0001 and 0002. I have spent some time on 0001, renaming the rule names to be less generic than "common", and applied it. 0002 looks to be in rather good shape, still there are a few things that have caught my eyes. I'll look at that more closely tomorrow. Thanks. I have rebased the remaining patches on top of 873ea9ee to use 'utility_option_list' instead of 'common_option_list'. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom ac3b77aec26a40016784ada9dab8b9059f424fa4 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Tue, 31 Mar 2020 20:35:41 -0500 Subject: [PATCH v31 5/5] Implement vacuum full/cluster (INDEX_TABLESPACE ) --- doc/src/sgml/ref/cluster.sgml | 12 - doc/src/sgml/ref/vacuum.sgml | 12 - src/backend/commands/cluster.c| 64 ++- src/backend/commands/matview.c| 3 +- src/backend/commands/tablecmds.c | 2 +- src/backend/commands/vacuum.c | 46 +++- src/backend/postmaster/autovacuum.c | 1 + src/include/commands/cluster.h| 6 ++- src/include/commands/vacuum.h | 5 +- src/test/regress/input/tablespace.source | 13 + src/test/regress/output/tablespace.source | 20 +++ 11 files changed, 123 insertions(+), 61 deletions(-) diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml index cbfc0582be..6781e3a025 100644 --- a/doc/src/sgml/ref/cluster.sgml +++ b/doc/src/sgml/ref/cluster.sgml @@ -28,6 +28,7 @@ CLUSTER [VERBOSE] [ ( option [, ... VERBOSE [ boolean ] TABLESPACE new_tablespace +INDEX_TABLESPACE new_tablespace @@ -105,6 +106,15 @@ CLUSTER [VERBOSE] [ ( option [, ... + +INDEX_TABLESPACE + + + Specifies that the table's indexes will be rebuilt on a new tablespace. + + + + table_name @@ -141,7 +151,7 @@ CLUSTER [VERBOSE] [ ( option [, ... new_tablespace - The tablespace where the table will be rebuilt. + The tablespace where the table or its indexes will be rebuilt. diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml index 5261a7c727..28cab119b6 100644 --- a/doc/src/sgml/ref/vacuum.sgml +++ b/doc/src/sgml/ref/vacuum.sgml @@ -36,6 +36,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ boolean ] PARALLEL integer TABLESPACE new_tablespace +INDEX_TABLESPACE new_tablespace and table_and_columns is: @@ -265,6 +266,15 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ boolean @@ -314,7 +324,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ new_tablespace - The tablespace where the relation will be rebuilt. + The tablespace where the relation or its indexes will be rebuilt. diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index b289a76d58..0f9f09a15a 100644 --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -71,7 +71,7 @@ typedef struct static void rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose, - Oid NewTableSpaceOid); + Oid NewTableSpaceOid, Oid NewIdxTableSpaceOid); static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose, bool *pSwapToastByContent, TransactionId *pFreezeXid, MultiXactId *pCutoffMulti); @@ -107,9 +107,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel) { ListCell *lc; int options = 0; - /* Name and Oid of tablespace to use for clustered relation. */ - char *tablespaceName = NULL; - Oid tablespaceOid = InvalidOid; + /* Name and Oid of tablespaces to use for clustered relations. */ + char *tablespaceName = NULL, +*idxtablespaceName = NULL; + Oid tablespaceOid, +idxtablespaceOid; /* Parse list of generic parameters not handled by the parser */ foreach(lc, stmt->params) @@ -123,6 +125,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel) options &= ~CLUOPT_VERBOSE; else if (strcmp(opt->defname, "tablespace") == 0) tablespaceName = defGetString(opt); + else if (strcmp(opt->defname, "index_tablespace") == 0) + idxtablespaceName = defGetString(opt); else ereport(ERROR, (e
Re: Notes on physical replica failover with logical publisher or subscriber
Hi Craig, On 2020-11-30 06:59, Craig Ringer wrote: https://wiki.postgresql.org/wiki/Logical_replication_and_physical_standby_failover Thank you for sharing these notes. I have not dealt a lot with physical/logical replication interoperability, so those were mostly new problems for me to know. One point from the wiki page, which seems clear enough to me: ``` Logical slots can fill pg_wal and can't benefit from archiving. Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal... ``` It does not look like a big deal to teach logical decoding process to use restore_command, but I have some doubts about how everything will perform in the case when we started getting WAL from archive for decoding purposes. If we started using restore_command, then subscriber lagged long enough to exceed max_slot_wal_keep_size. Taking into account that getting WAL files from the archive has an additional overhead and that primary continues generating (and archiving) new segments, there is a possibility for primary to start doing this double duty forever --- archive WAL file at first and get it back for decoding when requested. Another problem is that there are maybe several active decoders, IIRC, so they would have better to communicate in order to avoid fetching the same segment twice. I tried to address many of these issues with failover slots, but I am not trying to beat that dead horse now. I know that at least some people here are of the opinion that effort shouldn't go into logical/physical replication interoperation anyway - that we should instead address the remaining limitations in logical replication so that it can provide complete HA capabilities without use of physical replication. So for now I'm just trying to save others who go looking into these issues some time and warn them about some of the less obvious booby-traps. Another point to add regarding logical replication capabilities to build logical-only HA system --- logical equivalent of pg_rewind. At least I have not noticed anything after brief reading of the wiki page. IIUC, currently there is no way to quickly return ex-primary (ex-logical publisher) into HA-cluster without doing a pg_basebackup, isn't it? It seems that we should have the same problem here as with physical replication --- ex-primary may accept some xacts after promotion of new primary, so their history diverges and old primary should be rewound before being returned as standby (subscriber). Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-12-04 04:25, Justin Pryzby wrote: On Thu, Dec 03, 2020 at 04:12:53PM +0900, Michael Paquier wrote: > +typedef struct ReindexParams { > + bool concurrently; > + bool verbose; > + bool missingok; > + > + int options;/* bitmask of lowlevel REINDEXOPT_* */ > +} ReindexParams; > + By moving everything into indexcmds.c, keeping ReindexParams within it makes sense to me. Now, there is no need for the three booleans because options stores the same information, no? I liked the bools, but dropped them so the patch is smaller. I had a look on 0001 and it looks mostly fine to me except some strange mixture of tabs/spaces in the ExecReindex(). There is also a couple of meaningful comments: - options = - (verbose ? REINDEXOPT_VERBOSE : 0) | - (concurrently ? REINDEXOPT_CONCURRENTLY : 0); + if (verbose) + params.options |= REINDEXOPT_VERBOSE; Why do we need this intermediate 'verbose' variable here? We only use it once to set a bitmask. Maybe we can do it like this: params.options |= defGetBoolean(opt) ? REINDEXOPT_VERBOSE : 0; See also attached txt file with diff (I wonder can I trick cfbot this way, so it does not apply the diff). + int options;/* bitmask of lowlevel REINDEXOPT_* */ I would prefer if the comment says '/* bitmask of ReindexOption */' as in the VacuumOptions, since citing the exact enum type make it easier to navigate source code. Regarding the REINDEX patch, I think this comment is misleading: |+* Even if table was moved to new tablespace, normally toast cannot move. | */ |+ Oid toasttablespaceOid = allowSystemTableMods ? tablespaceOid : InvalidOid; |result |= reindex_relation(toast_relid, flags, I think it ought to say "Even if a table's indexes were moved to a new tablespace, its toast table's index is not normally moved" Right ? Yes, I think so, we are dealing only with index tablespace changing here. Thanks for noticing. Also, I don't know whether we should check for GLOBALTABLESPACE_OID after calling get_tablespace_oid(), or in the lowlevel routines. Note that reindex_relation is called during cluster/vacuum, and in the later patches, I moved the test from from cluster() and ExecVacuum() to rebuild_relation(). IIRC, I wanted to do GLOBALTABLESPACE_OID check as early as possible (just after getting Oid), since it does not make sense to proceed further if tablespace is set to that value. So initially there were a lot of duplicative GLOBALTABLESPACE_OID checks, since there were a lot of reindex entry-points (index, relation, concurrently, etc.). Now we are going to have ExecReindex(), so there are much less entry-points and in my opinion it is fine to keep this validation just after get_tablespace_oid(). However, this is mostly a sanity check. I can hardly imagine a lot of users trying to constantly move indexes to the global tablespace, so it is also OK to put this check deeper into guts. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index a27f8f9d83..0b1884815c 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -2472,8 +2472,6 @@ void ExecReindex(ParseState *pstate, ReindexStmt *stmt, bool isTopLevel) { ReindexParams params = {0}; - bool verbose = false, -concurrently = false; ListCell *lc; char *tablespace = NULL; @@ -2483,9 +2481,11 @@ ExecReindex(ParseState *pstate, ReindexStmt *stmt, bool isTopLevel) DefElem*opt = (DefElem *) lfirst(lc); if (strcmp(opt->defname, "verbose") == 0) - verbose = defGetBoolean(opt); + params.options |= defGetBoolean(opt) ? +REINDEXOPT_VERBOSE : 0; else if (strcmp(opt->defname, "concurrently") == 0) - concurrently = defGetBoolean(opt); + params.options |= defGetBoolean(opt) ? +REINDEXOPT_CONCURRENTLY : 0; else if (strcmp(opt->defname, "tablespace") == 0) tablespace = defGetString(opt); else @@ -2496,18 +2496,12 @@ ExecReindex(ParseState *pstate, ReindexStmt *stmt, bool isTopLevel) parser_errposition(pstate, opt->location))); } - if (verbose) - params.options |= REINDEXOPT_VERBOSE; + params.tablespaceOid = tablespace ? + get_tablespace_oid(tablespace, false) : InvalidOid; - if (concurrently) - { - params.options |= REINDEXOPT_CONCURRENTLY; + if (params.options & REINDEXOPT_CONCURRENTLY) PreventInTransactionBlock(isTopLevel, "REINDEX CONCURRENTLY"); - } - - params.tablespaceOid = tablespace ? - get_tablespace_oid(tablespace, false) : InvalidOid; switch (stmt->kind) {