Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 2015-04-28 10:26:54 +0530, Abhijit Menon-Sen wrote: Hi Andres. Do you intend to commit these patches? (I just verified that they apply without trouble to HEAD.) I intend to come back to them, yes. I'm not fully sure whether we should really apply them to the back branches. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
At 2015-01-30 21:36:42 +0100, and...@2ndquadrant.com wrote: Here's an alternative approach. I think it generally is superior and going in the right direction, but I'm not sure it's backpatchable. It basically consists out of: 1) Make GetLockConflicts() actually work. already commited as being a independent problem. 2) Allow the startup process to actually acquire locks other than AccessExclusiveLocks. There already is code acquiring other locks, but it's currently broken because they're acquired in blocking mode which isn't actually supported in the startup mode. Using this infrastructure we can actually fix a couple small races around database creation/drop. 3) Allow session locks during recovery to be heavier than RowExclusiveLock - since relation/object session locks aren't ever taken out by the user (without a plain lock first) that's safe. merged and submitted independently. 5) Make walsender actually react to recovery conflict interrupts submitted here. (0003) 4) Perform streaming base backups from within a transaction command, to provide error handling. 6) Prevent access to the template database of a CREATE DATABASE during WAL replay. 7) Add an interlock to prevent base backups and movedb() to run concurrently. What we actually came here for. combined, submitted here. (0004) I think this actually doesn't look that bad. Hi Andres. Do you intend to commit these patches? (I just verified that they apply without trouble to HEAD.) -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 2015-01-27 20:16:43 +0100, Andres Freund wrote: Here's an alternative approach. I think it generally is superior and going in the right direction, but I'm not sure it's backpatchable. It basically consists out of: 1) Make GetLockConflicts() actually work. already commited as being a independent problem. 2) Allow the startup process to actually acquire locks other than AccessExclusiveLocks. There already is code acquiring other locks, but it's currently broken because they're acquired in blocking mode which isn't actually supported in the startup mode. Using this infrastructure we can actually fix a couple small races around database creation/drop. 3) Allow session locks during recovery to be heavier than RowExclusiveLock - since relation/object session locks aren't ever taken out by the user (without a plain lock first) that's safe. merged and submitted independently. 5) Make walsender actually react to recovery conflict interrupts submitted here. (0003) 4) Perform streaming base backups from within a transaction command, to provide error handling. 6) Prevent access to the template database of a CREATE DATABASE during WAL replay. 7) Add an interlock to prevent base backups and movedb() to run concurrently. What we actually came here for. combined, submitted here. (0004) I think this actually doesn't look that bad. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 8531ca4d200d24ee45265774f7ead613563adca4 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Fri, 30 Jan 2015 09:28:17 +0100 Subject: [PATCH 1/4] Allow recovery lock infrastructure to not only hold relation locks. This is a preparatory commit to fix several bugs, separated out for easier review. The startup process could so far only properly acquire access exlusive locks on transactions. As it does not setup enough state to do lock queuing - primarily to be able to issue recovery conflict interrupts, and to release them when the primary releases locks - it has it's own infrastructure to manage locks. That infrastructure so far assumed that all locks are access exlusive locks on relations. Unfortunately some code in the startup process has to acquire other locks than what's supported by the aforementioned infrastructure in standby.c. Namely dbase_redo() has to acquire locks on the database objects. Also further such locks will be added soon, to fix a another bug. So this patch shanges the infrastructure to be able to acquire locks of different modes and locktags. Additionally allow acquiring more heavyweight relation logs on the standby than RowExclusive when acquired in session mode. Discussion: 20150120152819.gc24...@alap3.anarazel.de Backpatch all the way. --- src/backend/storage/ipc/standby.c | 120 ++ src/backend/storage/lmgr/lock.c | 7 +-- src/include/storage/standby.h | 2 +- 3 files changed, 61 insertions(+), 68 deletions(-) diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c index 292bed5..0502aab 100644 --- a/src/backend/storage/ipc/standby.c +++ b/src/backend/storage/ipc/standby.c @@ -38,10 +38,16 @@ int max_standby_archive_delay = 30 * 1000; int max_standby_streaming_delay = 30 * 1000; static List *RecoveryLockList; +typedef struct RecoveryLockListEntry +{ + TransactionId xid; + LOCKMODE lockmode; + LOCKTAG locktag; +} RecoveryLockListEntry; static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist, ProcSignalReason reason); -static void ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid); +static void ResolveRecoveryConflictWithLock(LOCKTAG *tag, LOCKMODE mode); static void SendRecoveryConflictWithBufferPin(ProcSignalReason reason); static XLogRecPtr LogCurrentRunningXacts(RunningTransactions CurrRunningXacts); static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks); @@ -320,10 +326,10 @@ ResolveRecoveryConflictWithDatabase(Oid dbid) * us. This is rare enough that we do this as simply as possible: no wait, * just force them off immediately. * - * No locking is required here because we already acquired - * AccessExclusiveLock. Anybody trying to connect while we do this will - * block during InitPostgres() and then disconnect when they see the - * database has been removed. + * No locking is required here because we already acquired a + * AccessExclusiveLock on the database in dbase_redo(). Anybody trying to + * connect while we do this will block during InitPostgres() and then + * disconnect when they see the database has been removed. */ while (CountDBBackends(dbid) 0) { @@ -338,14 +344,11 @@ ResolveRecoveryConflictWithDatabase(Oid dbid) } static void -ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid) +ResolveRecoveryConflictWithLock(LOCKTAG *locktag,
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 2015-01-27 07:16:27 -0500, Robert Haas wrote: On Mon, Jan 26, 2015 at 4:03 PM, Andres Freund and...@2ndquadrant.com wrote: I basically have two ideas to fix this. 1) Make do_pg_start_backup() acquire a SHARE lock on pg_database. That'll prevent it from starting while a movedb() is still in progress. Then additionally add pg_backup_in_progress() function to xlog.c that checks (XLogCtl-Insert.exclusiveBackup || XLogCtl-Insert.nonExclusiveBackups != 0). Use that in createdb() and movedb() to error out if a backup is in progress. Attached is a patch trying to this. Doesn't look too bad and lead me to discover missing recovery conflicts during a AD ST. But: It doesn't actually work on standbys, because lock.c prevents any stronger lock than RowExclusive from being acquired. And we need need a lock that can conflict with WAL replay of DBASE_CREATE, to handle base backups that are executed on the primary. Those obviously can't detect whether any standby is currently doing a base backup... I currently don't have a good idea how to mangle lock.c to allow this. I've played with doing it like in the second patch, but that doesn't actually work because of some asserts around ProcSleep - leading to locks on database objects not working in the startup process (despite already being used). The easiest thing would be to just use a lwlock instead of a heavyweight lock - but those aren't canceleable... How about just wrapping an lwlock around a flag variable? movedb() increments the variable when starting and decrements it when done (must use PG_ENSURE_ERROR_CLEANUP). Starting a backup errors out (or waits in 1-second increments) if it's non-zero. That'd end up essentially being a re-emulation of locks. Don't find that all that pretty. But maybe we have to go there. Here's an alternative approach. I think it generally is superior and going in the right direction, but I'm not sure it's backpatchable. It basically consists out of: 1) Make GetLockConflicts() actually work. 2) Allow the startup process to actually acquire locks other than AccessExclusiveLocks. There already is code acquiring other locks, but it's currently broken because they're acquired in blocking mode which isn't actually supported in the startup mode. Using this infrastructure we can actually fix a couple small races around database creation/drop. 3) Allow session locks during recovery to be heavier than RowExclusiveLock - since relation/object session locks aren't ever taken out by the user (without a plain lock first) that's safe. 4) Perform streaming base backups from within a transaction command, to provide error handling. 5) Make walsender actually react to recovery conflict interrupts 6) Prevent access to the template database of a CREATE DATABASE during WAL replay. 7) Add an interlock to prevent base backups and movedb() to run concurrently. What we actually came here for. Comments? At the very least it's missing some documentation updates about the locking changes in ALTER DATABASE, CREATE DATABASE and the base backup sections. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 6e196d17e3dc3ae923321c1b49eb46ccd5ac75b0 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Tue, 27 Jan 2015 19:52:11 +0100 Subject: [PATCH] Fix various issues around WAL replay and ALTER DATABASE SET TABLESPACE. Danger of basebackups vs. AD ST Discussion: 20150120152819.gc24...@alap3.anarazel.de Fix GetLockConflicts() to properly terminate the list of conflicting backends. It's unclear why this hasn't caused more problems. Discussion: 20150127142713.gd29...@awork2.anarazel.de Don't acquire blocking locks on the database in dbase_redo(), not enough state setup. Discussion: 20150126212458.ga29...@awork2.anarazel.de Don't allow access to the template database during the replay of a CREATE DATABASE. --- src/backend/access/transam/xlog.c| 29 + src/backend/commands/dbcommands.c| 104 ++- src/backend/replication/basebackup.c | 15 + src/backend/replication/walsender.c | 14 + src/backend/storage/ipc/standby.c| 117 ++- src/backend/storage/lmgr/lmgr.c | 31 ++ src/backend/storage/lmgr/lock.c | 13 ++-- src/backend/utils/init/postinit.c| 2 +- src/include/access/xlog.h| 1 + src/include/storage/lmgr.h | 3 + src/include/storage/standby.h| 2 +- 11 files changed, 240 insertions(+), 91 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 629a457..38e7dff 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -53,6 +53,7 @@ #include storage/ipc.h #include storage/large_object.h #include storage/latch.h
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On Tue, Jan 27, 2015 at 2:16 PM, Andres Freund and...@2ndquadrant.com wrote: That'd end up essentially being a re-emulation of locks. Don't find that all that pretty. But maybe we have to go there. The advantage of it is that it is simple. The only thing we're really giving up is the deadlock detector, which I think isn't needed in this case. Here's an alternative approach. I think it generally is superior and going in the right direction, but I'm not sure it's backpatchable. I tend think this is changing too many things to back-patch. It might all be fine, but it's pretty wide-reaching, so the chances of collateral damage seem non-trivial. Even in master, I'm not sure I see the point in rocking the apple cart to this degree. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 2015-01-27 19:24:24 -0500, Robert Haas wrote: On Tue, Jan 27, 2015 at 2:16 PM, Andres Freund and...@2ndquadrant.com wrote: That'd end up essentially being a re-emulation of locks. Don't find that all that pretty. But maybe we have to go there. The advantage of it is that it is simple. The only thing we're really giving up is the deadlock detector, which I think isn't needed in this case. I think it's more than just the deadlock detector. Consider pg_locks/pg_stat_activity.waiting, cancelling a acquisition, error cleanup and recursive acquisitions. Acquiring session locks + a surrounding transaction command deals with with cleanups without introducing PG_ENSURE blocks in a couple places. And we need recursive acquisition so a streaming base backup can acquire the lock over the whole runtime, while a manual pg_start_backup() does only for its own time. Especially the visibility in the system views is something I'd not like to give up in additional blocks we introduce in the backbranches. Here's an alternative approach. I think it generally is superior and going in the right direction, but I'm not sure it's backpatchable. I tend think this is changing too many things to back-patch. It might all be fine, but it's pretty wide-reaching, so the chances of collateral damage seem non-trivial. Even in master, I'm not sure I see the point in rocking the apple cart to this degree. It's big, true. But a fair amount of it stuff I think we have to do anyway. The current code acquiring session locks in dbase_redo() is borked - we either assert or segfault if there's a connection in the wrong moment beause there's no deadlock handler. Plus it has races that aren't that hard to hit :(. To fix the crashes (but not the race) we can have a separate ResolveRecoveryConflictWithObjectLock that doesn't record the existance of the lock, but doesn't ever do a ProcSleep(). Not pretty either :( Seems like a situation with no nice solutions so far :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On Mon, Jan 26, 2015 at 4:03 PM, Andres Freund and...@2ndquadrant.com wrote: I basically have two ideas to fix this. 1) Make do_pg_start_backup() acquire a SHARE lock on pg_database. That'll prevent it from starting while a movedb() is still in progress. Then additionally add pg_backup_in_progress() function to xlog.c that checks (XLogCtl-Insert.exclusiveBackup || XLogCtl-Insert.nonExclusiveBackups != 0). Use that in createdb() and movedb() to error out if a backup is in progress. Attached is a patch trying to this. Doesn't look too bad and lead me to discover missing recovery conflicts during a AD ST. But: It doesn't actually work on standbys, because lock.c prevents any stronger lock than RowExclusive from being acquired. And we need need a lock that can conflict with WAL replay of DBASE_CREATE, to handle base backups that are executed on the primary. Those obviously can't detect whether any standby is currently doing a base backup... I currently don't have a good idea how to mangle lock.c to allow this. I've played with doing it like in the second patch, but that doesn't actually work because of some asserts around ProcSleep - leading to locks on database objects not working in the startup process (despite already being used). The easiest thing would be to just use a lwlock instead of a heavyweight lock - but those aren't canceleable... How about just wrapping an lwlock around a flag variable? movedb() increments the variable when starting and decrements it when done (must use PG_ENSURE_ERROR_CLEANUP). Starting a backup errors out (or waits in 1-second increments) if it's non-zero. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
Hi, On 2015-01-22 19:56:07 +0100, Andres Freund wrote: Hi, On 2015-01-20 16:28:19 +0100, Andres Freund wrote: I'm analyzing a problem in which a customer had a pg_basebackup (from standby) created 9.2 cluster that failed with WAL contains references to invalid pages. The failed record was a xlog redo visible i.e. XLOG_HEAP2_VISIBLE. First I thought there might be another bug along the line of 17fa4c321cc. Looking at the code and the WAL that didn't seem to be the case (man, I miss pg_xlogdump). Other, slightly older, standbys, didn't seem to have any problems. Logs show that a ALTER DATABASE ... SET TABLESPACE ... was running when the basebackup was started and finished *before* pg_basebackup finished. movedb() basically works in these steps: 1) lock out users of the database 2) RequestCheckpoint(IMMEDIATE|WAIT) 3) DropDatabaseBuffers() 4) copydir() 5) XLogInsert(XLOG_DBASE_CREATE) 6) RequestCheckpoint(CHECKPOINT_IMMEDIATE) 7) rmtree(src_dbpath) 8) XLogInsert(XLOG_DBASE_DROP) 9) unlock database If a basebackup starts while 4) is in progress and continues until 7) happens I think a pretty wide race opens: The basebackup can end up with a partial copy of the database in the old tablespace because the rmtree(old_path) concurrently was in progress. Normally such races are fixed during replay. But in this case, the replay of the XLOG_DBASE_CREATE will just try to do a rmtree(new); copydiar(old, new);. fixing nothing. Besides making AD .. ST use sane WAL logging, which doesn't seem backpatchable, I don't see what could be done against this except somehow making basebackups fail if a AD .. ST is in progress. Which doesn't look entirely trivial either. I basically have two ideas to fix this. 1) Make do_pg_start_backup() acquire a SHARE lock on pg_database. That'll prevent it from starting while a movedb() is still in progress. Then additionally add pg_backup_in_progress() function to xlog.c that checks (XLogCtl-Insert.exclusiveBackup || XLogCtl-Insert.nonExclusiveBackups != 0). Use that in createdb() and movedb() to error out if a backup is in progress. Attached is a patch trying to this. Doesn't look too bad and lead me to discover missing recovery conflicts during a AD ST. But: It doesn't actually work on standbys, because lock.c prevents any stronger lock than RowExclusive from being acquired. And we need need a lock that can conflict with WAL replay of DBASE_CREATE, to handle base backups that are executed on the primary. Those obviously can't detect whether any standby is currently doing a base backup... I currently don't have a good idea how to mangle lock.c to allow this. I've played with doing it like in the second patch, but that doesn't actually work because of some asserts around ProcSleep - leading to locks on database objects not working in the startup process (despite already being used). The easiest thing would be to just use a lwlock instead of a heavyweight lock - but those aren't canceleable... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 2015-01-26 22:03:03 +0100, Andres Freund wrote: Attached is a patch trying to this. Doesn't look too bad and lead me to discover missing recovery conflicts during a AD ST. But: It doesn't actually work on standbys, because lock.c prevents any stronger lock than RowExclusive from being acquired. And we need need a lock that can conflict with WAL replay of DBASE_CREATE, to handle base backups that are executed on the primary. Those obviously can't detect whether any standby is currently doing a base backup... I currently don't have a good idea how to mangle lock.c to allow this. I've played with doing it like in the second patch, but that doesn't actually work because of some asserts around ProcSleep - leading to locks on database objects not working in the startup process (despite already being used). The easiest thing would be to just use a lwlock instead of a heavyweight lock - but those aren't canceleable... As Stephen noticed on irc I forgot to attach those. Caused by the generic HS issue mentioned in 20150126212458.ga29...@awork2.anarazel.de. Attached now. As mentioned above, this isn't really isn't ready (e.g. it'll Assert() on a standby due to the HS issue). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From e5d46c73955e7647912c2625e31027a6fef7c880 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Mon, 26 Jan 2015 21:03:35 +0100 Subject: [PATCH] Prevent ALTER DATABASE SET TABLESPACE from running concurrently with a base backup. The combination can cause a corrupt base backup. --- src/backend/access/transam/xlog.c| 33 + src/backend/commands/dbcommands.c| 41 src/backend/replication/basebackup.c | 11 ++ src/include/access/xlog.h| 1 + 4 files changed, 86 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 629a457..1daa10d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -53,6 +53,7 @@ #include storage/ipc.h #include storage/large_object.h #include storage/latch.h +#include storage/lmgr.h #include storage/pmsignal.h #include storage/predicate.h #include storage/proc.h @@ -9329,6 +9330,14 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p, else XLogCtl-Insert.nonExclusiveBackups++; XLogCtl-Insert.forcePageWrites = true; + + /* + * Acquire lock on pg datababase just before releasing the WAL insert lock + * and entering the ENSURE_ERROR_CLEANUP block. That way it's sufficient + * to release it in the error cleanup callback. + */ + LockRelationOid(DatabaseRelationId, ShareLock); + WALInsertLockRelease(); /* Ensure we release forcePageWrites if fail below */ @@ -9523,6 +9532,8 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p, } PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive)); + UnlockRelationOid(DatabaseRelationId, ShareLock); + /* * We're done. As a convenience, return the starting WAL location. */ @@ -9537,6 +9548,8 @@ pg_start_backup_callback(int code, Datum arg) { bool exclusive = DatumGetBool(arg); + UnlockRelationOid(DatabaseRelationId, ShareLock); + /* Update backup counters and forcePageWrites on failure */ WALInsertLockAcquireExclusive(); if (exclusive) @@ -9937,6 +9950,26 @@ do_pg_abort_backup(void) } /* + * Is a (exclusive or nonexclusive) base backup running? + * + * Note that this does not check whether any standby of this node has a + * basebackup running, or whether any upstream master (if this is a standby) + * has one in progress + */ +bool +LocalBaseBackupInProgress(void) +{ + bool ret; + + WALInsertLockAcquire(); + ret = XLogCtl-Insert.exclusiveBackup || + XLogCtl-Insert.nonExclusiveBackups 0; + WALInsertLockRelease(); + + return ret; +} + +/* * Get latest redo apply position. * * Exported to allow WALReceiver to read the pointer directly. diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index 5e66961..6184c83 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -1087,6 +1087,30 @@ movedb(const char *dbname, const char *tblspcname) errmsg(cannot change the tablespace of the currently open database))); /* + * Prevent SET TABLESPACE from running concurrently with a base + * backup. Without that check a base backup would potentially copy a + * partially removed source database; which WAL replay then would copy + * over the new database... + * + * Starting a base backup takes a SHARE lock on pg_database. In addition a + * streaming basebackup takes the same lock for the entirety of the copy + * of the data directory. That, combined with this check, prevents base + * backups from being taken at the same
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 2015-01-22 22:58:17 +0100, Andres Freund wrote: Because the way it currently works is a major crock. It's more luck than anything that it actually somewhat works. We normally rely on WAL to bring us into a consistent state. But around CREATE/MOVE/DROP DATABASE we've ignored that. Hah: There's even a comment about some of the existing dangers: * * In PITR replay, the first of these isn't an issue, and the second * is only a risk if the CREATE DATABASE and subsequent template * database change both occur while a base backup is being taken. * There doesn't seem to be much we can do about that except document * it as a limitation. * * Perhaps if we ever implement CREATE DATABASE in a less cheesy way, * we can avoid this. only that it has never been documented as a limitation to my knowledge... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 2015-01-23 12:52:03 -0600, Jim Nasby wrote: On 1/22/15 3:18 PM, Andres Freund wrote: , but to then put it's entire contents into WAL? Blech. Besides actually having a chance of being correct, doing so will save having to do two checkpoints inside movedb(). I think it's pretty likely that that actually saves overall IO, even including the WAL writes. Especially if there's other databases in the cluster at the same time. If you're moving a small amount of data, maybe. If you're moving several hundred GB or more? You're going to flood WAL and probably cause all replication/archiving to lag. I have hard time believing many people do AD ST on a database a couple hundred GB large. That'd mean the database is out of business for a significant amount of time (remember, there can't be any connected users during that time). I think realistically it's only used on smaller databases. The reason createdb()/movedb() work the way they do is imo simply because they're old. So far I can't see how the current solution can actually be made safe to be executed on a not yet consistent database. And we obviously can't wait for it to be consistent (as we're replaying a linear stream of WAL). Is there some way we can add an option to control this? I'm thinking that by default ADAT will error if a backup is running, but allow the user to over-ride that. Why would we want to allow that? There's simply no way it's safe, so ...? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 1/23/15 9:10 AM, Andres Freund wrote: On 2015-01-22 22:58:17 +0100, Andres Freund wrote: On 2015-01-22 16:38:49 -0500, Stephen Frost wrote: I'm trying to figure out why you'd do '2' in master when in discussion of '1' you say I also don't think ALTER DATABASE is even intentionally run at the time of a base backup. I agree with that sentiment and am inclined to say that '1' is good enough throughout. Because the way it currently works is a major crock. It's more luck than anything that it actually somewhat works. We normally rely on WAL to bring us into a consistent state. But around CREATE/MOVE/DROP DATABASE we've ignored that. And. Hm. The difficulty of the current method is evidenced by the fact that so far nodoby recognized that 1) as described above isn't actually safe. It fails to protect against basebackups on a standby as its XLogCtl state will obviously not be visible on the master. Further evidenced by the fact that the current method isn't crash/shutdown safe at all. If a standby was shut down/crashed/was started on a base backup when a CREATE DATABASE from the primary is replayed the template database used can be in an nearly arbitrarily bad state. It'll later get fixed up by recovery - but those changes won't make it to the copied database. I think we all agree that ADAT can't run while a base backup is happening, which I believe is what you're describing above. We'd have to somehow cover that same scenario on replicas too. Perhaps there isn't really an issue here; I suspect ADAT is very rarely used. What I'm worried about though is that someone is using it regularly for some reason, with non-trivial databases, and this is going to completely hose them. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 1/22/15 3:18 PM, Andres Freund wrote: , but to then put it's entire contents into WAL? Blech. Besides actually having a chance of being correct, doing so will save having to do two checkpoints inside movedb(). I think it's pretty likely that that actually saves overall IO, even including the WAL writes. Especially if there's other databases in the cluster at the same time. If you're moving a small amount of data, maybe. If you're moving several hundred GB or more? You're going to flood WAL and probably cause all replication/archiving to lag. Obviously, we need to do be reliable, but I think a lot of users would much rather that they can't muck with tablespaces while a backup is running than an ADAT suddenly consumes way more resources than before. Is there some way we can add an option to control this? I'm thinking that by default ADAT will error if a backup is running, but allow the user to over-ride that. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 2015-01-23 13:01:34 -0600, Jim Nasby wrote: On 1/23/15 9:10 AM, Andres Freund wrote: On 2015-01-22 22:58:17 +0100, Andres Freund wrote: On 2015-01-22 16:38:49 -0500, Stephen Frost wrote: I'm trying to figure out why you'd do '2' in master when in discussion of '1' you say I also don't think ALTER DATABASE is even intentionally run at the time of a base backup. I agree with that sentiment and am inclined to say that '1' is good enough throughout. Because the way it currently works is a major crock. It's more luck than anything that it actually somewhat works. We normally rely on WAL to bring us into a consistent state. But around CREATE/MOVE/DROP DATABASE we've ignored that. And. Hm. The difficulty of the current method is evidenced by the fact that so far nodoby recognized that 1) as described above isn't actually safe. It fails to protect against basebackups on a standby as its XLogCtl state will obviously not be visible on the master. Further evidenced by the fact that the current method isn't crash/shutdown safe at all. If a standby was shut down/crashed/was started on a base backup when a CREATE DATABASE from the primary is replayed the template database used can be in an nearly arbitrarily bad state. It'll later get fixed up by recovery - but those changes won't make it to the copied database. I think we all agree that ADAT can't run while a base backup is happening, which I believe is what you're describing above. No, that's not what I'm descirbing in the last paragraph. The problem is present without any AD ST. When a cluster crashes it's likely not in a consistent state - we need to replay from the last checkpoint's REDO pointer to the minimum recovery location to make it so. The problem though is that if the copied database (both for CREATE DB/SET TABLESPACE) is copied that record can be replayed well before the database is in a consistent state. In that case the new copy will be in a corrupted state as it'll not be fixed up by recovery, in contrast to the original, which will. Perhaps there isn't really an issue here; I suspect ADAT is very rarely used. What I'm worried about though is that someone is using it regularly for some reason, with non-trivial databases, and this is going to completely hose them. I think if somebody does this regularly on nontrivialy sized databases, over replication, they'd have hit this bug a long time ago. It's really not that hard to reproduce. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 2015-01-22 22:58:17 +0100, Andres Freund wrote: On 2015-01-22 16:38:49 -0500, Stephen Frost wrote: I'm trying to figure out why you'd do '2' in master when in discussion of '1' you say I also don't think ALTER DATABASE is even intentionally run at the time of a base backup. I agree with that sentiment and am inclined to say that '1' is good enough throughout. Because the way it currently works is a major crock. It's more luck than anything that it actually somewhat works. We normally rely on WAL to bring us into a consistent state. But around CREATE/MOVE/DROP DATABASE we've ignored that. And. Hm. The difficulty of the current method is evidenced by the fact that so far nodoby recognized that 1) as described above isn't actually safe. It fails to protect against basebackups on a standby as its XLogCtl state will obviously not be visible on the master. Further evidenced by the fact that the current method isn't crash/shutdown safe at all. If a standby was shut down/crashed/was started on a base backup when a CREATE DATABASE from the primary is replayed the template database used can be in an nearly arbitrarily bad state. It'll later get fixed up by recovery - but those changes won't make it to the copied database. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
Hi, On 2015-01-20 16:28:19 +0100, Andres Freund wrote: I'm analyzing a problem in which a customer had a pg_basebackup (from standby) created 9.2 cluster that failed with WAL contains references to invalid pages. The failed record was a xlog redo visible i.e. XLOG_HEAP2_VISIBLE. First I thought there might be another bug along the line of 17fa4c321cc. Looking at the code and the WAL that didn't seem to be the case (man, I miss pg_xlogdump). Other, slightly older, standbys, didn't seem to have any problems. Logs show that a ALTER DATABASE ... SET TABLESPACE ... was running when the basebackup was started and finished *before* pg_basebackup finished. movedb() basically works in these steps: 1) lock out users of the database 2) RequestCheckpoint(IMMEDIATE|WAIT) 3) DropDatabaseBuffers() 4) copydir() 5) XLogInsert(XLOG_DBASE_CREATE) 6) RequestCheckpoint(CHECKPOINT_IMMEDIATE) 7) rmtree(src_dbpath) 8) XLogInsert(XLOG_DBASE_DROP) 9) unlock database If a basebackup starts while 4) is in progress and continues until 7) happens I think a pretty wide race opens: The basebackup can end up with a partial copy of the database in the old tablespace because the rmtree(old_path) concurrently was in progress. Normally such races are fixed during replay. But in this case, the replay of the XLOG_DBASE_CREATE will just try to do a rmtree(new); copydiar(old, new);. fixing nothing. Besides making AD .. ST use sane WAL logging, which doesn't seem backpatchable, I don't see what could be done against this except somehow making basebackups fail if a AD .. ST is in progress. Which doesn't look entirely trivial either. I basically have two ideas to fix this. 1) Make do_pg_start_backup() acquire a SHARE lock on pg_database. That'll prevent it from starting while a movedb() is still in progress. Then additionally add pg_backup_in_progress() function to xlog.c that checks (XLogCtl-Insert.exclusiveBackup || XLogCtl-Insert.nonExclusiveBackups != 0). Use that in createdb() and movedb() to error out if a backup is in progress. Not pretty, but sounds doable. We've discussed trying to sleep instead of erroring out in movedb(), while a base backup is in progress, but that's nontrivial. I also don't think ALTER DATABASE is ever intentionally run at the time of a base backup. 2) Make movedb() (and possibly created(), not sure yet) use proper WAL logging and log the whole copied data. I think this is the right long term fix and would end up being much more reliable. But it either requires some uglyness during redo (creating nonexistant database directories on the fly during redo) or new wal records. Doable, but probably too large/invasive to backpatch. Thanks for Alvaro and Petr for discussing the problem. I lean towards doing 1) in all branches and then doing 2) in master. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
Andres, * Andres Freund (and...@2ndquadrant.com) wrote: 1) Make do_pg_start_backup() acquire a SHARE lock on pg_database. That'll prevent it from starting while a movedb() is still in progress. Then additionally add pg_backup_in_progress() function to xlog.c that checks (XLogCtl-Insert.exclusiveBackup || XLogCtl-Insert.nonExclusiveBackups != 0). Use that in createdb() and movedb() to error out if a backup is in progress. Not pretty, but sounds doable. We've discussed trying to sleep instead of erroring out in movedb(), while a base backup is in progress, but that's nontrivial. I also don't think ALTER DATABASE is ever intentionally run at the time of a base backup. 2) Make movedb() (and possibly created(), not sure yet) use proper WAL logging and log the whole copied data. I think this is the right long term fix and would end up being much more reliable. But it either requires some uglyness during redo (creating nonexistant database directories on the fly during redo) or new wal records. Doable, but probably too large/invasive to backpatch. Thanks for Alvaro and Petr for discussing the problem. I lean towards doing 1) in all branches and then doing 2) in master. I'm trying to figure out why you'd do '2' in master when in discussion of '1' you say I also don't think ALTER DATABASE is even intentionally run at the time of a base backup. I agree with that sentiment and am inclined to say that '1' is good enough throughout. Another thought would be to offer both- perhaps an AD .. ST .. CONCURRENTLY option which would WAL. Or a way to say my backup is more important than some AD .. ST .., which I could see some admins wanting. The other question is- what about AT .. ST? That is, ALTER TABLE .. SET TABLESPACE. Do we do things differently there or is there a similar issue? Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 2015-01-22 16:38:49 -0500, Stephen Frost wrote: Andres, * Andres Freund (and...@2ndquadrant.com) wrote: 1) Make do_pg_start_backup() acquire a SHARE lock on pg_database. That'll prevent it from starting while a movedb() is still in progress. Then additionally add pg_backup_in_progress() function to xlog.c that checks (XLogCtl-Insert.exclusiveBackup || XLogCtl-Insert.nonExclusiveBackups != 0). Use that in createdb() and movedb() to error out if a backup is in progress. Not pretty, but sounds doable. We've discussed trying to sleep instead of erroring out in movedb(), while a base backup is in progress, but that's nontrivial. I also don't think ALTER DATABASE is ever intentionally run at the time of a base backup. 2) Make movedb() (and possibly created(), not sure yet) use proper WAL logging and log the whole copied data. I think this is the right long term fix and would end up being much more reliable. But it either requires some uglyness during redo (creating nonexistant database directories on the fly during redo) or new wal records. Doable, but probably too large/invasive to backpatch. Thanks for Alvaro and Petr for discussing the problem. I lean towards doing 1) in all branches and then doing 2) in master. I'm trying to figure out why you'd do '2' in master when in discussion of '1' you say I also don't think ALTER DATABASE is even intentionally run at the time of a base backup. I agree with that sentiment and am inclined to say that '1' is good enough throughout. Because the way it currently works is a major crock. It's more luck than anything that it actually somewhat works. We normally rely on WAL to bring us into a consistent state. But around CREATE/MOVE/DROP DATABASE we've ignored that. My point about not intentionally running it at the same isn't that it's not happening, it's that it's not intended to happen at the same time. But most sane sites these days actually do use automated basebackups - making it much harder to be safe against this. And. Hm. The difficulty of the current method is evidenced by the fact that so far nodoby recognized that 1) as described above isn't actually safe. It fails to protect against basebackups on a standby as its XLogCtl state will obviously not be visible on the master. For exclusive basebackups (i.e. SELECT pg_start/stop_backup()) we can't rely on locking because these commands can happen in independent sessions. But I think we can in the builtin nonexclusive basebackups, as it's run in one session. So I guess we could rely on recovery conflicts not allowing the XLOG_DBASE_CREATE/DROP to replicate. It's nasty that a basebackup can suddenly stop replication from progressing though :(. Additionally it'd not protect stuff like pgespresso (an extension for nonexclusive standby basebackups). The other question is- what about AT .. ST? That is, ALTER TABLE .. SET TABLESPACE. Do we do things differently there or is there a similar issue? No issue, because it actually is implemented halfway sanely sanely and uses WAL logging. I personally don't like at all that it uses FlushRelationBuffers() and reads the tables on the smgr level instead of using the buffer manager and a bulk state, but ... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
Andres Freund wrote: 2) Make movedb() (and possibly created(), not sure yet) use proper WAL logging and log the whole copied data. I think this is the right long term fix and would end up being much more reliable. But it either requires some uglyness during redo (creating nonexistant database directories on the fly during redo) or new wal records. Doable, but probably too large/invasive to backpatch. Not to mention the extra WAL traffic ... -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 1/22/15 1:43 PM, Alvaro Herrera wrote: Andres Freund wrote: 2) Make movedb() (and possibly created(), not sure yet) use proper WAL logging and log the whole copied data. I think this is the right long term fix and would end up being much more reliable. But it either requires some uglyness during redo (creating nonexistant database directories on the fly during redo) or new wal records. Doable, but probably too large/invasive to backpatch. Not to mention the extra WAL traffic ... Yeah, I don't know that we actually want #2. It's bad enough to copy an entire database locally, but to then put it's entire contents into WAL? Blech. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?
On 2015-01-22 14:42:18 -0600, Jim Nasby wrote: On 1/22/15 1:43 PM, Alvaro Herrera wrote: Andres Freund wrote: 2) Make movedb() (and possibly created(), not sure yet) use proper WAL logging and log the whole copied data. I think this is the right long term fix and would end up being much more reliable. But it either requires some uglyness during redo (creating nonexistant database directories on the fly during redo) or new wal records. Doable, but probably too large/invasive to backpatch. Not to mention the extra WAL traffic ... Yeah, I don't know that we actually want #2. It's bad enough to copy an entire database locally The local copy is pretty much fundamental. Given that tablespaces usually will be on different filesystems there's not much else we can do. If we want a optimization for moving databases across tablespaces if both are on the same filesystems and wal_level archive, we can do that. But that's pretty independent. And I doubt it's worthwile the developer time. , but to then put it's entire contents into WAL? Blech. Besides actually having a chance of being correct, doing so will save having to do two checkpoints inside movedb(). I think it's pretty likely that that actually saves overall IO, even including the WAL writes. Especially if there's other databases in the cluster at the same time. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers