Re: [HACKERS] Hot standby, dropping a tablespace
On Sun, 2009-01-25 at 19:56 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > >> 2. Kill all connections by given user. Hmm, not used for anything, > >> actually. Should remove the roleId argument from GetConflictingVirtualXIDs. > > > > No, because we still need to add code to kill-connected-users if we drop > > role. > > Oh, I see, that's still a todo item. But we don't do that during normal > operation, so why should we during recovery? LOL. Not many systems allow you to continue working after the access has been removed. But not, as you say, a problem for Hot Standby unless that changes. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, dropping a tablespace
Simon Riggs wrote: 2. Kill all connections by given user. Hmm, not used for anything, actually. Should remove the roleId argument from GetConflictingVirtualXIDs. No, because we still need to add code to kill-connected-users if we drop role. Oh, I see, that's still a todo item. But we don't do that during normal operation, so why should we during recovery? I'm thinking of an interface consisting of three functions, replacing the current GetConflictingVirtualXIDs and ResolveRecoveryConflictWithVirtualXIDs functions: I'm not sure I see any benefit in doing that. It makes the code more readable. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, conflict resolution
On Fri, 2009-01-23 at 21:30 +0200, Heikki Linnakangas wrote: > Ok, then I think we have a little race condition. The startup process > doesn't get any reply indicating that the target backend has > processed > the SIGINT and set the cached conflict LSN. The target backend might > move ahead using the old LSN for a little while, even though the > startup > process has already gone ahead and replayed a vacuum record. > > Another tiny issue is that it looks like a new conflict LSN always > overwrites the old one. But you should always use the oldest > conflicted > LSN in the checks, not the newest. That makes it easier, because it is either not set, or it is set and does not need to be reset as new conflict LSNs appear. I can see a simple scheme emerging, which I will detail tomorrow morning. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, dropping a tablespace
On Sun, 2009-01-25 at 11:28 +0200, Heikki Linnakangas wrote: > >> You then call > >> ResolveRecoveryConflictWithVirtualXIDs to kill such transactions, and > >> try removing the directory again. But > >> ResolveRecoveryConflictWithVirtualXIDs doesn't wait for the target > >> transaction to die anymore (or at least it shouldn't, as we discussed > >> earlier), so that doesn't work AFAICS. > > > > The FATAL errors inflicted should be fairly quick to take effect, so > > waiting should not be a problem. We can make waiting for FATAL errors > > the standard response. > > The call in tablespace replay uses ERROR, not FATAL. You don't need to > kill the whole session, just the current transaction. Yeh, yeh, it already does that, FATAL was just my typo. My bad. > It seems more and more to me that the FATAL and ERROR cases are really > quite different. Looking at the callers, there's four different needs > for GetConflictingVirtualXIDs+ResolveRecoveryConflictWithVirtualXIDs: I think the deferred case should be handled with a different function, since it needs further work on it and that is best done as a second function. > 1. Kill all connections to a given database. Used when replaying DROP > DATABASE. > > 2. Kill all connections by given user. Hmm, not used for anything, > actually. Should remove the roleId argument from GetConflictingVirtualXIDs. No, because we still need to add code to kill-connected-users if we drop role. > 3. Kill all transactions using given tablespace as temp tablespace, FROP > TABLESPACE. > > 4. Mark all transactions that still see a given XID as running for > termination if they try to access a buffer with conflicting LSN (VACUUM, > btree-deletes). > > All callers call GetConflictingVirtualXIDs first, and then > ResolveRecoveryConflictWithVirtualXIDs. That's a bit cumbersome; none of > the callers do anything else with the virtualxid array they get from > GetConflictingVirtualXIDs than pass it on to > ResolveRecoveryConflictWithVirtualXIDs. I'm thinking of an interface > consisting of three functions, replacing the current > GetConflictingVirtualXIDs and ResolveRecoveryConflictWithVirtualXIDs > functions: I'm not sure I see any benefit in doing that. The two big changes we did earlier were worthwhile but I'm concerned about how much refactoring you want to do. It introduces new bugs each time and currently they take time to isolate and fix - much of that work is currently not very visible, but its a big effort. If I was happy we had a perfect working solution and we were not under time pressure then I'd say go for it as much as you like. I'd prefer it if we could get everything correct before we put all the code in the right cupboards. I know tidy-up-as-you-go is a good policy but I'd encourage you to do a first pass looking for potential problems before we did that. Or maybe we're there already, not sure. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, dropping a tablespace
On 2009-01-25, at 09:04, Simon Riggs wrote: On Sat, 2009-01-24 at 21:58 +0200, Heikki Linnakangas wrote: When replaying a DROP TABLE SPACE, you first try to remove the directory, and if that fails, you assume that it's because it's in use as a temp tablespace in a read-only transaction. That sounds like you think there another conclusion? What if subdirectory of that directory is owned by root ? Say I create /home/gj/tablespace1 . Surely emptying it is possible, but should it remove the dir altogether ? It is possible to create tablespace in that directory, even so postgres user doesn't own /home/gj/ directory. So why shouldn't it be possible to drop it ? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, dropping a tablespace
Simon Riggs wrote: On Sat, 2009-01-24 at 21:58 +0200, Heikki Linnakangas wrote: When replaying a DROP TABLE SPACE, you first try to remove the directory, and if that fails, you assume that it's because it's in use as a temp tablespace in a read-only transaction. That sounds like you think there another conclusion? No, I think that assumption is correct. You then call ResolveRecoveryConflictWithVirtualXIDs to kill such transactions, and try removing the directory again. But ResolveRecoveryConflictWithVirtualXIDs doesn't wait for the target transaction to die anymore (or at least it shouldn't, as we discussed earlier), so that doesn't work AFAICS. The FATAL errors inflicted should be fairly quick to take effect, so waiting should not be a problem. We can make waiting for FATAL errors the standard response. The call in tablespace replay uses ERROR, not FATAL. You don't need to kill the whole session, just the current transaction. It seems more and more to me that the FATAL and ERROR cases are really quite different. Looking at the callers, there's four different needs for GetConflictingVirtualXIDs+ResolveRecoveryConflictWithVirtualXIDs: 1. Kill all connections to a given database. Used when replaying DROP DATABASE. 2. Kill all connections by given user. Hmm, not used for anything, actually. Should remove the roleId argument from GetConflictingVirtualXIDs. 3. Kill all transactions using given tablespace as temp tablespace, FROP TABLESPACE. 4. Mark all transactions that still see a given XID as running for termination if they try to access a buffer with conflicting LSN (VACUUM, btree-deletes). All callers call GetConflictingVirtualXIDs first, and then ResolveRecoveryConflictWithVirtualXIDs. That's a bit cumbersome; none of the callers do anything else with the virtualxid array they get from GetConflictingVirtualXIDs than pass it on to ResolveRecoveryConflictWithVirtualXIDs. I'm thinking of an interface consisting of three functions, replacing the current GetConflictingVirtualXIDs and ResolveRecoveryConflictWithVirtualXIDs functions: /* Kill all connections to given database, wait for them to die */ void KillConnectionsToDatabase(Oid dbOid) /* Kill all transactions that use given tablespace, wait for them to die */ void KillTransactionsUsingTablespace(Oid tablespaceOid) /* Signal all backends that with xmin < limitXid that they need to abort if they access a page with LSN >= conflict_lsn. (Or wait for them to end, if max_standby_delay > 0). */ void ResolveRecoveryConflicts(Oid dbOid, TransactionId limitXid, XLogRecPtr conflict_lsn, char *reason) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, dropping a tablespace
On Sat, 2009-01-24 at 21:58 +0200, Heikki Linnakangas wrote: > When replaying a DROP TABLE SPACE, you first try to remove the > directory, and if that fails, you assume that it's because it's in use > as a temp tablespace in a read-only transaction. That sounds like you think there another conclusion? > You then call > ResolveRecoveryConflictWithVirtualXIDs to kill such transactions, and > try removing the directory again. But > ResolveRecoveryConflictWithVirtualXIDs doesn't wait for the target > transaction to die anymore (or at least it shouldn't, as we discussed > earlier), so that doesn't work AFAICS. The FATAL errors inflicted should be fairly quick to take effect, so waiting should not be a problem. We can make waiting for FATAL errors the standard response. > One quick work around would be to simply not respect temp_tablespace > during recovery... No, don't think that's acceptable. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby (v9d)
Simon Riggs wrote: On Sat, 2009-01-24 at 11:20 +, Simon Riggs wrote: On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote: version 9g - please use this for testing now I'm doing some test runs with this now. I notice an old flatfiles related bug has reappeared: I'm seeing an off-by-one error on xmax, in some cases. That then causes the flat file update to not pick up correct info, even though it executed in other ways as intended. If you run two create databases and then test only the first, it appears to have worked as intended. These bugs are result of recent refactoring and it will take a few days to shake some of them out. We've had more than 20 already so we're beating them back, but we're not done yet. I was at a loss to explain how this could have slipped through our tests. It appears that the error was corrected following each checkpoint as a result of ProcArrayUpdateRunningXacts(). Our tests were performed after a short delay, which typically would be greater than the deliberately short setting of checkpoint_timeout/archive_timeout and so by the time we looked the error was gone and masked the problem. We're setting checkpoint_timeout to 30 mins now to avoid the delay... That makes sense - I had archive_timeout set at 5 minutes, and I would have checked before 5 minutes were up. Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby (v9d)
On Sat, 2009-01-24 at 11:20 +, Simon Riggs wrote: > On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote: > > > > version 9g - please use this for testing now > > > I'm doing some test runs with this now. I notice an old flatfiles > > related bug has reappeared: > > I'm seeing an off-by-one error on xmax, in some cases. That then causes > the flat file update to not pick up correct info, even though it > executed in other ways as intended. If you run two create databases and > then test only the first, it appears to have worked as intended. > > These bugs are result of recent refactoring and it will take a few days > to shake some of them out. We've had more than 20 already so we're > beating them back, but we're not done yet. I was at a loss to explain how this could have slipped through our tests. It appears that the error was corrected following each checkpoint as a result of ProcArrayUpdateRunningXacts(). Our tests were performed after a short delay, which typically would be greater than the deliberately short setting of checkpoint_timeout/archive_timeout and so by the time we looked the error was gone and masked the problem. We're setting checkpoint_timeout to 30 mins now to avoid the delay... -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby (v9d)
On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote: > > version 9g - please use this for testing now > I'm doing some test runs with this now. I notice an old flatfiles > related bug has reappeared: I'm seeing an off-by-one error on xmax, in some cases. That then causes the flat file update to not pick up correct info, even though it executed in other ways as intended. If you run two create databases and then test only the first, it appears to have worked as intended. These bugs are result of recent refactoring and it will take a few days to shake some of them out. We've had more than 20 already so we're beating them back, but we're not done yet. Thanks for the report, will publish this and other fixes on Monday. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby (v9d)
Simon Riggs wrote: On Thu, 2009-01-22 at 22:35 +, Simon Riggs wrote: * Bug fix v9 over next few days version 9g - please use this for testing now I'm doing some test runs with this now. I notice an old flatfiles related bug has reappeared: master: =# create database test; slave: =# select datname from pg_database where datname like 'test'; datname - test (1 row) postgres=# \c test FATAL: database "test" does not exist Previous connection kept -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby (v9d)
On Fri, 2009-01-23 at 20:13 +, Greg Stark wrote: > > If you have a serializable transaction with subtransactions that > > suffers > > a serializability error it only cancels the current subtransaction. > > This is a complete tangent to your work, but I wonder if this is > really right. I mean it's not as if we could have srrialized the > transaction as a whole with respect to whatever other transaction we > failed. Isn't this back to the discussion about PostgreSQL serializability versus true serializability? I agree that's not true serializability. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby (v9d)
If you have a serializable transaction with subtransactions that suffers a serializability error it only cancels the current subtransaction. This is a complete tangent to your work, but I wonder if this is really right. I mean it's not as if we could have srrialized the transaction as a whole with respect to whatever other transaction we failed. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, conflict resolution
On Fri, 2009-01-23 at 21:30 +0200, Heikki Linnakangas wrote: > >> Correct me if I'm wrong, but I thought the idea of this new conflict > >> resolution was that the startup process doesn't need to wait for the > >> target backend to die. Instead, the target backend knows to commit > >> suicide if it stumbles into a buffer that's been modified in a > >> conflicting way. Looking at ResolveRecoveryConflictWithVirtualXIDs, it > >> looks like we still wait. > > > > err, no, that's just an oversight, not intentional. > > Ok, then I think we have a little race condition. The startup process > doesn't get any reply indicating that the target backend has processed > the SIGINT and set the cached conflict LSN. The target backend might > move ahead using the old LSN for a little while, even though the startup > process has already gone ahead and replayed a vacuum record. Hah! You had me either way. Very neat :-) I'll think some more and reply, but not tonight. > Another tiny issue is that it looks like a new conflict LSN always > overwrites the old one. But you should always use the oldest conflicted > LSN in the checks, not the newest. OK -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby (v9d)
Heikki Linnakangas wrote: > Simon Riggs wrote: >> If you have a serializable transaction with subtransactions that suffers >> a serializability error it only cancels the current subtransaction. That >> means it's snapshot is still valid and can be used again. By analogy, as >> long as a transaction does not see any data that is inconsistent with >> its snapshot it seems OK for it to continue. So I think it is correct. > > Yeah, you're right. How bizarre. It was argued this way to me way back when subtransactions were written. Originally I had written it in such a way as to abort the whole transaction, on the rationale that if you're blindly restarting the subtransaction after a serialization error, it would result in a (maybe infinite) loop. I think the reason it only aborts the subxact is that that's what all other errors do, so why would this one act differently. Hmm, now that I think about it, I think it was deadlock errors not serialization errors ... -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, conflict resolution
Simon Riggs wrote: On Fri, 2009-01-23 at 17:51 +0200, Heikki Linnakangas wrote: ISTM that if ReadBuffer read the value directly from the PGPROC entry, there would be no need for the signaling (in the ERROR mode). That is possible and I considered it. If we did it that way we would need to read the PGPROC each time we read a buffer. AFAICS we would need to use a spinlock to do that since reading an XLogRecPtr would not be atomic. Hmm, I think you could make it lock-free by ordering the instructions carefully. Not sure it's worth the code complexity, though. Correct me if I'm wrong, but I thought the idea of this new conflict resolution was that the startup process doesn't need to wait for the target backend to die. Instead, the target backend knows to commit suicide if it stumbles into a buffer that's been modified in a conflicting way. Looking at ResolveRecoveryConflictWithVirtualXIDs, it looks like we still wait. err, no, that's just an oversight, not intentional. Ok, then I think we have a little race condition. The startup process doesn't get any reply indicating that the target backend has processed the SIGINT and set the cached conflict LSN. The target backend might move ahead using the old LSN for a little while, even though the startup process has already gone ahead and replayed a vacuum record. Another tiny issue is that it looks like a new conflict LSN always overwrites the old one. But you should always use the oldest conflicted LSN in the checks, not the newest. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby (v9d)
Simon Riggs wrote: If you have a serializable transaction with subtransactions that suffers a serializability error it only cancels the current subtransaction. That means it's snapshot is still valid and can be used again. By analogy, as long as a transaction does not see any data that is inconsistent with its snapshot it seems OK for it to continue. So I think it is correct. Yeah, you're right. How bizarre. (I was sorely tempted to make it "snapshot too old", as a joke). Yeah, that is a very describing message, actually. If there wasn't any precedence to that, I believe we might have came up with exactly that message ourselves. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby (v9d)
On Fri, 2009-01-23 at 18:22 +0200, Heikki Linnakangas wrote: > > @@ -1601,6 +1602,24 @@ BufferProcessRecoveryConflictsIfAny(volatile > > BufferDesc *bufHdr) > > { > > XLogRecPtr bufLSN = BufferGetLSN(bufHdr); > > > > + /* > > +* If the buffer is recent we may need to cancel ourselves > > +* rather than risk returning a wrong answer. This test is > > +* too conservative, but it is correct. > > +* > >>> +* We only need to cancel the current subtransaction. > > +* Once we've handled the error then other subtransactions > > can > > +* continue processing. Note that we do *not* reset the > > +* BufferRecoveryConflictLSN at subcommit/abort, but we do > > +* reset it if we release our last remaining sbapshot. > > +* see SnapshotResetXmin() > > +* > > Is it really enough to cancel just the current subtransaction? What if > it's a serializable transaction? I did originally think that when I first looked at the problem. I'm sorry if I say that a lot. If you have a serializable transaction with subtransactions that suffers a serializability error it only cancels the current subtransaction. That means it's snapshot is still valid and can be used again. By analogy, as long as a transaction does not see any data that is inconsistent with its snapshot it seems OK for it to continue. So I think it is correct. (Bizarrely, this might mean that if we did this programatically in a loop we might keep the system busy for some time while it continually re-reads data and fails. But that's another story). You remind me that we can now do what Kevin has requested and throw a errcode(ERRCODE_T_R_SERIALIZATION_FAILURE) at this point, which I agree is the most easily understood way of describing this error. (I was sorely tempted to make it "snapshot too old", as a joke). -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, conflict resolution
On Fri, 2009-01-23 at 17:51 +0200, Heikki Linnakangas wrote: > In ERROR mode, you don't really want to interrupt the target backend. In > ReadBuffer, you're checking a global variable, > BufferRecoveryConflictPending on each call, and if it's set, you check > the buffer's LSN against the LSN of the earliest LSN conflicting LSN, > and throw an error if it's greater than that. Why do we jump through so > many hoops to get the earliest conflicting LSN to where it's needed? At > the moment: > > 1. Startup process sets the LSN in the target backend's PGPROC entry, > and signals it with SIGINT. > 2. The target backend receives the signal; ProcessInterrupts is called > either immediately or at the next CHECK_FOR_INTERRUPTS() call. > 3. ProcessInterrupts reads the value from PGPROC, and passes it to bufmgr.c > > ISTM that if ReadBuffer read the value directly from the PGPROC entry, > there would be no need for the signaling (in the ERROR mode). That is possible and I considered it. If we did it that way we would need to read the PGPROC each time we read a buffer. AFAICS we would need to use a spinlock to do that since reading an XLogRecPtr would not be atomic. So doing it the way I've done it allows us to use a local variable which can be more easily cached and avoids the locking overhead. We do still need to signal anyway for the FATAL case, so we're not significantly affecting the patch footprint by changing that. > Correct me if I'm wrong, but I thought the idea of this new conflict > resolution was that the startup process doesn't need to wait for the > target backend to die. Instead, the target backend knows to commit > suicide if it stumbles into a buffer that's been modified in a > conflicting way. Looking at ResolveRecoveryConflictWithVirtualXIDs, it > looks like we still wait. err, no, that's just an oversight, not intentional. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby (v9d)
@@ -1601,6 +1602,24 @@ BufferProcessRecoveryConflictsIfAny(volatile BufferDesc *bufHdr) { XLogRecPtr bufLSN = BufferGetLSN(bufHdr); + /* +* If the buffer is recent we may need to cancel ourselves +* rather than risk returning a wrong answer. This test is +* too conservative, but it is correct. +* +* We only need to cancel the current subtransaction. +* Once we've handled the error then other subtransactions can +* continue processing. Note that we do *not* reset the +* BufferRecoveryConflictLSN at subcommit/abort, but we do +* reset it if we release our last remaining sbapshot. +* see SnapshotResetXmin() +* Is it really enough to cancel just the current subtransaction? What if it's a serializable transaction? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby (v9d)
Simon Riggs wrote: On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote: I made a couple of minor changes after importing your patches. I've applied them both to v9g, attached here. If you wouldn't mind redoing the initial step, I will promise not to do anything else to the code, except via patch on GIT. No problem. I don't need to do it from scratch, I'll just apply the changes from that patch as an incremental commit. Done, you can see it in my git repo now too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby (v9d)
On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > * Put corrected version into GIT > > * Produce outstanding items as patch-on-patch via GIT > > I've applied the hot standby patch and recovery infra v9 patch to > branches in my git repository, and pushed those to: > > git://git.postgresql.org/git/~hlinnaka/pgsql.git > > You can clone that to get started. > > I've set those branches up so that the hot standby branch is branched > off from the recovery infra branch. I'd like to keep the two separate, > as the recovery infra patch is useful on it's own, and can be reviewed > separately. > > As a teaser, I made a couple of minor changes after importing your > patches. For the sake of the archives, I've also included those changes > as patches here. OK, I'll look at those. I've fixed 6 bugs today (!), so I'd rather go for the new version coming out in an hour. That's too many to unpick unfortunately. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby (v9d)
Simon Riggs wrote: * Put corrected version into GIT * Produce outstanding items as patch-on-patch via GIT I've applied the hot standby patch and recovery infra v9 patch to branches in my git repository, and pushed those to: git://git.postgresql.org/git/~hlinnaka/pgsql.git You can clone that to get started. I've set those branches up so that the hot standby branch is branched off from the recovery infra branch. I'd like to keep the two separate, as the recovery infra patch is useful on it's own, and can be reviewed separately. As a teaser, I made a couple of minor changes after importing your patches. For the sake of the archives, I've also included those changes as patches here. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com >From 6d583356063c9d3c8d0e69233a40065bc5d7bde1 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 23 Jan 2009 14:37:05 +0200 Subject: [PATCH] Remove padding in XLogCtl; might be a good idea, but let's keep it simple for now. Also remove the XLogCtl->mode_lck spinlock, which is pretty pointless for a single boolean that's only written by one process. --- src/backend/access/transam/xlog.c | 24 +--- 1 files changed, 1 insertions(+), 23 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index fcf5657..42c4552 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -340,18 +340,11 @@ typedef struct XLogCtlWrite /* * Total shared-memory state for XLOG. - * - * This small structure is accessed by many backends, so we take care to - * pad out the parts of the structure so they can be accessed by separate - * CPUs without causing false sharing cache flushes. Padding is generous - * to allow for a wide variety of CPU architectures. */ -#define XLOGCTL_BUFFER_SPACING 128 typedef struct XLogCtlData { /* Protected by WALInsertLock: */ XLogCtlInsert Insert; - char InsertPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlInsert)]; /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; @@ -359,16 +352,9 @@ typedef struct XLogCtlData uint32 ckptXidEpoch; /* nextXID & epoch of latest checkpoint */ TransactionId ckptXid; XLogRecPtr asyncCommitLSN; /* LSN of newest async commit */ - /* add data structure padding for above info_lck declarations */ - char InfoPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogwrtRqst) - - sizeof(XLogwrtResult) - - sizeof(uint32) - - sizeof(TransactionId) - - sizeof(XLogRecPtr)]; /* Protected by WALWriteLock: */ XLogCtlWrite Write; - char WritePadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlWrite)]; /* * These values do not change after startup, although the pointed-to pages @@ -390,11 +376,8 @@ typedef struct XLogCtlData * always during Recovery Processing Mode. This allows us to identify * code executed *during* Recovery Processing Mode but not necessarily * by Startup process itself. - * - * Protected by mode_lck */ bool SharedRecoveryProcessingMode; - slock_t mode_lck; /* * recovery target control information @@ -410,8 +393,6 @@ typedef struct XLogCtlData TransactionId recoveryLastXid; XLogRecPtr recoveryLastRecPtr; - char InfoLockPadding[XLOGCTL_BUFFER_SPACING]; - slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -4399,7 +4380,6 @@ XLOGShmemInit(void) XLogCtl->XLogCacheBlck = XLOGbuffers - 1; XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages); SpinLockInit(&XLogCtl->info_lck); - SpinLockInit(&XLogCtl->mode_lck); /* * If we are not in bootstrap mode, pg_control should already exist. Read @@ -6183,9 +6163,7 @@ IsRecoveryProcessingMode(void) if (xlogctl == NULL) return false; - SpinLockAcquire(&xlogctl->mode_lck); - LocalRecoveryProcessingMode = XLogCtl->SharedRecoveryProcessingMode; - SpinLockRelease(&xlogctl->mode_lck); + LocalRecoveryProcessingMode = xlogctl->SharedRecoveryProcessingMode; } knownProcessingMode = true; -- 1.5.6.5 >From 4061ac8f84cc699bf0f417689f853791089ed472 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 23 Jan 2009 15:55:33 +0200 Subject: [PATCH] Remove knownProcessingMode variable. --- src/backend/access/transam/xlog.c | 19 --- 1 files changed, 8 insertions(+), 11 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 7e480e2..e64fb48 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -126,9 +126,11 @@ bool InRecovery = false; /* Are we recovering using offline XLOG archives? */ static bool InArchiveRecovery = false; -/* Local copy of shared RecoveryProcessingMode state */ +/* + * Local copy of shared RecoveryProcessingMode state. True actually + * means "not known, need to check the shared state" + */ static bool LocalRecoveryProcessingMode = true; -static bool knownProcessingMode = false; /*
Re: [HACKERS] Hot standby, RestoreBkpBlocks and cleanup locks
Simon Riggs wrote: I did want you to commit those changes, but that was 8 days ago and much has changed since then. Now, I'm slightly worried that this may be a retrograde step. The last 3 days I've been refactoring the code to account for other requirements, as I have been discussing, and some of these things have now changed. Though the general pattern of your suggested refactoring remains the same. I figured there's possibly some further changes, but the general idea to move the responsibility of restoring backup blocks to the redo function should remain the same. Can we plan a move to GIT? We can both work on things at the same time and my changes can be more visible. There will be some initial pain... Sure, just go ahead and publish a repository. Or would you like me to apply the patches and publish a repository you can clone from? Perhaps that would be easier since I'm already familiar with GIT. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, RestoreBkpBlocks and cleanup locks
On Tue, 2009-01-20 at 21:01 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > It would be useful to nibble away at the patch, committing all the > > necessary refactorings to make the patch work. That will reduce size of > > eventual commit. > > I committed this part now; one less thing to review. Please adjust the > patch accordingly. OK, thanks. I did want you to commit those changes, but that was 8 days ago and much has changed since then. Now, I'm slightly worried that this may be a retrograde step. The last 3 days I've been refactoring the code to account for other requirements, as I have been discussing, and some of these things have now changed. Though the general pattern of your suggested refactoring remains the same. I'll check in more detail, but please could we talk before you commit parts, otherwise we'll just trip over each other. I'll post the new version (v8f) (which is pre-commit) this evening, so we can discuss. Can we plan a move to GIT? We can both work on things at the same time and my changes can be more visible. There will be some initial pain... -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, RestoreBkpBlocks and cleanup locks
Simon Riggs wrote: It would be useful to nibble away at the patch, committing all the necessary refactorings to make the patch work. That will reduce size of eventual commit. I committed this part now; one less thing to review. Please adjust the patch accordingly. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
On Mon, 2009-01-19 at 16:50 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > I prefer the last one, but if you think otherwise, please shout. > > We're now emitting WAL records for relcache invalidations, IIRC. I > wonder if those are useful for this? Tom already objected to putting strange inval messages into WAL. Hmm, DROP USER is transactional, so we can only do this at commit. So forget the other ideas I had. We already know about the auth file update at commit. So we should say, at commit, re-read the list of roleids in use and if any don't match a row in pg_user then remove them. If we do that after the flat file update and the actual commit that removes the user then we will be guaranteed no race condition exists to allow new users to logon as we try to disconnect them. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
Simon Riggs wrote: I prefer the last one, but if you think otherwise, please shout. We're now emitting WAL records for relcache invalidations, IIRC. I wonder if those are useful for this? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
On Mon, 2009-01-19 at 12:54 +, Simon Riggs wrote: > Some refactoring in this area is also required > because we need to handle two other types of conflict to correctly > support drop database and drop user, which is now underway. I've hung the drop database conflict code in dbase_redo(). For drop role, there isn't an rmgr at all, but I can add code in a few places. * add XLOG_DBASE_DROP_USER - i.e. add drop user to the Database rmgr * DropRole() takes an AccessExclusiveLock, so we do write a WAL record for it. I could add a special case to the Relation rmgr. * Add a new rmgr (unused slot 7) and have it handle DropRole. I prefer the last one, but if you think otherwise, please shout. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
On Mon, 2009-01-19 at 15:47 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > I suggest we handle this on the recovery side, not on the master, by > > deriving the xmin at the point the WAL record arrives. We would > > calculate it by looking at recovery procs only. That will likely give us > > a later value than we would get from the master, but that can't be > > helped. > > Hmm, that's an interesting idea. It presumes that we see an abort/commit > WAL record at the right moment for every transaction that we have a > recovery proc for. We just concluded in the other thread that we do > always emit abortion records when the database is running normally; I > think that's good enough for this purpose. But not perfect. > A few other random ideas I had: > > - in btree delete redo, follow the index pointers, and look at the xids > on the heap tuples. That requires some random I/O, but will give the > exact value we need. Since it's quite expensive, I think we'd only want > to do it after using some more conservative test but quicker test to > determine that there might be a conflict. Ouch. > - Add latestRemovedXid to b-tree page header, and update it as tuples > are killed. Need to tolerate the fact that tuple kills are not WAL-logged. Sounds easy-ish. If tuple kills aren't WAL logged then if we crash latestRemovedXid will remain as it was at time of last write. So if we do a delete scan it will only remove the index tuples with hint bits set at time of that write, so the value would always be correct, no? I'm somehow uncomfortable with this idea though. Care to persuade me further? > > Btree deletes were an important optimisation when it first went it, but > > now we have HOT it is much less important. > > If HOT is working well for your application, there won't be many btree > deletes anyway, and the whole issue is moot. That was my point. > > Another route might be to put > > an option to turn off btree delete on the master, default = on. We > > probably should consider turning it off entirely when it doesn't yield > > significant benefit. > > I'd rather put in a generic mechanism to prevent vacuuming of recent > tuples that might still be needed in the standby. Like always > subtracting a fixed amount of xids from OldestXmin/RecentGlobalXmin, or > having a feedback loop from the standby to the master, allowing the > master to say what it's oldest xmin is. But that's a fair amount of > work; I'd rather leave that as a future enhancement, and just figure out > something simple for this specific issue. We'll need to handle it > gracefully even if we try to avoid it by retaining dead tuples longer. Yeh, looked at both of those also. Definitely after sync rep goes in though. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
Simon Riggs wrote: I suggest we handle this on the recovery side, not on the master, by deriving the xmin at the point the WAL record arrives. We would calculate it by looking at recovery procs only. That will likely give us a later value than we would get from the master, but that can't be helped. Hmm, that's an interesting idea. It presumes that we see an abort/commit WAL record at the right moment for every transaction that we have a recovery proc for. We just concluded in the other thread that we do always emit abortion records when the database is running normally; I think that's good enough for this purpose. A few other random ideas I had: - in btree delete redo, follow the index pointers, and look at the xids on the heap tuples. That requires some random I/O, but will give the exact value we need. Since it's quite expensive, I think we'd only want to do it after using some more conservative test but quicker test to determine that there might be a conflict. - Add latestRemovedXid to b-tree page header, and update it as tuples are killed. Need to tolerate the fact that tuple kills are not WAL-logged. Btree deletes were an important optimisation when it first went it, but now we have HOT it is much less important. If HOT is working well for your application, there won't be many btree deletes anyway, and the whole issue is moot. Another route might be to put an option to turn off btree delete on the master, default = on. We probably should consider turning it off entirely when it doesn't yield significant benefit. I'd rather put in a generic mechanism to prevent vacuuming of recent tuples that might still be needed in the standby. Like always subtracting a fixed amount of xids from OldestXmin/RecentGlobalXmin, or having a feedback loop from the standby to the master, allowing the master to say what it's oldest xmin is. But that's a fair amount of work; I'd rather leave that as a future enhancement, and just figure out something simple for this specific issue. We'll need to handle it gracefully even if we try to avoid it by retaining dead tuples longer. Lots of scanning to remove the odd row is probably pretty wasteful and likely adds contention at the very point we don't want it - index splits. Remember that if you can remove enough dead tuples from the index page, you've just made room on the page and don't need to split. Splitting is pretty expensive compared to scanning a few line pointers. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
On Mon, 2009-01-19 at 14:00 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > GetSnapshotData() sets > > RecentGlobalXmin to the result of the snapshot's xmin. > > No. RecentGlobalXmin is set to the oldest *xmin* observed, across all > running transactions. TransactionXmin is the xid of the oldest running > transaction. IOW, RecentGlobalXmin is the xid of transaction that the > oldest running transaction sees as running. OK. That was fun. These WAL records are annoying, no matter what the exact value of latestRemovedXid is and they seem likely to conflict with many queries on the standby. If we don't use RecentGlobalXmin then I can't see any easily derived value that we can use in its place. It isn't worth the effort on the master to derive a more exact value, not when we don't even know if it matters yet. I suggest we handle this on the recovery side, not on the master, by deriving the xmin at the point the WAL record arrives. We would calculate it by looking at recovery procs only. That will likely give us a later value than we would get from the master, but that can't be helped. For me, this makes it essential now that I put in place the deferred cancellation mechanism. Some refactoring in this area is also required because we need to handle two other types of conflict to correctly support drop database and drop user, which is now underway. Btree deletes were an important optimisation when it first went it, but now we have HOT it is much less important. Another route might be to put an option to turn off btree delete on the master, default = on. We probably should consider turning it off entirely when it doesn't yield significant benefit. Lots of scanning to remove the odd row is probably pretty wasteful and likely adds contention at the very point we don't want it - index splits. Thoughts? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
Simon Riggs wrote: GetSnapshotData() sets RecentGlobalXmin to the result of the snapshot's xmin. No. RecentGlobalXmin is set to the oldest *xmin* observed, across all running transactions. TransactionXmin is the xid of the oldest running transaction. IOW, RecentGlobalXmin is the xid of transaction that the oldest running transaction sees as running. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
On Mon, 2009-01-19 at 12:50 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > One of us needs a coffee. > > Clearly, I put the kettle on... I had one too, just in case. > > How does Transaction 4 have a RecentGlobalXmin of 2 in step (7), but at > > step (9) the value of RecentGlobalXmin has gone backwards? > > Looks like I mixed up the xids of the two transactions in steps 8 and 9. > Let's see if I got it right this time: > > 1. Transaction 1 begins in backend A > 2. Transaction 2 begins in backend B, xmin = 1 > 3. Transaction 1 ends > 4. Transaction 3 begins in backend C, xmin = 2 > 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1 > 6. Transaction 2 ends. > 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, > RecentGlobalXmin = 2 > 8. Transaction 4 kills tuple, using its RecentGlobalxmin of 2 > 9. Transaction 3 splits the page, emits a delete xlog record, setting > latestRemovedXid to its RecentGlobalXmin of 1 I don't see how step (5) is possible. GetSnapshotData() sets RecentGlobalXmin to the result of the snapshot's xmin. If step (5) is possible, then yes, step (9) can happen. You are correct to say that RecentGlobalXmin is not always correctly set. All I'm saying is that at the exact time, place and circumstance I use it, it is correct. In other cases, it may not be. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
Simon Riggs wrote: One of us needs a coffee. Clearly, I put the kettle on... How does Transaction 4 have a RecentGlobalXmin of 2 in step (7), but at step (9) the value of RecentGlobalXmin has gone backwards? Looks like I mixed up the xids of the two transactions in steps 8 and 9. Let's see if I got it right this time: 1. Transaction 1 begins in backend A 2. Transaction 2 begins in backend B, xmin = 1 3. Transaction 1 ends 4. Transaction 3 begins in backend C, xmin = 2 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1 6. Transaction 2 ends. 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, RecentGlobalXmin = 2 8. Transaction 4 kills tuple, using its RecentGlobalxmin of 2 9. Transaction 3 splits the page, emits a delete xlog record, setting latestRemovedXid to its RecentGlobalXmin of 1 -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
On Mon, 2009-01-19 at 12:22 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > Well, steps 7 and 8 don't make sense. > > > > Your earlier comment was that it was possible for a WAL record to be > > written with a RecentGlobalXmin that was lower than other backends > > values. In step 9 the RecentGlobalXmin is *not* lower than any other > > backend, it is the same. > > > > So if there is a proof, this isn't it. > > Yeah, you're right. I got steps 8 and 9 mixed. Let me try again: > > 1. Transaction 1 begins in backend A > 2. Transaction 2 begins in backend B, xmin = 1 > 3. Transaction 1 ends > 4. Transaction 3 begins in backend C, xmin = 2 > 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1 > 6. Transaction 2 ends. > 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, > RecentGlobalXmin = 2 > 8. Transaction 3 kills tuple, using its RecentGlobalxmin of 2 > 9. Transaction 4 splits the page, emits a delete xlog record, setting > latestRemovedXid to its RecentGlobalXmin of 1 One of us needs a coffee. How does Transaction 4 have a RecentGlobalXmin of 2 in step (7), but at step (9) the value of RecentGlobalXmin has gone backwards? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
Simon Riggs wrote: Well, steps 7 and 8 don't make sense. Your earlier comment was that it was possible for a WAL record to be written with a RecentGlobalXmin that was lower than other backends values. In step 9 the RecentGlobalXmin is *not* lower than any other backend, it is the same. So if there is a proof, this isn't it. Yeah, you're right. I got steps 8 and 9 mixed. Let me try again: 1. Transaction 1 begins in backend A 2. Transaction 2 begins in backend B, xmin = 1 3. Transaction 1 ends 4. Transaction 3 begins in backend C, xmin = 2 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1 6. Transaction 2 ends. 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, RecentGlobalXmin = 2 8. Transaction 3 kills tuple, using its RecentGlobalxmin of 2 9. Transaction 4 splits the page, emits a delete xlog record, setting latestRemovedXid to its RecentGlobalXmin of 1 But I can't see how there can be one: Two concurrent vacuums can have different OldestXmin values, but two concurrent transactions cannot. Of course they can. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
On Mon, 2009-01-19 at 09:16 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Fri, 2009-01-16 at 22:09 +0200, Heikki Linnakangas wrote: > > > RecentGlobalXmin is just a hint, it lags behind the real oldest xmin > that GetOldestXmin() would return. If another backend has a more recent > RecentGlobalXmin value, and has killed more recent tuples on the page, > the latestRemovedXid written here is too old. > >>> What do you think we should do instead? > >> Dunno. Maybe call GetOldestXmin(). > > > > We are discussing btree deletes, not btree vacuums. > > Pardon my ignorance, but what's the difference? In terms of current HEAD, not much. In terms of Hot Standby, a significant difference - the two actions have been split, rather than continuing to share the same WAL record. XLOG_BTREE_VACUUM removes index tuples as a result of a vacuum. The initial scan of the heap already generated an XLOG_HEAP2_CLEANUP_INFO which gives the latestRemovedXid for that vacuum. So we don't need to worry about putting a latestRemovedXid on XLOG_BTREE_VACUUM. The WAL records also differ because the XLOG_BTREE_VACUUM contains details of blocks that need to be pinned but not otherwise touched. XLOG_BTREE_DELETE is different in 3 ways. It isn't part of a vacuum, so: * we don't need to take a cleanup lock * it doesn't contain info about other blocks we need to scan beforehand for correctness purposes * it wasn't preceded by an XLOG_HEAP2_CLEANUP_INFO record, so it must have a *correct* (even if too conservative) value for latestRemovedXid set. So the only time we need to set latestRemovedXid correctly is during a normal transaction, not during a vacuum. > > If we are doing > > btree delete then we have an unreleased snapshot therefore we also have > > a non-zero xmin. How can another backend have a later RecentGlobalXmin > > or result from GetOldestXmin() than we do? > > Sure it can, for example: > > 1. Transaction 1 begins in backend A > 2. Transaction 2 begins in backend B, xmin = 1 > 3. Transaction 1 ends > 4. Transaction 3 begins in backend C, xmin = 2 > 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1 > 6. Transaction 2 ends. > 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, > RecentGlobalXmin = 2 > 8. Transaction 4 kills tuple, using its RecentGlobalxmin of 1 > 9. Transaciont 3 splits the page, emits a delete xlog record, setting > latestRemovedXid to its RecentGlobalXmin of 2 Well, steps 7 and 8 don't make sense. Your earlier comment was that it was possible for a WAL record to be written with a RecentGlobalXmin that was lower than other backends values. In step 9 the RecentGlobalXmin is *not* lower than any other backend, it is the same. So if there is a proof, this isn't it. But I can't see how there can be one: Two concurrent vacuums can have different OldestXmin values, but two concurrent transactions cannot. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
Simon Riggs wrote: On Fri, 2009-01-16 at 22:09 +0200, Heikki Linnakangas wrote: RecentGlobalXmin is just a hint, it lags behind the real oldest xmin that GetOldestXmin() would return. If another backend has a more recent RecentGlobalXmin value, and has killed more recent tuples on the page, the latestRemovedXid written here is too old. What do you think we should do instead? Dunno. Maybe call GetOldestXmin(). We are discussing btree deletes, not btree vacuums. Pardon my ignorance, but what's the difference? If we are doing btree delete then we have an unreleased snapshot therefore we also have a non-zero xmin. How can another backend have a later RecentGlobalXmin or result from GetOldestXmin() than we do? Sure it can, for example: 1. Transaction 1 begins in backend A 2. Transaction 2 begins in backend B, xmin = 1 3. Transaction 1 ends 4. Transaction 3 begins in backend C, xmin = 2 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1 6. Transaction 2 ends. 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, RecentGlobalXmin = 2 8. Transaction 4 kills tuple, using its RecentGlobalxmin of 1 9. Transaciont 3 splits the page, emits a delete xlog record, setting latestRemovedXid to its RecentGlobalXmin of 2 -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
On Fri, 2009-01-16 at 22:09 +0200, Heikki Linnakangas wrote: > >> RecentGlobalXmin is just a hint, it lags behind the real oldest xmin > >> that GetOldestXmin() would return. If another backend has a more recent > >> RecentGlobalXmin value, and has killed more recent tuples on the page, > >> the latestRemovedXid written here is too old. > > > > What do you think we should do instead? > > Dunno. Maybe call GetOldestXmin(). We are discussing btree deletes, not btree vacuums. If we are doing btree delete then we have an unreleased snapshot therefore we also have a non-zero xmin. How can another backend have a later RecentGlobalXmin or result from GetOldestXmin() than we do? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
Simon Riggs wrote: On Fri, 2009-01-16 at 16:33 +0200, Heikki Linnakangas wrote: I don't think RecentGlobalXmin is good enough here: ! /* !* We would like to set an accurate latestRemovedXid, but there !* is no easy way of obtaining a useful value. So we use the !* probably far too conservative value of RecentGlobalXmin instead. !*/ ! xlrec_delete.latestRemovedXid = RecentGlobalXmin; ! rdata[0].data = (char *) &xlrec_delete; ! rdata[0].len = SizeOfBtreeDelete; RecentGlobalXmin is just a hint, it lags behind the real oldest xmin that GetOldestXmin() would return. If another backend has a more recent RecentGlobalXmin value, and has killed more recent tuples on the page, the latestRemovedXid written here is too old. What do you think we should do instead? Dunno. Maybe call GetOldestXmin(). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
On Fri, 2009-01-16 at 16:33 +0200, Heikki Linnakangas wrote: > I don't think RecentGlobalXmin is good enough here: > > > ! /* > > !* We would like to set an accurate latestRemovedXid, > > but there > > !* is no easy way of obtaining a useful value. So we > > use the > > !* probably far too conservative value of > > RecentGlobalXmin instead. > > !*/ > > ! xlrec_delete.latestRemovedXid = RecentGlobalXmin; > > ! rdata[0].data = (char *) &xlrec_delete; > > ! rdata[0].len = SizeOfBtreeDelete; > > RecentGlobalXmin is just a hint, it lags behind the real oldest xmin > that GetOldestXmin() would return. If another backend has a more recent > RecentGlobalXmin value, and has killed more recent tuples on the page, > the latestRemovedXid written here is too old. What do you think we should do instead? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby dev build (v8)
I don't think RecentGlobalXmin is good enough here: ! /* !* We would like to set an accurate latestRemovedXid, but there !* is no easy way of obtaining a useful value. So we use the !* probably far too conservative value of RecentGlobalXmin instead. !*/ ! xlrec_delete.latestRemovedXid = RecentGlobalXmin; ! rdata[0].data = (char *) &xlrec_delete; ! rdata[0].len = SizeOfBtreeDelete; RecentGlobalXmin is just a hint, it lags behind the real oldest xmin that GetOldestXmin() would return. If another backend has a more recent RecentGlobalXmin value, and has killed more recent tuples on the page, the latestRemovedXid written here is too old. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Thu, 2009-01-15 at 22:10 +, Simon Riggs wrote: > I notice that we are no longer able to record the databaseId for a > connection, so query conflict resolution cannot be isolated to > individual databases any longer. Wrong way round. Incoming WAL records from dbOid on them, so we can still check for conflicts against the db of the standby backends. Phew! > I'll leave all of the databaseId stuff in there in case we think of > anything good. Ripping out now as final part of earlier refactoring. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Thu, 2009-01-08 at 22:31 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > If you want to do things a different way you need to say what you want > > to do and what effects those changes will have. > > I want to reduce the coupling between the primary and the master. The > less they need to communicate, the better. I want to get rid of slotid, > and as many of the other extra information carried in WAL records that I > can. I believe that will make the patch both simpler and more robust. > > > Are there tradeoffs? If so what are they? > > I don't think there's any big difference in user-visible behavior. I notice that we are no longer able to record the databaseId for a connection, so query conflict resolution cannot be isolated to individual databases any longer. We might sometimes infer a transaction's databaseId from certain WAL records but that is only going to be possible within each rmgr, which is fairly strange. I'll leave all of the databaseId stuff in there in case we think of anything good. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, RestoreBkpBlocks and cleanup locks
On Thu, 2009-01-15 at 18:16 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > Is the first really useful? I would understand "advance until next > cleanup record *that would block/kill queries*", but why would you > want to advance until the next cleanup record? Minor difference there, but noted. > Anyway, if it is useful, you could do the pausing in > XactResolveRecoveryConflicts, too. Well that spreads code around that was previously fairly tight, which I'm not that happy with but I'll do as you suggest. We need to crack on now. I want to keep the feature at least until we have some serious feedback in the beta phase that it isn't necessary. Usability is close behind correctness and robustness. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, RestoreBkpBlocks and cleanup locks
On Thu, 2009-01-15 at 18:05 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > The idea outlined before didn't deal with all call points for > > RecordIsCleanupRecord(), so doesn't actually work. > > Are we talking about the same thing? I guess not. Look at the other call points for that function, cos your suggestion didn't include them AFAICS. > If we put the control of locking to > the hands of the redo-function, I don't see why it couldn't use a lock > of the right strength. Surely the redo-function can be taught what lock > it needs to take. Yes, it can, but there were other uses. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, RestoreBkpBlocks and cleanup locks
Simon Riggs wrote: The idea outlined before didn't deal with all call points for RecordIsCleanupRecord(), so doesn't actually work. Hmm, grep finds two call points for that: ! case RECOVERY_TARGET_PAUSE_CLEANUP: ! /* !* Advance until we see a cleanup record, then pause. !*/ ! if (RecordIsCleanupRecord(record)) ! pauseHere = true; ! break; ! and + /* +* Wait, kill or otherwise resolve any conflicts between +* incoming cleanup records and user queries. This is the + * main barrier that allows MVCC to work correctly when + * running standby servers. Only need to do this if there +* is a possibility that users may be active. +*/ + if (reachedSafeStartPoint && RecordIsCleanupRecord(record)) + ResolveRedoVisibilityConflicts(EndRecPtr, record); The second we can easily handle by getting rid of ResolveRedoVisibilityConflicts functions and making the redo-functions call XactResolveRecoveryConflicts when necessary. Is the first really useful? I would understand "advance until next cleanup record *that would block/kill queries*", but why would you want to advance until the next cleanup record? Anyway, if it is useful, you could do the pausing in XactResolveRecoveryConflicts, too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, RestoreBkpBlocks and cleanup locks
Simon Riggs wrote: The idea outlined before didn't deal with all call points for RecordIsCleanupRecord(), so doesn't actually work. Are we talking about the same thing? If we put the control of locking to the hands of the redo-function, I don't see why it couldn't use a lock of the right strength. Surely the redo-function can be taught what lock it needs to take. ISTM easier to do things within the rmgr at the time WAL records are written, rather than in the rmgr while handling redo. I don't like that idea. I'd like to keep the coupling between the primary and standby to the minimum. This avoids another rmgr call and is much more straightforward since we define how to redo the record at the time it is written, rather than via a separate mechanism that could mismatch. The code that generates a WAL record and the redo-functions need to match in general anyway. The strength of the lock is not any more error-prone than other things that a redo-function must do. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, RestoreBkpBlocks and cleanup locks
On Fri, 2009-01-09 at 19:34 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > It would be useful to nibble away at the patch, committing all the > > necessary refactorings to make the patch work. That will reduce size of > > eventual commit. > > Agreed. This in particular is a change I feel pretty confident to commit > beforehand. I'm looking to implement refactoring of this now. The idea outlined before didn't deal with all call points for RecordIsCleanupRecord(), so doesn't actually work. ISTM easier to do things within the rmgr at the time WAL records are written, rather than in the rmgr while handling redo. We currently have 2 bytes spare on the WAL record, so I propose to add an uint16 xl_info2 field (again). This can then be marked with 2 bits: * 1 bit to show that it is a cleanup record and may conflict * 1 bit to show that backup blocks must be applied with cleanup lock Just to say again that adding these is free: we use no more space because of alignment. This avoids another rmgr call and is much more straightforward since we define how to redo the record at the time it is written, rather than via a separate mechanism that could mismatch. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Sat, 2009-01-10 at 09:40 +, Simon Riggs wrote: > But looking at that, I > think the 6a patch had a bug there: a subtransaction abort record > would release locks for the whole top level xact. Fixed. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Tue, 2008-12-30 at 18:31 +0200, Heikki Linnakangas wrote: > You have to be careful to ignore the flags in read-only transactions > that started in hot standby mode, even if recovery has since ended and > we're in normal operation now. My initial implementation in v6 worked, but had a corner case if a transaction spanned the change from recovery into normal processing. I'm now done on a more complete fix, will be in v8. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, RestoreBkpBlocks and cleanup locks
Please commit soon On Fri, 2009-01-09 at 18:30 +0200, Heikki Linnakangas wrote: > The hot standby patch has some hacks to decide which full-page-images > can be restored holding an exclusive lock and which ones need a > vacuum-strength lock. It's not very pretty as is, as mentioned in > comments too. > > How about we refactor things so that redo-functions are responsible for > calling RestoreBkpBlocks? The redo function can then pass an argument > indicating what kind of lock is required. We should also change > XLogReadBufferExtended so that it doesn't lock the page; the caller > knows better what kind of a lock it needs. That makes it more analogous > with ReadBufferExtended too, although I think we should keep > XLogReadBuffer() unchanged for now. > > See attached patch. One shortfall of this patch is that you can pass > only one argument to RestoreBkpBlocks, but there can multiple backup > blocks in one WAL record. That's OK in current usage, though. > -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
Simon Riggs wrote: Rather than store the parent xid itself we store the difference between the current xid and the parent xid. Typically this will be less than 65535; when it is not we set it to zero but issue an xid assignment xlog record. That sounds pretty hacky. However, I think XactLockTableWait() doesn't need to know the parent either. (This feels more like wishful thinking, but here goes anyway). We release locks *after* TransactionIdAbortTree() has fully executed, so the test for TransactionIdIsInProgress(xid) will always see the abort status, if set. Notice that if we are awake at all it is because the top-level transaction is complete or our subxid is aborted. So there is never any need to look at the status of the parent, nor in fact any need to look at the procarray at all, which is always a waste of effort. Right, we don't currently write a WAL record at subtransaction commit, only at subtransaction abort or top-level commit. So the problem described in the comment at XactLockTableWait() can't arise in the standby. Actually, I wonder if we should write a WAL record at subtransaction commit too, to save on shared memory in the standby as well. If you believe that, you'll want to commit the attached patch (or something similar with comments refactored etc). Umm, we still need the SubTransGetParent() call in normal operation. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Mon, 2009-01-12 at 14:10 +0200, Heikki Linnakangas wrote: > > However, I think XactLockTableWait() doesn't need to know the parent > > either. (This feels more like wishful thinking, but here goes anyway). > > We release locks *after* TransactionIdAbortTree() has fully executed, so > > the test for TransactionIdIsInProgress(xid) will always see the abort > > status, if set. Notice that if we are awake at all it is because the > > top-level transaction is complete or our subxid is aborted. So there is > > never any need to look at the status of the parent, nor in fact any need > > to look at the procarray at all, which is always a waste of effort. > > Right, we don't currently write a WAL record at subtransaction commit, > only at subtransaction abort or top-level commit. So the problem > described in the comment at XactLockTableWait() can't arise in the standby. Good. So we can exclude parent_xid and just include top level xid. So we are now officially back inside the current xlog record padding: we don't need to increase WAL record length and therefore we can lose slotid (yay!). > Actually, I wonder if we should write a WAL record at subtransaction > commit too, to save on shared memory in the standby as well. You understand that the current design never performs XactLockTableInsert() for individual top/sub transactions? There would be no memory overhead to save. The only locks allowed are AccessShareLocks which never conflict with anything except for AccessExclusiveLocks. If an AEL conflicts with an ASL we blow away the ASL holders. If an ASL requestor is blocked by a an AEL the wait is performed on the single lock table entry for the Startup process, which acts as a proxy for all emulated AEL lock holders. > > If > > you believe that, you'll want to commit the attached patch (or something > > similar with comments refactored etc). > > Umm, we still need the SubTransGetParent() call in normal operation. OK, that was really just a thought experiment anyway to prove the point one way or the other. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
Heikki, can I get your feedback on this urgently please? I want to respond positively to your review comments and complete something you will find acceptable. But I need to know what that is, please. On Sun, 2009-01-11 at 11:55 +, Simon Riggs wrote: > On Sun, 2009-01-11 at 10:41 +0200, Heikki Linnakangas wrote: > > Simon Riggs wrote: > > > There's a confusion in the patch between top level xid and parent xid. > > > The xl field is named parentxid but actually contains top level. That > > > distinction is important because the code now uses the top level xid to > > > locate the recovery proc, formerly the role of the slotid. > > > > True, I changed the meaning halfway through. My thinking was that we can > > get away by only tracking the top-level xact of each subtransaction, not > > the whole tree of subtransactions. > > > > > This leads to an error when we SubTransSetParent(child_xid, top_xid); > > > since this assumes that the top_xid is the parent, which it is not. > > > Mostly you wouldn't notice unless you were looking up the subtrans > > > status for an xid that had committed but was the child of an aborted > > > subtransaction, with the top level xid having > 64 subtransactions. > > > > Hmm. When a subtransaction aborts, RecordTransactionAbort is called and > > clog is updated to show the subtransaction and all it's subcommitted > > children as aborted. I think we're safe, though it wouldn't hurt to > > double-check. > > That part of your argument is correct but we are not safe. But please > look at XactLockTableWait() and see what you think. According to > comments this would lead to different lock release behaviour. > > Beauty, thy name is not subtransaction. > > If you agree that we need the parent xid then we have further problems. > Adding another xid onto the header of each WAL record will add 8 bytes > onto each record because of alignment. This was the other reason for > slotid, since I was able to fit that into just 2 bytes and avoid the 8 > byte loss. Moving swiftly on, I have this morning though of an > alternative, so at least we now have some choice. Rather than store the > parent xid itself we store the difference between the current xid and > the parent xid. Typically this will be less than 65535; when it is not > we set it to zero but issue an xid assignment xlog record. > > However, I think XactLockTableWait() doesn't need to know the parent > either. (This feels more like wishful thinking, but here goes anyway). > We release locks *after* TransactionIdAbortTree() has fully executed, so > the test for TransactionIdIsInProgress(xid) will always see the abort > status, if set. Notice that if we are awake at all it is because the > top-level transaction is complete or our subxid is aborted. So there is > never any need to look at the status of the parent, nor in fact any need > to look at the procarray at all, which is always a waste of effort. If > you believe that, you'll want to commit the attached patch (or something > similar with comments refactored etc). > > > For xl_rel_lock you add a field called xid and then ask > > > /* xid of the *parent* transaction. XXX why parent? */. > > > You've done this because it replaced slotid. But looking at that, I > > > think the 6a patch had a bug there: a subtransaction abort record would > > > release locks for the whole top level xact. So we need to pass both top > > > level xid (or slotid) and xid for each lock, then release using the > > > actual xid only. > > > > Hmm, would just the subtransaction xid be enough? If we use that to > > release, what do we need the top-level xid for? > > So we can release all locks taken by subxids at once, when we learn that > a top level xid has disappeared. A minor point. > -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Sun, 2009-01-11 at 10:41 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > There's a confusion in the patch between top level xid and parent xid. > > The xl field is named parentxid but actually contains top level. That > > distinction is important because the code now uses the top level xid to > > locate the recovery proc, formerly the role of the slotid. > > True, I changed the meaning halfway through. My thinking was that we can > get away by only tracking the top-level xact of each subtransaction, not > the whole tree of subtransactions. > > > This leads to an error when we SubTransSetParent(child_xid, top_xid); > > since this assumes that the top_xid is the parent, which it is not. > > Mostly you wouldn't notice unless you were looking up the subtrans > > status for an xid that had committed but was the child of an aborted > > subtransaction, with the top level xid having > 64 subtransactions. > > Hmm. When a subtransaction aborts, RecordTransactionAbort is called and > clog is updated to show the subtransaction and all it's subcommitted > children as aborted. I think we're safe, though it wouldn't hurt to > double-check. That part of your argument is correct but we are not safe. But please look at XactLockTableWait() and see what you think. According to comments this would lead to different lock release behaviour. Beauty, thy name is not subtransaction. If you agree that we need the parent xid then we have further problems. Adding another xid onto the header of each WAL record will add 8 bytes onto each record because of alignment. This was the other reason for slotid, since I was able to fit that into just 2 bytes and avoid the 8 byte loss. Moving swiftly on, I have this morning though of an alternative, so at least we now have some choice. Rather than store the parent xid itself we store the difference between the current xid and the parent xid. Typically this will be less than 65535; when it is not we set it to zero but issue an xid assignment xlog record. However, I think XactLockTableWait() doesn't need to know the parent either. (This feels more like wishful thinking, but here goes anyway). We release locks *after* TransactionIdAbortTree() has fully executed, so the test for TransactionIdIsInProgress(xid) will always see the abort status, if set. Notice that if we are awake at all it is because the top-level transaction is complete or our subxid is aborted. So there is never any need to look at the status of the parent, nor in fact any need to look at the procarray at all, which is always a waste of effort. If you believe that, you'll want to commit the attached patch (or something similar with comments refactored etc). > For xl_rel_lock you add a field called xid and then ask > > /* xid of the *parent* transaction. XXX why parent? */. > > You've done this because it replaced slotid. But looking at that, I > > think the 6a patch had a bug there: a subtransaction abort record would > > release locks for the whole top level xact. So we need to pass both top > > level xid (or slotid) and xid for each lock, then release using the > > actual xid only. > > Hmm, would just the subtransaction xid be enough? If we use that to > release, what do we need the top-level xid for? So we can release all locks taken by subxids at once, when we learn that a top level xid has disappeared. A minor point. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support Index: src/backend/access/transam/xact.c === RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xact.c,v retrieving revision 1.271 diff -c -r1.271 xact.c *** src/backend/access/transam/xact.c 1 Jan 2009 17:23:36 - 1.271 --- src/backend/access/transam/xact.c 11 Jan 2009 11:32:24 - *** *** 416,422 s->transactionId = GetNewTransactionId(isSubXact); if (isSubXact) ! SubTransSetParent(s->transactionId, s->parent->transactionId); /* * Acquire lock on the transaction XID. (We assume this cannot block.) We --- 416,422 s->transactionId = GetNewTransactionId(isSubXact); if (isSubXact) ! SubTransSetParent(s->transactionId, GetTopTransactionIdIfAny()); /* * Acquire lock on the transaction XID. (We assume this cannot block.) We Index: src/backend/storage/lmgr/lmgr.c === RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/lmgr/lmgr.c,v retrieving revision 1.99 diff -c -r1.99 lmgr.c *** src/backend/storage/lmgr/lmgr.c 1 Jan 2009 17:23:47 - 1.99 --- src/backend/storage/lmgr/lmgr.c 11 Jan 2009 11:50:51 - *** *** 477,485 LockRelease(&tag, ShareLock, false); ! if (!TransactionIdIsInProgress(xid)) break; - xid = SubTransGetParent(xid); } } --- 477,484 LockRelease(&tag, ShareLock, false); ! if (TransactionIdDidAbort(xid)
Re: [HACKERS] Hot standby, slot ids and stuff
Simon Riggs wrote: There's a confusion in the patch between top level xid and parent xid. The xl field is named parentxid but actually contains top level. That distinction is important because the code now uses the top level xid to locate the recovery proc, formerly the role of the slotid. True, I changed the meaning halfway through. My thinking was that we can get away by only tracking the top-level xact of each subtransaction, not the whole tree of subtransactions. This leads to an error when we SubTransSetParent(child_xid, top_xid); since this assumes that the top_xid is the parent, which it is not. Mostly you wouldn't notice unless you were looking up the subtrans status for an xid that had committed but was the child of an aborted subtransaction, with the top level xid having > 64 subtransactions. Hmm. When a subtransaction aborts, RecordTransactionAbort is called and clog is updated to show the subtransaction and all it's subcommitted children as aborted. I think we're safe, though it wouldn't hurt to double-check. It's an important point that needs documenting, though. I completely neglected that. I'm wasn't looking for ways to reintroduce slotid, but it seems more logical to keep slotid in light of the above. However, you will probably view this as intransigence, so I will await your input. Yeah, it sure does seem like intransigence :-) For xl_rel_lock you add a field called xid and then ask /* xid of the *parent* transaction. XXX why parent? */. You've done this because it replaced slotid. But looking at that, I think the 6a patch had a bug there: a subtransaction abort record would release locks for the whole top level xact. So we need to pass both top level xid (or slotid) and xid for each lock, then release using the actual xid only. Hmm, would just the subtransaction xid be enough? If we use that to release, what do we need the top-level xid for? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Sat, 2009-01-10 at 11:45 +, Simon Riggs wrote: > On Sat, 2009-01-10 at 09:40 +, Simon Riggs wrote: > > > This leads to an error when we SubTransSetParent(child_xid, top_xid); > > since this assumes that the top_xid is the parent, which it is not. > > Mostly you wouldn't notice unless you were looking up the subtrans > > status for an xid that had committed but was the child of an aborted > > subtransaction, with the top level xid having > 64 subtransactions. > > It's possible the confusion leads to other bugs in UnobservedXid > > processing, but I didn't look too hard at that. > > > > AFAICS we need both parent and top xids. > > I wonder if its possible to derive the parent by looking at the > highest/most newly assigned xid? Abort records would remove aborted > subtransactions and AFAIK we currently assign a new subtransaction only > ever from the latest current subtransaction. (This wouldn't be > necessarily true if supported true branch-anywhere subtransactions, but > we don't). Sounds correct, but not really sure. Starting to sound like a do-me-later-if-ever optimisation and certainly nothing I want to rely on in court. I'm progressing with parent_xid added to the xlog record header, for now. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Sat, 2009-01-10 at 09:40 +, Simon Riggs wrote: > This leads to an error when we SubTransSetParent(child_xid, top_xid); > since this assumes that the top_xid is the parent, which it is not. > Mostly you wouldn't notice unless you were looking up the subtrans > status for an xid that had committed but was the child of an aborted > subtransaction, with the top level xid having > 64 subtransactions. > It's possible the confusion leads to other bugs in UnobservedXid > processing, but I didn't look too hard at that. > > AFAICS we need both parent and top xids. I wonder if its possible to derive the parent by looking at the highest/most newly assigned xid? Abort records would remove aborted subtransactions and AFAIK we currently assign a new subtransaction only ever from the latest current subtransaction. (This wouldn't be necessarily true if supported true branch-anywhere subtransactions, but we don't). Sounds correct, but not really sure. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Thu, 2009-01-08 at 20:30 +0200, Heikki Linnakangas wrote: > Since I've been whining about that for some time, I figured I have to > put my money where my mouth is, so here's a patch based on v6a that > eliminates the concept of slotids, as well as the new xl_info2 field in > XLogRecord. This seems much simpler to me. I haven't given it much > testing, but seems to work. There's a whole bunch of comments marked > with XXX that need resolving, though. There's a confusion in the patch between top level xid and parent xid. The xl field is named parentxid but actually contains top level. That distinction is important because the code now uses the top level xid to locate the recovery proc, formerly the role of the slotid. This leads to an error when we SubTransSetParent(child_xid, top_xid); since this assumes that the top_xid is the parent, which it is not. Mostly you wouldn't notice unless you were looking up the subtrans status for an xid that had committed but was the child of an aborted subtransaction, with the top level xid having > 64 subtransactions. It's possible the confusion leads to other bugs in UnobservedXid processing, but I didn't look too hard at that. AFAICS we need both parent and top xids. Or put another way, we need the parent xid and other info that allows us to uniquely determine the proc we need to update. Now the "other info..." could be top xid or it could also be slotid, which then avoids later zig-zagging to look up the proc. I'm wasn't looking for ways to reintroduce slotid, but it seems more logical to keep slotid in light of the above. However, you will probably view this as intransigence, so I will await your input. I'm very happy that GetStandbyInfoForTransaction() and all the XLR2 flags have bitten the dust and will sleep for eternity. For xl_rel_lock you add a field called xid and then ask /* xid of the *parent* transaction. XXX why parent? */. You've done this because it replaced slotid. But looking at that, I think the 6a patch had a bug there: a subtransaction abort record would release locks for the whole top level xact. So we need to pass both top level xid (or slotid) and xid for each lock, then release using the actual xid only. You also ask: Shouldn't we call StartupSUBTRANS() and the other startup functions like we do below, before letting anyone in? My answer is that the current role of StartupSUBTRANS and friends is not appropriate at that point, since they zero out those structures. I left those routines in place thinking "startup" meant "moving to normal running". If we did have a "startupsubtrans" at the point you note, it would currently be empty: we don't keep track of the latest page during recovery. Perhaps we should, but then we'd need to do the equivalent of ExtendSubtrans etc, which it seemed easier to avoid. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Fri, 2009-01-09 at 11:20 +, Simon Riggs wrote: > On Fri, 2009-01-09 at 12:33 +0200, Heikki Linnakangas wrote: > > Simon Riggs wrote: > > > On Thu, 2009-01-08 at 15:50 -0500, Tom Lane wrote: > > >> Simon Riggs writes: > > >>> On Thu, 2009-01-08 at 22:31 +0200, Heikki Linnakangas wrote: > > When a backend dies with FATAL, it writes an abort record before > > exiting. > > > > (I was under the impression it doesn't until few minutes ago myself, > > when I actually read the shutdown code :-)) > > >>> Not in all cases; keep reading :-) > > >> If it doesn't, that's a bug. A FATAL exit is not supposed to leave the > > >> shared state corrupted, it's only supposed to be a forcible session > > >> termination. Any open transaction should be rolled back. > > > > > > Please look back at the earlier discussion we had on this exact point: > > > http://archives.postgresql.org/pgsql-hackers/2008-09/msg01809.php > > > > I think the running-xacts list we dump to WAL at every checkpoint is > > enough to handle that. Just treat the dead transaction as in-progress > > until the next running-xacts record. It's presumably extremely rare to > > have a process die with FATAL, and not write an abort record. > > I agree, but I'll wait for Tom to speak further. OK, will proceed without this. Time is pressing. Heikki and I both agree that FATAL errors that fail to write abort records are rare and an acceptable problem in real usage. If they do occur, their nuisance factor is short-lived because of measures taken within the patch. Hot Standby does not *rely* upon there always an abort record for FATAL errors, so we cannot reasonably say the current design would be "unacceptably fragile" as I had once thought. So based upon that, out comes the slotid concept and luckily much of the cruftier aspects of the patch. Less code, probably fewer bugs. Good thing. I will produce a new v7 with those changes and merge the changes for v6b also, so we can begin review again from there. Hi ho, hi ho... -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, RestoreBkpBlocks and cleanup locks
Simon Riggs wrote: On Fri, 2009-01-09 at 18:30 +0200, Heikki Linnakangas wrote: How about we refactor things? Was that a patch against HEAD or a patch on patch? Against HEAD. It would be useful to nibble away at the patch, committing all the necessary refactorings to make the patch work. That will reduce size of eventual commit. Agreed. This in particular is a change I feel pretty confident to commit beforehand. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, RestoreBkpBlocks and cleanup locks
On Fri, 2009-01-09 at 18:30 +0200, Heikki Linnakangas wrote: > How about we refactor things? Was that a patch against HEAD or a patch on patch? It would be useful to nibble away at the patch, committing all the necessary refactorings to make the patch work. That will reduce size of eventual commit. Do you want me to start using the GIT repo to make it easier to extract parts? You'd need to show me the setup you use first. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, RestoreBkpBlocks and cleanup locks
On Fri, 2009-01-09 at 18:30 +0200, Heikki Linnakangas wrote: > The hot standby patch has some hacks to decide which full-page-images > can be restored holding an exclusive lock and which ones need a > vacuum-strength lock. It's not very pretty as is, as mentioned in > comments too. Agreed! > How about we refactor things so that redo-functions are responsible for > calling RestoreBkpBlocks? The redo function can then pass an argument > indicating what kind of lock is required. We should also change > XLogReadBufferExtended so that it doesn't lock the page; the caller > knows better what kind of a lock it needs. That makes it more analogous > with ReadBufferExtended too, although I think we should keep > XLogReadBuffer() unchanged for now. Much better idea, thanks. I felt a new rmgr function was overkill but couldn't think of how to do this. > See attached patch. One shortfall of this patch is that you can pass > only one argument to RestoreBkpBlocks, but there can multiple backup > blocks in one WAL record. That's OK in current usage, though. If we're doing this because of cleanup locks, then I'd say we don't currently need a cleanup lock with any WAL record that uses multiple backup blocks. So we can just document that so anybody adding such a record in the future will be careful. So all seems good. Would you want to push ResolveRedoVisibilityConflicts() down into the rmgrs as well and make reachedSafeStartPoint a global? That is only called for cleanup records. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Fri, 2009-01-09 at 14:38 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Fri, 2009-01-09 at 13:23 +0200, Heikki Linnakangas wrote: > >> I mean the standby should stop trying to track the in progress > >> transactions in recovery procs, and apply the WAL records like it does > >> before the consistent state is reached. > > > > ... > > > > So, if we don't PANIC, how should we behave? > > > > Without full information on running-xacts we would be unable to take a > > snapshot, so should: > > * backends be forcibly disconnected? > > * backends hang waiting for snapshot info to be re-available again in X > > minutes worth of WAL time? > > * backends throw an ERROR: unable to provide snapshot at this time, > > DETAIL: retry your statement later. > > ...other alternatives > > > > and possibly prevent new connections. > > All of those seem reasonable to me. The 2nd option seems nicest, "X > minutes" should probably be controlled by max_standby_delay, after which > you can throw an error. Hmm, we use the recovery procs to track transactions that have TransactionIds assigned. That means we will overflow only if we have approach 100% write transactions at any time, or if we have more write transactions in progress than we have max_connections on standby. So it sounds like the overflow situation would probably be both rare and, if it did occur, may not occur for long periods. > If we care enough, we could also keep tracking the transactions in > backend-private memory of the startup process, until there's enough room > in proc array. That would make the outage shorter, because you wouldn't > have to wait until the next running-xacts record, but only until enough > transactions have finished that they all fit in proc array again. > > But whatever is the simplest, really. The above does sound best since it would allow us to have the snapshot hang for a short period. But at this stage of the game, more complex. For now though, since it looks like it would happen fairly rarely, I'd opt for the simplest: throw an ERROR. > > If max_connections is higher on primary then the standby will *never* be > > available for querying. Should we have multiple ERRORs depending upon > > whether the situation is hopefully-temporary or looks-permanent? > > > > Don't assume I want the PANIC. That clearly needs to be revisited if we > > change slotids. > > It needs to be revisited whether we change slotids or not, IMHO. > > Note that with slotids, you have a problem as soon as any of the slots > that don't exist on standby are used, regardless of how many concurrent > transactions there actually is. Without slots you only have a problem if > you really have more than standby's max_connections concurrent > transactions. That makes a big difference in practice. Sometimes, but mostly people set max_connections higher because they intend to use those extra connections. So no real advantage there against the slotid approach :-) -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
Simon Riggs wrote: On Fri, 2009-01-09 at 13:23 +0200, Heikki Linnakangas wrote: I mean the standby should stop trying to track the in progress transactions in recovery procs, and apply the WAL records like it does before the consistent state is reached. ... So, if we don't PANIC, how should we behave? Without full information on running-xacts we would be unable to take a snapshot, so should: * backends be forcibly disconnected? * backends hang waiting for snapshot info to be re-available again in X minutes worth of WAL time? * backends throw an ERROR: unable to provide snapshot at this time, DETAIL: retry your statement later. ...other alternatives and possibly prevent new connections. All of those seem reasonable to me. The 2nd option seems nicest, "X minutes" should probably be controlled by max_standby_delay, after which you can throw an error. If we care enough, we could also keep tracking the transactions in backend-private memory of the startup process, until there's enough room in proc array. That would make the outage shorter, because you wouldn't have to wait until the next running-xacts record, but only until enough transactions have finished that they all fit in proc array again. But whatever is the simplest, really. If max_connections is higher on primary then the standby will *never* be available for querying. Should we have multiple ERRORs depending upon whether the situation is hopefully-temporary or looks-permanent? Don't assume I want the PANIC. That clearly needs to be revisited if we change slotids. It needs to be revisited whether we change slotids or not, IMHO. Note that with slotids, you have a problem as soon as any of the slots that don't exist on standby are used, regardless of how many concurrent transactions there actually is. Without slots you only have a problem if you really have more than standby's max_connections concurrent transactions. That makes a big difference in practice. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Fri, 2009-01-09 at 13:23 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Fri, 2009-01-09 at 12:33 +0200, Heikki Linnakangas wrote: > >> A related issue is that currently the recovery PANICs if it runs out of > >> recovery procs. I think that's not acceptable, regardless of whether we > >> use slotids or some other method to avoid it in normal operation, > >> because it means you can't recover at all if you set max_connections too > >> low in the standby (or in the primary, and you have to recover from > >> crash), or you run out of recovery procs because of an abort failed in > >> the primary like discussed on that thread. > > > >> The standby should just > >> fast-forward to the next running-xacts record in that case. > > > > What do you mean by "fast forward"? > > I mean the standby should stop trying to track the in progress > transactions in recovery procs, and apply the WAL records like it does > before the consistent state is reached. If you say something is not acceptable you need to say what behaviour you do find acceptable in its place. It's good to come up with new ideas, but it's not good to ignore the problems the new ideas have. This is a general point that applies both here and to your proposals with synch rep. It's much harder to say how it should work in a way that covers all the requirements and points others have made, otherwise its just handwaving. So, if we don't PANIC, how should we behave? Without full information on running-xacts we would be unable to take a snapshot, so should: * backends be forcibly disconnected? * backends hang waiting for snapshot info to be re-available again in X minutes worth of WAL time? * backends throw an ERROR: unable to provide snapshot at this time, DETAIL: retry your statement later. ...other alternatives and possibly prevent new connections. If max_connections is higher on primary then the standby will *never* be available for querying. Should we have multiple ERRORs depending upon whether the situation is hopefully-temporary or looks-permanent? Don't assume I want the PANIC. That clearly needs to be revisited if we change slotids. I just want to make a balanced judgement based upon a full consideration of the options. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
Simon Riggs wrote: On Fri, 2009-01-09 at 12:33 +0200, Heikki Linnakangas wrote: A related issue is that currently the recovery PANICs if it runs out of recovery procs. I think that's not acceptable, regardless of whether we use slotids or some other method to avoid it in normal operation, because it means you can't recover at all if you set max_connections too low in the standby (or in the primary, and you have to recover from crash), or you run out of recovery procs because of an abort failed in the primary like discussed on that thread. The standby should just fast-forward to the next running-xacts record in that case. What do you mean by "fast forward"? I mean the standby should stop trying to track the in progress transactions in recovery procs, and apply the WAL records like it does before the consistent state is reached. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Fri, 2009-01-09 at 12:33 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Thu, 2009-01-08 at 15:50 -0500, Tom Lane wrote: > >> Simon Riggs writes: > >>> On Thu, 2009-01-08 at 22:31 +0200, Heikki Linnakangas wrote: > When a backend dies with FATAL, it writes an abort record before exiting. > > (I was under the impression it doesn't until few minutes ago myself, > when I actually read the shutdown code :-)) > >>> Not in all cases; keep reading :-) > >> If it doesn't, that's a bug. A FATAL exit is not supposed to leave the > >> shared state corrupted, it's only supposed to be a forcible session > >> termination. Any open transaction should be rolled back. > > > > Please look back at the earlier discussion we had on this exact point: > > http://archives.postgresql.org/pgsql-hackers/2008-09/msg01809.php > > I think the running-xacts list we dump to WAL at every checkpoint is > enough to handle that. Just treat the dead transaction as in-progress > until the next running-xacts record. It's presumably extremely rare to > have a process die with FATAL, and not write an abort record. I agree, but I'll wait for Tom to speak further. > A related issue is that currently the recovery PANICs if it runs out of > recovery procs. I think that's not acceptable, regardless of whether we > use slotids or some other method to avoid it in normal operation, > because it means you can't recover at all if you set max_connections too > low in the standby (or in the primary, and you have to recover from > crash), or you run out of recovery procs because of an abort failed in > the primary like discussed on that thread. > The standby should just > fast-forward to the next running-xacts record in that case. What do you mean by "fast forward"? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
Simon Riggs wrote: On Thu, 2009-01-08 at 15:50 -0500, Tom Lane wrote: Simon Riggs writes: On Thu, 2009-01-08 at 22:31 +0200, Heikki Linnakangas wrote: When a backend dies with FATAL, it writes an abort record before exiting. (I was under the impression it doesn't until few minutes ago myself, when I actually read the shutdown code :-)) Not in all cases; keep reading :-) If it doesn't, that's a bug. A FATAL exit is not supposed to leave the shared state corrupted, it's only supposed to be a forcible session termination. Any open transaction should be rolled back. Please look back at the earlier discussion we had on this exact point: http://archives.postgresql.org/pgsql-hackers/2008-09/msg01809.php I think the running-xacts list we dump to WAL at every checkpoint is enough to handle that. Just treat the dead transaction as in-progress until the next running-xacts record. It's presumably extremely rare to have a process die with FATAL, and not write an abort record. A related issue is that currently the recovery PANICs if it runs out of recovery procs. I think that's not acceptable, regardless of whether we use slotids or some other method to avoid it in normal operation, because it means you can't recover at all if you set max_connections too low in the standby (or in the primary, and you have to recover from crash), or you run out of recovery procs because of an abort failed in the primary like discussed on that thread. The standby should just fast-forward to the next running-xacts record in that case. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Thu, 2009-01-08 at 15:50 -0500, Tom Lane wrote: > Simon Riggs writes: > > On Thu, 2009-01-08 at 22:31 +0200, Heikki Linnakangas wrote: > >> When a backend dies with FATAL, it writes an abort record before exiting. > >> > >> (I was under the impression it doesn't until few minutes ago myself, > >> when I actually read the shutdown code :-)) > > > Not in all cases; keep reading :-) > > If it doesn't, that's a bug. A FATAL exit is not supposed to leave the > shared state corrupted, it's only supposed to be a forcible session > termination. Any open transaction should be rolled back. Please look back at the earlier discussion we had on this exact point: http://archives.postgresql.org/pgsql-hackers/2008-09/msg01809.php Heikki, perhaps now you understand my continued opposition to your ideas for simplification: I had already thought of them and was forced to rule them out, not through my own choice. Tom, note that I listen to what you say and try to write code that meets those requirements. >From here, I don't mind which way we go. I want code that is as simple as possible to "do the job", whatever we agree that to be. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
Simon Riggs writes: > On Thu, 2009-01-08 at 22:31 +0200, Heikki Linnakangas wrote: >> When a backend dies with FATAL, it writes an abort record before exiting. >> >> (I was under the impression it doesn't until few minutes ago myself, >> when I actually read the shutdown code :-)) > Not in all cases; keep reading :-) If it doesn't, that's a bug. A FATAL exit is not supposed to leave the shared state corrupted, it's only supposed to be a forcible session termination. Any open transaction should be rolled back. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
Simon Riggs wrote: On Thu, 2009-01-08 at 22:31 +0200, Heikki Linnakangas wrote: When a backend dies with FATAL, it writes an abort record before exiting. (I was under the impression it doesn't until few minutes ago myself, when I actually read the shutdown code :-)) Not in all cases; keep reading :-) Want to give a clue? ShutdownPostgres is registered as an on_shmem_exit hook in InitPostgres, and ShutdownPostgres calls AbortOutOfAnyTransaction. PANIC is another story, but in that case the primary will go down, and will write a new checkpoint soon after it starts up again. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Thu, 2009-01-08 at 22:31 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > * if FATAL errors occur, yet we have long running transactions then we > > have no way of removing those entries from the recovery procs. Since we > > have a fixed pool of recovery transactions we won't have anywhere to > > store that data. Snapshot sizes are fixed maximum with max_connections, > > so eventually we would not be able to take a snapshot at all, and we'd > > need to have a "ERROR: unable to take valid snapshot". > > When a backend dies with FATAL, it writes an abort record before exiting. > > (I was under the impression it doesn't until few minutes ago myself, > when I actually read the shutdown code :-)) Not in all cases; keep reading :-) The good thing is we have a choice now as to whether we care about that, or not. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
Simon Riggs wrote: If you want to do things a different way you need to say what you want to do and what effects those changes will have. I want to reduce the coupling between the primary and the master. The less they need to communicate, the better. I want to get rid of slotid, and as many of the other extra information carried in WAL records that I can. I believe that will make the patch both simpler and more robust. Are there tradeoffs? If so what are they? I don't think there's any big difference in user-visible behavior. RecordKnownAssignedTransactionId now needs to be called for every xlog record as opposed to just the first record where an xid appears, because I eliminated the hint flag in WAL to mark those records. And it needs to look up the recover proc by xid, instead of using the slot id. But I don't think that will have a significant impact on performance. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
Simon Riggs wrote: * if FATAL errors occur, yet we have long running transactions then we have no way of removing those entries from the recovery procs. Since we have a fixed pool of recovery transactions we won't have anywhere to store that data. Snapshot sizes are fixed maximum with max_connections, so eventually we would not be able to take a snapshot at all, and we'd need to have a "ERROR: unable to take valid snapshot". When a backend dies with FATAL, it writes an abort record before exiting. (I was under the impression it doesn't until few minutes ago myself, when I actually read the shutdown code :-)) * if FATAL errors occur while holding AccessExclusiveLock then we have no way of releasing those locks until the recovery proc is stale, which might be some time. Not sure if your patch releases those? I don't think so, that needs to be fixed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Thu, 2009-01-08 at 20:30 +0200, Heikki Linnakangas wrote: > I've mentioned before that I don't like the slotid stuff. From an > architectural point of view, we should try to keep the extra > communication between primary and standby to a minimum, for the sake of > robustness. The primary shouldn't have such a big role in keeping track > of which xids it has already emitted in WAL records, which subxids have > been marked, and so on. Removing slotid does improve the code, I agree. It was never there for reasons other than the functions and features it provides. Effects of removing slotid are at least these * if FATAL errors occur, yet we have long running transactions then we have no way of removing those entries from the recovery procs. Since we have a fixed pool of recovery transactions we won't have anywhere to store that data. Snapshot sizes are fixed maximum with max_connections, so eventually we would not be able to take a snapshot at all, and we'd need to have a "ERROR: unable to take valid snapshot". * if FATAL errors occur while holding AccessExclusiveLock then we have no way of releasing those locks until the recovery proc is stale, which might be some time. Not sure if your patch releases those? * xid->proc lookup is now very slow and needs to be called more frequently; will still be slower even with the further optimisations you mention. Possibly a minor point with right tuning. * slotid removed from xlrec header; makes no difference with 64-bit systems because of alignment issues. ...perhaps more, not sure yet. All we need to do is decide if we want these things or not. If not, then we can remove the slotid stuff. If it wasn't clear before, this was, for me, never a discussion about a particular way of coding things. I care little for that and would probably go with your suggestions more often than not. It's just simply about what we want it to do. If you want to argue in favour of restrictions to make things simpler, that's OK and I could probably live with them quite happily myself. I'm trying to implement the project philosophy of It Just Works, no restrictions or caveats - and I know if I had not then the patch would have been rejected on that basis, as has happened many times before. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, slot ids and stuff
On Thu, 2009-01-08 at 20:30 +0200, Heikki Linnakangas wrote: > I've mentioned before that I don't like the slotid stuff. From an > architectural point of view, we should try to keep the extra > communication between primary and standby to a minimum, for the sake of > robustness. The primary shouldn't have such a big role in keeping track > of which xids it has already emitted in WAL records, which subxids have > been marked, and so on. > > Since I've been whining about that for some time, I figured I have to > put my money where my mouth is Yes, you do, but I think not like this. Turning up with a patch and saying "my way is better" but not actually saying what that way *is* just confuses things. If you want to do things a different way you need to say what you want to do and what effects those changes will have. Are there tradeoffs? If so what are they? ISTM easier to discuss them on list first than to write a load of code and ask us all to read, understand and comment. > , so here's a patch based on v6a that > eliminates the concept of slotids, as well as the new xl_info2 field in > XLogRecord. This seems much simpler to me. I haven't given it much > testing, but seems to work. There's a whole bunch of comments marked > with XXX that need resolving, though. > > Attached is a patch agains CVS HEAD (hs-noslotid-1.patch) as well as an > incremental patch against your v6a (hs-noslotid-against-v6a.patch). > Sorry if you were just working on some big changes and I joggled your > elbow. I also didn't check what changes you had in v7 yet. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Tue, 2008-12-30 at 18:31 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > (a) always ignore LP_DEAD flags we see when reading index during > > recovery. > > This sounds simplest, and it's nice to not clear the flags for the > benefit of transactions running after the recovery is done. Agreed. (Also: Transaction hint bits are always set correctly, because we would only ever see a full page write with hints set after the commit/abort record was processed. So I continue to honour transaction hint bit reading and setting during recovery). > You have to be careful to ignore the flags in read-only transactions > that started in hot standby mode, even if recovery has since ended and > we're in normal operation now. Got that. I'm setting ignore_killed_tuples = false at the start of any index scan during recovery. And kill_prior_tuples is never set true when in recovery. Both measures are AM-agnostic. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
Simon Riggs wrote: (a) always ignore LP_DEAD flags we see when reading index during recovery. This sounds simplest, and it's nice to not clear the flags for the benefit of transactions running after the recovery is done. You have to be careful to ignore the flags in read-only transactions that started in hot standby mode, even if recovery has since ended and we're in normal operation now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
Simon Riggs wrote: Issues (2) and (3) would go away entirely if both standby and primary always had the same xmin value as a system-wide setting. i.e. the standby and primary are locked together at their xmins. Perhaps that was Heikki's intention in recent suggestions? No, I only suggested that as an optional optimization. We can't rely on it, because the queries on standby should still work correctly if the connection to primary is lost for some reason, or if the primary decides not to honor standby's xmin, perhaps to avoid the usual issues with long-running-transactions. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Fri, 2008-12-19 at 09:22 -0500, Greg Stark wrote: > I'm confused shouldn't read-only transactions on the slave just be > hacked to not set any hint bits including lp_delete? It seems there are multiple issues involved and I saw only the first of these initially. I want to explicitly separate these issues so we can discuss them more easily. 1. When we replay an XLOG_BTREE_DELETE record, we may have to wait-then-cancel-etc other sessions. Possibly a pain, but these records are not very common now that we have HOT, except on certain kinds of queue table. 2. Should we ignore the LP_DEAD flag on btree rows when we are using the index during recovery? As Heikki points out, this hint bit is not WAL logged, but can appear in the standby as a result of full page writes. The LP_DEAD flags will have been set using a different xmin to the one on the standby and would cause index rows to be ignored that should have been included in a correct MVCC answer. So we need to either (a) always ignore LP_DEAD flags we see when reading index during recovery. (b) include an additional step to clean the full page writes to remove LP_DEAD hints from the incoming pages. (b) is feasible, but would need to be repeated each time a new full page arrived, so a page may need to be re-cleaned many times. Sounds like a bad plan, so we should choose (a). 3. Should we set LP_DELETE flag on btree rows when we are using the index during recovery? Not much point if we are ignoring them. There is no space for an additional flag, to distinguish between primary and standby hint bits. Issues (2) and (3) would go away entirely if both standby and primary always had the same xmin value as a system-wide setting. i.e. the standby and primary are locked together at their xmins. Perhaps that was Heikki's intention in recent suggestions? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
marcin mank wrote: Perhaps we should listen to the people that have said they don't want queries cancelled, even if the alternative is inconsistent answers. I don't like that much. PostgreSQL has traditionally avoided that very hard. It's hard to tell what kind of inconsistencies you'd get, as it'd depend on what plan is created, when a vacuum happens to run on master etc. I think an alternative to that would be "if the wal backlog is too big, let current queries finish and let incoming queries wait till the backlog gets smaller". Yeah, that makes sense too. Many approaches have been proposed, and they all have different tradeoffs and therefore fit different use cases. I'm not sure which ones are/will be included in the patch. We don't need all in 8.4, one or two simplest ones will do just fine, and we can extend later. Let me summarize. Whenever a WAL record conflicts with a query-in-progress, we can: 1. kill the query, or 2. wait for the query to finish 3. let the query proceed, producing invalid results. There's some combinations of those as well. You're proposal is a variation of 2, to avoid the problem of WAL application falling behind indefinitely. There's also the max_standby_delay option in the patch, to wait a while, and then kill the query. There's some additional optimizations that can be made to make those options less painful. Instead of killing all queries that might be affected by a vacuum record, only kill them when they actually hit a block that was vacuumed (Simon's idea of latestRemovedLSN field in page header). Another line of attack is to avoid getting into the situation in the first place, by affecting behavior on the master. If the standby has an online connection to the master (per the synch rep patch), it can tell master what the slave's OldestXmin is, and master can take that into account and not remove tuples still needed by the slave. That's not good from high availability point of view, you don't want a hung query in the slave to cause a long-running-transaction situation in the master, but for other use cases it would be fine. Or we can just add a constant # of transactions to OldestXmin in master, to get some breathing room in the server. The bottom line is that we have enough options to make everyone happy. Some understanding of the issue is required to tune it properly, however, so documentation is important. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
> Perhaps we should listen to the people that have said they don't want > queries cancelled, even if the alternative is inconsistent answers. I think an alternative to that would be "if the wal backlog is too big, let current queries finish and let incoming queries wait till the backlog gets smaller". fell free to ignore me, as a non-hacker I`m not even supposed to be reading this list :-] Greetings Marcin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Wed, Dec 24, 2008 at 7:18 PM, Simon Riggs wrote: > > > > With respect, I was hoping you might look in the patch and see if you > agree with the way it is handled. No need to remember. The whole > latestRemovedXid concept is designed to do help. > Well, that's common for all cleanup record including vacuum. But reading your comment, it seemed as there is something special to handle HOT prune case which I did not see. Anyways, the trouble with HOT prune is that uples may be cleaned up very frequently and that can lead to query cancellation at the standby. That's what I wanted to emphasize. > > Queries get cancelled if data they need to see if removed and the > max_standby_delay expires. So lag will be max_standby_delay, by > definition. That's per cleanup record, isn't it ? > We've also discussed storing lastCleanedLSN for each buffer, so queries > can cancel themselves if they need to read a buffer that has had data > removed from it that they would have needed to see. I'll write that up > also. > What if we do that at table level ? So if a query touches a table which had cleanup activity since the query was started, it cancels itself automatically, Happy X'mas to all of you! Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Wed, 2008-12-24 at 09:59 -0500, Robert Treat wrote: > I think the uncertainty comes from peoples experience with typical > replication > use cases vs a lack of experience with this current implementation. Quite possibly. Publishing user feedback on this will be very important in making this a usable feature. I'd be very happy if you were to direct the search for optimal usability. > One such example is that it is pretty common to use read-only slaves to do > horizontal scaling of read queries across a bunch of slaves. This is not the > scenario of running reporting queries on a second machine to lower load; you > would be running a large number of read-only, relativly short, oltp-ish > queries (think pg_benchs select only test i suppose), but you also have a > fairly regular stream of inserts/updates going on with these same tables, its > just you have 95/5 split of read/write (or similar). One thing to consider also is latency of information. Sending queries to master or slave may return different answers if querying very recent data. > This is standard practice in things like mysql or using slony or what have > you. I suspect it's one of the first things people are going to want to do > with hot standby. But it's unclear how well this will work because we don't > have any experience with it yet, coupled with the two downsides being > mentioned as canceled queries and replay lag, which happen to be probably the > two worst downsides you would have in the above scenario. :-) > > Hmm I'm not sure why I didn't think of running this test before, but > read/write pg_bench on a master with pg_bench select test on slave isn't that > bad of a scenario to match the above; it might be a little too much activity > on the master, but has anyone else run such a test? > -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Wednesday 24 December 2008 08:48:04 Simon Riggs wrote: > On Wed, 2008-12-24 at 17:56 +0530, Pavan Deolasee wrote: > > Again, I haven't seen how frequently queries may get canceled. Or if > > the delay is set to a large value, how far behind standby may get > > during replication, so I can't really comment. Have you done any tests > > on a reasonable hardware and checked if moderately long read queries > > can be run on standby without standby lagging behind a lot. > > Queries get cancelled if data they need to see if removed and the > max_standby_delay expires. So lag will be max_standby_delay, by > definition. > > Not sure what further tests would show. Queries that run for longer than > max_standby delay plus mean time between cleanup records will currently > end up being cancelled. > I think the uncertainty comes from peoples experience with typical replication use cases vs a lack of experience with this current implementation. One such example is that it is pretty common to use read-only slaves to do horizontal scaling of read queries across a bunch of slaves. This is not the scenario of running reporting queries on a second machine to lower load; you would be running a large number of read-only, relativly short, oltp-ish queries (think pg_benchs select only test i suppose), but you also have a fairly regular stream of inserts/updates going on with these same tables, its just you have 95/5 split of read/write (or similar). This is standard practice in things like mysql or using slony or what have you. I suspect it's one of the first things people are going to want to do with hot standby. But it's unclear how well this will work because we don't have any experience with it yet, coupled with the two downsides being mentioned as canceled queries and replay lag, which happen to be probably the two worst downsides you would have in the above scenario. :-) Hmm I'm not sure why I didn't think of running this test before, but read/write pg_bench on a master with pg_bench select test on slave isn't that bad of a scenario to match the above; it might be a little too much activity on the master, but has anyone else run such a test? -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Wed, 2008-12-24 at 17:56 +0530, Pavan Deolasee wrote: > On Wed, Dec 24, 2008 at 5:26 PM, Simon Riggs wrote: > > > > > > > The patch does go to some trouble to handle that case, as I'm sure > > you've seen. Are you saying that part of the patch is ineffective and > > should be removed, or? > > > > Umm.. are you talking about the "wait" mechanism ? That's the only > thing I remember. Otherwise, prune record is pretty much same as any > vacuum cleanup record. With respect, I was hoping you might look in the patch and see if you agree with the way it is handled. No need to remember. The whole latestRemovedXid concept is designed to do help. > > Should/could there be a way to control frequency of prune operations? We > > could maintain cleanupxmin as a constant minimum distance from xmax, for > > example. > > > > Well, there can be. But tuning any such thing might be difficult and > would have implications on the primary. I am not saying we can do > that, but we will need additional tests to see its impact. > > > Are we saying we should take further measures, as I asked upthread? If > > it is a consensus that I take some action, then I will. > > > > Again, I haven't seen how frequently queries may get canceled. Or if > the delay is set to a large value, how far behind standby may get > during replication, so I can't really comment. Have you done any tests > on a reasonable hardware and checked if moderately long read queries > can be run on standby without standby lagging behind a lot. Queries get cancelled if data they need to see if removed and the max_standby_delay expires. So lag will be max_standby_delay, by definition. Not sure what further tests would show. Queries that run for longer than max_standby delay plus mean time between cleanup records will currently end up being cancelled. > I would prefer to have a solution which can be smarter than canceling > all queries as soon as a cleanup record comes and timeout occurs. Currently, it was the consensus view that queries should be cancelled, though there are other options still on the table. It's discussed in Design Notes on the Wiki. "Simply ignoring WAL removal has been discussed and rejected (so far). http://archives.postgresql.org/pgsql-hackers/2008-05/msg00753.php Explicitly defining the tables a transaction wishes to see has also been discussed and rejected (so far). http://archives.postgresql.org/pgsql-hackers/2008-08/msg00268.php"; > For > example, if the queries are being run on a completely different set of > tables where as the updates/deletes are happening on another set of > tables, there is no reason why those queries should be canceled. I > think it would be very common to have large history tables which may > receive long read-only queries, but no updates/deletes. Whereas other > frequently updated tables which receive very few queries on the > standby. There is currently no way to tell which tables a query will touch during the course of its execution. Nor is there likely to be because of user-defined volatile functions. I attempted to find ways to explicitly limit the set of tables over which a query might venture, but that cam to nothing also. We've also discussed storing lastCleanedLSN for each buffer, so queries can cancel themselves if they need to read a buffer that has had data removed from it that they would have needed to see. I'll write that up also. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Wed, Dec 24, 2008 at 5:26 PM, Simon Riggs wrote: > > > The patch does go to some trouble to handle that case, as I'm sure > you've seen. Are you saying that part of the patch is ineffective and > should be removed, or? > Umm.. are you talking about the "wait" mechanism ? That's the only thing I remember. Otherwise, prune record is pretty much same as any vacuum cleanup record. > Should/could there be a way to control frequency of prune operations? We > could maintain cleanupxmin as a constant minimum distance from xmax, for > example. > Well, there can be. But tuning any such thing might be difficult and would have implications on the primary. I am not saying we can do that, but we will need additional tests to see its impact. > Are we saying we should take further measures, as I asked upthread? If > it is a consensus that I take some action, then I will. > Again, I haven't seen how frequently queries may get canceled. Or if the delay is set to a large value, how far behind standby may get during replication, so I can't really comment. Have you done any tests on a reasonable hardware and checked if moderately long read queries can be run on standby without standby lagging behind a lot. I would prefer to have a solution which can be smarter than canceling all queries as soon as a cleanup record comes and timeout occurs. For example, if the queries are being run on a completely different set of tables where as the updates/deletes are happening on another set of tables, there is no reason why those queries should be canceled. I think it would be very common to have large history tables which may receive long read-only queries, but no updates/deletes. Whereas other frequently updated tables which receive very few queries on the standby. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Wed, 2008-12-24 at 16:48 +0530, Pavan Deolasee wrote: > On Wed, Dec 24, 2008 at 4:41 PM, Simon Riggs wrote: > > > > > > Greg and Heikki have highlighted in this thread some aspects of btree > > garbage collection that will increase the chance of queries being > > cancelled in various circumstances > > Even HOT-prune may lead to frequent query cancellations and unlike > VACUUM there is no way user can control the frequency of prune > operations. The patch does go to some trouble to handle that case, as I'm sure you've seen. Are you saying that part of the patch is ineffective and should be removed, or? Should/could there be a way to control frequency of prune operations? We could maintain cleanupxmin as a constant minimum distance from xmax, for example. Are we saying we should take further measures, as I asked upthread? If it is a consensus that I take some action, then I will. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Wed, Dec 24, 2008 at 4:41 PM, Simon Riggs wrote: > > > Greg and Heikki have highlighted in this thread some aspects of btree > garbage collection that will increase the chance of queries being > cancelled in various circumstances Even HOT-prune may lead to frequent query cancellations and unlike VACUUM there is no way user can control the frequency of prune operations. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Tue, 2008-12-23 at 23:59 -0500, Robert Treat wrote: > On Friday 19 December 2008 19:36:42 Simon Riggs wrote: > > Perhaps we should listen to the people that have said they don't want > > queries cancelled, even if the alternative is inconsistent answers. That > > is easily possible yet is not currently an option. Plus we have the > > option I referred to up thread, which is to defer query cancel until the > > query reads a modified data block. I'm OK with implementing either of > > those, as non-default options. Do we need those options or are we ok? > > > > Haven't seen any feed back on this, but I think the two options of cancel > query for replay, and pause replay for queries, are probably enough for a > first go around (especially if you can get the query canceling to work only > when changes are made to the specific database in question) Thanks for picking up on this. This question is the #1 usability issue for Hot Standby, since at least May 2008. There are many potential additions and we need to track this carefully over the next few months to see if we have it just right. I'll take viewpoints at any time on that; this door is never closed, though tempus fugit. Greg and Heikki have highlighted in this thread some aspects of btree garbage collection that will increase the chance of queries being cancelled in various circumstances. If this is important enough to trigger additional actions then we need to highlight that now so we have time to take those corrective actions. I've listened to many different viewpoints on and off list. Everybody takes a slightly different angle on it and I'm in favour of giving everybody what they want with the right set of options. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Friday 19 December 2008 19:36:42 Simon Riggs wrote: > Perhaps we should listen to the people that have said they don't want > queries cancelled, even if the alternative is inconsistent answers. That > is easily possible yet is not currently an option. Plus we have the > option I referred to up thread, which is to defer query cancel until the > query reads a modified data block. I'm OK with implementing either of > those, as non-default options. Do we need those options or are we ok? > Haven't seen any feed back on this, but I think the two options of cancel query for replay, and pause replay for queries, are probably enough for a first go around (especially if you can get the query canceling to work only when changes are made to the specific database in question) -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Saturday 20 December 2008 04:10:21 Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Sat, 2008-12-20 at 09:21 +0200, Heikki Linnakangas wrote: > >> Gregory Stark wrote: > >>> Simon Riggs writes: > Increasing the waiting time increases the failover time and thus > decreases the value of the standby as an HA system. Others value high > availability higher than you and so we had agreed to provide an option > to allow the max waiting time to be set. > >>> > >>> Sure, it's a nice option to have. But I think the default should be to > >>> pause WAL replay. > >> > >> I think I agree that pausing should be the default. If for no other > >> reason, because I can't think of a good default for max_standby_delay. > > > > I would rather err on the side of caution. If we do as you suggest, > > somebody will lose their database and start shouting "stupid default". > > Even if we stop applying the WAL, it should still be archived safely, > right? So no data should be lost, although the standby can fall very > much behind, and it can take a while to catch up. > I was thinking the condition Simon was concerned about was that on a very busy slave with wal delay, you could theoretically fill up the disks and destroy the slave. With query cancel, you might be annoyed to see the queries canceled, but theres no way that you would destroy the slave. (That might not have been what he meant though) -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Fri, 2008-12-19 at 14:23 -0600, Kevin Grittner wrote: > > I guess making it that SQLSTATE would make it simpler to understand > why > > the error occurs and also how to handle it (i.e. resubmit). > > Precisely. Just confirming I will implement the SQLSTATE as requested. I recognize my own and Greg's arguments that the match is not perfect, but the Standard isn't clear on this and Kevin's interpretation is the more useful behaviour for developers. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Sat, 2008-12-20 at 20:09 -0300, Alvaro Herrera wrote: > Heikki Linnakangas wrote: > > Gregory Stark wrote: > >> A vacuum being replayed -- even in a different database -- could trigger > >> the > >> error. Or with the btree split issue, a data load -- again even in a > >> different > >> database -- would be quite likely cause your SELECT to be killed. > > > > Hmm, I wonder if we should/could track the "latestRemovedXid" separately > > for each database. There's no reason why we need to kill a read-only > > query in database X when a table in database Y is vacuumed. > > What about shared catalogs? Hot Standby already covers this special case. Patch uses GetCurrentVirtualXIDs(), which treats a dbOid of 0 to match against all databases. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Sat, 2008-12-20 at 22:07 +0200, Heikki Linnakangas wrote: > Gregory Stark wrote: > > A vacuum being replayed -- even in a different database -- could trigger the > > error. Or with the btree split issue, a data load -- again even in a > > different > > database -- would be quite likely cause your SELECT to be killed. > > Hmm, I wonder if we should/could track the "latestRemovedXid" separately > for each database. There's no reason why we need to kill a read-only > query in database X when a table in database Y is vacuumed. Already implemented in code. see ResolveRedoVisibilityConflicts() -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
Heikki Linnakangas wrote: > Alvaro Herrera wrote: >> Heikki Linnakangas wrote: >>> Gregory Stark wrote: A vacuum being replayed -- even in a different database -- could trigger the error. Or with the btree split issue, a data load -- again even in a different database -- would be quite likely cause your SELECT to be killed. >>> Hmm, I wonder if we should/could track the "latestRemovedXid" >>> separately for each database. There's no reason why we need to kill >>> a read-only query in database X when a table in database Y is >>> vacuumed. >> >> What about shared catalogs? > > True, vacuums on shared catalogs would affect read-only queries on all > databases. Maybe it's possible to track latestRemovedXid for each database, and additionally another counter that tracks vacuums on shared catalogs. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
Alvaro Herrera wrote: Heikki Linnakangas wrote: Gregory Stark wrote: A vacuum being replayed -- even in a different database -- could trigger the error. Or with the btree split issue, a data load -- again even in a different database -- would be quite likely cause your SELECT to be killed. Hmm, I wonder if we should/could track the "latestRemovedXid" separately for each database. There's no reason why we need to kill a read-only query in database X when a table in database Y is vacuumed. What about shared catalogs? True, vacuums on shared catalogs would affect read-only queries on all databases. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers