Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Wednesday 13 January 2010 00:07:53 Simon Riggs wrote: On Tue, 2010-01-12 at 19:43 +0100, Andres Freund wrote: On Tuesday 12 January 2010 09:40:03 Simon Riggs wrote: On Tue, 2010-01-12 at 06:30 +0100, Andres Freund wrote: Currently the patch does not yet do anything to avoid letting the protocol out of sync. What do you think about adding a flag for error codes not to communicate with the client (Similarly to COMERROR)? So that one could do an elog(ERROR ERROR_NO_SEND_CLIENT, .. or such? Seems fairly important piece. Do you aggree on the approach then? Do you want to do it? If you would like to prototype something on this issue it would be gratefully received. I will review when submitted, though I may need other review also. Will do - likely not before Saturday though. I'm still reworking other code, so things might change under you, though not deliberately so. I will post as soon as I can, which isn't yet. No problem. Readapting a relatively minor amount of code isnt that hard. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Tue, 2010-01-12 at 06:30 +0100, Andres Freund wrote: On 01/07/10 22:37, Andres Freund wrote: On Thursday 07 January 2010 22:28:46 Tom Lane wrote: Andres Freundand...@anarazel.de writes: I did not want to suggest using Simons code there. Sorry for the brevity. should have read as revert to old code and add two step killing (thats likely around 10 lines or so). two step killing meaning that we signal ERROR for a few times and if nothing happens that we like, we signal FATAL. As the code already loops around signaling anyway that should be easy to implement. Ah. This loop happens in the process that's trying to send the cancel signal, correct, not the one that needs to respond to it? That sounds fairly sane to me. Yes. There are some things we could do to make it more likely that a cancel of this type is accepted --- for instance, give it a distinct SQLSTATE code that *can not* be trapped by plpgsql EXCEPTION blocks --- but there is no practical way to guarantee it except elog(FATAL). I'm not entirely convinced that an untrappable error would be a good thing anyway; it's hard to argue that that's much better than a FATAL. Well a session which is usable after a transaction abort is quite sensible - quite some software I know handles a failing transaction much more gracefully than a session abort (e.g. because it has to deal with serialization failures and such anyway). So making it cought by fewer places and degrading to FATAL sound sensible and relatively easy to me. Unless somebody disagrees I will give it a shot. Ok, here is a stab at that: 1. Allow the signal multiplexing facility to transfer one sig_atomic_t worth of data. This is usefull e.g. for making operations idempotent or more precise. In this the LocalBackendId is transported - that should make it impossible to cancel the wrong transaction This doesn't remove the possibility of cancelling the wrong transaction, it just reduces the chances. You can still cancel a session with LocalBackendId == 1 and then have it accidentally cancel the next session with the same backendid. That's why I didn't chase this idea myself earlier. Closing that loophole isn't a priority and is best left until we have the other things have been cleaned up. 2. AbortTransactionAndAnySubtransactions is only used in the mainloops error handler as it should be unproblematic there. In the current CVS code ConditionalVirtualXactLockTableWait() in ResolveRecoveryConflictWithVirtualXIDs does the wait for every try of cancelling the other transaction. I moved the retry logic into CancelVirtualTransaction(). If 50 times a ERROR does nothing it degrades to FATAL It's a good idea but not the right one, IMHO. I'd rather that the backend decided whether the instruction results in an ERROR or a FATAL based upon its own state, rather than have Startup process bang its head against the wall 50 times without knowing why. XXX: I temporarily do not use the skipExistingConflicts argument of GetConflictingVirtualXIDs - I dont understand its purpose and a bit of infrastructure is missing right now as the recoveryConflictMode is not stored anymore anywhere. Can easily be added back though. 3. Add a new error code ERRCODE_QUERY_CANCELED_HS for use with HS indicating a failure that is more than a plain ERRCODE_QUERY_CANCELED - namely it should not be caught from various places like savepoints and in PLs. Exemplarily I checked for that error code in plpgsql and make it uncatcheable. I am not sure how new errorcode codes get chosen though - and the name is not that nice. Opinions on that? I don't trust it yet, as a mechanism. Changing only PL/pgSQL seems fairly broken also. Again, this seems like something that be handled separately. I copied quite a bit from Simons earlier patch... It changes a few things around and adds the ideas you've mentioned, though seeing them as code doesn't in itself move the discussion forwards. There are far too many new ideas in this patch for it to be even close to acceptable, to me. Yes, lets discuss these additional ideas, but after a basic patch has been applied. I do value your interest and input, though racing me to rework my own patch, then asking me to review it seems like wasted time for both of us. I will post a new patch on ~Friday. (Also, please make your braces { } follow the normal coding conventions for Postgres.) Currently the patch does not yet do anything to avoid letting the protocol out of sync. What do you think about adding a flag for error codes not to communicate with the client (Similarly to COMERROR)? So that one could do an elog(ERROR ERROR_NO_SEND_CLIENT, .. or such? Seems fairly important piece. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On 01/12/10 09:40, Simon Riggs wrote: On Tue, 2010-01-12 at 06:30 +0100, Andres Freund wrote: On 01/07/10 22:37, Andres Freund wrote: On Thursday 07 January 2010 22:28:46 Tom Lane wrote: Andres Freundand...@anarazel.de writes: I did not want to suggest using Simons code there. Sorry for the brevity. should have read as revert to old code and add two step killing (thats likely around 10 lines or so). two step killing meaning that we signal ERROR for a few times and if nothing happens that we like, we signal FATAL. As the code already loops around signaling anyway that should be easy to implement. Ah. This loop happens in the process that's trying to send the cancel signal, correct, not the one that needs to respond to it? That sounds fairly sane to me. Yes. There are some things we could do to make it more likely that a cancel of this type is accepted --- for instance, give it a distinct SQLSTATE code that *can not* be trapped by plpgsql EXCEPTION blocks --- but there is no practical way to guarantee it except elog(FATAL). I'm not entirely convinced that an untrappable error would be a good thing anyway; it's hard to argue that that's much better than a FATAL. Well a session which is usable after a transaction abort is quite sensible - quite some software I know handles a failing transaction much more gracefully than a session abort (e.g. because it has to deal with serialization failures and such anyway). So making it cought by fewer places and degrading to FATAL sound sensible and relatively easy to me. Unless somebody disagrees I will give it a shot. Ok, here is a stab at that: 1. Allow the signal multiplexing facility to transfer one sig_atomic_t worth of data. This is usefull e.g. for making operations idempotent or more precise. In this the LocalBackendId is transported - that should make it impossible to cancel the wrong transaction This doesn't remove the possibility of cancelling the wrong transaction, it just reduces the chances. You can still cancel a session with LocalBackendId == 1 and then have it accidentally cancel the next session with the same backendid. That's why I didn't chase this idea myself earlier. Hm. I don't think so. Backend Initialization clears those flags. And the signal is sent to a specific pid so you cant send it to a new backend when targeting the old one. So how should that occur? Closing that loophole isn't a priority and is best left until we have the other things have been cleaned up. Yes, maybe. It was just rather easy to fix and fixed the problem sufficiently so that I could not reproduce it in contrast to seeing the problem during a regular testrun. 2. AbortTransactionAndAnySubtransactions is only used in the mainloops error handler as it should be unproblematic there. In the current CVS code ConditionalVirtualXactLockTableWait() in ResolveRecoveryConflictWithVirtualXIDs does the wait for every try of cancelling the other transaction. I moved the retry logic into CancelVirtualTransaction(). If 50 times a ERROR does nothing it degrades to FATAL It's a good idea but not the right one, IMHO. I'd rather that the backend decided whether the instruction results in an ERROR or a FATAL based upon its own state, rather than have Startup process bang its head against the wall 50 times without knowing why. I don't think its that easy to keep enough state there in a safe manner. Also the concept of retrying is not new - I just made it degrade and not rewait for max_standby_delay (which imho is a bug). In many cases the retries and repeated ERRORs will be enough to free the backend out of all PG_TRY/CATCH blocks that the next ERROR reaches the mainloop. So the retrying is quite sensible - and you cant do that part in the signalled backend itself. XXX: I temporarily do not use the skipExistingConflicts argument of GetConflictingVirtualXIDs - I dont understand its purpose and a bit of infrastructure is missing right now as the recoveryConflictMode is not stored anymore anywhere. Can easily be added back though. Is that a leftover from additional capabilities (deferred conflict handling?)? Because currently there will be never a case with two different cancellations at the same time. Also the current logic throws away a FATAL error if a ERROR is already there. 3. Add a new error code ERRCODE_QUERY_CANCELED_HS for use with HS indicating a failure that is more than a plain ERRCODE_QUERY_CANCELED - namely it should not be caught from various places like savepoints and in PLs. Exemplarily I checked for that error code in plpgsql and make it uncatcheable. I am not sure how new errorcode codes get chosen though - and the name is not that nice. Opinions on that? I don't trust it yet, as a mechanism. Changing only PL/pgSQL seems fairly broken also. Again, this seems like something that be handled separately. Well, that was an exemplary change - its easy to fix that at other places as well. Without any identifier of such an
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Tuesday 12 January 2010 09:40:03 Simon Riggs wrote: On Tue, 2010-01-12 at 06:30 +0100, Andres Freund wrote: Currently the patch does not yet do anything to avoid letting the protocol out of sync. What do you think about adding a flag for error codes not to communicate with the client (Similarly to COMERROR)? So that one could do an elog(ERROR ERROR_NO_SEND_CLIENT, .. or such? Seems fairly important piece. Do you aggree on the approach then? Do you want to do it? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
Andres Freund and...@anarazel.de writes: On Tuesday 12 January 2010 09:40:03 Simon Riggs wrote: So that one could do an elog(ERROR ERROR_NO_SEND_CLIENT, .. or such? Do you aggree on the approach then? Do you want to do it? Why not use the already-existing COMMERROR thing? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
Hi, The thought was that it might be helpful to be able to raise NOTICE or DEBUG* as well - but admitedly there is not a big need for it... Andres -- Sent from a mobile phone - please excuse brevity and formatting. - Ursprüngliche Mitteilung - Andres Freund and...@anarazel.de writes: On Tuesday 12 January 2010 09:40:03 Simon Riggs wrote: So that one could do an elog(ERROR ERROR_NO_SEND_CLIENT, .. or such? Do you aggree on the approach then? Do you want to do it? Why not use the already-existing COMMERROR thing? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Tue, 2010-01-12 at 19:43 +0100, Andres Freund wrote: On Tuesday 12 January 2010 09:40:03 Simon Riggs wrote: On Tue, 2010-01-12 at 06:30 +0100, Andres Freund wrote: Currently the patch does not yet do anything to avoid letting the protocol out of sync. What do you think about adding a flag for error codes not to communicate with the client (Similarly to COMERROR)? So that one could do an elog(ERROR ERROR_NO_SEND_CLIENT, .. or such? Seems fairly important piece. Do you aggree on the approach then? Do you want to do it? If you would like to prototype something on this issue it would be gratefully received. I will review when submitted, though I may need other review also. I'm still reworking other code, so things might change under you, though not deliberately so. I will post as soon as I can, which isn't yet. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On 01/07/10 22:37, Andres Freund wrote: On Thursday 07 January 2010 22:28:46 Tom Lane wrote: Andres Freundand...@anarazel.de writes: I did not want to suggest using Simons code there. Sorry for the brevity. should have read as revert to old code and add two step killing (thats likely around 10 lines or so). two step killing meaning that we signal ERROR for a few times and if nothing happens that we like, we signal FATAL. As the code already loops around signaling anyway that should be easy to implement. Ah. This loop happens in the process that's trying to send the cancel signal, correct, not the one that needs to respond to it? That sounds fairly sane to me. Yes. There are some things we could do to make it more likely that a cancel of this type is accepted --- for instance, give it a distinct SQLSTATE code that *can not* be trapped by plpgsql EXCEPTION blocks --- but there is no practical way to guarantee it except elog(FATAL). I'm not entirely convinced that an untrappable error would be a good thing anyway; it's hard to argue that that's much better than a FATAL. Well a session which is usable after a transaction abort is quite sensible - quite some software I know handles a failing transaction much more gracefully than a session abort (e.g. because it has to deal with serialization failures and such anyway). So making it cought by fewer places and degrading to FATAL sound sensible and relatively easy to me. Unless somebody disagrees I will give it a shot. Ok, here is a stab at that: 1. Allow the signal multiplexing facility to transfer one sig_atomic_t worth of data. This is usefull e.g. for making operations idempotent or more precise. In this the LocalBackendId is transported - that should make it impossible to cancel the wrong transaction Use the signal multiplexing facility. 2. AbortTransactionAndAnySubtransactions is only used in the mainloops error handler as it should be unproblematic there. In the current CVS code ConditionalVirtualXactLockTableWait() in ResolveRecoveryConflictWithVirtualXIDs does the wait for every try of cancelling the other transaction. I moved the retry logic into CancelVirtualTransaction(). If 50 times a ERROR does nothing it degrades to FATAL XXX: I temporarily do not use the skipExistingConflicts argument of GetConflictingVirtualXIDs - I dont understand its purpose and a bit of infrastructure is missing right now as the recoveryConflictMode is not stored anymore anywhere. Can easily be added back though. 3. Add a new error code ERRCODE_QUERY_CANCELED_HS for use with HS indicating a failure that is more than a plain ERRCODE_QUERY_CANCELED - namely it should not be caught from various places like savepoints and in PLs. Exemplarily I checked for that error code in plpgsql and make it uncatcheable. I am not sure how new errorcode codes get chosen though - and the name is not that nice. Opinions on that? I copied quite a bit from Simons earlier patch... --- Currently the patch does not yet do anything to avoid letting the protocol out of sync. What do you think about adding a flag for error codes not to communicate with the client (Similarly to COMERROR)? So that one could do an elog(ERROR ERROR_NO_SEND_CLIENT, .. or such? Andres PS: My current MUA suffers from a wronggone upgrade currently, so no idea how that message will appear. diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 90e7b4a..a6e2405 100644 *** a/src/backend/access/transam/xact.c --- b/src/backend/access/transam/xact.c *** AbortOutOfAnyTransaction(void) *** 3712,3717 --- 3712,3761 } /* + * AbortTransactionAndAnySubtransactions + * + * Similar to AbortCurrentTransaction but if any subtransactions + * in progress we abort them *and* all of their parents. So this is + * used when the caller wishes to make the abort untrappable by the user. + * After this has run IsAbortedTransactionBlockState() will be true. + */ + void + AbortTransactionAndAnySubtransactions(void) + { + while(true){ + TransactionState s = CurrentTransactionState; + + switch (s-blockState) + { + case TBLOCK_DEFAULT: + case TBLOCK_STARTED: + case TBLOCK_BEGIN: + case TBLOCK_INPROGRESS: + case TBLOCK_END: + case TBLOCK_ABORT: + case TBLOCK_SUBABORT: + case TBLOCK_ABORT_END: + case TBLOCK_ABORT_PENDING: + case TBLOCK_PREPARE: + case TBLOCK_SUBABORT_END: + case TBLOCK_SUBABORT_RESTART: + AbortCurrentTransaction(); + return; + break; + + case TBLOCK_SUBINPROGRESS: + case TBLOCK_SUBBEGIN: + case TBLOCK_SUBEND: + case TBLOCK_SUBABORT_PENDING: + case TBLOCK_SUBRESTART: + AbortSubTransaction(); + CleanupSubTransaction(); + break; + } + } + } + + /* * IsTransactionBlock --- are we within a transaction block? */ bool diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 9cb28f7..d8ea470 100644
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On 01/07/10 22:37, Andres Freund wrote: On Thursday 07 January 2010 22:28:46 Tom Lane wrote: Andres Freundand...@anarazel.de writes: I did not want to suggest using Simons code there. Sorry for the brevity. should have read as revert to old code and add two step killing (thats likely around 10 lines or so). two step killing meaning that we signal ERROR for a few times and if nothing happens that we like, we signal FATAL. As the code already loops around signaling anyway that should be easy to implement. Ah. This loop happens in the process that's trying to send the cancel signal, correct, not the one that needs to respond to it? That sounds fairly sane to me. Yes. There are some things we could do to make it more likely that a cancel of this type is accepted --- for instance, give it a distinct SQLSTATE code that *can not* be trapped by plpgsql EXCEPTION blocks --- but there is no practical way to guarantee it except elog(FATAL). I'm not entirely convinced that an untrappable error would be a good thing anyway; it's hard to argue that that's much better than a FATAL. Well a session which is usable after a transaction abort is quite sensible - quite some software I know handles a failing transaction much more gracefully than a session abort (e.g. because it has to deal with serialization failures and such anyway). So making it cought by fewer places and degrading to FATAL sound sensible and relatively easy to me. Unless somebody disagrees I will give it a shot. Alternatively the current state is available at: http://git.postgresql.org/gitweb?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/hs-query-cancel Its a bit raw around the edges, but I wanted to get some feedback... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thu, Dec 31, 2009 at 6:40 PM, Simon Riggs si...@2ndquadrant.com wrote: Building racy infrastructure when it can be avoided with a little care still seems not to be the best path to me. Doing that will add more complexity in an area that is hard to test effectively. I think the risk of introducing further bugs while trying to fix this rare condition is high. Right now the conflict processing needs more work and is often much less precise than this, so improving this aspect of it would not be a priority. I've added it to the TODO though. Thank you for your research. Patch implements recovery conflict signalling using SIGUSR1 multiplexing, then uses a SessionCancelPending mode similar to Joachim's TransactionCancelPending. I have reworked Simon's patch a bit and attach the result. Quick facts: - Hot Standby only uses SIGUSR1 - SIGINT behaves as it did before: it only cancels running statements - pg_cancel_backend() continues to use SIGINT - I added pg_cancel_idle_transaction() to cancel an idle transaction via SIGUSR1 - One central function HandleCancelAction() sets the flags before calling ProcessInterrupts(), it is called from the different signal handlers and receives parameters about what it should do - If a SIGUSR1 reason is used that will cancel something, ProcArrayLock is acquired until the signal has been sent to make sure that we won't signal the wrong backend. Does this sufficiently cover the concerns of Andres Freund discussed upthread? As there were so many boolean SomethingCancelPending variables I changed them to be bitmasks and merged all of them into a single variable. However I am not sure if we can change the name and type of especially InterruptPending that easily as it was marked with PGDLLIMPORT... @Simon: Is there a reason why you have not yet removed recoveryConflictMode from PGPROC? Joachim diff -cr cvs/src/backend/access/transam/xact.c cvs.build/src/backend/access/transam/xact.c *** cvs/src/backend/access/transam/xact.c 2009-12-24 13:55:12.0 +0100 --- cvs.build/src/backend/access/transam/xact.c 2010-01-05 11:25:17.0 +0100 *** *** 313,320 /* * IsAbortedTransactionBlockState * ! * This returns true if we are currently running a query ! * within an aborted transaction block. */ bool IsAbortedTransactionBlockState(void) --- 313,319 /* * IsAbortedTransactionBlockState * ! * This returns true if we are within an aborted transaction block. */ bool IsAbortedTransactionBlockState(void) *** *** 2692,2697 --- 2691,2738 } /* + * AbortTransactionAndAnySubtransactions + * + * Similar to AbortCurrentTransaction but if any subtransactions + * in progress we abort them and all of their parents. So this is + * used when the caller wishes to make the abort untrappable by the user. + */ + void + AbortTransactionAndAnySubtransactions(void) + { + TransactionState s = CurrentTransactionState; + + switch (s-blockState) + { + case TBLOCK_DEFAULT: + case TBLOCK_STARTED: + case TBLOCK_BEGIN: + case TBLOCK_INPROGRESS: + case TBLOCK_END: + case TBLOCK_ABORT: + case TBLOCK_SUBABORT: + case TBLOCK_ABORT_END: + case TBLOCK_ABORT_PENDING: + case TBLOCK_PREPARE: + case TBLOCK_SUBABORT_END: + case TBLOCK_SUBABORT_RESTART: + AbortCurrentTransaction(); + break; + + case TBLOCK_SUBINPROGRESS: + case TBLOCK_SUBBEGIN: + case TBLOCK_SUBEND: + case TBLOCK_SUBABORT_PENDING: + case TBLOCK_SUBRESTART: + AbortSubTransaction(); + CleanupSubTransaction(); + AbortTransactionAndAnySubtransactions(); + break; + } + } + + + /* * PreventTransactionChain * * This routine is to be called by statements that must not run inside diff -cr cvs/src/backend/commands/vacuum.c cvs.build/src/backend/commands/vacuum.c *** cvs/src/backend/commands/vacuum.c 2009-12-24 13:55:13.0 +0100 --- cvs.build/src/backend/commands/vacuum.c 2010-01-03 20:24:26.0 +0100 *** *** 3936,3942 CHECK_FOR_INTERRUPTS(); /* Nap if appropriate */ ! if (VacuumCostActive !InterruptPending VacuumCostBalance = VacuumCostLimit) { int msec; --- 3936,3942 CHECK_FOR_INTERRUPTS(); /* Nap if appropriate */ ! if (VacuumCostActive !IsInterruptEventPending VacuumCostBalance = VacuumCostLimit) { int msec; diff -cr cvs/src/backend/postmaster/autovacuum.c cvs.build/src/backend/postmaster/autovacuum.c *** cvs/src/backend/postmaster/autovacuum.c 2009-12-09 11:24:41.0 +0100 --- cvs.build/src/backend/postmaster/autovacuum.c 2010-01-07 00:22:21.0 +0100 *** *** 479,488 /* Prevents interrupts while cleaning up */ HOLD_INTERRUPTS(); ! /* Forget any pending QueryCancel request */ ! QueryCancelPending = false; disable_sig_alarm(true); ! QueryCancelPending = false; /* again in case timeout occurred */ /* Report the error to the server log */
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thu, 2010-01-07 at 14:45 +0100, Joachim Wieland wrote: On Thu, Dec 31, 2009 at 6:40 PM, Simon Riggs si...@2ndquadrant.com wrote: Building racy infrastructure when it can be avoided with a little care still seems not to be the best path to me. Doing that will add more complexity in an area that is hard to test effectively. I think the risk of introducing further bugs while trying to fix this rare condition is high. Right now the conflict processing needs more work and is often much less precise than this, so improving this aspect of it would not be a priority. I've added it to the TODO though. Thank you for your research. Patch implements recovery conflict signalling using SIGUSR1 multiplexing, then uses a SessionCancelPending mode similar to Joachim's TransactionCancelPending. I have reworked Simon's patch a bit and attach the result. Oh dear, this is exactly what I've been working on... Quick facts: - Hot Standby only uses SIGUSR1 - SIGINT behaves as it did before: it only cancels running statements - pg_cancel_backend() continues to use SIGINT - I added pg_cancel_idle_transaction() to cancel an idle transaction via SIGUSR1 - One central function HandleCancelAction() sets the flags before calling ProcessInterrupts(), it is called from the different signal handlers and receives parameters about what it should do - If a SIGUSR1 reason is used that will cancel something, ProcArrayLock is acquired until the signal has been sent to make sure that we won't signal the wrong backend. Does this sufficiently cover the concerns of Andres Freund discussed upthread? As there were so many boolean SomethingCancelPending variables I changed them to be bitmasks and merged all of them into a single variable. However I am not sure if we can change the name and type of especially InterruptPending that easily as it was marked with PGDLLIMPORT... @Simon: Is there a reason why you have not yet removed recoveryConflictMode from PGPROC? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thursday 07 January 2010 14:45:55 Joachim Wieland wrote: On Thu, Dec 31, 2009 at 6:40 PM, Simon Riggs si...@2ndquadrant.com wrote: Building racy infrastructure when it can be avoided with a little care still seems not to be the best path to me. Doing that will add more complexity in an area that is hard to test effectively. I think the risk of introducing further bugs while trying to fix this rare condition is high. Right now the conflict processing needs more work and is often much less precise than this, so improving this aspect of it would not be a priority. I've added it to the TODO though. Thank you for your research. Patch implements recovery conflict signalling using SIGUSR1 multiplexing, then uses a SessionCancelPending mode similar to Joachim's TransactionCancelPending. I have reworked Simon's patch a bit and attach the result. Quick facts: - Hot Standby only uses SIGUSR1 - SIGINT behaves as it did before: it only cancels running statements - pg_cancel_backend() continues to use SIGINT - I added pg_cancel_idle_transaction() to cancel an idle transaction via SIGUSR1 - One central function HandleCancelAction() sets the flags before calling ProcessInterrupts(), it is called from the different signal handlers and receives parameters about what it should do - If a SIGUSR1 reason is used that will cancel something, ProcArrayLock is acquired until the signal has been sent to make sure that we won't signal the wrong backend. Does this sufficiently cover the concerns of Andres Freund discussed upthread? I think it solves the major concern (which I btw could easily reproduce using software that is in production) but unfortunately not completely. The avoided situation is: C(Client): BEGIN; SELECT WHATEVER; S(Standby): conflict with C S: Starting to cancel C C: COMMIT S: Sending Signal to C C: Wrong transaction is aborted The situation not avoided is: C: BEGIN; SELECT ... S: conflict with C, lock procarray, sending signal(thats asynchronous), unlock procarray C: COMMIT; BEGIN C: Signal arrives C: Wrong txn is killled It should be easy to fix this by having a cancel_localTransactionId field in the procarray which gets cleaned uppon transaction/backend start and gets checked in the signal handler (should be casted to sig_atomic_t) Will cookup a patch if nobody speaks against something like that. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
Joachim Wieland j...@mcknight.de writes: As there were so many boolean SomethingCancelPending variables I changed them to be bitmasks and merged all of them into a single variable. This seems like a truly horrid idea, because those variables are set by signal handlers. A bitmask cannot be manipulated atomically, so you have almost certainly introduced race-condition bugs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thu, Jan 7, 2010 at 4:23 PM, Tom Lane t...@sss.pgh.pa.us wrote: Joachim Wieland j...@mcknight.de writes: As there were so many boolean SomethingCancelPending variables I changed them to be bitmasks and merged all of them into a single variable. This seems like a truly horrid idea, because those variables are set by signal handlers. A bitmask cannot be manipulated atomically, so you have almost certainly introduced race-condition bugs. True... However there are just a few places where the patch uses bitmasks for modification of the variable. As Simon seems to be working on this currently anyway, I'll leave it to him to either keep the 5 boolean variables or do some mutual exclusion as in HandleNotifyInterrupt() (or do something completely different). Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thu, Jan 7, 2010 at 3:03 PM, Simon Riggs si...@2ndquadrant.com wrote: On Thu, 2010-01-07 at 14:45 +0100, Joachim Wieland wrote: I have reworked Simon's patch a bit and attach the result. Oh dear, this is exactly what I've been working on... Sorry, as you have posted a first patch some days ago I thought you were waiting for feedback... Just tried to help ;-) Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thu, 2010-01-07 at 17:50 +0100, Joachim Wieland wrote: On Thu, Jan 7, 2010 at 3:03 PM, Simon Riggs si...@2ndquadrant.com wrote: On Thu, 2010-01-07 at 14:45 +0100, Joachim Wieland wrote: I have reworked Simon's patch a bit and attach the result. Oh dear, this is exactly what I've been working on... Sorry, as you have posted a first patch some days ago I thought you were waiting for feedback... Just tried to help ;-) Well, you have been very helpful, its just this last little part that is duplicated. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thu, 2010-01-07 at 14:45 +0100, Joachim Wieland wrote: @Simon: Is there a reason why you have not yet removed recoveryConflictMode from PGPROC? Unfortunately we still need a mechanism to mark which backends have been cancelled already. Transaction state for virtual transactions isn't visible on the procarray, so we need something there to indicate that a backend has been sent a conflict. Otherwise we'd end up waiting for it endlessly. The name will be changing though. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
Simon Riggs si...@2ndquadrant.com writes: On Thu, 2010-01-07 at 14:45 +0100, Joachim Wieland wrote: @Simon: Is there a reason why you have not yet removed recoveryConflictMode from PGPROC? Unfortunately we still need a mechanism to mark which backends have been cancelled already. Transaction state for virtual transactions isn't visible on the procarray, so we need something there to indicate that a backend has been sent a conflict. Otherwise we'd end up waiting for it endlessly. The name will be changing though. While we're discussing this: the current coding with AbortOutOfAnyTransaction within ProcessInterrupts is *utterly* unsafe. I realize that's just a toy placeholder, but getting rid of it has to be on the list of stop-ship items. Right at the moment I'd prefer to see CONFLICT_MODE_ERROR always turned into CONFLICT_MODE_FATAL than to imagine this is going to work. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thu, 2010-01-07 at 12:14 -0500, Tom Lane wrote: Simon Riggs si...@2ndquadrant.com writes: On Thu, 2010-01-07 at 14:45 +0100, Joachim Wieland wrote: @Simon: Is there a reason why you have not yet removed recoveryConflictMode from PGPROC? Unfortunately we still need a mechanism to mark which backends have been cancelled already. Transaction state for virtual transactions isn't visible on the procarray, so we need something there to indicate that a backend has been sent a conflict. Otherwise we'd end up waiting for it endlessly. The name will be changing though. While we're discussing this: the current coding with AbortOutOfAnyTransaction within ProcessInterrupts is *utterly* unsafe. I realize that's just a toy placeholder, but getting rid of it has to be on the list of stop-ship items. Right at the moment I'd prefer to see CONFLICT_MODE_ERROR always turned into CONFLICT_MODE_FATAL than to imagine this is going to work. Hmmm. Can you check my current attempt? This may suffer this problem. If, so can you explain a little more for me? Thanks. -- Simon Riggs www.2ndQuadrant.com diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index ea40420..e05792e 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -313,8 +313,7 @@ IsTransactionState(void) /* * IsAbortedTransactionBlockState * - * This returns true if we are currently running a query - * within an aborted transaction block. + * This returns true if we are within an aborted transaction block. */ bool IsAbortedTransactionBlockState(void) @@ -2692,6 +2691,48 @@ AbortCurrentTransaction(void) } /* + * AbortTransactionAndAnySubtransactions + * + * Similar to AbortCurrentTransaction but if any subtransactions + * in progress we abort them *and* all of their parents. So this is + * used when the caller wishes to make the abort untrappable by the user. + * After this has run IsAbortedTransactionBlockState() will be true. + */ +void +AbortTransactionAndAnySubtransactions(void) +{ + TransactionState s = CurrentTransactionState; + + switch (s-blockState) + { + case TBLOCK_DEFAULT: + case TBLOCK_STARTED: + case TBLOCK_BEGIN: + case TBLOCK_INPROGRESS: + case TBLOCK_END: + case TBLOCK_ABORT: + case TBLOCK_SUBABORT: + case TBLOCK_ABORT_END: + case TBLOCK_ABORT_PENDING: + case TBLOCK_PREPARE: + case TBLOCK_SUBABORT_END: + case TBLOCK_SUBABORT_RESTART: + AbortCurrentTransaction(); + break; + + case TBLOCK_SUBINPROGRESS: + case TBLOCK_SUBBEGIN: + case TBLOCK_SUBEND: + case TBLOCK_SUBABORT_PENDING: + case TBLOCK_SUBRESTART: + AbortSubTransaction(); + CleanupSubTransaction(); + AbortTransactionAndAnySubtransactions(); + break; + } +} + +/* * PreventTransactionChain * * This routine is to be called by statements that must not run inside diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 85f14f6..e2e64dd 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -324,6 +324,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid) /* must be cleared with xid/xmin: */ proc-vacuumFlags = ~PROC_VACUUM_STATE_MASK; proc-inCommit = false; /* be sure this is cleared in abort */ + proc-recoveryConflictPending = false; /* Clear the subtransaction-XID cache too while holding the lock */ proc-subxids.nxids = 0; @@ -350,6 +351,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid) /* must be cleared with xid/xmin: */ proc-vacuumFlags = ~PROC_VACUUM_STATE_MASK; proc-inCommit = false; /* be sure this is cleared in abort */ + proc-recoveryConflictPending = false; Assert(proc-subxids.nxids == 0); Assert(proc-subxids.overflowed == false); @@ -377,7 +379,7 @@ ProcArrayClearTransaction(PGPROC *proc) proc-xid = InvalidTransactionId; proc-lxid = InvalidLocalTransactionId; proc-xmin = InvalidTransactionId; - proc-recoveryConflictMode = 0; + proc-recoveryConflictPending = false; /* redundant, but just in case */ proc-vacuumFlags = ~PROC_VACUUM_STATE_MASK; @@ -1665,7 +1667,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid, if (proc-pid == 0) continue; - if (skipExistingConflicts proc-recoveryConflictMode 0) + if (skipExistingConflicts proc-recoveryConflictPending) continue; if (!OidIsValid(dbOid) || @@ -1704,7 +1706,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid, * Returns pid of the process signaled, or 0 if not found. */ pid_t -CancelVirtualTransaction(VirtualTransactionId vxid, int cancel_mode) +CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode) { ProcArrayStruct *arrayP = procArray; int index; @@ -1722,28 +1724,22 @@ CancelVirtualTransaction(VirtualTransactionId vxid, int cancel_mode) if (procvxid.backendId == vxid.backendId procvxid.localTransactionId == vxid.localTransactionId) { - /*
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
Simon Riggs si...@2ndquadrant.com writes: On Thu, 2010-01-07 at 12:14 -0500, Tom Lane wrote: While we're discussing this: the current coding with AbortOutOfAnyTransaction within ProcessInterrupts is *utterly* unsafe. I realize that's just a toy placeholder, but getting rid of it has to be on the list of stop-ship items. Right at the moment I'd prefer to see CONFLICT_MODE_ERROR always turned into CONFLICT_MODE_FATAL than to imagine this is going to work. Hmmm. Can you check my current attempt? This may suffer this problem. This looks like a mess. You've duplicated a whole lot of code and not fixed the fundamental problem. If, so can you explain a little more for me? Thanks. You can not do this from inside an interrupt service routine. Period. No amount of deck-chair-rearrangement will fix that. As far as I can think at the moment, the best you can do is throw the elog(ERROR), and if control gets out to the error recovery block in PostgresMain, you can force a transaction abort there. In other words, pretty much the same logic that was there before; the only addition that I think is safe is to allow this to happen while DoingCommandRead, so that you can cancel an idle transaction. Now of course the problem with this approach, if you choose to see it as a problem, is that somebody could trap the error and try to continue processing. The only way you can positively guarantee that the backend will give up whatever locks it's holding is if you elog(FATAL) instead of trying to do normal error processing. So maybe the right thing is to forget about CONFLICT_MODE_ERROR altogether. How critical is it that an HS-requested query cancel be any more likely to do anything than a regular query cancel is? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thursday 07 January 2010 19:12:31 Tom Lane wrote: Simon Riggs si...@2ndquadrant.com writes: On Thu, 2010-01-07 at 12:14 -0500, Tom Lane wrote: While we're discussing this: the current coding with AbortOutOfAnyTransaction within ProcessInterrupts is *utterly* unsafe. I realize that's just a toy placeholder, but getting rid of it has to be on the list of stop-ship items. Right at the moment I'd prefer to see CONFLICT_MODE_ERROR always turned into CONFLICT_MODE_FATAL than to imagine this is going to work. Hmmm. Can you check my current attempt? This may suffer this problem. This looks like a mess. You've duplicated a whole lot of code and not fixed the fundamental problem. If, so can you explain a little more for me? Thanks. You can not do this from inside an interrupt service routine. Period. No amount of deck-chair-rearrangement will fix that. As far as I can think at the moment, the best you can do is throw the elog(ERROR), and if control gets out to the error recovery block in PostgresMain, you can force a transaction abort there. In other words, pretty much the same logic that was there before; the only addition that I think is safe is to allow this to happen while DoingCommandRead, so that you can cancel an idle transaction. Stupid question: Why dont we add CHECK_FOR_INTERRUPTS (or something similar) to everything calling recv (especially in the EINTR) case? That way we can simply abort in the normal context without the constraint of an interrupt handler because recv() will finish after having serviced a signal handler. Sure there will be some parts calling CHECK_FOR_INTERRUPTS not often enough but thats surely not that critical - and after some time using a bit more force is ok I guess. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thu, 2010-01-07 at 13:12 -0500, Tom Lane wrote: not fixed the fundamental problem. (I wasn't saying that I had, only wanted to confirm the problem still existed in the version I was currently working on.) How critical is it that an HS-requested query cancel be any more likely to do anything than a regular query cancel is? Recovery needs to cancel backends in two main cases, which currently throw CONFLICT_MODE_ERROR, though could be separated: (1) Cancel because a snapshot might fail to see data that has now been removed and so get an inconsistent result. We need to abort any subtransactions that have open snapshots, which is really the whole top level xact, unless anyone has a good plan of how to identify (2) Cancel because a backend holds a lock on a table that has been AccessExclusiveLocked on primary. We need to abort as far up the transaction stack as it takes to remove the conflicting lock. We currently abort the whole transaction. Both of those have cases where we might need to abort further than just the top-level subtransaction. We might be able to identify the cases where aborting only the current subtrans is OK, and then just throw normal ERROR for those cases. I think the cases are * when no subxacts exist or * for (1): if no open portals in still active parent sub-transactions * for (2): if any conflicting locks are held by current subxact only -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
Andres Freund and...@anarazel.de writes: Stupid question: Why dont we add CHECK_FOR_INTERRUPTS (or something similar) to everything calling recv (especially in the EINTR) case? We pretty much have CHECK_FOR_INTERRUPTS everywhere that it's safe already. The problem here is not a lack of execution of CHECK_FOR_INTERRUPTS, but what to do inside it. Although I pointed to an interrupt service routine as being the worst case, the fact is that Simon's code will crash the system in a large number of scenarios even when ProcessInterrupts is called from CHECK_FOR_INTERRUPTS rather than directly from the signal handler. You can't just abort transactions and then return to a nest of functions that think they still have a live transaction they can resume. This is why the standard error code path has the AbortTransaction call out in the sigjmp catcher, not inside elog() itself. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thu, 2010-01-07 at 19:23 +0100, Andres Freund wrote: On Thursday 07 January 2010 19:12:31 Tom Lane wrote: As far as I can think at the moment, the best you can do is throw the elog(ERROR), and if control gets out to the error recovery block in PostgresMain, you can force a transaction abort there. In other words, pretty much the same logic that was there before; the only addition that I think is safe is to allow this to happen while DoingCommandRead, so that you can cancel an idle transaction. Stupid question: Why dont we add CHECK_FOR_INTERRUPTS (or something similar) to everything calling recv (especially in the EINTR) case? That way we can simply abort in the normal context without the constraint of an interrupt handler because recv() will finish after having serviced a signal handler. Sure there will be some parts calling CHECK_FOR_INTERRUPTS not often enough but thats surely not that critical - and after some time using a bit more force is ok I guess. There's only a need for an immediate death when we were waiting for a lock. In other cases a deferred death is possible. We could do that in various places, such as each time we access a new data block. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thu, Jan 7, 2010 at 7:12 PM, Tom Lane t...@sss.pgh.pa.us wrote: As far as I can think at the moment, the best you can do is throw the elog(ERROR), and if control gets out to the error recovery block in PostgresMain, you can force a transaction abort there. In other words, pretty much the same logic that was there before; the only addition that I think is safe is to allow this to happen while DoingCommandRead, so that you can cancel an idle transaction. Sorry, but to be clear about this, what do you mean with allow this to happen? You mean that while DoingCommandRead it should be safe to abort the transaction even from the signal handler or only from the sigjmp catcher? Now of course the problem with this approach, if you choose to see it as a problem, is that somebody could trap the error and try to continue processing. The only way you can positively guarantee that the backend will give up whatever locks it's holding is if you elog(FATAL) instead of trying to do normal error processing. So maybe the right thing is to forget about CONFLICT_MODE_ERROR altogether. How critical is it that an HS-requested query cancel be any more likely to do anything than a regular query cancel is? Simon, couldn't you just translate the conflict modes to the other cancel modes depending on DoingCommandRead (which is to be determined from within ProcessInterrupts(), not before). CONFLICT_MODE_ERROR !DoingCommandRead = cancel running query (QueryCancelPending) CONFLICT_MODE_ERROR DoingCommandRead = cancel idle transaction CONFLICT_MODE_FATAL = cancel session (ProcDiePending) Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
Joachim Wieland j...@mcknight.de writes: On Thu, Jan 7, 2010 at 7:12 PM, Tom Lane t...@sss.pgh.pa.us wrote: As far as I can think at the moment, the best you can do is throw the elog(ERROR), and if control gets out to the error recovery block in PostgresMain, you can force a transaction abort there. In other words, pretty much the same logic that was there before; the only addition that I think is safe is to allow this to happen while DoingCommandRead, so that you can cancel an idle transaction. Sorry, but to be clear about this, what do you mean with allow this to happen? You mean that while DoingCommandRead it should be safe to abort the transaction even from the signal handler or only from the sigjmp catcher? In the previous coding, nothing at all happened if DoingCommandRead. What I am suggesting (and what should be actually safe after my fixes a couple hours ago) is that it is okay to allow a query-cancel error to be thrown while in DoingCommandRead state. (Of course there are still nasty implications for the FE/BE protocol, but at least you won't crash the backend by doing so.) What is not, and never will be, safe is to do any sort of transaction-aborting work inside ProcessInterrupts. You can either throw a regular elog(ERROR) and hope that subsequent transaction abort will do what you want, or throw elog(FATAL) and let the cleanup happen during proc_exit. The reason the second form is safe is that control won't ever return to whatever it was you interrupted, and so any expectations that such code might have about being able to recover from the error and continue its processing won't matter. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thursday 07 January 2010 19:49:59 Tom Lane wrote: Andres Freund and...@anarazel.de writes: Stupid question: Why dont we add CHECK_FOR_INTERRUPTS (or something similar) to everything calling recv (especially in the EINTR) case? We pretty much have CHECK_FOR_INTERRUPTS everywhere that it's safe already. The problem here is not a lack of execution of CHECK_FOR_INTERRUPTS, but what to do inside it. Although I pointed to an interrupt service routine as being the worst case, the fact is that Simon's code will crash the system in a large number of scenarios even when ProcessInterrupts is called from CHECK_FOR_INTERRUPTS rather than directly from the signal handler. I did not want to suggest using Simons code there. Sorry for the brevity. The reason I suggested adding CHECK_FOR_INTERRUPTS into the recv code path was that it should allow a relatively natural handling of canceling IDLE IN TRANSACTION queries without doing anything in the interrupt handler. I think it shouldn't be to hard to make that code path safe for CHECK_FOR_INTERRUPTS(). I personally don't think its important to be that nice to a non-cooperating backend (i.e. one catching our ERRORs) so I dont see any problems in just FATAL'ing it after a couple of tries (the code retries already so...). Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
Andres Freund and...@anarazel.de writes: The reason I suggested adding CHECK_FOR_INTERRUPTS into the recv code path was that it should allow a relatively natural handling of canceling IDLE IN TRANSACTION queries without doing anything in the interrupt handler. I think it shouldn't be to hard to make that code path safe for CHECK_FOR_INTERRUPTS(). Idle in transaction isn't the problem (except for what it does to the FE/BE protocol state). The problem is what happens inside a non-idle transaction. Since apparently I'm still not being clear enough about this, let me spell it out: 1. Outer transaction calls, say, a plperl function. 2. plperl function executes some query via SPI, thereby starting a subtransaction. 3. We receive an HS query-cancel interrupt. Since !ImmediateInterruptOK, this just sets QueryCancelPending. 4. At the next occurrence of CHECK_FOR_INTERRUPTS, ProcessInterrupts is entered. 5. According to both Simon's committed patch and his recent variant, ProcessInterrupts executes AbortOutOfAnyTransaction and then throws elog(ERROR). 6. plperl.c catches the elog longjmp and tries to abort its subtransaction (loss #1), then return to the Perl interpreter which is under no obligation to abort processing its perl script (loss #2), and whenever it does exit, or else call SPI to try to process another query, we're screwed because the outer transaction is already dead (loss #3). The situation with Perl or Python or some other PL is pretty much the worst case, since we have no control whatever over that code layer --- but in reality this type of scenario can play out even without any third-party code involved. Anyplace that catches an elog longjmp will be broken by AbortOutOfAnyTransaction inside ProcessInterrupts, because things aren't supposed to happen in that order. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thursday 07 January 2010 21:47:47 Tom Lane wrote: Andres Freund and...@anarazel.de writes: The reason I suggested adding CHECK_FOR_INTERRUPTS into the recv code path was that it should allow a relatively natural handling of canceling IDLE IN TRANSACTION queries without doing anything in the interrupt handler. I think it shouldn't be to hard to make that code path safe for CHECK_FOR_INTERRUPTS(). Idle in transaction isn't the problem (except for what it does to the FE/BE protocol state). The problem is what happens inside a non-idle transaction. Since apparently I'm still not being clear enough about this, let me spell it out: 5. According to both Simon's committed patch and his recent variant, ProcessInterrupts executes AbortOutOfAnyTransaction and then throws elog(ERROR). Andres Freund and...@anarazel.de writes: I did not want to suggest using Simons code there. Sorry for the brevity. should have read as revert to old code and add two step killing (thats likely around 10 lines or so). two step killing meaning that we signal ERROR for a few times and if nothing happens that we like, we signal FATAL. As the code already loops around signaling anyway that should be easy to implement. I have no doubt about you being right that its not practical (even more so at this point in the release cycle) to make that AbortOutOfAnyTransaction call. Andres PS: Thats procarray.c: ResolveRecoveryConflictWithVirtualXIDs -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
Andres Freund and...@anarazel.de writes: I did not want to suggest using Simons code there. Sorry for the brevity. should have read as revert to old code and add two step killing (thats likely around 10 lines or so). two step killing meaning that we signal ERROR for a few times and if nothing happens that we like, we signal FATAL. As the code already loops around signaling anyway that should be easy to implement. Ah. This loop happens in the process that's trying to send the cancel signal, correct, not the one that needs to respond to it? That sounds fairly sane to me. There are some things we could do to make it more likely that a cancel of this type is accepted --- for instance, give it a distinct SQLSTATE code that *can not* be trapped by plpgsql EXCEPTION blocks --- but there is no practical way to guarantee it except elog(FATAL). I'm not entirely convinced that an untrappable error would be a good thing anyway; it's hard to argue that that's much better than a FATAL. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thursday 07 January 2010 22:28:46 Tom Lane wrote: Andres Freund and...@anarazel.de writes: I did not want to suggest using Simons code there. Sorry for the brevity. should have read as revert to old code and add two step killing (thats likely around 10 lines or so). two step killing meaning that we signal ERROR for a few times and if nothing happens that we like, we signal FATAL. As the code already loops around signaling anyway that should be easy to implement. Ah. This loop happens in the process that's trying to send the cancel signal, correct, not the one that needs to respond to it? That sounds fairly sane to me. Yes. There are some things we could do to make it more likely that a cancel of this type is accepted --- for instance, give it a distinct SQLSTATE code that *can not* be trapped by plpgsql EXCEPTION blocks --- but there is no practical way to guarantee it except elog(FATAL). I'm not entirely convinced that an untrappable error would be a good thing anyway; it's hard to argue that that's much better than a FATAL. Well a session which is usable after a transaction abort is quite sensible - quite some software I know handles a failing transaction much more gracefully than a session abort (e.g. because it has to deal with serialization failures and such anyway). So making it cought by fewer places and degrading to FATAL sound sensible and relatively easy to me. Unless somebody disagrees I will give it a shot. @Simon: Could you possibly push your current HS state to the git repo? That would make it easier to see whats already applied and such. That would be really nice. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Wed, 2009-12-30 at 20:06 +0100, Andres Freund wrote: Hm. I just read a bit of that multiplexing facility (out of a different reason) and I have some doubt about it being used unmodified for canceling backends: procsignal.c: /* * Note: Since there's no locking, it's possible that the target * process detaches from shared memory and exits right after this * test, before we set the flag and send signal. And the signal slot * might even be recycled by a new process, so it's remotely possible * that we set a flag for a wrong process. That's OK, all the signals * are such that no harm is done if they're mistakenly fired. */ procsignal.h: ... * Also, because of race conditions, it's important that all the signals be * defined so that no harm is done if a process mistakenly receives one. */ When cancelling a backend that behaviour could be a bit annoying ;-) Reading comments alone doesn't show the full situation here. Before we send signal we check pid also, so the chances of this happening are fairly small. i.e. we would need to have a backend slot reused by a new backend with exactly same pid as the last slot holder. I'm proposing to use this for killing transactions and connections, so I don't think there's too much harm done there. If the backend is leaving anyway, that's what we wanted. If its a new guy that is wearing the same boots then a little collateral damage doesn't leave the server in a bad place. HS cancellations aren't yet so precise that this matters. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thursday 31 December 2009 12:25:19 Simon Riggs wrote: On Wed, 2009-12-30 at 20:06 +0100, Andres Freund wrote: Hm. I just read a bit of that multiplexing facility (out of a different reason) and I have some doubt about it being used unmodified for canceling backends: procsignal.c: /* * Note: Since there's no locking, it's possible that the target * process detaches from shared memory and exits right after this * test, before we set the flag and send signal. And the signal slot * might even be recycled by a new process, so it's remotely possible * that we set a flag for a wrong process. That's OK, all the signals * are such that no harm is done if they're mistakenly fired. */ procsignal.h: ... * Also, because of race conditions, it's important that all the signals be * defined so that no harm is done if a process mistakenly receives one. */ When cancelling a backend that behaviour could be a bit annoying ;-) Reading comments alone doesn't show the full situation here. Before we send signal we check pid also, so the chances of this happening are fairly small. i.e. we would need to have a backend slot reused by a new backend with exactly same pid as the last slot holder. Well. The problem does not occur for critical errors (i.e. session death) only. As signal delivery is asynchronous it can very well happen for transaction cancellation as well. I'm proposing to use this for killing transactions and connections, so I don't think there's too much harm done there. If the backend is leaving anyway, that's what we wanted. If its a new guy that is wearing the same boots then a little collateral damage doesn't leave the server in a bad place. HS cancellations aren't yet so precise that this matters. Building racy infrastructure when it can be avoided with a little care still seems not to be the best path to me. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thu, 2009-12-31 at 13:14 +0100, Andres Freund wrote: When cancelling a backend that behaviour could be a bit annoying ;-) Reading comments alone doesn't show the full situation here. Before we send signal we check pid also, so the chances of this happening are fairly small. i.e. we would need to have a backend slot reused by a new backend with exactly same pid as the last slot holder. Well. The problem does not occur for critical errors (i.e. session death) only. As signal delivery is asynchronous it can very well happen for transaction cancellation as well. I'm proposing to use this for killing transactions and connections, so I don't think there's too much harm done there. If the backend is leaving anyway, that's what we wanted. If its a new guy that is wearing the same boots then a little collateral damage doesn't leave the server in a bad place. HS cancellations aren't yet so precise that this matters. Building racy infrastructure when it can be avoided with a little care still seems not to be the best path to me. Doing that will add more complexity in an area that is hard to test effectively. I think the risk of introducing further bugs while trying to fix this rare condition is high. Right now the conflict processing needs more work and is often much less precise than this, so improving this aspect of it would not be a priority. I've added it to the TODO though. Thank you for your research. I enclose the patch I am currently testing, as a patch-on-patch on top of Joachim's changes, recently published on this thread. POC only as yet. Patch implements recovery conflict signalling using SIGUSR1 multiplexing, then uses a SessionCancelPending mode similar to Joachim's TransactionCancelPending. -- Simon Riggs www.2ndQuadrant.com *** a/src/backend/storage/ipc/procarray.c --- b/src/backend/storage/ipc/procarray.c *** *** 1704,1710 GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid, * Returns pid of the process signaled, or 0 if not found. */ pid_t ! CancelVirtualTransaction(VirtualTransactionId vxid, int cancel_mode) { ProcArrayStruct *arrayP = procArray; int index; --- 1704,1710 * Returns pid of the process signaled, or 0 if not found. */ pid_t ! CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode) { ProcArrayStruct *arrayP = procArray; int index; *** *** 1722,1733 CancelVirtualTransaction(VirtualTransactionId vxid, int cancel_mode) if (procvxid.backendId == vxid.backendId procvxid.localTransactionId == vxid.localTransactionId) { - /* - * Issue orders for the proc to read next time it receives SIGINT - */ - if (proc-recoveryConflictMode cancel_mode) - proc-recoveryConflictMode = cancel_mode; - pid = proc-pid; break; } --- 1722,1727 *** *** 1741,1747 CancelVirtualTransaction(VirtualTransactionId vxid, int cancel_mode) * Kill the pid if it's still here. If not, that's what we wanted * so ignore any errors. */ ! kill(pid, SIGINT); } return pid; --- 1735,1741 * Kill the pid if it's still here. If not, that's what we wanted * so ignore any errors. */ ! (void) SendProcSignal(pid, sigmode, vxid.backendId); } return pid; *** a/src/backend/storage/ipc/procsignal.c --- b/src/backend/storage/ipc/procsignal.c *** *** 24,29 --- 24,30 #include storage/procsignal.h #include storage/shmem.h #include storage/sinval.h + #include storage/standby.h /* *** *** 258,262 procsignal_sigusr1_handler(SIGNAL_ARGS) --- 259,269 if (CheckProcSignal(PROCSIG_NOTIFY_INTERRUPT)) HandleNotifyInterrupt(); + if (CheckProcSignal(PROCSIG_CONFLICT_ERROR_INTERRUPT)) + HandleConflictInterrupt(ERROR); + + if (CheckProcSignal(PROCSIG_CONFLICT_FATAL_INTERRUPT)) + HandleConflictInterrupt(FATAL); + errno = save_errno; } *** a/src/backend/storage/ipc/standby.c --- b/src/backend/storage/ipc/standby.c *** *** 218,229 ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist, if (WaitExceedsMaxStandbyDelay()) { pid_t pid; /* * Now find out who to throw out of the balloon. */ Assert(VirtualTransactionIdIsValid(*waitlist)); ! pid = CancelVirtualTransaction(*waitlist, cancel_mode); if (pid != 0) { --- 218,235 if (WaitExceedsMaxStandbyDelay()) { pid_t pid; + ProcSignalReason sigmode; + + if (cancel_mode == CONFLICT_MODE_ERROR) + sigmode = PROCSIG_CONFLICT_ERROR_INTERRUPT; + else + sigmode = PROCSIG_CONFLICT_FATAL_INTERRUPT; /* * Now find out who to throw out of the balloon. */ Assert(VirtualTransactionIdIsValid(*waitlist)); ! pid = CancelVirtualTransaction(*waitlist, sigmode); if (pid != 0) {
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Tue, Dec 29, 2009 at 4:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: This seems like a fairly bad idea. One of the intended use-cases is to be able to manually kill -INT a misbehaving backend. Assuming that there will be valid info about the signal in shared memory will break that. I never intended to change the current behavior. Actually I wanted to even enhance it by allowing to also cancel an idle transaction via SIGINT. We are free to define what should happen if there is no internal reason available because the signal has been sent manually. We can also use SIGUSR1 of course but then you cannot cancel an idle transaction just from the commandline. Not sure if this is necessary though but I would have liked it in the past already (I once used a version of slony that left transactions idle from time to time...) Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Wed, 2009-12-30 at 12:05 +0100, Joachim Wieland wrote: On Tue, Dec 29, 2009 at 4:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: This seems like a fairly bad idea. One of the intended use-cases is to be able to manually kill -INT a misbehaving backend. Assuming that there will be valid info about the signal in shared memory will break that. I never intended to change the current behavior. Actually I wanted to even enhance it by allowing to also cancel an idle transaction via SIGINT. We are free to define what should happen if there is no internal reason available because the signal has been sent manually. We can also use SIGUSR1 of course but then you cannot cancel an idle transaction just from the commandline. Not sure if this is necessary though but I would have liked it in the past already (I once used a version of slony that left transactions idle from time to time...) Andres mentioned this in relation to Startup process sending signals to backends. Startup needs to in-some cases issue a FATAL error also, which is a separate issue from the discusion around SIGINT. I will rework the FATAL case and continue to support you in finding a solution to the cancel-while-idle case. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Wednesday 30 December 2009 01:13:01 Simon Riggs wrote: On Tue, 2009-12-29 at 11:13 -0500, Tom Lane wrote: Andres Freund and...@anarazel.de writes: On Tuesday 29 December 2009 16:22:54 Tom Lane wrote: This seems like a fairly bad idea. One of the intended use-cases is to be able to manually kill -INT a misbehaving backend. Assuming that there will be valid info about the signal in shared memory will break that. Well. That already is the case now. MyProc-recoveryConflictMode is checked to recognize what kind of conflict is being resolved... In that case, HS has already broken it, and we need to fix it not make it worse. My humble opinion is that SIGINT should not be overloaded with multiple meanings. We already have a multiplexed signal mechanism, which is what should be used for any additional signal reasons HS may need to introduce. It's a revelation to me, but yes, I see it now and agree. I'm looking at Fujii-san's multiplexing patch from Jul 31 to rewrite this code using that mechanism. It sounds like it's a neat fit and it should get around the bug report from Kris also if it all works. Hm. I just read a bit of that multiplexing facility (out of a different reason) and I have some doubt about it being used unmodified for canceling backends: procsignal.c: /* * Note: Since there's no locking, it's possible that the target * process detaches from shared memory and exits right after this * test, before we set the flag and send signal. And the signal slot * might even be recycled by a new process, so it's remotely possible * that we set a flag for a wrong process. That's OK, all the signals * are such that no harm is done if they're mistakenly fired. */ procsignal.h: ... * Also, because of race conditions, it's important that all the signals be * defined so that no harm is done if a process mistakenly receives one. */ When cancelling a backend that behaviour could be a bit annoying ;-) I guess locking procarray during sending the signal should be enough? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Wed, Dec 30, 2009 at 2:06 PM, Andres Freund and...@anarazel.de wrote: On Wednesday 30 December 2009 01:13:01 Simon Riggs wrote: On Tue, 2009-12-29 at 11:13 -0500, Tom Lane wrote: Andres Freund and...@anarazel.de writes: On Tuesday 29 December 2009 16:22:54 Tom Lane wrote: This seems like a fairly bad idea. One of the intended use-cases is to be able to manually kill -INT a misbehaving backend. Assuming that there will be valid info about the signal in shared memory will break that. Well. That already is the case now. MyProc-recoveryConflictMode is checked to recognize what kind of conflict is being resolved... In that case, HS has already broken it, and we need to fix it not make it worse. My humble opinion is that SIGINT should not be overloaded with multiple meanings. We already have a multiplexed signal mechanism, which is what should be used for any additional signal reasons HS may need to introduce. It's a revelation to me, but yes, I see it now and agree. I'm looking at Fujii-san's multiplexing patch from Jul 31 to rewrite this code using that mechanism. It sounds like it's a neat fit and it should get around the bug report from Kris also if it all works. Hm. I just read a bit of that multiplexing facility (out of a different reason) and I have some doubt about it being used unmodified for canceling backends: procsignal.c: /* * Note: Since there's no locking, it's possible that the target * process detaches from shared memory and exits right after this * test, before we set the flag and send signal. And the signal slot * might even be recycled by a new process, so it's remotely possible * that we set a flag for a wrong process. That's OK, all the signals * are such that no harm is done if they're mistakenly fired. */ procsignal.h: ... * Also, because of race conditions, it's important that all the signals be * defined so that no harm is done if a process mistakenly receives one. */ When cancelling a backend that behaviour could be a bit annoying ;-) I guess locking procarray during sending the signal should be enough? I think the idea is that you define the behavior of the signal to be look at this other piece of state to see whether you should cancel yourself rather than just cancel yourself. Then if a signal is delivered by mistake, it's no big deal - you just look at the other piece of state and decide that you don't need to do anything. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
- Ursprüngliche Mitteilung - On Wed, Dec 30, 2009 at 2:06 PM, Andres Freund and...@anarazel.de wrote: On Wednesday 30 December 2009 01:13:01 Simon Riggs wrote: On Tue, 2009-12-29 at 11:13 -0500, Tom Lane wrote: Andres Freund and...@anarazel.de writes: On Tuesday 29 December 2009 16:22:54 Tom Lane wrote: This seems like a fairly bad idea. One of the intended use-cases is to be able to manually kill -INT a misbehaving backend. Assuming that there will be valid info about the signal in shared memory will break that. Well. That already is the case now. MyProc-recoveryConflictMode is checked to recognize what kind of conflict is being resolved... In that case, HS has already broken it, and we need to fix it not make it worse. My humble opinion is that SIGINT should not be overloaded with multiple meanings. We already have a multiplexed signal mechanism, which is what should be used for any additional signal reasons HS may need to introduce. It's a revelation to me, but yes, I see it now and agree. I'm looking at Fujii-san's multiplexing patch from Jul 31 to rewrite this code using that mechanism. It sounds like it's a neat fit and it should get around the bug report from Kris also if it all works. Hm. I just read a bit of that multiplexing facility (out of a different reason) and I have some doubt about it being used unmodified for canceling backends: procsignal.c: /* * Note: Since there's no locking, it's possible that the target * process detaches from shared memory and exits right after this * test, before we set the flag and send signal. And the signal slot * might even be recycled by a new process, so it's remotely possible * that we set a flag for a wrong process. That's OK, all the signals * are such that no harm is done if they're mistakenly fired. */ procsignal.h: ... * Also, because of race conditions, it's important that all the signals be * defined so that no harm is done if a process mistakenly receives one. */ When cancelling a backend that behaviour could be a bit annoying ;-) I guess locking procarray during sending the signal should be enough? I think the idea is that you define the behavior of the signal to be look at this other piece of state to see whether you should cancel yourself rather than just cancel yourself. Then if a signal is delivered by mistake, it's no big deal - you just look at the other piece of state and decide that you don't need to do anything. I dont see an easy way to pass enough information right now. You cant regenerate enough of it in the to be killed backend as most of the relevant information is only available in the startup process. Inventing yet another segment in shm just for this seems overcomplicated to me. Thats why I suggested locking the procarray for this - without having looked at the code that should prevent a backend slot from beimg reused. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Wed, Dec 30, 2009 at 6:43 PM, Andres Freund and...@anarazel.de wrote: - Ursprüngliche Mitteilung - On Wed, Dec 30, 2009 at 2:06 PM, Andres Freund and...@anarazel.de wrote: On Wednesday 30 December 2009 01:13:01 Simon Riggs wrote: On Tue, 2009-12-29 at 11:13 -0500, Tom Lane wrote: Andres Freund and...@anarazel.de writes: On Tuesday 29 December 2009 16:22:54 Tom Lane wrote: This seems like a fairly bad idea. One of the intended use-cases is to be able to manually kill -INT a misbehaving backend. Assuming that there will be valid info about the signal in shared memory will break that. Well. That already is the case now. MyProc-recoveryConflictMode is checked to recognize what kind of conflict is being resolved... In that case, HS has already broken it, and we need to fix it not make it worse. My humble opinion is that SIGINT should not be overloaded with multiple meanings. We already have a multiplexed signal mechanism, which is what should be used for any additional signal reasons HS may need to introduce. It's a revelation to me, but yes, I see it now and agree. I'm looking at Fujii-san's multiplexing patch from Jul 31 to rewrite this code using that mechanism. It sounds like it's a neat fit and it should get around the bug report from Kris also if it all works. Hm. I just read a bit of that multiplexing facility (out of a different reason) and I have some doubt about it being used unmodified for canceling backends: procsignal.c: /* * Note: Since there's no locking, it's possible that the target * process detaches from shared memory and exits right after this * test, before we set the flag and send signal. And the signal slot * might even be recycled by a new process, so it's remotely possible * that we set a flag for a wrong process. That's OK, all the signals * are such that no harm is done if they're mistakenly fired. */ procsignal.h: ... * Also, because of race conditions, it's important that all the signals be * defined so that no harm is done if a process mistakenly receives one. */ When cancelling a backend that behaviour could be a bit annoying ;-) I guess locking procarray during sending the signal should be enough? I think the idea is that you define the behavior of the signal to be look at this other piece of state to see whether you should cancel yourself rather than just cancel yourself. Then if a signal is delivered by mistake, it's no big deal - you just look at the other piece of state and decide that you don't need to do anything. I dont see an easy way to pass enough information right now. You cant regenerate enough of it in the to be killed backend as most of the relevant information is only available in the startup process. Inventing yet another segment in shm just for this seems overcomplicated to me. Thats why I suggested locking the procarray for this - without having looked at the code that should prevent a backend slot from beimg reused. Yeah, I understand, but I have a feeling that the code doesn't do it that way right now for a reason. Someone who understands this better than I should comment, but I'm thinking you would likely need to lock the ProcArray in CheckProcSignal as well, and I'm thinking that can't be safely done from within a signal handler. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Thursday 31 December 2009 01:09:57 Robert Haas wrote: On Wed, Dec 30, 2009 at 6:43 PM, Andres Freund and...@anarazel.de wrote: On Wed, Dec 30, 2009 at 2:06 PM, Andres Freund and...@anarazel.de wrote: On Wednesday 30 December 2009 01:13:01 Simon Riggs wrote: On Tue, 2009-12-29 at 11:13 -0500, Tom Lane wrote: Andres Freund and...@anarazel.de writes: On Tuesday 29 December 2009 16:22:54 Tom Lane wrote: This seems like a fairly bad idea. One of the intended use-cases is to be able to manually kill -INT a misbehaving backend. Assuming that there will be valid info about the signal in shared memory will break that. Well. That already is the case now. MyProc-recoveryConflictMode is checked to recognize what kind of conflict is being resolved... In that case, HS has already broken it, and we need to fix it not make it worse. My humble opinion is that SIGINT should not be overloaded with multiple meanings. We already have a multiplexed signal mechanism, which is what should be used for any additional signal reasons HS may need to introduce. It's a revelation to me, but yes, I see it now and agree. I'm looking at Fujii-san's multiplexing patch from Jul 31 to rewrite this code using that mechanism. It sounds like it's a neat fit and it should get around the bug report from Kris also if it all works. Hm. I just read a bit of that multiplexing facility (out of a different reason) and I have some doubt about it being used unmodified for canceling backends: procsignal.c: /* * Note: Since there's no locking, it's possible that the target * process detaches from shared memory and exits right after this * test, before we set the flag and send signal. And the signal slot * might even be recycled by a new process, so it's remotely possible * that we set a flag for a wrong process. That's OK, all the signals * are such that no harm is done if they're mistakenly fired. */ procsignal.h: ... * Also, because of race conditions, it's important that all the signals be * defined so that no harm is done if a process mistakenly receives one. */ When cancelling a backend that behaviour could be a bit annoying ;-) I guess locking procarray during sending the signal should be enough? I think the idea is that you define the behavior of the signal to be look at this other piece of state to see whether you should cancel yourself rather than just cancel yourself. Then if a signal is delivered by mistake, it's no big deal - you just look at the other piece of state and decide that you don't need to do anything. I dont see an easy way to pass enough information right now. You cant regenerate enough of it in the to be killed backend as most of the relevant information is only available in the startup process. Inventing yet another segment in shm just for this seems overcomplicated to me. Thats why I suggested locking the procarray for this - without having looked at the code that should prevent a backend slot from beimg reused. Yeah, I understand, but I have a feeling that the code doesn't do it that way right now for a reason. Someone who understands this better than I should comment, but I'm thinking you would likely need to lock the ProcArray in CheckProcSignal as well, and I'm thinking that can't be safely done from within a signal handler. I don't think we would need to lock in the signal handler. The situation is that two different flags (at this point at least) are needed. FATAL which aborts the session and ERROR which aborts the transaction. Consider the following scenario: - both flag are set while holding a lock on the procarray - starting a new backend requires a lock on the procarray - backend startup cleans up both flags in its ProcSignalSlot (only that specific one as its the only one manipulated under a lock, mind you) - transaction startup clears the ERROR flag under locking - after the new backend started no signal handler targeted for the old backend can get triggered (new pid, if we consider direct reuse of the same pid we have much bigger problems anyway), thus no flag targeted for the old backend can get set That would require a nice comment explaining that but it should work. Another possibility would be to make the whole signal delivery mechanism safe - that should be possible if we had a atomically settable backend id... Unfortunately that would limit the max lifetime for a backend a bit - as sig_atomic_t is 4byte on most platforms. So no backend would be allowed to live after creation of the 2**32-1th backend after it. I don't immediately see a way circumvent that 32bit barrier without using assembler or locks. Andres PS: Hm. For my former message beeing written on a mobile phone it looked surprisingly clean... -- Sent via pgsql-hackers mailing list
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Sun, Dec 27, 2009 at 10:42 PM, Simon Riggs si...@2ndquadrant.com wrote: Thanks for the report. I'll see about a fix. In the end we are about to use SIGINT for two use cases: - cancel an idle transaction - cancel a running query Previously a backend that was DoingCommandRead == true didn't do anything upon reception of SIGINT, now it aborts either the running query or the idle transaction, which is why Kris's example behaves differently now. If we use the same signal for both cases, the receiving backend cannot tell what the intention of the sending backend was. That's why I proposed to make SIGINT similar to SIGUSR1 where we write a reason to a shared memory structure first and then send the signal (see http://archives.postgresql.org/pgsql-hackers/2009-12/msg02067.php from a few days ago). There was also some dicussion about how to communicate the cancellation back to the client when its idle transaction got aborted. I implemented what I thought was the conclusion of the discussion but haven't received a reply on it yet. Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
Joachim Wieland j...@mcknight.de writes: If we use the same signal for both cases, the receiving backend cannot tell what the intention of the sending backend was. That's why I proposed to make SIGINT similar to SIGUSR1 where we write a reason to a shared memory structure first and then send the signal (see http://archives.postgresql.org/pgsql-hackers/2009-12/msg02067.php from a few days ago). This seems like a fairly bad idea. One of the intended use-cases is to be able to manually kill -INT a misbehaving backend. Assuming that there will be valid info about the signal in shared memory will break that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Tuesday 29 December 2009 16:22:54 Tom Lane wrote: Joachim Wieland j...@mcknight.de writes: If we use the same signal for both cases, the receiving backend cannot tell what the intention of the sending backend was. That's why I proposed to make SIGINT similar to SIGUSR1 where we write a reason to a shared memory structure first and then send the signal (see http://archives.postgresql.org/pgsql-hackers/2009-12/msg02067.php from a few days ago). This seems like a fairly bad idea. One of the intended use-cases is to be able to manually kill -INT a misbehaving backend. Assuming that there will be valid info about the signal in shared memory will break that. Well. That already is the case now. MyProc-recoveryConflictMode is checked to recognize what kind of conflict is being resolved... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
Andres Freund and...@anarazel.de writes: On Tuesday 29 December 2009 16:22:54 Tom Lane wrote: This seems like a fairly bad idea. One of the intended use-cases is to be able to manually kill -INT a misbehaving backend. Assuming that there will be valid info about the signal in shared memory will break that. Well. That already is the case now. MyProc-recoveryConflictMode is checked to recognize what kind of conflict is being resolved... In that case, HS has already broken it, and we need to fix it not make it worse. My humble opinion is that SIGINT should not be overloaded with multiple meanings. We already have a multiplexed signal mechanism, which is what should be used for any additional signal reasons HS may need to introduce. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Tue, 2009-12-29 at 14:14 +0100, Joachim Wieland wrote: haven't received a reply on it yet. Apologies for that. I replied today, just forgot to hit send until now. Thanks again for the patch. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Tue, 2009-12-29 at 11:13 -0500, Tom Lane wrote: Andres Freund and...@anarazel.de writes: On Tuesday 29 December 2009 16:22:54 Tom Lane wrote: This seems like a fairly bad idea. One of the intended use-cases is to be able to manually kill -INT a misbehaving backend. Assuming that there will be valid info about the signal in shared memory will break that. Well. That already is the case now. MyProc-recoveryConflictMode is checked to recognize what kind of conflict is being resolved... In that case, HS has already broken it, and we need to fix it not make it worse. My humble opinion is that SIGINT should not be overloaded with multiple meanings. We already have a multiplexed signal mechanism, which is what should be used for any additional signal reasons HS may need to introduce. It's a revelation to me, but yes, I see it now and agree. I'm looking at Fujii-san's multiplexing patch from Jul 31 to rewrite this code using that mechanism. It sounds like it's a neat fit and it should get around the bug report from Kris also if it all works. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standy introduced problem with query cancel behavior
On Sat, 2009-12-26 at 20:15 -0500, Kris Jurka wrote: The JDBC driver's regression test suite has revealed a change in behavior introduced by the hot standy patch. Previously when a client sent a cancel request on an idle connection, nothing happened. Now it sends an error message ERROR: canceling statement due to user request. This confuses the driver's protocol handling and it thinks that the error message is for the next statement issued. Attached is a test case. Thanks for the report. I'll see about a fix. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Hot Standy introduced problem with query cancel behavior
The JDBC driver's regression test suite has revealed a change in behavior introduced by the hot standy patch. Previously when a client sent a cancel request on an idle connection, nothing happened. Now it sends an error message ERROR: canceling statement due to user request. This confuses the driver's protocol handling and it thinks that the error message is for the next statement issued. Attached is a test case. Kris Jurka import java.sql.*; public class Cancel { public static void main(String args[]) throws Exception { Class.forName(org.postgresql.Driver); Connection conn = DriverManager.getConnection(jdbc:postgresql://localhost:5851/jurka, jurka, ); Statement stmt = conn.createStatement(); stmt.cancel(); ResultSet rs = stmt.executeQuery(SELECT 1); } } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers