Re: [HACKERS] Hot Standy introduced problem with query cancel behavior

Andres Freund Tue, 12 Jan 2010 01:31:45 -0800

On 01/12/10 09:40, Simon Riggs wrote:

On Tue, 2010-01-12 at 06:30 +0100, Andres Freund wrote:

On 01/07/10 22:37, Andres Freund wrote:

On Thursday 07 January 2010 22:28:46 Tom Lane wrote:

Andres Freund<and...@anarazel.de>   writes:

I did not want to suggest using Simons code there. Sorry for the brevity.
should have read as "revert to old code and add two step killing (thats
likely around 10 lines or so)".


"two step killing" meaning that we signal ERROR for a few times and if
nothing happens that we like, we signal FATAL.
As the code already loops around signaling anyway that should be easy to
implement.

Ah.  This loop happens in the process that's trying to send the cancel
signal, correct, not the one that needs to respond to it?  That sounds
fairly sane to me.

Yes.

There are some things we could do to make it more likely that a cancel
of this type is accepted --- for instance, give it a distinct SQLSTATE
code that *can not* be trapped by plpgsql EXCEPTION blocks --- but there
is no practical way to guarantee it except elog(FATAL).  I'm not
entirely convinced that an untrappable error would be a good thing
anyway; it's hard to argue that that's much better than a FATAL.

Well a session which is usable after a transaction abort is quite sensible -
quite some software I know handles a failing transaction much more gracefully
than a session abort (e.g. because it has to deal with serialization failures
and such anyway).

So making it cought by fewer places and degrading to FATAL sound sensible and
relatively easy to me.
Unless somebody disagrees I will give it a shot.

Ok, here is a stab at that:

1. Allow the signal multiplexing facility to transfer one sig_atomic_t
worth of data. This is usefull e.g. for making operations idempotent
or more precise.

In this the LocalBackendId is transported - that should make it
impossible to cancel the wrong transaction


This doesn't remove the possibility of cancelling the wrong transaction,
it just reduces the chances. You can still cancel a session with
LocalBackendId == 1 and then have it accidentally cancel the next
session with the same backendid. That's why I didn't chase this idea
myself earlier.

Hm. I don't think so. Backend Initialization clears those flags. And thesignal is sent to a specific pid so you cant send it to a new backendwhen targeting the old one.


So how should that occur?

Closing that loophole isn't a priority and is best left until we have
the other things have been cleaned up.

Yes, maybe. It was just rather easy to fix and fixed the problemsufficiently so that I could not reproduce it in contrast to seeing theproblem during a regular testrun.

2.

AbortTransactionAndAnySubtransactions is only used in the mainloops
error handler as it should be unproblematic there.

In the current CVS code ConditionalVirtualXactLockTableWait() in
ResolveRecoveryConflictWithVirtualXIDs does the wait for every try of
cancelling the other transaction.

I moved the retry logic into CancelVirtualTransaction(). If 50 times a
ERROR does nothing it degrades to FATAL


It's a good idea but not the right one, IMHO. I'd rather that the
backend decided whether the instruction results in an ERROR or a FATAL
based upon its own state, rather than have Startup process bang its head
against the wall 50 times without knowing why.

I don't think its that easy to keep enough state there in a safe manner.

Also the concept of retrying is not new - I just made it degrade and notrewait for max_standby_delay (which imho is a bug).

In many cases the retries and repeated ERRORs will be enough to free thebackend out of all PG_TRY/CATCH blocks that the next ERROR reaches themainloop. So the retrying is quite sensible - and you cant do that partin the signalled backend itself.

XXX: I temporarily do not use the skipExistingConflicts argument of
GetConflictingVirtualXIDs - I dont understand its purpose and a bit of
infrastructure is missing right now as the recoveryConflictMode is not
stored anymore anywhere. Can easily be added back though.

Is that a leftover from additional capabilities (deferred conflicthandling?)? Because currently there will be never a case with twodifferent cancellations at the same time. Also the current logic throwsaway a FATAL error if a ERROR is already there.

3.
Add a new error code ERRCODE_QUERY_CANCELED_HS for use with HS
indicating a failure that is more than a plain
ERRCODE_QUERY_CANCELED - namely it should not be caught from
various places like savepoints and in PLs.

Exemplarily I checked for that error code in plpgsql and make it
uncatcheable.

I am not sure how new errorcode codes get chosen though - and the name
is not that nice.

Opinions on that?


I don't trust it yet, as a mechanism. Changing only PL/pgSQL seems
fairly broken also. Again, this seems like something that be handled
separately.

Well, that was an exemplary change - its easy to fix that at otherplaces as well. Without any identifier of such an abort I don't see howthat could work.

We simply cant break out of nested transactions if were outside themainloop.

I copied quite a bit from Simons earlier patch...

It changes a few things around and adds the ideas you've mentioned,
though seeing them as code doesn't in itself move the discussion
forwards. There are far too many new ideas in this patch for it to be
even close to acceptable, to me. Yes, lets discuss these additional
ideas, but after a basic patch has been applied.

Sure, the "targeted" signalling part can be left of, yes. But the restis mostly necessary I think.

I do value your interest and input, though racing me to rework my own
patch, then asking me to review it seems like wasted time for both of
us. I will post a new patch on ~Friday.

Well. I explicitly asked whether you or somebody else is going afterthis. Waited two days and only then started working on it.

And you seem to have enough on your plate...

(Also, please make your braces { } follow the normal coding conventions
for Postgres.)

Sure.

Currently the patch does not yet do anything to avoid letting the
protocol out of sync. What do you think about adding a flag for error
codes not to communicate with the client (Similarly to COMERROR)?

So that one could do an elog(ERROR&  ERROR_NO_SEND_CLIENT, .. or such?

Seems fairly important piece.

Its quite a bit of manual work though so I wanted to be sure thats anagreeable approach.



Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hot Standy introduced problem with query cancel behavior

Reply via email to