Re: Always show correct error message for statement timeouts, fixes random buildfarm failures

Jelte Fennema-Nio Fri, 09 Jan 2026 14:08:51 -0800

On Thu, 8 Jan 2026 at 18:50, Tom Lane <[email protected]> wrote:

I took another look and I still don't like it.  The intent of the code
is that once we're out of a query, a query cancel that arrived
meanwhile should be a no-op.  I completely disagree with the idea that
a late-arriving cancel should cancel the next query.  (From the user's
standpoint: if you're a bit slow to hit control-C and so your query
finishes anyway, and then you go for coffee, would you be surprised
to come back awhile later and enter a new query and have it fail?)
And I think the same should apply to statement timeouts: if one
happens slightly too late to cancel the current query, that does not
mean it should cancel the next one.


The fact that you're thinking that my intent of the patch was this, that
means I did a terrible job at explaining the bug that this patch is
fixing. So I'll try doing a better job now.

The bug this is fixing has nothing to do with multiple queries. It
occurs within a single query being processed by exec_simple_query:
1. At line 1051 start_xact_command is called and then calls
  enable_statement_timeout to start the timeout for the query.
2. The code continues to line 1148 where start_xact_command is called AGAIN.
3. Now the timeout occurs (could also happen a bit earlier, as long as
  no CHECK_FOR_INTERRUPTS is called). This sets the statement timeout
  indicator and calls remove_timeout_index to mark the timeout as
  inactive.
4. Now the second start_xact_command calls enable_statement_timeout AGAIN.
5. Because get_timeout_active returns false now due to the timeout,
  enable_timeout_after is called again which resets the indicator flag
6. New CHECK_FOR_INTERRUPTS is called wherever in the query and
  QueryCancelPending is still set (correctly), but the matching
  indicator flag was lost.

The problem is that while the comment above 4 shows that the intent of
calling enable_statement_timeout again without an interceding
finish_xact_command is a no-op:

   /*
    * Start statement timeout if necessary.  Note that this'll intentionally
    * not reset the clock on an already started timeout, to avoid the timing
    * overhead when start_xact_command() is invoked repeatedly, without an
    * interceding finish_xact_command() (e.g. parse/bind/execute).  If that's
    * not desired, the timeout has to be disabled explicitly.
    */

But that's not actually the case if you run into this race where the
timeout has already fired, because then It resets the just-before-set
indicator flag. Note that the comment also explains that this is not the
only place where this can happen.

        /*
         * (5) disable async signal conditions again.
         *
         * Query cancel is supposed to be a no-op when there is no query in
         * progress, so if a query cancel arrived while we were idle, just
         * reset QueryCancelPending. ProcessInterrupts() has that effect when
         * it's called when DoingCommandRead is set, so check for interrupts
         * before resetting DoingCommandRead.
         */
        CHECK_FOR_INTERRUPTS();
        DoingCommandRead = false;

We are taking care to discard a stale QueryCancelPending flag, but
this code is ignorant of the fact that timeouts now also feed into
QueryCancelPending.  Perhaps we should clear the timeout indicators
here?  Or maybe using CHECK_FOR_INTERRUPTS here is too cute and
we should just clear QueryCancelPending directly (and clear the
indicators too, likely).


Thank you for pointing this out. I think (but I'm not exactly sure what
I was thinking over a year ago), that I had trusted the comment there,
and I had missed that my changes had inadvertently changed the behaviour
here because now the indicators could still be set at this
CHECK_FOR_INTERRUPTS call.

Notably CHECK_FOR_INTERRUPTS is already clearing the indicators (your
2nd proposed option), because it's calling get_timeout_indicator with
the true flag. Though that's not enough, because it still throws an
error if they are set even if DoingCommandRead=true. So the newly
attached patch changes that behaviour instead by checking if
DoingCommandRead is false before throwing any error. Does that resolve
your concerns about the possibility of cancels leaking to the next
query? (which to be extremely clear, was unintended)

Also, I could buy making the proposed change in
disable_statement_timeout without the one in enable_statement_timeout.
I think that would have the desired effect that if ProcessInterrupts
runs shortly after disable_statement_timeout it would not lie about
the cause of the cancel, while not sacrificing the principle that
a statement timeout can't kill the next statement.


As explained above the one in enable_statement_timeout is exactly the
one that is required to fix the bug in question.

As a more thorough redesign, maybe we should get rid of the
idea that QueryCancelPending represents all three of these
conditions.  That would mean a bit more duplication of logic
in ProcessInterrupts, but I don't think that code is especially
performance-critical.


I would like that very much, as I think it would make the code much
easier to follow (I think it's pretty hard to understand code
currently). The problem I see with actually doing that though, is that
StatementTimeoutHandler is sending SIGINT to the whole process group. So
unless we want to stop doing that, I don't see a way of separating the
indicators from QueryCancelPending.

From 1020cd6daf6e513420984d172026218502f99ec2 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <[email protected]>
Date: Mon, 24 Jun 2024 00:29:39 +0200
Subject: [PATCH v6] Do not reset statement_timeout indicator outside of
 ProcessInterrupts

The only way that ProcessInterrupts can know why QueryCancelPending is
set is by looking at the indicator bits of the various timeouts. The
STATEMENT_TIMEOUT indicator was being reset in two places, causing
ProcessInterrupts to report "user request" instead of "statement
timeout".

The primary reason for this patch is that this different error message
caused a test to be flaky. The reason this was happening is because
enable_statement_timeout would be called twice for a single query. The
intent of calling enable_statement_timeout twice is that the second call
was a no-op, but that was not the case in practice. If the timer expired
right before the second call, then the timer would restart and unset the
current indicator. The QueryCancelPending flag would still be true, so
on the next CHECK_FOR_INTERRUPTS it would throw the "canceled statement
due to user request" error.

The disable_statement_timeout function has a similar theoretical race
condition, where it would reset the indicator, but leave the cancel flag
set. So that one was changed too, even though none of our tests actually
trigger that specific bug.

The downside of keeping these indicators set, is that they can leak
outside of the query that's being run. To make sure that this causes no
problems in practice, ProcessInterrupts now also checks that
DoingCommandRead is false before canceling because of statement_timeout
or lock_timeout, in addition to resetting of those indicators (which it
already did). Given that CHECK_FOR_INTERRUPTS is called right before
changing the DoingCommandRead back to false, this will reset the
indicators before entering the next query.
---
 src/backend/tcop/postgres.c | 44 ++++++++++++++++++++++++-------------
 1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e54bf1e760f..b3b4afdac24 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3436,34 +3436,37 @@ ProcessInterrupts(void)
 			get_timeout_finish_time(STATEMENT_TIMEOUT) < get_timeout_finish_time(LOCK_TIMEOUT))
 			lock_timeout_occurred = false;	/* report stmt timeout */
 
-		if (lock_timeout_occurred)
+		if (DoingCommandRead)
+		{
+			/*
+			 * If we are reading a command from the client, just ignore the
+			 * cancel request --- sending an extra error message won't
+			 * accomplish anything.  Otherwise, go ahead and throw an error
+			 * with the correct reason.
+			 */
+		}
+		else if (lock_timeout_occurred)
 		{
 			LockErrorCleanup();
 			ereport(ERROR,
 					(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
 					 errmsg("canceling statement due to lock timeout")));
 		}
-		if (stmt_timeout_occurred)
+		else if (stmt_timeout_occurred)
 		{
 			LockErrorCleanup();
 			ereport(ERROR,
 					(errcode(ERRCODE_QUERY_CANCELED),
 					 errmsg("canceling statement due to statement timeout")));
 		}
-		if (AmAutoVacuumWorkerProcess())
+		else if (AmAutoVacuumWorkerProcess())
 		{
 			LockErrorCleanup();
 			ereport(ERROR,
 					(errcode(ERRCODE_QUERY_CANCELED),
 					 errmsg("canceling autovacuum task")));
 		}
-
-		/*
-		 * If we are reading a command from the client, just ignore the cancel
-		 * request --- sending an extra error message won't accomplish
-		 * anything.  Otherwise, go ahead and throw the error.
-		 */
-		if (!DoingCommandRead)
+		else
 		{
 			LockErrorCleanup();
 			ereport(ERROR,
@@ -5218,22 +5221,33 @@ enable_statement_timeout(void)
 	if (StatementTimeout > 0
 		&& (StatementTimeout < TransactionTimeout || TransactionTimeout == 0))
 	{
-		if (!get_timeout_active(STATEMENT_TIMEOUT))
+		/*
+		 * We check both if it's active or if it's already triggered. If it's
+		 * already triggered we don't want to restart it because that clears
+		 * the indicator flag, which in turn would cause the wrong error
+		 * message to be used by ProcessInterrupts() on the next
+		 * CHECK_FOR_INTERRUPTS() call. Restarting the timer in that case
+		 * would be pointless anyway, because the statement timeout error is
+		 * going to trigger on the next CHECK_FOR_INTERRUPTS() call.
+		 */
+		if (!get_timeout_active(STATEMENT_TIMEOUT)
+			&& !get_timeout_indicator(STATEMENT_TIMEOUT, false))
 			enable_timeout_after(STATEMENT_TIMEOUT, StatementTimeout);
 	}
 	else
 	{
-		if (get_timeout_active(STATEMENT_TIMEOUT))
-			disable_timeout(STATEMENT_TIMEOUT, false);
+		disable_statement_timeout();
 	}
 }
 
 /*
- * Disable statement timeout, if active.
+ * Disable statement timeout, if active. We preserve the indicator flag
+ * though, otherwise we'd lose the knowledge in ProcessInterupts that the
+ * SIGINT came from a statement timeout.
  */
 static void
 disable_statement_timeout(void)
 {
 	if (get_timeout_active(STATEMENT_TIMEOUT))
-		disable_timeout(STATEMENT_TIMEOUT, false);
+		disable_timeout(STATEMENT_TIMEOUT, true);
 }

base-commit: e5a5e0a90750d665cab417322b9f85c806430d85
-- 
2.52.0

Re: Always show correct error message for statement timeouts, fixes random buildfarm failures

Reply via email to