On Tue, 18 Feb 2025 at 10:29, Jelte Fennema-Nio <[email protected]> wrote: > > So I think the solution in 0002, while debatable if strictly correct, > > is the only fix that we can easily do. Also I personally believe the > > behaviour resulting from 0002 is totally correct: The new behaviour > > would be that if a timeout occurred, right before it was disabled or > > reset, but the interrupt was not processed yet, then we process that > > timeout as normal. That seems totally reasonable behaviour to me from > > the perspective of an end user: You get a timeout error when the > > timeout occurred before the timeout was disabled/reset.
I still think there's no reason for this not to be committed, as it fixes both a bug and a flaky test. On Tue, 18 Feb 2025 at 10:29, Jelte Fennema-Nio <[email protected]> wrote: > > (was part of "Add non-blocking version of PQcancel" thread[1]) > > The commitfest app notified me that this patch needed a rebase. While > doing that I'm also moving this patch out of the original thread it > was part of. It fixes a bug that causes this buildfarm failure[2]. In > rare cases the current code would report "canceling statement due to > user request", even though it was actually canceled due to a statement > timeout, so it should have this error message: "canceling statement > due to statement timeout". A way to reproduce this problem locally was > provided by Alexander[3]. > > I took it out of the original thread because that has become > impossible to follow (at least for me, so I expect also for others) > due to discussing 3 different random failures involving cancel > requests. And this specific one doesn't even have anything to do with > the original non-blocking PQCancel work. > > I have inlined the parts of emails that are relevant to this patch > below. That way others don't have to dissect that thread. (for context > this patch is often, but not always, called patch 0002 in the original > thread). > > My initial email containing the patch[4]: > > On Mon, 24 Jun 2024 at 00:59, Jelte Fennema-Nio <[email protected]> wrote: > > On Sat, 22 Jun 2024 at 17:00, Alexander Lakhin <[email protected]> wrote: > > > I also came across another failure of the test: > > > @@ -2774,7 +2774,7 @@ > > > BEGIN; > > > SET LOCAL statement_timeout = '10ms'; > > > select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; > > > -- this takes very long > > > -ERROR: canceling statement due to statement timeout > > > +ERROR: canceling statement due to user request > > > COMMIT; > > > > > > which is reproduced with a sleep added here: > > > @@ -1065,6 +1065,7 @@ exec_simple_query(const char *query_string) > > > */ > > > parsetree_list = pg_parse_query(query_string); > > > +pg_usleep(11000); > > > > After investigating, I realized this actually exposes a bug in our > > statement timeout logic. It has nothing to do with posgres_fdw and > > reproduces with any regular postgres query too. Attached is a patch > > that fixes this issue. This one should probably be backported. > > > Tom's review[5]: > > On Fri, 30 Aug 2024 at 22:12, Tom Lane <[email protected]> wrote: > > I think 0002 is probably outright wrong, or at least the change to > > disable_statement_timeout is. Once we get to that, we don't want > > to throw a timeout error any more, even if an interrupt was received > > just before it. > > My initial response to that review[6]: > > On Fri, 30 Aug 2024 at 23:24, Jelte Fennema-Nio <[email protected]> wrote: > > The disable_statement_timeout change was not the part of that patch > > that was necessary for stable output, only the change in the first > > branch of enable_statement_timeout was necessary. The reason being > > that enable_statement_timeout is called multiple times for a query, > > because start_xact_command is called multiple times in > > exec_simple_query. The change to disable_statement_timeout just seemed > > like the logical extension of that change, especially since there was > > basically a verbatim copy of disable_statement_timeout in the second > > branch of enable_statement_timeout. > > > > To make sure I understand your suggestion correctly: Are you saying > > you would want to completely remove the outstanding interrupt if it > > was caused by de statement_timout when disable_statement_timeout is > > called? Because I agree that would probably make sense, but that > > sounds like a more impactful change. But the current behaviour seems > > strictly worse than the behaviour proposed in the patch to me, because > > currently the backend would still be interrupted, but the error would > > indicate a reason for the interrupt that is simply incorrect i.e. it > > will say it was cancelled due to a user request, which never happened. > > My second response to that review[7] > > On Sat, 31 Aug 2024 at 09:08, Jelte Fennema-Nio <[email protected]> wrote: > > I also took a quick look at the code again, and completely removing > > the outstanding interrupt seems hard to do. Because there's no way to > > know if there were multiple causes for the interupt, i.e. someone > > could have pressed ctrl+c as well and we wouldn't want to undo that. > > > > So I think the solution in 0002, while debatable if strictly correct, > > is the only fix that we can easily do. Also I personally believe the > > behaviour resulting from 0002 is totally correct: The new behaviour > > would be that if a timeout occurred, right before it was disabled or > > reset, but the interrupt was not processed yet, then we process that > > timeout as normal. That seems totally reasonable behaviour to me from > > the perspective of an end user: You get a timeout error when the > > timeout occurred before the timeout was disabled/reset. > > > [1]: > https://www.postgresql.org/message-id/flat/am5pr83mb0178d3b31ca1b6ec4a8ecc42f7...@am5pr83mb0178.eurprd83.prod.outlook.com > [2]: > https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#timeouts.spec_failed_because_of_statement_cancelled_due_to_unexpected_reason > [3]: > https://www.postgresql.org/message-id/4db099c8-4a52-3cc4-e970-14539a319466%40gmail.com#a81203796f7257531d037e2ca6a0ac43 > [4]: > https://www.postgresql.org/message-id/CAGECzQSBp%3DsCcMCy1txig7cV0VX9hDDb1UdZt3zpDpZe2X2K-Q%40mail.gmail.com#a8d453bddc52f0db0b0fa80676d352c5 > [5]: > https://www.postgresql.org/message-id/584811.1725048717%40sss.pgh.pa.us#030de114b9a6a3eafd082a92fb95fa6e > [6]: > https://www.postgresql.org/message-id/CAGECzQR2HqQTFUddz-Q6y4sfMtTA%2BLRE7WqJtpPSrJKKUXtH7w%40mail.gmail.com#702f7a3fe3936b30bdaad0c5264f5a08 > [7]: > https://www.postgresql.org/message-id/CAGECzQQM1N0H3e9RbSqi4%3DmmFgfLaFNr6R7gNpFoePb9exRryg%40mail.gmail.com#f11b82da794d5996eca82339c2a17caa
