On Fri, Sep 13, 2019 at 2:13 AM Robert Haas <robertmh...@gmail.com> wrote: > > I've been studying At{Sub,}{Abort,Cleanup}_Portals() for the last few > days and have come to the conclusion that the code is not entirely up > to our usual standards. I believe that a good deal of the reason for > this is attributable to the poor quality of the code comments in this > area, although there are perhaps some other contributing factors as > well, such as bullheadedness on my part and perhaps others. > > The trouble starts with the header comment for AtAbort_Portals, which > states that "At this point we run the cleanup hook if present, but we > can't release the portal's memory until the cleanup call." At the time > this logic was introduced in commit > de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02), > AtAbort_Portals() affected all non-held portals without caring whether > they were active or not, and, UserAbortTransactionBlock() called > AbortTransaction() directly, so typing "ROLLBACK;" would cause > AtAbort_Portals() to be reached from within PortalRun(). Even if > PortalRun() managed to return without crashing, the caller would next > try to call PortalDrop() on what was now an invalid pointer. However, > commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed > things so that UserAbortEndTransaction() just sets things up so that > the subsequent call to CommitTransactionCommand() would call > AbortTransaction() instead of trying to do it right away, and that > doesn't happen until after we're done with the portal. As far as I > can see, that change made this comment mostly false, but the comment > has nevertheless managed to survive for another ~15 years. I think we > can, and in fact should, just drop the portal right here. > > As far as actually making that work, there are a few wrinkles. The > first is that we might be in the middle of FATAL. In that case, unlike > the ROLLBACK case, a call to PortalRun() is still on the stack, but > we'll exit the process rather than returning, so the fact that we've > created a dangling pointer for the caller won't matter. However, as > shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01) > and the report that led up to it at > https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de > it's not a good idea to try to clean up the portal in that case, > because we might've already started shutting down critical systems. > It seems not only risky but also unnecessary: our process-local state > is about to go away, and the executor shouldn't need to clean up any > shared memory state that won't also get cleaned up by some other > mechanism. So, it seems to me that if we reach AtAbort_Portals() > during FATAL processing, we should either (1) do nothing at all and > just return or (2) forget about all the existing portals without > cleaning them up and then return. The second option seems a little > safer to me, because it guarantees that if we somehow reach code that > might try to look up a portal later, it won't find anything. But I > think it's arguable. >
I agree with your position on this. > > Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely > in favor of dropping portals on the spot in At(Sub)Abort_Portals(), > (2) overhauls the comments in this area, and (3) makes > AtSubAbort_Portals() consistent with AtAbort_Portals(). The overall idea seems good to me, but I have a few comments on the changes. 1. @@ -2756,7 +2756,6 @@ CleanupTransaction(void) /* * do abort cleanup processing */ - AtCleanup_Portals(); /* now safe to release portal memory */ AtEOXact_Snapshot(false, true); /* and release the transaction's snapshots */ CurrentResourceOwner = NULL; /* and resource owner */ @@ -5032,8 +5031,6 @@ CleanupSubTransaction(void) elog(WARNING, "CleanupSubTransaction while in %s state", TransStateAsString(s->state)); - AtSubCleanup_Portals(s->subTransactionId); - After this cleanup, I think we don't need At(Sub)Abort_Portals in AbortOutOfAnyTransaction() for the states TBLOCK_(SUB)ABORT and friends. This is because AbortTransaction itself would have zapped the portal. 2. You seem to forgot removing AtCleanup_Portals() from portal.h 3. /* - * If it was created in the current transaction, we can't do normal - * shutdown on a READY portal either; it might refer to objects - * created in the failed transaction. See comments in - * AtSubAbort_Portals. - */ - if (portal->status == PORTAL_READY) - MarkPortalFailed(portal); - Why it is safe to remove this check? It has been explained in commit 7981c342 why we need that check. I don't see any explanation in email or patch which justifies this code removal. Is it because you removed PortalCleanup? If so, that is still called from PortalDrop? 4. -AtCleanup_Portals(void) -{ - HASH_SEQ_STATUS status; - PortalHashEnt *hentry; - - hash_seq_init(&status, PortalHashTable); - - while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL) - { - Portal portal = hentry->portal; - - /* - * Do not touch active portals --- this can only happen in the case of - * a multi-transaction command. + * If the status is PORTAL_ACTIVE, then we must be executing a command + * that uses multiple transactions internally. In that case, the + * command in question must be one that does not internally rely on + * any transaction-lifetime resources, because they would disappear + * in the upcoming transaction-wide cleanup. */ if (portal->status == PORTAL_ACTIVE) I am not able to understand how we can reach with the portal state as 'active' for a multi-transaction command. It seems wherever we mark portal as active, we don't relinquish the control unless its state is changed. Can you share some example where this can happen? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com