Re: abort-time portal cleanup

Amit Kapila Tue, 24 Sep 2019 02:31:39 -0700

On Fri, Sep 13, 2019 at 2:13 AM Robert Haas <[email protected]> wrote:
>
> I've been studying At{Sub,}{Abort,Cleanup}_Portals() for the last few
> days and have come to the conclusion that the code is not entirely up
> to our usual standards. I believe that a good deal of the reason for
> this is attributable to the poor quality of the code comments in this
> area, although there are perhaps some other contributing factors as
> well, such as bullheadedness on my part and perhaps others.
>
> The trouble starts with the header comment for AtAbort_Portals, which
> states that "At this point we run the cleanup hook if present, but we
> can't release the portal's memory until the cleanup call." At the time
> this logic was introduced in commit
> de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02),
> AtAbort_Portals() affected all non-held portals without caring whether
> they were active or not, and, UserAbortTransactionBlock() called
> AbortTransaction() directly, so typing "ROLLBACK;" would cause
> AtAbort_Portals() to be reached from within PortalRun().  Even if
> PortalRun() managed to return without crashing, the caller would next
> try to call PortalDrop() on what was now an invalid pointer. However,
> commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed
> things so that UserAbortEndTransaction() just sets things up so that
> the subsequent call to CommitTransactionCommand() would call
> AbortTransaction() instead of trying to do it right away, and that
> doesn't happen until after we're done with the portal.  As far as I
> can see, that change made this comment mostly false, but the comment
> has nevertheless managed to survive for another ~15 years. I think we
> can, and in fact should, just drop the portal right here.
>
> As far as actually making that work, there are a few wrinkles. The
> first is that we might be in the middle of FATAL. In that case, unlike
> the ROLLBACK case, a call to PortalRun() is still on the stack, but
> we'll exit the process rather than returning, so the fact that we've
> created a dangling pointer for the caller won't matter. However, as
> shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01)
> and the report that led up to it at
> https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de
> it's not a good idea to try to clean up the portal in that case,
> because we might've already started shutting down critical systems.
> It seems not only risky but also unnecessary: our process-local state
> is about to go away, and the executor shouldn't need to clean up any
> shared memory state that won't also get cleaned up by some other
> mechanism. So, it seems to me that if we reach AtAbort_Portals()
> during FATAL processing, we should either (1) do nothing at all and
> just return or (2) forget about all the existing portals without
> cleaning them up and then return.  The second option seems a little
> safer to me, because it guarantees that if we somehow reach code that
> might try to look up a portal later, it won't find anything. But I
> think it's arguable.
>


I agree with your position on this.

>
> Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely
> in favor of dropping portals on the spot in At(Sub)Abort_Portals(),
> (2) overhauls the comments in this area, and (3) makes
> AtSubAbort_Portals() consistent with AtAbort_Portals().

The overall idea seems good to me, but I have a few comments on the changes.

1.
@@ -2756,7 +2756,6 @@ CleanupTransaction(void)
  /*
  * do abort cleanup processing
  */
- AtCleanup_Portals(); /* now safe to release portal
memory */
  AtEOXact_Snapshot(false, true); /* and release the transaction's
snapshots */

  CurrentResourceOwner = NULL; /* and resource owner */
@@ -5032,8 +5031,6 @@ CleanupSubTransaction(void)
  elog(WARNING, "CleanupSubTransaction while in %s state",
  TransStateAsString(s->state));

- AtSubCleanup_Portals(s->subTransactionId);
-

After this cleanup, I think we don't need At(Sub)Abort_Portals in
AbortOutOfAnyTransaction() for the states TBLOCK_(SUB)ABORT and
friends. This is because AbortTransaction itself would have zapped the
portal.

2. You seem to forgot removing AtCleanup_Portals() from portal.h

3.
  /*
- * If it was created in the current transaction, we
can't do normal
- * shutdown on a READY portal either; it might refer to
objects
- * created in the failed transaction.  See comments in
- * AtSubAbort_Portals.
- */
- if (portal->status == PORTAL_READY)
- MarkPortalFailed(portal);
-

Why it is safe to remove this check?  It has been explained in commit
7981c342 why we need that check.  I don't see any explanation in email
or patch which justifies this code removal.  Is it because you removed
PortalCleanup?  If so, that is still called from PortalDrop?

4.
-AtCleanup_Portals(void)
-{
- HASH_SEQ_STATUS status;
- PortalHashEnt *hentry;
-
- hash_seq_init(&status, PortalHashTable);
-
- while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) !=
NULL)
- {
- Portal portal = hentry->portal;
-
- /*
- * Do not touch active portals --- this can only happen
in the case of
- * a multi-transaction command.
+ * If the status is PORTAL_ACTIVE, then we must be
executing a command
+ * that uses multiple transactions internally. In that
case, the
+ * command in question must be one that does not
internally rely on
+ * any transaction-lifetime resources, because they
would disappear
+ * in the upcoming transaction-wide cleanup.
  */
  if (portal->status == PORTAL_ACTIVE)

I am not able to understand how we can reach with the portal state as
'active' for a multi-transaction command.  It seems wherever we mark
portal as active, we don't relinquish the control unless its state is
changed.  Can you share some example where this can happen?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: abort-time portal cleanup

Reply via email to