Re: [HACKERS] ERROR during end-of-xact/FATAL

2013-12-23 Thread Noah Misch
On Fri, Nov 15, 2013 at 09:51:32AM -0500, Robert Haas wrote:
 On Wed, Nov 13, 2013 at 11:04 AM, Noah Misch n...@leadboat.com wrote:
  So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL.
  That's bizarre.
 
  Quite so.
 
  Given that that's where we are, promoting an ERROR during FATAL
  processing to PANIC doesn't seem like it's losing much; we're
  essentially already doing that in the (probably more likely) case of a
  persistent ERROR during ERROR processing.  But since PANIC sucks, I'd
  rather go the other direction: let's make an ERROR during ERROR
  processing promote to FATAL.  And then let's do what you write above:
  make sure that there's a separate on-shmem-exit callback for each
  critical shared memory resource and that we call of those during FATAL
  processing.
 
  Many of the factors that can cause AbortTransaction() to fail can also cause
  CommitTransaction() to fail, and those would still PANIC if the transaction
  had an xid.  How practical might it be to also escape from an error during
  CommitTransaction() with a FATAL instead of PANIC?  There's more to fix up 
  in
  that case (sinval, NOTIFY), but it may be within reach.  If such a technique
  can only reasonably fix abort, though, I have doubts it buys us enough.
 
 The critical stuff that's got to happen after
 RecordTransactionCommit() appears to be ProcArrayEndTransaction() and
 AtEOXact_Inval(). Unfortunately, the latter is well after the point
 when we're supposed to only be doing non-critical resource cleanup,
 nonwithstanding which it appears to be critical.
 
 So here's a sketch.  Hoist the preparatory logic in
 RecordTransactionCommit() - smgrGetPendingDeletes,
 xactGetCommittedChildren, and xactGetCommittedInvalidationMessages up
 into the caller and do it before setting TRANS_COMMIT.  If any of that
 stuff fails, we'll land in AbortTransaction() which must cope.  As
 soon as we exit the commit critical section, set a flag somewhere
 (where?) indicating that we have written our commit record; when that
 flag is set, (a) promote any ERROR after that point through the end of
 commit cleanup to FATAL and (b) if we enter AbortTransaction(), don't
 try to RecordTransactionAbort().
 
 I can't see that the notification stuff requires fixup in this case;
 AFAICS, it is just adjusting backend-local state, and it's OK to
 disregard any problems there during a FATAL exit.  Do you see
 something to the contrary?

The part not to miss is SignalBackends() in ProcessCompletedNotifies().  It
happens outside CommitTransaction(), just before the backend goes idle.
Without that, listeners could miss our notification until some other NOTIFY
transaction signals them.  That said, since ProcessCompletedNotifies() already
accepts the possibility of failure, it would not be crazy to overlook this.

 But invalidation messages are a problem:
 if we commit and exit without sending our queued-up invalidation
 messages, Bad Things Will Happen.  Perhaps we could arrange things so
 that in that case only, we just PANIC.   That would allow most write
 transactions to get by with FATAL, promoting to PANIC only in the case
 of transactions that have modified system catalogs and only until the
 invalidations have actually been sent.  Avoiding the PANIC in that
 case seems to require some additional wizardry which is not entirely
 clear to me at this time.

Agreed.

 I think we'll have to approach the various problems in this area
 stepwise, or we'll never make any progress.

That's one option.  The larger point is that allowing ERROR-late-in-COMMIT and
FATAL + ERROR scenarios to get away with FATAL requires this sort of analysis
of every exit callback and all the later stages of CommitTransaction().  The
current bug level shows such analysis hasn't been happening, and the dearth of
bug reports suggests the scenarios are rare.  I don't think the project stands
to gain enough from the changes contemplated here and, perhaps more notably,
from performing such analysis during review of future patches that touch
CommitTransaction() and the exit sequence.  It remains my recommendation to
run proc_exit_prepare() and the post-CLOG actions of CommitTransaction() and
AbortTransaction() in critical sections, giving the scenarios blunt but
reliable treatment.  If reports of excess PANIC arise, the thing to fix is the
code causing the late error.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR during end-of-xact/FATAL

2013-12-20 Thread Alvaro Herrera
Robert Haas escribió:
 On Thu, Nov 28, 2013 at 10:10 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  Robert Haas escribió:
  On Wed, Nov 6, 2013 at 9:40 AM, Alvaro Herrera alvhe...@2ndquadrant.com 
  wrote:
   Noah Misch wrote:
   Incomplete list:
  
   - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its 
   callee
 relpathbackend() call palloc(); this is true in all supported 
   branches.  In
 9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls 
   palloc().
 (In fact, it does so even when the pending list is empty -- this is 
   the only
 palloc() during a trivial transaction commit.)  palloc() failure there
 yields a PANIC during commit.
  
   I think we should fix this routine to avoid the palloc when not 
   necessary.
   That initial palloc is pointless.
 
  Here's a trivial patch we could apply to 9.3 immediately.  Anything else
  such as the ideas proposed below would require more effort than anyone
  can probably spend here soon.
 
 Yeah, this seems like a good thing to do for now.

Pushed, thanks.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR during end-of-xact/FATAL

2013-11-28 Thread Alvaro Herrera
Robert Haas escribió:
 On Wed, Nov 6, 2013 at 9:40 AM, Alvaro Herrera alvhe...@2ndquadrant.com 
 wrote:
  Noah Misch wrote:
  Incomplete list:
 
  - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its 
  callee
relpathbackend() call palloc(); this is true in all supported branches.  
  In
9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls 
  palloc().
(In fact, it does so even when the pending list is empty -- this is the 
  only
palloc() during a trivial transaction commit.)  palloc() failure there
yields a PANIC during commit.
 
  I think we should fix this routine to avoid the palloc when not necessary.
  That initial palloc is pointless.

Here's a trivial patch we could apply to 9.3 immediately.  Anything else
such as the ideas proposed below would require more effort than anyone
can probably spend here soon.


  Also, there have been previous discussions about having relpathbackend
  not use palloc at all.  That was only because we wanted to use it in
  pg_xlogdump which didn't have palloc support at the time, so it's no
  longer as pressing; but perhaps it's still worthy of consideration.
 
 +1, but I'd like it better if we could find a way of avoiding the
 palloc in all cases.  Panicking because we run out of memory at the
 wrong time isn't really very nice.  Maybe the information needs to be
 maintained in the format in which it ultimately needs to be returned,
 so that we needn't rejigger it in the middle of a critical section.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
***
*** 314,321  smgrDoPendingDeletes(bool isCommit)
  	PendingRelDelete *next;
  	int			nrels = 0,
  i = 0,
! maxrels = 8;
! 	SMgrRelation *srels = palloc(maxrels * sizeof(SMgrRelation));
  
  	prev = NULL;
  	for (pending = pendingDeletes; pending != NULL; pending = next)
--- 314,321 
  	PendingRelDelete *next;
  	int			nrels = 0,
  i = 0,
! maxrels = 0;
! 	SMgrRelation *srels = NULL;
  
  	prev = NULL;
  	for (pending = pendingDeletes; pending != NULL; pending = next)
***
*** 340,347  smgrDoPendingDeletes(bool isCommit)
  
  srel = smgropen(pending-relnode, pending-backend);
  
! /* extend the array if needed (double the size) */
! if (maxrels = nrels)
  {
  	maxrels *= 2;
  	srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
--- 340,352 
  
  srel = smgropen(pending-relnode, pending-backend);
  
! /* allocate the initial array, or extend it, if needed */
! if (maxrels == 0)
! {
! 	maxrels = 8;
! 	srels = palloc(sizeof(SMgrRelation) * maxrels );
! }
! else if (maxrels = nrels)
  {
  	maxrels *= 2;
  	srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
***
*** 361,370  smgrDoPendingDeletes(bool isCommit)
  
  		for (i = 0; i  nrels; i++)
  			smgrclose(srels[i]);
- 	}
- 
- 	pfree(srels);
  
  }
  
  /*
--- 366,374 
  
  		for (i = 0; i  nrels; i++)
  			smgrclose(srels[i]);
  
+ 		pfree(srels);
+ 	}
  }
  
  /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR during end-of-xact/FATAL

2013-11-28 Thread Robert Haas
On Thu, Nov 28, 2013 at 10:10 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Robert Haas escribió:
 On Wed, Nov 6, 2013 at 9:40 AM, Alvaro Herrera alvhe...@2ndquadrant.com 
 wrote:
  Noah Misch wrote:
  Incomplete list:
 
  - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its 
  callee
relpathbackend() call palloc(); this is true in all supported branches. 
   In
9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls 
  palloc().
(In fact, it does so even when the pending list is empty -- this is the 
  only
palloc() during a trivial transaction commit.)  palloc() failure there
yields a PANIC during commit.
 
  I think we should fix this routine to avoid the palloc when not necessary.
  That initial palloc is pointless.

 Here's a trivial patch we could apply to 9.3 immediately.  Anything else
 such as the ideas proposed below would require more effort than anyone
 can probably spend here soon.

Yeah, this seems like a good thing to do for now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR during end-of-xact/FATAL

2013-11-15 Thread Robert Haas
On Wed, Nov 13, 2013 at 11:04 AM, Noah Misch n...@leadboat.com wrote:
 So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL.
 That's bizarre.

 Quite so.

 Given that that's where we are, promoting an ERROR during FATAL
 processing to PANIC doesn't seem like it's losing much; we're
 essentially already doing that in the (probably more likely) case of a
 persistent ERROR during ERROR processing.  But since PANIC sucks, I'd
 rather go the other direction: let's make an ERROR during ERROR
 processing promote to FATAL.  And then let's do what you write above:
 make sure that there's a separate on-shmem-exit callback for each
 critical shared memory resource and that we call of those during FATAL
 processing.

 Many of the factors that can cause AbortTransaction() to fail can also cause
 CommitTransaction() to fail, and those would still PANIC if the transaction
 had an xid.  How practical might it be to also escape from an error during
 CommitTransaction() with a FATAL instead of PANIC?  There's more to fix up in
 that case (sinval, NOTIFY), but it may be within reach.  If such a technique
 can only reasonably fix abort, though, I have doubts it buys us enough.

The critical stuff that's got to happen after
RecordTransactionCommit() appears to be ProcArrayEndTransaction() and
AtEOXact_Inval(). Unfortunately, the latter is well after the point
when we're supposed to only be doing non-critical resource cleanup,
nonwithstanding which it appears to be critical.

So here's a sketch.  Hoist the preparatory logic in
RecordTransactionCommit() - smgrGetPendingDeletes,
xactGetCommittedChildren, and xactGetCommittedInvalidationMessages up
into the caller and do it before setting TRANS_COMMIT.  If any of that
stuff fails, we'll land in AbortTransaction() which must cope.  As
soon as we exit the commit critical section, set a flag somewhere
(where?) indicating that we have written our commit record; when that
flag is set, (a) promote any ERROR after that point through the end of
commit cleanup to FATAL and (b) if we enter AbortTransaction(), don't
try to RecordTransactionAbort().

I can't see that the notification stuff requires fixup in this case;
AFAICS, it is just adjusting backend-local state, and it's OK to
disregard any problems there during a FATAL exit.  Do you see
something to the contrary?  But invalidation messages are a problem:
if we commit and exit without sending our queued-up invalidation
messages, Bad Things Will Happen.  Perhaps we could arrange things so
that in that case only, we just PANIC.   That would allow most write
transactions to get by with FATAL, promoting to PANIC only in the case
of transactions that have modified system catalogs and only until the
invalidations have actually been sent.  Avoiding the PANIC in that
case seems to require some additional wizardry which is not entirely
clear to me at this time.

I think we'll have to approach the various problems in this area
stepwise, or we'll never make any progress.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR during end-of-xact/FATAL

2013-11-13 Thread Noah Misch
On Tue, Nov 12, 2013 at 09:55:34AM -0500, Robert Haas wrote:
 On Fri, Nov 8, 2013 at 4:13 PM, Noah Misch n...@leadboat.com wrote:
  A PANIC will reinitialize everything relevant, largely resolving the 
  problems
  around ERROR during FATAL.  It's a heavy-handed solution, but it may well be
  the best solution.  Efforts to harden CommitTransaction() and
  AbortTransaction() seem well-spent, but the additional effort to make FATAL
  exit cope where AbortTransaction() or another exit action could not cope 
  seems
  to be slicing ever-smaller portions of additional robustness.
 
  I pondered a variant of that conclusion that distinguished critical cleanup
  needs from the rest.  Each shared resource (heavyweight locks, buffer pins,
  LWLocks) would have an on_shmem_exit() callback that cleans up the resource
  under a critical section.  (AtProcExit_Buffers() used to fill such a role, 
  but
  resowner.c's work during AbortTransaction() has mostly supplanted it.)  The
  ShutdownPostgres callback would not use a critical section, so lesser 
  failures
  in AbortTransaction() would not upgrade to a PANIC.  But I'm leaning against
  such a complication on the grounds that it would add seldom-tested code 
  paths
  posing as much a chance of eroding robustness as bolstering it.
 
 The current situation is just plain weird: in the ERROR-then-ERROR
 case, we emit a WARNING and bounce right back into AbortTransaction(),
 and if it doesn't work any better the second time than the first time,
 we recurse again, and eventually if it fails enough times in a row, we
 just give up and PANIC.  But in the ERROR-then-FATAL case, we *don't*
 retry AbortTransaction(); instead, we just continue running the rest
 of the on_shmem_exit callbacks and then exit.
 
 So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL.
 That's bizarre.

Quite so.

 Given that that's where we are, promoting an ERROR during FATAL
 processing to PANIC doesn't seem like it's losing much; we're
 essentially already doing that in the (probably more likely) case of a
 persistent ERROR during ERROR processing.  But since PANIC sucks, I'd
 rather go the other direction: let's make an ERROR during ERROR
 processing promote to FATAL.  And then let's do what you write above:
 make sure that there's a separate on-shmem-exit callback for each
 critical shared memory resource and that we call of those during FATAL
 processing.

Many of the factors that can cause AbortTransaction() to fail can also cause
CommitTransaction() to fail, and those would still PANIC if the transaction
had an xid.  How practical might it be to also escape from an error during
CommitTransaction() with a FATAL instead of PANIC?  There's more to fix up in
that case (sinval, NOTIFY), but it may be within reach.  If such a technique
can only reasonably fix abort, though, I have doubts it buys us enough.

 It seems to me that that's how things were originally designed to
 work, but that we've drifted away from it basically because the
 low-level callbacks to release heavyweight locks and buffer pins
 turned out to be kinda, uh, slow, and we thought those code paths
 couldn't be taken anyway (turns out they can).  I think we could
 either make those routines very fast, or arrange only to run that code
 at all in the case where AbortTransaction() didn't complete
 successfully.

Agreed; the performance hazards look tractable.

 It's true that such code will be rarely run, but the
 logic is simple enough that I think we can verify it by hand, and it's
 sure nice to avoid PANICs.

True.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR during end-of-xact/FATAL

2013-11-12 Thread Robert Haas
On Fri, Nov 8, 2013 at 4:13 PM, Noah Misch n...@leadboat.com wrote:
 A PANIC will reinitialize everything relevant, largely resolving the problems
 around ERROR during FATAL.  It's a heavy-handed solution, but it may well be
 the best solution.  Efforts to harden CommitTransaction() and
 AbortTransaction() seem well-spent, but the additional effort to make FATAL
 exit cope where AbortTransaction() or another exit action could not cope seems
 to be slicing ever-smaller portions of additional robustness.

 I pondered a variant of that conclusion that distinguished critical cleanup
 needs from the rest.  Each shared resource (heavyweight locks, buffer pins,
 LWLocks) would have an on_shmem_exit() callback that cleans up the resource
 under a critical section.  (AtProcExit_Buffers() used to fill such a role, but
 resowner.c's work during AbortTransaction() has mostly supplanted it.)  The
 ShutdownPostgres callback would not use a critical section, so lesser failures
 in AbortTransaction() would not upgrade to a PANIC.  But I'm leaning against
 such a complication on the grounds that it would add seldom-tested code paths
 posing as much a chance of eroding robustness as bolstering it.

The current situation is just plain weird: in the ERROR-then-ERROR
case, we emit a WARNING and bounce right back into AbortTransaction(),
and if it doesn't work any better the second time than the first time,
we recurse again, and eventually if it fails enough times in a row, we
just give up and PANIC.  But in the ERROR-then-FATAL case, we *don't*
retry AbortTransaction(); instead, we just continue running the rest
of the on_shmem_exit callbacks and then exit.

So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL.
That's bizarre.

Given that that's where we are, promoting an ERROR during FATAL
processing to PANIC doesn't seem like it's losing much; we're
essentially already doing that in the (probably more likely) case of a
persistent ERROR during ERROR processing.  But since PANIC sucks, I'd
rather go the other direction: let's make an ERROR during ERROR
processing promote to FATAL.  And then let's do what you write above:
make sure that there's a separate on-shmem-exit callback for each
critical shared memory resource and that we call of those during FATAL
processing.

It seems to me that that's how things were originally designed to
work, but that we've drifted away from it basically because the
low-level callbacks to release heavyweight locks and buffer pins
turned out to be kinda, uh, slow, and we thought those code paths
couldn't be taken anyway (turns out they can).  I think we could
either make those routines very fast, or arrange only to run that code
at all in the case where AbortTransaction() didn't complete
successfully.  It's true that such code will be rarely run, but the
logic is simple enough that I think we can verify it by hand, and it's
sure nice to avoid PANICs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR during end-of-xact/FATAL

2013-11-09 Thread Amit Kapila
On Sat, Nov 9, 2013 at 2:43 AM, Noah Misch n...@leadboat.com wrote:
 On Wed, Nov 06, 2013 at 10:14:53AM +0530, Amit Kapila wrote:
 On Thu, Oct 31, 2013 at 8:22 PM, Noah Misch n...@leadboat.com wrote:
 About unclean FATAL-then-ERROR scenario, one way to deal at high level
 could be to treat such a case as backend crash in which case
 postmaster reinitialises shared memory and other stuff.

  If we can't manage to
  free a shared memory resource like a lock or buffer pin, we really must 
  PANIC.

 Can't we try to initialise the shared memory and other resources,
 wouldn't that resolve the problem's that can occur due to scenario
 explained by you?

 A PANIC will reinitialize everything relevant, largely resolving the problems
 around ERROR during FATAL.  It's a heavy-handed solution, but it may well be
 the best solution.  Efforts to harden CommitTransaction() and
 AbortTransaction() seem well-spent, but the additional effort to make FATAL
 exit cope where AbortTransaction() or another exit action could not cope seems
 to be slicing ever-smaller portions of additional robustness.

 I pondered a variant of that conclusion that distinguished critical cleanup
 needs from the rest.  Each shared resource (heavyweight locks, buffer pins,
 LWLocks) would have an on_shmem_exit() callback that cleans up the resource
 under a critical section.  (AtProcExit_Buffers() used to fill such a role, but
 resowner.c's work during AbortTransaction() has mostly supplanted it.)  The
 ShutdownPostgres callback would not use a critical section, so lesser failures
 in AbortTransaction() would not upgrade to a PANIC.  But I'm leaning against
 such a complication on the grounds that it would add seldom-tested code paths
 posing as much a chance of eroding robustness as bolstering it.

I think here PANIC is safe and less complicated solution for this
situation. Apart from this we can try to avoid palloc or other such
errors in AbortTransaction/CommitTransaction path.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR during end-of-xact/FATAL

2013-11-08 Thread Robert Haas
On Wed, Nov 6, 2013 at 9:40 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Noah Misch wrote:
 Incomplete list:

 - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee
   relpathbackend() call palloc(); this is true in all supported branches.  In
   9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc().
   (In fact, it does so even when the pending list is empty -- this is the 
 only
   palloc() during a trivial transaction commit.)  palloc() failure there
   yields a PANIC during commit.

 I think we should fix this routine to avoid the palloc when not necessary.
 That initial palloc is pointless.

 Also, there have been previous discussions about having relpathbackend
 not use palloc at all.  That was only because we wanted to use it in
 pg_xlogdump which didn't have palloc support at the time, so it's no
 longer as pressing; but perhaps it's still worthy of consideration.

+1, but I'd like it better if we could find a way of avoiding the
palloc in all cases.  Panicking because we run out of memory at the
wrong time isn't really very nice.  Maybe the information needs to be
maintained in the format in which it ultimately needs to be returned,
so that we needn't rejigger it in the middle of a critical section.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR during end-of-xact/FATAL

2013-11-08 Thread Noah Misch
On Wed, Nov 06, 2013 at 10:14:53AM +0530, Amit Kapila wrote:
 On Thu, Oct 31, 2013 at 8:22 PM, Noah Misch n...@leadboat.com wrote:
  If the original AbortTransaction() pertained to a FATAL, the situation is
  worse.  errfinish() promotes the ERROR thrown from AbortTransaction() to
  another FATAL,
 
 isn't errstart promotes ERROR to FATAL?

Right.

 When I tried above scenario, I hit Assert at different place
...
 This means that in the situation when an ERROR occurs in
 AbortTransaction which is called as a result of FATAL error, there are
 many more possibilities of Assert.

Agreed.

 About unclean FATAL-then-ERROR scenario, one way to deal at high level
 could be to treat such a case as backend crash in which case
 postmaster reinitialises shared memory and other stuff.
 
  If we can't manage to
  free a shared memory resource like a lock or buffer pin, we really must 
  PANIC.
 
 Can't we try to initialise the shared memory and other resources,
 wouldn't that resolve the problem's that can occur due to scenario
 explained by you?

A PANIC will reinitialize everything relevant, largely resolving the problems
around ERROR during FATAL.  It's a heavy-handed solution, but it may well be
the best solution.  Efforts to harden CommitTransaction() and
AbortTransaction() seem well-spent, but the additional effort to make FATAL
exit cope where AbortTransaction() or another exit action could not cope seems
to be slicing ever-smaller portions of additional robustness.

I pondered a variant of that conclusion that distinguished critical cleanup
needs from the rest.  Each shared resource (heavyweight locks, buffer pins,
LWLocks) would have an on_shmem_exit() callback that cleans up the resource
under a critical section.  (AtProcExit_Buffers() used to fill such a role, but
resowner.c's work during AbortTransaction() has mostly supplanted it.)  The
ShutdownPostgres callback would not use a critical section, so lesser failures
in AbortTransaction() would not upgrade to a PANIC.  But I'm leaning against
such a complication on the grounds that it would add seldom-tested code paths
posing as much a chance of eroding robustness as bolstering it.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR during end-of-xact/FATAL

2013-11-06 Thread Alvaro Herrera
Noah Misch wrote:

 Incomplete list:
 
 - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee
   relpathbackend() call palloc(); this is true in all supported branches.  In
   9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc().
   (In fact, it does so even when the pending list is empty -- this is the only
   palloc() during a trivial transaction commit.)  palloc() failure there
   yields a PANIC during commit.

I think we should fix this routine to avoid the palloc when not necessary.
That initial palloc is pointless.

Also, there have been previous discussions about having relpathbackend
not use palloc at all.  That was only because we wanted to use it in
pg_xlogdump which didn't have palloc support at the time, so it's no
longer as pressing; but perhaps it's still worthy of consideration.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR during end-of-xact/FATAL

2013-11-05 Thread Amit Kapila
On Thu, Oct 31, 2013 at 8:22 PM, Noah Misch n...@leadboat.com wrote:
 CommitTransaction() and AbortTransaction() both do much work, and large
 portions of that work either should not or must not throw errors.  An error
 during either function will, as usual, siglongjmp() out.  Ordinarily,
 PostgresMain() will regain control and fire off a fresh AbortTransaction().
 The consequences thereof depend on the original function's progress:

 - Before the function updates CurrentTransactionState-state, an ERROR is
   fully acceptable.  CommitTransaction() specifically places failure-prone
   tasks accordingly; AbortTransaction() has no analogous tasks.

 - After the function updates CurrentTransactionState-state, an ERROR yields a
   user-unfriendly e.g. WARNING:  AbortTransaction while in COMMIT state.
   This is not itself harmful, but we've largely kept the things that can fail
   for pedestrian reasons ahead of that point.

 - After CommitTransaction() calls RecordTransactionCommit() for an xid-bearing
   transaction, an ERROR upgrades to e.g. PANIC:  cannot abort transaction
   805, it was already committed.

 - After AbortTransaction() calls ProcArrayEndTransaction() for an xid-bearing
   transaction, an ERROR will lead to this assertion failure:

   TRAP: FailedAssertion(!(((allPgXact[proc-pgprocno].xid) != 
 ((TransactionId) 0))), File: procarray.c, Line: 396)

 If the original AbortTransaction() pertained to a FATAL, the situation is
 worse.  errfinish() promotes the ERROR thrown from AbortTransaction() to
 another FATAL,

isn't errstart promotes ERROR to FATAL?

 so we reenter proc_exit().  Thanks to the following logic in
 shmem_exit(),

following comment is in proc_exit_prepare(), but the effect will be
same as described you due to logic in shmem_exit().

 we will never return to AbortTransaction():

 /*
  * call all the registered callbacks.
  *
  * Note that since we decrement on_proc_exit_index each time, if a
  * callback calls ereport(ERROR) or ereport(FATAL) then it won't be
  * invoked again when control comes back here (nor will the
  * previously-completed callbacks).  So, an infinite loop should not 
 be
  * possible.
  */

 As a result, we miss any cleanups that had not yet happened in the original
 AbortTransaction().  In particular, this can leak heavyweight locks.  An
 asserts build subsequently fails this way:

 TRAP: FailedAssertion(!(SHMQueueEmpty((MyProc-myProcLocks[i]))), File: 
 proc.c, Line: 788)

 In a production build, the affected PGPROC slot just continues to hold the
 lock until the next backend claiming that slot calls LockReleaseAll().  Oops.
 Bottom line: most bets are off given an ERROR after RecordTransactionCommit()
 in CommitTransaction() or anywhere in AbortTransaction().

When I tried above scenario, I hit Assert at different place

shmem_exit()-pgstat_beshutdown_hook()-pgstat_report_stat()
pgstat_report_stat(bool force)
{
..
Assert(entry-trans == NULL);
}

The reason is that in AbortTransaction, an ERROR is thrown before call
to AtEOXact_PgStat(false).

This means that in the situation when an ERROR occurs in
AbortTransaction which is called as a result of FATAL error, there are
many more possibilities of Assert.


 Now, while those assertion failures are worth preventing on general principle,
 the actual field importance depends on whether things actually do fail in the
 vulnerable end-of-xact work.  We've prevented the errors that would otherwise
 be relatively common, but there are several rare ways to get a late ERROR.
 Incomplete list:

 - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee
   relpathbackend() call palloc(); this is true in all supported branches.  In
   9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc().
   (In fact, it does so even when the pending list is empty -- this is the only
   palloc() during a trivial transaction commit.)  palloc() failure there
   yields a PANIC during commit.

 - ResourceOwnerRelease() calls FileClose() during abort, and FileClose()
   raises an ERROR when close() returns EIO.

 - AtEOXact_Inval() can lead to calls like RelationReloadIndexInfo(), which has
   many ways to throw errors.  This precedes releasing heavyweight locks, so an
   error here during an abort pertaining to a FATAL exit orphans locks as
   described above.  This relates into another recent thread:
 http://www.postgresql.org/message-id/20130805170931.ga369...@tornado.leadboat.com


 What should we do to mitigate these problems?  Certainly we can harden
 individual end-of-xact tasks to not throw errors, as we have in the past.
 What higher-level strategies should we consider?  What about for the unclean
 result of the FATAL-then-ERROR scenario in particular?

About unclean FATAL-then-ERROR scenario, one way to deal at high level
could be to treat such a case as backend crash in which case
postmaster reinitialises shared memory and 

[HACKERS] ERROR during end-of-xact/FATAL

2013-10-31 Thread Noah Misch
CommitTransaction() and AbortTransaction() both do much work, and large
portions of that work either should not or must not throw errors.  An error
during either function will, as usual, siglongjmp() out.  Ordinarily,
PostgresMain() will regain control and fire off a fresh AbortTransaction().
The consequences thereof depend on the original function's progress:

- Before the function updates CurrentTransactionState-state, an ERROR is
  fully acceptable.  CommitTransaction() specifically places failure-prone
  tasks accordingly; AbortTransaction() has no analogous tasks.

- After the function updates CurrentTransactionState-state, an ERROR yields a
  user-unfriendly e.g. WARNING:  AbortTransaction while in COMMIT state.
  This is not itself harmful, but we've largely kept the things that can fail
  for pedestrian reasons ahead of that point.

- After CommitTransaction() calls RecordTransactionCommit() for an xid-bearing
  transaction, an ERROR upgrades to e.g. PANIC:  cannot abort transaction
  805, it was already committed.

- After AbortTransaction() calls ProcArrayEndTransaction() for an xid-bearing
  transaction, an ERROR will lead to this assertion failure:

  TRAP: FailedAssertion(!(((allPgXact[proc-pgprocno].xid) != ((TransactionId) 
0))), File: procarray.c, Line: 396)

If the original AbortTransaction() pertained to a FATAL, the situation is
worse.  errfinish() promotes the ERROR thrown from AbortTransaction() to
another FATAL, so we reenter proc_exit().  Thanks to the following logic in
shmem_exit(), we will never return to AbortTransaction():

/*
 * call all the registered callbacks.
 *
 * Note that since we decrement on_proc_exit_index each time, if a
 * callback calls ereport(ERROR) or ereport(FATAL) then it won't be
 * invoked again when control comes back here (nor will the
 * previously-completed callbacks).  So, an infinite loop should not be
 * possible.
 */

As a result, we miss any cleanups that had not yet happened in the original
AbortTransaction().  In particular, this can leak heavyweight locks.  An
asserts build subsequently fails this way:

TRAP: FailedAssertion(!(SHMQueueEmpty((MyProc-myProcLocks[i]))), File: 
proc.c, Line: 788)

In a production build, the affected PGPROC slot just continues to hold the
lock until the next backend claiming that slot calls LockReleaseAll().  Oops.
Bottom line: most bets are off given an ERROR after RecordTransactionCommit()
in CommitTransaction() or anywhere in AbortTransaction().


Now, while those assertion failures are worth preventing on general principle,
the actual field importance depends on whether things actually do fail in the
vulnerable end-of-xact work.  We've prevented the errors that would otherwise
be relatively common, but there are several rare ways to get a late ERROR.
Incomplete list:

- If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee
  relpathbackend() call palloc(); this is true in all supported branches.  In
  9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc().
  (In fact, it does so even when the pending list is empty -- this is the only
  palloc() during a trivial transaction commit.)  palloc() failure there
  yields a PANIC during commit.

- ResourceOwnerRelease() calls FileClose() during abort, and FileClose()
  raises an ERROR when close() returns EIO.

- AtEOXact_Inval() can lead to calls like RelationReloadIndexInfo(), which has
  many ways to throw errors.  This precedes releasing heavyweight locks, so an
  error here during an abort pertaining to a FATAL exit orphans locks as
  described above.  This relates into another recent thread:
http://www.postgresql.org/message-id/20130805170931.ga369...@tornado.leadboat.com


What should we do to mitigate these problems?  Certainly we can harden
individual end-of-xact tasks to not throw errors, as we have in the past.
What higher-level strategies should we consider?  What about for the unclean
result of the FATAL-then-ERROR scenario in particular?  If we can't manage to
free a shared memory resource like a lock or buffer pin, we really must PANIC.
Releasing those things is quite reliable, though.  The tasks that have the
highest chance of capsizing the AbortTransaction() are of backend-local
interest, or they're tasks for which we tolerate failure as a rule
(e.g. unlinking files).

Robert Haas provided a large slice of the research for this report.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers