Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Sun, Jan 31, 2016 at 11:50 AM, Tom Lane wrote: > Noah Misch writes: >> On Sat, Jan 30, 2016 at 04:28:47PM -0500, Robert Haas wrote: >>> I think I've >>> pretty much said what I have to say about this; if nothing I wrote up >>> until now swayed you, it's unlikely that anything else I say after >>> this point will either. > >> Say I drop the parts that change the binary. Does the attached v2 manage to >> improve PostgreSQL, or is it neutral-or-harmful like v1? > > There can surely be no objection to improving these comments. However, > I'm not convinced that we should word the comments to insist that the > hypothetical cases are bugs. As I said before, I do not think there is an > API contract that would promise that portals don't reach here in ACTIVE > state. So IMO it's fair to note that no such case can arise currently, > but not to state that it's a bug if it does. So for example I'd reword > your last comment addition along the lines of "Currently, every > MarkPortalActive() caller ensures it updates the portal status again > before relinquishing control, so that ACTIVE can't happen here. If it > does happen, dispose the portal like existing MarkPortalActive() callers > would." +1. I think Noah's comment additions constitute useful and helpful information, but I too am doubtful about the characterization of the situations in question as bugs. I don't see what the evidence for that is, especially given that it's quite hard to predict how the code might change in the future. However, to reiterate, a more neutral position of the same facts would get my vote. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
Noah Misch writes: > On Sat, Jan 30, 2016 at 04:28:47PM -0500, Robert Haas wrote: >> I think I've >> pretty much said what I have to say about this; if nothing I wrote up >> until now swayed you, it's unlikely that anything else I say after >> this point will either. > Say I drop the parts that change the binary. Does the attached v2 manage to > improve PostgreSQL, or is it neutral-or-harmful like v1? There can surely be no objection to improving these comments. However, I'm not convinced that we should word the comments to insist that the hypothetical cases are bugs. As I said before, I do not think there is an API contract that would promise that portals don't reach here in ACTIVE state. So IMO it's fair to note that no such case can arise currently, but not to state that it's a bug if it does. So for example I'd reword your last comment addition along the lines of "Currently, every MarkPortalActive() caller ensures it updates the portal status again before relinquishing control, so that ACTIVE can't happen here. If it does happen, dispose the portal like existing MarkPortalActive() callers would." regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Sat, Jan 30, 2016 at 04:28:47PM -0500, Robert Haas wrote: > I think I've > pretty much said what I have to say about this; if nothing I wrote up > until now swayed you, it's unlikely that anything else I say after > this point will either. Say I drop the parts that change the binary. Does the attached v2 manage to improve PostgreSQL, or is it neutral-or-harmful like v1? On Sat, Jan 30, 2016 at 04:28:47PM -0500, Robert Haas wrote: > On Sat, Jan 30, 2016 at 2:13 PM, Noah Misch wrote: > > On Sat, Jan 30, 2016 at 07:37:45AM -0500, Robert Haas wrote: > > > On Fri, Jan 29, 2016 at 11:19 PM, Noah Misch wrote: > > > > On Thu, Jan 28, 2016 at 11:12:15AM -0500, Robert Haas wrote: > > > >> The code you quote emits a warning > > > >> about a reasonably forseeable situation that can never be right, but > > > >> there's no particular reason to think that MarkPortalFailed is the > > > >> wrong thing to do here if that situation came up. > > > > > > > > I have two reasons to expect these MarkPortalFailed() calls will be the > > > > wrong > > > > thing for hypothetical code reaching them. First, PortalRun() and > > > > peers reset > > > > ActivePortal and PortalContext on error in addition to fixing portal > > > > status > > > > with MarkPortalFailed(). If code forgets to update the status, there's > > > > a > > > > substantial chance it forgot the other two. My patch added a comment > > > > explaining the second reason: > > > > > > > > + /* > > > > +* See similar code in AtSubAbort_Portals(). This > > > > would fire if code > > > > +* orchestrating multiple top-level transactions within > > > > a portal, such > > > > +* as VACUUM, caught errors and continued under the > > > > same portal with a > > > > +* fresh transaction. No part of core PostgreSQL > > > > functions that way, > > > > +* though it's a fair thing to want. Such code would > > > > wish the portal > > > > +* to remain ACTIVE, as in PreCommit_Portals(); we > > > > don't cater to > > > > +* that. Instead, like AtSubAbort_Portals(), interpret > > > > this as a bug. > > > > +*/ > > > > > > You may be right, but then again Tom had a different opinion, even > > > after seeing your patch, and he's no dummy. > > > > Eh? Tom last posted to this thread before I first posted a patch. > > http://www.postgresql.org/message-id/29758.1451780...@sss.pgh.pa.us > seems to me to be a vote against the concept embodied by the patch. That much is true. The order of postings is the opposite of what you stated, and there's no mailing list evidence that anyone other than you has read my explanation of why MarkPortalFailed() is wrong here. Alternately, I demand the schematics for Tom's time machine. diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c index a53673c..c840aa6 100644 --- a/src/backend/utils/mmgr/portalmem.c +++ b/src/backend/utils/mmgr/portalmem.c @@ -765,7 +765,14 @@ AtAbort_Portals(void) { Portal portal = hentry->portal; - /* Any portal that was actually running has to be considered broken */ + /* +* See similar code in AtSubAbort_Portals(). This would fire if code +* orchestrating multiple top-level transactions within a portal, such +* as VACUUM, caught errors and continued under the same portal with a +* fresh transaction. No part of core PostgreSQL functions that way. +* XXX Such code would wish the portal to remain ACTIVE, as in +* PreCommit_Portals(). +*/ if (portal->status == PORTAL_ACTIVE) MarkPortalFailed(portal); @@ -919,9 +926,11 @@ AtSubAbort_Portals(SubTransactionId mySubid, portal->activeSubid = parentSubid; /* -* Upper-level portals that failed while running in this -* subtransaction must be forced into FAILED state, for the -* same reasons discussed below. +* If a bug in a MarkPortalActive() caller has it miss cleanup +* after having failed while running an upper-level portal in +* this subtransaction, we don't know what else in the portal +* is wrong. Force it into FAILED state, for the same reasons +* discussed below. * * We assume we can get away without forcing upper-level READY * portals to fail, even if they were run and then suspended. @@ -960,7 +969,11 @@ AtSubAbort_Portals(SubTransactionId mySubid,
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Sat, Jan 30, 2016 at 2:13 PM, Noah Misch wrote: > You could offer that paragraph as an objection to almost all Assert(), elog(), > and automated tests. Why levy it against this patch? The valuable ways > assertions and tests supplement review are well-established. Sure, that's true, but I don't view all situations in the same way, so I don't write the same thing in answer to each one. I think I've pretty much said what I have to say about this; if nothing I wrote up until now swayed you, it's unlikely that anything else I say after this point will either. >> You may be right, but then again Tom had a different opinion, even >> after seeing your patch, and he's no dummy. > > Eh? Tom last posted to this thread before I first posted a patch. http://www.postgresql.org/message-id/29758.1451780...@sss.pgh.pa.us seems to me to be a vote against the concept embodied by the patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Sat, Jan 30, 2016 at 07:37:45AM -0500, Robert Haas wrote: > On Fri, Jan 29, 2016 at 11:19 PM, Noah Misch wrote: > > As you say, forbidding things makes friction in the event that someone comes > > along wanting to do the forbidden thing. Forbidding things also simplifies > > the system, making it easier to verify. This decision should depend on the > > API's audience. We have prominently-public APIs like lsyscache.h, > > stringinfo.h and funcapi.h. They should cater to reasonably-foreseeable use > > cases, not just what yesterday's users have actually used. We then have > > narrow-interest APIs like subselect.h and view.h. For those, the value of a > > simpler system exceeds the risk-adjusted cost of friction. They should > > impose > > strict constraints on their callers. > > > > Portal status belongs to the second category. PortalRun(), PortalRunFetch() > > and PersistHoldablePortal() are the three core functions that place portals > > into PORTAL_ACTIVE status. No PGXN module does so; none so much as > > references > > a PortalStatus constant or MarkPortal{Active,Done,Failed}() function. If > > someone adds a fourth core portal runner, the system will be simpler and > > better if that function replicates the structure of the existing three. > > I won't argue with that, but it strikes me that if I were reviewing a > patch to do such a thing, I'd surely compare the new caller against > the existing ones, so any failure to adhere to the same design pattern > would be caught that way. I expect other reviewers and/or committers > would almost certainly do something similar. If we presuppose > incompetence on the part of future reviewers and committers, no action > taken now can secure us. You could offer that paragraph as an objection to almost all Assert(), elog(), and automated tests. Why levy it against this patch? The valuable ways assertions and tests supplement review are well-established. > You may be right, but then again Tom had a different opinion, even > after seeing your patch, and he's no dummy. Eh? Tom last posted to this thread before I first posted a patch. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Fri, Jan 29, 2016 at 11:19 PM, Noah Misch wrote: > As you say, forbidding things makes friction in the event that someone comes > along wanting to do the forbidden thing. Forbidding things also simplifies > the system, making it easier to verify. This decision should depend on the > API's audience. We have prominently-public APIs like lsyscache.h, > stringinfo.h and funcapi.h. They should cater to reasonably-foreseeable use > cases, not just what yesterday's users have actually used. We then have > narrow-interest APIs like subselect.h and view.h. For those, the value of a > simpler system exceeds the risk-adjusted cost of friction. They should impose > strict constraints on their callers. > > Portal status belongs to the second category. PortalRun(), PortalRunFetch() > and PersistHoldablePortal() are the three core functions that place portals > into PORTAL_ACTIVE status. No PGXN module does so; none so much as references > a PortalStatus constant or MarkPortal{Active,Done,Failed}() function. If > someone adds a fourth core portal runner, the system will be simpler and > better if that function replicates the structure of the existing three. I won't argue with that, but it strikes me that if I were reviewing a patch to do such a thing, I'd surely compare the new caller against the existing ones, so any failure to adhere to the same design pattern would be caught that way. I expect other reviewers and/or committers would almost certainly do something similar. If we presuppose incompetence on the part of future reviewers and committers, no action taken now can secure us. >> The code you quote emits a warning >> about a reasonably forseeable situation that can never be right, but >> there's no particular reason to think that MarkPortalFailed is the >> wrong thing to do here if that situation came up. > > I have two reasons to expect these MarkPortalFailed() calls will be the wrong > thing for hypothetical code reaching them. First, PortalRun() and peers reset > ActivePortal and PortalContext on error in addition to fixing portal status > with MarkPortalFailed(). If code forgets to update the status, there's a > substantial chance it forgot the other two. My patch added a comment > explaining the second reason: > > + /* > +* See similar code in AtSubAbort_Portals(). This would fire > if code > +* orchestrating multiple top-level transactions within a > portal, such > +* as VACUUM, caught errors and continued under the same > portal with a > +* fresh transaction. No part of core PostgreSQL functions > that way, > +* though it's a fair thing to want. Such code would wish > the portal > +* to remain ACTIVE, as in PreCommit_Portals(); we don't > cater to > +* that. Instead, like AtSubAbort_Portals(), interpret this > as a bug. > +*/ You may be right, but then again Tom had a different opinion, even after seeing your patch, and he's no dummy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Thu, Jan 28, 2016 at 11:12:15AM -0500, Robert Haas wrote: > I'm honestly failing to understand why we should change anything at > all. I don't believe that doing something more severe than marking > the portal failed actually improves anything. I suppose if I had to > pick between what you proposed before and elog(FATAL) I'd pick the > latter, but I see no real reason to cut off future code (or > third-party code) at the knees like that. I don't see it as either > necessary or desirable to forbid something just because there's no > clear present use case for it. As you say, forbidding things makes friction in the event that someone comes along wanting to do the forbidden thing. Forbidding things also simplifies the system, making it easier to verify. This decision should depend on the API's audience. We have prominently-public APIs like lsyscache.h, stringinfo.h and funcapi.h. They should cater to reasonably-foreseeable use cases, not just what yesterday's users have actually used. We then have narrow-interest APIs like subselect.h and view.h. For those, the value of a simpler system exceeds the risk-adjusted cost of friction. They should impose strict constraints on their callers. Portal status belongs to the second category. PortalRun(), PortalRunFetch() and PersistHoldablePortal() are the three core functions that place portals into PORTAL_ACTIVE status. No PGXN module does so; none so much as references a PortalStatus constant or MarkPortal{Active,Done,Failed}() function. If someone adds a fourth core portal runner, the system will be simpler and better if that function replicates the structure of the existing three. > The code you quote emits a warning > about a reasonably forseeable situation that can never be right, but > there's no particular reason to think that MarkPortalFailed is the > wrong thing to do here if that situation came up. I have two reasons to expect these MarkPortalFailed() calls will be the wrong thing for hypothetical code reaching them. First, PortalRun() and peers reset ActivePortal and PortalContext on error in addition to fixing portal status with MarkPortalFailed(). If code forgets to update the status, there's a substantial chance it forgot the other two. My patch added a comment explaining the second reason: + /* +* See similar code in AtSubAbort_Portals(). This would fire if code +* orchestrating multiple top-level transactions within a portal, such +* as VACUUM, caught errors and continued under the same portal with a +* fresh transaction. No part of core PostgreSQL functions that way, +* though it's a fair thing to want. Such code would wish the portal +* to remain ACTIVE, as in PreCommit_Portals(); we don't cater to +* that. Instead, like AtSubAbort_Portals(), interpret this as a bug. +*/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Wed, Jan 27, 2016 at 11:47 PM, Noah Misch wrote: > On Wed, Jan 27, 2016 at 11:04:33PM -0500, Robert Haas wrote: >> +Assert(portal->status != PORTAL_ACTIVE); >> if (portal->status == PORTAL_ACTIVE) >> MarkPortalFailed(portal); >> >> Now that just looks kooky to me. We assert that the portal isn't >> active, but then cater to the possibility that it might be anyway? > > Right. > >> That seems totally contrary to our usual programming practices, and a >> bad idea for that reason. > > It is contrary to our usual programming practices, I agree. I borrowed the > idea from untenured code (da3751c8, 2015-11-11) in load_relcache_init_file(): > > if (nailed_rels != NUM_CRITICAL_SHARED_RELS || > nailed_indexes != NUM_CRITICAL_SHARED_INDEXES) > { > elog(WARNING, "found %d nailed shared rels and %d > nailed shared indexes in init file, but expected %d and %d respectively", > nailed_rels, nailed_indexes, > NUM_CRITICAL_SHARED_RELS, > NUM_CRITICAL_SHARED_INDEXES); > /* Make sure we get developers' attention about this > */ > Assert(false); > > I liked this pattern. It's a good fit for cases that we design to be > impossible yet for which we have a workaround if they do happen. That being > said, if you feel it's bad, I would be fine using elog(FATAL). I envision > this as a master-only change in any case. No PGXN module references > PORTAL_ACTIVE or MarkPortalActive(), so it's unlikely that extension code will > notice the change whether in Assert() form or in elog() form. What is best? I'm honestly failing to understand why we should change anything at all. I don't believe that doing something more severe than marking the portal failed actually improves anything. I suppose if I had to pick between what you proposed before and elog(FATAL) I'd pick the latter, but I see no real reason to cut off future code (or third-party code) at the knees like that. I don't see it as either necessary or desirable to forbid something just because there's no clear present use case for it. The code you quote emits a warning about a reasonably forseeable situation that can never be right, but there's no particular reason to think that MarkPortalFailed is the wrong thing to do here if that situation came up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Wed, Jan 27, 2016 at 11:04:33PM -0500, Robert Haas wrote: > +Assert(portal->status != PORTAL_ACTIVE); > if (portal->status == PORTAL_ACTIVE) > MarkPortalFailed(portal); > > Now that just looks kooky to me. We assert that the portal isn't > active, but then cater to the possibility that it might be anyway? Right. > That seems totally contrary to our usual programming practices, and a > bad idea for that reason. It is contrary to our usual programming practices, I agree. I borrowed the idea from untenured code (da3751c8, 2015-11-11) in load_relcache_init_file(): if (nailed_rels != NUM_CRITICAL_SHARED_RELS || nailed_indexes != NUM_CRITICAL_SHARED_INDEXES) { elog(WARNING, "found %d nailed shared rels and %d nailed shared indexes in init file, but expected %d and %d respectively", nailed_rels, nailed_indexes, NUM_CRITICAL_SHARED_RELS, NUM_CRITICAL_SHARED_INDEXES); /* Make sure we get developers' attention about this */ Assert(false); I liked this pattern. It's a good fit for cases that we design to be impossible yet for which we have a workaround if they do happen. That being said, if you feel it's bad, I would be fine using elog(FATAL). I envision this as a master-only change in any case. No PGXN module references PORTAL_ACTIVE or MarkPortalActive(), so it's unlikely that extension code will notice the change whether in Assert() form or in elog() form. What is best? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Thu, Jan 28, 2016 at 12:57:36PM +0900, Michael Paquier wrote: > On Thu, Jan 28, 2016 at 12:40 PM, Noah Misch wrote: > + * fresh transaction. No part of core PostgreSQL functions that way, > + * though it's a fair thing to want. Such code would wish the portal > From the point of view of core code, this stands true, but, for my 2c, > honestly, I think that is just going to annoy more people working on > plugins and forks of Postgres. When working on Postgres-XC and > developing stuff for the core code, I recall having been annoyed a > couple of times by similar assert limitations At first, I left out that assertion in case some extension code did the thing I described, perhaps in a background worker. I then realized that MarkPortalFailed() is the wrong thing for such code, which would want treatment similar to this bit of PreCommit_Portals(): /* * Do not touch active portals --- this can only happen in the case of * a multi-transaction utility command, such as VACUUM. * * Note however that any resource owner attached to such a portal is * still going to go away, so don't leave a dangling pointer. */ if (portal->status == PORTAL_ACTIVE) { portal->resowner = NULL; continue; } If you can think of a case where the code would work okay despite its active portal being marked as failed, that would be a good reason to omit the one assertion. Otherwise, an assertion seems better than silently doing the known-wrong thing. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Wed, Jan 27, 2016 at 10:40 PM, Noah Misch wrote: > On Sun, Jan 03, 2016 at 12:35:56AM -0500, Noah Misch wrote: >> On Sat, Jan 02, 2016 at 07:22:13PM -0500, Tom Lane wrote: >> > Noah Misch writes: >> > > I am inclined to add an Assert(portal->status != PORTAL_ACTIVE) to >> > > emphasize >> > > that this is backup only. MarkPortalActive() callers remain responsible >> > > for >> > > updating the status to something else before relinquishing control. >> > >> > No, I do not think that would be an improvement. There is no contract >> > saying that this must be done earlier, IMO. >> >> Indeed, nobody wrote a contract. The assertion would record what has been >> the >> sole standing practice for eleven years (since commit a393fbf9). It would >> prompt discussion if a proposed patch would depart from that practice, and >> that is a good thing. Also, every addition of dead code should label that >> code to aid future readers. > > Here's the patch I have in mind. AtAbort_Portals() has an older > MarkPortalFailed() call, and the story is somewhat different there per its new > comment. That call is plausible to reach with no bug involved, but > MarkPortalFailed() would then be the wrong thing. +Assert(portal->status != PORTAL_ACTIVE); if (portal->status == PORTAL_ACTIVE) MarkPortalFailed(portal); Now that just looks kooky to me. We assert that the portal isn't active, but then cater to the possibility that it might be anyway? That seems totally contrary to our usual programming practices, and a bad idea for that reason. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Thu, Jan 28, 2016 at 12:40 PM, Noah Misch wrote: > On Sun, Jan 03, 2016 at 12:35:56AM -0500, Noah Misch wrote: >> On Sat, Jan 02, 2016 at 07:22:13PM -0500, Tom Lane wrote: >> > Noah Misch writes: >> > > I am inclined to add an Assert(portal->status != PORTAL_ACTIVE) to >> > > emphasize >> > > that this is backup only. MarkPortalActive() callers remain responsible >> > > for >> > > updating the status to something else before relinquishing control. >> > >> > No, I do not think that would be an improvement. There is no contract >> > saying that this must be done earlier, IMO. >> >> Indeed, nobody wrote a contract. The assertion would record what has been >> the >> sole standing practice for eleven years (since commit a393fbf9). It would >> prompt discussion if a proposed patch would depart from that practice, and >> that is a good thing. Also, every addition of dead code should label that >> code to aid future readers. > > Here's the patch I have in mind. AtAbort_Portals() has an older > MarkPortalFailed() call, and the story is somewhat different there per its new > comment. That call is plausible to reach with no bug involved, but > MarkPortalFailed() would then be the wrong thing. + * fresh transaction. No part of core PostgreSQL functions that way, + * though it's a fair thing to want. Such code would wish the portal >From the point of view of core code, this stands true, but, for my 2c, honestly, I think that is just going to annoy more people working on plugins and forks of Postgres. When working on Postgres-XC and developing stuff for the core code, I recall having been annoyed a couple of times by similar assert limitations, because sometimes they did not actually make sense in the context of what I was doing, and other times things got suddendly broken after bumping the forks code base to a major upgrades because a routine used up to now broke its contract. Perhaps I was doing it wrong at this point though, and should have used something else. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Sun, Jan 03, 2016 at 12:35:56AM -0500, Noah Misch wrote: > On Sat, Jan 02, 2016 at 07:22:13PM -0500, Tom Lane wrote: > > Noah Misch writes: > > > I am inclined to add an Assert(portal->status != PORTAL_ACTIVE) to > > > emphasize > > > that this is backup only. MarkPortalActive() callers remain responsible > > > for > > > updating the status to something else before relinquishing control. > > > > No, I do not think that would be an improvement. There is no contract > > saying that this must be done earlier, IMO. > > Indeed, nobody wrote a contract. The assertion would record what has been the > sole standing practice for eleven years (since commit a393fbf9). It would > prompt discussion if a proposed patch would depart from that practice, and > that is a good thing. Also, every addition of dead code should label that > code to aid future readers. Here's the patch I have in mind. AtAbort_Portals() has an older MarkPortalFailed() call, and the story is somewhat different there per its new comment. That call is plausible to reach with no bug involved, but MarkPortalFailed() would then be the wrong thing. nm diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c index a53673c..632b202 100644 --- a/src/backend/utils/mmgr/portalmem.c +++ b/src/backend/utils/mmgr/portalmem.c @@ -762,13 +762,22 @@ AtAbort_Portals(void) hash_seq_init(&status, PortalHashTable); while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL) { Portal portal = hentry->portal; - /* Any portal that was actually running has to be considered broken */ + /* +* See similar code in AtSubAbort_Portals(). This would fire if code +* orchestrating multiple top-level transactions within a portal, such +* as VACUUM, caught errors and continued under the same portal with a +* fresh transaction. No part of core PostgreSQL functions that way, +* though it's a fair thing to want. Such code would wish the portal +* to remain ACTIVE, as in PreCommit_Portals(); we don't cater to +* that. Instead, like AtSubAbort_Portals(), interpret this as a bug. +*/ + Assert(portal->status != PORTAL_ACTIVE); if (portal->status == PORTAL_ACTIVE) MarkPortalFailed(portal); /* * Do nothing else to cursors held over from a previous transaction. */ @@ -916,26 +925,29 @@ AtSubAbort_Portals(SubTransactionId mySubid, if (portal->activeSubid == mySubid) { /* Maintain activeSubid until the portal is removed */ portal->activeSubid = parentSubid; /* -* Upper-level portals that failed while running in this -* subtransaction must be forced into FAILED state, for the -* same reasons discussed below. +* If a bug in a MarkPortalActive() caller has it miss cleanup +* after having failed while running an upper-level portal in +* this subtransaction, we don't know what else in the portal +* is wrong. Force it into FAILED state, for the same reasons +* discussed below. * * We assume we can get away without forcing upper-level READY * portals to fail, even if they were run and then suspended. * In theory a suspended upper-level portal could have * acquired some references to objects that are about to be * destroyed, but there should be sufficient defenses against * such cases: the portal's original query cannot contain such * references, and any references within, say, cached plans of * PL/pgSQL functions are not from active queries and should * be protected by revalidation logic. */ + Assert(portal->status != PORTAL_ACTIVE); if (portal->status == PORTAL_ACTIVE) MarkPortalFailed(portal); /* * Also, if we failed it during the current subtransaction * (either just above, or earlier), reattach its resource @@ -957,14 +969,19 @@ AtSubAbort_Portals(SubTrans
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Sat, Jan 02, 2016 at 07:22:13PM -0500, Tom Lane wrote: > Noah Misch writes: > > I am inclined to add an Assert(portal->status != PORTAL_ACTIVE) to emphasize > > that this is backup only. MarkPortalActive() callers remain responsible for > > updating the status to something else before relinquishing control. > > No, I do not think that would be an improvement. There is no contract > saying that this must be done earlier, IMO. Indeed, nobody wrote a contract. The assertion would record what has been the sole standing practice for eleven years (since commit a393fbf9). It would prompt discussion if a proposed patch would depart from that practice, and that is a good thing. Also, every addition of dead code should label that code to aid future readers. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
Noah Misch writes: > I am inclined to add an Assert(portal->status != PORTAL_ACTIVE) to emphasize > that this is backup only. MarkPortalActive() callers remain responsible for > updating the status to something else before relinquishing control. No, I do not think that would be an improvement. There is no contract saying that this must be done earlier, IMO. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Sat, Jan 02, 2016 at 01:46:07PM -0500, Tom Lane wrote: > Noah Misch writes: > > On Thu, Sep 03, 2015 at 11:04:11PM -0400, Tom Lane wrote: > >> *** AtSubAbort_Portals(SubTransactionId mySu > > >> --- 909,966 > >> { > >> Portal portal = hentry->portal; > >> > >> + /* Was it created in this subtransaction? */ > >> if (portal->createSubid != mySubid) > >> + { > >> + /* No, but maybe it was used in this subtransaction? */ > >> + if (portal->activeSubid == mySubid) > >> + { > > ... > >> + if (portal->status == PORTAL_ACTIVE) > >> + MarkPortalFailed(portal); > > > Do you have a test case that reaches this particular MarkPortalFailed() > > call? > > My attempts stumbled over the fact that, before we reach here, each of the > > three MarkPortalActive() callers will have already called MarkPortalFailed() > > in its PG_CATCH block. ("make check" does not reach this call.) > > Offhand I think that's just belt-and-suspenders-too coding. As you say, > we'd typically have failed active portals already before getting here. > But the responsibility of this routine is to *guarantee* that no broken > portals remain active, so I'd not want to remove this check. > > Do you have a particular reason for asking? After reading the log message for and comments added in commit c5454f9, I understood that the commit changed PostgreSQL to fail portals that did not fail before. That is to say, queries that formerly completed without error would now elicit errors. I was looking into what sorts of queries it affected in this way. If the new MarkPortalFailed() is indeed dead code, then the commit affects no query that way. The meat of the commit is then the ResourceOwnerNewParent() call, which lets queries ERROR cleanly instead of seeing assertion failures or undefined behavior. I am inclined to add an Assert(portal->status != PORTAL_ACTIVE) to emphasize that this is backup only. MarkPortalActive() callers remain responsible for updating the status to something else before relinquishing control. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
Noah Misch writes: > On Thu, Sep 03, 2015 at 11:04:11PM -0400, Tom Lane wrote: >> *** AtSubAbort_Portals(SubTransactionId mySu >> --- 909,966 >> { >> Portal portal = hentry->portal; >> >> +/* Was it created in this subtransaction? */ >> if (portal->createSubid != mySubid) >> +{ >> +/* No, but maybe it was used in this subtransaction? */ >> +if (portal->activeSubid == mySubid) >> +{ > ... >> +if (portal->status == PORTAL_ACTIVE) >> +MarkPortalFailed(portal); > Do you have a test case that reaches this particular MarkPortalFailed() call? > My attempts stumbled over the fact that, before we reach here, each of the > three MarkPortalActive() callers will have already called MarkPortalFailed() > in its PG_CATCH block. ("make check" does not reach this call.) Offhand I think that's just belt-and-suspenders-too coding. As you say, we'd typically have failed active portals already before getting here. But the responsibility of this routine is to *guarantee* that no broken portals remain active, so I'd not want to remove this check. Do you have a particular reason for asking? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Thu, Sep 03, 2015 at 11:04:11PM -0400, Tom Lane wrote: > *** AtSubAbort_Portals(SubTransactionId mySu > --- 909,966 > { > Portal portal = hentry->portal; > > + /* Was it created in this subtransaction? */ > if (portal->createSubid != mySubid) > + { > + /* No, but maybe it was used in this subtransaction? */ > + if (portal->activeSubid == mySubid) > + { ... > + if (portal->status == PORTAL_ACTIVE) > + MarkPortalFailed(portal); Do you have a test case that reaches this particular MarkPortalFailed() call? My attempts stumbled over the fact that, before we reach here, each of the three MarkPortalActive() callers will have already called MarkPortalFailed() in its PG_CATCH block. ("make check" does not reach this call.) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
Michael Paquier writes: > On Sat, Sep 5, 2015 at 3:41 PM, Jim Nasby wrote: >> I was about to test this and was verifying that the crash still worked >> when I noticed this in the logs (9.4.1, not HEAD). Not sure if there's any >> realion here or not... >> >> WARNING: relation "pg_proc" page 121 is uninitialized --- fixing >> WARNING: relation "pg_proc" page 122 is uninitialized --- fixing > [reading vacuumlazy.c...] This seems unrelated and I would not worry about > it. Those system catalogs have been extended with new pages by a couple of > backends, but a crash happened before they could actually insert tuples on > it and commit. Perhaps you were creating a bunch of objects when a crash > happened, no? If this is the system you were doing the previous crash testing on, I'd say it's a direct artifact of those crashes. You were repeatedly crashing the system during transactions that created functions and temp tables, which would surely make entries in pg_proc, pg_depend, etc. The crash came before commit and so neither the relation pages nor the WAL entries would necessarily get to disk. But extension of the relations to make room would still happen. Moreover, there's probably nothing to cause those new pages to get added to the relations' FSM, so each new test attempt would add another set of pages. Eventually autovacuum would notice, which is what you've got logged here, but it might take awhile. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Sat, Sep 5, 2015 at 3:41 PM, Jim Nasby wrote: > I was about to test this and was verifying that the crash still worked > when I noticed this in the logs (9.4.1, not HEAD). Not sure if there's any > realion here or not... > > WARNING: relation "pg_proc" page 121 is uninitialized --- fixing > WARNING: relation "pg_proc" page 122 is uninitialized --- fixing > [reading vacuumlazy.c...] This seems unrelated and I would not worry about it. Those system catalogs have been extended with new pages by a couple of backends, but a crash happened before they could actually insert tuples on it and commit. Perhaps you were creating a bunch of objects when a crash happened, no? Coming to the point, did you see a new crash with test_factory? Is that some remnant of a previous test? -- Michael
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On 9/4/15 12:41 PM, Tom Lane wrote: I wrote: After review of all the callers of RelationClearRelation, it seems like most of them already have sufficient guards to prevent triggering of the Assert. The ones that lack such tests are AtEOXact_cleanup and AtEOSubXact_cleanup. So what I'm now thinking is that those should do something along the lines of Specifically, this, which can be shown to mitigate the results of the problem cases in an otherwise-unpatched build. And pushed. I noticed that while the relcache.c fix mitigates the error pretty well in 9.3 and up, in the older branches you still end up with a PANIC due to error stack overflow. This may indicate that there's some patch we'd have been better off back-porting. However, with the Portal changes in place the test cases work in all branches, so I'm not excited enough to pursue the point further myself. I was about to test this and was verifying that the crash still worked when I noticed this in the logs (9.4.1, not HEAD). Not sure if there's any realion here or not... WARNING: relation "pg_proc" page 121 is uninitialized --- fixing WARNING: relation "pg_proc" page 122 is uninitialized --- fixing WARNING: relation "pg_proc" page 123 is uninitialized --- fixing WARNING: relation "pg_proc" page 124 is uninitialized --- fixing WARNING: relation "pg_proc" page 125 is uninitialized --- fixing WARNING: relation "pg_proc" page 126 is uninitialized --- fixing WARNING: relation "pg_proc" page 127 is uninitialized --- fixing WARNING: relation "pg_proc" page 128 is uninitialized --- fixing WARNING: relation "pg_proc" page 129 is uninitialized --- fixing WARNING: relation "pg_proc" page 130 is uninitialized --- fixing WARNING: relation "pg_proc" page 131 is uninitialized --- fixing WARNING: relation "pg_proc" page 132 is uninitialized --- fixing WARNING: relation "pg_proc" page 133 is uninitialized --- fixing WARNING: relation "pg_proc" page 134 is uninitialized --- fixing WARNING: relation "pg_proc" page 135 is uninitialized --- fixing WARNING: relation "pg_proc" page 136 is uninitialized --- fixing WARNING: relation "pg_proc" page 137 is uninitialized --- fixing WARNING: relation "pg_proc" page 138 is uninitialized --- fixing WARNING: relation "pg_proc" page 139 is uninitialized --- fixing WARNING: relation "pg_proc" page 140 is uninitialized --- fixing WARNING: relation "pg_proc" page 141 is uninitialized --- fixing WARNING: relation "pg_proc" page 142 is uninitialized --- fixing WARNING: relation "pg_proc" page 143 is uninitialized --- fixing WARNING: relation "pg_proc" page 144 is uninitialized --- fixing WARNING: relation "pg_proc" page 145 is uninitialized --- fixing WARNING: relation "pg_proc" page 146 is uninitialized --- fixing WARNING: relation "pg_proc" page 147 is uninitialized --- fixing WARNING: relation "pg_proc" page 148 is uninitialized --- fixing WARNING: relation "pg_proc" page 149 is uninitialized --- fixing WARNING: relation "pg_proc" page 150 is uninitialized --- fixing WARNING: relation "pg_proc" page 151 is uninitialized --- fixing WARNING: relation "pg_proc" page 152 is uninitialized --- fixing WARNING: relation "pg_proc" page 153 is uninitialized --- fixing WARNING: relation "pg_proc" page 154 is uninitialized --- fixing WARNING: relation "pg_proc" page 155 is uninitialized --- fixing WARNING: relation "pg_proc" page 156 is uninitialized --- fixing WARNING: relation "pg_proc" page 157 is uninitialized --- fixing WARNING: relation "pg_proc" page 158 is uninitialized --- fixing WARNING: relation "pg_proc" page 159 is uninitialized --- fixing WARNING: relation "pg_proc" page 160 is uninitialized --- fixing WARNING: relation "pg_proc" page 161 is uninitialized --- fixing WARNING: relation "pg_proc" page 162 is uninitialized --- fixing WARNING: relation "pg_proc" page 246 is uninitialized --- fixing WARNING: relation "pg_depend" page 82 is uninitialized --- fixing WARNING: relation "pg_depend" page 83 is uninitialized --- fixing WARNING: relation "pg_depend" page 84 is uninitialized --- fixing WARNING: relation "pg_depend" page 85 is uninitialized --- fixing WARNING: relation "pg_depend" page 86 is uninitialized --- fixing WARNING: relation "pg_depend" page 87 is uninitialized --- fixing WARNING: relation "pg_depend" page 88 is uninitialized --- fixing WARNING: relation "pg_depend" page 89 is uninitialized --- fixing WARNING: relation "pg_depend" page 90 is uninitialized --- fixing WARNING: relation "pg_depend" page 91 is uninitialized --- fixing WARNING: relation "pg_depend" page 92 is uninitialized --- fixing WARNING: relation "pg_depend" page 93 is uninitialized --- fixing WARNING: relation "pg_depend" page 94 is uninitialized --- fixing WARNING: relation "pg_depend" page 95 is uninitialized --- fixing WARNING: relation "pg_depend" page 96 is uninitialized --- fixing WARNING: relation "pg_depend" page 112 is uninitialized --- fixing WARNING: relation "pg_depen
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
I wrote: >> After review of all the callers of RelationClearRelation, it seems like >> most of them already have sufficient guards to prevent triggering of the >> Assert. The ones that lack such tests are AtEOXact_cleanup and >> AtEOSubXact_cleanup. So what I'm now thinking is that those should do >> something along the lines of > Specifically, this, which can be shown to mitigate the results of the > problem cases in an otherwise-unpatched build. And pushed. I noticed that while the relcache.c fix mitigates the error pretty well in 9.3 and up, in the older branches you still end up with a PANIC due to error stack overflow. This may indicate that there's some patch we'd have been better off back-porting. However, with the Portal changes in place the test cases work in all branches, so I'm not excited enough to pursue the point further myself. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
I wrote: > After review of all the callers of RelationClearRelation, it seems like > most of them already have sufficient guards to prevent triggering of the > Assert. The ones that lack such tests are AtEOXact_cleanup and > AtEOSubXact_cleanup. So what I'm now thinking is that those should do > something along the lines of Specifically, this, which can be shown to mitigate the results of the problem cases in an otherwise-unpatched build. regards, tom lane diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 44e9509..420ef3d 100644 *** a/src/backend/utils/cache/relcache.c --- b/src/backend/utils/cache/relcache.c *** RelationClearRelation(Relation relation, *** 2048,2054 { /* * As per notes above, a rel to be rebuilt MUST have refcnt > 0; while of ! * course it would be a bad idea to blow away one with nonzero refcnt. */ Assert(rebuild ? !RelationHasReferenceCountZero(relation) : --- 2048,2056 { /* * As per notes above, a rel to be rebuilt MUST have refcnt > 0; while of ! * course it would be an equally bad idea to blow away one with nonzero ! * refcnt, since that would leave someone somewhere with a dangling ! * pointer. All callers are expected to have verified that this holds. */ Assert(rebuild ? !RelationHasReferenceCountZero(relation) : *** AtEOXact_cleanup(Relation relation, bool *** 2654,2664 { if (isCommit) relation->rd_createSubid = InvalidSubTransactionId; ! else { RelationClearRelation(relation, false); return; } } /* --- 2656,2680 { if (isCommit) relation->rd_createSubid = InvalidSubTransactionId; ! else if (RelationHasReferenceCountZero(relation)) { RelationClearRelation(relation, false); return; } + else + { + /* + * Hmm, somewhere there's a (leaked?) reference to the relation. + * We daren't remove the entry for fear of dereferencing a + * dangling pointer later. Bleat, and mark it as not belonging to + * the current transaction. Hopefully it'll get cleaned up + * eventually. This must be just a WARNING to avoid + * error-during-error-recovery loops. + */ + relation->rd_createSubid = InvalidSubTransactionId; + elog(WARNING, "cannot remove relcache entry for \"%s\" because it has nonzero refcount", + RelationGetRelationName(relation)); + } } /* *** AtEOSubXact_cleanup(Relation relation, b *** 2747,2757 { if (isCommit) relation->rd_createSubid = parentSubid; ! else { RelationClearRelation(relation, false); return; } } /* --- 2763,2786 { if (isCommit) relation->rd_createSubid = parentSubid; ! else if (RelationHasReferenceCountZero(relation)) { RelationClearRelation(relation, false); return; } + else + { + /* + * Hmm, somewhere there's a (leaked?) reference to the relation. + * We daren't remove the entry for fear of dereferencing a + * dangling pointer later. Bleat, and transfer it to the parent + * subtransaction so we can try again later. This must be just a + * WARNING to avoid error-during-error-recovery loops. + */ + relation->rd_createSubid = parentSubid; + elog(WARNING, "cannot remove relcache entry for \"%s\" because it has nonzero refcount", + RelationGetRelationName(relation)); + } } /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
Michael Paquier writes: > On Fri, Sep 4, 2015 at 10:52 PM, Tom Lane wrote: >> Also, I am very strongly tempted to convert the original problem-causing >> Assert in relcache.c into a regular runtime test-and-elog. If we're wrong >> about having sealed off this hole completely, I'd much rather see the >> result be an elog than silent corruption of the relcache. > +1. For a HEAD-only change I suppose. No, the point is to have less dangerous behavior *in the field* if the situation occurs. I have every intention of back-patching this bit as well. Although after some experimentation with a build lacking the Portal-level fixes, an elog(ERROR) in RelationClearRelation isn't much fun either, because the problem case causes errors during error cleanup, soon leading to "PANIC: ERRORDATA_STACK_SIZE exceeded". After review of all the callers of RelationClearRelation, it seems like most of them already have sufficient guards to prevent triggering of the Assert. The ones that lack such tests are AtEOXact_cleanup and AtEOSubXact_cleanup. So what I'm now thinking is that those should do something along the lines of if (isCommit) relation->rd_createSubid = parentSubid; else if (RelationHasReferenceCountZero(relation)) { RelationClearRelation(relation, false); return; } else { elog(WARNING, "rel so-and-so is still referenced"); relation->rd_createSubid = parentSubid; } so as to mitigate the damage if there's a dangling reference. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Fri, Sep 4, 2015 at 10:52 PM, Tom Lane wrote: > Also, I am very strongly tempted to convert the original problem-causing > Assert in relcache.c into a regular runtime test-and-elog. If we're wrong > about having sealed off this hole completely, I'd much rather see the > result be an elog than silent corruption of the relcache. > +1. For a HEAD-only change I suppose. -- Michael
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
Michael Paquier writes: > On Fri, Sep 4, 2015 at 12:04 PM, Tom Lane wrote: >> Attached is an updated patch with comments to this effect and some >> minor other code cleanup (mainly, not assuming that CurrentResourceOwner >> is the right thing to use in AtSubAbort_Portals). > Thanks! This looks good to me. If no further comments, I'll see about pushing and back-patching this. In the stable branches, the new field in Portal needs to go at the end, to avoid an ABI break for extensions that look at Portals. Otherwise I'm inclined to push it as-is. In particular, that would break any extensions that call AtSubAbort_Portals() directly, but I'm having a really hard time believing that there are any. (If anyone can make a case that there are some, the fallback would be to remove the new argument in the back branches and make AtSubAbort_Portals depend on CurrentResourceOwner again; but I'd rather not do it that way.) Also, I am very strongly tempted to convert the original problem-causing Assert in relcache.c into a regular runtime test-and-elog. If we're wrong about having sealed off this hole completely, I'd much rather see the result be an elog than silent corruption of the relcache. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Fri, Sep 4, 2015 at 12:04 PM, Tom Lane wrote: > I wrote: > > Hmm. I am not feeling especially comfortable about this: it's not clear > > that there's anything preventing a suspended portal from containing a > > dangerous reference. However, a quick trial did not show that it was > > possible to break it --- in the cases I tried, we detected that a cached > > plan was no longer valid, tried to rebuild it, and noticed the missing > > object at that point. So maybe it's OK. > > After further thought I concluded that this probably is safe. The > portal's original query was created and planned when it was opened, > so it cannot contain any references to objects of the failed > subtransaction. We have a hazard from queries within functions, > but if the portal is suspended then no such functions are in progress. > > Attached is an updated patch with comments to this effect and some > minor other code cleanup (mainly, not assuming that CurrentResourceOwner > is the right thing to use in AtSubAbort_Portals). > Thanks! This looks good to me. -- Michael
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
I wrote: > Hmm. I am not feeling especially comfortable about this: it's not clear > that there's anything preventing a suspended portal from containing a > dangerous reference. However, a quick trial did not show that it was > possible to break it --- in the cases I tried, we detected that a cached > plan was no longer valid, tried to rebuild it, and noticed the missing > object at that point. So maybe it's OK. After further thought I concluded that this probably is safe. The portal's original query was created and planned when it was opened, so it cannot contain any references to objects of the failed subtransaction. We have a hazard from queries within functions, but if the portal is suspended then no such functions are in progress. Attached is an updated patch with comments to this effect and some minor other code cleanup (mainly, not assuming that CurrentResourceOwner is the right thing to use in AtSubAbort_Portals). regards, tom lane diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index b53d95f..ae9e850 100644 *** a/src/backend/access/transam/xact.c --- b/src/backend/access/transam/xact.c *** AbortSubTransaction(void) *** 4585,4590 --- 4585,4591 AfterTriggerEndSubXact(false); AtSubAbort_Portals(s->subTransactionId, s->parent->subTransactionId, + s->curTransactionOwner, s->parent->curTransactionOwner); AtEOSubXact_LargeObject(false, s->subTransactionId, s->parent->subTransactionId); diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c index 2794537..327b2a5 100644 *** a/src/backend/commands/portalcmds.c --- b/src/backend/commands/portalcmds.c *** PersistHoldablePortal(Portal portal) *** 335,345 /* * Check for improper portal use, and mark portal active. */ ! if (portal->status != PORTAL_READY) ! ereport(ERROR, ! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), ! errmsg("portal \"%s\" cannot be run", portal->name))); ! portal->status = PORTAL_ACTIVE; /* * Set up global portal context pointers. --- 335,341 /* * Check for improper portal use, and mark portal active. */ ! MarkPortalActive(portal); /* * Set up global portal context pointers. diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c index 9c14e8a..0df86a2 100644 *** a/src/backend/tcop/pquery.c --- b/src/backend/tcop/pquery.c *** PortalRun(Portal portal, long count, boo *** 734,744 /* * Check for improper portal use, and mark portal active. */ ! if (portal->status != PORTAL_READY) ! ereport(ERROR, ! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), ! errmsg("portal \"%s\" cannot be run", portal->name))); ! portal->status = PORTAL_ACTIVE; /* * Set up global portal context pointers. --- 734,740 /* * Check for improper portal use, and mark portal active. */ ! MarkPortalActive(portal); /* * Set up global portal context pointers. *** PortalRunFetch(Portal portal, *** 1398,1408 /* * Check for improper portal use, and mark portal active. */ ! if (portal->status != PORTAL_READY) ! ereport(ERROR, ! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), ! errmsg("portal \"%s\" cannot be run", portal->name))); ! portal->status = PORTAL_ACTIVE; /* * Set up global portal context pointers. --- 1394,1400 /* * Check for improper portal use, and mark portal active. */ ! MarkPortalActive(portal); /* * Set up global portal context pointers. diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c index 7530498..44a87f7 100644 *** a/src/backend/utils/mmgr/portalmem.c --- b/src/backend/utils/mmgr/portalmem.c *** CreatePortal(const char *name, bool allo *** 232,237 --- 232,238 portal->status = PORTAL_NEW; portal->cleanup = PortalCleanup; portal->createSubid = GetCurrentSubTransactionId(); + portal->activeSubid = portal->createSubid; portal->strategy = PORTAL_MULTI_QUERY; portal->cursorOptions = CURSOR_OPT_NO_SCROLL; portal->atStart = true; *** UnpinPortal(Portal portal) *** 403,408 --- 404,428 } /* + * MarkPortalActive + * Transition a portal from READY to ACTIVE state. + * + * NOTE: never set portal->status = PORTAL_ACTIVE directly; call this instead. + */ + void + MarkPortalActive(Portal portal) + { + /* For safety, this is a runtime test not just an Assert */ + if (portal->status != PORTAL_READY) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("portal \"%s\" cannot be run", portal->name))); + /* Perform the state transition */ + portal->status = PORTAL_ACTIVE; + portal->activeSubid = GetCurrentSubTransactionId(); + } + + /* * MarkPortalDone * Transition a portal from ACTIVE to DONE state. *
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
Michael Paquier writes: > On Fri, Sep 4, 2015 at 9:57 AM, Tom Lane wrote: >> On reflection I think that the tracking of activeSubid in my patch is >> probably overkill if we're attacking it this way. We can just have >> AtSubAbort_Portals fail any ACTIVE portal regardless of subxact level, >> which is pretty analogous to what AtAbort_Portals has done for a long >> time. > Tracking the activated subxact looked neat from my side, that's more > consistent with what is done when the portal is marked as ready, > particularly with the new routine introduced. Actually, that idea did not work at all: it caused errors inside plpgsql EXCEPT blocks to try to kill the portal running the outer function call. Ooops. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Fri, Sep 4, 2015 at 9:57 AM, Tom Lane wrote: > On reflection I think that the tracking of activeSubid in my patch is > probably overkill if we're attacking it this way. We can just have > AtSubAbort_Portals fail any ACTIVE portal regardless of subxact level, > which is pretty analogous to what AtAbort_Portals has done for a long > time. > Tracking the activated subxact looked neat from my side, that's more consistent with what is done when the portal is marked as ready, particularly with the new routine introduced. > Let me work on this and see if I can get to a simpler patch. > Oh, OK. Thanks! -- Michael
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
Michael Paquier writes: > On Fri, Sep 4, 2015 at 6:51 AM, Tom Lane wrote: >> Ideas? > Yes. This diff on top of your patch: > @@ -922,8 +922,7 @@ AtSubAbort_Portals(SubTransactionId mySubid, > * must be forced into FAILED state, for > the same reasons > * discussed below. > */ > - if (portal->status == PORTAL_READY || > - portal->status == PORTAL_ACTIVE) > + if (portal->status == PORTAL_ACTIVE) > MarkPortalFailed(portal); > This way only the active portals are marked as failed. Hmm. I am not feeling especially comfortable about this: it's not clear that there's anything preventing a suspended portal from containing a dangerous reference. However, a quick trial did not show that it was possible to break it --- in the cases I tried, we detected that a cached plan was no longer valid, tried to rebuild it, and noticed the missing object at that point. So maybe it's OK. On reflection I think that the tracking of activeSubid in my patch is probably overkill if we're attacking it this way. We can just have AtSubAbort_Portals fail any ACTIVE portal regardless of subxact level, which is pretty analogous to what AtAbort_Portals has done for a long time. Let me work on this and see if I can get to a simpler patch. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Fri, Sep 4, 2015 at 6:51 AM, Tom Lane wrote: > I experimented with the attached patch. It seems to work to stop the > crash Michael exhibited (I've not tried to duplicate Jim's original > example, though). However, it also causes a regression test failure, > because transactions.sql does this: > Neat patch, it would have taken me longer to figure out that... I have tried with Jim's test case and the patch is working. > which is exactly the case we're trying to reject now. So that fills > me with fear that this approach would break existing applications. > On the other hand, I do not really see a good alternative :-(. > This behavior is present since 2004 with fe548629, so that's a real concern to me, especially if there are drivers around relying on this behavior. There are for example some code patterns around Postgres ODBC that could be impacted, not sure which ones but I guess that some internal error handling is not going to like that. I thought about trying to detect whether the Portal actually had any > references to "new in subtransaction" objects to decide whether to > kill it or not, but that seems awfully fragile. > > Ideas? > Yes. This diff on top of your patch: @@ -922,8 +922,7 @@ AtSubAbort_Portals(SubTransactionId mySubid, * must be forced into FAILED state, for the same reasons * discussed below. */ - if (portal->status == PORTAL_READY || - portal->status == PORTAL_ACTIVE) + if (portal->status == PORTAL_ACTIVE) MarkPortalFailed(portal); This way only the active portals are marked as failed. The regression tests that are failing with your patch applied visibly do not activate the portal they use, just marking it as ready to be used. This seems to be the safest approach to me on stable branches, as well as on master, this way we are sure that resources on the failed subxacts are cleaned up correctly, and that existing applications are not impacted. I am attaching a new patch with what I changed and a test case based on my previous one. Regards, -- Michael diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c index 2794537..327b2a5 100644 --- a/src/backend/commands/portalcmds.c +++ b/src/backend/commands/portalcmds.c @@ -335,11 +335,7 @@ PersistHoldablePortal(Portal portal) /* * Check for improper portal use, and mark portal active. */ - if (portal->status != PORTAL_READY) - ereport(ERROR, -(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), - errmsg("portal \"%s\" cannot be run", portal->name))); - portal->status = PORTAL_ACTIVE; + MarkPortalActive(portal); /* * Set up global portal context pointers. diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c index 9c14e8a..0df86a2 100644 --- a/src/backend/tcop/pquery.c +++ b/src/backend/tcop/pquery.c @@ -734,11 +734,7 @@ PortalRun(Portal portal, long count, bool isTopLevel, /* * Check for improper portal use, and mark portal active. */ - if (portal->status != PORTAL_READY) - ereport(ERROR, -(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), - errmsg("portal \"%s\" cannot be run", portal->name))); - portal->status = PORTAL_ACTIVE; + MarkPortalActive(portal); /* * Set up global portal context pointers. @@ -1398,11 +1394,7 @@ PortalRunFetch(Portal portal, /* * Check for improper portal use, and mark portal active. */ - if (portal->status != PORTAL_READY) - ereport(ERROR, -(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), - errmsg("portal \"%s\" cannot be run", portal->name))); - portal->status = PORTAL_ACTIVE; + MarkPortalActive(portal); /* * Set up global portal context pointers. diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c index 7530498..639c2b3 100644 --- a/src/backend/utils/mmgr/portalmem.c +++ b/src/backend/utils/mmgr/portalmem.c @@ -232,6 +232,7 @@ CreatePortal(const char *name, bool allowDup, bool dupSilent) portal->status = PORTAL_NEW; portal->cleanup = PortalCleanup; portal->createSubid = GetCurrentSubTransactionId(); + portal->activeSubid = portal->createSubid; portal->strategy = PORTAL_MULTI_QUERY; portal->cursorOptions = CURSOR_OPT_NO_SCROLL; portal->atStart = true; @@ -403,6 +404,25 @@ UnpinPortal(Portal portal) } /* + * MarkPortalActive + * Transition a portal from READY to ACTIVE state. + * + * NOTE: never set portal->status = PORTAL_ACTIVE directly; call this instead. + */ +void +MarkPortalActive(Portal portal) +{ + /* For safety, this is a runtime test not just an Assert */ + if (portal->status != PORTAL_READY) + ereport(ERROR, +(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("portal \"%s\" cannot be run", portal->name))); + /* Perform the state transition */ + portal->status = PORTAL_ACTIVE; +
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
I wrote: > I'm inclined to think that the only real fix for this type of problem > is that at subtransaction abort, we have to prevent further execution > of any Portals that have executed at all within the subxact (ie force > them into PORTAL_FAILED state and clean out their resources, see > MarkPortalFailed). It's not good enough to kill the one(s) that are > actively executing at the instant of failure, because anything that's > run at all since the subxact started might be holding a reference to an > object we're about to delete. > I'm a little worried that that will break usage patterns that work today, > though. I experimented with the attached patch. It seems to work to stop the crash Michael exhibited (I've not tried to duplicate Jim's original example, though). However, it also causes a regression test failure, because transactions.sql does this: BEGIN; DECLARE c CURSOR FOR SELECT unique2 FROM tenk1 ORDER BY unique2; SAVEPOINT one; FETCH 10 FROM c; ROLLBACK TO SAVEPOINT one; FETCH 10 FROM c; which is exactly the case we're trying to reject now. So that fills me with fear that this approach would break existing applications. On the other hand, I do not really see a good alternative :-(. I thought about trying to detect whether the Portal actually had any references to "new in subtransaction" objects to decide whether to kill it or not, but that seems awfully fragile. Ideas? regards, tom lane diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c index 2794537..327b2a5 100644 *** a/src/backend/commands/portalcmds.c --- b/src/backend/commands/portalcmds.c *** PersistHoldablePortal(Portal portal) *** 335,345 /* * Check for improper portal use, and mark portal active. */ ! if (portal->status != PORTAL_READY) ! ereport(ERROR, ! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), ! errmsg("portal \"%s\" cannot be run", portal->name))); ! portal->status = PORTAL_ACTIVE; /* * Set up global portal context pointers. --- 335,341 /* * Check for improper portal use, and mark portal active. */ ! MarkPortalActive(portal); /* * Set up global portal context pointers. diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c index 9c14e8a..0df86a2 100644 *** a/src/backend/tcop/pquery.c --- b/src/backend/tcop/pquery.c *** PortalRun(Portal portal, long count, boo *** 734,744 /* * Check for improper portal use, and mark portal active. */ ! if (portal->status != PORTAL_READY) ! ereport(ERROR, ! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), ! errmsg("portal \"%s\" cannot be run", portal->name))); ! portal->status = PORTAL_ACTIVE; /* * Set up global portal context pointers. --- 734,740 /* * Check for improper portal use, and mark portal active. */ ! MarkPortalActive(portal); /* * Set up global portal context pointers. *** PortalRunFetch(Portal portal, *** 1398,1408 /* * Check for improper portal use, and mark portal active. */ ! if (portal->status != PORTAL_READY) ! ereport(ERROR, ! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), ! errmsg("portal \"%s\" cannot be run", portal->name))); ! portal->status = PORTAL_ACTIVE; /* * Set up global portal context pointers. --- 1394,1400 /* * Check for improper portal use, and mark portal active. */ ! MarkPortalActive(portal); /* * Set up global portal context pointers. diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c index 7530498..74f5089 100644 *** a/src/backend/utils/mmgr/portalmem.c --- b/src/backend/utils/mmgr/portalmem.c *** CreatePortal(const char *name, bool allo *** 232,237 --- 232,238 portal->status = PORTAL_NEW; portal->cleanup = PortalCleanup; portal->createSubid = GetCurrentSubTransactionId(); + portal->activeSubid = portal->createSubid; portal->strategy = PORTAL_MULTI_QUERY; portal->cursorOptions = CURSOR_OPT_NO_SCROLL; portal->atStart = true; *** UnpinPortal(Portal portal) *** 403,408 --- 404,428 } /* + * MarkPortalActive + * Transition a portal from READY to ACTIVE state. + * + * NOTE: never set portal->status = PORTAL_ACTIVE directly; call this instead. + */ + void + MarkPortalActive(Portal portal) + { + /* For safety, this is a runtime test not just an Assert */ + if (portal->status != PORTAL_READY) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("portal \"%s\" cannot be run", portal->name))); + /* Perform the state transition */ + portal->status = PORTAL_ACTIVE; + portal->activeSubid = GetCurrentSubTransactionId(); + } + + /* * MarkPortalDone * Transition a portal from ACTIVE to DONE state. * *** PreCommit_Portals(bool
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
Michael Paquier writes: > Finally I have been able to crack down the problem, and it can be > reproduced with the following test case for example: Hm. It looks like the problem is that we're executing the function within the Portal created for cursor "h", and that Portal is older than the subtransaction created by the savepoint, so when we abort the subtransaction we do not clean up the Portal. That means that resources held by that Portal haven't been released --- in particular it still has a relcache pin on the "new_table" --- when subtransaction abort gets to the point of wanting to drop relcache entries created during the subtransaction. So the Assert has caught an actual problem: we'd have flushed the relcache entry while there was still an open reference to it. I'm inclined to think that the only real fix for this type of problem is that at subtransaction abort, we have to prevent further execution of any Portals that have executed at all within the subxact (ie force them into PORTAL_FAILED state and clean out their resources, see MarkPortalFailed). It's not good enough to kill the one(s) that are actively executing at the instant of failure, because anything that's run at all since the subxact started might be holding a reference to an object we're about to delete. I'm a little worried that that will break usage patterns that work today, though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Thu, Sep 3, 2015 at 1:46 AM, Jim Nasby wrote: > The double set of exceptions seems to be critical; if instead of calling > tf.get(invoice) (which recursively does tf.get(customer)) I define the > cursor to call tf.get(customer) directly there's no assert. Finally I have been able to crack down the problem, and it can be reproduced with the following test case for example: BEGIN; CREATE OR REPLACE FUNCTION create_self_tab() RETURNS text LANGUAGE plpgsql AS $$ BEGIN CREATE TEMP TABLE new_table ON COMMIT DROP AS SELECT create_self_tab(); RETURN 'foo'; END $$; DECLARE h CURSOR FOR SELECT create_self_tab(); SAVEPOINT self_tab_point; FETCH h; -- error COMMIT; When fetching the first tuple, the transaction status is cleaned up to the savepoint because the call actually fails in the second loop at the temporary relation exists, but it happens that the temporary table that has been created tried to be cleanup up but it is still referenced, causing the assertion failure. So that's not related to the use of the exception blocks. What directed me to the SAVEPOINT causing the issue is the use of ON_ERROR_ROLLBACK in the test case Jim proposed. I don't have a patch at hand yet, still now things are easier to test. -- Michael
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On 9/2/15 2:56 PM, Jim Nasby wrote: On 9/2/15 2:17 PM, Alvaro Herrera wrote: Michael Paquier wrote: I haven't written yet a test case but I think that we could reproduce that simply by having a relation referenced in the exception block of a first function, calling a second function that itself raises an exception, causing the referencing error. Hm, so function 2 is called in the exception block of function 1? Are you going to produce the test case for this? Jim? I had attempted one but the issue is that it's more than just an exception block thing. If it was that simple then we'd still get the crash without involving pgTap. So I suspect you need to have a named cursor in the mix as well. Let me make another attempt at something simpler. After some time messing around with this I've been able to remove pgTap from the equation using the attached crash.sql (relevant snippet below). This only works if you pass a refcursor into the function. It doesn't work if you do OPEN have FOR EXECUTE $$SELECT * FROM tf.get( NULL::invoice, 'base' )$$; inside the function instead. This code also produces different results; it outputs the error message before crashing and the stack trace shows the assert is now failing while trying to abort the top-level transaction. So it looks like the scenario is: BEGIN; DECLARE (open?) cursor that calls function with exception handler that calls function with exception handler that calls something that fails The double set of exceptions seems to be critical; if instead of calling tf.get(invoice) (which recursively does tf.get(customer)) I define the cursor to call tf.get(customer) directly there's no assert. I can poke at this more tomorrow, but it'd be very helpful if someone could explain this failure mode, as I'm basically stumbling in the dark on a test case right now. CREATE OR REPLACE FUNCTION a( have refcursor ) RETURNS void LANGUAGE plpgsql AS $body$ DECLARE have_rec record; BEGIN FETCH have INTO have_rec; END $body$; DECLARE h CURSOR FOR SELECT * FROM tf.get( NULL::invoice, 'base' ); SELECT a( 'h'::refcursor ); (lldb) bt * thread #1: tid = 0x722ed8, 0x7fff92a5a866 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT * frame #0: 0x7fff92a5a866 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x7fff9001b35c libsystem_pthread.dylib`pthread_kill + 92 frame #2: 0x7fff8cf89b1a libsystem_c.dylib`abort + 125 frame #3: 0x00010cdb4039 postgres`ExceptionalCondition(conditionName=0x00010cf59cfd, errorType=0x00010cec392d, fileName=0x00010cf59045, lineNumber=1972) + 137 at assert.c:54 frame #4: 0x00010cd9b332 postgres`RelationClearRelation(relation=0x00011658f110, rebuild='\0') + 162 at relcache.c:1970 frame #5: 0x00010cd9cc0f postgres`AtEOSubXact_cleanup(relation=0x00011658f110, isCommit='\0', mySubid=15, parentSubid=1) + 79 at relcache.c:2665 frame #6: 0x00010cd9cb92 postgres`AtEOSubXact_RelationCache(isCommit='\0', mySubid=15, parentSubid=1) + 242 at relcache.c:2633 frame #7: 0x00010c9b6e88 postgres`AbortSubTransaction + 440 at xact.c:4373 frame #8: 0x00010c9b7208 postgres`AbortCurrentTransaction + 312 at xact.c:2947 frame #9: 0x00010cc249f3 postgres`PostgresMain(argc=1, argv=0x7fa3f4800378, dbname=0x7fa3f4800200, username=0x7fa3f30031f8) + 1875 at postgres.c:3902 frame #10: 0x00010cb9da48 postgres`PostmasterMain [inlined] BackendRun + 8024 at postmaster.c:4155 frame #11: 0x00010cb9da22 postgres`PostmasterMain [inlined] BackendStartup at postmaster.c:3829 frame #12: 0x00010cb9da22 postgres`PostmasterMain [inlined] ServerLoop at postmaster.c:1597 frame #13: 0x00010cb9da22 postgres`PostmasterMain(argc=, argv=) + 7986 at postmaster.c:1244 frame #14: 0x00010cb218cd postgres`main(argc=, argv=) + 1325 at main.c:228 frame #15: 0x7fff8e9a35fd libdyld.dylib`start + 1 -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com BEGIN; \i test/helpers/tap_setup.sql CREATE EXTENSION test_factory; SET search_path=tap; \i test/helpers/create.sql SELECT tf.register( 'customer' , array[ row( 'insert' , $$INSERT INTO customer VALUES (DEFAULT, 'first', 'last' ) RETURNING *$$ )::tf.test_set , row( 'function' , $$SELECT * FROM customer__add( 'func first', 'func last' )$$ )::tf.test_set ] ); SELECT tf.register( 'invoice' , array[ row( 'insert' , $$INSERT INTO invoice VALUES( DEFAULT , (tf.get( NULL::customer, 'insert' )).customer_id , current_date , current_date + 30 ) RETURNING *$$ )::tf.test_set , row( 'function' , $$INSERT INTO invoice VALUES(
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On 9/2/15 2:17 PM, Alvaro Herrera wrote: Michael Paquier wrote: I haven't written yet a test case but I think that we could reproduce that simply by having a relation referenced in the exception block of a first function, calling a second function that itself raises an exception, causing the referencing error. Hm, so function 2 is called in the exception block of function 1? Are you going to produce the test case for this? Jim? I had attempted one but the issue is that it's more than just an exception block thing. If it was that simple then we'd still get the crash without involving pgTap. So I suspect you need to have a named cursor in the mix as well. Let me make another attempt at something simpler. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
Michael Paquier wrote: > I haven't written yet a test case but I think that we could reproduce > that simply by having a relation referenced in the exception block of > a first function, calling a second function that itself raises an > exception, causing the referencing error. Hm, so function 2 is called in the exception block of function 1? Are you going to produce the test case for this? Jim? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Wed, Sep 2, 2015 at 6:21 AM, Michael Paquier wrote: > Yes, that's what I have been looking at actually by having some markers in > relcache.c. The reference count of this relation get incremented here: So, I have been playing more with this code, and as mentioned by Andres this test case goes twice through the PG_CATCH block in pl_exec.c, causing the crash as the temporary relation does not get dereferenced. I may be missing something, but isn't it a matter of moving the switch to the old memory context at the end of the block so as we can get a correct cleanup of the estate in the first block? See for example the attached that fixes the problem for me, and passes make checkworld, though I expect Tom and Andres to jump in and say that this is not correct within the next 12 hours :) I haven't written yet a test case but I think that we could reproduce that simply by having a relation referenced in the exception block of a first function, calling a second function that itself raises an exception, causing the referencing error. Regards, -- Michael diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c index 935fa62..56084c3 100644 --- a/src/pl/plpgsql/src/pl_exec.c +++ b/src/pl/plpgsql/src/pl_exec.c @@ -1253,8 +1253,6 @@ exec_stmt_block(PLpgSQL_execstate *estate, PLpgSQL_stmt_block *block) /* Abort the inner transaction */ RollbackAndReleaseCurrentSubTransaction(); - MemoryContextSwitchTo(oldcontext); - CurrentResourceOwner = oldowner; /* Revert to outer eval_econtext */ estate->eval_econtext = old_eval_econtext; @@ -1326,6 +1324,9 @@ exec_stmt_block(PLpgSQL_execstate *estate, PLpgSQL_stmt_block *block) ReThrowError(edata); else FreeErrorData(edata); + + MemoryContextSwitchTo(oldcontext); + CurrentResourceOwner = oldowner; } PG_END_TRY(); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Wed, Sep 2, 2015 at 6:13 AM, Jim Nasby wrote: > On 9/1/15 11:59 PM, Michael Paquier wrote: >> >> On Wed, Sep 2, 2015 at 12:59 PM, Jim Nasby wrote: >>> >>> On 9/1/15 8:42 PM, Michael Paquier wrote: >>> The crash is triggered by having an exception raised in this particular >>> call >>> stack. Since there's no syntax error in master/0.2.1, there's no assert >>> failure either. Were you running asserts? >> >> >> I thought I was, but visibly it was mistaken. So, after re-rolling >> configure, I have been able to reproduce the crash, and as far as I >> can see all supported versions are impacted. I tested down to 9.1 >> where extensions were introduced, and the stack trace, as well as the >> assertion failing are the same, similar to what Jim has reported. I am >> just digging more into this thing now. > > > I just had a theory that reference counts were messed up because there was > both a temp table and a real table with the same name. Turns out that's not > the case, but I do now know that the assert is failing for the reference > count on the temp table. Yes, that's what I have been looking at actually by having some markers in relcache.c. The reference count of this relation get incremented here: LOG: increment ref count relation: invoice_03, rd_refcnt: 1 CONTEXT: SQL statement " CREATE TEMP TABLE invoice_03 ON COMMIT DROP AS WITH i AS ( INSERT INTO invoice VALUES( DEFAULT , (tf.get( NULL::customer, 'insert' )).customer_id , current_date , current_date + 30 ) RETURNING * ) SELECT * FROM i " PL/pgSQL function tf.get(anyelement,text) line 29 at EXECUTE PL/pgSQL function results_eq(refcursor,refcursor,text) line 11 at FETCH PL/pgSQL function results_eq(text,text,text) line 9 at assignment STATEMENT: SELECT results_eq( $$SELECT * FROM tf.get( NULL::invoice, 'base' )$$ , $$VALUES( 1, 1, current_date, current_date + 30 )$$ , 'invoice factory output' ); And it gets cleared here without being decremented when cleaning up the second exception: LOG: clear relation: invoice_03, rd_refcnt: 1 CONTEXT: PL/pgSQL function results_eq(refcursor,refcursor,text) line 11 during exception cleanup PL/pgSQL function results_eq(text,text,text) line 9 at assignment STATEMENT: SELECT results_eq( $$SELECT * FROM tf.get( NULL::invoice, 'base' )$$ , $$VALUES( 1, 1, current_date, current_date + 30 )$$ , 'invoice factory output' ); -- Michael
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On 9/1/15 11:59 PM, Michael Paquier wrote: On Wed, Sep 2, 2015 at 12:59 PM, Jim Nasby wrote: On 9/1/15 8:42 PM, Michael Paquier wrote: The crash is triggered by having an exception raised in this particular call stack. Since there's no syntax error in master/0.2.1, there's no assert failure either. Were you running asserts? I thought I was, but visibly it was mistaken. So, after re-rolling configure, I have been able to reproduce the crash, and as far as I can see all supported versions are impacted. I tested down to 9.1 where extensions were introduced, and the stack trace, as well as the assertion failing are the same, similar to what Jim has reported. I am just digging more into this thing now. I just had a theory that reference counts were messed up because there was both a temp table and a real table with the same name. Turns out that's not the case, but I do now know that the assert is failing for the reference count on the temp table. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Wed, Sep 2, 2015 at 12:59 PM, Jim Nasby wrote: > On 9/1/15 8:42 PM, Michael Paquier wrote: > The crash is triggered by having an exception raised in this particular call > stack. Since there's no syntax error in master/0.2.1, there's no assert > failure either. Were you running asserts? I thought I was, but visibly it was mistaken. So, after re-rolling configure, I have been able to reproduce the crash, and as far as I can see all supported versions are impacted. I tested down to 9.1 where extensions were introduced, and the stack trace, as well as the assertion failing are the same, similar to what Jim has reported. I am just digging more into this thing now. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On 9/1/15 8:42 PM, Michael Paquier wrote: test_factory is a jungle to me. Perhaps you could just extract a self-contained test case? It does not matter if the file is long as long as the problem can be easily reproduced. Sorry, more info on what's happening here. The idea behind test_factory is to allow you to register a command that creates and returns test data (IE: INSERT INTO table VALUES( DEFAULT, 'my test data' ) RETURNING *). That insert statement is stored in a table (_tf._test_factory). When this dynamic statement is executed, the results will be stored in a specially named table (plpgsql variable c_data_table_name). When you call tf.get(), it first attempts to grab all the rows from c_data_table_name. The first time you do this in a database, that table won't exist. tg.get's exception block will create a temp table holding the results of the stored statement (IE: INSERT INTO table ...). Something else important here is that in crash.sql there is a nested tf.get() call: -- Invoice , $$INSERT INTO invoice VALUES( DEFAULT , (tf.get( NULL::customer, 'insert' )).customer_id , current_date , current_date + 30 ) RETURNING *$$ Note that calls tf.get for customer (which is a simple INSERT). This is where stuff gets weird. If you run tf.get( NULL::customer, 'insert' ) you get a regular plpgsql error. If you simply run tf.get() for invoices, *outside* of tap.results_eq(), you also only get a plpgsql error. To trigger the assert, you must use tf.get( NULL::invoice, 'base' ) from within tap.results_eq(). That's the only way I've found to trigger this. AFAICT, that call stack looks like this: results_eq opens a cursor to run $$SELECT * FROM tf.get( NULL::invoice, 'base' )$$ plpgsql does it's thing and eventually that statement begins execution tf.get() does a few things then attempts to read from a non-existent table. tf.get's outer block catches that exception and runs dynamic SQL to create a temp table. That dynamic SQL contains (in part) this: , (tf.get( NULL::customer, 'insert' )).customer_id *That* tf.get also attempts to read from a non-existent table and *successfully* creates it's temp table. It then does PERFORM _tf.table_create( c_data_table_name ); which fails due to a bug in _tf.table_create(). Now we have a second exception bubbling back up to the exception handler of the second tf.get call, which goes up to the exception handler for the first tf.get call. That call was in the process of creating a temp table (invoice_003). The error continues up to the FETCH command in results_eq(). The assert happens somewhere after here, and it's because the refcount on that temp table (invoice_003) is unexpected. I'm tracing through this scenario by hand right now to try and figure out exactly when that assert blows up, but I know it's happening in results_eq(refcursor, refcursor, text). BTW, I just updated the crash branch to ensure that test_factory 0.1.1 is what gets installed. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On 9/1/15 8:42 PM, Michael Paquier wrote: On Wed, Sep 2, 2015 at 9:39 AM, Michael Paquier wrote: On Wed, Sep 2, 2015 at 7:23 AM, Jim Nasby wrote: Well nuts, pretty sure that means the error isn't reproducing for you. :/ Do you maybe have unusual config options or postgresql.conf options? Yep. And I have found one reason why it was not working, I have been using --extra-version with a custom string and the Makefile of test_factory failed to detect that (you may want to use VERSION_NUM in Makefile.global instead), leading to test_factory--0.1.1.sql to not be installed as well. Even after fixing that, I am facing the same problem when running the test, aka: psql:crash.sql:42: ERROR: 42703: column "c_data_table_name" does not exist LINE 4: , c_data_table_name And the same error show up, should I use the branch crash in the github repo or your zip file, make test or make install/psql -f crash.sql. test_factory is a jungle to me. Perhaps you could just extract a self-contained test case? It does not matter if the file is long as long as the problem can be easily reproduced. Mea culpa. It is possible to run crash.sql, with test_factory instead 0.2.1 installed in a cluster. So I picked it up from github on master branch, deployed it, and then crash.sql is able to run. However I was not able to reproduce a failure on master, REL9_4_STABLE and 9.4.1. Thoughts? The crash is triggered by having an exception raised in this particular call stack. Since there's no syntax error in master/0.2.1, there's no assert failure either. Were you running asserts? What configure line are you using? I might be able to track this down to something particular to my configure. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Wed, Sep 2, 2015 at 9:39 AM, Michael Paquier wrote: > On Wed, Sep 2, 2015 at 7:23 AM, Jim Nasby wrote: >> Well nuts, pretty sure that means the error isn't reproducing for you. :/ Do >> you maybe have unusual config options or postgresql.conf options? > > Yep. And I have found one reason why it was not working, I have been > using --extra-version with a custom string and the Makefile of > test_factory failed to detect that (you may want to use VERSION_NUM in > Makefile.global instead), leading to test_factory--0.1.1.sql to not be > installed as well. > > Even after fixing that, I am facing the same problem when running the test, > aka: > psql:crash.sql:42: ERROR: 42703: column "c_data_table_name" does not exist > LINE 4: , c_data_table_name > And the same error show up, should I use the branch crash in the > github repo or your zip file, make test or make install/psql -f > crash.sql. > > test_factory is a jungle to me. Perhaps you could just extract a > self-contained test case? It does not matter if the file is long as > long as the problem can be easily reproduced. Mea culpa. It is possible to run crash.sql, with test_factory instead 0.2.1 installed in a cluster. So I picked it up from github on master branch, deployed it, and then crash.sql is able to run. However I was not able to reproduce a failure on master, REL9_4_STABLE and 9.4.1. Thoughts? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Wed, Sep 2, 2015 at 7:23 AM, Jim Nasby wrote: > Well nuts, pretty sure that means the error isn't reproducing for you. :/ Do > you maybe have unusual config options or postgresql.conf options? Yep. And I have found one reason why it was not working, I have been using --extra-version with a custom string and the Makefile of test_factory failed to detect that (you may want to use VERSION_NUM in Makefile.global instead), leading to test_factory--0.1.1.sql to not be installed as well. Even after fixing that, I am facing the same problem when running the test, aka: psql:crash.sql:42: ERROR: 42703: column "c_data_table_name" does not exist LINE 4: , c_data_table_name And the same error show up, should I use the branch crash in the github repo or your zip file, make test or make install/psql -f crash.sql. test_factory is a jungle to me. Perhaps you could just extract a self-contained test case? It does not matter if the file is long as long as the problem can be easily reproduced. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On 9/1/15 1:08 AM, Michael Paquier wrote: On Sun, Aug 30, 2015 at 1:06 AM, Jim Nasby wrote: Steps to reproduce: Download https://github.com/BlueTreble/test_factory/archive/crash.zip Unzip, cd into directory pgxn install pgtap (or just make test) FWIW, make test fails: ! ERROR: 42703: column "c_data_table_name" does not exist ! LINE 4: , c_data_table_name make install if you didn't do make test psql -f crash.sql As does this one with the same error. I am not exactly sure what I am missing. Regards, Well nuts, pretty sure that means the error isn't reproducing for you. :/ Do you maybe have unusual config options or postgresql.conf options? Here's a log of how I can reproduce from a just-pulled copy of HEAD: decibel@decina:[17:22]~/pgsql/HEAD (master $=)$head config.log |grep ./conf $ ./configure --with-includes=/opt/local/include --with-libraries=/opt/local/lib --with-perl --with-python --enable-depend -C --enable-tap-tests --prefix=/Users/decibel/pgsql/HEAD/i --with-pgport= --enable-debug CC=ccache clang -Qunused-arguments -fcolor-diagnostics --enable-cassert CFLAGS=-O0 --no-create --no-recursion decibel@decina:[17:22]~/pgsql/HEAD (master $=)$ make; make install; pg_ctl init; pg_ctl start; createdb decibel@decina:[17:20]~/tmp$mv ~/Downloads/test_factory-crash.zip . decibel@decina:[17:20]~/tmp$open test_factory-crash.zip decibel@decina:[17:20]~/tmp$cd test_factory-crash/ decibel@decina:[17:20]~/tmp/test_factory-crash$pgHEAD decibel@decina:[17:20]~/tmp/test_factory-crash$make test rm -rf sql/test_factory--0.1.1.sql rm -rf results/ regression.diffs regression.out tmp_check/ log/ cp sql/test_factory.sql sql/test_factory--0.1.1.sql /bin/sh /Users/decibel/pgsql/HEAD/i/lib/pgxs/src/makefiles/../../config/install-sh -c -d '/Users/decibel/pgsql/HEAD/i/share/extension' /bin/sh /Users/decibel/pgsql/HEAD/i/lib/pgxs/src/makefiles/../../config/install-sh -c -d '/Users/decibel/pgsql/HEAD/i/share/doc/extension' /usr/bin/install -c -m 644 .//test_factory.control '/Users/decibel/pgsql/HEAD/i/share/extension/' /usr/bin/install -c -m 644 .//doc/test_factory.asc '/Users/decibel/pgsql/HEAD/i/share/doc/extension/' /Users/decibel/pgsql/HEAD/i/lib/pgxs/src/makefiles/../../src/test/regress/pg_regress --inputdir=./ --bindir='/Users/decibel/pgsql/HEAD/i/bin' --inputdir=test --load-language=plpgsql --dbname=contrib_regression base (using postmaster on Unix socket, default port) == dropping database "contrib_regression" == NOTICE: database "contrib_regression" does not exist, skipping DROP DATABASE == creating database "contrib_regression" == CREATE DATABASE ALTER DATABASE == installing plpgsql == CREATE LANGUAGE == running regression test queries== test base ... FAILED (test process exited with exit code 2) == 1 of 1 tests failed. == The differences that caused some tests to fail can be viewed in the file "/Users/decibel/tmp/test_factory-crash/regression.diffs". A copy of the test summary that you see above is saved in the file "/Users/decibel/tmp/test_factory-crash/regression.out". make: [installcheck] Error 1 (ignored) *** /Users/decibel/tmp/test_factory-crash/test/expected/base.out Sat Aug 29 08:50:08 2015 --- /Users/decibel/tmp/test_factory-crash/results/base.out Tue Sep 1 17:21:04 2015 *** *** 1,18 \set ECHO none ! ok 1 - Create customer table ! ok 2 - Register test customers ! ok 3 - Create function customer__add ! ok 4 - Create invoice table ! ok 5 - Register test invoices ! ok 6 - customer table is empty ! ok 7 - invoice table is empty ! ok 8 - invoice factory output ! ok 9 - invoice table content ! ok 10 - customer table content ! ok 11 - invoice factory second call ! ok 12 - invoice table content stayed constant ! ok 13 - customer table content stayed constant ! ok 14 - Test function factory ! ok 15 - customer table has new row ! ok 16 - truncate invoice ! ok 17 - invoice factory get remains the same after truncate --- 1,10 \set ECHO none ! ok 1 - Register test customers ! ok 2 - Create function customer__add ! ok 3 - Register test invoices ! ok 4 - customer table is empty ! ok 5 - invoice table is empty ! server closed the connection unexpectedly ! This probably means the server terminated abnormally ! before or while processing the request. ! connection to server was lost == decibel@decina:[17:21]~/tmp/test_factory-crash$ -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Sun, Aug 30, 2015 at 1:06 AM, Jim Nasby wrote: > Steps to reproduce: > Download https://github.com/BlueTreble/test_factory/archive/crash.zip > Unzip, cd into directory > pgxn install pgtap (or just make test) FWIW, make test fails: ! ERROR: 42703: column "c_data_table_name" does not exist ! LINE 4: , c_data_table_name > make install if you didn't do make test > psql -f crash.sql As does this one with the same error. I am not exactly sure what I am missing. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On 8/29/15 11:31 AM, Andres Freund wrote: Hi, On 2015-08-29 11:06:05 -0500, Jim Nasby wrote: Stack trace below. Relevant assert: What exact revision is this on? Either there's some aggressive inlining going on, or these lines don't entirely match up with HEAD. Oops, that was 9.4.1. Trace from just pulled HEAD below, compiled with -O0. That's compiled with optimization, isn't it? Could you compile with -O0? Can you go up to frame #19: 0x000116295f1f plpgsql.so`exec_stmt_block(estate=0x7fff532c0090, block=0x7fa3f53a21d0) + 2095 at pl_exec.c:1300 and do 'disass/m' to see where we actually are? That's now 20 instead of 19. disass/m gives an error (note my gdb is tweaked for apple and everything else I've posted here is from lldb). Output from 'disassemble' in frame 20 at bottom. The area around #19 likely is the PG_CATCH inside exec_stmt_block. So what's happening is that there's an exception raised, while handling the previous exception. And apparently we're not doing that entirely right. Besides the line numbers "during exception cleanup" hints at that. If the line numbers and the 'gui' mode of lldb are to be believed, frame 20 is NOT in the try block, but #9 is: │ 1242 │PG_CATCH(); ││ │ │ 1243 │{ ││ │ │ 1244 │ErrorData *edata; ││ │ │ 1245 │ListCell *e; ││ │ │ 1246 │ ││ │ │ 1247 │estate->err_text = gettext_noop("during exception cleanup"); ││ │ │ 1248 │ ││ │ │ 1249 │/* Save error info */ ││ │ │ 1250 │MemoryContextSwitchTo(oldcontext); ││ │ │ 1251 │edata = CopyErrorData(); ││ │ │ 1252 │FlushErrorState(); ││ │ │ 1253 │ ││ │ │ 1254 │/* Abort the inner transaction */ ││ │ │ 1255 │◆ RollbackAndReleaseCurrentSubTransaction(); Looking at frame 10 432 */ 433 estate.err_text = NULL; 434 estate.err_stmt = (PLpgSQL_stmt *) (func->action); -> 435 rc = exec_stmt_block(&estate, func->action); 436 if (rc != PLPGSQL_RC_RETURN) 437 { 438 estate.err_stmt = NULL; func->fn_signature is results_eq(refcursor,refcursor,text), which is a pgtap function that has an exception handler, and I know the code that it's calling also has an exception handler. (lldb) bt * thread #1: tid = 0x3c8fb4, 0x7fff92a5a866 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT * frame #0: 0x7fff92a5a866 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x7fff9001b35c libsystem_pthread.dylib`pthread_kill + 92 frame #2: 0x7fff8cf89b1a libsystem_c.dylib`abort + 125 frame #3: 0x00010fcf3b99 postgres`ExceptionalCondition(conditionName=0x00010fe157e9, errorType=0x00010fd60053, fileName=0x00010fe14a7f, lineNumber=2055) + 137 at assert.c:54 frame #4: 0x00010fcda0ac postgres`RelationClearRelation(relation=0x00011954f388, rebuild='\0') + 140 at relcache.c:2053 frame #5: 0x00010
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
Hi, On 2015-08-29 11:06:05 -0500, Jim Nasby wrote: > Stack trace below. Relevant assert: What exact revision is this on? Either there's some aggressive inlining going on, or these lines don't entirely match up with HEAD. That's compiled with optimization, isn't it? Could you compile with -O0? Can you go up to > >frame #19: 0x000116295f1f > > plpgsql.so`exec_stmt_block(estate=0x7fff532c0090, > > block=0x7fa3f53a21d0) + 2095 at pl_exec.c:1300 and do 'disass/m' to see where we actually are? The area around #19 likely is the PG_CATCH inside exec_stmt_block. So what's happening is that there's an exception raised, while handling the previous exception. And apparently we're not doing that entirely right. Besides the line numbers "during exception cleanup" hints at that. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On 8/29/15 8:04 AM, Tom Lane wrote: Michael Paquier writes: Ah, OK, you meant this file... Yes I was able to receive it as well in your original email. I'll try to investigate further later, but Tom may beat me first. He usually does. Jim had indicated the bug wasn't reproducible on the strength of that info, so I was waiting for him to provide a more reproducible case. It was reproducible, just not very self contained. [1] is a bit better, but it still involves pgTap as well as test_factory. Steps to reproduce: Download https://github.com/BlueTreble/test_factory/archive/crash.zip Unzip, cd into directory pgxn install pgtap (or just make test) make install if you didn't do make test psql -f crash.sql Stack trace below. Relevant assert: 1967 * As per notes above, a rel to be rebuilt MUST have refcnt > 0; while of 1968 * course it would be a bad idea to blow away one with nonzero refcnt. 1969 */ -> 1970 Assert(rebuild ? 1971!RelationHasReferenceCountZero(relation) : 1972RelationHasReferenceCountZero(relation)); Relevant bits of relation: │ ◆─(Relation) relation = 0x0001165832a8 ││ │ │ ├─◆─(RelFileNode) rd_node ││ │ │ ├─◆─(SMgrRelationData *) rd_smgr = 0x ││ │ │ ├─(int) rd_refcnt = 1 ││ │ │ ├─(BackendId) rd_backend = 2 ││ │ │ ├─(bool) rd_islocaltemp = '\x01' ││ │ │ ├─(bool) rd_isnailed = '\0' ││ │ │ ├─(bool) rd_isvalid = '\x01' ││ │ │ ├─(char) rd_indexvalid = '\0' ││ │ │ ├─(SubTransactionId) rd_createSubid = 16 ││ │ │ ├─(SubTransactionId) rd_newRelfilenodeSubid = 0 ││ │ │ ├─◆─(Form_pg_class) rd_rel = 0x0001165838d8 ││ │ │ │ ├─◆─(NameData) relname ││ │ │ │ │ └─◆─(char [64]) data "invoice_03" rebuild is 0. [1]https://github.com/BlueTreble/test_factory/blob/crash/crash.sql (lldb) bt * thread #1: tid = 0x3b03a0, 0x7fff92a5a866 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT * frame #0: 0x7fff92a5a866 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x7fff9001b35c libsystem_pthread.dylib`pthread_kill + 92 frame #2: 0x7fff8cf89b1a libsystem_c.dylib`abort + 125 frame #3: 0x00010cdb4039 postgres`ExceptionalCondition(conditionName=0x00010cf59cfd, errorType=0x00010cec392d, fileName=0x00010cf59045, lineNumber=1972) + 137 at assert.c:54 frame #4: 0x00010cd9b332 postgres`RelationClearRelation(relation=0x000116594cd8, rebuild='\0') + 162 at relcache.c:1970 frame #5: 0x00010cd9cc0f postgres`AtEOSubXact_cleanup(relation=0x000116594cd8, isCommit='\0', mySubid=15, parentSubid=14) + 79 at relcache.c:2665 frame #6: 0x00010cd9cb92 postgres`AtEOSubXact_RelationCache(isCommit='\0', mySubid=15, parentSubid=14) + 242 at relcache.c:2633 frame #7: 0x00010c9b6e88 postgres`AbortSubTransaction + 440 at xact.c:4373 frame #8: 0x00010c9b8ef2 postgres`RollbackAndReleaseCurrentSubTransaction + 178 at xact.c:3948 frame #9: 0x000116295c93 plpgsql.so`exec_stmt_block(estate=0x000
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
Michael Paquier writes: > Ah, OK, you meant this file... Yes I was able to receive it as well in your > original email. I'll try to investigate further later, but Tom may beat me > first. He usually does. Jim had indicated the bug wasn't reproducible on the strength of that info, so I was waiting for him to provide a more reproducible case. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Sat, Aug 29, 2015 at 11:18 AM, Jim Nasby wrote: > On 8/28/15 8:39 PM, Tom Lane wrote: > >> Michael Paquier writes: >> >>> On Sat, Aug 29, 2015 at 5:02 AM, Jim Nasby >>> wrote: >>> Looks like a 98k file won't get through the list... >>> >> Is it compressed? Note that we have sometimes larger patches than that, >>> but >>> perhaps those had special permissions by the admins of this list. >>> >> >> Messages significantly larger than that go through all the time. Maybe >> you had it marked with some weird MIME type? >> > > Apparently the original email did go through and my MUA search just failed > to find it. Sorry for the noise. > > Original email: > http://www.postgresql.org/message-id/55dfaf18.4060...@bluetreble.com Ah, OK, you meant this file... Yes I was able to receive it as well in your original email. I'll try to investigate further later, but Tom may beat me first. He usually does. -- Michael
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On 8/28/15 8:39 PM, Tom Lane wrote: Michael Paquier writes: On Sat, Aug 29, 2015 at 5:02 AM, Jim Nasby wrote: Looks like a 98k file won't get through the list... Is it compressed? Note that we have sometimes larger patches than that, but perhaps those had special permissions by the admins of this list. Messages significantly larger than that go through all the time. Maybe you had it marked with some weird MIME type? Apparently the original email did go through and my MUA search just failed to find it. Sorry for the noise. Original email: http://www.postgresql.org/message-id/55dfaf18.4060...@bluetreble.com -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
Michael Paquier writes: > On Sat, Aug 29, 2015 at 5:02 AM, Jim Nasby wrote: >> Looks like a 98k file won't get through the list... > Is it compressed? Note that we have sometimes larger patches than that, but > perhaps those had special permissions by the admins of this list. Messages significantly larger than that go through all the time. Maybe you had it marked with some weird MIME type? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE
On Sat, Aug 29, 2015 at 5:02 AM, Jim Nasby wrote: > Looks like a 98k file won't get through the list... Is it compressed? Note that we have sometimes larger patches than that, but perhaps those had special permissions by the admins of this list. -- Michael