subject:"Re\: \[HACKERS\] Fwd\: Core dump with nested CREATE TEMP TABLE"

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-02-02 Thread Robert Haas

On Sun, Jan 31, 2016 at 11:50 AM, Tom Lane  wrote:
> Noah Misch  writes:
>> On Sat, Jan 30, 2016 at 04:28:47PM -0500, Robert Haas wrote:
>>> I think I've
>>> pretty much said what I have to say about this; if nothing I wrote up
>>> until now swayed you, it's unlikely that anything else I say after
>>> this point will either.
>
>> Say I drop the parts that change the binary.  Does the attached v2 manage to
>> improve PostgreSQL, or is it neutral-or-harmful like v1?
>
> There can surely be no objection to improving these comments.  However,
> I'm not convinced that we should word the comments to insist that the
> hypothetical cases are bugs.  As I said before, I do not think there is an
> API contract that would promise that portals don't reach here in ACTIVE
> state.  So IMO it's fair to note that no such case can arise currently,
> but not to state that it's a bug if it does.  So for example I'd reword
> your last comment addition along the lines of "Currently, every
> MarkPortalActive() caller ensures it updates the portal status again
> before relinquishing control, so that ACTIVE can't happen here.  If it
> does happen, dispose the portal like existing MarkPortalActive() callers
> would."

+1.  I think Noah's comment additions constitute useful and helpful
information, but I too am doubtful about the characterization of the
situations in question as bugs.  I don't see what the evidence for
that is, especially given that it's quite hard to predict how the code
might change in the future.  However, to reiterate, a more neutral
position of the same facts would get my vote.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-31 Thread Tom Lane

Noah Misch  writes:
> On Sat, Jan 30, 2016 at 04:28:47PM -0500, Robert Haas wrote:
>> I think I've
>> pretty much said what I have to say about this; if nothing I wrote up
>> until now swayed you, it's unlikely that anything else I say after
>> this point will either.

> Say I drop the parts that change the binary.  Does the attached v2 manage to
> improve PostgreSQL, or is it neutral-or-harmful like v1?

There can surely be no objection to improving these comments.  However,
I'm not convinced that we should word the comments to insist that the
hypothetical cases are bugs.  As I said before, I do not think there is an
API contract that would promise that portals don't reach here in ACTIVE
state.  So IMO it's fair to note that no such case can arise currently,
but not to state that it's a bug if it does.  So for example I'd reword
your last comment addition along the lines of "Currently, every
MarkPortalActive() caller ensures it updates the portal status again
before relinquishing control, so that ACTIVE can't happen here.  If it
does happen, dispose the portal like existing MarkPortalActive() callers
would."

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-31 Thread Noah Misch

On Sat, Jan 30, 2016 at 04:28:47PM -0500, Robert Haas wrote:
> I think I've
> pretty much said what I have to say about this; if nothing I wrote up
> until now swayed you, it's unlikely that anything else I say after
> this point will either.

Say I drop the parts that change the binary.  Does the attached v2 manage to
improve PostgreSQL, or is it neutral-or-harmful like v1?

On Sat, Jan 30, 2016 at 04:28:47PM -0500, Robert Haas wrote:
> On Sat, Jan 30, 2016 at 2:13 PM, Noah Misch  wrote:
> > On Sat, Jan 30, 2016 at 07:37:45AM -0500, Robert Haas wrote:
> > > On Fri, Jan 29, 2016 at 11:19 PM, Noah Misch  wrote:
> > > > On Thu, Jan 28, 2016 at 11:12:15AM -0500, Robert Haas wrote:
> > > >> The code you quote emits a warning
> > > >> about a reasonably forseeable situation that can never be right, but
> > > >> there's no particular reason to think that MarkPortalFailed is the
> > > >> wrong thing to do here if that situation came up.
> > > >
> > > > I have two reasons to expect these MarkPortalFailed() calls will be the 
> > > > wrong
> > > > thing for hypothetical code reaching them.  First, PortalRun() and 
> > > > peers reset
> > > > ActivePortal and PortalContext on error in addition to fixing portal 
> > > > status
> > > > with MarkPortalFailed().  If code forgets to update the status, there's 
> > > > a
> > > > substantial chance it forgot the other two.  My patch added a comment
> > > > explaining the second reason:
> > > >
> > > > +   /*
> > > > +* See similar code in AtSubAbort_Portals().  This 
> > > > would fire if code
> > > > +* orchestrating multiple top-level transactions within 
> > > > a portal, such
> > > > +* as VACUUM, caught errors and continued under the 
> > > > same portal with a
> > > > +* fresh transaction.  No part of core PostgreSQL 
> > > > functions that way,
> > > > +* though it's a fair thing to want.  Such code would 
> > > > wish the portal
> > > > +* to remain ACTIVE, as in PreCommit_Portals(); we 
> > > > don't cater to
> > > > +* that.  Instead, like AtSubAbort_Portals(), interpret 
> > > > this as a bug.
> > > > +*/
> > > 
> > > You may be right, but then again Tom had a different opinion, even
> > > after seeing your patch, and he's no dummy.
> >
> > Eh?  Tom last posted to this thread before I first posted a patch.
> 
> http://www.postgresql.org/message-id/29758.1451780...@sss.pgh.pa.us
> seems to me to be a vote against the concept embodied by the patch.

That much is true.  The order of postings is the opposite of what you stated,
and there's no mailing list evidence that anyone other than you has read my
explanation of why MarkPortalFailed() is wrong here.  Alternately, I demand
the schematics for Tom's time machine.
diff --git a/src/backend/utils/mmgr/portalmem.c 
b/src/backend/utils/mmgr/portalmem.c
index a53673c..c840aa6 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -765,7 +765,14 @@ AtAbort_Portals(void)
{
Portal  portal = hentry->portal;
 
-   /* Any portal that was actually running has to be considered 
broken */
+   /*
+* See similar code in AtSubAbort_Portals().  This would fire 
if code
+* orchestrating multiple top-level transactions within a 
portal, such
+* as VACUUM, caught errors and continued under the same portal 
with a
+* fresh transaction.  No part of core PostgreSQL functions 
that way.
+* XXX Such code would wish the portal to remain ACTIVE, as in
+* PreCommit_Portals().
+*/
if (portal->status == PORTAL_ACTIVE)
MarkPortalFailed(portal);
 
@@ -919,9 +926,11 @@ AtSubAbort_Portals(SubTransactionId mySubid,
portal->activeSubid = parentSubid;
 
/*
-* Upper-level portals that failed while 
running in this
-* subtransaction must be forced into FAILED 
state, for the
-* same reasons discussed below.
+* If a bug in a MarkPortalActive() caller has 
it miss cleanup
+* after having failed while running an 
upper-level portal in
+* this subtransaction, we don't know what else 
in the portal
+* is wrong.  Force it into FAILED state, for 
the same reasons
+* discussed below.
 *
 * We assume we can get away without forcing 
upper-level READY
 * portals to fail, even if they were run and 
then suspended.
@@ -960,7 +969,11 @@ AtSubAbort_Portals(SubTransactionId mySubid,

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-30 Thread Robert Haas

On Sat, Jan 30, 2016 at 2:13 PM, Noah Misch  wrote:
> You could offer that paragraph as an objection to almost all Assert(), elog(),
> and automated tests.  Why levy it against this patch?  The valuable ways
> assertions and tests supplement review are well-established.

Sure, that's true, but I don't view all situations in the same way, so
I don't write the same thing in answer to each one.  I think I've
pretty much said what I have to say about this; if nothing I wrote up
until now swayed you, it's unlikely that anything else I say after
this point will either.

>> You may be right, but then again Tom had a different opinion, even
>> after seeing your patch, and he's no dummy.
>
> Eh?  Tom last posted to this thread before I first posted a patch.

http://www.postgresql.org/message-id/29758.1451780...@sss.pgh.pa.us
seems to me to be a vote against the concept embodied by the patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-30 Thread Noah Misch

On Sat, Jan 30, 2016 at 07:37:45AM -0500, Robert Haas wrote:
> On Fri, Jan 29, 2016 at 11:19 PM, Noah Misch  wrote:
> > As you say, forbidding things makes friction in the event that someone comes
> > along wanting to do the forbidden thing.  Forbidding things also simplifies
> > the system, making it easier to verify.  This decision should depend on the
> > API's audience.  We have prominently-public APIs like lsyscache.h,
> > stringinfo.h and funcapi.h.  They should cater to reasonably-foreseeable use
> > cases, not just what yesterday's users have actually used.  We then have
> > narrow-interest APIs like subselect.h and view.h.  For those, the value of a
> > simpler system exceeds the risk-adjusted cost of friction.  They should 
> > impose
> > strict constraints on their callers.
> >
> > Portal status belongs to the second category.  PortalRun(), PortalRunFetch()
> > and PersistHoldablePortal() are the three core functions that place portals
> > into PORTAL_ACTIVE status.  No PGXN module does so; none so much as 
> > references
> > a PortalStatus constant or MarkPortal{Active,Done,Failed}() function.  If
> > someone adds a fourth core portal runner, the system will be simpler and
> > better if that function replicates the structure of the existing three.
> 
> I won't argue with that, but it strikes me that if I were reviewing a
> patch to do such a thing, I'd surely compare the new caller against
> the existing ones, so any failure to adhere to the same design pattern
> would be caught that way.  I expect other reviewers and/or committers
> would almost certainly do something similar.  If we presuppose
> incompetence on the part of future reviewers and committers, no action
> taken now can secure us.

You could offer that paragraph as an objection to almost all Assert(), elog(),
and automated tests.  Why levy it against this patch?  The valuable ways
assertions and tests supplement review are well-established.

> You may be right, but then again Tom had a different opinion, even
> after seeing your patch, and he's no dummy.

Eh?  Tom last posted to this thread before I first posted a patch.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-30 Thread Robert Haas

On Fri, Jan 29, 2016 at 11:19 PM, Noah Misch  wrote:
> As you say, forbidding things makes friction in the event that someone comes
> along wanting to do the forbidden thing.  Forbidding things also simplifies
> the system, making it easier to verify.  This decision should depend on the
> API's audience.  We have prominently-public APIs like lsyscache.h,
> stringinfo.h and funcapi.h.  They should cater to reasonably-foreseeable use
> cases, not just what yesterday's users have actually used.  We then have
> narrow-interest APIs like subselect.h and view.h.  For those, the value of a
> simpler system exceeds the risk-adjusted cost of friction.  They should impose
> strict constraints on their callers.
>
> Portal status belongs to the second category.  PortalRun(), PortalRunFetch()
> and PersistHoldablePortal() are the three core functions that place portals
> into PORTAL_ACTIVE status.  No PGXN module does so; none so much as references
> a PortalStatus constant or MarkPortal{Active,Done,Failed}() function.  If
> someone adds a fourth core portal runner, the system will be simpler and
> better if that function replicates the structure of the existing three.

I won't argue with that, but it strikes me that if I were reviewing a
patch to do such a thing, I'd surely compare the new caller against
the existing ones, so any failure to adhere to the same design pattern
would be caught that way.  I expect other reviewers and/or committers
would almost certainly do something similar.  If we presuppose
incompetence on the part of future reviewers and committers, no action
taken now can secure us.

>> The code you quote emits a warning
>> about a reasonably forseeable situation that can never be right, but
>> there's no particular reason to think that MarkPortalFailed is the
>> wrong thing to do here if that situation came up.
>
> I have two reasons to expect these MarkPortalFailed() calls will be the wrong
> thing for hypothetical code reaching them.  First, PortalRun() and peers reset
> ActivePortal and PortalContext on error in addition to fixing portal status
> with MarkPortalFailed().  If code forgets to update the status, there's a
> substantial chance it forgot the other two.  My patch added a comment
> explaining the second reason:
>
> +   /*
> +* See similar code in AtSubAbort_Portals().  This would fire 
> if code
> +* orchestrating multiple top-level transactions within a 
> portal, such
> +* as VACUUM, caught errors and continued under the same 
> portal with a
> +* fresh transaction.  No part of core PostgreSQL functions 
> that way,
> +* though it's a fair thing to want.  Such code would wish 
> the portal
> +* to remain ACTIVE, as in PreCommit_Portals(); we don't 
> cater to
> +* that.  Instead, like AtSubAbort_Portals(), interpret this 
> as a bug.
> +*/

You may be right, but then again Tom had a different opinion, even
after seeing your patch, and he's no dummy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-29 Thread Noah Misch

On Thu, Jan 28, 2016 at 11:12:15AM -0500, Robert Haas wrote:
> I'm honestly failing to understand why we should change anything at
> all.  I don't believe that doing something more severe than marking
> the portal failed actually improves anything.  I suppose if I had to
> pick between what you proposed before and elog(FATAL) I'd pick the
> latter, but I see no real reason to cut off future code (or
> third-party code) at the knees like that.  I don't see it as either
> necessary or desirable to forbid something just because there's no
> clear present use case for it.

As you say, forbidding things makes friction in the event that someone comes
along wanting to do the forbidden thing.  Forbidding things also simplifies
the system, making it easier to verify.  This decision should depend on the
API's audience.  We have prominently-public APIs like lsyscache.h,
stringinfo.h and funcapi.h.  They should cater to reasonably-foreseeable use
cases, not just what yesterday's users have actually used.  We then have
narrow-interest APIs like subselect.h and view.h.  For those, the value of a
simpler system exceeds the risk-adjusted cost of friction.  They should impose
strict constraints on their callers.

Portal status belongs to the second category.  PortalRun(), PortalRunFetch()
and PersistHoldablePortal() are the three core functions that place portals
into PORTAL_ACTIVE status.  No PGXN module does so; none so much as references
a PortalStatus constant or MarkPortal{Active,Done,Failed}() function.  If
someone adds a fourth core portal runner, the system will be simpler and
better if that function replicates the structure of the existing three.

> The code you quote emits a warning
> about a reasonably forseeable situation that can never be right, but
> there's no particular reason to think that MarkPortalFailed is the
> wrong thing to do here if that situation came up.

I have two reasons to expect these MarkPortalFailed() calls will be the wrong
thing for hypothetical code reaching them.  First, PortalRun() and peers reset
ActivePortal and PortalContext on error in addition to fixing portal status
with MarkPortalFailed().  If code forgets to update the status, there's a
substantial chance it forgot the other two.  My patch added a comment
explaining the second reason:

+   /*
+* See similar code in AtSubAbort_Portals().  This would fire 
if code
+* orchestrating multiple top-level transactions within a 
portal, such
+* as VACUUM, caught errors and continued under the same portal 
with a
+* fresh transaction.  No part of core PostgreSQL functions 
that way,
+* though it's a fair thing to want.  Such code would wish the 
portal
+* to remain ACTIVE, as in PreCommit_Portals(); we don't cater 
to
+* that.  Instead, like AtSubAbort_Portals(), interpret this as 
a bug.
+*/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-28 Thread Robert Haas

On Wed, Jan 27, 2016 at 11:47 PM, Noah Misch  wrote:
> On Wed, Jan 27, 2016 at 11:04:33PM -0500, Robert Haas wrote:
>> +Assert(portal->status != PORTAL_ACTIVE);
>>  if (portal->status == PORTAL_ACTIVE)
>>  MarkPortalFailed(portal);
>>
>> Now that just looks kooky to me.  We assert that the portal isn't
>> active, but then cater to the possibility that it might be anyway?
>
> Right.
>
>> That seems totally contrary to our usual programming practices, and a
>> bad idea for that reason.
>
> It is contrary to our usual programming practices, I agree.  I borrowed the
> idea from untenured code (da3751c8, 2015-11-11) in load_relcache_init_file():
>
> if (nailed_rels != NUM_CRITICAL_SHARED_RELS ||
> nailed_indexes != NUM_CRITICAL_SHARED_INDEXES)
> {
> elog(WARNING, "found %d nailed shared rels and %d 
> nailed shared indexes in init file, but expected %d and %d respectively",
>  nailed_rels, nailed_indexes,
>  NUM_CRITICAL_SHARED_RELS, 
> NUM_CRITICAL_SHARED_INDEXES);
> /* Make sure we get developers' attention about this 
> */
> Assert(false);
>
> I liked this pattern.  It's a good fit for cases that we design to be
> impossible yet for which we have a workaround if they do happen.  That being
> said, if you feel it's bad, I would be fine using elog(FATAL).  I envision
> this as a master-only change in any case.  No PGXN module references
> PORTAL_ACTIVE or MarkPortalActive(), so it's unlikely that extension code will
> notice the change whether in Assert() form or in elog() form.  What is best?

I'm honestly failing to understand why we should change anything at
all.  I don't believe that doing something more severe than marking
the portal failed actually improves anything.  I suppose if I had to
pick between what you proposed before and elog(FATAL) I'd pick the
latter, but I see no real reason to cut off future code (or
third-party code) at the knees like that.  I don't see it as either
necessary or desirable to forbid something just because there's no
clear present use case for it.  The code you quote emits a warning
about a reasonably forseeable situation that can never be right, but
there's no particular reason to think that MarkPortalFailed is the
wrong thing to do here if that situation came up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-27 Thread Noah Misch

On Wed, Jan 27, 2016 at 11:04:33PM -0500, Robert Haas wrote:
> +Assert(portal->status != PORTAL_ACTIVE);
>  if (portal->status == PORTAL_ACTIVE)
>  MarkPortalFailed(portal);
> 
> Now that just looks kooky to me.  We assert that the portal isn't
> active, but then cater to the possibility that it might be anyway?

Right.

> That seems totally contrary to our usual programming practices, and a
> bad idea for that reason.

It is contrary to our usual programming practices, I agree.  I borrowed the
idea from untenured code (da3751c8, 2015-11-11) in load_relcache_init_file():

if (nailed_rels != NUM_CRITICAL_SHARED_RELS ||
nailed_indexes != NUM_CRITICAL_SHARED_INDEXES)
{
elog(WARNING, "found %d nailed shared rels and %d 
nailed shared indexes in init file, but expected %d and %d respectively",
 nailed_rels, nailed_indexes,
 NUM_CRITICAL_SHARED_RELS, 
NUM_CRITICAL_SHARED_INDEXES);
/* Make sure we get developers' attention about this */
Assert(false);

I liked this pattern.  It's a good fit for cases that we design to be
impossible yet for which we have a workaround if they do happen.  That being
said, if you feel it's bad, I would be fine using elog(FATAL).  I envision
this as a master-only change in any case.  No PGXN module references
PORTAL_ACTIVE or MarkPortalActive(), so it's unlikely that extension code will
notice the change whether in Assert() form or in elog() form.  What is best?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-27 Thread Noah Misch

On Thu, Jan 28, 2016 at 12:57:36PM +0900, Michael Paquier wrote:
> On Thu, Jan 28, 2016 at 12:40 PM, Noah Misch  wrote:
> + * fresh transaction.  No part of core PostgreSQL functions that way,
> + * though it's a fair thing to want.  Such code would wish the portal

> From the point of view of core code, this stands true, but, for my 2c,
> honestly, I think that is just going to annoy more people working on
> plugins and forks of Postgres. When working on Postgres-XC and
> developing stuff for the core code, I recall having been annoyed a
> couple of times by similar assert limitations

At first, I left out that assertion in case some extension code did the thing
I described, perhaps in a background worker.  I then realized that
MarkPortalFailed() is the wrong thing for such code, which would want
treatment similar to this bit of PreCommit_Portals():

/*
 * Do not touch active portals --- this can only happen in the 
case of
 * a multi-transaction utility command, such as VACUUM.
 *
 * Note however that any resource owner attached to such a 
portal is
 * still going to go away, so don't leave a dangling pointer.
 */
if (portal->status == PORTAL_ACTIVE)
{
portal->resowner = NULL;
continue;
}

If you can think of a case where the code would work okay despite its active
portal being marked as failed, that would be a good reason to omit the one
assertion.  Otherwise, an assertion seems better than silently doing the
known-wrong thing.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-27 Thread Robert Haas

On Wed, Jan 27, 2016 at 10:40 PM, Noah Misch  wrote:
> On Sun, Jan 03, 2016 at 12:35:56AM -0500, Noah Misch wrote:
>> On Sat, Jan 02, 2016 at 07:22:13PM -0500, Tom Lane wrote:
>> > Noah Misch  writes:
>> > > I am inclined to add an Assert(portal->status != PORTAL_ACTIVE) to 
>> > > emphasize
>> > > that this is backup only.  MarkPortalActive() callers remain responsible 
>> > > for
>> > > updating the status to something else before relinquishing control.
>> >
>> > No, I do not think that would be an improvement.  There is no contract
>> > saying that this must be done earlier, IMO.
>>
>> Indeed, nobody wrote a contract.  The assertion would record what has been 
>> the
>> sole standing practice for eleven years (since commit a393fbf9).  It would
>> prompt discussion if a proposed patch would depart from that practice, and
>> that is a good thing.  Also, every addition of dead code should label that
>> code to aid future readers.
>
> Here's the patch I have in mind.  AtAbort_Portals() has an older
> MarkPortalFailed() call, and the story is somewhat different there per its new
> comment.  That call is plausible to reach with no bug involved, but
> MarkPortalFailed() would then be the wrong thing.

+Assert(portal->status != PORTAL_ACTIVE);
 if (portal->status == PORTAL_ACTIVE)
 MarkPortalFailed(portal);

Now that just looks kooky to me.  We assert that the portal isn't
active, but then cater to the possibility that it might be anyway?
That seems totally contrary to our usual programming practices, and a
bad idea for that reason.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-27 Thread Michael Paquier

On Thu, Jan 28, 2016 at 12:40 PM, Noah Misch  wrote:
> On Sun, Jan 03, 2016 at 12:35:56AM -0500, Noah Misch wrote:
>> On Sat, Jan 02, 2016 at 07:22:13PM -0500, Tom Lane wrote:
>> > Noah Misch  writes:
>> > > I am inclined to add an Assert(portal->status != PORTAL_ACTIVE) to 
>> > > emphasize
>> > > that this is backup only.  MarkPortalActive() callers remain responsible 
>> > > for
>> > > updating the status to something else before relinquishing control.
>> >
>> > No, I do not think that would be an improvement.  There is no contract
>> > saying that this must be done earlier, IMO.
>>
>> Indeed, nobody wrote a contract.  The assertion would record what has been 
>> the
>> sole standing practice for eleven years (since commit a393fbf9).  It would
>> prompt discussion if a proposed patch would depart from that practice, and
>> that is a good thing.  Also, every addition of dead code should label that
>> code to aid future readers.
>
> Here's the patch I have in mind.  AtAbort_Portals() has an older
> MarkPortalFailed() call, and the story is somewhat different there per its new
> comment.  That call is plausible to reach with no bug involved, but
> MarkPortalFailed() would then be the wrong thing.

+ * fresh transaction.  No part of core PostgreSQL functions that way,
+ * though it's a fair thing to want.  Such code would wish the portal
>From the point of view of core code, this stands true, but, for my 2c,
honestly, I think that is just going to annoy more people working on
plugins and forks of Postgres. When working on Postgres-XC and
developing stuff for the core code, I recall having been annoyed a
couple of times by similar assert limitations, because sometimes they
did not actually make sense in the context of what I was doing, and
other times things got suddendly broken after bumping the forks code
base to a major upgrades because a routine used up to now broke its
contract. Perhaps I was doing it wrong at this point though, and
should have used something else.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-27 Thread Noah Misch

On Sun, Jan 03, 2016 at 12:35:56AM -0500, Noah Misch wrote:
> On Sat, Jan 02, 2016 at 07:22:13PM -0500, Tom Lane wrote:
> > Noah Misch  writes:
> > > I am inclined to add an Assert(portal->status != PORTAL_ACTIVE) to 
> > > emphasize
> > > that this is backup only.  MarkPortalActive() callers remain responsible 
> > > for
> > > updating the status to something else before relinquishing control.
> > 
> > No, I do not think that would be an improvement.  There is no contract
> > saying that this must be done earlier, IMO.
> 
> Indeed, nobody wrote a contract.  The assertion would record what has been the
> sole standing practice for eleven years (since commit a393fbf9).  It would
> prompt discussion if a proposed patch would depart from that practice, and
> that is a good thing.  Also, every addition of dead code should label that
> code to aid future readers.

Here's the patch I have in mind.  AtAbort_Portals() has an older
MarkPortalFailed() call, and the story is somewhat different there per its new
comment.  That call is plausible to reach with no bug involved, but
MarkPortalFailed() would then be the wrong thing.

nm
diff --git a/src/backend/utils/mmgr/portalmem.c 
b/src/backend/utils/mmgr/portalmem.c
index a53673c..632b202 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -762,13 +762,22 @@ AtAbort_Portals(void)
hash_seq_init(&status, PortalHashTable);
 
while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL)
{
Portal  portal = hentry->portal;
 
-   /* Any portal that was actually running has to be considered 
broken */
+   /*
+* See similar code in AtSubAbort_Portals().  This would fire 
if code
+* orchestrating multiple top-level transactions within a 
portal, such
+* as VACUUM, caught errors and continued under the same portal 
with a
+* fresh transaction.  No part of core PostgreSQL functions 
that way,
+* though it's a fair thing to want.  Such code would wish the 
portal
+* to remain ACTIVE, as in PreCommit_Portals(); we don't cater 
to
+* that.  Instead, like AtSubAbort_Portals(), interpret this as 
a bug.
+*/
+   Assert(portal->status != PORTAL_ACTIVE);
if (portal->status == PORTAL_ACTIVE)
MarkPortalFailed(portal);
 
/*
 * Do nothing else to cursors held over from a previous 
transaction.
 */
@@ -916,26 +925,29 @@ AtSubAbort_Portals(SubTransactionId mySubid,
if (portal->activeSubid == mySubid)
{
/* Maintain activeSubid until the portal is 
removed */
portal->activeSubid = parentSubid;
 
/*
-* Upper-level portals that failed while 
running in this
-* subtransaction must be forced into FAILED 
state, for the
-* same reasons discussed below.
+* If a bug in a MarkPortalActive() caller has 
it miss cleanup
+* after having failed while running an 
upper-level portal in
+* this subtransaction, we don't know what else 
in the portal
+* is wrong.  Force it into FAILED state, for 
the same reasons
+* discussed below.
 *
 * We assume we can get away without forcing 
upper-level READY
 * portals to fail, even if they were run and 
then suspended.
 * In theory a suspended upper-level portal 
could have
 * acquired some references to objects that are 
about to be
 * destroyed, but there should be sufficient 
defenses against
 * such cases: the portal's original query 
cannot contain such
 * references, and any references within, say, 
cached plans of
 * PL/pgSQL functions are not from active 
queries and should
 * be protected by revalidation logic.
 */
+   Assert(portal->status != PORTAL_ACTIVE);
if (portal->status == PORTAL_ACTIVE)
MarkPortalFailed(portal);
 
/*
 * Also, if we failed it during the current 
subtransaction
 * (either just above, or earlier), reattach 
its resource
@@ -957,14 +969,19 @@ AtSubAbort_Portals(SubTrans

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-02 Thread Noah Misch

On Sat, Jan 02, 2016 at 07:22:13PM -0500, Tom Lane wrote:
> Noah Misch  writes:
> > I am inclined to add an Assert(portal->status != PORTAL_ACTIVE) to emphasize
> > that this is backup only.  MarkPortalActive() callers remain responsible for
> > updating the status to something else before relinquishing control.
> 
> No, I do not think that would be an improvement.  There is no contract
> saying that this must be done earlier, IMO.

Indeed, nobody wrote a contract.  The assertion would record what has been the
sole standing practice for eleven years (since commit a393fbf9).  It would
prompt discussion if a proposed patch would depart from that practice, and
that is a good thing.  Also, every addition of dead code should label that
code to aid future readers.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-02 Thread Tom Lane

Noah Misch  writes:
> I am inclined to add an Assert(portal->status != PORTAL_ACTIVE) to emphasize
> that this is backup only.  MarkPortalActive() callers remain responsible for
> updating the status to something else before relinquishing control.

No, I do not think that would be an improvement.  There is no contract
saying that this must be done earlier, IMO.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-02 Thread Noah Misch

On Sat, Jan 02, 2016 at 01:46:07PM -0500, Tom Lane wrote:
> Noah Misch  writes:
> > On Thu, Sep 03, 2015 at 11:04:11PM -0400, Tom Lane wrote:
> >> *** AtSubAbort_Portals(SubTransactionId mySu
> 
> >> --- 909,966 
> >> {
> >> Portal portal = hentry->portal;
> >> 
> >> +  /* Was it created in this subtransaction? */
> >> if (portal->createSubid != mySubid)
> >> +  {
> >> +  /* No, but maybe it was used in this subtransaction? */
> >> +  if (portal->activeSubid == mySubid)
> >> +  {
> > ...
> >> +  if (portal->status == PORTAL_ACTIVE)
> >> +  MarkPortalFailed(portal);
> 
> > Do you have a test case that reaches this particular MarkPortalFailed() 
> > call?
> > My attempts stumbled over the fact that, before we reach here, each of the
> > three MarkPortalActive() callers will have already called MarkPortalFailed()
> > in its PG_CATCH block.  ("make check" does not reach this call.)
> 
> Offhand I think that's just belt-and-suspenders-too coding.  As you say,
> we'd typically have failed active portals already before getting here.
> But the responsibility of this routine is to *guarantee* that no broken
> portals remain active, so I'd not want to remove this check.
> 
> Do you have a particular reason for asking?

After reading the log message for and comments added in commit c5454f9, I
understood that the commit changed PostgreSQL to fail portals that did not
fail before.  That is to say, queries that formerly completed without error
would now elicit errors.  I was looking into what sorts of queries it affected
in this way.  If the new MarkPortalFailed() is indeed dead code, then the
commit affects no query that way.  The meat of the commit is then the
ResourceOwnerNewParent() call, which lets queries ERROR cleanly instead of
seeing assertion failures or undefined behavior.

I am inclined to add an Assert(portal->status != PORTAL_ACTIVE) to emphasize
that this is backup only.  MarkPortalActive() callers remain responsible for
updating the status to something else before relinquishing control.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-02 Thread Tom Lane

Noah Misch  writes:
> On Thu, Sep 03, 2015 at 11:04:11PM -0400, Tom Lane wrote:
>> *** AtSubAbort_Portals(SubTransactionId mySu

>> --- 909,966 
>> {
>> Portal   portal = hentry->portal;
>> 
>> +/* Was it created in this subtransaction? */
>> if (portal->createSubid != mySubid)
>> +{
>> +/* No, but maybe it was used in this subtransaction? */
>> +if (portal->activeSubid == mySubid)
>> +{
> ...
>> +if (portal->status == PORTAL_ACTIVE)
>> +MarkPortalFailed(portal);

> Do you have a test case that reaches this particular MarkPortalFailed() call?
> My attempts stumbled over the fact that, before we reach here, each of the
> three MarkPortalActive() callers will have already called MarkPortalFailed()
> in its PG_CATCH block.  ("make check" does not reach this call.)

Offhand I think that's just belt-and-suspenders-too coding.  As you say,
we'd typically have failed active portals already before getting here.
But the responsibility of this routine is to *guarantee* that no broken
portals remain active, so I'd not want to remove this check.

Do you have a particular reason for asking?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2016-01-01 Thread Noah Misch

On Thu, Sep 03, 2015 at 11:04:11PM -0400, Tom Lane wrote:
> *** AtSubAbort_Portals(SubTransactionId mySu

> --- 909,966 
>   {
>   Portal  portal = hentry->portal;
>   
> + /* Was it created in this subtransaction? */
>   if (portal->createSubid != mySubid)
> + {
> + /* No, but maybe it was used in this subtransaction? */
> + if (portal->activeSubid == mySubid)
> + {
...
> + if (portal->status == PORTAL_ACTIVE)
> + MarkPortalFailed(portal);

Do you have a test case that reaches this particular MarkPortalFailed() call?
My attempts stumbled over the fact that, before we reach here, each of the
three MarkPortalActive() callers will have already called MarkPortalFailed()
in its PG_CATCH block.  ("make check" does not reach this call.)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-05 Thread Tom Lane

Michael Paquier  writes:
> On Sat, Sep 5, 2015 at 3:41 PM, Jim Nasby  wrote:
>> I was about to test this and was verifying that the crash still worked
>> when I noticed this in the logs (9.4.1, not HEAD). Not sure if there's any
>> realion here or not...
>> 
>> WARNING:  relation "pg_proc" page 121 is uninitialized --- fixing
>> WARNING:  relation "pg_proc" page 122 is uninitialized --- fixing

> [reading vacuumlazy.c...] This seems unrelated and I would not worry about
> it. Those system catalogs have been extended with new pages by a couple of
> backends, but a crash happened before they could actually insert tuples on
> it and commit. Perhaps you were creating a bunch of objects when a crash
> happened, no?

If this is the system you were doing the previous crash testing on, I'd
say it's a direct artifact of those crashes.  You were repeatedly crashing
the system during transactions that created functions and temp tables,
which would surely make entries in pg_proc, pg_depend, etc.  The crash
came before commit and so neither the relation pages nor the WAL entries
would necessarily get to disk.  But extension of the relations to make
room would still happen.  Moreover, there's probably nothing to cause
those new pages to get added to the relations' FSM, so each new test
attempt would add another set of pages.  Eventually autovacuum would
notice, which is what you've got logged here, but it might take awhile.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-05 Thread Michael Paquier

On Sat, Sep 5, 2015 at 3:41 PM, Jim Nasby  wrote:

> I was about to test this and was verifying that the crash still worked
> when I noticed this in the logs (9.4.1, not HEAD). Not sure if there's any
> realion here or not...
>
> WARNING:  relation "pg_proc" page 121 is uninitialized --- fixing
> WARNING:  relation "pg_proc" page 122 is uninitialized --- fixing
>

[reading vacuumlazy.c...] This seems unrelated and I would not worry about
it. Those system catalogs have been extended with new pages by a couple of
backends, but a crash happened before they could actually insert tuples on
it and commit. Perhaps you were creating a bunch of objects when a crash
happened, no?

Coming to the point, did you see a new crash with test_factory? Is that
some remnant of a previous test?
-- 
Michael

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-04 Thread Jim Nasby


On 9/4/15 12:41 PM, Tom Lane wrote:

I wrote:

After review of all the callers of RelationClearRelation, it seems like
most of them already have sufficient guards to prevent triggering of the
Assert.  The ones that lack such tests are AtEOXact_cleanup and
AtEOSubXact_cleanup.  So what I'm now thinking is that those should do
something along the lines of



Specifically, this, which can be shown to mitigate the results of the
problem cases in an otherwise-unpatched build.


And pushed.  I noticed that while the relcache.c fix mitigates the error
pretty well in 9.3 and up, in the older branches you still end up with
a PANIC due to error stack overflow.  This may indicate that there's
some patch we'd have been better off back-porting.  However, with the
Portal changes in place the test cases work in all branches, so I'm
not excited enough to pursue the point further myself.


I was about to test this and was verifying that the crash still worked 
when I noticed this in the logs (9.4.1, not HEAD). Not sure if there's 
any realion here or not...


WARNING:  relation "pg_proc" page 121 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 122 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 123 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 124 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 125 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 126 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 127 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 128 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 129 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 130 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 131 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 132 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 133 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 134 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 135 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 136 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 137 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 138 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 139 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 140 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 141 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 142 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 143 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 144 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 145 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 146 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 147 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 148 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 149 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 150 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 151 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 152 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 153 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 154 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 155 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 156 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 157 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 158 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 159 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 160 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 161 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 162 is uninitialized --- fixing
WARNING:  relation "pg_proc" page 246 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 82 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 83 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 84 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 85 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 86 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 87 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 88 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 89 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 90 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 91 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 92 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 93 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 94 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 95 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 96 is uninitialized --- fixing
WARNING:  relation "pg_depend" page 112 is uninitialized --- fixing
WARNING:  relation "pg_depen

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-04 Thread Tom Lane

I wrote:
>> After review of all the callers of RelationClearRelation, it seems like
>> most of them already have sufficient guards to prevent triggering of the
>> Assert.  The ones that lack such tests are AtEOXact_cleanup and
>> AtEOSubXact_cleanup.  So what I'm now thinking is that those should do
>> something along the lines of

> Specifically, this, which can be shown to mitigate the results of the
> problem cases in an otherwise-unpatched build.

And pushed.  I noticed that while the relcache.c fix mitigates the error
pretty well in 9.3 and up, in the older branches you still end up with
a PANIC due to error stack overflow.  This may indicate that there's
some patch we'd have been better off back-porting.  However, with the
Portal changes in place the test cases work in all branches, so I'm
not excited enough to pursue the point further myself.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-04 Thread Tom Lane

I wrote:
> After review of all the callers of RelationClearRelation, it seems like
> most of them already have sufficient guards to prevent triggering of the
> Assert.  The ones that lack such tests are AtEOXact_cleanup and
> AtEOSubXact_cleanup.  So what I'm now thinking is that those should do
> something along the lines of

Specifically, this, which can be shown to mitigate the results of the
problem cases in an otherwise-unpatched build.

regards, tom lane

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 44e9509..420ef3d 100644
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
*** RelationClearRelation(Relation relation,
*** 2048,2054 
  {
  	/*
  	 * As per notes above, a rel to be rebuilt MUST have refcnt > 0; while of
! 	 * course it would be a bad idea to blow away one with nonzero refcnt.
  	 */
  	Assert(rebuild ?
  		   !RelationHasReferenceCountZero(relation) :
--- 2048,2056 
  {
  	/*
  	 * As per notes above, a rel to be rebuilt MUST have refcnt > 0; while of
! 	 * course it would be an equally bad idea to blow away one with nonzero
! 	 * refcnt, since that would leave someone somewhere with a dangling
! 	 * pointer.  All callers are expected to have verified that this holds.
  	 */
  	Assert(rebuild ?
  		   !RelationHasReferenceCountZero(relation) :
*** AtEOXact_cleanup(Relation relation, bool
*** 2654,2664 
  	{
  		if (isCommit)
  			relation->rd_createSubid = InvalidSubTransactionId;
! 		else
  		{
  			RelationClearRelation(relation, false);
  			return;
  		}
  	}
  
  	/*
--- 2656,2680 
  	{
  		if (isCommit)
  			relation->rd_createSubid = InvalidSubTransactionId;
! 		else if (RelationHasReferenceCountZero(relation))
  		{
  			RelationClearRelation(relation, false);
  			return;
  		}
+ 		else
+ 		{
+ 			/*
+ 			 * Hmm, somewhere there's a (leaked?) reference to the relation.
+ 			 * We daren't remove the entry for fear of dereferencing a
+ 			 * dangling pointer later.  Bleat, and mark it as not belonging to
+ 			 * the current transaction.  Hopefully it'll get cleaned up
+ 			 * eventually.  This must be just a WARNING to avoid
+ 			 * error-during-error-recovery loops.
+ 			 */
+ 			relation->rd_createSubid = InvalidSubTransactionId;
+ 			elog(WARNING, "cannot remove relcache entry for \"%s\" because it has nonzero refcount",
+  RelationGetRelationName(relation));
+ 		}
  	}
  
  	/*
*** AtEOSubXact_cleanup(Relation relation, b
*** 2747,2757 
  	{
  		if (isCommit)
  			relation->rd_createSubid = parentSubid;
! 		else
  		{
  			RelationClearRelation(relation, false);
  			return;
  		}
  	}
  
  	/*
--- 2763,2786 
  	{
  		if (isCommit)
  			relation->rd_createSubid = parentSubid;
! 		else if (RelationHasReferenceCountZero(relation))
  		{
  			RelationClearRelation(relation, false);
  			return;
  		}
+ 		else
+ 		{
+ 			/*
+ 			 * Hmm, somewhere there's a (leaked?) reference to the relation.
+ 			 * We daren't remove the entry for fear of dereferencing a
+ 			 * dangling pointer later.  Bleat, and transfer it to the parent
+ 			 * subtransaction so we can try again later.  This must be just a
+ 			 * WARNING to avoid error-during-error-recovery loops.
+ 			 */
+ 			relation->rd_createSubid = parentSubid;
+ 			elog(WARNING, "cannot remove relcache entry for \"%s\" because it has nonzero refcount",
+  RelationGetRelationName(relation));
+ 		}
  	}
  
  	/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-04 Thread Tom Lane

Michael Paquier  writes:
> On Fri, Sep 4, 2015 at 10:52 PM, Tom Lane  wrote:
>> Also, I am very strongly tempted to convert the original problem-causing
>> Assert in relcache.c into a regular runtime test-and-elog.  If we're wrong
>> about having sealed off this hole completely, I'd much rather see the
>> result be an elog than silent corruption of the relcache.

> +1. For a HEAD-only change I suppose.

No, the point is to have less dangerous behavior *in the field* if the
situation occurs.  I have every intention of back-patching this bit as
well.

Although after some experimentation with a build lacking the Portal-level
fixes, an elog(ERROR) in RelationClearRelation isn't much fun either,
because the problem case causes errors during error cleanup, soon leading
to "PANIC: ERRORDATA_STACK_SIZE exceeded".

After review of all the callers of RelationClearRelation, it seems like
most of them already have sufficient guards to prevent triggering of the
Assert.  The ones that lack such tests are AtEOXact_cleanup and
AtEOSubXact_cleanup.  So what I'm now thinking is that those should do
something along the lines of

if (isCommit)
relation->rd_createSubid = parentSubid;
else if (RelationHasReferenceCountZero(relation))
{
RelationClearRelation(relation, false);
return;
}
else
{
elog(WARNING, "rel so-and-so is still referenced");
relation->rd_createSubid = parentSubid;
}

so as to mitigate the damage if there's a dangling reference.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-04 Thread Michael Paquier

On Fri, Sep 4, 2015 at 10:52 PM, Tom Lane  wrote:

> Also, I am very strongly tempted to convert the original problem-causing
> Assert in relcache.c into a regular runtime test-and-elog.  If we're wrong
> about having sealed off this hole completely, I'd much rather see the
> result be an elog than silent corruption of the relcache.
>

+1. For a HEAD-only change I suppose.
-- 
Michael

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-04 Thread Tom Lane

Michael Paquier  writes:
> On Fri, Sep 4, 2015 at 12:04 PM, Tom Lane  wrote:
>> Attached is an updated patch with comments to this effect and some
>> minor other code cleanup (mainly, not assuming that CurrentResourceOwner
>> is the right thing to use in AtSubAbort_Portals).

> Thanks! This looks good to me.

If no further comments, I'll see about pushing and back-patching this.

In the stable branches, the new field in Portal needs to go at the end,
to avoid an ABI break for extensions that look at Portals.  Otherwise I'm
inclined to push it as-is.  In particular, that would break any extensions
that call AtSubAbort_Portals() directly, but I'm having a really hard time
believing that there are any.  (If anyone can make a case that there are
some, the fallback would be to remove the new argument in the back
branches and make AtSubAbort_Portals depend on CurrentResourceOwner again;
but I'd rather not do it that way.)

Also, I am very strongly tempted to convert the original problem-causing
Assert in relcache.c into a regular runtime test-and-elog.  If we're wrong
about having sealed off this hole completely, I'd much rather see the
result be an elog than silent corruption of the relcache.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Michael Paquier

On Fri, Sep 4, 2015 at 12:04 PM, Tom Lane  wrote:

> I wrote:
> > Hmm.  I am not feeling especially comfortable about this: it's not clear
> > that there's anything preventing a suspended portal from containing a
> > dangerous reference.  However, a quick trial did not show that it was
> > possible to break it --- in the cases I tried, we detected that a cached
> > plan was no longer valid, tried to rebuild it, and noticed the missing
> > object at that point.  So maybe it's OK.
>
> After further thought I concluded that this probably is safe.  The
> portal's original query was created and planned when it was opened,
> so it cannot contain any references to objects of the failed
> subtransaction.  We have a hazard from queries within functions,
> but if the portal is suspended then no such functions are in progress.
>
> Attached is an updated patch with comments to this effect and some
> minor other code cleanup (mainly, not assuming that CurrentResourceOwner
> is the right thing to use in AtSubAbort_Portals).
>

Thanks! This looks good to me.
-- 
Michael

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Tom Lane

I wrote:
> Hmm.  I am not feeling especially comfortable about this: it's not clear
> that there's anything preventing a suspended portal from containing a
> dangerous reference.  However, a quick trial did not show that it was
> possible to break it --- in the cases I tried, we detected that a cached
> plan was no longer valid, tried to rebuild it, and noticed the missing
> object at that point.  So maybe it's OK.

After further thought I concluded that this probably is safe.  The
portal's original query was created and planned when it was opened,
so it cannot contain any references to objects of the failed
subtransaction.  We have a hazard from queries within functions,
but if the portal is suspended then no such functions are in progress.

Attached is an updated patch with comments to this effect and some
minor other code cleanup (mainly, not assuming that CurrentResourceOwner
is the right thing to use in AtSubAbort_Portals).

regards, tom lane

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b53d95f..ae9e850 100644
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
*** AbortSubTransaction(void)
*** 4585,4590 
--- 4585,4591 
  		AfterTriggerEndSubXact(false);
  		AtSubAbort_Portals(s->subTransactionId,
  		   s->parent->subTransactionId,
+ 		   s->curTransactionOwner,
  		   s->parent->curTransactionOwner);
  		AtEOSubXact_LargeObject(false, s->subTransactionId,
  s->parent->subTransactionId);
diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c
index 2794537..327b2a5 100644
*** a/src/backend/commands/portalcmds.c
--- b/src/backend/commands/portalcmds.c
*** PersistHoldablePortal(Portal portal)
*** 335,345 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	if (portal->status != PORTAL_READY)
! 		ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
!  errmsg("portal \"%s\" cannot be run", portal->name)));
! 	portal->status = PORTAL_ACTIVE;
  
  	/*
  	 * Set up global portal context pointers.
--- 335,341 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	MarkPortalActive(portal);
  
  	/*
  	 * Set up global portal context pointers.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..0df86a2 100644
*** a/src/backend/tcop/pquery.c
--- b/src/backend/tcop/pquery.c
*** PortalRun(Portal portal, long count, boo
*** 734,744 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	if (portal->status != PORTAL_READY)
! 		ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
!  errmsg("portal \"%s\" cannot be run", portal->name)));
! 	portal->status = PORTAL_ACTIVE;
  
  	/*
  	 * Set up global portal context pointers.
--- 734,740 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	MarkPortalActive(portal);
  
  	/*
  	 * Set up global portal context pointers.
*** PortalRunFetch(Portal portal,
*** 1398,1408 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	if (portal->status != PORTAL_READY)
! 		ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
!  errmsg("portal \"%s\" cannot be run", portal->name)));
! 	portal->status = PORTAL_ACTIVE;
  
  	/*
  	 * Set up global portal context pointers.
--- 1394,1400 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	MarkPortalActive(portal);
  
  	/*
  	 * Set up global portal context pointers.
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index 7530498..44a87f7 100644
*** a/src/backend/utils/mmgr/portalmem.c
--- b/src/backend/utils/mmgr/portalmem.c
*** CreatePortal(const char *name, bool allo
*** 232,237 
--- 232,238 
  	portal->status = PORTAL_NEW;
  	portal->cleanup = PortalCleanup;
  	portal->createSubid = GetCurrentSubTransactionId();
+ 	portal->activeSubid = portal->createSubid;
  	portal->strategy = PORTAL_MULTI_QUERY;
  	portal->cursorOptions = CURSOR_OPT_NO_SCROLL;
  	portal->atStart = true;
*** UnpinPortal(Portal portal)
*** 403,408 
--- 404,428 
  }
  
  /*
+  * MarkPortalActive
+  *		Transition a portal from READY to ACTIVE state.
+  *
+  * NOTE: never set portal->status = PORTAL_ACTIVE directly; call this instead.
+  */
+ void
+ MarkPortalActive(Portal portal)
+ {
+ 	/* For safety, this is a runtime test not just an Assert */
+ 	if (portal->status != PORTAL_READY)
+ 		ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+  errmsg("portal \"%s\" cannot be run", portal->name)));
+ 	/* Perform the state transition */
+ 	portal->status = PORTAL_ACTIVE;
+ 	portal->activeSubid = GetCurrentSubTransactionId();
+ }
+ 
+ /*
   * MarkPortalDone
   *		Transition a portal from ACTIVE to DONE state.
   *

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Tom Lane

Michael Paquier  writes:
> On Fri, Sep 4, 2015 at 9:57 AM, Tom Lane  wrote:
>> On reflection I think that the tracking of activeSubid in my patch is
>> probably overkill if we're attacking it this way.  We can just have
>> AtSubAbort_Portals fail any ACTIVE portal regardless of subxact level,
>> which is pretty analogous to what AtAbort_Portals has done for a long
>> time.

> Tracking the activated subxact looked neat from my side, that's more
> consistent with what is done when the portal is marked as ready,
> particularly with the new routine introduced.

Actually, that idea did not work at all: it caused errors inside plpgsql
EXCEPT blocks to try to kill the portal running the outer function call.
Ooops.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Michael Paquier

On Fri, Sep 4, 2015 at 9:57 AM, Tom Lane  wrote:

> On reflection I think that the tracking of activeSubid in my patch is
> probably overkill if we're attacking it this way.  We can just have
> AtSubAbort_Portals fail any ACTIVE portal regardless of subxact level,
> which is pretty analogous to what AtAbort_Portals has done for a long
> time.
>

Tracking the activated subxact looked neat from my side, that's more
consistent with what is done when the portal is marked as ready,
particularly with the new routine introduced.

> Let me work on this and see if I can get to a simpler patch.
>

Oh, OK. Thanks!
-- 
Michael

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Tom Lane

Michael Paquier  writes:
> On Fri, Sep 4, 2015 at 6:51 AM, Tom Lane  wrote:
>> Ideas?

> Yes. This diff on top of your patch:
> @@ -922,8 +922,7 @@ AtSubAbort_Portals(SubTransactionId mySubid,
>  * must be forced into FAILED state, for
> the same reasons
>  * discussed below.
>  */
> -   if (portal->status == PORTAL_READY ||
> -   portal->status == PORTAL_ACTIVE)
> +   if (portal->status == PORTAL_ACTIVE)
> MarkPortalFailed(portal);

> This way only the active portals are marked as failed.

Hmm.  I am not feeling especially comfortable about this: it's not clear
that there's anything preventing a suspended portal from containing a
dangerous reference.  However, a quick trial did not show that it was
possible to break it --- in the cases I tried, we detected that a cached
plan was no longer valid, tried to rebuild it, and noticed the missing
object at that point.  So maybe it's OK.

On reflection I think that the tracking of activeSubid in my patch is
probably overkill if we're attacking it this way.  We can just have
AtSubAbort_Portals fail any ACTIVE portal regardless of subxact level,
which is pretty analogous to what AtAbort_Portals has done for a long
time.

Let me work on this and see if I can get to a simpler patch.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Michael Paquier

On Fri, Sep 4, 2015 at 6:51 AM, Tom Lane  wrote:

> I experimented with the attached patch.  It seems to work to stop the
> crash Michael exhibited (I've not tried to duplicate Jim's original
> example, though).  However, it also causes a regression test failure,
> because transactions.sql does this:
>

Neat patch, it would have taken me longer to figure out that... I have
tried with Jim's test case and the patch is working.


> which is exactly the case we're trying to reject now.  So that fills
> me with fear that this approach would break existing applications.
> On the other hand, I do not really see a good alternative :-(.
>

This behavior is present since 2004 with fe548629, so that's a real concern
to me, especially if there are drivers around relying on this behavior.
There are for example some code patterns around Postgres ODBC that could be
impacted, not sure which ones but I guess that some internal error handling
is not going to like that.

I thought about trying to detect whether the Portal actually had any
> references to "new in subtransaction" objects to decide whether to
> kill it or not, but that seems awfully fragile.
>
> Ideas?
>

Yes. This diff on top of your patch:
@@ -922,8 +922,7 @@ AtSubAbort_Portals(SubTransactionId mySubid,
 * must be forced into FAILED state, for
the same reasons
 * discussed below.
 */
-   if (portal->status == PORTAL_READY ||
-   portal->status == PORTAL_ACTIVE)
+   if (portal->status == PORTAL_ACTIVE)
MarkPortalFailed(portal);
This way only the active portals are marked as failed. The regression tests
that are failing with your patch applied visibly do not activate the portal
they use, just marking it as ready to be used. This seems to be the safest
approach to me on stable branches, as well as on master, this way we are
sure that resources on the failed subxacts are cleaned up correctly, and
that existing applications are not impacted.

I am attaching a new patch with what I changed and a test case based on my
previous one.
Regards,
-- 
Michael
diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c
index 2794537..327b2a5 100644
--- a/src/backend/commands/portalcmds.c
+++ b/src/backend/commands/portalcmds.c
@@ -335,11 +335,7 @@ PersistHoldablePortal(Portal portal)
 	/*
 	 * Check for improper portal use, and mark portal active.
 	 */
-	if (portal->status != PORTAL_READY)
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("portal \"%s\" cannot be run", portal->name)));
-	portal->status = PORTAL_ACTIVE;
+	MarkPortalActive(portal);
 
 	/*
 	 * Set up global portal context pointers.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..0df86a2 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -734,11 +734,7 @@ PortalRun(Portal portal, long count, bool isTopLevel,
 	/*
 	 * Check for improper portal use, and mark portal active.
 	 */
-	if (portal->status != PORTAL_READY)
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("portal \"%s\" cannot be run", portal->name)));
-	portal->status = PORTAL_ACTIVE;
+	MarkPortalActive(portal);
 
 	/*
 	 * Set up global portal context pointers.
@@ -1398,11 +1394,7 @@ PortalRunFetch(Portal portal,
 	/*
 	 * Check for improper portal use, and mark portal active.
 	 */
-	if (portal->status != PORTAL_READY)
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("portal \"%s\" cannot be run", portal->name)));
-	portal->status = PORTAL_ACTIVE;
+	MarkPortalActive(portal);
 
 	/*
 	 * Set up global portal context pointers.
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index 7530498..639c2b3 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -232,6 +232,7 @@ CreatePortal(const char *name, bool allowDup, bool dupSilent)
 	portal->status = PORTAL_NEW;
 	portal->cleanup = PortalCleanup;
 	portal->createSubid = GetCurrentSubTransactionId();
+	portal->activeSubid = portal->createSubid;
 	portal->strategy = PORTAL_MULTI_QUERY;
 	portal->cursorOptions = CURSOR_OPT_NO_SCROLL;
 	portal->atStart = true;
@@ -403,6 +404,25 @@ UnpinPortal(Portal portal)
 }
 
 /*
+ * MarkPortalActive
+ *		Transition a portal from READY to ACTIVE state.
+ *
+ * NOTE: never set portal->status = PORTAL_ACTIVE directly; call this instead.
+ */
+void
+MarkPortalActive(Portal portal)
+{
+	/* For safety, this is a runtime test not just an Assert */
+	if (portal->status != PORTAL_READY)
+		ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("portal \"%s\" cannot be run", portal->name)));
+	/* Perform the state transition */
+	portal->status = PORTAL_ACTIVE;
+

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Tom Lane

I wrote:
> I'm inclined to think that the only real fix for this type of problem
> is that at subtransaction abort, we have to prevent further execution
> of any Portals that have executed at all within the subxact (ie force
> them into PORTAL_FAILED state and clean out their resources, see
> MarkPortalFailed).  It's not good enough to kill the one(s) that are
> actively executing at the instant of failure, because anything that's
> run at all since the subxact started might be holding a reference to an
> object we're about to delete.

> I'm a little worried that that will break usage patterns that work today,
> though.

I experimented with the attached patch.  It seems to work to stop the
crash Michael exhibited (I've not tried to duplicate Jim's original
example, though).  However, it also causes a regression test failure,
because transactions.sql does this:

BEGIN;
DECLARE c CURSOR FOR SELECT unique2 FROM tenk1 ORDER BY unique2;
SAVEPOINT one;
FETCH 10 FROM c;
ROLLBACK TO SAVEPOINT one;
FETCH 10 FROM c;

which is exactly the case we're trying to reject now.  So that fills
me with fear that this approach would break existing applications.
On the other hand, I do not really see a good alternative :-(.

I thought about trying to detect whether the Portal actually had any
references to "new in subtransaction" objects to decide whether to
kill it or not, but that seems awfully fragile.

Ideas?

regards, tom lane

diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c
index 2794537..327b2a5 100644
*** a/src/backend/commands/portalcmds.c
--- b/src/backend/commands/portalcmds.c
*** PersistHoldablePortal(Portal portal)
*** 335,345 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	if (portal->status != PORTAL_READY)
! 		ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
!  errmsg("portal \"%s\" cannot be run", portal->name)));
! 	portal->status = PORTAL_ACTIVE;
  
  	/*
  	 * Set up global portal context pointers.
--- 335,341 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	MarkPortalActive(portal);
  
  	/*
  	 * Set up global portal context pointers.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..0df86a2 100644
*** a/src/backend/tcop/pquery.c
--- b/src/backend/tcop/pquery.c
*** PortalRun(Portal portal, long count, boo
*** 734,744 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	if (portal->status != PORTAL_READY)
! 		ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
!  errmsg("portal \"%s\" cannot be run", portal->name)));
! 	portal->status = PORTAL_ACTIVE;
  
  	/*
  	 * Set up global portal context pointers.
--- 734,740 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	MarkPortalActive(portal);
  
  	/*
  	 * Set up global portal context pointers.
*** PortalRunFetch(Portal portal,
*** 1398,1408 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	if (portal->status != PORTAL_READY)
! 		ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
!  errmsg("portal \"%s\" cannot be run", portal->name)));
! 	portal->status = PORTAL_ACTIVE;
  
  	/*
  	 * Set up global portal context pointers.
--- 1394,1400 
  	/*
  	 * Check for improper portal use, and mark portal active.
  	 */
! 	MarkPortalActive(portal);
  
  	/*
  	 * Set up global portal context pointers.
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index 7530498..74f5089 100644
*** a/src/backend/utils/mmgr/portalmem.c
--- b/src/backend/utils/mmgr/portalmem.c
*** CreatePortal(const char *name, bool allo
*** 232,237 
--- 232,238 
  	portal->status = PORTAL_NEW;
  	portal->cleanup = PortalCleanup;
  	portal->createSubid = GetCurrentSubTransactionId();
+ 	portal->activeSubid = portal->createSubid;
  	portal->strategy = PORTAL_MULTI_QUERY;
  	portal->cursorOptions = CURSOR_OPT_NO_SCROLL;
  	portal->atStart = true;
*** UnpinPortal(Portal portal)
*** 403,408 
--- 404,428 
  }
  
  /*
+  * MarkPortalActive
+  *		Transition a portal from READY to ACTIVE state.
+  *
+  * NOTE: never set portal->status = PORTAL_ACTIVE directly; call this instead.
+  */
+ void
+ MarkPortalActive(Portal portal)
+ {
+ 	/* For safety, this is a runtime test not just an Assert */
+ 	if (portal->status != PORTAL_READY)
+ 		ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+  errmsg("portal \"%s\" cannot be run", portal->name)));
+ 	/* Perform the state transition */
+ 	portal->status = PORTAL_ACTIVE;
+ 	portal->activeSubid = GetCurrentSubTransactionId();
+ }
+ 
+ /*
   * MarkPortalDone
   *		Transition a portal from ACTIVE to DONE state.
   *
*** PreCommit_Portals(bool

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Tom Lane

Michael Paquier  writes:
> Finally I have been able to crack down the problem, and it can be
> reproduced with the following test case for example:

Hm.  It looks like the problem is that we're executing the function
within the Portal created for cursor "h", and that Portal is older
than the subtransaction created by the savepoint, so when we abort the
subtransaction we do not clean up the Portal.  That means that resources
held by that Portal haven't been released --- in particular it still has
a relcache pin on the "new_table" --- when subtransaction abort gets to
the point of wanting to drop relcache entries created during the
subtransaction.  So the Assert has caught an actual problem: we'd have
flushed the relcache entry while there was still an open reference to it.

I'm inclined to think that the only real fix for this type of problem
is that at subtransaction abort, we have to prevent further execution
of any Portals that have executed at all within the subxact (ie force
them into PORTAL_FAILED state and clean out their resources, see
MarkPortalFailed).  It's not good enough to kill the one(s) that are
actively executing at the instant of failure, because anything that's
run at all since the subxact started might be holding a reference to an
object we're about to delete.

I'm a little worried that that will break usage patterns that work today,
though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-03 Thread Michael Paquier

On Thu, Sep 3, 2015 at 1:46 AM, Jim Nasby  wrote:
> The double set of exceptions seems to be critical; if instead of calling
> tf.get(invoice) (which recursively does tf.get(customer)) I define the
> cursor to call tf.get(customer) directly there's no assert.

Finally I have been able to crack down the problem, and it can be
reproduced with the following test case for example:
BEGIN;
CREATE OR REPLACE FUNCTION create_self_tab() RETURNS text
LANGUAGE plpgsql AS $$
BEGIN
  CREATE TEMP TABLE new_table ON COMMIT DROP AS SELECT create_self_tab();
  RETURN 'foo';
END $$;
DECLARE h CURSOR FOR SELECT create_self_tab();
SAVEPOINT self_tab_point;
FETCH h; -- error
COMMIT;

When fetching the first tuple, the transaction status is cleaned up to the
savepoint because the call actually fails in the second loop at the
temporary relation exists, but it happens that the temporary table that has
been created tried to be cleanup up but it is still referenced, causing the
assertion failure. So that's not related to the use of the exception
blocks. What directed me to the SAVEPOINT causing the issue is the use of
ON_ERROR_ROLLBACK in the test case Jim proposed. I don't have a patch at
hand yet, still now things are easier to test.
-- 
Michael

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Jim Nasby


On 9/2/15 2:56 PM, Jim Nasby wrote:

On 9/2/15 2:17 PM, Alvaro Herrera wrote:

Michael Paquier wrote:


I haven't written yet a test case but I think that we could reproduce
that simply by having a relation referenced in the exception block of
a first function, calling a second function that itself raises an
exception, causing the referencing error.


Hm, so function 2 is called in the exception block of function 1?

Are you going to produce the test case for this?  Jim?


I had attempted one but the issue is that it's more than just an
exception block thing. If it was that simple then we'd still get the
crash without involving pgTap. So I suspect you need to have a named
cursor in the mix as well.

Let me make another attempt at something simpler.


After some time messing around with this I've been able to remove pgTap 
from the equation using the attached crash.sql (relevant snippet below).


This only works if you pass a refcursor into the function. It doesn't 
work if you do


OPEN have FOR EXECUTE $$SELECT * FROM tf.get( NULL::invoice, 'base' )$$;

inside the function instead. This code also produces different results; 
it outputs the error message before crashing and the stack trace shows 
the assert is now failing while trying to abort the top-level 
transaction. So it looks like the scenario is:


BEGIN;
DECLARE (open?) cursor that calls function with exception handler that 
calls function with exception handler that calls something that fails


The double set of exceptions seems to be critical; if instead of calling 
tf.get(invoice) (which recursively does tf.get(customer)) I define the 
cursor to call tf.get(customer) directly there's no assert.


I can poke at this more tomorrow, but it'd be very helpful if someone 
could explain this failure mode, as I'm basically stumbling in the dark 
on a test case right now.


CREATE OR REPLACE FUNCTION a(
have refcursor
) RETURNS void LANGUAGE plpgsql AS $body$
DECLARE
have_rec record;
BEGIN
FETCH have INTO have_rec;
END
$body$;
DECLARE h CURSOR FOR SELECT * FROM tf.get( NULL::invoice, 'base' );
SELECT a(
  'h'::refcursor
);


(lldb) bt
* thread #1: tid = 0x722ed8, 0x7fff92a5a866 
libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', 
stop reason = signal SIGABRT
  * frame #0: 0x7fff92a5a866 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fff9001b35c libsystem_pthread.dylib`pthread_kill + 92
frame #2: 0x7fff8cf89b1a libsystem_c.dylib`abort + 125
frame #3: 0x00010cdb4039 
postgres`ExceptionalCondition(conditionName=0x00010cf59cfd, 
errorType=0x00010cec392d, fileName=0x00010cf59045, lineNumber=1972) + 
137 at assert.c:54
frame #4: 0x00010cd9b332 
postgres`RelationClearRelation(relation=0x00011658f110, rebuild='\0') + 162 
at relcache.c:1970
frame #5: 0x00010cd9cc0f 
postgres`AtEOSubXact_cleanup(relation=0x00011658f110, isCommit='\0', 
mySubid=15, parentSubid=1) + 79 at relcache.c:2665
frame #6: 0x00010cd9cb92 
postgres`AtEOSubXact_RelationCache(isCommit='\0', mySubid=15, parentSubid=1) + 
242 at relcache.c:2633
frame #7: 0x00010c9b6e88 postgres`AbortSubTransaction + 440 at 
xact.c:4373
frame #8: 0x00010c9b7208 postgres`AbortCurrentTransaction + 312 at 
xact.c:2947
frame #9: 0x00010cc249f3 postgres`PostgresMain(argc=1, 
argv=0x7fa3f4800378, dbname=0x7fa3f4800200, 
username=0x7fa3f30031f8) + 1875 at postgres.c:3902
frame #10: 0x00010cb9da48 postgres`PostmasterMain [inlined] BackendRun 
+ 8024 at postmaster.c:4155
frame #11: 0x00010cb9da22 postgres`PostmasterMain [inlined] 
BackendStartup at postmaster.c:3829
frame #12: 0x00010cb9da22 postgres`PostmasterMain [inlined] ServerLoop 
at postmaster.c:1597
frame #13: 0x00010cb9da22 postgres`PostmasterMain(argc=, 
argv=) + 7986 at postmaster.c:1244
frame #14: 0x00010cb218cd postgres`main(argc=, 
argv=) + 1325 at main.c:228
frame #15: 0x7fff8e9a35fd libdyld.dylib`start + 1

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
BEGIN;
\i test/helpers/tap_setup.sql

CREATE EXTENSION test_factory;
SET search_path=tap;
\i test/helpers/create.sql

SELECT tf.register(
  'customer'
  , array[
row(
  'insert'
  , $$INSERT INTO customer VALUES (DEFAULT, 'first', 'last' ) RETURNING *$$
)::tf.test_set
, row(
  'function'
  , $$SELECT * FROM customer__add( 'func first', 'func last' )$$
)::tf.test_set
  ]
);
SELECT tf.register(
  'invoice'
  , array[
  row(
'insert'
, $$INSERT INTO invoice VALUES(
DEFAULT
, (tf.get( NULL::customer, 'insert' )).customer_id
, current_date
, current_date + 30
  ) RETURNING *$$
  )::tf.test_set
  , row(
'function'
, $$INSERT INTO invoice VALUES(

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Jim Nasby


On 9/2/15 2:17 PM, Alvaro Herrera wrote:

Michael Paquier wrote:


I haven't written yet a test case but I think that we could reproduce
that simply by having a relation referenced in the exception block of
a first function, calling a second function that itself raises an
exception, causing the referencing error.


Hm, so function 2 is called in the exception block of function 1?

Are you going to produce the test case for this?  Jim?


I had attempted one but the issue is that it's more than just an 
exception block thing. If it was that simple then we'd still get the 
crash without involving pgTap. So I suspect you need to have a named 
cursor in the mix as well.


Let me make another attempt at something simpler.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Alvaro Herrera

Michael Paquier wrote:

> I haven't written yet a test case but I think that we could reproduce
> that simply by having a relation referenced in the exception block of
> a first function, calling a second function that itself raises an
> exception, causing the referencing error.

Hm, so function 2 is called in the exception block of function 1?

Are you going to produce the test case for this?  Jim?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Michael Paquier

On Wed, Sep 2, 2015 at 6:21 AM, Michael Paquier wrote:
> Yes, that's what I have been looking at actually by having some markers in 
> relcache.c. The reference count of this relation get incremented here:

So, I have been playing more with this code, and as mentioned by
Andres this test case goes twice through the PG_CATCH block in
pl_exec.c, causing the crash as the temporary relation does not get
dereferenced. I may be missing something, but isn't it a matter of
moving the switch to the old memory context at the end of the block so
as we can get a correct cleanup of the estate in the first block?
See for example the attached that fixes the problem for me, and passes
make checkworld, though I expect Tom and Andres to jump in and say
that this is not correct within the next 12 hours :)
I haven't written yet a test case but I think that we could reproduce
that simply by having a relation referenced in the exception block of
a first function, calling a second function that itself raises an
exception, causing the referencing error.
Regards,
-- 
Michael
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 935fa62..56084c3 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -1253,8 +1253,6 @@ exec_stmt_block(PLpgSQL_execstate *estate, PLpgSQL_stmt_block *block)
 
 			/* Abort the inner transaction */
 			RollbackAndReleaseCurrentSubTransaction();
-			MemoryContextSwitchTo(oldcontext);
-			CurrentResourceOwner = oldowner;
 
 			/* Revert to outer eval_econtext */
 			estate->eval_econtext = old_eval_econtext;
@@ -1326,6 +1324,9 @@ exec_stmt_block(PLpgSQL_execstate *estate, PLpgSQL_stmt_block *block)
 ReThrowError(edata);
 			else
 FreeErrorData(edata);
+
+			MemoryContextSwitchTo(oldcontext);
+			CurrentResourceOwner = oldowner;
 		}
 		PG_END_TRY();
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-01 Thread Michael Paquier

On Wed, Sep 2, 2015 at 6:13 AM, Jim Nasby  wrote:
> On 9/1/15 11:59 PM, Michael Paquier wrote:
>>
>> On Wed, Sep 2, 2015 at 12:59 PM, Jim Nasby wrote:
>>>
>>> On 9/1/15 8:42 PM, Michael Paquier wrote:
>>> The crash is triggered by having an exception raised in this particular
>>> call
>>> stack. Since there's no syntax error in master/0.2.1, there's no assert
>>> failure either. Were you running asserts?
>>
>>
>> I thought I was, but visibly it was mistaken. So, after re-rolling
>> configure, I have been able to reproduce the crash, and as far as I
>> can see all supported versions are impacted. I tested down to 9.1
>> where extensions were introduced, and the stack trace, as well as the
>> assertion failing are the same, similar to what Jim has reported. I am
>> just digging more into this thing now.
>
>
> I just had a theory that reference counts were messed up because there was
> both a temp table and a real table with the same name. Turns out that's
not
> the case, but I do now know that the assert is failing for the reference
> count on the temp table.

Yes, that's what I have been looking at actually by having some markers in
relcache.c. The reference count of this relation get incremented here:
LOG:  increment ref count relation: invoice_03, rd_refcnt: 1
CONTEXT:  SQL statement "
CREATE TEMP TABLE invoice_03 ON COMMIT DROP AS
WITH i AS (
  INSERT INTO invoice VALUES(
DEFAULT
, (tf.get( NULL::customer, 'insert' )).customer_id
, current_date
, current_date + 30
  ) RETURNING *
)
  SELECT *
FROM i
"
PL/pgSQL function tf.get(anyelement,text) line 29 at EXECUTE
PL/pgSQL function results_eq(refcursor,refcursor,text) line 11 at
FETCH
PL/pgSQL function results_eq(text,text,text) line 9 at assignment
STATEMENT:  SELECT results_eq(
  $$SELECT * FROM tf.get( NULL::invoice, 'base' )$$
  , $$VALUES( 1, 1, current_date, current_date + 30 )$$
  , 'invoice factory output'
);
And it gets cleared here without being decremented when cleaning up the
second exception:
LOG:  clear relation: invoice_03, rd_refcnt: 1
CONTEXT:  PL/pgSQL function results_eq(refcursor,refcursor,text) line 11
during exception cleanup
PL/pgSQL function results_eq(text,text,text) line 9 at assignment
STATEMENT:  SELECT results_eq(
  $$SELECT * FROM tf.get( NULL::invoice, 'base' )$$
  , $$VALUES( 1, 1, current_date, current_date + 30 )$$
  , 'invoice factory output'
);
-- 
Michael

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-01 Thread Jim Nasby


On 9/1/15 11:59 PM, Michael Paquier wrote:

On Wed, Sep 2, 2015 at 12:59 PM, Jim Nasby wrote:

On 9/1/15 8:42 PM, Michael Paquier wrote:
The crash is triggered by having an exception raised in this particular call
stack. Since there's no syntax error in master/0.2.1, there's no assert
failure either. Were you running asserts?


I thought I was, but visibly it was mistaken. So, after re-rolling
configure, I have been able to reproduce the crash, and as far as I
can see all supported versions are impacted. I tested down to 9.1
where extensions were introduced, and the stack trace, as well as the
assertion failing are the same, similar to what Jim has reported. I am
just digging more into this thing now.


I just had a theory that reference counts were messed up because there 
was both a temp table and a real table with the same name. Turns out 
that's not the case, but I do now know that the assert is failing for 
the reference count on the temp table.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-01 Thread Michael Paquier

On Wed, Sep 2, 2015 at 12:59 PM, Jim Nasby wrote:
> On 9/1/15 8:42 PM, Michael Paquier wrote:
> The crash is triggered by having an exception raised in this particular call
> stack. Since there's no syntax error in master/0.2.1, there's no assert
> failure either. Were you running asserts?

I thought I was, but visibly it was mistaken. So, after re-rolling
configure, I have been able to reproduce the crash, and as far as I
can see all supported versions are impacted. I tested down to 9.1
where extensions were introduced, and the stack trace, as well as the
assertion failing are the same, similar to what Jim has reported. I am
just digging more into this thing now.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-01 Thread Jim Nasby


On 9/1/15 8:42 PM, Michael Paquier wrote:

test_factory is a jungle to me. Perhaps you could just extract a
self-contained test case? It does not matter if the file is long as
long as the problem can be easily reproduced.


Sorry, more info on what's happening here.

The idea behind test_factory is to allow you to register a command that 
creates and returns test data (IE: INSERT INTO table VALUES( DEFAULT, 
'my test data' ) RETURNING *). That insert statement is stored in a 
table (_tf._test_factory). When this dynamic statement is executed, the 
results will be stored in a specially named table (plpgsql variable 
c_data_table_name).


When you call tf.get(), it first attempts to grab all the rows from 
c_data_table_name. The first time you do this in a database, that table 
won't exist. tg.get's exception block will create a temp table holding 
the results of the stored statement (IE: INSERT INTO table ...).


Something else important here is that in crash.sql there is a nested 
tf.get() call:


-- Invoice
, $$INSERT INTO invoice VALUES(
DEFAULT
, (tf.get( NULL::customer, 'insert' )).customer_id
, current_date
, current_date + 30
  ) RETURNING *$$

Note that calls tf.get for customer (which is a simple INSERT).

This is where stuff gets weird. If you run tf.get( NULL::customer, 
'insert' ) you get a regular plpgsql error. If you simply run tf.get() 
for invoices, *outside* of tap.results_eq(), you also only get a plpgsql 
error. To trigger the assert, you must use tf.get( NULL::invoice, 'base' 
) from within tap.results_eq(). That's the only way I've found to 
trigger this.


AFAICT, that call stack looks like this:

results_eq opens a cursor to run $$SELECT * FROM tf.get( NULL::invoice, 
'base' )$$


plpgsql does it's thing and eventually that statement begins execution
tf.get() does a few things then attempts to read from a non-existent 
table. tf.get's outer block catches that exception and runs dynamic SQL 
to create a temp table. That dynamic SQL contains (in part) this:  , 
(tf.get( NULL::customer, 'insert' )).customer_id


*That* tf.get also attempts to read from a non-existent table and 
*successfully* creates it's temp table. It then does

  PERFORM _tf.table_create( c_data_table_name );

which fails due to a bug in _tf.table_create().

Now we have a second exception bubbling back up to the exception handler 
of the second tf.get call, which goes up to the exception handler for 
the first tf.get call. That call was in the process of creating a temp 
table (invoice_003). The error continues up to the FETCH command in 
results_eq(). The assert happens somewhere after here, and it's because 
the refcount on that temp table (invoice_003) is unexpected. I'm tracing 
through this scenario by hand right now to try and figure out exactly 
when that assert blows up, but I know it's happening in 
results_eq(refcursor, refcursor, text).


BTW, I just updated the crash branch to ensure that test_factory 0.1.1 
is what gets installed.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-01 Thread Jim Nasby


On 9/1/15 8:42 PM, Michael Paquier wrote:

On Wed, Sep 2, 2015 at 9:39 AM, Michael Paquier
 wrote:

On Wed, Sep 2, 2015 at 7:23 AM, Jim Nasby wrote:

Well nuts, pretty sure that means the error isn't reproducing for you. :/ Do
you maybe have unusual config options or postgresql.conf options?


Yep. And I have found one reason why it was not working, I have been
using --extra-version with a custom string and the Makefile of
test_factory failed to detect that (you may want to use VERSION_NUM in
Makefile.global instead), leading to test_factory--0.1.1.sql to not be
installed as well.

Even after fixing that, I am facing the same problem when running the test, aka:
psql:crash.sql:42: ERROR:  42703: column "c_data_table_name" does not exist
LINE 4: , c_data_table_name
And the same error show up, should I use the branch crash in the
github repo or your zip file, make test or make install/psql -f
crash.sql.

test_factory is a jungle to me. Perhaps you could just extract a
self-contained test case? It does not matter if the file is long as
long as the problem can be easily reproduced.


Mea culpa. It is possible to run crash.sql, with test_factory instead
0.2.1 installed in a cluster. So I picked it up from github on master
branch, deployed it, and then crash.sql is able to run. However I was
not able to reproduce a failure on master, REL9_4_STABLE and 9.4.1.
Thoughts?


The crash is triggered by having an exception raised in this particular 
call stack. Since there's no syntax error in master/0.2.1, there's no 
assert failure either. Were you running asserts? What configure line are 
you using? I might be able to track this down to something particular to 
my configure.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-01 Thread Michael Paquier

On Wed, Sep 2, 2015 at 9:39 AM, Michael Paquier
 wrote:
> On Wed, Sep 2, 2015 at 7:23 AM, Jim Nasby wrote:
>> Well nuts, pretty sure that means the error isn't reproducing for you. :/ Do
>> you maybe have unusual config options or postgresql.conf options?
>
> Yep. And I have found one reason why it was not working, I have been
> using --extra-version with a custom string and the Makefile of
> test_factory failed to detect that (you may want to use VERSION_NUM in
> Makefile.global instead), leading to test_factory--0.1.1.sql to not be
> installed as well.
>
> Even after fixing that, I am facing the same problem when running the test, 
> aka:
> psql:crash.sql:42: ERROR:  42703: column "c_data_table_name" does not exist
> LINE 4: , c_data_table_name
> And the same error show up, should I use the branch crash in the
> github repo or your zip file, make test or make install/psql -f
> crash.sql.
>
> test_factory is a jungle to me. Perhaps you could just extract a
> self-contained test case? It does not matter if the file is long as
> long as the problem can be easily reproduced.

Mea culpa. It is possible to run crash.sql, with test_factory instead
0.2.1 installed in a cluster. So I picked it up from github on master
branch, deployed it, and then crash.sql is able to run. However I was
not able to reproduce a failure on master, REL9_4_STABLE and 9.4.1.
Thoughts?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-01 Thread Michael Paquier

On Wed, Sep 2, 2015 at 7:23 AM, Jim Nasby wrote:
> Well nuts, pretty sure that means the error isn't reproducing for you. :/ Do
> you maybe have unusual config options or postgresql.conf options?

Yep. And I have found one reason why it was not working, I have been
using --extra-version with a custom string and the Makefile of
test_factory failed to detect that (you may want to use VERSION_NUM in
Makefile.global instead), leading to test_factory--0.1.1.sql to not be
installed as well.

Even after fixing that, I am facing the same problem when running the test, aka:
psql:crash.sql:42: ERROR:  42703: column "c_data_table_name" does not exist
LINE 4: , c_data_table_name
And the same error show up, should I use the branch crash in the
github repo or your zip file, make test or make install/psql -f
crash.sql.

test_factory is a jungle to me. Perhaps you could just extract a
self-contained test case? It does not matter if the file is long as
long as the problem can be easily reproduced.
Regards,
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-01 Thread Jim Nasby


On 9/1/15 1:08 AM, Michael Paquier wrote:

On Sun, Aug 30, 2015 at 1:06 AM, Jim Nasby wrote:

Steps to reproduce:
Download https://github.com/BlueTreble/test_factory/archive/crash.zip
Unzip, cd into directory
pgxn install pgtap (or just make test)


FWIW, make test fails:
! ERROR:  42703: column "c_data_table_name" does not exist
! LINE 4: , c_data_table_name


make install if you didn't do make test
psql -f crash.sql


As does this one with the same error. I am not exactly sure what I am missing.
Regards,


Well nuts, pretty sure that means the error isn't reproducing for you. 
:/ Do you maybe have unusual config options or postgresql.conf options?


Here's a log of how I can reproduce from a just-pulled copy of HEAD:
decibel@decina:[17:22]~/pgsql/HEAD (master $=)$head config.log |grep ./conf
  $ ./configure --with-includes=/opt/local/include 
--with-libraries=/opt/local/lib --with-perl --with-python 
--enable-depend -C --enable-tap-tests 
--prefix=/Users/decibel/pgsql/HEAD/i --with-pgport= --enable-debug 
CC=ccache clang -Qunused-arguments -fcolor-diagnostics --enable-cassert 
CFLAGS=-O0 --no-create --no-recursion

decibel@decina:[17:22]~/pgsql/HEAD (master $=)$

make; make install; pg_ctl init; pg_ctl start; createdb

decibel@decina:[17:20]~/tmp$mv ~/Downloads/test_factory-crash.zip .
decibel@decina:[17:20]~/tmp$open test_factory-crash.zip
decibel@decina:[17:20]~/tmp$cd test_factory-crash/
decibel@decina:[17:20]~/tmp/test_factory-crash$pgHEAD
decibel@decina:[17:20]~/tmp/test_factory-crash$make test
rm -rf  sql/test_factory--0.1.1.sql
rm -rf results/ regression.diffs regression.out tmp_check/ log/
cp sql/test_factory.sql sql/test_factory--0.1.1.sql
/bin/sh 
/Users/decibel/pgsql/HEAD/i/lib/pgxs/src/makefiles/../../config/install-sh 
-c -d '/Users/decibel/pgsql/HEAD/i/share/extension'
/bin/sh 
/Users/decibel/pgsql/HEAD/i/lib/pgxs/src/makefiles/../../config/install-sh 
-c -d '/Users/decibel/pgsql/HEAD/i/share/doc/extension'
/usr/bin/install -c -m 644 .//test_factory.control 
'/Users/decibel/pgsql/HEAD/i/share/extension/'
/usr/bin/install -c -m 644 .//doc/test_factory.asc 
'/Users/decibel/pgsql/HEAD/i/share/doc/extension/'
/Users/decibel/pgsql/HEAD/i/lib/pgxs/src/makefiles/../../src/test/regress/pg_regress 
--inputdir=./ --bindir='/Users/decibel/pgsql/HEAD/i/bin' 
--inputdir=test --load-language=plpgsql --dbname=contrib_regression base

(using postmaster on Unix socket, default port)
== dropping database "contrib_regression" ==
NOTICE:  database "contrib_regression" does not exist, skipping
DROP DATABASE
== creating database "contrib_regression" ==
CREATE DATABASE
ALTER DATABASE
== installing plpgsql ==
CREATE LANGUAGE
== running regression test queries==
test base ... FAILED (test process exited with exit 
code 2)


==
 1 of 1 tests failed.
==

The differences that caused some tests to fail can be viewed in the
file "/Users/decibel/tmp/test_factory-crash/regression.diffs".  A copy 
of the test summary that you see
above is saved in the file 
"/Users/decibel/tmp/test_factory-crash/regression.out".


make: [installcheck] Error 1 (ignored)
*** /Users/decibel/tmp/test_factory-crash/test/expected/base.out	Sat Aug 
29 08:50:08 2015
--- /Users/decibel/tmp/test_factory-crash/results/base.out	Tue Sep  1 
17:21:04 2015

***
*** 1,18 
  \set ECHO none
! ok 1 - Create customer table
! ok 2 - Register test customers
! ok 3 - Create function customer__add
! ok 4 - Create invoice table
! ok 5 - Register test invoices
! ok 6 - customer table is empty
! ok 7 - invoice table is empty
! ok 8 - invoice factory output
! ok 9 - invoice table content
! ok 10 - customer table content
! ok 11 - invoice factory second call
! ok 12 - invoice table content stayed constant
! ok 13 - customer table content stayed constant
! ok 14 - Test function factory
! ok 15 - customer table has new row
! ok 16 - truncate invoice
! ok 17 - invoice factory get remains the same after truncate
--- 1,10 
  \set ECHO none
! ok 1 - Register test customers
! ok 2 - Create function customer__add
! ok 3 - Register test invoices
! ok 4 - customer table is empty
! ok 5 - invoice table is empty
! server closed the connection unexpectedly
!   This probably means the server terminated abnormally
!   before or while processing the request.
! connection to server was lost

==

decibel@decina:[17:21]~/tmp/test_factory-crash$
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-08-31 Thread Michael Paquier

On Sun, Aug 30, 2015 at 1:06 AM, Jim Nasby wrote:
> Steps to reproduce:
> Download https://github.com/BlueTreble/test_factory/archive/crash.zip
> Unzip, cd into directory
> pgxn install pgtap (or just make test)

FWIW, make test fails:
! ERROR:  42703: column "c_data_table_name" does not exist
! LINE 4: , c_data_table_name

> make install if you didn't do make test
> psql -f crash.sql

As does this one with the same error. I am not exactly sure what I am missing.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-08-29 Thread Jim Nasby


On 8/29/15 11:31 AM, Andres Freund wrote:

Hi,

On 2015-08-29 11:06:05 -0500, Jim Nasby wrote:

Stack trace below. Relevant assert:


What exact revision is this on? Either there's some aggressive inlining
going on, or these lines don't entirely match up with HEAD.


Oops, that was 9.4.1. Trace from just pulled HEAD below, compiled with -O0.


That's compiled with optimization, isn't it? Could you compile with -O0?

Can you go up to

frame #19: 0x000116295f1f 
plpgsql.so`exec_stmt_block(estate=0x7fff532c0090, block=0x7fa3f53a21d0) 
+ 2095 at pl_exec.c:1300


and do 'disass/m' to see where we actually are?


That's now 20 instead of 19. disass/m gives an error (note my gdb is 
tweaked for apple and everything else I've posted here is from lldb). 
Output from 'disassemble' in frame 20 at bottom.



The area around #19 likely is the PG_CATCH inside exec_stmt_block. So
what's happening is that there's an exception raised, while handling the
previous exception. And apparently we're not doing that entirely
right. Besides the line numbers "during exception cleanup" hints at
that.


If the line numbers and the 'gui' mode of lldb are to be believed, frame 
20 is NOT in the try block, but #9 is:



│ 1242 │PG_CATCH(); 
   ││   
│
│ 1243 │{   
   ││   
│
│ 1244 │ErrorData  *edata;  
   ││   
│
│ 1245 │ListCell   *e;  
   ││   
│
│ 1246 │
   ││   
│
│ 1247 │estate->err_text = gettext_noop("during exception 
cleanup");   ││ 
  │
│ 1248 │
   ││   
│
│ 1249 │/* Save error info */   
   ││   
│
│ 1250 │MemoryContextSwitchTo(oldcontext);  
   ││   
│
│ 1251 │edata = CopyErrorData();
   ││   
│
│ 1252 │FlushErrorState();  
   ││   
│
│ 1253 │
   ││   
│
│ 1254 │/* Abort the inner transaction */   
   ││   
│
│ 1255 │◆   RollbackAndReleaseCurrentSubTransaction();


Looking at frame 10

   432   */
   433  estate.err_text = NULL;
   434  estate.err_stmt = (PLpgSQL_stmt *) (func->action);
-> 435   rc = exec_stmt_block(&estate, func->action);
   436  if (rc != PLPGSQL_RC_RETURN)
   437  {
   438  estate.err_stmt = NULL;

func->fn_signature is results_eq(refcursor,refcursor,text), which is a 
pgtap function that has an exception handler, and I know the code that 
it's calling also has an exception handler.



(lldb) bt
* thread #1: tid = 0x3c8fb4, 0x7fff92a5a866 
libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', 
stop reason = signal SIGABRT
  * frame #0: 0x7fff92a5a866 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fff9001b35c libsystem_pthread.dylib`pthread_kill + 92
frame #2: 0x7fff8cf89b1a libsystem_c.dylib`abort + 125
frame #3: 0x00010fcf3b99 
postgres`ExceptionalCondition(conditionName=0x00010fe157e9, 
errorType=0x00010fd60053, fileName=0x00010fe14a7f, lineNumber=2055) + 
137 at assert.c:54
frame #4: 0x00010fcda0ac 
postgres`RelationClearRelation(relation=0x00011954f388, rebuild='\0') + 140 
at relcache.c:2053
frame #5: 0x00010

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-08-29 Thread Andres Freund

Hi,

On 2015-08-29 11:06:05 -0500, Jim Nasby wrote:
> Stack trace below. Relevant assert:

What exact revision is this on? Either there's some aggressive inlining
going on, or these lines don't entirely match up with HEAD.

That's compiled with optimization, isn't it? Could you compile with -O0?

Can you go up to
> >frame #19: 0x000116295f1f 
> > plpgsql.so`exec_stmt_block(estate=0x7fff532c0090, 
> > block=0x7fa3f53a21d0) + 2095 at pl_exec.c:1300

and do 'disass/m' to see where we actually are?

The area around #19 likely is the PG_CATCH inside exec_stmt_block. So
what's happening is that there's an exception raised, while handling the
previous exception. And apparently we're not doing that entirely
right. Besides the line numbers "during exception cleanup" hints at
that.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-08-29 Thread Jim Nasby


On 8/29/15 8:04 AM, Tom Lane wrote:

Michael Paquier  writes:

Ah, OK, you meant this file... Yes I was able to receive it as well in your
original email. I'll try to investigate further later, but Tom may beat me
first. He usually does.


Jim had indicated the bug wasn't reproducible on the strength of that
info, so I was waiting for him to provide a more reproducible case.


It was reproducible, just not very self contained. [1] is a bit better, 
but it still involves pgTap as well as test_factory.


Steps to reproduce:
Download https://github.com/BlueTreble/test_factory/archive/crash.zip
Unzip, cd into directory
pgxn install pgtap (or just make test)
make install if you didn't do make test
psql -f crash.sql

Stack trace below. Relevant assert:

   1967		 * As per notes above, a rel to be rebuilt MUST have refcnt > 
0; while of
   1968		 * course it would be a bad idea to blow away one with nonzero 
refcnt.

   1969  */
-> 1970  Assert(rebuild ?
   1971!RelationHasReferenceCountZero(relation) :
   1972RelationHasReferenceCountZero(relation));

Relevant bits of relation:

│ ◆─(Relation) relation = 0x0001165832a8
   ││   
│
│ ├─◆─(RelFileNode) rd_node 
   ││   
│
│ ├─◆─(SMgrRelationData *) rd_smgr = 0x 
   ││   
│
│ ├─(int) rd_refcnt = 1 
   ││   
│
│ ├─(BackendId) rd_backend = 2  
   ││   
│
│ ├─(bool) rd_islocaltemp = '\x01'  
   ││   
│
│ ├─(bool) rd_isnailed = '\0'   
   ││   
│
│ ├─(bool) rd_isvalid = '\x01'  
   ││   
│
│ ├─(char) rd_indexvalid = '\0' 
   ││   
│
│ ├─(SubTransactionId) rd_createSubid = 16  
   ││   
│
│ ├─(SubTransactionId) rd_newRelfilenodeSubid = 0   
   ││   
│
│ ├─◆─(Form_pg_class) rd_rel = 0x0001165838d8   
   ││   
│
│ │ ├─◆─(NameData) relname  
   ││   
│
│ │ │ └─◆─(char [64]) data "invoice_03"


rebuild is 0.

[1]https://github.com/BlueTreble/test_factory/blob/crash/crash.sql


(lldb) bt
* thread #1: tid = 0x3b03a0, 0x7fff92a5a866 
libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', 
stop reason = signal SIGABRT
  * frame #0: 0x7fff92a5a866 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fff9001b35c libsystem_pthread.dylib`pthread_kill + 92
frame #2: 0x7fff8cf89b1a libsystem_c.dylib`abort + 125
frame #3: 0x00010cdb4039 
postgres`ExceptionalCondition(conditionName=0x00010cf59cfd, 
errorType=0x00010cec392d, fileName=0x00010cf59045, lineNumber=1972) + 
137 at assert.c:54
frame #4: 0x00010cd9b332 
postgres`RelationClearRelation(relation=0x000116594cd8, rebuild='\0') + 162 
at relcache.c:1970
frame #5: 0x00010cd9cc0f 
postgres`AtEOSubXact_cleanup(relation=0x000116594cd8, isCommit='\0', 
mySubid=15, parentSubid=14) + 79 at relcache.c:2665
frame #6: 0x00010cd9cb92 
postgres`AtEOSubXact_RelationCache(isCommit='\0', mySubid=15, parentSubid=14) + 
242 at relcache.c:2633
frame #7: 0x00010c9b6e88 postgres`AbortSubTransaction + 440 at 
xact.c:4373
frame #8: 0x00010c9b8ef2 
postgres`RollbackAndReleaseCurrentSubTransaction + 178 at xact.c:3948
frame #9: 0x000116295c93 
plpgsql.so`exec_stmt_block(estate=0x000

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-08-29 Thread Tom Lane

Michael Paquier  writes:
> Ah, OK, you meant this file... Yes I was able to receive it as well in your
> original email. I'll try to investigate further later, but Tom may beat me
> first. He usually does.

Jim had indicated the bug wasn't reproducible on the strength of that
info, so I was waiting for him to provide a more reproducible case.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-08-28 Thread Michael Paquier

On Sat, Aug 29, 2015 at 11:18 AM, Jim Nasby 
wrote:

> On 8/28/15 8:39 PM, Tom Lane wrote:
>
>> Michael Paquier  writes:
>>
>>> On Sat, Aug 29, 2015 at 5:02 AM, Jim Nasby 
>>> wrote:
>>>
 Looks like a 98k file won't get through the list...

>>>
>> Is it compressed? Note that we have sometimes larger patches than that,
>>> but
>>> perhaps those had special permissions by the admins of this list.
>>>
>>
>> Messages significantly larger than that go through all the time.  Maybe
>> you had it marked with some weird MIME type?
>>
>
> Apparently the original email did go through and my MUA search just failed
> to find it. Sorry for the noise.
>
> Original email:
> http://www.postgresql.org/message-id/55dfaf18.4060...@bluetreble.com


Ah, OK, you meant this file... Yes I was able to receive it as well in your
original email. I'll try to investigate further later, but Tom may beat me
first. He usually does.
-- 
Michael

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-08-28 Thread Jim Nasby


On 8/28/15 8:39 PM, Tom Lane wrote:

Michael Paquier  writes:

On Sat, Aug 29, 2015 at 5:02 AM, Jim Nasby  wrote:

Looks like a 98k file won't get through the list...



Is it compressed? Note that we have sometimes larger patches than that, but
perhaps those had special permissions by the admins of this list.


Messages significantly larger than that go through all the time.  Maybe
you had it marked with some weird MIME type?


Apparently the original email did go through and my MUA search just 
failed to find it. Sorry for the noise.


Original email: 
http://www.postgresql.org/message-id/55dfaf18.4060...@bluetreble.com

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-08-28 Thread Tom Lane

Michael Paquier  writes:
> On Sat, Aug 29, 2015 at 5:02 AM, Jim Nasby  wrote:
>> Looks like a 98k file won't get through the list...

> Is it compressed? Note that we have sometimes larger patches than that, but
> perhaps those had special permissions by the admins of this list.

Messages significantly larger than that go through all the time.  Maybe
you had it marked with some weird MIME type?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-08-28 Thread Michael Paquier

On Sat, Aug 29, 2015 at 5:02 AM, Jim Nasby  wrote:

> Looks like a 98k file won't get through the list...


Is it compressed? Note that we have sometimes larger patches than that, but
perhaps those had special permissions by the admins of this list.
-- 
Michael

56 matches

Mail list logo