Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-06 Thread Amit Langote
On Fri, Aug 7, 2020 at 1:21 PM Justin Pryzby  wrote:
> On Fri, Aug 07, 2020 at 01:13:51PM +0900, Amit Langote wrote:
> > On Fri, Aug 7, 2020 at 1:05 PM Justin Pryzby  wrote:
> > > On Fri, Aug 07, 2020 at 12:16:11PM +0900, Amit Langote wrote:
> > > > Curiously, Justin mentioned upthread that the crash occurred during
> > > > BIND of a prepared query, so it better had been that a custom plan was
> > > > being executed, because a generic one based on fewer partitions would
> > > > be thrown away due to A.I.M. invoked during AcquireExecutorLocks().
> > >
> > > Well this statement should only be executed once, and should be using
> > > PQexecParams and not PQexecPrepared (pygresql: pg.DB().query_prepared()).
> > >
> > > (gdb) p portal->name
> > > $30 = 0xf03238 ""
> > >
> > > (gdb) p portal->prepStmtName
> > > $31 = 0x0
> > >
> > > (gdb) p *portal->cplan
> > > $24 = {magic = 953717834, stmt_list = 0x682ec38, is_oneshot = false, 
> > > is_saved = true, is_valid = true, planRoleId = 16554, dependsOnRole = 
> > > false, saved_xmin = 0, generation = 1, refcount = 1, context = 0x682dfd0}
> > >
> > > I'm not sure why is_oneshot=false, though...
> >
> > Perhaps printing *unnamed_stmt_psrc (CachedPlanSource for an unnamed
> > statement) would put out more information.
>
> (gdb) p *unnamed_stmt_psrc
> $49 = {... gplan = 0x0, is_oneshot = false, is_complete = true, is_saved = 
> true, is_valid = false, generation = 1, node = {prev = 0x12fcf28,
> next = 0xdf2c80 }, generic_cost = -1, total_custom_cost 
> = 12187136.69680, num_custom_plans = 1}

>From this part, I think it's clear that a custom plan was used and
that's the only one that this portal seems to know about.  Also, I can
see that only SPI ever builds "oneshot" plans, so is_oneshot would
always be false in your use case.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-06 Thread Justin Pryzby
On Fri, Aug 07, 2020 at 01:13:51PM +0900, Amit Langote wrote:
> On Fri, Aug 7, 2020 at 1:05 PM Justin Pryzby  wrote:
> > On Fri, Aug 07, 2020 at 12:16:11PM +0900, Amit Langote wrote:
> > > Curiously, Justin mentioned upthread that the crash occurred during
> > > BIND of a prepared query, so it better had been that a custom plan was
> > > being executed, because a generic one based on fewer partitions would
> > > be thrown away due to A.I.M. invoked during AcquireExecutorLocks().
> >
> > Well this statement should only be executed once, and should be using
> > PQexecParams and not PQexecPrepared (pygresql: pg.DB().query_prepared()).
> >
> > (gdb) p portal->name
> > $30 = 0xf03238 ""
> >
> > (gdb) p portal->prepStmtName
> > $31 = 0x0
> >
> > (gdb) p *portal->cplan
> > $24 = {magic = 953717834, stmt_list = 0x682ec38, is_oneshot = false, 
> > is_saved = true, is_valid = true, planRoleId = 16554, dependsOnRole = 
> > false, saved_xmin = 0, generation = 1, refcount = 1, context = 0x682dfd0}
> >
> > I'm not sure why is_oneshot=false, though...
> 
> Perhaps printing *unnamed_stmt_psrc (CachedPlanSource for an unnamed
> statement) would put out more information.

(gdb) p *unnamed_stmt_psrc
$49 = {magic = 195726186, raw_parse_tree = 0xfae788, 
  query_string = 0xfaddc0 "\n", ' ' , "SELECT $3::timestamp 
as start_time, $2::int as interval_seconds,\n", ' ' , 
"first_cgi as cgi, gsm_carr_mcc||gsm_carr_mnc as home_plmn,\n", ' ' , "SUM(chargeable_"..., commandTag = CMDTAG_SELECT, param_types = 
0x1254400, num_params = 3, parserSetup = 0x0, parserSetupArg = 0x0, 
cursor_options = 256, fixed_result = true, 
  resultDesc = 0x1376670, context = 0xfae550, query_list = 0x103c9a8, 
relationOids = 0x11aa580, invalItems = 0x11aa600, search_path = 0x11aa878, 
query_context = 0xf85790, rewriteRoleId = 16554, 
  rewriteRowSecurity = true, dependsOnRLS = false, gplan = 0x0, is_oneshot = 
false, is_complete = true, is_saved = true, is_valid = false, generation = 1, 
node = {prev = 0x12fcf28, 
next = 0xdf2c80 }, generic_cost = -1, total_custom_cost = 
12187136.69680, num_custom_plans = 1}

-- 
Justin




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-06 Thread Amit Langote
On Fri, Aug 7, 2020 at 1:05 PM Justin Pryzby  wrote:
>
> On Fri, Aug 07, 2020 at 12:16:11PM +0900, Amit Langote wrote:
> > Curiously, Justin mentioned upthread that the crash occurred during
> > BIND of a prepared query, so it better had been that a custom plan was
> > being executed, because a generic one based on fewer partitions would
> > be thrown away due to A.I.M. invoked during AcquireExecutorLocks().
>
> Well this statement should only be executed once, and should be using
> PQexecParams and not PQexecPrepared (pygresql: pg.DB().query_prepared()).
>
> (gdb) p portal->name
> $30 = 0xf03238 ""
>
> (gdb) p portal->prepStmtName
> $31 = 0x0
>
> (gdb) p *portal->cplan
> $24 = {magic = 953717834, stmt_list = 0x682ec38, is_oneshot = false, is_saved 
> = true, is_valid = true, planRoleId = 16554, dependsOnRole = false, 
> saved_xmin = 0, generation = 1, refcount = 1, context = 0x682dfd0}
>
> I'm not sure why is_oneshot=false, though...

Perhaps printing *unnamed_stmt_psrc (CachedPlanSource for an unnamed
statement) would put out more information.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-06 Thread Justin Pryzby
On Fri, Aug 07, 2020 at 12:16:11PM +0900, Amit Langote wrote:
> Curiously, Justin mentioned upthread that the crash occurred during
> BIND of a prepared query, so it better had been that a custom plan was
> being executed, because a generic one based on fewer partitions would
> be thrown away due to A.I.M. invoked during AcquireExecutorLocks().

Well this statement should only be executed once, and should be using
PQexecParams and not PQexecPrepared (pygresql: pg.DB().query_prepared()).

(gdb) p portal->name
$30 = 0xf03238 ""

(gdb) p portal->prepStmtName 
$31 = 0x0

(gdb) p *portal->cplan
$24 = {magic = 953717834, stmt_list = 0x682ec38, is_oneshot = false, is_saved = 
true, is_valid = true, planRoleId = 16554, dependsOnRole = false, saved_xmin = 
0, generation = 1, refcount = 1, context = 0x682dfd0}

I'm not sure why is_oneshot=false, though...

-- 
Justin




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-06 Thread Justin Pryzby
On Fri, Aug 07, 2020 at 12:16:11PM +0900, Amit Langote wrote:
> Curiously, Justin mentioned upthread that the crash occurred during
> BIND of a prepared query, so it better had been that a custom plan was
> being executed,

I'm looking at how to check that ... can you give a hint ?

-- 
Justin




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-06 Thread Tom Lane
Amit Langote  writes:
> On Fri, Aug 7, 2020 at 9:32 AM Tom Lane  wrote:
>> ...  Amit's recipe for reproducing the bug works because there are
>> other relation lock acquisitions (and hence AcceptInvalidationMessages
>> calls) later in planning than where he asked us to wait.  So this
>> effectively tests a scenario where a very late A.I.M. call within the
>> planner detects an inval event for some already-planned relation, and
>> that seems like a valid-enough scenario.

> Agreed.

> Curiously, Justin mentioned upthread that the crash occurred during
> BIND of a prepared query, so it better had been that a custom plan was
> being executed, because a generic one based on fewer partitions would
> be thrown away due to A.I.M. invoked during AcquireExecutorLocks().

Based on the above, it seems plausible that the plancache did throw away
an old plan and try to replan, but the inval message announcing partition
addition arrived too late during that planning cycle.  Just like the
normal execution path, the plancache code path won't do more than one
iteration of planning on the way to a demanded query execution.

>> (BTW, I checked and found that this test does *not* expose the problems
>> with Amit's original patch.  Not sure if it's worth trying to finagle
>> it so it does.)

> I tried to figure out a scenario where my patch would fail but
> couldn't come up with one either, but it's no proof that it isn't
> wrong.  For example, I can see that pinfo->relid_map[pinfo->nparts]
> can be accessed with my patch which is not correct.

Yeah, touching array entries off the end of the relid_map array definitely
seems possible with that coding.  But the scenario I was worried about
was that the loop actually attaches the wrong subplan (one for a different
partition) to a partdesc entry.  In an assert-enabled build, that would
have led to assertion failure just below, because then we could not match
up all the remaining relid_map entries; but in a non-assert build, we'd
plow through and bad things would likely happen during execution.
You might need further conditions, like the partitions not being all
identical, for that to actually cause any problem.  I'd poked at this
for a little bit without causing an obvious crash, but I can't claim
to have tried hard.

regards, tom lane




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-06 Thread Amit Langote
On Fri, Aug 7, 2020 at 9:32 AM Tom Lane  wrote:
> Andy Fan  writes:
> > Attached is the v2 patch.

Thanks Andy and Tom for this.

> Forgot to mention that I'd envisioned adding this as a src/test/modules/
> module; contrib/ is for things that we intend to expose to users, which
> I think this isn't.
>
> I played around with this and got the isolation test I'd experimented
> with yesterday to work with it.  If you revert 7a980dfc6 then the
> attached patch will expose that bug.  Interestingly, I had to add an
> explicit AcceptInvalidationMessages() call to reproduce the bug; so
> apparently we do none of those between planning and execution in the
> ordinary query code path.  Arguably, that means we're testing a scenario
> somewhat different from what can happen in live databases, but I think
> it's OK.  Amit's recipe for reproducing the bug works because there are
> other relation lock acquisitions (and hence AcceptInvalidationMessages
> calls) later in planning than where he asked us to wait.  So this
> effectively tests a scenario where a very late A.I.M. call within the
> planner detects an inval event for some already-planned relation, and
> that seems like a valid-enough scenario.

Agreed.

Curiously, Justin mentioned upthread that the crash occurred during
BIND of a prepared query, so it better had been that a custom plan was
being executed, because a generic one based on fewer partitions would
be thrown away due to A.I.M. invoked during AcquireExecutorLocks().

> Anyway, attached find a reviewed version of your patch plus a test
> scenario contributed by me (I was too lazy to split it into two
> patches).  Barring objections, I'll push this tomorrow or so.
>
> (BTW, I checked and found that this test does *not* expose the problems
> with Amit's original patch.  Not sure if it's worth trying to finagle
> it so it does.)

I tried to figure out a scenario where my patch would fail but
couldn't come up with one either, but it's no proof that it isn't
wrong.  For example, I can see that pinfo->relid_map[pinfo->nparts]
can be accessed with my patch which is not correct.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-06 Thread Tom Lane
Andy Fan  writes:
> I am not sure if we need to limit the scope to "between parsing and
> execution",

Yeah, there might be reason to add similar functionality in other
places later.  I'm not sure where yet --- but that idea does make
me slightly unhappy with the "delay_execution" moniker.  I don't
have a better name though ...

regards, tom lane




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-06 Thread Andy Fan
On Fri, Aug 7, 2020 at 8:32 AM Tom Lane  wrote:

> Andy Fan  writes:
> > Attached is the v2 patch.
>
> Forgot to mention that I'd envisioned adding this as a src/test/modules/
> module; contrib/ is for things that we intend to expose to users, which
> I think this isn't.
>
> I played around with this and got the isolation test I'd experimented
> with yesterday to work with it.  If you revert 7a980dfc6 then the
> attached patch will expose that bug.  Interestingly, I had to add an
> explicit AcceptInvalidationMessages() call to reproduce the bug; so
> apparently we do none of those between planning and execution in the
> ordinary query code path.  Arguably, that means we're testing a scenario
> somewhat different from what can happen in live databases, but I think
> it's OK.  Amit's recipe for reproducing the bug works because there are
> other relation lock acquisitions (and hence AcceptInvalidationMessages
> calls) later in planning than where he asked us to wait.  So this
> effectively tests a scenario where a very late A.I.M. call within the
> planner detects an inval event for some already-planned relation, and
> that seems like a valid-enough scenario.
>
> Anyway, attached find a reviewed version of your patch plus a test
> scenario contributed by me (I was too lazy to split it into two
> patches).  Barring objections, I'll push this tomorrow or so.
>
> (BTW, I checked and found that this test does *not* expose the problems
> with Amit's original patch.  Not sure if it's worth trying to finagle
> it so it does.)
>
> regards, tom lane
>
>
+ * delay_execution.c
+ * Test module to allow delay between parsing and execution of a query.

I am not sure if we need to limit the scope to "between parsing and
execution",
IMO, it can be used at any place where we have a hook so that
delay_execution
can inject the lock_unlock logic with a predefined lock id. Probably the
test writer
only wants one place blocked, then delay_execution.xxx_lock_id can be set
so
that only the given lock ID  is considered.  Just my 0.01 cents.

-- 
Best Regards
Andy Fan


Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-06 Thread Tom Lane
Andy Fan  writes:
> Attached is the v2 patch.

Forgot to mention that I'd envisioned adding this as a src/test/modules/
module; contrib/ is for things that we intend to expose to users, which
I think this isn't.

I played around with this and got the isolation test I'd experimented
with yesterday to work with it.  If you revert 7a980dfc6 then the
attached patch will expose that bug.  Interestingly, I had to add an
explicit AcceptInvalidationMessages() call to reproduce the bug; so
apparently we do none of those between planning and execution in the
ordinary query code path.  Arguably, that means we're testing a scenario
somewhat different from what can happen in live databases, but I think
it's OK.  Amit's recipe for reproducing the bug works because there are
other relation lock acquisitions (and hence AcceptInvalidationMessages
calls) later in planning than where he asked us to wait.  So this
effectively tests a scenario where a very late A.I.M. call within the
planner detects an inval event for some already-planned relation, and
that seems like a valid-enough scenario.

Anyway, attached find a reviewed version of your patch plus a test
scenario contributed by me (I was too lazy to split it into two
patches).  Barring objections, I'll push this tomorrow or so.

(BTW, I checked and found that this test does *not* expose the problems
with Amit's original patch.  Not sure if it's worth trying to finagle
it so it does.)

regards, tom lane

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 29de73c060..1428529b04 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
 		  snapshot_too_old \
diff --git a/src/test/modules/delay_execution/.gitignore b/src/test/modules/delay_execution/.gitignore
new file mode 100644
index 00..ba2160b66c
--- /dev/null
+++ b/src/test/modules/delay_execution/.gitignore
@@ -0,0 +1,3 @@
+# Generated subdirectories
+/output_iso/
+/tmp_check_iso/
diff --git a/src/test/modules/delay_execution/Makefile b/src/test/modules/delay_execution/Makefile
new file mode 100644
index 00..f270aebf3a
--- /dev/null
+++ b/src/test/modules/delay_execution/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/delay_execution/Makefile
+
+PGFILEDESC = "delay_execution - allow delay between parsing and execution"
+
+MODULE_big = delay_execution
+OBJS = \
+	$(WIN32RES) \
+	delay_execution.o
+
+ISOLATION = partition-addition
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/delay_execution
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/delay_execution/delay_execution.c b/src/test/modules/delay_execution/delay_execution.c
new file mode 100644
index 00..03ea23d0f2
--- /dev/null
+++ b/src/test/modules/delay_execution/delay_execution.c
@@ -0,0 +1,104 @@
+/*-
+ *
+ * delay_execution.c
+ *		Test module to allow delay between parsing and execution of a query.
+ *
+ * The delay is implemented by taking and immediately releasing a specified
+ * advisory lock.  If another process has previously taken that lock, the
+ * current process will be blocked until the lock is released; otherwise,
+ * there's no effect.  This allows an isolationtester script to reliably
+ * test behaviors where some specified action happens in another backend
+ * between parsing and execution of any desired query.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/test/modules/delay_execution/delay_execution.c
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include 
+
+#include "optimizer/planner.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/inval.h"
+
+
+PG_MODULE_MAGIC;
+
+/* GUC: advisory lock ID to use.  Zero disables the feature. */
+static int	post_planning_lock_id = 0;
+
+/* Save previous planner hook user to be a good citizen */
+static planner_hook_type prev_planner_hook = NULL;
+
+/* Module load/unload functions */
+void		_PG_init(void);
+void		_PG_fini(void);
+
+
+/* planner_hook function to provide the desired delay */
+static PlannedStmt *
+delay_execution_planner(Query *parse, const char *query_string,
+		int cursorOptions, ParamListInfo boundParams)
+{
+	PlannedStmt *result;
+
+	/* Invoke the planner, possibly via a previous hook user */
+	if (prev_planner_hook)
+		result = prev_planner_hook(parse, query_string, cursorOptions,
+   boundParams);
+	else
+		result = standard_planner(parse, query_string, cursorOptions,
+  boundParams);
+
+	/* If enabled, delay by taking and rele

Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-06 Thread Andy Fan
On Thu, Aug 6, 2020 at 10:42 PM Tom Lane  wrote:

> Andy Fan  writes:
> > On Thu, Aug 6, 2020 at 12:02 PM Tom Lane  wrote:
> >> See my straw-man proposal downthread.
>
> > Thanks for your explanation, I checked it again and it looks a very clean
> > method.  The attached is a draft patch based on my understanding.  Hope
> > I didn't  misunderstand you..
>
> Ah, I was going to play arond with that today, but you beat me to it ;-)
>
>
Very glad to be helpful.


> A few thoughts after a quick look at the patch:
>
> * I had envisioned that there's a custom GUC controlling the lock ID
> used; this would allow blocking different sessions at different points,
> if we ever need that.  Also, I'd make the GUC start out as zero which
> means "do nothing", so that merely loading the module has no immediate
> effect.
>
>
I forgot to say I didn't get the point of the guc variable in the last
thread,
now I think it is a smart idea, so added it.  In this way, one session
can only be blocked at one place, it may not be an issue in practise.

* Don't really see the point of the before-planning lock.
>
>
yes.. it was removed now.

* Rather than exposing internal declarations from lockfuncs.c, you
> could just write calls to pg_advisory_lock_int8 etc. using
> DirectFunctionCall1.
>
>
Thanks for sharing it,  this method looks pretty good.


> * We need some better name than "test_module".  I had vaguely thought
> about "delay_execution", but am surely open to better ideas.
>
>
Both names look good to me, delay_execution looks better,  it is used in
v2.

Attached is the v2 patch.

-- 
Best Regards
Andy Fan


v2-0001-delay_execution-used-for-concurrency-case-simulat.patch
Description: Binary data


Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-06 Thread Tom Lane
Andy Fan  writes:
> On Thu, Aug 6, 2020 at 12:02 PM Tom Lane  wrote:
>> See my straw-man proposal downthread.

> Thanks for your explanation, I checked it again and it looks a very clean
> method.  The attached is a draft patch based on my understanding.  Hope
> I didn't  misunderstand you..

Ah, I was going to play arond with that today, but you beat me to it ;-)

A few thoughts after a quick look at the patch:

* I had envisioned that there's a custom GUC controlling the lock ID
used; this would allow blocking different sessions at different points,
if we ever need that.  Also, I'd make the GUC start out as zero which
means "do nothing", so that merely loading the module has no immediate
effect.

* Don't really see the point of the before-planning lock.

* Rather than exposing internal declarations from lockfuncs.c, you
could just write calls to pg_advisory_lock_int8 etc. using
DirectFunctionCall1.

* We need some better name than "test_module".  I had vaguely thought
about "delay_execution", but am surely open to better ideas.

regards, tom lane




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-06 Thread Andy Fan
On Thu, Aug 6, 2020 at 12:02 PM Tom Lane  wrote:

> Andy Fan  writes:
> > On Thu, Aug 6, 2020 at 2:22 AM Tom Lane  wrote:
> >> In the longer term, it's annoying that we have no test methodology
> >> for this other than "manually set a breakpoint here".
>
> > One of the methods I see is we can just add some GUC variable for some
> > action injection.   basically it adds some code based on the GUC like
> this;
>
> See my straw-man proposal downthread.  I'm not very excited about putting
> things like this into the standard build, because it's really hard to be
> sure that there are no security-hazard-ish downsides of putting in ways to
> get at testing behaviors from standard SQL.  And then there's the question
> of whether you're adding noticeable overhead to production builds.  So a
> loadable module that can use some existing hook to provide the needed
> behavior seems like a better plan to me, whenever we can do it that way.
>
> In general, though, it seems like we've seldom regretted investments in
> test tooling.
>
> regards, tom lane
>


Thanks for your explanation, I checked it again and it looks a very clean
method.  The attached is a draft patch based on my understanding.  Hope
I didn't  misunderstand you..

-- 
Best Regards
Andy Fan


v1-0001-test_module-used-for-concurrency-case-simulation.patch
Description: Binary data


Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Tom Lane
Andy Fan  writes:
> On Thu, Aug 6, 2020 at 2:22 AM Tom Lane  wrote:
>> In the longer term, it's annoying that we have no test methodology
>> for this other than "manually set a breakpoint here".

> One of the methods I see is we can just add some GUC variable for some
> action injection.   basically it adds some code based on the GUC like this;

See my straw-man proposal downthread.  I'm not very excited about putting
things like this into the standard build, because it's really hard to be
sure that there are no security-hazard-ish downsides of putting in ways to
get at testing behaviors from standard SQL.  And then there's the question
of whether you're adding noticeable overhead to production builds.  So a
loadable module that can use some existing hook to provide the needed
behavior seems like a better plan to me, whenever we can do it that way.

In general, though, it seems like we've seldom regretted investments in
test tooling.

regards, tom lane




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Andy Fan
On Thu, Aug 6, 2020 at 2:22 AM Tom Lane  wrote:

> Robert Haas  writes:
> > On Wed, Aug 5, 2020 at 1:30 PM Tom Lane  wrote:
> >> I'm strongly tempted to convert the trailing Assert to an actual
> >> test-and-elog, too, but didn't do so here.
>
> > I was thinking about that, too. +1 for taking that step.
>
> Will do.
>
> In the longer term, it's annoying that we have no test methodology
> for this other than "manually set a breakpoint here".


One of the methods I see is we can just add some GUC variable for some
action injection.   basically it adds some code based on the GUC like this;

if (shall_delay_planning)
{
  sleep(10)
};

AFAIK,  MongoDB uses much such technology  in their test framework. First
it
defines the fail point [1],  and then does code injection if the fail point
is set [2].
At last, during the test it can set a fail point like a GUC, but with more
attributes [3].
If that is useful in PG as well and it is not an urgent task,  I would like
to help
in this direction.

[1]
https://github.com/mongodb/mongo/search?q=MONGO_FAIL_POINT_DEFINE&unscoped_q=MONGO_FAIL_POINT_DEFINE

[2]
https://github.com/mongodb/mongo/blob/d4e7ea57599b44353b5393afedee8ae5670837b3/src/mongo/db/repl/repl_set_config.cpp#L475
[3]
https://github.com/mongodb/mongo/blob/e07c2d29aded5a30ff08b5ce6a436b6ef6f44014/src/mongo/shell/replsettest.js#L1427



If we're going
> to allow plan-relevant DDL changes to happen with less than full table
> lock, I think we need to improve that.  I spent a little bit of time
> just now trying to build an isolationtester case for this, and failed
> completely.  So I wonder if we can create some sort of test module that
> allows capture of a plan tree and then execution of that plan tree later
> (even after relcache inval would normally have forced replanning).
> Obviously that could not be a normal SQL-accessible feature, because
> some types of invals would make the plan completely wrong, but for
> testing purposes it'd be mighty helpful to check that a stale plan
> still works.
>
> regards, tom lane
>
>
>

-- 
Best Regards
Andy Fan


Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Amit Langote
On Thu, Aug 6, 2020 at 2:30 AM Tom Lane  wrote:
> Amit Langote  writes:
> > The attached patch should fix that.
>
> I don't like this patch at all though; I do not think it is being nearly
> careful enough to ensure that it's matched the surviving relation OIDs
> correctly.  In particular it blithely assumes that a zero in relid_map
> *must* match the immediately next entry in partdesc->oids, which is easy
> to break if the new partition is adjacent to the one the planner managed
> to prune.

Indeed, you're right.

>  So I think we should do it more like the attached.

Thanks for pushing that.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Robert Haas
On Wed, Aug 5, 2020 at 4:19 PM Tom Lane  wrote:
> After thinking about it for a little bit, I'm envisioning a test module
> that can be loaded into a session, and then it gets into the planner_hook,
> and what it does is after each planner execution, take and release an
> advisory lock with some selectable ID.  Then we can construct
> isolationtester specs that do something like
>
> session 1   session 2
>
> LOAD test-module;
> SET custom_guc_for_lock_id = n;
> prepare test tables;
>
> SELECT pg_advisory_lock(n);
>
> SELECT victim-query-here;
> ... after planning, query blocks on lock
>
> perform DDL changes;
> SELECT pg_advisory_unlock(n);
>
> ... query executes with now-stale plan

Very sneaky!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Tom Lane
Robert Haas  writes:
> On Wed, Aug 5, 2020 at 2:22 PM Tom Lane  wrote:
>> In the longer term, it's annoying that we have no test methodology
>> for this other than "manually set a breakpoint here".  If we're going
>> to allow plan-relevant DDL changes to happen with less than full table
>> lock, I think we need to improve that.  I spent a little bit of time
>> just now trying to build an isolationtester case for this, and failed
>> completely.  So I wonder if we can create some sort of test module that
>> allows capture of a plan tree and then execution of that plan tree later
>> (even after relcache inval would normally have forced replanning).
>> Obviously that could not be a normal SQL-accessible feature, because
>> some types of invals would make the plan completely wrong, but for
>> testing purposes it'd be mighty helpful to check that a stale plan
>> still works.

> That's an interesting idea. I don't know exactly how it would work,
> but I agree that it would allow useful testing that we can't do today.

After thinking about it for a little bit, I'm envisioning a test module
that can be loaded into a session, and then it gets into the planner_hook,
and what it does is after each planner execution, take and release an
advisory lock with some selectable ID.  Then we can construct
isolationtester specs that do something like

session 1   session 2

LOAD test-module;
SET custom_guc_for_lock_id = n;
prepare test tables;

SELECT pg_advisory_lock(n);

SELECT victim-query-here;
... after planning, query blocks on lock

perform DDL changes;
SELECT pg_advisory_unlock(n);

... query executes with now-stale plan

regards, tom lane




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Robert Haas
On Wed, Aug 5, 2020 at 2:22 PM Tom Lane  wrote:
> In the longer term, it's annoying that we have no test methodology
> for this other than "manually set a breakpoint here".  If we're going
> to allow plan-relevant DDL changes to happen with less than full table
> lock, I think we need to improve that.  I spent a little bit of time
> just now trying to build an isolationtester case for this, and failed
> completely.  So I wonder if we can create some sort of test module that
> allows capture of a plan tree and then execution of that plan tree later
> (even after relcache inval would normally have forced replanning).
> Obviously that could not be a normal SQL-accessible feature, because
> some types of invals would make the plan completely wrong, but for
> testing purposes it'd be mighty helpful to check that a stale plan
> still works.

That's an interesting idea. I don't know exactly how it would work,
but I agree that it would allow useful testing that we can't do today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Tom Lane
Robert Haas  writes:
> On Wed, Aug 5, 2020 at 1:30 PM Tom Lane  wrote:
>> I'm strongly tempted to convert the trailing Assert to an actual
>> test-and-elog, too, but didn't do so here.

> I was thinking about that, too. +1 for taking that step.

Will do.

In the longer term, it's annoying that we have no test methodology
for this other than "manually set a breakpoint here".  If we're going
to allow plan-relevant DDL changes to happen with less than full table
lock, I think we need to improve that.  I spent a little bit of time
just now trying to build an isolationtester case for this, and failed
completely.  So I wonder if we can create some sort of test module that
allows capture of a plan tree and then execution of that plan tree later
(even after relcache inval would normally have forced replanning).
Obviously that could not be a normal SQL-accessible feature, because
some types of invals would make the plan completely wrong, but for
testing purposes it'd be mighty helpful to check that a stale plan
still works.

regards, tom lane




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Robert Haas
On Wed, Aug 5, 2020 at 1:30 PM Tom Lane  wrote:
> I don't like this patch at all though; I do not think it is being nearly
> careful enough to ensure that it's matched the surviving relation OIDs
> correctly.  In particular it blithely assumes that a zero in relid_map
> *must* match the immediately next entry in partdesc->oids, which is easy
> to break if the new partition is adjacent to the one the planner managed
> to prune.  So I think we should do it more like the attached.

Ooh, nice catch.

> I'm strongly tempted to convert the trailing Assert to an actual
> test-and-elog, too, but didn't do so here.

I was thinking about that, too. +1 for taking that step.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-05 Thread Tom Lane
Amit Langote  writes:
> The crash reported here is in the other case where the concurrently
> added partitions cause the execution-time PartitionDesc to have more
> partitions than the one that PartitionedRelPruneInfo is based on.
> I was able to reproduce such a crash as follows:

Yeah, I can repeat the case per these directions.  I concur that the
issue is that ExecCreatePartitionPruneState is failing to cope with
zeroes in the relid_map.

> The attached patch should fix that.

I don't like this patch at all though; I do not think it is being nearly
careful enough to ensure that it's matched the surviving relation OIDs
correctly.  In particular it blithely assumes that a zero in relid_map
*must* match the immediately next entry in partdesc->oids, which is easy
to break if the new partition is adjacent to the one the planner managed
to prune.  So I think we should do it more like the attached.

I'm strongly tempted to convert the trailing Assert to an actual
test-and-elog, too, but didn't do so here.

regards, tom lane

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index fb6ce49056..221a34e738 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1673,7 +1673,19 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 pprune->subpart_map = palloc(sizeof(int) * partdesc->nparts);
 for (pp_idx = 0; pp_idx < partdesc->nparts; ++pp_idx)
 {
-	if (pinfo->relid_map[pd_idx] != partdesc->oids[pp_idx])
+	/*
+	 * pinfo->relid_map can contain InvalidOid entries for
+	 * partitions pruned by the planner.  We cannot tell
+	 * exactly which of the partdesc entries these correspond
+	 * to, but we don't have to; just skip over them.  The
+	 * non-pruned relid_map entries, however, had better be a
+	 * subset of the partdesc entries and in the same order.
+	 */
+	while (pd_idx < pinfo->nparts &&
+		   !OidIsValid(pinfo->relid_map[pd_idx]))
+		pd_idx++;
+	if (pd_idx >= pinfo->nparts ||
+		pinfo->relid_map[pd_idx] != partdesc->oids[pp_idx])
 	{
 		pprune->subplan_map[pp_idx] = -1;
 		pprune->subpart_map[pp_idx] = -1;
@@ -1686,6 +1698,14 @@ ExecCreatePartitionPruneState(PlanState *planstate,
 			pinfo->subpart_map[pd_idx++];
 	}
 }
+
+/*
+ * It might seem that we need to skip any trailing InvalidOid
+ * entries in pinfo->relid_map before asserting that we
+ * scanned all of the relid_map.  But we will have skipped
+ * them above, because they must correspond to some
+ * partdesc->oids entries; we just couldn't tell which.
+ */
 Assert(pd_idx == pinfo->nparts);
 			}
 


Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-04 Thread Amit Langote
On Wed, Aug 5, 2020 at 10:04 AM Justin Pryzby  wrote:
> On Wed, Aug 05, 2020 at 09:53:44AM +0900, Amit Langote wrote:
> > On Wed, Aug 5, 2020 at 9:52 AM Amit Langote  wrote:
> > > On Wed, Aug 5, 2020 at 9:32 AM Justin Pryzby  wrote:
> > > > On Wed, Aug 05, 2020 at 09:26:20AM +0900, Amit Langote wrote:
> > > > > On Wed, Aug 5, 2020 at 12:11 AM Justin Pryzby  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Aug 04, 2020 at 08:12:10PM +0900, Amit Langote wrote:
> > > > > > > It may be this commit that went into PG 12 that is causing the 
> > > > > > > problem:
> > > > > >
> > > > > > Thanks for digging into this.
> > > > > >
> > > > > > > to account for partitions that were pruned by the planner for 
> > > > > > > which we
> > > > > > > decided to put 0 into relid_map, but it only considered the case 
> > > > > > > where
> > > > > > > the number of partitions doesn't change since the plan was 
> > > > > > > created.
> > > > > > > The crash reported here is in the other case where the 
> > > > > > > concurrently
> > > > > > > added partitions cause the execution-time PartitionDesc to have 
> > > > > > > more
> > > > > > > partitions than the one that PartitionedRelPruneInfo is based on.
> > > > > >
> > > > > > Is there anything else needed to check that my crash matches your 
> > > > > > analysis ?
> > > > >
> > > > > If you can spot a 0 in the output of the following, then yes.
> > > > >
> > > > > (gdb) p *pinfo->relid_map@pinfo->nparts
> > > >
> > > > I guess you knew that an earlier message has just that.  Thanks.
> > > > https://www.postgresql.org/message-id/20200803161133.ga21...@telsasoft.com
> > >
> > > Yeah, you showed:
> > >
> > > (gdb) p *pinfo->relid_map@414
> > >
> > > And there is indeed a 0 in there, but I wasn't sure if it was actually
> > > in the array or a stray zero due to forcing gdb to show beyond the
> > > array bound.  Does pinfo->nparts match 414?
>
> Yes.  I typed 414 manually since the the array lengths were suspect.
>
> (gdb) p pinfo->nparts
> $1 = 414
> (gdb) set print elements 0
> (gdb) p *pinfo->relid_map@pinfo->nparts
> $3 = {
>   21151836, 21151726, 21151608, 21151498, 21151388, 21151278, 21151168, 
> 21151055, 2576248, 2576255, 2576262, 2576269, 2576276, 21456497, 22064128, 0}

Thanks.  There is a 0 in there, which can only be there if planner was
able to prune that last partition.  So, the planner saw a table with
414 partitions, was able to prune the last one and constructed an
Append plan with 413 subplans for unpruned partitions as you showed
upthread:

> (gdb) p *node->appendplans
> $17 = {type = T_List, length = 413, max_length = 509, elements = 0x7037400, 
> initial_elements = 0x7037400}

This suggests that the crash I was able produce is similar to what you saw.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-04 Thread Justin Pryzby
On Wed, Aug 05, 2020 at 09:53:44AM +0900, Amit Langote wrote:
> On Wed, Aug 5, 2020 at 9:52 AM Amit Langote  wrote:
> > On Wed, Aug 5, 2020 at 9:32 AM Justin Pryzby  wrote:
> > > On Wed, Aug 05, 2020 at 09:26:20AM +0900, Amit Langote wrote:
> > > > On Wed, Aug 5, 2020 at 12:11 AM Justin Pryzby  
> > > > wrote:
> > > > >
> > > > > On Tue, Aug 04, 2020 at 08:12:10PM +0900, Amit Langote wrote:
> > > > > > It may be this commit that went into PG 12 that is causing the 
> > > > > > problem:
> > > > >
> > > > > Thanks for digging into this.
> > > > >
> > > > > > to account for partitions that were pruned by the planner for which 
> > > > > > we
> > > > > > decided to put 0 into relid_map, but it only considered the case 
> > > > > > where
> > > > > > the number of partitions doesn't change since the plan was created.
> > > > > > The crash reported here is in the other case where the concurrently
> > > > > > added partitions cause the execution-time PartitionDesc to have more
> > > > > > partitions than the one that PartitionedRelPruneInfo is based on.
> > > > >
> > > > > Is there anything else needed to check that my crash matches your 
> > > > > analysis ?
> > > >
> > > > If you can spot a 0 in the output of the following, then yes.
> > > >
> > > > (gdb) p *pinfo->relid_map@pinfo->nparts
> > >
> > > I guess you knew that an earlier message has just that.  Thanks.
> > > https://www.postgresql.org/message-id/20200803161133.ga21...@telsasoft.com
> >
> > Yeah, you showed:
> >
> > (gdb) p *pinfo->relid_map@414
> >
> > And there is indeed a 0 in there, but I wasn't sure if it was actually
> > in the array or a stray zero due to forcing gdb to show beyond the
> > array bound.  Does pinfo->nparts match 414?

Yes.  I typed 414 manually since the the array lengths were suspect.

(gdb) p pinfo->nparts
$1 = 414
(gdb) set print elements 0
(gdb) p *pinfo->relid_map@pinfo->nparts
$3 = {22652203, 22652104, 22651920, 22651654, 22647359, 22645269, 22644012, 
22639600, 22638852, 22621975, 22615355, 22615256, 22615069, 22610573, 22606503, 
22606404, 22600237, 22600131, 22596610, 22595013, 
  22594914, 22594725, 22594464, 22589317, 22589216, 22587504, 22582570, 
22580264, 22577047, 22576948, 22576763, 22576656, 22574077, 22570911, 22570812, 
22564524, 22564113, 22558519, 22557080, 22556981, 22556793, 
  22555205, 22550680, 22550579, 22548884, 22543899, 22540822, 22536665, 
22536566, 22536377, 22535133, 22528876, 22527780, 22526065, 22521131, 22517878, 
22513674, 22513575, 22513405, 22513288, 22507520, 22504728, 
  22504629, 22493699, 22466016, 22458641, 22457551, 22457421, 22457264, 
22452879, 22449864, 22449765, 22443560, 22442952, 22436193, 22434644, 22434469, 
22434352, 22430792, 22426903, 22426804, 22420732, 22420025, 
  22413050, 22411963, 22411864, 22411675, 22407652, 22404156, 22404049, 
22397550, 22394622, 22390035, 22389936, 22389752, 22388386, 22383211, 22382115, 
22381934, 22375210, 22370297, 22367878, 22367779, 22367586, 
  22362556, 22359928, 22358236, 22353374, 22348704, 22345692, 22345593, 
22345399, 22341347, 22336809, 22336709, 22325812, 22292836, 22287756, 22287657, 
22287466, 22283194, 22278659, 22278560, 22272041, 22269121, 
  22264424, 22264325, 22264135, 22260102, 22255418, 22254818, 22248841, 
22245824, 22241490, 22241391, 22241210, 22240354, 22236224, 22235123, 22234060, 
8744, 8345, 8033, 2528, 2429, 2330, 
  2144, 2045, 22218408, 22215986, 22215887, 22209311, 22209212, 
22207919, 22205203, 22203385, 22203298, 22203211, 22203124, 22202954, 22202859, 
22202772, 22201869, 22200438, 22197706, 22195027, 22194932, 
  22194834, 22191208, 22188412, 22187029, 22182238, 22182134, 22182030, 
22181849, 22181737, 22181107, 22175811, 22175710, 22169859, 22169604, 22159266, 
22158131, 22158021, 22157824, 22153348, 22153236, 22147308, 
  22146736, 22143778, 22143599, 22143471, 22138702, 22138590, 22132612, 
22132513, 22132271, 22132172, 22131987, 21935599, 21932664, 21927997, 21925823, 
21885889, 21862973, 21859854, 21859671, 21858869, 21853440, 
  21851884, 21845405, 21842901, 21837523, 21837413, 21837209, 21832347, 
21829359, 21827652, 21822602, 21816150, 21805995, 21805812, 21805235, 21798914, 
21798026, 21791895, 21791124, 21783854, 21783744, 21783540, 
  21780568, 21774797, 21774687, 21768326, 21764063, 21759627, 21759517, 
21759311, 21755697, 21751690, 21751156, 21744906, 21738543, 21736176, 21735992, 
21735769, 21727603, 21725956, 21716432, 21678682, 21670968, 
  21670858, 21670665, 21669342, 21661932, 21661822, 21655311, 21650838, 
21646721, 21646611, 21646409, 21640984, 21637816, 21637706, 21631061, 21622723, 
21621459, 21621320, 21621148, 21612902, 21612790, 21606170, 
  21602265, 21597910, 21597800, 21597605, 21592489, 21589415, 21589305, 
21582910, 21578017, 21576758, 21576648, 21572692, 21566633, 21566521, 21560127, 
21560017, 21553910, 21553800, 21553613, 21553495, 21549102, 
  21548992, 21542759, 21540922, 21532093, 21531983, 21531786, 21531676, 
21531264, 21531154, 21525290,

Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-04 Thread Amit Langote
On Wed, Aug 5, 2020 at 9:52 AM Amit Langote  wrote:
> On Wed, Aug 5, 2020 at 9:32 AM Justin Pryzby  wrote:
> > On Wed, Aug 05, 2020 at 09:26:20AM +0900, Amit Langote wrote:
> > > On Wed, Aug 5, 2020 at 12:11 AM Justin Pryzby  
> > > wrote:
> > > >
> > > > On Tue, Aug 04, 2020 at 08:12:10PM +0900, Amit Langote wrote:
> > > > > It may be this commit that went into PG 12 that is causing the 
> > > > > problem:
> > > >
> > > > Thanks for digging into this.
> > > >
> > > > > to account for partitions that were pruned by the planner for which we
> > > > > decided to put 0 into relid_map, but it only considered the case where
> > > > > the number of partitions doesn't change since the plan was created.
> > > > > The crash reported here is in the other case where the concurrently
> > > > > added partitions cause the execution-time PartitionDesc to have more
> > > > > partitions than the one that PartitionedRelPruneInfo is based on.
> > > >
> > > > Is there anything else needed to check that my crash matches your 
> > > > analysis ?
> > >
> > > If you can spot a 0 in the output of the following, then yes.
> > >
> > > (gdb) p *pinfo->relid_map@pinfo->nparts
> >
> > I guess you knew that an earlier message has just that.  Thanks.
> > https://www.postgresql.org/message-id/20200803161133.ga21...@telsasoft.com
>
> Yeah, you showed:
>
> (gdb) p *pinfo->relid_map@414
>
> And there is indeed a 0 in there, but I wasn't sure if it was actually
> in the array or a stray zero due to forcing gdb to show beyond the
> array bound.  Does pinfo->nparts match 414?

(sorry, I forgot to hit reply all in last two emails.)

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-04 Thread Robert Haas
On Mon, Aug 3, 2020 at 12:11 PM Justin Pryzby  wrote:
> (gdb) p *pinfo->relid_map@414
> (gdb) p *partdesc->oids@415

Whoa, I didn't know about using @ in gdb to print multiple elements. Wild!

Anyway, these two arrays differ in that the latter array has 22653702
inserted at the beginning and 22628862 at the end, and also in that a
0 has been removed. This code can't cope with things getting removed,
so kaboom. I think Amit probably has the right idea about what's going
on here and how to fix it, but I haven't yet had time to study it in
detail.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-04 Thread Justin Pryzby
On Tue, Aug 04, 2020 at 08:12:10PM +0900, Amit Langote wrote:
> It may be this commit that went into PG 12 that is causing the problem:

Thanks for digging into this.

> to account for partitions that were pruned by the planner for which we
> decided to put 0 into relid_map, but it only considered the case where
> the number of partitions doesn't change since the plan was created.
> The crash reported here is in the other case where the concurrently
> added partitions cause the execution-time PartitionDesc to have more
> partitions than the one that PartitionedRelPruneInfo is based on.

Is there anything else needed to check that my crash matches your analysis ?

(gdb) up
#4  0x0068444a in ExecInitAppend (node=node@entry=0x7036b90, 
estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at nodeAppend.c:132
132 nodeAppend.c: No such file or directory.
(gdb) p *node->appendplans   
$17 = {type = T_List, length = 413, max_length = 509, elements = 0x7037400, 
initial_elements = 0x7037400}

(gdb) down
#3  0x00672806 in ExecCreatePartitionPruneState 
(planstate=planstate@entry=0x908f6d8, partitionpruneinfo=) at 
execPartition.c:1689
1689execPartition.c: No such file or directory.

$27 = {ps = {type = T_AppendState, plan = 0x7036b90, state = 0x11563f0, 
ExecProcNode = 0x6842c0 , ExecProcNodeReal = 0x0, instrument = 0x0, 
worker_instrument = 0x0, worker_jit_instrument = 0x0, 
qual = 0x0, lefttree = 0x0, righttree = 0x0, initPlan = 0x0, subPlan = 0x0, 
chgParam = 0x0, ps_ResultTupleDesc = 0x0, ps_ResultTupleSlot = 0x0, 
ps_ExprContext = 0x908f7f0, ps_ProjInfo = 0x0, scandesc = 0x0, 
scanops = 0x0, outerops = 0x0, innerops = 0x0, resultops = 0x0, 
scanopsfixed = false, outeropsfixed = false, inneropsfixed = false, 
resultopsfixed = false, scanopsset = false, outeropsset = false, 
inneropsset = false, resultopsset = false}, appendplans = 0x0, as_nplans = 
0, as_whichplan = -1, as_first_partial_plan = 0, as_pstate = 0x0, pstate_len = 
0, as_prune_state = 0x0, as_valid_subplans = 0x0, 
  choose_next_subplan = 0x0}

-- 
Justin




Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-04 Thread Amit Langote
On Tue, Aug 4, 2020 at 1:11 AM Justin Pryzby  wrote:
>
> On Mon, Aug 03, 2020 at 11:41:37AM -0400, Robert Haas wrote:
> > On Sun, Aug 2, 2020 at 2:11 PM Justin Pryzby  wrote:
> > > Based on commit logs, I suspect this may be an "older bug", specifically 
> > > maybe
> > > with:
> > >
> > > |commit 898e5e3290a72d288923260143930fb32036c00c
> > > |Author: Robert Haas 
> > > |Date:   Thu Mar 7 11:13:12 2019 -0500
> > > |
> > > |Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.
> > >
> > > I don't think it matters, but the process surrounding the table being 
> > > INSERTed
> > > INTO is more than a little special, involving renames, detaches, creation,
> > > re-attaching within a transaction.  I think that doesn't matter though, 
> > > and the
> > > issue is surrounding the table being SELECTed *from*, which is actually 
> > > behind
> > > a view.
> >
> > That's an entirely reasonable guess, but it doesn't seem easy to
> > understand exactly what happened here based on the provided
> > information. The assertion failure probably indicates that
> > pinfo->relid_map[] and partdesc->oids[] differ in some way other than
> > additional elements having been inserted into the latter. For example,
> > some elements might have disappeared, or the order might have changed.
> > This isn't supposed to happen, because DETACH PARTITION requires
> > heavier locking, and the order changing without anything getting
> > detached should be impossible. But evidently it did. If we could dump
> > out the two arrays in question it might shed more light on exactly how
> > things went wrong.

It may be this commit that went into PG 12 that is causing the problem:

commit 428b260f87e8861ba8e58807b69d433db491c4f4
Author: Tom Lane 
Date:   Sat Mar 30 18:58:55 2019 -0400

Speed up planning when partitions can be pruned at plan time.

which had this:

-   /* Double-check that list of relations has not changed. */
-   Assert(memcmp(partdesc->oids, pinfo->relid_map,
-  pinfo->nparts * sizeof(Oid)) == 0);
+   /*
+* Double-check that the list of unpruned relations has not
+* changed.  (Pruned partitions are not in relid_map[].)
+*/
+#ifdef USE_ASSERT_CHECKING
+   for (int k = 0; k < pinfo->nparts; k++)
+   {
+   Assert(partdesc->oids[k] == pinfo->relid_map[k] ||
+  pinfo->subplan_map[k] == -1);
+   }
+#endif

to account for partitions that were pruned by the planner for which we
decided to put 0 into relid_map, but it only considered the case where
the number of partitions doesn't change since the plan was created.
The crash reported here is in the other case where the concurrently
added partitions cause the execution-time PartitionDesc to have more
partitions than the one that PartitionedRelPruneInfo is based on.

I was able to reproduce such a crash as follows:

Start with these tables in session 1.

create table foo (a int, b int) partition by list (a);
create table foo1 partition of foo for values in (1);
create table foo2 partition of foo for values in (2);
create table foo3 partition of foo for values in (3);

Attach gdb with a breakpoint set in PartitionDirectoryLookup() and run this:

explain analyze select * from foo where a <> 1 and a = (select 2);


In another session:

create table foo4 (like foo)
alter table foo attach partition foo4 for values in (4);

That should finish without waiting for any lock and send an
invalidation message to session 1.  Go back to gdb attached to session
1 and hit continue, resulting in the plan containing runtime pruning
info being executed.  ExecCreatePartitionPruneState() opens foo which
will now have 4 partitions instead of the 3 that the planner would
have seen, of which foo1 is pruned (a <> 1), so the following block is
executed:

if (partdesc->nparts == pinfo->nparts)
...
else
{
int pd_idx = 0;
int pp_idx;

/*
 * Some new partitions have appeared since plan time, and
 * those are reflected in our PartitionDesc but were not
 * present in the one used to construct subplan_map and
 * subpart_map.  So we must construct new and longer arrays
 * where the partitions that were originally present map to
 * the same place, and any added indexes map to -1, as if the
 * new partitions had been pruned.
 */
pprune->subpart_map = palloc(sizeof(int) * partdesc->nparts);
for (pp_idx = 0; pp_idx < partdesc->nparts; ++pp_idx)
{
if (pinfo->relid_map[pd_idx] != partdesc->oids[pp_idx])
{
pprune->subplan_map[pp_idx] = -1;
pprune->subpart_map[pp_idx] = -1;

Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-03 Thread Justin Pryzby
On Mon, Aug 03, 2020 at 11:41:37AM -0400, Robert Haas wrote:
> On Sun, Aug 2, 2020 at 2:11 PM Justin Pryzby  wrote:
> > Based on commit logs, I suspect this may be an "older bug", specifically 
> > maybe
> > with:
> >
> > |commit 898e5e3290a72d288923260143930fb32036c00c
> > |Author: Robert Haas 
> > |Date:   Thu Mar 7 11:13:12 2019 -0500
> > |
> > |Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.
> >
> > I don't think it matters, but the process surrounding the table being 
> > INSERTed
> > INTO is more than a little special, involving renames, detaches, creation,
> > re-attaching within a transaction.  I think that doesn't matter though, and 
> > the
> > issue is surrounding the table being SELECTed *from*, which is actually 
> > behind
> > a view.
> 
> That's an entirely reasonable guess, but it doesn't seem easy to
> understand exactly what happened here based on the provided
> information. The assertion failure probably indicates that
> pinfo->relid_map[] and partdesc->oids[] differ in some way other than
> additional elements having been inserted into the latter. For example,
> some elements might have disappeared, or the order might have changed.
> This isn't supposed to happen, because DETACH PARTITION requires
> heavier locking, and the order changing without anything getting
> detached should be impossible. But evidently it did. If we could dump
> out the two arrays in question it might shed more light on exactly how
> things went wrong.

(gdb) p *pinfo->relid_map@414
$8 = {22652203, 22652104, 22651920, 22651654, 22647359, 22645269, 22644012, 
22639600, 22638852, 22621975, 22615355, 22615256, 22615069, 22610573, 22606503, 
22606404, 22600237, 22600131, 22596610, 22595013, 
  22594914, 22594725, 22594464, 22589317, 22589216, 22587504, 22582570, 
22580264, 22577047, 22576948, 22576763, 22576656, 22574077, 22570911, 22570812, 
22564524, 22564113, 22558519, 22557080, 22556981, 22556793, 
  22555205, 22550680, 22550579, 22548884, 22543899, 22540822, 22536665, 
22536566, 22536377, 22535133, 22528876, 22527780, 22526065, 22521131, 22517878, 
22513674, 22513575, 22513405, 22513288, 22507520, 22504728, 
  22504629, 22493699, 22466016, 22458641, 22457551, 22457421, 22457264, 
22452879, 22449864, 22449765, 22443560, 22442952, 22436193, 22434644, 22434469, 
22434352, 22430792, 22426903, 22426804, 22420732, 22420025, 
  22413050, 22411963, 22411864, 22411675, 22407652, 22404156, 22404049, 
22397550, 22394622, 22390035, 22389936, 22389752, 22388386, 22383211, 22382115, 
22381934, 22375210, 22370297, 22367878, 22367779, 22367586, 
  22362556, 22359928, 22358236, 22353374, 22348704, 22345692, 22345593, 
22345399, 22341347, 22336809, 22336709, 22325812, 22292836, 22287756, 22287657, 
22287466, 22283194, 22278659, 22278560, 22272041, 22269121, 
  22264424, 22264325, 22264135, 22260102, 22255418, 22254818, 22248841, 
22245824, 22241490, 22241391, 22241210, 22240354, 22236224, 22235123, 22234060, 
8744, 8345, 8033, 2528, 2429, 2330, 
  2144, 2045, 22218408, 22215986, 22215887, 22209311, 22209212, 
22207919, 22205203, 22203385, 22203298, 22203211, 22203124, 22202954, 22202859, 
22202772, 22201869, 22200438, 22197706, 22195027, 22194932, 
  22194834, 22191208, 22188412, 22187029, 22182238, 22182134, 22182030, 
22181849, 22181737, 22181107, 22175811, 22175710, 22169859, 22169604, 22159266, 
22158131, 22158021, 22157824, 22153348, 22153236, 22147308, 
  22146736, 22143778, 22143599, 22143471, 22138702, 22138590, 22132612, 
22132513, 22132271, 22132172, 22131987, 21935599, 21932664, 21927997, 21925823, 
21885889, 21862973, 21859854, 21859671, 21858869, 21853440, 
  21851884, 21845405, 21842901, 21837523, 21837413, 21837209, 21832347, 
21829359, 21827652, 21822602, 21816150, 21805995, 21805812, 21805235, 21798914, 
21798026, 21791895, 21791124, 21783854, 21783744, 21783540, 
  21780568, 21774797, 21774687, 21768326, 21764063, 21759627, 21759517, 
21759311, 21755697, 21751690, 21751156, 21744906, 21738543, 21736176, 21735992, 
21735769, 21727603, 21725956, 21716432, 21678682, 21670968, 
  21670858, 21670665, 21669342, 21661932, 21661822, 21655311, 21650838, 
21646721, 21646611, 21646409, 21640984, 21637816, 21637706, 21631061, 21622723, 
21621459, 21621320, 21621148, 21612902, 21612790, 21606170, 
  21602265, 21597910, 21597800, 21597605, 21592489, 21589415, 21589305, 
21582910, 21578017, 21576758, 21576648, 21572692, 21566633, 21566521, 21560127, 
21560017, 21553910, 21553800, 21553613, 21553495, 21549102, 
  21548992, 21542759, 21540922, 21532093, 21531983, 21531786, 21531676, 
21531264, 21531154, 21525290, 21524817, 21519470, 21519360, 21519165, 21516571, 
21514269, 21514159, 21508389, 21508138, 21508028, 21507830, 
  21503457, 21502484, 21496897, 21494287, 21493722, 21493527, 21491807, 
21488530, 21486122, 21485766, 21485603, 21485383, 21481969, 21481672, 21476245, 
21472576, 21468851, 21468741, 21468546, 21467832, 21460086, 
  21425406, 21420632, 21420506, 21419974, 2141783

Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-03 Thread Robert Haas
On Sun, Aug 2, 2020 at 2:11 PM Justin Pryzby  wrote:
> Based on commit logs, I suspect this may be an "older bug", specifically maybe
> with:
>
> |commit 898e5e3290a72d288923260143930fb32036c00c
> |Author: Robert Haas 
> |Date:   Thu Mar 7 11:13:12 2019 -0500
> |
> |Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.
>
> I don't think it matters, but the process surrounding the table being INSERTed
> INTO is more than a little special, involving renames, detaches, creation,
> re-attaching within a transaction.  I think that doesn't matter though, and 
> the
> issue is surrounding the table being SELECTed *from*, which is actually behind
> a view.

That's an entirely reasonable guess, but it doesn't seem easy to
understand exactly what happened here based on the provided
information. The assertion failure probably indicates that
pinfo->relid_map[] and partdesc->oids[] differ in some way other than
additional elements having been inserted into the latter. For example,
some elements might have disappeared, or the order might have changed.
This isn't supposed to happen, because DETACH PARTITION requires
heavier locking, and the order changing without anything getting
detached should be impossible. But evidently it did. If we could dump
out the two arrays in question it might shed more light on exactly how
things went wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company