Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Fri, Aug 7, 2020 at 1:21 PM Justin Pryzby wrote: > On Fri, Aug 07, 2020 at 01:13:51PM +0900, Amit Langote wrote: > > On Fri, Aug 7, 2020 at 1:05 PM Justin Pryzby wrote: > > > On Fri, Aug 07, 2020 at 12:16:11PM +0900, Amit Langote wrote: > > > > Curiously, Justin mentioned upthread that the crash occurred during > > > > BIND of a prepared query, so it better had been that a custom plan was > > > > being executed, because a generic one based on fewer partitions would > > > > be thrown away due to A.I.M. invoked during AcquireExecutorLocks(). > > > > > > Well this statement should only be executed once, and should be using > > > PQexecParams and not PQexecPrepared (pygresql: pg.DB().query_prepared()). > > > > > > (gdb) p portal->name > > > $30 = 0xf03238 "" > > > > > > (gdb) p portal->prepStmtName > > > $31 = 0x0 > > > > > > (gdb) p *portal->cplan > > > $24 = {magic = 953717834, stmt_list = 0x682ec38, is_oneshot = false, > > > is_saved = true, is_valid = true, planRoleId = 16554, dependsOnRole = > > > false, saved_xmin = 0, generation = 1, refcount = 1, context = 0x682dfd0} > > > > > > I'm not sure why is_oneshot=false, though... > > > > Perhaps printing *unnamed_stmt_psrc (CachedPlanSource for an unnamed > > statement) would put out more information. > > (gdb) p *unnamed_stmt_psrc > $49 = {... gplan = 0x0, is_oneshot = false, is_complete = true, is_saved = > true, is_valid = false, generation = 1, node = {prev = 0x12fcf28, > next = 0xdf2c80 }, generic_cost = -1, total_custom_cost > = 12187136.69680, num_custom_plans = 1} >From this part, I think it's clear that a custom plan was used and that's the only one that this portal seems to know about. Also, I can see that only SPI ever builds "oneshot" plans, so is_oneshot would always be false in your use case. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Fri, Aug 07, 2020 at 01:13:51PM +0900, Amit Langote wrote: > On Fri, Aug 7, 2020 at 1:05 PM Justin Pryzby wrote: > > On Fri, Aug 07, 2020 at 12:16:11PM +0900, Amit Langote wrote: > > > Curiously, Justin mentioned upthread that the crash occurred during > > > BIND of a prepared query, so it better had been that a custom plan was > > > being executed, because a generic one based on fewer partitions would > > > be thrown away due to A.I.M. invoked during AcquireExecutorLocks(). > > > > Well this statement should only be executed once, and should be using > > PQexecParams and not PQexecPrepared (pygresql: pg.DB().query_prepared()). > > > > (gdb) p portal->name > > $30 = 0xf03238 "" > > > > (gdb) p portal->prepStmtName > > $31 = 0x0 > > > > (gdb) p *portal->cplan > > $24 = {magic = 953717834, stmt_list = 0x682ec38, is_oneshot = false, > > is_saved = true, is_valid = true, planRoleId = 16554, dependsOnRole = > > false, saved_xmin = 0, generation = 1, refcount = 1, context = 0x682dfd0} > > > > I'm not sure why is_oneshot=false, though... > > Perhaps printing *unnamed_stmt_psrc (CachedPlanSource for an unnamed > statement) would put out more information. (gdb) p *unnamed_stmt_psrc $49 = {magic = 195726186, raw_parse_tree = 0xfae788, query_string = 0xfaddc0 "\n", ' ' , "SELECT $3::timestamp as start_time, $2::int as interval_seconds,\n", ' ' , "first_cgi as cgi, gsm_carr_mcc||gsm_carr_mnc as home_plmn,\n", ' ' , "SUM(chargeable_"..., commandTag = CMDTAG_SELECT, param_types = 0x1254400, num_params = 3, parserSetup = 0x0, parserSetupArg = 0x0, cursor_options = 256, fixed_result = true, resultDesc = 0x1376670, context = 0xfae550, query_list = 0x103c9a8, relationOids = 0x11aa580, invalItems = 0x11aa600, search_path = 0x11aa878, query_context = 0xf85790, rewriteRoleId = 16554, rewriteRowSecurity = true, dependsOnRLS = false, gplan = 0x0, is_oneshot = false, is_complete = true, is_saved = true, is_valid = false, generation = 1, node = {prev = 0x12fcf28, next = 0xdf2c80 }, generic_cost = -1, total_custom_cost = 12187136.69680, num_custom_plans = 1} -- Justin
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Fri, Aug 7, 2020 at 1:05 PM Justin Pryzby wrote: > > On Fri, Aug 07, 2020 at 12:16:11PM +0900, Amit Langote wrote: > > Curiously, Justin mentioned upthread that the crash occurred during > > BIND of a prepared query, so it better had been that a custom plan was > > being executed, because a generic one based on fewer partitions would > > be thrown away due to A.I.M. invoked during AcquireExecutorLocks(). > > Well this statement should only be executed once, and should be using > PQexecParams and not PQexecPrepared (pygresql: pg.DB().query_prepared()). > > (gdb) p portal->name > $30 = 0xf03238 "" > > (gdb) p portal->prepStmtName > $31 = 0x0 > > (gdb) p *portal->cplan > $24 = {magic = 953717834, stmt_list = 0x682ec38, is_oneshot = false, is_saved > = true, is_valid = true, planRoleId = 16554, dependsOnRole = false, > saved_xmin = 0, generation = 1, refcount = 1, context = 0x682dfd0} > > I'm not sure why is_oneshot=false, though... Perhaps printing *unnamed_stmt_psrc (CachedPlanSource for an unnamed statement) would put out more information. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Fri, Aug 07, 2020 at 12:16:11PM +0900, Amit Langote wrote: > Curiously, Justin mentioned upthread that the crash occurred during > BIND of a prepared query, so it better had been that a custom plan was > being executed, because a generic one based on fewer partitions would > be thrown away due to A.I.M. invoked during AcquireExecutorLocks(). Well this statement should only be executed once, and should be using PQexecParams and not PQexecPrepared (pygresql: pg.DB().query_prepared()). (gdb) p portal->name $30 = 0xf03238 "" (gdb) p portal->prepStmtName $31 = 0x0 (gdb) p *portal->cplan $24 = {magic = 953717834, stmt_list = 0x682ec38, is_oneshot = false, is_saved = true, is_valid = true, planRoleId = 16554, dependsOnRole = false, saved_xmin = 0, generation = 1, refcount = 1, context = 0x682dfd0} I'm not sure why is_oneshot=false, though... -- Justin
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Fri, Aug 07, 2020 at 12:16:11PM +0900, Amit Langote wrote: > Curiously, Justin mentioned upthread that the crash occurred during > BIND of a prepared query, so it better had been that a custom plan was > being executed, I'm looking at how to check that ... can you give a hint ? -- Justin
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
Amit Langote writes: > On Fri, Aug 7, 2020 at 9:32 AM Tom Lane wrote: >> ... Amit's recipe for reproducing the bug works because there are >> other relation lock acquisitions (and hence AcceptInvalidationMessages >> calls) later in planning than where he asked us to wait. So this >> effectively tests a scenario where a very late A.I.M. call within the >> planner detects an inval event for some already-planned relation, and >> that seems like a valid-enough scenario. > Agreed. > Curiously, Justin mentioned upthread that the crash occurred during > BIND of a prepared query, so it better had been that a custom plan was > being executed, because a generic one based on fewer partitions would > be thrown away due to A.I.M. invoked during AcquireExecutorLocks(). Based on the above, it seems plausible that the plancache did throw away an old plan and try to replan, but the inval message announcing partition addition arrived too late during that planning cycle. Just like the normal execution path, the plancache code path won't do more than one iteration of planning on the way to a demanded query execution. >> (BTW, I checked and found that this test does *not* expose the problems >> with Amit's original patch. Not sure if it's worth trying to finagle >> it so it does.) > I tried to figure out a scenario where my patch would fail but > couldn't come up with one either, but it's no proof that it isn't > wrong. For example, I can see that pinfo->relid_map[pinfo->nparts] > can be accessed with my patch which is not correct. Yeah, touching array entries off the end of the relid_map array definitely seems possible with that coding. But the scenario I was worried about was that the loop actually attaches the wrong subplan (one for a different partition) to a partdesc entry. In an assert-enabled build, that would have led to assertion failure just below, because then we could not match up all the remaining relid_map entries; but in a non-assert build, we'd plow through and bad things would likely happen during execution. You might need further conditions, like the partitions not being all identical, for that to actually cause any problem. I'd poked at this for a little bit without causing an obvious crash, but I can't claim to have tried hard. regards, tom lane
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Fri, Aug 7, 2020 at 9:32 AM Tom Lane wrote: > Andy Fan writes: > > Attached is the v2 patch. Thanks Andy and Tom for this. > Forgot to mention that I'd envisioned adding this as a src/test/modules/ > module; contrib/ is for things that we intend to expose to users, which > I think this isn't. > > I played around with this and got the isolation test I'd experimented > with yesterday to work with it. If you revert 7a980dfc6 then the > attached patch will expose that bug. Interestingly, I had to add an > explicit AcceptInvalidationMessages() call to reproduce the bug; so > apparently we do none of those between planning and execution in the > ordinary query code path. Arguably, that means we're testing a scenario > somewhat different from what can happen in live databases, but I think > it's OK. Amit's recipe for reproducing the bug works because there are > other relation lock acquisitions (and hence AcceptInvalidationMessages > calls) later in planning than where he asked us to wait. So this > effectively tests a scenario where a very late A.I.M. call within the > planner detects an inval event for some already-planned relation, and > that seems like a valid-enough scenario. Agreed. Curiously, Justin mentioned upthread that the crash occurred during BIND of a prepared query, so it better had been that a custom plan was being executed, because a generic one based on fewer partitions would be thrown away due to A.I.M. invoked during AcquireExecutorLocks(). > Anyway, attached find a reviewed version of your patch plus a test > scenario contributed by me (I was too lazy to split it into two > patches). Barring objections, I'll push this tomorrow or so. > > (BTW, I checked and found that this test does *not* expose the problems > with Amit's original patch. Not sure if it's worth trying to finagle > it so it does.) I tried to figure out a scenario where my patch would fail but couldn't come up with one either, but it's no proof that it isn't wrong. For example, I can see that pinfo->relid_map[pinfo->nparts] can be accessed with my patch which is not correct. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
Andy Fan writes: > I am not sure if we need to limit the scope to "between parsing and > execution", Yeah, there might be reason to add similar functionality in other places later. I'm not sure where yet --- but that idea does make me slightly unhappy with the "delay_execution" moniker. I don't have a better name though ... regards, tom lane
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Fri, Aug 7, 2020 at 8:32 AM Tom Lane wrote: > Andy Fan writes: > > Attached is the v2 patch. > > Forgot to mention that I'd envisioned adding this as a src/test/modules/ > module; contrib/ is for things that we intend to expose to users, which > I think this isn't. > > I played around with this and got the isolation test I'd experimented > with yesterday to work with it. If you revert 7a980dfc6 then the > attached patch will expose that bug. Interestingly, I had to add an > explicit AcceptInvalidationMessages() call to reproduce the bug; so > apparently we do none of those between planning and execution in the > ordinary query code path. Arguably, that means we're testing a scenario > somewhat different from what can happen in live databases, but I think > it's OK. Amit's recipe for reproducing the bug works because there are > other relation lock acquisitions (and hence AcceptInvalidationMessages > calls) later in planning than where he asked us to wait. So this > effectively tests a scenario where a very late A.I.M. call within the > planner detects an inval event for some already-planned relation, and > that seems like a valid-enough scenario. > > Anyway, attached find a reviewed version of your patch plus a test > scenario contributed by me (I was too lazy to split it into two > patches). Barring objections, I'll push this tomorrow or so. > > (BTW, I checked and found that this test does *not* expose the problems > with Amit's original patch. Not sure if it's worth trying to finagle > it so it does.) > > regards, tom lane > > + * delay_execution.c + * Test module to allow delay between parsing and execution of a query. I am not sure if we need to limit the scope to "between parsing and execution", IMO, it can be used at any place where we have a hook so that delay_execution can inject the lock_unlock logic with a predefined lock id. Probably the test writer only wants one place blocked, then delay_execution.xxx_lock_id can be set so that only the given lock ID is considered. Just my 0.01 cents. -- Best Regards Andy Fan
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
Andy Fan writes: > Attached is the v2 patch. Forgot to mention that I'd envisioned adding this as a src/test/modules/ module; contrib/ is for things that we intend to expose to users, which I think this isn't. I played around with this and got the isolation test I'd experimented with yesterday to work with it. If you revert 7a980dfc6 then the attached patch will expose that bug. Interestingly, I had to add an explicit AcceptInvalidationMessages() call to reproduce the bug; so apparently we do none of those between planning and execution in the ordinary query code path. Arguably, that means we're testing a scenario somewhat different from what can happen in live databases, but I think it's OK. Amit's recipe for reproducing the bug works because there are other relation lock acquisitions (and hence AcceptInvalidationMessages calls) later in planning than where he asked us to wait. So this effectively tests a scenario where a very late A.I.M. call within the planner detects an inval event for some already-planned relation, and that seems like a valid-enough scenario. Anyway, attached find a reviewed version of your patch plus a test scenario contributed by me (I was too lazy to split it into two patches). Barring objections, I'll push this tomorrow or so. (BTW, I checked and found that this test does *not* expose the problems with Amit's original patch. Not sure if it's worth trying to finagle it so it does.) regards, tom lane diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile index 29de73c060..1428529b04 100644 --- a/src/test/modules/Makefile +++ b/src/test/modules/Makefile @@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global SUBDIRS = \ brin \ commit_ts \ + delay_execution \ dummy_index_am \ dummy_seclabel \ snapshot_too_old \ diff --git a/src/test/modules/delay_execution/.gitignore b/src/test/modules/delay_execution/.gitignore new file mode 100644 index 00..ba2160b66c --- /dev/null +++ b/src/test/modules/delay_execution/.gitignore @@ -0,0 +1,3 @@ +# Generated subdirectories +/output_iso/ +/tmp_check_iso/ diff --git a/src/test/modules/delay_execution/Makefile b/src/test/modules/delay_execution/Makefile new file mode 100644 index 00..f270aebf3a --- /dev/null +++ b/src/test/modules/delay_execution/Makefile @@ -0,0 +1,21 @@ +# src/test/modules/delay_execution/Makefile + +PGFILEDESC = "delay_execution - allow delay between parsing and execution" + +MODULE_big = delay_execution +OBJS = \ + $(WIN32RES) \ + delay_execution.o + +ISOLATION = partition-addition + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = src/test/modules/delay_execution +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/src/test/modules/delay_execution/delay_execution.c b/src/test/modules/delay_execution/delay_execution.c new file mode 100644 index 00..03ea23d0f2 --- /dev/null +++ b/src/test/modules/delay_execution/delay_execution.c @@ -0,0 +1,104 @@ +/*- + * + * delay_execution.c + * Test module to allow delay between parsing and execution of a query. + * + * The delay is implemented by taking and immediately releasing a specified + * advisory lock. If another process has previously taken that lock, the + * current process will be blocked until the lock is released; otherwise, + * there's no effect. This allows an isolationtester script to reliably + * test behaviors where some specified action happens in another backend + * between parsing and execution of any desired query. + * + * Copyright (c) 2020, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/test/modules/delay_execution/delay_execution.c + * + *- + */ + +#include "postgres.h" + +#include + +#include "optimizer/planner.h" +#include "utils/builtins.h" +#include "utils/guc.h" +#include "utils/inval.h" + + +PG_MODULE_MAGIC; + +/* GUC: advisory lock ID to use. Zero disables the feature. */ +static int post_planning_lock_id = 0; + +/* Save previous planner hook user to be a good citizen */ +static planner_hook_type prev_planner_hook = NULL; + +/* Module load/unload functions */ +void _PG_init(void); +void _PG_fini(void); + + +/* planner_hook function to provide the desired delay */ +static PlannedStmt * +delay_execution_planner(Query *parse, const char *query_string, + int cursorOptions, ParamListInfo boundParams) +{ + PlannedStmt *result; + + /* Invoke the planner, possibly via a previous hook user */ + if (prev_planner_hook) + result = prev_planner_hook(parse, query_string, cursorOptions, + boundParams); + else + result = standard_planner(parse, query_string, cursorOptions, + boundParams); + + /* If enabled, delay by taking and rele
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Thu, Aug 6, 2020 at 10:42 PM Tom Lane wrote: > Andy Fan writes: > > On Thu, Aug 6, 2020 at 12:02 PM Tom Lane wrote: > >> See my straw-man proposal downthread. > > > Thanks for your explanation, I checked it again and it looks a very clean > > method. The attached is a draft patch based on my understanding. Hope > > I didn't misunderstand you.. > > Ah, I was going to play arond with that today, but you beat me to it ;-) > > Very glad to be helpful. > A few thoughts after a quick look at the patch: > > * I had envisioned that there's a custom GUC controlling the lock ID > used; this would allow blocking different sessions at different points, > if we ever need that. Also, I'd make the GUC start out as zero which > means "do nothing", so that merely loading the module has no immediate > effect. > > I forgot to say I didn't get the point of the guc variable in the last thread, now I think it is a smart idea, so added it. In this way, one session can only be blocked at one place, it may not be an issue in practise. * Don't really see the point of the before-planning lock. > > yes.. it was removed now. * Rather than exposing internal declarations from lockfuncs.c, you > could just write calls to pg_advisory_lock_int8 etc. using > DirectFunctionCall1. > > Thanks for sharing it, this method looks pretty good. > * We need some better name than "test_module". I had vaguely thought > about "delay_execution", but am surely open to better ideas. > > Both names look good to me, delay_execution looks better, it is used in v2. Attached is the v2 patch. -- Best Regards Andy Fan v2-0001-delay_execution-used-for-concurrency-case-simulat.patch Description: Binary data
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
Andy Fan writes: > On Thu, Aug 6, 2020 at 12:02 PM Tom Lane wrote: >> See my straw-man proposal downthread. > Thanks for your explanation, I checked it again and it looks a very clean > method. The attached is a draft patch based on my understanding. Hope > I didn't misunderstand you.. Ah, I was going to play arond with that today, but you beat me to it ;-) A few thoughts after a quick look at the patch: * I had envisioned that there's a custom GUC controlling the lock ID used; this would allow blocking different sessions at different points, if we ever need that. Also, I'd make the GUC start out as zero which means "do nothing", so that merely loading the module has no immediate effect. * Don't really see the point of the before-planning lock. * Rather than exposing internal declarations from lockfuncs.c, you could just write calls to pg_advisory_lock_int8 etc. using DirectFunctionCall1. * We need some better name than "test_module". I had vaguely thought about "delay_execution", but am surely open to better ideas. regards, tom lane
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Thu, Aug 6, 2020 at 12:02 PM Tom Lane wrote: > Andy Fan writes: > > On Thu, Aug 6, 2020 at 2:22 AM Tom Lane wrote: > >> In the longer term, it's annoying that we have no test methodology > >> for this other than "manually set a breakpoint here". > > > One of the methods I see is we can just add some GUC variable for some > > action injection. basically it adds some code based on the GUC like > this; > > See my straw-man proposal downthread. I'm not very excited about putting > things like this into the standard build, because it's really hard to be > sure that there are no security-hazard-ish downsides of putting in ways to > get at testing behaviors from standard SQL. And then there's the question > of whether you're adding noticeable overhead to production builds. So a > loadable module that can use some existing hook to provide the needed > behavior seems like a better plan to me, whenever we can do it that way. > > In general, though, it seems like we've seldom regretted investments in > test tooling. > > regards, tom lane > Thanks for your explanation, I checked it again and it looks a very clean method. The attached is a draft patch based on my understanding. Hope I didn't misunderstand you.. -- Best Regards Andy Fan v1-0001-test_module-used-for-concurrency-case-simulation.patch Description: Binary data
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
Andy Fan writes: > On Thu, Aug 6, 2020 at 2:22 AM Tom Lane wrote: >> In the longer term, it's annoying that we have no test methodology >> for this other than "manually set a breakpoint here". > One of the methods I see is we can just add some GUC variable for some > action injection. basically it adds some code based on the GUC like this; See my straw-man proposal downthread. I'm not very excited about putting things like this into the standard build, because it's really hard to be sure that there are no security-hazard-ish downsides of putting in ways to get at testing behaviors from standard SQL. And then there's the question of whether you're adding noticeable overhead to production builds. So a loadable module that can use some existing hook to provide the needed behavior seems like a better plan to me, whenever we can do it that way. In general, though, it seems like we've seldom regretted investments in test tooling. regards, tom lane
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Thu, Aug 6, 2020 at 2:22 AM Tom Lane wrote: > Robert Haas writes: > > On Wed, Aug 5, 2020 at 1:30 PM Tom Lane wrote: > >> I'm strongly tempted to convert the trailing Assert to an actual > >> test-and-elog, too, but didn't do so here. > > > I was thinking about that, too. +1 for taking that step. > > Will do. > > In the longer term, it's annoying that we have no test methodology > for this other than "manually set a breakpoint here". One of the methods I see is we can just add some GUC variable for some action injection. basically it adds some code based on the GUC like this; if (shall_delay_planning) { sleep(10) }; AFAIK, MongoDB uses much such technology in their test framework. First it defines the fail point [1], and then does code injection if the fail point is set [2]. At last, during the test it can set a fail point like a GUC, but with more attributes [3]. If that is useful in PG as well and it is not an urgent task, I would like to help in this direction. [1] https://github.com/mongodb/mongo/search?q=MONGO_FAIL_POINT_DEFINE&unscoped_q=MONGO_FAIL_POINT_DEFINE [2] https://github.com/mongodb/mongo/blob/d4e7ea57599b44353b5393afedee8ae5670837b3/src/mongo/db/repl/repl_set_config.cpp#L475 [3] https://github.com/mongodb/mongo/blob/e07c2d29aded5a30ff08b5ce6a436b6ef6f44014/src/mongo/shell/replsettest.js#L1427 If we're going > to allow plan-relevant DDL changes to happen with less than full table > lock, I think we need to improve that. I spent a little bit of time > just now trying to build an isolationtester case for this, and failed > completely. So I wonder if we can create some sort of test module that > allows capture of a plan tree and then execution of that plan tree later > (even after relcache inval would normally have forced replanning). > Obviously that could not be a normal SQL-accessible feature, because > some types of invals would make the plan completely wrong, but for > testing purposes it'd be mighty helpful to check that a stale plan > still works. > > regards, tom lane > > > -- Best Regards Andy Fan
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Thu, Aug 6, 2020 at 2:30 AM Tom Lane wrote: > Amit Langote writes: > > The attached patch should fix that. > > I don't like this patch at all though; I do not think it is being nearly > careful enough to ensure that it's matched the surviving relation OIDs > correctly. In particular it blithely assumes that a zero in relid_map > *must* match the immediately next entry in partdesc->oids, which is easy > to break if the new partition is adjacent to the one the planner managed > to prune. Indeed, you're right. > So I think we should do it more like the attached. Thanks for pushing that. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Wed, Aug 5, 2020 at 4:19 PM Tom Lane wrote: > After thinking about it for a little bit, I'm envisioning a test module > that can be loaded into a session, and then it gets into the planner_hook, > and what it does is after each planner execution, take and release an > advisory lock with some selectable ID. Then we can construct > isolationtester specs that do something like > > session 1 session 2 > > LOAD test-module; > SET custom_guc_for_lock_id = n; > prepare test tables; > > SELECT pg_advisory_lock(n); > > SELECT victim-query-here; > ... after planning, query blocks on lock > > perform DDL changes; > SELECT pg_advisory_unlock(n); > > ... query executes with now-stale plan Very sneaky! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
Robert Haas writes: > On Wed, Aug 5, 2020 at 2:22 PM Tom Lane wrote: >> In the longer term, it's annoying that we have no test methodology >> for this other than "manually set a breakpoint here". If we're going >> to allow plan-relevant DDL changes to happen with less than full table >> lock, I think we need to improve that. I spent a little bit of time >> just now trying to build an isolationtester case for this, and failed >> completely. So I wonder if we can create some sort of test module that >> allows capture of a plan tree and then execution of that plan tree later >> (even after relcache inval would normally have forced replanning). >> Obviously that could not be a normal SQL-accessible feature, because >> some types of invals would make the plan completely wrong, but for >> testing purposes it'd be mighty helpful to check that a stale plan >> still works. > That's an interesting idea. I don't know exactly how it would work, > but I agree that it would allow useful testing that we can't do today. After thinking about it for a little bit, I'm envisioning a test module that can be loaded into a session, and then it gets into the planner_hook, and what it does is after each planner execution, take and release an advisory lock with some selectable ID. Then we can construct isolationtester specs that do something like session 1 session 2 LOAD test-module; SET custom_guc_for_lock_id = n; prepare test tables; SELECT pg_advisory_lock(n); SELECT victim-query-here; ... after planning, query blocks on lock perform DDL changes; SELECT pg_advisory_unlock(n); ... query executes with now-stale plan regards, tom lane
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Wed, Aug 5, 2020 at 2:22 PM Tom Lane wrote: > In the longer term, it's annoying that we have no test methodology > for this other than "manually set a breakpoint here". If we're going > to allow plan-relevant DDL changes to happen with less than full table > lock, I think we need to improve that. I spent a little bit of time > just now trying to build an isolationtester case for this, and failed > completely. So I wonder if we can create some sort of test module that > allows capture of a plan tree and then execution of that plan tree later > (even after relcache inval would normally have forced replanning). > Obviously that could not be a normal SQL-accessible feature, because > some types of invals would make the plan completely wrong, but for > testing purposes it'd be mighty helpful to check that a stale plan > still works. That's an interesting idea. I don't know exactly how it would work, but I agree that it would allow useful testing that we can't do today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
Robert Haas writes: > On Wed, Aug 5, 2020 at 1:30 PM Tom Lane wrote: >> I'm strongly tempted to convert the trailing Assert to an actual >> test-and-elog, too, but didn't do so here. > I was thinking about that, too. +1 for taking that step. Will do. In the longer term, it's annoying that we have no test methodology for this other than "manually set a breakpoint here". If we're going to allow plan-relevant DDL changes to happen with less than full table lock, I think we need to improve that. I spent a little bit of time just now trying to build an isolationtester case for this, and failed completely. So I wonder if we can create some sort of test module that allows capture of a plan tree and then execution of that plan tree later (even after relcache inval would normally have forced replanning). Obviously that could not be a normal SQL-accessible feature, because some types of invals would make the plan completely wrong, but for testing purposes it'd be mighty helpful to check that a stale plan still works. regards, tom lane
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Wed, Aug 5, 2020 at 1:30 PM Tom Lane wrote: > I don't like this patch at all though; I do not think it is being nearly > careful enough to ensure that it's matched the surviving relation OIDs > correctly. In particular it blithely assumes that a zero in relid_map > *must* match the immediately next entry in partdesc->oids, which is easy > to break if the new partition is adjacent to the one the planner managed > to prune. So I think we should do it more like the attached. Ooh, nice catch. > I'm strongly tempted to convert the trailing Assert to an actual > test-and-elog, too, but didn't do so here. I was thinking about that, too. +1 for taking that step. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
Amit Langote writes: > The crash reported here is in the other case where the concurrently > added partitions cause the execution-time PartitionDesc to have more > partitions than the one that PartitionedRelPruneInfo is based on. > I was able to reproduce such a crash as follows: Yeah, I can repeat the case per these directions. I concur that the issue is that ExecCreatePartitionPruneState is failing to cope with zeroes in the relid_map. > The attached patch should fix that. I don't like this patch at all though; I do not think it is being nearly careful enough to ensure that it's matched the surviving relation OIDs correctly. In particular it blithely assumes that a zero in relid_map *must* match the immediately next entry in partdesc->oids, which is easy to break if the new partition is adjacent to the one the planner managed to prune. So I think we should do it more like the attached. I'm strongly tempted to convert the trailing Assert to an actual test-and-elog, too, but didn't do so here. regards, tom lane diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index fb6ce49056..221a34e738 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -1673,7 +1673,19 @@ ExecCreatePartitionPruneState(PlanState *planstate, pprune->subpart_map = palloc(sizeof(int) * partdesc->nparts); for (pp_idx = 0; pp_idx < partdesc->nparts; ++pp_idx) { - if (pinfo->relid_map[pd_idx] != partdesc->oids[pp_idx]) + /* + * pinfo->relid_map can contain InvalidOid entries for + * partitions pruned by the planner. We cannot tell + * exactly which of the partdesc entries these correspond + * to, but we don't have to; just skip over them. The + * non-pruned relid_map entries, however, had better be a + * subset of the partdesc entries and in the same order. + */ + while (pd_idx < pinfo->nparts && + !OidIsValid(pinfo->relid_map[pd_idx])) + pd_idx++; + if (pd_idx >= pinfo->nparts || + pinfo->relid_map[pd_idx] != partdesc->oids[pp_idx]) { pprune->subplan_map[pp_idx] = -1; pprune->subpart_map[pp_idx] = -1; @@ -1686,6 +1698,14 @@ ExecCreatePartitionPruneState(PlanState *planstate, pinfo->subpart_map[pd_idx++]; } } + +/* + * It might seem that we need to skip any trailing InvalidOid + * entries in pinfo->relid_map before asserting that we + * scanned all of the relid_map. But we will have skipped + * them above, because they must correspond to some + * partdesc->oids entries; we just couldn't tell which. + */ Assert(pd_idx == pinfo->nparts); }
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Wed, Aug 5, 2020 at 10:04 AM Justin Pryzby wrote: > On Wed, Aug 05, 2020 at 09:53:44AM +0900, Amit Langote wrote: > > On Wed, Aug 5, 2020 at 9:52 AM Amit Langote wrote: > > > On Wed, Aug 5, 2020 at 9:32 AM Justin Pryzby wrote: > > > > On Wed, Aug 05, 2020 at 09:26:20AM +0900, Amit Langote wrote: > > > > > On Wed, Aug 5, 2020 at 12:11 AM Justin Pryzby > > > > > wrote: > > > > > > > > > > > > On Tue, Aug 04, 2020 at 08:12:10PM +0900, Amit Langote wrote: > > > > > > > It may be this commit that went into PG 12 that is causing the > > > > > > > problem: > > > > > > > > > > > > Thanks for digging into this. > > > > > > > > > > > > > to account for partitions that were pruned by the planner for > > > > > > > which we > > > > > > > decided to put 0 into relid_map, but it only considered the case > > > > > > > where > > > > > > > the number of partitions doesn't change since the plan was > > > > > > > created. > > > > > > > The crash reported here is in the other case where the > > > > > > > concurrently > > > > > > > added partitions cause the execution-time PartitionDesc to have > > > > > > > more > > > > > > > partitions than the one that PartitionedRelPruneInfo is based on. > > > > > > > > > > > > Is there anything else needed to check that my crash matches your > > > > > > analysis ? > > > > > > > > > > If you can spot a 0 in the output of the following, then yes. > > > > > > > > > > (gdb) p *pinfo->relid_map@pinfo->nparts > > > > > > > > I guess you knew that an earlier message has just that. Thanks. > > > > https://www.postgresql.org/message-id/20200803161133.ga21...@telsasoft.com > > > > > > Yeah, you showed: > > > > > > (gdb) p *pinfo->relid_map@414 > > > > > > And there is indeed a 0 in there, but I wasn't sure if it was actually > > > in the array or a stray zero due to forcing gdb to show beyond the > > > array bound. Does pinfo->nparts match 414? > > Yes. I typed 414 manually since the the array lengths were suspect. > > (gdb) p pinfo->nparts > $1 = 414 > (gdb) set print elements 0 > (gdb) p *pinfo->relid_map@pinfo->nparts > $3 = { > 21151836, 21151726, 21151608, 21151498, 21151388, 21151278, 21151168, > 21151055, 2576248, 2576255, 2576262, 2576269, 2576276, 21456497, 22064128, 0} Thanks. There is a 0 in there, which can only be there if planner was able to prune that last partition. So, the planner saw a table with 414 partitions, was able to prune the last one and constructed an Append plan with 413 subplans for unpruned partitions as you showed upthread: > (gdb) p *node->appendplans > $17 = {type = T_List, length = 413, max_length = 509, elements = 0x7037400, > initial_elements = 0x7037400} This suggests that the crash I was able produce is similar to what you saw. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Wed, Aug 05, 2020 at 09:53:44AM +0900, Amit Langote wrote: > On Wed, Aug 5, 2020 at 9:52 AM Amit Langote wrote: > > On Wed, Aug 5, 2020 at 9:32 AM Justin Pryzby wrote: > > > On Wed, Aug 05, 2020 at 09:26:20AM +0900, Amit Langote wrote: > > > > On Wed, Aug 5, 2020 at 12:11 AM Justin Pryzby > > > > wrote: > > > > > > > > > > On Tue, Aug 04, 2020 at 08:12:10PM +0900, Amit Langote wrote: > > > > > > It may be this commit that went into PG 12 that is causing the > > > > > > problem: > > > > > > > > > > Thanks for digging into this. > > > > > > > > > > > to account for partitions that were pruned by the planner for which > > > > > > we > > > > > > decided to put 0 into relid_map, but it only considered the case > > > > > > where > > > > > > the number of partitions doesn't change since the plan was created. > > > > > > The crash reported here is in the other case where the concurrently > > > > > > added partitions cause the execution-time PartitionDesc to have more > > > > > > partitions than the one that PartitionedRelPruneInfo is based on. > > > > > > > > > > Is there anything else needed to check that my crash matches your > > > > > analysis ? > > > > > > > > If you can spot a 0 in the output of the following, then yes. > > > > > > > > (gdb) p *pinfo->relid_map@pinfo->nparts > > > > > > I guess you knew that an earlier message has just that. Thanks. > > > https://www.postgresql.org/message-id/20200803161133.ga21...@telsasoft.com > > > > Yeah, you showed: > > > > (gdb) p *pinfo->relid_map@414 > > > > And there is indeed a 0 in there, but I wasn't sure if it was actually > > in the array or a stray zero due to forcing gdb to show beyond the > > array bound. Does pinfo->nparts match 414? Yes. I typed 414 manually since the the array lengths were suspect. (gdb) p pinfo->nparts $1 = 414 (gdb) set print elements 0 (gdb) p *pinfo->relid_map@pinfo->nparts $3 = {22652203, 22652104, 22651920, 22651654, 22647359, 22645269, 22644012, 22639600, 22638852, 22621975, 22615355, 22615256, 22615069, 22610573, 22606503, 22606404, 22600237, 22600131, 22596610, 22595013, 22594914, 22594725, 22594464, 22589317, 22589216, 22587504, 22582570, 22580264, 22577047, 22576948, 22576763, 22576656, 22574077, 22570911, 22570812, 22564524, 22564113, 22558519, 22557080, 22556981, 22556793, 22555205, 22550680, 22550579, 22548884, 22543899, 22540822, 22536665, 22536566, 22536377, 22535133, 22528876, 22527780, 22526065, 22521131, 22517878, 22513674, 22513575, 22513405, 22513288, 22507520, 22504728, 22504629, 22493699, 22466016, 22458641, 22457551, 22457421, 22457264, 22452879, 22449864, 22449765, 22443560, 22442952, 22436193, 22434644, 22434469, 22434352, 22430792, 22426903, 22426804, 22420732, 22420025, 22413050, 22411963, 22411864, 22411675, 22407652, 22404156, 22404049, 22397550, 22394622, 22390035, 22389936, 22389752, 22388386, 22383211, 22382115, 22381934, 22375210, 22370297, 22367878, 22367779, 22367586, 22362556, 22359928, 22358236, 22353374, 22348704, 22345692, 22345593, 22345399, 22341347, 22336809, 22336709, 22325812, 22292836, 22287756, 22287657, 22287466, 22283194, 22278659, 22278560, 22272041, 22269121, 22264424, 22264325, 22264135, 22260102, 22255418, 22254818, 22248841, 22245824, 22241490, 22241391, 22241210, 22240354, 22236224, 22235123, 22234060, 8744, 8345, 8033, 2528, 2429, 2330, 2144, 2045, 22218408, 22215986, 22215887, 22209311, 22209212, 22207919, 22205203, 22203385, 22203298, 22203211, 22203124, 22202954, 22202859, 22202772, 22201869, 22200438, 22197706, 22195027, 22194932, 22194834, 22191208, 22188412, 22187029, 22182238, 22182134, 22182030, 22181849, 22181737, 22181107, 22175811, 22175710, 22169859, 22169604, 22159266, 22158131, 22158021, 22157824, 22153348, 22153236, 22147308, 22146736, 22143778, 22143599, 22143471, 22138702, 22138590, 22132612, 22132513, 22132271, 22132172, 22131987, 21935599, 21932664, 21927997, 21925823, 21885889, 21862973, 21859854, 21859671, 21858869, 21853440, 21851884, 21845405, 21842901, 21837523, 21837413, 21837209, 21832347, 21829359, 21827652, 21822602, 21816150, 21805995, 21805812, 21805235, 21798914, 21798026, 21791895, 21791124, 21783854, 21783744, 21783540, 21780568, 21774797, 21774687, 21768326, 21764063, 21759627, 21759517, 21759311, 21755697, 21751690, 21751156, 21744906, 21738543, 21736176, 21735992, 21735769, 21727603, 21725956, 21716432, 21678682, 21670968, 21670858, 21670665, 21669342, 21661932, 21661822, 21655311, 21650838, 21646721, 21646611, 21646409, 21640984, 21637816, 21637706, 21631061, 21622723, 21621459, 21621320, 21621148, 21612902, 21612790, 21606170, 21602265, 21597910, 21597800, 21597605, 21592489, 21589415, 21589305, 21582910, 21578017, 21576758, 21576648, 21572692, 21566633, 21566521, 21560127, 21560017, 21553910, 21553800, 21553613, 21553495, 21549102, 21548992, 21542759, 21540922, 21532093, 21531983, 21531786, 21531676, 21531264, 21531154, 21525290,
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Wed, Aug 5, 2020 at 9:52 AM Amit Langote wrote: > On Wed, Aug 5, 2020 at 9:32 AM Justin Pryzby wrote: > > On Wed, Aug 05, 2020 at 09:26:20AM +0900, Amit Langote wrote: > > > On Wed, Aug 5, 2020 at 12:11 AM Justin Pryzby > > > wrote: > > > > > > > > On Tue, Aug 04, 2020 at 08:12:10PM +0900, Amit Langote wrote: > > > > > It may be this commit that went into PG 12 that is causing the > > > > > problem: > > > > > > > > Thanks for digging into this. > > > > > > > > > to account for partitions that were pruned by the planner for which we > > > > > decided to put 0 into relid_map, but it only considered the case where > > > > > the number of partitions doesn't change since the plan was created. > > > > > The crash reported here is in the other case where the concurrently > > > > > added partitions cause the execution-time PartitionDesc to have more > > > > > partitions than the one that PartitionedRelPruneInfo is based on. > > > > > > > > Is there anything else needed to check that my crash matches your > > > > analysis ? > > > > > > If you can spot a 0 in the output of the following, then yes. > > > > > > (gdb) p *pinfo->relid_map@pinfo->nparts > > > > I guess you knew that an earlier message has just that. Thanks. > > https://www.postgresql.org/message-id/20200803161133.ga21...@telsasoft.com > > Yeah, you showed: > > (gdb) p *pinfo->relid_map@414 > > And there is indeed a 0 in there, but I wasn't sure if it was actually > in the array or a stray zero due to forcing gdb to show beyond the > array bound. Does pinfo->nparts match 414? (sorry, I forgot to hit reply all in last two emails.) -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Mon, Aug 3, 2020 at 12:11 PM Justin Pryzby wrote: > (gdb) p *pinfo->relid_map@414 > (gdb) p *partdesc->oids@415 Whoa, I didn't know about using @ in gdb to print multiple elements. Wild! Anyway, these two arrays differ in that the latter array has 22653702 inserted at the beginning and 22628862 at the end, and also in that a 0 has been removed. This code can't cope with things getting removed, so kaboom. I think Amit probably has the right idea about what's going on here and how to fix it, but I haven't yet had time to study it in detail. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Tue, Aug 04, 2020 at 08:12:10PM +0900, Amit Langote wrote: > It may be this commit that went into PG 12 that is causing the problem: Thanks for digging into this. > to account for partitions that were pruned by the planner for which we > decided to put 0 into relid_map, but it only considered the case where > the number of partitions doesn't change since the plan was created. > The crash reported here is in the other case where the concurrently > added partitions cause the execution-time PartitionDesc to have more > partitions than the one that PartitionedRelPruneInfo is based on. Is there anything else needed to check that my crash matches your analysis ? (gdb) up #4 0x0068444a in ExecInitAppend (node=node@entry=0x7036b90, estate=estate@entry=0x11563f0, eflags=eflags@entry=16) at nodeAppend.c:132 132 nodeAppend.c: No such file or directory. (gdb) p *node->appendplans $17 = {type = T_List, length = 413, max_length = 509, elements = 0x7037400, initial_elements = 0x7037400} (gdb) down #3 0x00672806 in ExecCreatePartitionPruneState (planstate=planstate@entry=0x908f6d8, partitionpruneinfo=) at execPartition.c:1689 1689execPartition.c: No such file or directory. $27 = {ps = {type = T_AppendState, plan = 0x7036b90, state = 0x11563f0, ExecProcNode = 0x6842c0 , ExecProcNodeReal = 0x0, instrument = 0x0, worker_instrument = 0x0, worker_jit_instrument = 0x0, qual = 0x0, lefttree = 0x0, righttree = 0x0, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0, ps_ResultTupleDesc = 0x0, ps_ResultTupleSlot = 0x0, ps_ExprContext = 0x908f7f0, ps_ProjInfo = 0x0, scandesc = 0x0, scanops = 0x0, outerops = 0x0, innerops = 0x0, resultops = 0x0, scanopsfixed = false, outeropsfixed = false, inneropsfixed = false, resultopsfixed = false, scanopsset = false, outeropsset = false, inneropsset = false, resultopsset = false}, appendplans = 0x0, as_nplans = 0, as_whichplan = -1, as_first_partial_plan = 0, as_pstate = 0x0, pstate_len = 0, as_prune_state = 0x0, as_valid_subplans = 0x0, choose_next_subplan = 0x0} -- Justin
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Tue, Aug 4, 2020 at 1:11 AM Justin Pryzby wrote: > > On Mon, Aug 03, 2020 at 11:41:37AM -0400, Robert Haas wrote: > > On Sun, Aug 2, 2020 at 2:11 PM Justin Pryzby wrote: > > > Based on commit logs, I suspect this may be an "older bug", specifically > > > maybe > > > with: > > > > > > |commit 898e5e3290a72d288923260143930fb32036c00c > > > |Author: Robert Haas > > > |Date: Thu Mar 7 11:13:12 2019 -0500 > > > | > > > |Allow ATTACH PARTITION with only ShareUpdateExclusiveLock. > > > > > > I don't think it matters, but the process surrounding the table being > > > INSERTed > > > INTO is more than a little special, involving renames, detaches, creation, > > > re-attaching within a transaction. I think that doesn't matter though, > > > and the > > > issue is surrounding the table being SELECTed *from*, which is actually > > > behind > > > a view. > > > > That's an entirely reasonable guess, but it doesn't seem easy to > > understand exactly what happened here based on the provided > > information. The assertion failure probably indicates that > > pinfo->relid_map[] and partdesc->oids[] differ in some way other than > > additional elements having been inserted into the latter. For example, > > some elements might have disappeared, or the order might have changed. > > This isn't supposed to happen, because DETACH PARTITION requires > > heavier locking, and the order changing without anything getting > > detached should be impossible. But evidently it did. If we could dump > > out the two arrays in question it might shed more light on exactly how > > things went wrong. It may be this commit that went into PG 12 that is causing the problem: commit 428b260f87e8861ba8e58807b69d433db491c4f4 Author: Tom Lane Date: Sat Mar 30 18:58:55 2019 -0400 Speed up planning when partitions can be pruned at plan time. which had this: - /* Double-check that list of relations has not changed. */ - Assert(memcmp(partdesc->oids, pinfo->relid_map, - pinfo->nparts * sizeof(Oid)) == 0); + /* +* Double-check that the list of unpruned relations has not +* changed. (Pruned partitions are not in relid_map[].) +*/ +#ifdef USE_ASSERT_CHECKING + for (int k = 0; k < pinfo->nparts; k++) + { + Assert(partdesc->oids[k] == pinfo->relid_map[k] || + pinfo->subplan_map[k] == -1); + } +#endif to account for partitions that were pruned by the planner for which we decided to put 0 into relid_map, but it only considered the case where the number of partitions doesn't change since the plan was created. The crash reported here is in the other case where the concurrently added partitions cause the execution-time PartitionDesc to have more partitions than the one that PartitionedRelPruneInfo is based on. I was able to reproduce such a crash as follows: Start with these tables in session 1. create table foo (a int, b int) partition by list (a); create table foo1 partition of foo for values in (1); create table foo2 partition of foo for values in (2); create table foo3 partition of foo for values in (3); Attach gdb with a breakpoint set in PartitionDirectoryLookup() and run this: explain analyze select * from foo where a <> 1 and a = (select 2); In another session: create table foo4 (like foo) alter table foo attach partition foo4 for values in (4); That should finish without waiting for any lock and send an invalidation message to session 1. Go back to gdb attached to session 1 and hit continue, resulting in the plan containing runtime pruning info being executed. ExecCreatePartitionPruneState() opens foo which will now have 4 partitions instead of the 3 that the planner would have seen, of which foo1 is pruned (a <> 1), so the following block is executed: if (partdesc->nparts == pinfo->nparts) ... else { int pd_idx = 0; int pp_idx; /* * Some new partitions have appeared since plan time, and * those are reflected in our PartitionDesc but were not * present in the one used to construct subplan_map and * subpart_map. So we must construct new and longer arrays * where the partitions that were originally present map to * the same place, and any added indexes map to -1, as if the * new partitions had been pruned. */ pprune->subpart_map = palloc(sizeof(int) * partdesc->nparts); for (pp_idx = 0; pp_idx < partdesc->nparts; ++pp_idx) { if (pinfo->relid_map[pd_idx] != partdesc->oids[pp_idx]) { pprune->subplan_map[pp_idx] = -1; pprune->subpart_map[pp_idx] = -1;
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Mon, Aug 03, 2020 at 11:41:37AM -0400, Robert Haas wrote: > On Sun, Aug 2, 2020 at 2:11 PM Justin Pryzby wrote: > > Based on commit logs, I suspect this may be an "older bug", specifically > > maybe > > with: > > > > |commit 898e5e3290a72d288923260143930fb32036c00c > > |Author: Robert Haas > > |Date: Thu Mar 7 11:13:12 2019 -0500 > > | > > |Allow ATTACH PARTITION with only ShareUpdateExclusiveLock. > > > > I don't think it matters, but the process surrounding the table being > > INSERTed > > INTO is more than a little special, involving renames, detaches, creation, > > re-attaching within a transaction. I think that doesn't matter though, and > > the > > issue is surrounding the table being SELECTed *from*, which is actually > > behind > > a view. > > That's an entirely reasonable guess, but it doesn't seem easy to > understand exactly what happened here based on the provided > information. The assertion failure probably indicates that > pinfo->relid_map[] and partdesc->oids[] differ in some way other than > additional elements having been inserted into the latter. For example, > some elements might have disappeared, or the order might have changed. > This isn't supposed to happen, because DETACH PARTITION requires > heavier locking, and the order changing without anything getting > detached should be impossible. But evidently it did. If we could dump > out the two arrays in question it might shed more light on exactly how > things went wrong. (gdb) p *pinfo->relid_map@414 $8 = {22652203, 22652104, 22651920, 22651654, 22647359, 22645269, 22644012, 22639600, 22638852, 22621975, 22615355, 22615256, 22615069, 22610573, 22606503, 22606404, 22600237, 22600131, 22596610, 22595013, 22594914, 22594725, 22594464, 22589317, 22589216, 22587504, 22582570, 22580264, 22577047, 22576948, 22576763, 22576656, 22574077, 22570911, 22570812, 22564524, 22564113, 22558519, 22557080, 22556981, 22556793, 22555205, 22550680, 22550579, 22548884, 22543899, 22540822, 22536665, 22536566, 22536377, 22535133, 22528876, 22527780, 22526065, 22521131, 22517878, 22513674, 22513575, 22513405, 22513288, 22507520, 22504728, 22504629, 22493699, 22466016, 22458641, 22457551, 22457421, 22457264, 22452879, 22449864, 22449765, 22443560, 22442952, 22436193, 22434644, 22434469, 22434352, 22430792, 22426903, 22426804, 22420732, 22420025, 22413050, 22411963, 22411864, 22411675, 22407652, 22404156, 22404049, 22397550, 22394622, 22390035, 22389936, 22389752, 22388386, 22383211, 22382115, 22381934, 22375210, 22370297, 22367878, 22367779, 22367586, 22362556, 22359928, 22358236, 22353374, 22348704, 22345692, 22345593, 22345399, 22341347, 22336809, 22336709, 22325812, 22292836, 22287756, 22287657, 22287466, 22283194, 22278659, 22278560, 22272041, 22269121, 22264424, 22264325, 22264135, 22260102, 22255418, 22254818, 22248841, 22245824, 22241490, 22241391, 22241210, 22240354, 22236224, 22235123, 22234060, 8744, 8345, 8033, 2528, 2429, 2330, 2144, 2045, 22218408, 22215986, 22215887, 22209311, 22209212, 22207919, 22205203, 22203385, 22203298, 22203211, 22203124, 22202954, 22202859, 22202772, 22201869, 22200438, 22197706, 22195027, 22194932, 22194834, 22191208, 22188412, 22187029, 22182238, 22182134, 22182030, 22181849, 22181737, 22181107, 22175811, 22175710, 22169859, 22169604, 22159266, 22158131, 22158021, 22157824, 22153348, 22153236, 22147308, 22146736, 22143778, 22143599, 22143471, 22138702, 22138590, 22132612, 22132513, 22132271, 22132172, 22131987, 21935599, 21932664, 21927997, 21925823, 21885889, 21862973, 21859854, 21859671, 21858869, 21853440, 21851884, 21845405, 21842901, 21837523, 21837413, 21837209, 21832347, 21829359, 21827652, 21822602, 21816150, 21805995, 21805812, 21805235, 21798914, 21798026, 21791895, 21791124, 21783854, 21783744, 21783540, 21780568, 21774797, 21774687, 21768326, 21764063, 21759627, 21759517, 21759311, 21755697, 21751690, 21751156, 21744906, 21738543, 21736176, 21735992, 21735769, 21727603, 21725956, 21716432, 21678682, 21670968, 21670858, 21670665, 21669342, 21661932, 21661822, 21655311, 21650838, 21646721, 21646611, 21646409, 21640984, 21637816, 21637706, 21631061, 21622723, 21621459, 21621320, 21621148, 21612902, 21612790, 21606170, 21602265, 21597910, 21597800, 21597605, 21592489, 21589415, 21589305, 21582910, 21578017, 21576758, 21576648, 21572692, 21566633, 21566521, 21560127, 21560017, 21553910, 21553800, 21553613, 21553495, 21549102, 21548992, 21542759, 21540922, 21532093, 21531983, 21531786, 21531676, 21531264, 21531154, 21525290, 21524817, 21519470, 21519360, 21519165, 21516571, 21514269, 21514159, 21508389, 21508138, 21508028, 21507830, 21503457, 21502484, 21496897, 21494287, 21493722, 21493527, 21491807, 21488530, 21486122, 21485766, 21485603, 21485383, 21481969, 21481672, 21476245, 21472576, 21468851, 21468741, 21468546, 21467832, 21460086, 21425406, 21420632, 21420506, 21419974, 2141783
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Sun, Aug 2, 2020 at 2:11 PM Justin Pryzby wrote: > Based on commit logs, I suspect this may be an "older bug", specifically maybe > with: > > |commit 898e5e3290a72d288923260143930fb32036c00c > |Author: Robert Haas > |Date: Thu Mar 7 11:13:12 2019 -0500 > | > |Allow ATTACH PARTITION with only ShareUpdateExclusiveLock. > > I don't think it matters, but the process surrounding the table being INSERTed > INTO is more than a little special, involving renames, detaches, creation, > re-attaching within a transaction. I think that doesn't matter though, and > the > issue is surrounding the table being SELECTed *from*, which is actually behind > a view. That's an entirely reasonable guess, but it doesn't seem easy to understand exactly what happened here based on the provided information. The assertion failure probably indicates that pinfo->relid_map[] and partdesc->oids[] differ in some way other than additional elements having been inserted into the latter. For example, some elements might have disappeared, or the order might have changed. This isn't supposed to happen, because DETACH PARTITION requires heavier locking, and the order changing without anything getting detached should be impossible. But evidently it did. If we could dump out the two arrays in question it might shed more light on exactly how things went wrong. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company