回复:A problem in ExecModifyTable
Hi, David Your answer explains that we still need to ModifyTable node without Leaf partitions. You are right about this. But you can review the source code again, ``` /* * Fetch rows from subplan, and execute the required table modification * for each row. */ for (;;){ ... /* No more tuples to process? */ if (TupIsNull(planSlot))break; /* * For UPDATE/DELETE, fetch the row identity info for the tuple to be * updated/deleted. For a heap relation, that's a TID; otherwise we * may have a wholerow junk attr that carries the old tuple in toto. * Keep this in step with the part of ExecInitModifyTable that sets up * ri_RowIdAttNo. */ if (operation == CMD_UPDATE || operation == CMD_DELETE) { charrelkind; Datum datum; boolisNull; relkind = resultRelInfo->ri_RelationDesc->rd_rel->relkind; if (relkind == RELKIND_RELATION || relkind == RELKIND_MATVIEW ||relkind == RELKIND_PARTITIONED_TABLE) ... } ``` We can see that if the scanned tuple is NULL, it will break. Therefore, this step will never be achieved that is to extract ctid. >We'll need some sort of ResultRelInfo in the case that all partitions >are pruned. In this case, the tuple returned should be null. >Using the one for the partitioned table is convenient. I >believe that's required if there are any statement-level triggers on >the partitioned table or if there's a RETURNING clause. If the tuple to be modified is null, you do not need to process RETURNING clause. statement-level triggers cannot use tuple, and triggers on partitioned tables cannot have transition tables. I set Assert(relkind! = RELKIND_PARTITIONED_TABLE); in the code, But it never arrives. Did I not understand your expression clearly? Could you provide me with a case to verify it? Thank you very much for your answer. Regards & Thanks Adger
A problem in ExecModifyTable
Hi hackers, Recently, I noticed a great patch in pg 14. "Rework planning and execution of UPDATE and DELETE. (86dc90056dfdbd9d1b891718d2e5614e3e432f35)" This changes the DML execution of the partitioned table and makes it more friendly. But I am very confused about the following changes: ``` + relkind = resultRelInfo->ri_RelationDesc->rd_rel->relkind; + if (relkind == RELKIND_RELATION || + relkind == RELKIND_MATVIEW || + relkind == RELKIND_PARTITIONED_TABLE) { - charrelkind; - Datum datum; - boolisNull; - - relkind = resultRelInfo->ri_RelationDesc->rd_rel->relkind; - if (relkind == RELKIND_RELATION || relkind == RELKIND_MATVIEW) - { - datum = ExecGetJunkAttribute(slot, -junkfilter->jf_junkAttNo, -&isNull); - /* shouldn't ever get a null result... */``` According to my understanding, the parent table of a partitioned table does not store any tuples. Then why is "relkind = = RELKIND_PARTITIONED_TABLE" suddenly added here ? There is no comment on this point in the code. Can you answer my confusion? Be deeply grateful. Regards & Thanks Adger
回复:BEFORE ROW triggers for partitioned tables
Hi commiter, Many of our customers expect to use BR triggers in partitioned tables. After I followed your discussion, I also checked your patch. Here are two questions confusing me: 1. Your modification removes the check BR triggers against partitioned table, and a more friendly error message is added to the ExecInsert and ExecUpdate. You are correct. ExecInsert does not reroute partitions. However, when ExecUpdate modifies partition keys, it is almost equivalent to ExecDelete and ExecInsert, and it is re-routed(ExecPrepareTupleRouting) once before ExecInsert. Therefore, why should an error be thrown in ExecUpdate? Let's look at a case : ``` postgres=# create table parted (a int, b int, c text) partition by list (a); CREATE TABLE postgres=# create table parted_1 partition of parted for values in (1); CREATE TABLE postgres=# create table parted_2 partition of parted for values in (2); CREATE TABLE postgres=# create function parted_trigfunc() returns trigger language plpgsql as $$ begin new.a = new.a + 1; return new; end; $$; CREATE FUNCTION postgres=# insert into parted values (1, 1, 'uno uno v1'); INSERT 0 1 postgres=# create trigger t before update on parted for each row execute function parted_trigfunc(); CREATE TRIGGER postgres=# update parted set c = c || 'v3'; ``` If there is no check in the ExecUpdate, the above update SQL will be executed successfully. However, in your code, this will fail. So, what is the reason for your consideration? 2. In this patch, you only removed the restrictions BR trigger against the partitioned table, but did not fundamentally solve the problem caused by modifying partition keys in the BR trigger. What are the difficulties in solving this problem fundamentally? We plan to implement it. Can you give us some suggestions? We look forward to your reply. Thank you very much, Regards, Adger -- 发件人:Alvaro Herrera 发送时间:2021年1月18日(星期一) 20:36 收件人:Ashutosh Bapat 抄 送:Ashutosh Bapat ; Pg Hackers 主 题:Re: BEFORE ROW triggers for partitioned tables Thanks for the reviews; I have pushed it now. (I made the doc edits you mentioned too.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
回复:BEFORE ROW triggers for partitioned tables
Hi commiter, Many of our customers expect to use BR triggers in partitioned tables. After I followed your discussion, I also checked your patch. Here are two questions confusing me: 1. Your modification removes the check BR triggers against partitioned table, and a more friendly error message is added to the ExecInsert and ExecUpdate. You are correct. ExecInsert does not reroute partitions. However, when ExecUpdate modifies partition keys, it is almost equivalent to ExecDelete and ExecInsert, and it is re-routed(ExecPrepareTupleRouting) once before ExecInsert. Therefore, why should an error be thrown in ExecUpdate? Let's look at a case : ``` postgres=# create table parted (a int, b int, c text) partition by list (a); CREATE TABLE postgres=# create table parted_1 partition of parted for values in (1); CREATE TABLE postgres=# create table parted_2 partition of parted for values in (2); CREATE TABLE postgres=# create function parted_trigfunc() returns trigger language plpgsql as $$ begin new.a = new.a + 1; return new; end; $$; CREATE FUNCTION postgres=# insert into parted values (1, 1, 'uno uno v1'); INSERT 0 1 postgres=# create trigger t before update on parted for each row execute function parted_trigfunc(); CREATE TRIGGER postgres=# update parted set c = c || 'v3'; ``` If there is no check in the ExecUpdate, the above update SQL will be executed successfully. However, in your code, this will fail. So, what is the reason for your consideration? 2. In this patch, you only removed the restrictions BR trigger against the partitioned table, but did not fundamentally solve the problem caused by modifying partition keys in the BR trigger. What are the difficulties in solving this problem fundamentally? We plan to implement it. Can you give us some suggestions? -- 发件人:Alvaro Herrera 发送时间:2021年1月18日(星期一) 20:36 收件人:Ashutosh Bapat 抄 送:Ashutosh Bapat ; Pg Hackers 主 题:Re: BEFORE ROW triggers for partitioned tables Thanks for the reviews; I have pushed it now. (I made the doc edits you mentioned too.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
回复:回复:回复:回复:how to create index concurrently on partitioned table
> Not sure I am following. In the case of REINDEX, it seems to me that > the calls to validate_index() and index_concurrently_build() can > happen in a separate transaction for each index, as long as all the > calls to index_concurrently_swap() are grouped together in the same > transaction to make sure that index partition trees are switched > consistently when all entries are swapped from an invalid state to a > valid state, because the swapping phase is also when we attach a fresh > index to a partition tree. See also index_concurrently_create_copy() > where we don't set parentIndexRelid for the lower call to > index_create(). It would be good of course to check that when > swapping we have the code to handle that for a lot of indexes at > once. Let's look at this example: A partition table has five partitions, parttable: part1, part2, part3, part3 ,part5 We simply abstract the following definitions: phase 1: index_create(), it is only registered in catalogs. phase 2: index_concurrently_build(), Build the indexes. phase 3: validate_index(), insert any missing index entries, mark the index as valid. (schema 1) ``` StartTransaction one parttable phase 1 part 1 phase 1 part 2 phase 1 part 3 phase 1 part 4 phase 1 part 5 phase 1 CommitTransaction StartTransaction two parttable phase 2part 1 phase 2 part 2 phase 2 part 3 phase 2 (error occurred ) part 4 phase 2 part 5 phase 2 CommitTransaction StartTransaction three parttable phase 3 part 1 phase 3 part 2 phase 3 part 3 phase 3 part 4 phase 4 part 5 phase 5 CommitTransaction ... ``` Now, the following steps cannot continue due to an error in Transaction two . so, Transaction two roll back, Transaction three haven't started. All of our indexes are invalid. In this way, we ensure the strong consistency of indexes in the partition tree. However, we need to rebuild all indexes when reindex. (schema 2) ``` StartTransaction one parttable phase 1 part 1 phase 1 part 2 phase 1 part 3 phase 1 part 4 phase 1 part 5 phase 1 CommitTransaction StartTransaction two part 1 phase 2 part 1 phase 3 CommitTransaction StartTransaction three part 2 phase 2 part 2 phase 3 CommitTransaction StartTransaction fourpart 3 phase 2 (error occurred ) part 3 phase 3 CommitTransaction StartTransaction five part 4 phase 2 part 4 phase 3 StartTransaction sixpart 5 phase 2 part 5 phase 3 CommitTransaction StartTransaction sevenparttable phase 2 parttable phase 3 CommitTransaction ``` Now, the following steps cannot continue due to an error in Transaction four . so, Transaction four roll back, Transactions behind Transaction 3 have not started The indexes of the p1 and p2 partitions are available. Other indexes are invalid. In reindex, we can ignore the rebuild of p1 and p2. This seems better, although it seems to be inconsistent. Do you think that scheme is more suitable for CIC? Thank you very much, Regards, Adger -- 发件人:Michael Paquier 发送时间:2020年6月18日(星期四) 10:41 收件人:李杰(慎追) 抄 送:Justin Pryzby ; pgsql-hackers ; 曾文旌(义从) ; Alvaro Herrera 主 题:Re: 回复:回复:回复:how to create index concurrently on partitioned table On Wed, Jun 17, 2020 at 10:22:28PM +0800, 李杰(慎追) wrote: > However, I found a problem. If there are many partitions, > we may need to handle too many missing index entries when > validate_index(). Especially for the first partition, the time may > have been long and many entries are missing. In this case, why > don't we put the second and third phase together into a transaction > for each partition? Not sure I am following. In the case of REINDEX, it seems to me that the calls to validate_index() and index_concurrently_build() can happen in a separate transaction for each index, as long as all the calls to index_concurrently_swap() are grouped together in the same transaction to make sure that index partition trees are switched consistently when all entries are swapped from an invalid state to a valid state, because the swapping phase is also when we attach a fresh index to a partition tree. See also index_concurrently_create_copy() where we don't set parentIndexRelid for the lower call to index_create(). It would be good of course to check that when swapping we have the code to handle that for a lot of indexes at once. -- Michael
回复:回复:回复:how to create index concurrently on partitioned table
> We can refer to the implementation in the ReindexRelationConcurrently, > in the three phases of the REINDEX CONCURRENTLY, > all indexes of the partitions are operated one by one in each phase. > In this way, we can maintain the consistency of the entire partitioned table > index. > After we implement CIC in this way, we can also complete reindex partitioned > table index concurrently (this is not done now.) After careful analysis, I found that there were two choices that embarrassed me. Although we can handle the entire partition tree with one transaction in each of the three phases of CIC, just like ordinary tables. However, I found a problem. If there are many partitions, we may need to handle too many missing index entries when validate_index() . Especially for the first partition, the time may have been long and many entries are missing. In this case, why don't we put the second and third phase together into a transaction for each partition? So, which schema do you think is better? Choose to maintain consistency in all three phases, or just maintain consistency in the first phase? Thank you very much, Regards, Adger
回复:回复:回复:how to create index concurrently on partitioned table
> I am afraid that this makes the error handling more complicated, with > risks of having inconsistent partition trees. That's the point you > raised. This one is going to need more thoughts. > CIC is an operation that exists while allowing read and writes to > still happen in parallel, so that's fine IMO if it takes time. Now it > is true that it may not scale well so we should be careful with the > approach taken. Also, I think that the case of REINDEX would require > less work overall because we already have some code in place to gather > multiple indexes from one or more relations and work on these > consistently, all at once. I'm with you on that. (Scheme 1) We can refer to the implementation in the ReindexRelationConcurrently, in the three phases of the REINDEX CONCURRENTLY, all indexes of the partitions are operated one by one in each phase. In this way, we can maintain the consistency of the entire partitioned table index. After we implement CIC in this way, we can also complete reindex partitioned table index concurrently (this is not done now.) Looking back, let's talk about our original schema. (Scheme 2) If CIC is performed one by one on each partition, how can we make it on subsequent partitions continue when an error occurs in the second partition? If this problem can be solved, It's not that I can't accept it. Because a partition CIC error is acceptable, you can reindex it later. Pseudo indexes on partitioned tables are useless for real queries, but the indexes on partitions are really useful. Scheme 1, more elegant, can also implement reindex concurrently on partitioned table, but the implementation is more complex. Scheme 2: simple implementation, but not so friendly. Hi Jsutin, Which scheme do you think is more helpful to realize CIC? ---------- 发件人:Michael Paquier 发送时间:2020年6月16日(星期二) 09:02 收件人:李杰(慎追) 抄 送:Justin Pryzby ; pgsql-hackers ; 曾文旌(义从) ; Alvaro Herrera 主 题:Re: 回复:回复:how to create index concurrently on partitioned table On Mon, Jun 15, 2020 at 09:33:05PM +0800, 李杰(慎追) wrote: > This is a good idea. > We should maintain the consistency of the entire partition table. > However, this is not a small change in the code. > May be we need to make a new design for DefineIndex function Indeed. I have looked at the patch set a bit and here is the related bit for CIC in 0001, meaning that you handle the basic index creation for a partition tree within one transaction for each: + /* +* CIC needs to mark a partitioned table as VALID, which itself +* requires setting READY, which is unset for CIC (even though +* it's meaningless for an index without storage). +*/ + if (concurrent) + { + PopActiveSnapshot(); + CommitTransactionCommand(); + StartTransactionCommand(); + index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY); + + CommandCounterIncrement(); + index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID); } I am afraid that this makes the error handling more complicated, with risks of having inconsistent partition trees. That's the point you raised. This one is going to need more thoughts. > But most importantly, if we use three steps to implement CIC, > we will spend more time than ordinary tables, especially in high > concurrency cases. To wait for one of partitions which the users to > use frequently, the creation of other partition indexes is delayed. > Is it worth doing this? CIC is an operation that exists while allowing read and writes to still happen in parallel, so that's fine IMO if it takes time. Now it is true that it may not scale well so we should be careful with the approach taken. Also, I think that the case of REINDEX would require less work overall because we already have some code in place to gather multiple indexes from one or more relations and work on these consistently, all at once. -- Michael
回复:回复:how to create index concurrently on partitioned table
> That's a problem. I really think that we should make the steps of the > concurrent operation consistent across all relations, meaning that all > the indexes should be created as invalid for all the parts of the > partition tree, including partitioned tables as well as their > partitions, in the same transaction. Then a second new transaction > gets used for the index build, followed by a third one for the > validation that switches the indexes to become valid. This is a good idea. We should maintain the consistency of the entire partition table. However, this is not a small change in the code. May be we need to make a new design for DefineIndex function But most importantly, if we use three steps to implement CIC, we will spend more time than ordinary tables, especially in high concurrency cases. To wait for one of partitions which the users to use frequently, the creation of other partition indexes is delayed. Is it worth doing this? Regards, Adger -- 发件人:Michael Paquier 发送时间:2020年6月15日(星期一) 20:37 收件人:李杰(慎追) 抄 送:Justin Pryzby ; pgsql-hackers ; 曾文旌(义从) ; Alvaro Herrera 主 题:Re: 回复:how to create index concurrently on partitioned table On Mon, Jun 15, 2020 at 08:15:05PM +0800, 李杰(慎追) wrote: > As shown above, an error occurred while creating an index in the second > partition. > It can be clearly seen that the index of the partitioned table is invalid > and the index of the first partition is normal, the second partition is > invalid, > and the Third Partition index does not exist at all. That's a problem. I really think that we should make the steps of the concurrent operation consistent across all relations, meaning that all the indexes should be created as invalid for all the parts of the partition tree, including partitioned tables as well as their partitions, in the same transaction. Then a second new transaction gets used for the index build, followed by a third one for the validation that switches the indexes to become valid. -- Michael
回复:how to create index concurrently on partitioned table
Description +---+---+--+-+--+--+- a | integer | | | | plain | | b | integer | | | | plain | | c | character varying | | | | extended | | Partition of: prt1 FOR VALUES FROM (0) TO (25) Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 25)) Indexes: "prt1_p1_a_idx" UNIQUE, btree (a) Access method: heap postgres=# \d+ prt1_p2 Table "public.prt1_p2" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description +---+---+--+-+--+--+- a | integer | | | | plain | | b | integer | | | | plain | | c | character varying | | | | extended | | Partition of: prt1 FOR VALUES FROM (25) TO (50) Partition constraint: ((a IS NOT NULL) AND (a >= 25) AND (a < 50)) Indexes: "prt1_p2_a_idx" UNIQUE, btree (a) Access method: heap postgres=# \d+ prt1_p3 Table "public.prt1_p3" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description +---+---+--+-+--+--+- a | integer | | | | plain | | b | integer | | | | plain | | c | character varying | | | | extended | | Partition of: prt1 FOR VALUES FROM (50) TO (75) Partition constraint: ((a IS NOT NULL) AND (a >= 50) AND (a < 75)) Access method: heap ``` Now we can see that the first two partitions have indexes, but the third partition has no indexes due to an error. Therefore, in our first case, it should not be what we expected that the third partition has no index. That is to say, when our CIC goes wrong, either roll back all or go down instead of stopping in the middle. This is my shallow opinion, please take it as your reference. Thank you very much, Regards, Adger ---------- 发件人:Justin Pryzby 发送时间:2020年6月13日(星期六) 02:15 收件人:Michael Paquier ; 李杰(慎追) 抄 送:pgsql-hackers ; 曾文旌(义从) ; Alvaro Herrera 主 题:Re: how to create index concurrently on partitioned table On Fri, Jun 12, 2020 at 04:17:34PM +0800, 李杰(慎追) wrote: > As we all know, CIC has three transactions. If we recursively in n > partitioned tables, > it will become 3N transactions. If an error occurs in these transactions, we > have too many things to deal... > > If an error occurs when an index is created in one of the partitions, > what should we do with our new index? My (tentative) understanding is that these types of things should use a "subtransaction" internally.. So if the toplevel transaction rolls back, its changes are lost. In some cases, it might be desirable to not roll back, in which case the user(client) should first create indexes (concurrently if needed) on every child, and then later create index on parent (that has the advtantage of working on older servers, too). postgres=# SET client_min_messages=debug; postgres=# CREATE INDEX ON t(i); DEBUG: building index "t1_i_idx" on table "t1" with request for 1 parallel worker DEBUG: index "t1_i_idx" can safely use deduplication DEBUG: creating and filling new WAL file DEBUG: done creating and filling new WAL file DEBUG: creating and filling new WAL file DEBUG: done creating and filling new WAL file DEBUG: building index "t2_i_idx" on table "t2" with request for 1 parallel worker ^C2020-06-12 13:08:17.001 CDT [19291] ERROR: canceling statement due to user request 2020-06-12 13:08:17.001 CDT [19291] STATEMENT: CREATE INDEX ON t(i); 2020-06-12 13:08:17.001 CDT [27410] FATAL: terminating connection due to administrator command 2020-06-12 13:08:17.001 CDT [27410] STATEMENT: CREATE INDEX ON t(i); Cancel request sent If the index creation is interrupted at this point, no indexes will exist. On Fri, Jun 12, 2020 at 04:06:28PM +0800, 李杰(慎追) wrote: > >On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote: > > I looked at CIC now and came up with the attached. All that's needed to > > allow > > this case is to close the relation before recursing to partitions - it > > needs to > > be closed before calling CommitTransactionCommand(). There's probably a > > better > > way to write this, but I can't see that there's anything complicated about > > handling partitioned tables. > > I'm so sorry about getting back late. > Thank you very much for helping me consider this issue. > I compiled the patch v1 you provided. And I patch v2-001 again to enter > postgresql. > I got a coredump that was easy to reproduce. As follows: > I have been trying to get familiar with the source code of create index. > Can you solve this bug first? I will try my best to implement CIC with you. > Next, I will read your patch
回复:how to create index concurrently on paritioned table
Hi Justin, > Maybe I'm wrong, but I don't think there's any known difficulty - just that> > nobody did it yet. You should pay attention to what happens on error, but > hopefully you wouldn't need to add much code and can rely on existing code to > paths to handle that right. yes, I am trying to learn the code of index definition. > I think you'd look at the commits and code implementing indexes on partitioned > tables and CREATE INDEX CONCURRENTLY. And maybe any following commits with > fixes. > You'd first loop around all children (recursively if there are partitions > which > are themselves partitioned) and create indexes concurrently. As we all know, CIC has three transactions. If we recursively in n partitioned tables, it will become 3N transactions. If an error occurs in these transactions, we have too many things to deal... If an error occurs when an index is created in one of the partitions, what should we do with our new index? Thank you very much, Regards, Adger
回复:how to create index concurrently on paritioned table
>On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote: > > On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote: > > > Partitioning is necessary for very large tables. > > > However, I found that postgresql does not support create index > > > concurrently on partitioned tables. > > > The document show that we need to create an index on each partition > > > individually and then finally create the partitioned index > > > non-concurrently. > > > This is undoubtedly a complex operation for DBA, especially when there > > > are many partitions. > > > > > Therefore, I wonder why pg does not support concurrent index creation on > > > partitioned tables? > > > What are the difficulties of this function? > > > If I want to implement it, what should I pay attention? > > > > Maybe I'm wrong, but I don't think there's any known difficulty - just that > > nobody did it yet. > I said that but I was actually thinking about the code for "REINDEX > CONCURRENTLY" (which should also handle partitioned tables). > I looked at CIC now and came up with the attached. All that's needed to allow > this case is to close the relation before recursing to partitions - it needs > to > be closed before calling CommitTransactionCommand(). There's probably a > better > way to write this, but I can't see that there's anything complicated about > handling partitioned tables. Hi, Justin Pryzby I'm so sorry about getting back late. Thank you very much for helping me consider this issue. I compiled the patch v1 you provided. And I patch v2-001 again to enter postgresql. I got a coredump that was easy to reproduce. As follows: #0 PopActiveSnapshot () at snapmgr.c:822 #1 0x005ca687 in DefineIndex (relationId=relationId@entry=16400, stmt=stmt@entry=0x1aa5e28, indexRelationId=16408, indexRelationId@entry=0, parentIndexId=parentIndexId@entry=16406, parentConstraintId=0, is_alter_table=is_alter_table@entry=false, check_rights=true, check_not_in_use=true, skip_build=false, quiet=false) at indexcmds.c:1426 #2 0x005ca5ab in DefineIndex (relationId=relationId@entry=16384, stmt=stmt@entry=0x1b35278, indexRelationId=16406, indexRelationId@entry=0, parentIndexId=parentIndexId@entry=0, parentConstraintId=parentConstraintId@entry=0, is_alter_table= is_alter_table@entry=false, check_rights=true, check_not_in_use=true, skip_build=false, quiet=false) at indexcmds.c:1329 #3 0x0076bf80 in ProcessUtilitySlow (pstate=pstate@entry=0x1b350c8, pstmt=pstmt@entry=0x1a2bf40, queryString=queryString@entry=0x1a2b2c8 "create index CONCURRENTLY idxpart_a_idx on idxpart (a);", context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, qc=0x7ffc86cc7630, dest=0x1a2c200) at utility.c:1474 #4 0x0076afeb in standard_ProcessUtility (pstmt=0x1a2bf40, queryString=0x1a2b2c8 "create index CONCURRENTLY idxpart_a_idx on idxpart (a);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1a2c200, qc=0x7ffc86cc7630) at utility.c:1069 #5 0x00768992 in PortalRunUtility (portal=0x1a8d1f8, pstmt=0x1a2bf40, isTopLevel=, setHoldSnapshot=, dest=, qc=0x7ffc86cc7630) at pquery.c:1157 #6 0x007693f3 in PortalRunMulti (portal=portal@entry=0x1a8d1f8, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x1a2c200, altdest=altdest@entry=0x1a2c200, qc=qc@entry=0x7ffc86cc7630) at pquery.c:1310 #7 0x00769ed3 in PortalRun (portal=portal@entry=0x1a8d1f8, count=count @entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once @entry=true, dest=dest@entry=0x1a2c200, altdest=altdest@entry=0x1a2c200, qc=0x7ffc86cc7630) at pquery.c:779 #8 0x00765b06 in exec_simple_query (query_string=0x1a2b2c8 "create index CONCURRENTLY idxpart_a_idx on idxpart (a);") at postgres.c:1239 #9 0x00767de5 in PostgresMain (argc=, argv=argv@entry =0x1a552c8, dbname=, username=) at postgres.c:4315 #10 0x006f2b23 in BackendRun (port=0x1a4d1e0, port=0x1a4d1e0) at postmaster.c:4523 #11 BackendStartup (port=0x1a4d1e0) at postmaster.c:4215 #12 ServerLoop () at postmaster.c:1727 #13 0x006f3a1f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1a25ea0) at postmaster.c:1400 #14 0x004857f9 in main (argc=3, argv=0x1a25ea0) at main.c:210 You can re-produce it like this: ``` create table idxpart (a int, b int, c text) partition by range (a); create table idxpart1 partition of idxpart for values from (0) to (10); create table idxpart2 partition of idxpart for values from (10) to (20); create index CONCURRENTLY idxpart_a_idx on idxpart (a); I have been trying to get familiar with the source code of create index. Can you solve this bug first? I will try my best to implement CIC with you. Next, I will read your patchs v2-002 and v2-003. Thank you very much, Regards, Adger
how to create index concurrently on paritioned table
Hi hackers, Partitioning is necessary for very large tables. However, I found that postgresql does not support create index concurrently on partitioned tables. The document show that we need to create an index on each partition individually and then finally create the partitioned index non-concurrently. This is undoubtedly a complex operation for DBA, especially when there are many partitions. Therefore, I wonder why pg does not support concurrent index creation on partitioned tables? What are the difficulties of this function? If I want to implement it, what should I pay attention? Sincerely look forward to your reply. Regards & Thanks Adger
回复:回复:回复:Bug about drop index concurrently
e, and the most unbearable is that it will bring delay to the log apply on standb, resulting in inconsistent data between the master and the standby. All in all, I think this bug is a flaw in postgres design. We need to think carefully about how to handle it better. Even we can learn from other database products. I hope I can help you. Thank you very much for your attention. Regards. adger. -- 发件人:Tomas Vondra 发送时间:2019年10月24日(星期四) 06:04 收件人:李杰(慎追) 抄 送:pgsql-hackers 主 题:Re: 回复:回复:Bug about drop index concurrently On Wed, Oct 23, 2019 at 02:38:45PM +0800, 李杰(慎追) wrote: >> >>I'm a bit confused. You shouldn't see any crashes and/or core files in >>this scenario, for two reasons. Firstly, I assume you're running a >>regular build without asserts. Secondly, I had to add an extra assert >>to trigger the failure. So what core are you talking about? >> >Sorry, I should explain it more clearly. I saw the core file because I >modified the postgres source code and added Assert to it. >> OK >>Also, it's not clear to me what do you mean by "bug in the standby" or >>no lock in the drop index concurrently. Can you explain? >> >"bug in the standby" means that we built a master-slave instance, when >we executed a large number of queries on the standby, we executed 'drop >index concurrently' on the master so that get ‘error’ in the standby. >Although it is not 100%, it will appear. no lock in the drop index >concurrently ::: I think this is because there are not enough advanced >locks when executing ‘ drop index concurrently’. > OK, thanks for the clarification. Yes, it won't appear every time, it's likely timing-sensitive (I'll explain in a minute). >>Hmm, so you observe the issue with regular queries, not just EXPLAIN >>ANALYZE? > >yeah, we have seen this error frequently. > That suggests you're doing a lot of 'drop index concurrently', right? >>>Of course, we considered applying the method of waiting to detect the >>>query lock on the master to the standby, but worried about affecting >>>the standby application log delay, so we gave up that. >>> >>I don't understand? What method? >> > >I analyzed this problem, I used to find out the cause of this problem, >I also executed 'drop index concurrently' and ‘explain select * from >xxx’ on the master, but the bug did not appear as expected. So I went >to analyze the source code. I found that there is such a mechanism on >the master that when the 'drop index concurrently' is execute, it wait >will every transaction that saw the old index state has finished. >source code is as follows follow as: > >WaitForLockers(heaplocktag, AccessExclusiveLock); > >Therefore, I think that if this method is also available in standby, >then the error will not appear. but I worried about affecting the >standby application log delay, so we gave up that. > Yes, but we can't really do that, I'm afraid. We certainly can't do that on the master because we simply don't have the necessary information about locks from the standby, and we really don't want to have it, because with a busy standby that might be quite a bit of data (plust the standby would have to wait for the master to confirm each lock acquisition, I think which seems pretty terrible). On the standby, we don't really have an idea that the there's a drop index running - we only get information about AE locks, and a bunch of catalog updates. I don't think we have a way to determine this is a drop index in concurrent mode :-( More preciresly, the master sends information about AccessExclusiveLock in XLOG_STANDBY_LOCK wal record (in xl_standby_locks struct). And when the standby replays that, it should acquire those locks. For regular DROP INDEX we send this: rmgr: Standby ... desc: LOCK xid db 16384 rel 16385 rmgr: Standby ... desc: LOCK xid db 16384 rel 20573 ... catalog changes ... rmgr: Transaction ... desc: COMMIT 2019-10-23 22:42:27.950995 CEST; rels: base/16384/20573; inval msgs: catcache 32 catcache 7 catcache 6 catcache 50 catcache 49 relcache 20573 relcache 16385 snapshot 2608 while for DROP IDNEX CONCURRENTLY we send this rmgr: Heap... desc: INPLACE ... catalog update rmgr: Standby ... desc: INVALIDATIONS ; inval msgs: catcache 32 relcache 21288 relcache 16385 rmgr: Heap... desc: INPLACE ... catalog update rmgr: Standby ... desc: INVALIDATIONS ; inval msgs: catcache 32 relcache 21288 relcache 16385 rmgr: Standby ... desc: LOCK xid 10326 db 16384 rel 21288 ... catalog updates ... rmgr: Transaction ... desc: COMMIT 2019-10-23 23:47:10.04256
回复:Bug about drop index concurrently
Ah ha ha , this is great, I am very ashamed of my English expression, did not let you clearly understand my mail. now, I am very glad that you can understand this. I sincerely hope that I can help you. I am also a postgres fan, a freshly graduated student. We have all confirmed that this bug will only appear on the standby and will not appear on the master.But it does affect the use of standby. For this bug, I proposed two options, one is to disable this feature (drop index concurrently), the other is to wait for the standby select like on the master, but it may affect the application delay of the log. Because this bug appears on the standby, I think both methods have advantages and disadvantages. So I hope that you can discuss it so much that it will help you. Sincerely adger -- 发件人:Tomas Vondra 发送时间:2019年10月23日(星期三) 06:49 收件人:李杰(慎追) 抄 送:pgsql-hackers 主 题:Re: Bug about drop index concurrently On Fri, Oct 18, 2019 at 05:00:54PM +0200, Tomas Vondra wrote: >Hi, > >I can trivially reproduce this - it's enough to create a master-standby >setup, and then do this on the master > > CREATE TABLE t (a int, b int); > INSERT INTO t SELECT i, i FROM generate_series(1,1) s(i); > >and run pgbench with this script > > DROP INDEX CONCURRENTLY IF EXISTS t_a_idx; > CREATE INDEX t_a_idx ON t(a); > >while on the standby there's another pgbench running this script > > EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1; > >and it fails pretty fast for me. With an extra assert(false) added to >src/backend/access/common/relation.c I get a backtrace like this: > > Program terminated with signal SIGABRT, Aborted. > #0 0x7c32e458fe35 in raise () from /lib64/libc.so.6 > Missing separate debuginfos, use: dnf debuginfo-install > glibc-2.29-22.fc30.x86_64 > (gdb) bt > #0 0x7c32e458fe35 in raise () from /lib64/libc.so.6 > #1 0x7c32e457a895 in abort () from /lib64/libc.so.6 > #2 0x00a58579 in ExceptionalCondition (conditionName=0xacd9bc > "!(0)", errorType=0xacd95b "FailedAssertion", fileName=0xacd950 "relation.c", > lineNumber=64) at assert.c:54 > #3 0x0048d1bd in relation_open (relationId=38216, lockmode=1) at > relation.c:64 > #4 0x005082e4 in index_open (relationId=38216, lockmode=1) at > indexam.c:130 > #5 0x0080ac3f in get_relation_info (root=0x21698b8, > relationObjectId=16385, inhparent=false, rel=0x220ce60) at plancat.c:196 > #6 0x008118c6 in build_simple_rel (root=0x21698b8, relid=1, > parent=0x0) at relnode.c:292 > #7 0x007d485d in add_base_rels_to_query (root=0x21698b8, > jtnode=0x2169478) at initsplan.c:114 > #8 0x007d48a3 in add_base_rels_to_query (root=0x21698b8, > jtnode=0x21693e0) at initsplan.c:122 > #9 0x007d8fad in query_planner (root=0x21698b8, > qp_callback=0x7ded97 , qp_extra=0x7fffa4834f10) at > planmain.c:168 > #10 0x007dc316 in grouping_planner (root=0x21698b8, > inheritance_update=false, tuple_fraction=0) at planner.c:2048 > #11 0x007da7ca in subquery_planner (glob=0x220d078, > parse=0x2168f78, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at > planner.c:1012 > #12 0x007d942c in standard_planner (parse=0x2168f78, > cursorOptions=256, boundParams=0x0) at planner.c:406 > #13 0x007d91e8 in planner (parse=0x2168f78, cursorOptions=256, > boundParams=0x0) at planner.c:275 > #14 0x008e1b0d in pg_plan_query (querytree=0x2168f78, > cursorOptions=256, boundParams=0x0) at postgres.c:878 > #15 0x00658683 in ExplainOneQuery (query=0x2168f78, > cursorOptions=256, into=0x0, es=0x220cd90, queryString=0x21407b8 "explain > analyze select * from t where a = 1;", params=0x0, queryEnv=0x0) at > explain.c:367 > #16 0x00658386 in ExplainQuery (pstate=0x220cc28, stmt=0x2141728, > queryString=0x21407b8 "explain analyze select * from t where a = 1;", > params=0x0, queryEnv=0x0, dest=0x220cb90) at explain.c:255 > #17 0x008ea218 in standard_ProcessUtility (pstmt=0x21425c0, > queryString=0x21407b8 "explain analyze select * from t where a = 1;", > context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x220cb90, > completionTag=0x7fffa48355c0 "") at utility.c:675 > #18 0x008e9a45 in ProcessUtility (pstmt=0x21425c0, > queryString=0x21407b8 "explain analyze select * from t where a = 1;", > context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x220cb90, > completionTag=0x7fffa48355c0 "") at utility.c:360 > #19 0x008e8a0c in PortalRunUtility
回复:回复:Bug about drop index concurrently
> >I'm a bit confused. You shouldn't see any crashes and/or core files in >this scenario, for two reasons. Firstly, I assume you're running a >regular build without asserts. Secondly, I had to add an extra assert to >trigger the failure. So what core are you talking about? > Sorry, I should explain it more clearly. I saw the core file because I modified the postgres source code and added Assert to it. > >Also, it's not clear to me what do you mean by "bug in the standby" or >no lock in the drop index concurrently. Can you explain? > "bug in the standby" means that we built a master-slave instance, when we executed a large number of queries on the standby, we executed 'drop index concurrently' on the master so that get ‘error’ in the standby. Although it is not 100%, it will appear. no lock in the drop index concurrently ::: I think this is because there are not enough advanced locks when executing ‘ drop index concurrently’. >Hmm, so you observe the issue with regular queries, not just EXPLAIN >ANALYZE? yeah, we have seen this error frequently. >>Of course, we considered applying the method of waiting to detect the >>query lock on the master to the standby, but worried about affecting >>the standby application log delay, so we gave up that. >> >I don't understand? What method? > I analyzed this problem, I used to find out the cause of this problem, I also executed 'drop index concurrently' and ‘explain select * from xxx’ on the master, but the bug did not appear as expected. So I went to analyze the source code. I found that there is such a mechanism on the master that when the 'drop index concurrently' is execute, it wait will every transaction that saw the old index state has finished. source code is as follows follow as: WaitForLockers(heaplocktag, AccessExclusiveLock); Therefore, I think that if this method is also available in standby, then the error will not appear. but I worried about affecting the standby application log delay, so we gave up that. -- 发件人:Tomas Vondra 发送时间:2019年10月23日(星期三) 01:47 收件人:李杰(慎追) 抄 送:pgsql-hackers 主 题:Re: 回复:Bug about drop index concurrently On Mon, Oct 21, 2019 at 10:36:04AM +0800, 李杰(慎追) wrote: >Thanks for the quick reply. And sorry I haven’t got back to you sooner >. > >I have seen this backtrace in the core file, and I also looked at the >bug in the standby because there is no lock in the drop index >concurrently. > I'm a bit confused. You shouldn't see any crashes and/or core files in this scenario, for two reasons. Firstly, I assume you're running a regular build without asserts. Secondly, I had to add an extra assert to trigger the failure. So what core are you talking about? Also, it's not clear to me what do you mean by "bug in the standby" or no lock in the drop index concurrently. Can you explain? >However, when our business will perform a large number of queries in >the standby, this problem will occur more frequently. So we are trying >to solve this problem, and the solution we are currently dealing with >is to ban it. > Hmm, so you observe the issue with regular queries, not just EXPLAIN ANALYZE? >Of course, we considered applying the method of waiting to detect the >query lock on the master to the standby, but worried about affecting >the standby application log delay, so we gave up that. > I don't understand? What method? >If you have a better solution in the future, please push it to the new >version, or email it, thank you very much. > regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
回复:Bug about drop index concurrently
Hi all, I am sorry to bother you again. I want to consult again, about the last time I raised a bug about drop index, are you going to deal with it in the future? Is it to ban it or to propose a repair plan, what is your next plan? Sincerely look forward to your reply and thanks. adger. -- 发件人:Tomas Vondra 发送时间:2019年10月19日(星期六) 02:00 收件人:李杰(慎追) 抄 送:pgsql-hackers 主 题:Re: Bug about drop index concurrently Hi, I can trivially reproduce this - it's enough to create a master-standby setup, and then do this on the master CREATE TABLE t (a int, b int); INSERT INTO t SELECT i, i FROM generate_series(1,1) s(i); and run pgbench with this script DROP INDEX CONCURRENTLY IF EXISTS t_a_idx; CREATE INDEX t_a_idx ON t(a); while on the standby there's another pgbench running this script EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1; and it fails pretty fast for me. With an extra assert(false) added to src/backend/access/common/relation.c I get a backtrace like this: Program terminated with signal SIGABRT, Aborted. #0 0x7c32e458fe35 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: dnf debuginfo-install glibc-2.29-22.fc30.x86_64 (gdb) bt #0 0x7c32e458fe35 in raise () from /lib64/libc.so.6 #1 0x7c32e457a895 in abort () from /lib64/libc.so.6 #2 0x00a58579 in ExceptionalCondition (conditionName=0xacd9bc "!(0)", errorType=0xacd95b "FailedAssertion", fileName=0xacd950 "relation.c", lineNumber=64) at assert.c:54 #3 0x0048d1bd in relation_open (relationId=38216, lockmode=1) at relation.c:64 #4 0x005082e4 in index_open (relationId=38216, lockmode=1) at indexam.c:130 #5 0x0080ac3f in get_relation_info (root=0x21698b8, relationObjectId=16385, inhparent=false, rel=0x220ce60) at plancat.c:196 #6 0x008118c6 in build_simple_rel (root=0x21698b8, relid=1, parent=0x0) at relnode.c:292 #7 0x007d485d in add_base_rels_to_query (root=0x21698b8, jtnode=0x2169478) at initsplan.c:114 #8 0x007d48a3 in add_base_rels_to_query (root=0x21698b8, jtnode=0x21693e0) at initsplan.c:122 #9 0x007d8fad in query_planner (root=0x21698b8, qp_callback=0x7ded97 , qp_extra=0x7fffa4834f10) at planmain.c:168 #10 0x007dc316 in grouping_planner (root=0x21698b8, inheritance_update=false, tuple_fraction=0) at planner.c:2048 #11 0x007da7ca in subquery_planner (glob=0x220d078, parse=0x2168f78, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:1012 #12 0x007d942c in standard_planner (parse=0x2168f78, cursorOptions=256, boundParams=0x0) at planner.c:406 #13 0x007d91e8 in planner (parse=0x2168f78, cursorOptions=256, boundParams=0x0) at planner.c:275 #14 0x008e1b0d in pg_plan_query (querytree=0x2168f78, cursorOptions=256, boundParams=0x0) at postgres.c:878 #15 0x00658683 in ExplainOneQuery (query=0x2168f78, cursorOptions=256, into=0x0, es=0x220cd90, queryString=0x21407b8 "explain analyze select * from t where a = 1;", params=0x0, queryEnv=0x0) at explain.c:367 #16 0x00658386 in ExplainQuery (pstate=0x220cc28, stmt=0x2141728, queryString=0x21407b8 "explain analyze select * from t where a = 1;", params=0x0, queryEnv=0x0, dest=0x220cb90) at explain.c:255 #17 0x008ea218 in standard_ProcessUtility (pstmt=0x21425c0, queryString=0x21407b8 "explain analyze select * from t where a = 1;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x220cb90, completionTag=0x7fffa48355c0 "") at utility.c:675 #18 0x008e9a45 in ProcessUtility (pstmt=0x21425c0, queryString=0x21407b8 "explain analyze select * from t where a = 1;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x220cb90, completionTag=0x7fffa48355c0 "") at utility.c:360 #19 0x008e8a0c in PortalRunUtility (portal=0x219c278, pstmt=0x21425c0, isTopLevel=true, setHoldSnapshot=true, dest=0x220cb90, completionTag=0x7fffa48355c0 "") at pquery.c:1175 #20 0x008e871a in FillPortalStore (portal=0x219c278, isTopLevel=true) at pquery.c:1035 #21 0x008e8075 in PortalRun (portal=0x219c278, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x21efb90, altdest=0x21efb90, completionTag=0x7fffa48357b0 "") at pquery.c:765 #22 0x008e207c in exec_simple_query (query_string=0x21407b8 "explain analyze select * from t where a = 1;") at postgres.c:1215 #23 0x008e636e in PostgresMain (argc=1, argv=0x216c600, dbname=0x216c4e0 "test", username=0x213c3f8 "user") at postgres.c:4236 #24 0x0083c71e in BackendRun (port=0x2165850) at postmaster.c:4437 #2
回复:Bug about drop index concurrently
Thanks for the quick reply. And sorry I haven’t got back to you sooner . I have seen this backtrace in the core file, and I also looked at the bug in the standby because there is no lock in the drop index concurrently. However, when our business will perform a large number of queries in the standby, this problem will occur more frequently. So we are trying to solve this problem, and the solution we are currently dealing with is to ban it. Of course, we considered applying the method of waiting to detect the query lock on the master to the standby, but worried about affecting the standby application log delay, so we gave up that. If you have a better solution in the future, please push it to the new version, or email it, thank you very much. regards. adger. -- 发件人:Tomas Vondra 发送时间:2019年10月19日(星期六) 02:00 收件人:李杰(慎追) 抄 送:pgsql-hackers 主 题:Re: Bug about drop index concurrently Hi, I can trivially reproduce this - it's enough to create a master-standby setup, and then do this on the master CREATE TABLE t (a int, b int); INSERT INTO t SELECT i, i FROM generate_series(1,1) s(i); and run pgbench with this script DROP INDEX CONCURRENTLY IF EXISTS t_a_idx; CREATE INDEX t_a_idx ON t(a); while on the standby there's another pgbench running this script EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1; and it fails pretty fast for me. With an extra assert(false) added to src/backend/access/common/relation.c I get a backtrace like this: Program terminated with signal SIGABRT, Aborted. #0 0x7c32e458fe35 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: dnf debuginfo-install glibc-2.29-22.fc30.x86_64 (gdb) bt #0 0x7c32e458fe35 in raise () from /lib64/libc.so.6 #1 0x7c32e457a895 in abort () from /lib64/libc.so.6 #2 0x00a58579 in ExceptionalCondition (conditionName=0xacd9bc "!(0)", errorType=0xacd95b "FailedAssertion", fileName=0xacd950 "relation.c", lineNumber=64) at assert.c:54 #3 0x0048d1bd in relation_open (relationId=38216, lockmode=1) at relation.c:64 #4 0x005082e4 in index_open (relationId=38216, lockmode=1) at indexam.c:130 #5 0x0080ac3f in get_relation_info (root=0x21698b8, relationObjectId=16385, inhparent=false, rel=0x220ce60) at plancat.c:196 #6 0x008118c6 in build_simple_rel (root=0x21698b8, relid=1, parent=0x0) at relnode.c:292 #7 0x007d485d in add_base_rels_to_query (root=0x21698b8, jtnode=0x2169478) at initsplan.c:114 #8 0x007d48a3 in add_base_rels_to_query (root=0x21698b8, jtnode=0x21693e0) at initsplan.c:122 #9 0x007d8fad in query_planner (root=0x21698b8, qp_callback=0x7ded97 , qp_extra=0x7fffa4834f10) at planmain.c:168 #10 0x007dc316 in grouping_planner (root=0x21698b8, inheritance_update=false, tuple_fraction=0) at planner.c:2048 #11 0x007da7ca in subquery_planner (glob=0x220d078, parse=0x2168f78, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:1012 #12 0x007d942c in standard_planner (parse=0x2168f78, cursorOptions=256, boundParams=0x0) at planner.c:406 #13 0x007d91e8 in planner (parse=0x2168f78, cursorOptions=256, boundParams=0x0) at planner.c:275 #14 0x008e1b0d in pg_plan_query (querytree=0x2168f78, cursorOptions=256, boundParams=0x0) at postgres.c:878 #15 0x00658683 in ExplainOneQuery (query=0x2168f78, cursorOptions=256, into=0x0, es=0x220cd90, queryString=0x21407b8 "explain analyze select * from t where a = 1;", params=0x0, queryEnv=0x0) at explain.c:367 #16 0x00658386 in ExplainQuery (pstate=0x220cc28, stmt=0x2141728, queryString=0x21407b8 "explain analyze select * from t where a = 1;", params=0x0, queryEnv=0x0, dest=0x220cb90) at explain.c:255 #17 0x008ea218 in standard_ProcessUtility (pstmt=0x21425c0, queryString=0x21407b8 "explain analyze select * from t where a = 1;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x220cb90, completionTag=0x7fffa48355c0 "") at utility.c:675 #18 0x008e9a45 in ProcessUtility (pstmt=0x21425c0, queryString=0x21407b8 "explain analyze select * from t where a = 1;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x220cb90, completionTag=0x7fffa48355c0 "") at utility.c:360 #19 0x008e8a0c in PortalRunUtility (portal=0x219c278, pstmt=0x21425c0, isTopLevel=true, setHoldSnapshot=true, dest=0x220cb90, completionTag=0x7fffa48355c0 "") at pquery.c:1175 #20 0x008e871a in FillPortalStore (portal=0x219c278, isTopLevel=true) at pquery.c:1035 #21 0x008e8075 in PortalRun (portal=0x219c278, count=9223372036854775807, isTopLevel=true, run_once=true, dest
Bug about drop index concurrently
Hi hackers, In recently, I discovered a postgres bug, and I hope I can ask you for the best solution. The problem is as follows: postgres=# explain analyze select * from xxx where a=500; ERROR: could not open relation with OID 25989 The structure of my table is as follows: postgres=# \d xxx Table "public.xxx" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | text | | | postgres=# select count(*) from xxx; count 80 (1 row) postgres=# select * from xxx limit 3; a | b ---+-- 1 | 203c51477570aa517cfa317155dcc52c 2 | b58da31baa5c78fee4642fd398bd5909 3 | c7c475bf0a3ca2fc2afc4812a4d44c58 I opened the log file and saw that the index of table xxx was deleted, postgres=# drop index CONCURRENTLY idx_xxx ; DROP INDEX In order to reproduce this bug, I created and deleted the index again and again on the master. What is hard to understand is that this bug cannot be repeated 100%. I wrote a script that loops over the master and runs the following two sentences. postgres=# create index idx_xxx on xxx (a); postgres=# drop index CONCURRENTLY idx_xxx ; postgres=# create index idx_xxx on xxx (a); postgres=# drop index CONCURRENTLY idx_xxx ; ... ... ... At the same time, I started two clients in the standby, respectively execute the following sql on the table xxx: postgres=# explain analyze select * from xxx where a=500; postgres=# \watch 0.1 After a few minutes, the bug will appear. I finally confirmed my guess, I used an index scan in the standby query, but deleted the index on the master at the same time. Curious, I went to read the source code of Postgres. I found that regular DROP INDEX commands imposes a AccessExclusiveLock on the table, while drop index concurrently commands only used ShareUpdateExclusiveLock. As we all know, only AccessExclusiveLock and AccessShareLock ,a select's lock , are mutually exclusive, and AccessShareLock can't block ShareUpdateExclusiveLock. This is very weird and not desirable. This is of course, developers must have thought of this, so we can see in the source code, before the drop index concurrently, will wait for all transactions using this index to wait for detection. But this only exists on the master, my query is executed on the standby. I use the pg_waldump tool to parse the wal file, and analyze the stantup process, I found that there is no similar operation on the standby, so it will appear that when I execute the query on the standby, the index will be deleted by others. I think this is a bug that will affect the user's experience. we need to fix it. I have imagined that the logic that detects the query transaction and waits for it to end is implemented on the standby,but this may increase the log application delay and the delay is exacerbated that cause the master and backup. This is not desirable if the query concurrency is large. All in all, I expect that you can provide a solution that can use drop index concurrently without affecting the master-slave delay. Sincerely look forward to your reply and thanks. adger