On Tuesday, June 20, 2023 5:44 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Mon, Jun 19, 2023 at 5:13 PM Amit Kapila <amit.kapil...@gmail.com> > wrote: > > > > On Sun, Jun 18, 2023 at 12:18 AM Peter Geoghegan <p...@bowt.ie> wrote: > > > > > > On Sat, Jun 17, 2023 at 11:29 AM Jaime Casanova > > > <jcasa...@systemguards.com.ec> wrote: > > > > I have been testing 16beta1, last commit > > > > a14e75eb0b6a73821e0d66c0d407372ec8376105 > > > > I just let sqlsmith do its magic before trying something else, and > > > > today I found a core with the attached backtrace. > > > > > > The assertion that fails is the IsPageLockHeld assertion from commit > 72e78d831a. > > > > > > > I'll look into this and share my analysis. > > > > This failure mode appears to be introduced in commit 7d71d3dd08 (in > PG16) where we started to process the config file after acquiring page lock > during autovacuum. The problem here is that after acquiring page lock (a > heavy-weight lock), while processing the config file, we tried to access the > catalog cache which in turn attempts to acquire a lock on the catalog > relation, > and that leads to the assertion failure. This is because of an existing rule > that we > don't acquire any other heavyweight lock while holding the page lock except > for relation extension. I think normally we should be careful about the lock > ordering for heavy-weight locks to avoid deadlocks but here there may not be > any existing hazard in acquiring a lock on the catalog table after acquiring > page > lock on the gin index's metapage as I am not aware of a scenario where we can > acquire them in reverse order. One naive idea is to have a parameter like > vacuum_config_reload_safe to allow config reload during autovacuum and > make it false for the gin index cleanup code path.
I also think it would be better to skip reloading config in page lock cases to be consistent with the rule. And here is the patch which does the same. I tried to reproduce the assert failure(before applying the patch) using the following steps: 1. I added a sleep before vacuum_delay_point in ginInsertCleanup and LOG("attach this process") before sleeping. 2. And changed few GUC to make autovacuum happen more frequently and then start the server. - autovacuum_naptime = 5s autovacuum_vacuum_threshold = 1 autovacuum_vacuum_insert_threshold = 100 - 3. Then I execute the following sqls: - create table gin_test_tbl(i int4[]); create index gin_test_idx on gin_test_tbl using gin (i) with (fastupdate = on, gin_pending_list_limit = 4096); insert into gin_test_tbl select array[1, 2, g] from generate_series(1, 20000) g; insert into gin_test_tbl select array[1, 3, g] from generate_series(1, 1000) g; - 4. After a while, I can see the LOG from autovacuum worker and then use gdb to attach to the autovacuum worker. 5. When the autovacuum worker is blocked, I changed the "default_text_search_config = 'pg_catalog.public'" in configure file and reload it. 6. Release the autovacuum worker and then I can see the assert failure. And I can see the assert failure doesn't happen after applying the patch. > > The reason for the existing rule for page lock and relation extension locks > was > to not allow them to participate in group locking which will allow other > parallel > operations like a parallel vacuum where multiple workers can work on the same > index, or parallel inserts, parallel copy, etc. The related commits are > 15ef6ff4b9, > 72e78d831ab, 85f6b49c2c, and 3ba59ccc89. See 3ba59ccc89 for more details > (To allow parallel inserts and parallel copy, we have ensured that relation > extension and page locks don't participate in group locking which means such > locks can conflict among the same group members. This is required as it is no > safer for two related processes to extend the same relation or perform clean > up in gin indexes at a time than for unrelated processes to do the same....). Best Regards, Hou zj
0001-disallow-reloading-config-when-holding-the-page-lock.patch
Description: 0001-disallow-reloading-config-when-holding-the-page-lock.patch