On Tuesday, June 20, 2023 5:44 PM Amit Kapila <[email protected]> wrote: > > On Mon, Jun 19, 2023 at 5:13 PM Amit Kapila <[email protected]> > wrote: > > > > On Sun, Jun 18, 2023 at 12:18 AM Peter Geoghegan <[email protected]> wrote: > > > > > > On Sat, Jun 17, 2023 at 11:29 AM Jaime Casanova > > > <[email protected]> wrote: > > > > I have been testing 16beta1, last commit > > > > a14e75eb0b6a73821e0d66c0d407372ec8376105 > > > > I just let sqlsmith do its magic before trying something else, and > > > > today I found a core with the attached backtrace. > > > > > > The assertion that fails is the IsPageLockHeld assertion from commit > 72e78d831a. > > > > > > > I'll look into this and share my analysis. > > > > This failure mode appears to be introduced in commit 7d71d3dd08 (in > PG16) where we started to process the config file after acquiring page lock > during autovacuum. The problem here is that after acquiring page lock (a > heavy-weight lock), while processing the config file, we tried to access the > catalog cache which in turn attempts to acquire a lock on the catalog > relation, > and that leads to the assertion failure. This is because of an existing rule > that we > don't acquire any other heavyweight lock while holding the page lock except > for relation extension. I think normally we should be careful about the lock > ordering for heavy-weight locks to avoid deadlocks but here there may not be > any existing hazard in acquiring a lock on the catalog table after acquiring > page > lock on the gin index's metapage as I am not aware of a scenario where we can > acquire them in reverse order. One naive idea is to have a parameter like > vacuum_config_reload_safe to allow config reload during autovacuum and > make it false for the gin index cleanup code path.
I also think it would be better to skip reloading config in page lock cases to
be consistent
with the rule. And here is the patch which does the same.
I tried to reproduce the assert failure(before applying the patch) using the
following steps:
1. I added a sleep before vacuum_delay_point in ginInsertCleanup and
LOG("attach this process") before sleeping.
2. And changed few GUC to make autovacuum happen more frequently and then start
the server.
-
autovacuum_naptime = 5s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_insert_threshold = 100
-
3. Then I execute the following sqls:
-
create table gin_test_tbl(i int4[]);
create index gin_test_idx on gin_test_tbl using gin (i)
with (fastupdate = on, gin_pending_list_limit = 4096);
insert into gin_test_tbl select array[1, 2, g] from generate_series(1, 20000) g;
insert into gin_test_tbl select array[1, 3, g] from generate_series(1, 1000) g;
-
4. After a while, I can see the LOG from autovacuum worker and then use gdb to
attach to the autovacuum worker.
5. When the autovacuum worker is blocked, I changed the
"default_text_search_config = 'pg_catalog.public'" in configure file and reload
it.
6. Release the autovacuum worker and then I can see the assert failure.
And I can see the assert failure doesn't happen after applying the patch.
>
> The reason for the existing rule for page lock and relation extension locks
> was
> to not allow them to participate in group locking which will allow other
> parallel
> operations like a parallel vacuum where multiple workers can work on the same
> index, or parallel inserts, parallel copy, etc. The related commits are
> 15ef6ff4b9,
> 72e78d831ab, 85f6b49c2c, and 3ba59ccc89. See 3ba59ccc89 for more details
> (To allow parallel inserts and parallel copy, we have ensured that relation
> extension and page locks don't participate in group locking which means such
> locks can conflict among the same group members. This is required as it is no
> safer for two related processes to extend the same relation or perform clean
> up in gin indexes at a time than for unrelated processes to do the same....).
Best Regards,
Hou zj
0001-disallow-reloading-config-when-holding-the-page-lock.patch
Description: 0001-disallow-reloading-config-when-holding-the-page-lock.patch
