On Monday, May 29, 2023 5:22 PM Drouvot, Bertrand <bertranddrouvot...@gmail.com> wrote: > > Hi, > > On 5/26/23 9:27 AM, Yu Shi (Fujitsu) wrote: > > Hi hackers, > > > > I saw a buildfarm failure on "dikkop"[1]. It failed in > > 035_standby_logical_decoding.pl, because the slots row_removal_inactiveslot > and > > row_removal_activeslot are not invalidated after vacuum. > > Thanks for sharing! > > > > > regress_log_035_standby_logical_decoding: > > ``` > > [12:15:05.943](4.442s) not ok 22 - inactiveslot slot invalidation is logged > > with > vacuum on pg_class > > [12:15:05.945](0.003s) > > [12:15:05.946](0.000s) # Failed test 'inactiveslot slot invalidation is > > logged > with vacuum on pg_class' > > # at t/035_standby_logical_decoding.pl line 238. > > [12:15:05.948](0.002s) not ok 23 - activeslot slot invalidation is logged > > with > vacuum on pg_class > > [12:15:05.949](0.001s) > > [12:15:05.950](0.000s) # Failed test 'activeslot slot invalidation is > > logged with > vacuum on pg_class' > > # at t/035_standby_logical_decoding.pl line 244. > > [13:38:26.977](5001.028s) # poll_query_until timed out executing this query: > > # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts > > where > datname = 'testdb' > > # expecting this output: > > # t > > # last actual query output: > > # f > > # with stderr: > > [13:38:26.980](0.003s) not ok 24 - confl_active_logicalslot updated > > [13:38:26.982](0.002s) > > [13:38:26.982](0.000s) # Failed test 'confl_active_logicalslot updated' > > # at t/035_standby_logical_decoding.pl line 251. > > Timed out waiting confl_active_logicalslot to be updated at > t/035_standby_logical_decoding.pl line 251. > > ``` > > > > 035_standby_logical_decoding.pl: > > ``` > > # This should trigger the conflict > > $node_primary->safe_psql( > > 'testdb', qq[ > > CREATE TABLE conflict_test(x integer, y text); > > DROP TABLE conflict_test; > > VACUUM pg_class; > > INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal > > ]); > > > > $node_primary->wait_for_replay_catchup($node_standby); > > > > # Check invalidation in the logfile and in pg_stat_database_conflicts > > check_for_invalidation('row_removal_', $logstart, 'with vacuum on > > pg_class'); > > ``` > > > > Is it possible that the vacuum command didn't remove tuples and then the > > conflict was not triggered? > > The flush_wal table added by Andres should guarantee that the WAL is flushed, > so > the only reason I can think about is indeed that the vacuum did not remove > tuples ( > but I don't get why/how that could be the case). > > > It seems we can't confirm this because there is not > > enough information. > > Right, and looking at its status history most of the time the test is green > (making > it > even more difficult to diagnose). > > > Maybe "vacuum verbose" can be used to provide more > > information. > > I can see that dikkop "belongs" to Tomas (adding Tomas to this thread). > Tomas, do you think it would be possible to run some > 035_standby_logical_decoding.pl > manually with "vacuum verbose" in the test mentioned above? (or any other > way you can think > about that would help diagnose this random failure?). >
Thanks for your reply. I saw another failure on "drongo" [1], which looks like a similar problem. Maybe a temporary patch can be committed to dump the result of "vacuum verbose". And we can check this when the test fails. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2023-05-26%2018%3A05%3A57 Regards, Shi Yu