On Wed, Jan 24, 2024 at 12:08 PM Nathan Bossart <nathandboss...@gmail.com> wrote: > I'm seeing some recent buildfarm failures for pg_walsummary: > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2024-01-14%2006%3A21%3A58 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2024-01-17%2021%3A10%3A36 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-20%2018%3A58%3A49 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-01-23%2002%3A46%3A57 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-23%2020%3A23%3A36 > > The signature looks nearly identical in each: > > # Failed test 'WAL summary file exists' > # at t/002_blocks.pl line 79. > > # Failed test 'stdout shows block 0 modified' > # at t/002_blocks.pl line 85. > # '' > # doesn't match '(?^m:FORK main: block 0$)' > > I haven't been able to reproduce the issue on my machine, and I haven't > figured out precisely what is happening yet, but I wanted to make sure > there is awareness.
This is weird. There's a little more detail in the log file, regress_log_002_blocks, e.g. from the first failure you linked: [11:18:20.683](96.787s) # before insert, summarized TLI 1 through 0/14E09D0 [11:18:21.188](0.505s) # after insert, summarized TLI 1 through 0/14E0D08 [11:18:21.326](0.138s) # examining summary for TLI 1 from 0/14E0D08 to 0/155BAF0 # 1 ... [11:18:21.349](0.000s) # got: 'pg_walsummary: error: could not open file "/home/nm/farm/gcc64/HEAD/pgsql.build/src/bin/pg_walsummary/tmp_check/t_002_blocks_node1_data/pgdata/pg_wal/summaries/0000000100000000014E0D0800000000155BAF0 # 1.summary": No such file or directory' The "examining summary" line is generated based on the output of pg_available_wal_summaries(). The way that works is that the server calls readdir(), disassembles the filename into a TLI and two LSNs, and returns the result. Then, a fraction of a second later, the test script reassembles those components into a filename and finds the file missing. If the logic to translate between filenames and TLIs & LSNs were incorrect, the test would fail consistently. So the only explanation that seems to fit the facts is the file disappearing out from under us. But that really shouldn't happen. We do have code to remove such files in MaybeRemoveOldWalSummaries(), but it's only supposed to be nuking files more than 10 days old. So I don't really have a theory here as to what could be happening. :-( -- Robert Haas EDB: http://www.enterprisedb.com