On Wed, Jan 24, 2024 at 12:08 PM Nathan Bossart
<nathandboss...@gmail.com> wrote:
> I'm seeing some recent buildfarm failures for pg_walsummary:
>
>         
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2024-01-14%2006%3A21%3A58
>         
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2024-01-17%2021%3A10%3A36
>         
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-20%2018%3A58%3A49
>         
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-01-23%2002%3A46%3A57
>         
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-23%2020%3A23%3A36
>
> The signature looks nearly identical in each:
>
>         #   Failed test 'WAL summary file exists'
>         #   at t/002_blocks.pl line 79.
>
>         #   Failed test 'stdout shows block 0 modified'
>         #   at t/002_blocks.pl line 85.
>         #                   ''
>         #     doesn't match '(?^m:FORK main: block 0$)'
>
> I haven't been able to reproduce the issue on my machine, and I haven't
> figured out precisely what is happening yet, but I wanted to make sure
> there is awareness.

This is weird. There's a little more detail in the log file,
regress_log_002_blocks, e.g. from the first failure you linked:

[11:18:20.683](96.787s) # before insert, summarized TLI 1 through 0/14E09D0
[11:18:21.188](0.505s) # after insert, summarized TLI 1 through 0/14E0D08
[11:18:21.326](0.138s) # examining summary for TLI 1 from 0/14E0D08 to 0/155BAF0
# 1
...
[11:18:21.349](0.000s) #          got: 'pg_walsummary: error: could
not open file 
"/home/nm/farm/gcc64/HEAD/pgsql.build/src/bin/pg_walsummary/tmp_check/t_002_blocks_node1_data/pgdata/pg_wal/summaries/0000000100000000014E0D0800000000155BAF0
# 1.summary": No such file or directory'

The "examining summary" line is generated based on the output of
pg_available_wal_summaries(). The way that works is that the server
calls readdir(), disassembles the filename into a TLI and two LSNs,
and returns the result. Then, a fraction of a second later, the test
script reassembles those components into a filename and finds the file
missing. If the logic to translate between filenames and TLIs & LSNs
were incorrect, the test would fail consistently. So the only
explanation that seems to fit the facts is the file disappearing out
from under us. But that really shouldn't happen. We do have code to
remove such files in MaybeRemoveOldWalSummaries(), but it's only
supposed to be nuking files more than 10 days old.

So I don't really have a theory here as to what could be happening. :-(

-- 
Robert Haas
EDB: http://www.enterprisedb.com


Reply via email to