Thomas Munro <thomas.mu...@gmail.com> writes: > On Thu, Apr 29, 2021 at 4:45 AM Tom Lane <t...@sss.pgh.pa.us> wrote: >> Andres Freund <and...@anarazel.de> writes: >>> Tom, any chance you could check if your machine repros the issue before >>> these commits?
>> Wilco, but it'll likely take a little while to get results ... > FWIW I also chewed through many megawatts trying to reproduce this on > a PowerPC system in 64 bit big endian mode, with an emulator. No > cigar. However, it's so slow that I didn't make it to 10 runs... So I've expended a lot of kilowatt-hours over the past several days, and I've got results that are interesting but don't really get us any closer to a resolution. To recap, the test lashup is: * 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive) * Standard debug build (--enable-debug --enable-cassert) * Out-of-the-box configuration, except add wal_consistency_checking = all and configure a wal-streaming standby on the same machine * Repeatedly run "make installcheck-parallel", but skip the tablespace test to avoid issues with the standby trying to use the same directory * Delay long enough after each installcheck-parallel to let the standby catch up (the run proper is ~24 min, plus 2 min for catchup) The failures I'm seeing generally look like 2021-05-01 15:33:10.968 EDT [8281] FATAL: inconsistent page found, rel 1663/58186/66338, forknum 0, blkno 19 2021-05-01 15:33:10.968 EDT [8281] CONTEXT: WAL redo at 3/4CE905B8 for Gist/PAGE_UPDATE: ; blkref #0: rel 1663/58186/66338, blk 19 FPW with a variety of WAL record types being named, so it doesn't seem to be specific to any particular record type. I've twice gotten the bogus-checksum-and-then-assertion-failure I reported before: 2021-05-01 17:07:52.992 EDT [17464] LOG: incorrect resource manager data checksum in record at 3/E0073EA4 TRAP: FailedAssertion("state->recordRemainLen > 0", File: "xlogreader.c", Line: 567, PID: 17464) In both of those cases, the WAL on disk was perfectly fine, and the same is true of most of the "inconsistent page" complaints. So the issue definitely seems to be about the startup process mis-reading data that was correctly shipped over. Anyway, the new and interesting data concerns the relative failure rates of different builds: * Recent HEAD (from 4-28 and 5-1): 4 failures in 8 test cycles * Reverting 1d257577e: 1 failure in 8 test cycles * Reverting 1d257577e and f003d9f87: 3 failures in 28 cycles * Reverting 1d257577e, f003d9f87, and 323cbe7c7: 2 failures in 93 cycles That last point means that there was some hard-to-hit problem even before any of the recent WAL-related changes. However, 323cbe7c7 (Remove read_page callback from XLogReader) increased the failure rate by at least a factor of 5, and 1d257577e (Optionally prefetch referenced data) seems to have increased it by another factor of 4. But it looks like f003d9f87 (Add circular WAL decoding buffer) didn't materially change the failure rate. Considering that 323cbe7c7 was supposed to be just refactoring, and 1d257577e is allegedly disabled-by-default, these are surely not the results I was expecting to get. It seems like it's still an open question whether all this is a real bug, or flaky hardware. I have seen occasional kernel freezeups (or so I think -- machine stops responding to keyboard or network input) over the past year or two, so I cannot in good conscience rule out the flaky-hardware theory. But it doesn't smell like that kind of problem to me. I think what we're looking at is a timing-sensitive bug that was there before (maybe long before?) and these commits happened to make it occur more often on this particular hardware. This hardware is enough unlike anything made in the past decade that it's not hard to credit that it'd show a timing problem that nobody else can reproduce. (I did try the time-honored ritual of reseating all the machine's RAM, partway through this. Doesn't seem to have changed anything.) Anyway, I'm not sure where to go from here. I'm for sure nowhere near being able to identify the bug --- and if there really is a bug that formerly had a one-in-fifty reproduction rate, I have zero interest in trying to identify where it started by bisecting. It'd take at least a day per bisection step, and even that might not be accurate enough. (But, if anyone has ideas of specific commits to test, I'd be willing to try a few.) regards, tom lane