On Tue, Mar 19, 2019 at 4:49 PM Andres Freund <and...@anarazel.de> wrote: > To demonstrate that I ran a loop that verified that a) a normal backend > query using the tale detects the corruption b) pg_basebackup doesn't. > > i=0; > while true; do > i=$(($i+1)); > echo attempt $i; > dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 > conv=notrunc 2>/dev/null; > psql -X -c 'SELECT * FROM corruptme;' 2>/dev/null && break; > ~/build/postgres/dev-assert/vpath/src/bin/pg_basebackup/pg_basebackup -X > fetch -F t -D - -c fast > /dev/null || break; > done > > (excuse the crappy one-off sh) > > had, during ~12k iterations, always detected the corruption in the > backend, and never via pg_basebackup. Given the likely LSNs in a > cluster, that's not too surprising.
Wow. So we shipped a checksum-verification feature (in pg_basebackup) that reliably fails to detect blatantly corrupt pages. That's pretty awful. Your chances get better the more WAL you've ever generated, but you have to generate 163 petabytes of WAL to have a 1% chance of detecting a page of random garbage, so realistically they never get very good. It's probably fair to point out that flipping a couple of random bytes on the page is a more likely error than replacing the entire page with garbage, and the check as designed will detect that fairly reliably -- unless those bytes are very near the beginning of the page. Still, that leaves a lot of kinds of corruption that this will not catch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company