Re: Online verification of checksums

Robert Haas Tue, 19 Mar 2019 14:35:10 -0700

On Tue, Mar 19, 2019 at 4:49 PM Andres Freund <[email protected]> wrote:
> To demonstrate that I ran a loop that verified that a) a normal backend
> query using the tale detects the corruption b) pg_basebackup doesn't.
>
> i=0;
> while true; do
>     i=$(($i+1));
>     echo attempt $i;
>     dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 
> conv=notrunc 2>/dev/null;
>     psql -X -c 'SELECT * FROM corruptme;' 2>/dev/null && break;
>     ~/build/postgres/dev-assert/vpath/src/bin/pg_basebackup/pg_basebackup -X 
> fetch -F t -D - -c fast > /dev/null || break;
> done
>
> (excuse the crappy one-off sh)
>
> had, during ~12k iterations, always detected the corruption in the
> backend, and never via pg_basebackup. Given the likely LSNs in a
> cluster, that's not too surprising.


Wow.  So we shipped a checksum-verification feature (in pg_basebackup)
that reliably fails to detect blatantly corrupt pages.  That's pretty
awful.  Your chances get better the more WAL you've ever generated,
but you have to generate 163 petabytes of WAL to have a 1% chance of
detecting a page of random garbage, so realistically they never get
very good.

It's probably fair to point out that flipping a couple of random bytes
on the page is a more likely error than replacing the entire page with
garbage, and the check as designed will detect that fairly reliably --
unless those bytes are very near the beginning of the page.  Still,
that leaves a lot of kinds of corruption that this will not catch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Online verification of checksums

Reply via email to