Hi,
Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> * Michael Paquier (mich...@paquier.xyz) wrote:
> > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > To be clear, I agree completely that we don't want to be reporting false
> > > positives or "this might mean corruption!" to users running the tool,
> > > but I haven't seen a good explaination of why this needs to involve the
> > > server to avoid that happening. If someone would like to point that out
> > > to me, I'd be happy to go read about it and try to understand.
> >
> > The mentions on this thread that the server has all the facility in
> > place to properly lock a buffer and make sure that a partial read
> > *never* happens and that we *never* have any kind of false positives,
>
> Uh, we are, of course, going to have partial reads- we just need to
> handle them appropriately, and that's not hard to do in a way that we
> never have false positives.
I think the current patch (V13 from https://www.postgresql.org/message-i
d/1552045881.4947.43.ca...@credativ.de) does that, modulo possible bugs.
> I do not understand, at all, the whole sub-thread argument that we have
> to avoid partial reads. We certainly don't worry about that when doing
> backups, and I don't see why we need to avoid it here. We are going to
> have partial reads- and that's ok, as long as it's because we're at the
> end of the file, and that's easy enough to check by just doing another
> read to see if we get back zero bytes, which indicates we're at the end
> of the file, and then we move on, no need to coordinate anything with
> the backend for this.
Well, I agree with you, but we don't seem to have consensus on that.
> > directly preventing the set of issues we are trying to implement
> > workarounds for in a frontend tool are rather good arguments in my
> > opinion (you can grep for BufferDescriptorGetIOLock() on this thread
> > for example).
>
> Sure the backend has those facilities since it needs to, but these
> frontend tools *don't* need that to *never* have any false positives, so
> why are we complicating things by saying that this frontend tool and the
> backend have to coordinate?
>
> If there's an explanation of why we can't avoid having false positives
> in the frontend tool, I've yet to see it. I definitely understand that
> we can get partial reads, but a partial read isn't a failure, and
> shouldn't be reported as such.
It is not in the current patch, it should just get reported as a skipped
block in the end. If the cluster is online that is, if it is offline,
we do consider it a failure.
I have now rebased that patch on top of the pg_verify_checksums ->
pg_checksums renaming, see attached.
Michael
--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.ba...@credativ.de
credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
diff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 6a47dda683..124475f057 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -37,9 +37,8 @@ PostgreSQL documentation
<title>Description</title>
<para>
<application>pg_checksums</application> verifies data checksums in a
- <productname>PostgreSQL</productname> cluster. The server must be shut
- down cleanly before running <application>pg_checksums</application>.
- The exit status is zero if there are no checksum errors, otherwise nonzero.
+ <productname>PostgreSQL</productname> cluster. The exit status is zero if
+ there are no checksum errors, otherwise nonzero.
</para>
</refsect1>
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index b7ebc11017..0ed065f7e9 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -1,7 +1,7 @@
/*-------------------------------------------------------------------------
*
* pg_checksums.c
- * Verifies page level checksums in an offline cluster.
+ * Verifies page level checksums in an cluster.
*
* Copyright (c) 2010-2019, PostgreSQL Global Development Group
*
@@ -28,12 +28,16 @@
static int64 files = 0;
+static int64 skippedfiles = 0;
static int64 blocks = 0;
static int64 badblocks = 0;
+static int64 skippedblocks = 0;
static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
static char *only_relfilenode = NULL;
static bool verbose = false;
+static bool online = false;
static const char *progname;
@@ -90,10 +94,17 @@ scan_file(const char *fn, BlockNumber segmentno)
PageHeader header = (PageHeader) buf.data;
int f;
BlockNumber blockno;
+ bool block_retry = false;
f = open(fn, O_RDONLY | PG_BINARY, 0);
if (f < 0)
{
+ if (online && errno == ENOENT)
+ {
+ /* File was removed in the meantime */
+ return;
+ }
+
fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
progname, fn, strerror(errno));
exit(1);
@@ -108,26 +119,130 @@ scan_file(const char *fn, BlockNumber segmentno)
if (r == 0)
break;
+ if (r < 0)
+ {
+ skippedfiles++;
+ fprintf(stderr, _("%s: could not read block %u in file \"%s\": %s\n"),
+ progname, blockno, fn, strerror(errno));
+ return;
+ }
if (r != BLCKSZ)
{
- fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
- progname, blockno, fn, r, BLCKSZ);
- exit(1);
+ if (online)
+ {
+ if (block_retry)
+ {
+ /* We already tried once to reread the block, skip to the next block */
+ skippedblocks++;
+ if (lseek(f, BLCKSZ-r, SEEK_CUR) == -1)
+ {
+ skippedfiles++;
+ fprintf(stderr, _("%s: could not lseek to next block in file \"%s\": %m\n"),
+ progname, fn);
+ return;
+ }
+ continue;
+ }
+
+ /*
+ * Retry the block. It's possible that we read the block while it
+ * was extended or shrinked, so it it ends up looking torn to us.
+ */
+
+ /*
+ * Seek back by the amount of bytes we read to the beginning of
+ * the failed block.
+ */
+ if (lseek(f, -r, SEEK_CUR) == -1)
+ {
+ skippedfiles++;
+ fprintf(stderr, _("%s: could not lseek in file \"%s\": %m\n"),
+ progname, fn);
+ return;
+ }
+
+ /* Set flag so we know a retry was attempted */
+ block_retry = true;
+
+ /* Reset loop to validate the block again */
+ blockno--;
+
+ continue;
+ }
+ else
+ {
+ skippedfiles++;
+ fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
+ progname, blockno, fn, r, BLCKSZ);
+ return;
+ }
}
- blocks++;
/* New pages have no checksum yet */
if (PageIsNew(header))
+ {
+ skippedblocks++;
continue;
+ }
+
+ blocks++;
csum = pg_checksum_page(buf.data, blockno + segmentno * RELSEG_SIZE);
if (csum != header->pd_checksum)
{
+ if (online)
+ {
+ /*
+ * Retry the block on the first failure if online. If the
+ * verification is done while the instance is online, it is
+ * possible that we read the first 4K page of the block just
+ * before postgres updated the entire block so it ends up
+ * looking torn to us. We only need to retry once because the
+ * LSN should be updated to something we can ignore on the next
+ * pass. If the error happens again then it is a true
+ * validation failure.
+ */
+ if (!block_retry)
+ {
+ /* Seek to the beginning of the failed block */
+ if (lseek(f, -BLCKSZ, SEEK_CUR) == -1)
+ {
+ skippedfiles++;
+ fprintf(stderr, _("%s: could not lseek in file \"%s\": %m\n"),
+ progname, fn);
+ return;
+ }
+
+ /* Set flag so we know a retry was attempted */
+ block_retry = true;
+
+ /* Reset loop to validate the block again */
+ blockno--;
+
+ continue;
+ }
+
+ /*
+ * The checksum verification failed on retry as well. Check if
+ * the page has been modified since the checkpoint and skip it
+ * in this case.
+ */
+ if (PageGetLSN(buf.data) > checkpointLSN)
+ {
+ block_retry = false;
+ blocks--;
+ skippedblocks++;
+ continue;
+ }
+ }
+
if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_VERSION)
fprintf(stderr, _("%s: checksum verification failed in file \"%s\", block %u: calculated checksum %X but block contains %X\n"),
progname, fn, blockno, csum, header->pd_checksum);
badblocks++;
}
+
+ block_retry = false;
}
if (verbose)
@@ -176,6 +291,12 @@ scan_directory(const char *basedir, const char *subdir)
snprintf(fn, sizeof(fn), "%s/%s", path, de->d_name);
if (lstat(fn, &st) < 0)
{
+ if (online && errno == ENOENT)
+ {
+ /* File was removed in the meantime */
+ continue;
+ }
+
fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
progname, fn, strerror(errno));
exit(1);
@@ -312,7 +433,7 @@ main(int argc, char *argv[])
exit(1);
}
- /* Check if cluster is running */
+ /* Check if checksums are enabled */
ControlFile = get_controlfile(DataDir, progname, &crc_ok);
if (!crc_ok)
{
@@ -336,12 +457,10 @@ main(int argc, char *argv[])
exit(1);
}
+ /* Check if cluster is running */
if (ControlFile->state != DB_SHUTDOWNED &&
ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
- {
- fprintf(stderr, _("%s: cluster must be shut down to verify checksums\n"), progname);
- exit(1);
- }
+ online = true;
if (ControlFile->data_checksum_version == 0)
{
@@ -349,6 +468,9 @@ main(int argc, char *argv[])
exit(1);
}
+ /* Get checkpoint LSN */
+ checkpointLSN = ControlFile->checkPoint;
+
/* Scan all files */
scan_directory(DataDir, "global");
scan_directory(DataDir, "base");
@@ -357,11 +479,20 @@ main(int argc, char *argv[])
printf(_("Checksum scan completed\n"));
printf(_("Data checksum version: %d\n"), ControlFile->data_checksum_version);
printf(_("Files scanned: %s\n"), psprintf(INT64_FORMAT, files));
+ if (skippedfiles > 0)
+ printf(_("Files skipped: %s\n"), psprintf(INT64_FORMAT, skippedfiles));
printf(_("Blocks scanned: %s\n"), psprintf(INT64_FORMAT, blocks));
+ if (skippedblocks > 0)
+ printf(_("Blocks skipped: %s\n"), psprintf(INT64_FORMAT, skippedblocks));
printf(_("Bad checksums: %s\n"), psprintf(INT64_FORMAT, badblocks));
if (badblocks > 0)
return 1;
+ /* skipped blocks or files are considered an error if offline */
+ if (!online)
+ if (skippedblocks > 0 || skippedfiles > 0)
+ return 1;
+
return 0;
}
diff --git a/src/bin/pg_checksums/t/002_actions.pl b/src/bin/pg_checksums/t/002_actions.pl
index 97284e8930..fcf113a88c 100644
--- a/src/bin/pg_checksums/t/002_actions.pl
+++ b/src/bin/pg_checksums/t/002_actions.pl
@@ -5,7 +5,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 45;
+use Test::More tests => 69;
# Utility routine to create and check a table with corrupted checksums
@@ -104,10 +104,10 @@ append_to_file "$pgdata/global/pgsql_tmp/1.1", "foo";
command_ok(['pg_checksums', '-D', $pgdata],
"succeeds with offline cluster");
-# Checks cannot happen with an online cluster
+# Checksums pass on an online cluster
$node->start;
-command_fails(['pg_checksums', '-D', $pgdata],
- "fails with online cluster");
+command_ok(['pg_verify_checksums', '-D', $pgdata],
+ "succeeds with online cluster");
# Check corruption of table on default tablespace.
check_relation_corruption($node, 'corrupt1', 'pg_default');
@@ -133,23 +133,32 @@ sub fail_corrupt
my $file_name = "$pgdata/global/$file";
append_to_file $file_name, "foo";
- $node->command_checks_all([ 'pg_checksums', '-D', $pgdata],
+ $node->stop;
+ # If the instance is offline, the whole file is skipped and this is
+ # considered to be an error.
+ $node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata],
1,
- [qr/^$/],
+ [qr/Files skipped:.*1/],
[qr/could not read block 0 in file.*$file\":/],
- "fails for corrupted data in $file");
+ "skips corrupted data in $file");
+
+ $node->start;
+ # If the instance is online, the block is skipped and this is not
+ # considered to be an error
+ $node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata],
+ 0,
+ [qr/Blocks skipped:.*1/],
+ [qr/^$/],
+ "skips corrupted data in $file");
# Remove file to prevent future lookup errors on conflicts.
unlink $file_name;
return;
}
-# Stop instance for the follow-up checks.
-$node->stop;
-
-# Authorized relation files filled with corrupted data cause the
-# checksum checks to fail. Make sure to use file names different
-# than the previous ones.
+# Authorized relation files filled with corrupted data cause the files to be
+# skipped and, if the instance is offline, a non-zero exit status. Make sure
+# to use file names different than the previous ones.
fail_corrupt($node, "99990");
fail_corrupt($node, "99990.123");
fail_corrupt($node, "99990_fsm");
@@ -158,3 +167,6 @@ fail_corrupt($node, "99990_vm");
fail_corrupt($node, "99990_init.123");
fail_corrupt($node, "99990_fsm.123");
fail_corrupt($node, "99990_vm.123");
+
+# Stop node again at the end of tests
+$node->stop;