Hi,

Am Montag, den 18.03.2019, 16:11 +0800 schrieb Stephen Frost:
> On Mon, Mar 18, 2019 at 15:52 Michael Banck <michael.ba...@credativ.de> wrote:
> > Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > > Thanks for that.  Reading through the code though, I don't entirely
> > > understand why we're making things complicated for ourselves by trying
> > > to seek and re-read the entire block, specifically this:
> > 
> > [...]
> > 
> > > I would think that we could just do:
> > > 
> > >   insert_location = 0;
> > >   r = read(BLCKSIZE - insert_location);
> > >   if (r < 0) error();
> > >   if (r == 0) EOF detected, move to next
> > >   if (r < (BLCKSIZE - insert_location)) {
> > >     insert_location += r;
> > >     continue;
> > >   }
> > > 
> > > At this point, we should have a full block, do our checks...
> > 
> > Well, we need to read() into some buffer which you have ommitted.
> 
> Surely there’s a buffer the read in the existing code is passing in,
> you just need to offset by the current pointer, sorry for not being
> clear.
> 
> In other words the read would look more like:
> 
> read(fd,buf + insert_ptr, BUFSZ - insert_ptr)
> 
> And then you have to reset insert_ptr once you have a full block.

Ok, thanks for clearing that up.

I've tried to do that now in the attached, does that suit you?

> Yes, the question was more like this: have you ever seen a read return
> a partial result when you know you’re in the middle somewhere of an
> existing file and the length of the file hasn’t been changed by
> something else..?

I don't think I've seen that, but that wouldn't turn up in regular
testing anyway I guess but only in pathological cases?  I guess we are
probably dealing with this in the current version of the patch, but I
can't say for certain as it sounds pretty difficult to test.

I have also added a paragraph to the documentation about possilby
skipping new or recently updated pages:

+   If the cluster is online, pages that have been (re-)written since the last
+   checkpoint will not count as checksum failures if they cannot be read or
+   verified correctly.

Wording improvements welcome.


Michael


-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
diff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 6a47dda683..c1fa167508 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -37,10 +37,15 @@ PostgreSQL documentation
   <title>Description</title>
   <para>
    <application>pg_checksums</application> verifies data checksums in a
-   <productname>PostgreSQL</productname> cluster.  The server must be shut
-   down cleanly before running <application>pg_checksums</application>.
-   The exit status is zero if there are no checksum errors, otherwise nonzero.
+   <productname>PostgreSQL</productname> cluster.  The exit status is zero if
+   there are no checksum errors, otherwise nonzero.  
   </para>
+
+  <para>
+   If the cluster is online, pages that have been (re-)written since the last
+   checkpoint will not count as checksum failures if they cannot be read or
+   verified correctly.
+  </page>
  </refsect1>
 
  <refsect1>
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index b7ebc11017..a82ce7d7b4 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pg_checksums.c
- *	  Verifies page level checksums in an offline cluster.
+ *	  Verifies page level checksums in an cluster.
  *
  * Copyright (c) 2010-2019, PostgreSQL Global Development Group
  *
@@ -28,12 +28,16 @@
 
 
 static int64 files = 0;
+static int64 skippedfiles = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
+static int64 skippedblocks = 0;
 static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
+static bool online = false;
 
 static const char *progname;
 
@@ -90,10 +94,18 @@ scan_file(const char *fn, BlockNumber segmentno)
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
+	int			block_loc = 0;
+	bool		block_retry = false;
 
 	f = open(fn, O_RDONLY | PG_BINARY, 0);
 	if (f < 0)
 	{
+		if (online && errno == ENOENT)
+		{
+			/* File was removed in the meantime */
+			return;
+		}
+
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 				progname, fn, strerror(errno));
 		exit(1);
@@ -104,30 +116,112 @@ scan_file(const char *fn, BlockNumber segmentno)
 	for (blockno = 0;; blockno++)
 	{
 		uint16		csum;
-		int			r = read(f, buf.data, BLCKSZ);
+		int			r = read(f, buf.data + block_loc, BLCKSZ - block_loc);
 
 		if (r == 0)
+		{
+			/*
+			 * We had a short read and got an EOF before we could read the
+			 * whole block, so count this as a skipped block.
+			 */
+			if (online && block_loc != 0)
+				skippedblocks++;
 			break;
-		if (r != BLCKSZ)
+		}
+		if (r < 0)
 		{
-			fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
-					progname, blockno, fn, r, BLCKSZ);
-			exit(1);
+			skippedfiles++;
+			fprintf(stderr, _("%s: could not read block %u in file \"%s\": %s\n"),
+					progname, blockno, fn, strerror(errno));
+			return;
+		}
+		if (r < (BLCKSZ - block_loc))
+		{
+			if (online)
+			{
+				/*
+				 * We read only parts of the block, possibly due to a
+				 * concurrent extension of the relation file.  Increment
+				 * block_loc by the amount that we read and try again.
+				 */
+				block_loc += r;
+				continue;
+			}
+			else
+			{
+				skippedfiles++;
+				fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
+						progname, blockno, fn, r, BLCKSZ);
+				return;
+			}
 		}
-		blocks++;
 
 		/* New pages have no checksum yet */
 		if (PageIsNew(header))
+		{
+			skippedblocks++;
 			continue;
+		}
+
+		blocks++;
 
 		csum = pg_checksum_page(buf.data, blockno + segmentno * RELSEG_SIZE);
 		if (csum != header->pd_checksum)
 		{
+			if (online)
+			{
+				/*
+				 * Retry the block on the first failure if online.  If the
+				 * verification is done while the instance is online, it is
+				 * possible that we read the first 4K page of the block just
+				 * before postgres updated the entire block so it ends up
+				 * looking torn to us.  We only need to retry once because the
+				 * LSN should be updated to something we can ignore on the next
+				 * pass.  If the error happens again then it is a true
+				 * validation failure.
+				 */
+				if (!block_retry)
+				{
+					/* Seek to the beginning of the failed block */
+					if (lseek(f, -BLCKSZ, SEEK_CUR) == -1)
+					{
+						skippedfiles++;
+						fprintf(stderr, _("%s: could not lseek in file \"%s\": %m\n"),
+								progname, fn);
+						return;
+					}
+
+					/* Set flag so we know a retry was attempted */
+					block_retry = true;
+
+					/* Reset loop to validate the block again */
+					blockno--;
+
+					continue;
+				}
+
+				/*
+				 * The checksum verification failed on retry as well.  Check if
+				 * the page has been modified since the checkpoint and skip it
+				 * in this case.
+				 */
+				if (PageGetLSN(buf.data) > checkpointLSN)
+				{
+					block_retry = false;
+					blocks--;
+					skippedblocks++;
+					continue;
+				}
+			}
+
 			if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_VERSION)
 				fprintf(stderr, _("%s: checksum verification failed in file \"%s\", block %u: calculated checksum %X but block contains %X\n"),
 						progname, fn, blockno, csum, header->pd_checksum);
 			badblocks++;
 		}
+
+		block_retry = false;
+		block_loc = 0;
 	}
 
 	if (verbose)
@@ -176,6 +270,12 @@ scan_directory(const char *basedir, const char *subdir)
 		snprintf(fn, sizeof(fn), "%s/%s", path, de->d_name);
 		if (lstat(fn, &st) < 0)
 		{
+			if (online && errno == ENOENT)
+			{
+				/* File was removed in the meantime */
+				continue;
+			}
+
 			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
 					progname, fn, strerror(errno));
 			exit(1);
@@ -312,7 +412,7 @@ main(int argc, char *argv[])
 		exit(1);
 	}
 
-	/* Check if cluster is running */
+	/* Check if checksums are enabled */
 	ControlFile = get_controlfile(DataDir, progname, &crc_ok);
 	if (!crc_ok)
 	{
@@ -336,12 +436,10 @@ main(int argc, char *argv[])
 		exit(1);
 	}
 
+	/* Check if cluster is running */
 	if (ControlFile->state != DB_SHUTDOWNED &&
 		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
-	{
-		fprintf(stderr, _("%s: cluster must be shut down to verify checksums\n"), progname);
-		exit(1);
-	}
+		online = true;
 
 	if (ControlFile->data_checksum_version == 0)
 	{
@@ -349,6 +447,9 @@ main(int argc, char *argv[])
 		exit(1);
 	}
 
+	/* Get checkpoint LSN */
+	checkpointLSN = ControlFile->checkPoint;
+
 	/* Scan all files */
 	scan_directory(DataDir, "global");
 	scan_directory(DataDir, "base");
@@ -357,11 +458,20 @@ main(int argc, char *argv[])
 	printf(_("Checksum scan completed\n"));
 	printf(_("Data checksum version: %d\n"), ControlFile->data_checksum_version);
 	printf(_("Files scanned:  %s\n"), psprintf(INT64_FORMAT, files));
+	if (skippedfiles > 0)
+		printf(_("Files skipped: %s\n"), psprintf(INT64_FORMAT, skippedfiles));
 	printf(_("Blocks scanned: %s\n"), psprintf(INT64_FORMAT, blocks));
+	if (skippedblocks > 0)
+		printf(_("Blocks skipped: %s\n"), psprintf(INT64_FORMAT, skippedblocks));
 	printf(_("Bad checksums:  %s\n"), psprintf(INT64_FORMAT, badblocks));
 
 	if (badblocks > 0)
 		return 1;
 
+	/* skipped blocks or files are considered an error if offline */
+	if (!online)
+		if (skippedblocks > 0 || skippedfiles > 0)
+			return 1;
+
 	return 0;
 }
diff --git a/src/bin/pg_checksums/t/002_actions.pl b/src/bin/pg_checksums/t/002_actions.pl
index 97284e8930..35d3b3dd36 100644
--- a/src/bin/pg_checksums/t/002_actions.pl
+++ b/src/bin/pg_checksums/t/002_actions.pl
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 45;
+use Test::More tests => 69;
 
 
 # Utility routine to create and check a table with corrupted checksums
@@ -104,10 +104,10 @@ append_to_file "$pgdata/global/pgsql_tmp/1.1", "foo";
 command_ok(['pg_checksums',  '-D', $pgdata],
 		   "succeeds with offline cluster");
 
-# Checks cannot happen with an online cluster
+# Checksums pass on an online cluster
 $node->start;
-command_fails(['pg_checksums',  '-D', $pgdata],
-			  "fails with online cluster");
+command_ok(['pg_checksums',  '-D', $pgdata],
+		   "succeeds with online cluster");
 
 # Check corruption of table on default tablespace.
 check_relation_corruption($node, 'corrupt1', 'pg_default');
@@ -133,23 +133,32 @@ sub fail_corrupt
 	my $file_name = "$pgdata/global/$file";
 	append_to_file $file_name, "foo";
 
+	$node->stop;
+	# If the instance is offline, the whole file is skipped and this is
+	# considered to be an error.
 	$node->command_checks_all([ 'pg_checksums', '-D', $pgdata],
 						  1,
-						  [qr/^$/],
+						  [qr/Files skipped:.*1/],
 						  [qr/could not read block 0 in file.*$file\":/],
-						  "fails for corrupted data in $file");
+						  "skips corrupted data in $file");
+
+	$node->start;
+	# If the instance is online, the block is skipped and this is not
+	# considered to be an error
+	$node->command_checks_all([ 'pg_checksums', '-D', $pgdata],
+						  0,
+						  [qr/Blocks skipped:.*1/],
+						  [qr/^$/],
+						  "skips corrupted data in $file");
 
 	# Remove file to prevent future lookup errors on conflicts.
 	unlink $file_name;
 	return;
 }
 
-# Stop instance for the follow-up checks.
-$node->stop;
-
-# Authorized relation files filled with corrupted data cause the
-# checksum checks to fail.  Make sure to use file names different
-# than the previous ones.
+# Authorized relation files filled with corrupted data cause the files to be
+# skipped and, if the instance is offline, a non-zero exit status.  Make sure
+# to use file names different than the previous ones.
 fail_corrupt($node, "99990");
 fail_corrupt($node, "99990.123");
 fail_corrupt($node, "99990_fsm");
@@ -158,3 +167,6 @@ fail_corrupt($node, "99990_vm");
 fail_corrupt($node, "99990_init.123");
 fail_corrupt($node, "99990_fsm.123");
 fail_corrupt($node, "99990_vm.123");
+
+# Stop node again at the end of tests
+$node->stop;

Reply via email to