On Tue, Apr 2, 2019 at 6:56 AM Michael Paquier <mich...@paquier.xyz> wrote: > > On Sat, Mar 30, 2019 at 06:15:11PM +0100, Julien Rouhaud wrote: > > I'd also have to get more feedback on this. For now, I'll add this > > thread to the pg12 open items, as a follow up of the initial code > > drop. > > Catching up here... I think that having a completely separate view > with one row for each database and one row for shared objects makes > the most sense based on what has been proposed on this thread. Being > able to track checksum failures for shared catalogs is really > something I'd like to be able to see easily, and I have seen > corruption involving such objects from time to time. I think that we > should have a design which is extensible.
Ok! > One thing which is not > proposed on this patch, and I am fine with it as a first draft, is > that we don't have any information about the broken block number and > the file involved. My gut tells me that we'd want a separate view, > like pg_stat_checksums_details with one tuple per (dboid, rel, fork, > blck) to be complete. But that's just for future work. That could indeed be nice. > For the progress part, we would most likely have a separate view for > that as well, as the view should show no rows if there is no operation > in progress. Ok. > The patch looks rather clean to me, I have some comments. > > - <application>pg_checksums</application>. The exit status is zero if there > - are no checksum errors when checking them, and nonzero if at least one > - checksum failure is detected. If enabling or disabling checksums, the > - exit status is nonzero if the operation failed. > + <application>pg_checksums</application>. As a consequence, the > + <structname>pg_stat_checksums</structnameview won't reflect this activity. > + The exit status is zero if there are no checksum errors when checking > them, > + and nonzero if at least one checksum failure is detected. If enabling or > + disabling checksums, the exit status is nonzero if the operation failed. > > The docs of pg_checksums already clearly state that the cluster needs > to be offline, so I am not sure that this addition is necessary. Agreed, removed. > @@ -1539,6 +1539,8 @@ pgstat_report_checksum_failures_in_db(Oid dboid, > int failurecount) > > Please note that there is no need to have the list of arguments in the > comment block at the top of pgstat_report_checksum_failures_in_db(). Indeed, fixed. > + if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL) > + result = 0; > + else > + result = dbentry->last_checksum_failure; > + > + if (result == 0) > + PG_RETURN_NULL(); > + else > + PG_RETURN_TIMESTAMPTZ(result); > +} > > No need for two ifs here. What about just that? > if (NULL) > PG_RETURN_NULL(); > else > PG_RETURN_TIMESTAMPTZ(last_checksum_failure); I do agree, but this is done like this everywhere in pgstatfuncs.c, so I think it's better to keep it as-is for consistency.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index f1df14bdea..30674f61ce 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -384,6 +384,14 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser </entry> </row> + <row> + <entry><structname>pg_stat_checksums</structname><indexterm><primary>pg_stat_checksums</primary></indexterm></entry> + <entry>One row per database, plus one for the shared objects, showing + database-wide checksums statistics. See + <xref linkend="pg-stat-checksums-view"/> for details. + </entry> + </row> + <row> <entry><structname>pg_stat_database</structname><indexterm><primary>pg_stat_database</primary></indexterm></entry> <entry>One row per database, showing database-wide statistics. See @@ -2418,6 +2426,54 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i single row, containing global data for the cluster. </para> + <table id="pg-stat-checksums-view" xreflabel="pg_stat_checksums"> + <title><structname>pg_stat_checksums</structname> View</title> + <tgroup cols="3"> + <thead> + <row> + <entry>Column</entry> + <entry>Type</entry> + <entry>Description</entry> + </row> + </thead> + + <tbody> + <row> + <entry><structfield>datid</structfield></entry> + <entry><type>oid</type></entry> + <entry>OID of a database, or 0 for objects belonging to a shared relation</entry> + </row> + <row> + <entry><structfield>datname</structfield></entry> + <entry><type>name</type></entry> + <entry>Name of this database, or <literal><shared_objects></literal></entry> + </row> + <row> + <entry><structfield>checksum_failures</structfield></entry> + <entry><type>bigint</type></entry> + <entry>Number of data page checksum failures detected in this + database</entry> + </row> + <row> + <entry><structfield>checksum_last_failure</structfield></entry> + <entry><type>timestamp with time zone</type></entry> + <entry>Time at which the last data page checksum failures was detected in + this database</entry> + </row> + <row> + <entry><structfield>stats_reset</structfield></entry> + <entry><type>timestamp with time zone</type></entry> + <entry>Time at which these statistics were last reset</entry> + </row> + </tbody> + </tgroup> + </table> + + <para> + The <structname>pg_stat_database</structname> view will contain one row + for each database in the cluster, showing database-wide statistics. + </para> + <table id="pg-stat-database-view" xreflabel="pg_stat_database"> <title><structname>pg_stat_database</structname> View</title> <tgroup cols="3"> @@ -2529,11 +2585,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i <entry><type>bigint</type></entry> <entry>Number of deadlocks detected in this database</entry> </row> - <row> - <entry><structfield>checksum_failures</structfield></entry> - <entry><type>bigint</type></entry> - <entry>Number of data page checksum failures detected in this database</entry> - </row> <row> <entry><structfield>blk_read_time</structfield></entry> <entry><type>double precision</type></entry> diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml index 84fb37c293..e063dd972e 100644 --- a/doc/src/sgml/ref/initdb.sgml +++ b/doc/src/sgml/ref/initdb.sgml @@ -218,7 +218,9 @@ PostgreSQL documentation I/O system that would otherwise be silent. Enabling checksums may incur a noticeable performance penalty. This option can only be set during initialization, and cannot be changed later. If - set, checksums are calculated for all objects, in all databases. + set, checksums are calculated for all objects, in all databases. All + checksum failures will be reported in the <xref + linkend="pg-stat-checksums-view"/> view. </para> </listitem> </varlistentry> diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index c4f3950e5b..8179db5f2a 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -531,7 +531,8 @@ PostgreSQL documentation By default, checksums are verified and checksum failures will result in a non-zero exit status. However, the base backup will not be removed in such a case, as if the <option>--no-clean</option> option - had been used. + had been used. Checksum verifications failures will also be reported + in the <xref linkend="pg-stat-checksums-view"/> view. </para> </listitem> </varlistentry> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index b89df70653..ffe53bf888 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -823,12 +823,24 @@ CREATE VIEW pg_stat_database AS pg_stat_get_db_temp_files(D.oid) AS temp_files, pg_stat_get_db_temp_bytes(D.oid) AS temp_bytes, pg_stat_get_db_deadlocks(D.oid) AS deadlocks, - pg_stat_get_db_checksum_failures(D.oid) AS checksum_failures, pg_stat_get_db_blk_read_time(D.oid) AS blk_read_time, pg_stat_get_db_blk_write_time(D.oid) AS blk_write_time, pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset FROM pg_database D; +CREATE VIEW pg_stat_checksums AS + SELECT + D.oid AS datid, + D.datname AS datname, + pg_stat_get_db_checksum_failures(D.oid) AS checksum_failures, + pg_stat_get_db_checksum_last_failure(D.oid) AS checksum_last_failure, + pg_stat_get_db_stat_reset_time(D.oid) AS stats_reset + FROM ( + SELECT oid, datname FROM pg_database + UNION ALL + SELECT 0, '<shared objects>' + ) D; + CREATE VIEW pg_stat_database_conflicts AS SELECT D.oid AS datid, diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 2a8472b91a..ed877d6d95 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -1523,7 +1523,7 @@ pgstat_report_deadlock(void) /* -------- - * pgstat_report_checksum_failures_in_db(dboid, failure_count) - + * pgstat_report_checksum_failures_in_db() - * * Tell the collector about one or more checksum failures. * -------- @@ -1539,6 +1539,8 @@ pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount) pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE); msg.m_databaseid = dboid; msg.m_failurecount = failurecount; + msg.m_failure_time = GetCurrentTimestamp(); + pgstat_send(&msg, sizeof(msg)); } @@ -4601,6 +4603,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry) dbentry->n_temp_bytes = 0; dbentry->n_deadlocks = 0; dbentry->n_checksum_failures = 0; + dbentry->last_checksum_failure = 0; dbentry->n_block_read_time = 0; dbentry->n_block_write_time = 0; @@ -6257,6 +6260,7 @@ pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len) dbentry = pgstat_get_db_entry(msg->m_databaseid, true); dbentry->n_checksum_failures += msg->m_failurecount; + dbentry->last_checksum_failure = msg->m_failure_time; } /* ---------- diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 537f09e342..36dcb28754 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1584,9 +1584,9 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf (errmsg("file \"%s\" has a total of %d checksum verification " "failures", readfilename, checksum_failures))); - if (dboid != InvalidOid) - pgstat_report_checksum_failures_in_db(dboid, checksum_failures); + pgstat_report_checksum_failures_in_db(dboid, checksum_failures); } + total_checksum_failures += checksum_failures; return true; diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index 90a817a25c..b28d9ab216 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -1514,6 +1514,24 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS) PG_RETURN_INT64(result); } +Datum +pg_stat_get_db_checksum_last_failure(PG_FUNCTION_ARGS) +{ + Oid dbid = PG_GETARG_OID(0); + TimestampTz result; + PgStat_StatDBEntry *dbentry; + + if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL) + result = 0; + else + result = dbentry->last_checksum_failure; + + if (result == 0) + PG_RETURN_NULL(); + else + PG_RETURN_TIMESTAMPTZ(result); +} + Datum pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS) { diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index eac909109c..c4aeeee810 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -5281,6 +5281,11 @@ proname => 'pg_stat_get_db_checksum_failures', provolatile => 's', proparallel => 'r', prorettype => 'int8', proargtypes => 'oid', prosrc => 'pg_stat_get_db_checksum_failures' }, +{ oid => '8394', + descr => 'statistics: when last checksum failure was detected in database', + proname => 'pg_stat_get_db_checksum_last_failure', provolatile => 's', + proparallel => 'r', prorettype => 'timestamptz', proargtypes => 'oid', + prosrc => 'pg_stat_get_db_checksum_last_failure' }, { oid => '3074', descr => 'statistics: last reset for a database', proname => 'pg_stat_get_db_stat_reset_time', provolatile => 's', proparallel => 'r', prorettype => 'timestamptz', proargtypes => 'oid', diff --git a/src/include/pgstat.h b/src/include/pgstat.h index c080fa6388..d6d1772706 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -541,6 +541,7 @@ typedef struct PgStat_MsgChecksumFailure PgStat_MsgHdr m_hdr; Oid m_databaseid; int m_failurecount; + TimestampTz m_failure_time; } PgStat_MsgChecksumFailure; @@ -607,6 +608,7 @@ typedef struct PgStat_StatDBEntry PgStat_Counter n_temp_bytes; PgStat_Counter n_deadlocks; PgStat_Counter n_checksum_failures; + TimestampTz last_checksum_failure; PgStat_Counter n_block_read_time; /* times in microseconds */ PgStat_Counter n_block_write_time; diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index d5f309fbfb..193de82cc3 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1801,6 +1801,17 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset; +pg_stat_checksums| SELECT d.oid AS datid, + d.datname, + pg_stat_get_db_checksum_failures(d.oid) AS checksum_failures, + pg_stat_get_db_checksum_last_failure(d.oid) AS checksum_last_failure, + pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset + FROM ( SELECT pg_database.oid, + pg_database.datname + FROM pg_database + UNION ALL + SELECT 0, + '<shared objects>'::name) d; pg_stat_database| SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, @@ -1817,7 +1828,6 @@ pg_stat_database| SELECT d.oid AS datid, pg_stat_get_db_temp_files(d.oid) AS temp_files, pg_stat_get_db_temp_bytes(d.oid) AS temp_bytes, pg_stat_get_db_deadlocks(d.oid) AS deadlocks, - pg_stat_get_db_checksum_failures(d.oid) AS checksum_failures, pg_stat_get_db_blk_read_time(d.oid) AS blk_read_time, pg_stat_get_db_blk_write_time(d.oid) AS blk_write_time, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset