Re: [HACKERS] Recovery conflict monitoring

2011-01-03 Thread Magnus Hagander
On Mon, Jan 3, 2011 at 00:23, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, 2010-12-27 at 14:39 +0100, Magnus Hagander wrote:
 On Thu, Dec 23, 2010 at 13:09, Magnus Hagander mag...@hagander.net wrote:
  This patch adds counters and views to monitor hot standby generated
  recovery conflicts. It extends the pg_stat_database view with one
  column with the total number of conflicts, and also creates a new view
  pg_stat_database_conflicts that contains a breakdown of exactly what
  caused the conflicts.
 
  Documentation still pending, but comments meanwhile is of course 
  appreciated ;)

 Heikki pointed out over IM that it's pointless to count stats caused
 by recovery conflict with drop database - since we drop the stats
 record as soon as it arrives anyway. Here's an updated patch that
 removes that, and also adds some documentation.

 I like the patch, well inspired, code in the right places AFAICS. No
 code comments at all.

Thanks for reviewing!



 Couple of thoughts:

 * are we safe to issue stats immediately before issuing FATAL? Won't
 some of them get lost?

They shouldn't - not more than other stats messages. Those are often
flushed from on_shmem_exit() which I think runs even later.


 * Not clear what I'd do with database level information, except worry a
 lot. Maybe an option to count conflicts per user would be better, since
 at least we'd know exactly who was affected by those. Just an idea.

Depends on the usage scenario. In a lot of dedicated environments you
really only have one database - but there are many environments where
you do have multiple and it's quite useful to see them separately. And
you can of course very easily sum() them up for a total count, since
it's a view... Better keep the detail than throw it away, even if that
part isn't useful in *all* cases...

Grouping by user would potentially be helpful - I agree. However, that
goes for most pgstat counters (number of seqscans, tuples read etc
are interesting per user as well in some cases). So doing that right
would mean adding per-user tracking all across pgstats in some smart
way - something we don't do now at all. So I see that as a separate
issue.


 * Would it better to have a log_standby_conflicts that allowed the
 opportunity to log the conflicting SQL, duration until cancelation etc?

Logging is useful to figure out why you have a certain scenario, yes.
But absolutely not as a *replacement* for the statistics counters, but
as an addition. Just like we have (the now incorrectly named)
pg_stat_bgwriter view *and* log_checkpoints... Different usecases for
the same basic information.


 I'd rather have what you have than nothing at all though... the new
 hot_standby_feedback mode should be acting to reduce these, so it would
 be useful to have this patch enabled for testing that feature.

It will help reduce it, but not take it away, right? Plus, it's an
optional feature...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery conflict monitoring

2011-01-03 Thread Simon Riggs

On Mon, 2011-01-03 at 10:03 +0100, Magnus Hagander wrote:

  I like the patch, well inspired, code in the right places AFAICS. No
  code comments at all.
 
 Thanks for reviewing!

All good here. Test and commit please.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery conflict monitoring

2011-01-03 Thread Greg Smith

Couple of doc suggestions:

--- doc/src/sgml/high-availability.sgml
+ The number of query cancels and the reason for them can be viewed 
using

+ the structnamepg_stat_database_conflicts/ system view on the slave
+ server.

For compleness sake, this should also mention the per-database summary, 
even though I'm not sure how valuable that view is.  Also, on a standby 
server instead of on the slave server here.  slave is mentioned 
once as a synonym in high-availability.sgml once, but that's it, and 
there can be more than one standby you want to pull these stats from.


*** doc/src/sgml/monitoring.sgml
!   number of rows returned, fetched, inserted, updated and deleted, and
!   total number of queries cancelled due to conflict with recovery.

This would be clearer if it said you're talking about standby recovery 
here, and possibly that this info is only available on the standby.  I 
could see someone reading this and thinking it's possible for general 
database crash recovery to produce cancelled queries, instead of the way 
connections are actually blocked until that's done.


!   entrystructnamepg_stat_database_conflicts/
!   entryOne row per database, showing database OID, database name and
!   the number of queries that have been cancelled in this database 
due to
!   dropped tablespaces, lock timeouts, old snapshots, pinned 
buffers and

!   deadlocks.

A clarification that you're talking about standby query cancellation 
here might be helpful too.  I don't think that's necessary for all of 
the detailed pg_stat_get_* functions that regular users are less likely 
to care about, just these higher level ones.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery conflict monitoring

2011-01-03 Thread Magnus Hagander
On Mon, Jan 3, 2011 at 11:35, Greg Smith g...@2ndquadrant.com wrote:
 Couple of doc suggestions:

 --- doc/src/sgml/high-availability.sgml
 +     The number of query cancels and the reason for them can be viewed
 using
 +     the structnamepg_stat_database_conflicts/ system view on the slave
 +     server.

 For compleness sake, this should also mention the per-database summary, even
 though I'm not sure how valuable that view is.  Also, on a standby server
 instead of on the slave server here.  slave is mentioned once as a
 synonym in high-availability.sgml once, but that's it, and there can be more
 than one standby you want to pull these stats from.

Good point, changed and added.


 *** doc/src/sgml/monitoring.sgml
 !       number of rows returned, fetched, inserted, updated and deleted, and
 !       total number of queries cancelled due to conflict with recovery.

 This would be clearer if it said you're talking about standby recovery here,
 and possibly that this info is only available on the standby.  I could see
 someone reading this and thinking it's possible for general database crash
 recovery to produce cancelled queries, instead of the way connections are
 actually blocked until that's done.

 !       entrystructnamepg_stat_database_conflicts/
 !       entryOne row per database, showing database OID, database name and
 !       the number of queries that have been cancelled in this database due
 to
 !       dropped tablespaces, lock timeouts, old snapshots, pinned buffers
 and
 !       deadlocks.

 A clarification that you're talking about standby query cancellation here
 might be helpful too.  I don't think that's necessary for all of the
 detailed pg_stat_get_* functions that regular users are less likely to care
 about, just these higher level ones.

Yeah, those both make sense - I've updated the docs and am running tests ATM.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery conflict monitoring

2011-01-02 Thread Simon Riggs
On Mon, 2010-12-27 at 14:39 +0100, Magnus Hagander wrote:
 On Thu, Dec 23, 2010 at 13:09, Magnus Hagander mag...@hagander.net wrote:
  This patch adds counters and views to monitor hot standby generated
  recovery conflicts. It extends the pg_stat_database view with one
  column with the total number of conflicts, and also creates a new view
  pg_stat_database_conflicts that contains a breakdown of exactly what
  caused the conflicts.
 
  Documentation still pending, but comments meanwhile is of course 
  appreciated ;)
 
 Heikki pointed out over IM that it's pointless to count stats caused
 by recovery conflict with drop database - since we drop the stats
 record as soon as it arrives anyway. Here's an updated patch that
 removes that, and also adds some documentation.

I like the patch, well inspired, code in the right places AFAICS. No
code comments at all.

Couple of thoughts: 

* are we safe to issue stats immediately before issuing FATAL? Won't
some of them get lost?

* Not clear what I'd do with database level information, except worry a
lot. Maybe an option to count conflicts per user would be better, since
at least we'd know exactly who was affected by those. Just an idea.

* Would it better to have a log_standby_conflicts that allowed the
opportunity to log the conflicting SQL, duration until cancelation etc?

I'd rather have what you have than nothing at all though... the new
hot_standby_feedback mode should be acting to reduce these, so it would
be useful to have this patch enabled for testing that feature.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery conflict monitoring

2010-12-27 Thread Magnus Hagander
On Thu, Dec 23, 2010 at 13:09, Magnus Hagander mag...@hagander.net wrote:
 This patch adds counters and views to monitor hot standby generated
 recovery conflicts. It extends the pg_stat_database view with one
 column with the total number of conflicts, and also creates a new view
 pg_stat_database_conflicts that contains a breakdown of exactly what
 caused the conflicts.

 Documentation still pending, but comments meanwhile is of course appreciated 
 ;)

Heikki pointed out over IM that it's pointless to count stats caused
by recovery conflict with drop database - since we drop the stats
record as soon as it arrives anyway. Here's an updated patch that
removes that, and also adds some documentation.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***
*** 1541,1546  if (!triggered)
--- 1541,1552 
  approach, since varnamevacuum_defer_cleanup_age/ is measured in
  transactions executed on the primary server.
 /para
+ 
+para
+ The number of query cancels and the reason for them can be viewed using
+ the structnamepg_stat_database_conflicts/ system view on the slave
+ server.
+/para
/sect2
  
sect2 id=hot-standby-admin
*** a/doc/src/sgml/monitoring.sgml
--- b/doc/src/sgml/monitoring.sgml
***
*** 278,284  postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re
number of transactions committed and rolled back in that database,
total disk blocks read, total buffer hits (i.e., block
read requests avoided by finding the block already in buffer cache),
!   number of rows returned, fetched, inserted, updated and deleted.
   /entry
   /row
  
--- 278,294 
number of transactions committed and rolled back in that database,
total disk blocks read, total buffer hits (i.e., block
read requests avoided by finding the block already in buffer cache),
!   number of rows returned, fetched, inserted, updated and deleted, and
!   total number of queries cancelled due to conflict with recovery.
!  /entry
!  /row
! 
!  row
!   entrystructnamepg_stat_database_conflicts/indextermprimarypg_stat_database_conflicts/primary/indexterm/entry
!   entryOne row per database, showing database OID, database name and
!   the number of queries that have been cancelled in this database due to
!   dropped tablespaces, lock timeouts, old snapshots, pinned buffers and
!   deadlocks.
   /entry
   /row
  
***
*** 600,605  postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re
--- 610,655 
   /row
  
   row
+   entryliteralfunctionpg_stat_get_db_conflict_tablespace/function(typeoid/type)/literal/entry
+   entrytypebigint/type/entry
+   entry
+Number of queries cancelled because of recovery conflict with dropped tablespaces in database
+   /entry
+  /row
+ 
+  row
+   entryliteralfunctionpg_stat_get_db_conflict_lock/function(typeoid/type)/literal/entry
+   entrytypebigint/type/entry
+   entry
+Number of queries cancelled because of recovery conflict with locks in database
+   /entry
+  /row
+ 
+  row
+   entryliteralfunctionpg_stat_get_db_conflict_snapshot/function(typeoid/type)/literal/entry
+   entrytypebigint/type/entry
+   entry
+Number of queries cancelled because of recovery conflict with old snapshots in database
+   /entry
+  /row
+ 
+  row
+   entryliteralfunctionpg_stat_get_db_conflict_bufferpin/function(typeoid/type)/literal/entry
+   entrytypebigint/type/entry
+   entry
+Number of queries cancelled because of recovery conflict with pinned buffers in database
+   /entry
+  /row
+ 
+  row
+   entryliteralfunctionpg_stat_get_db_conflict_startup_deadlock/function(typeoid/type)/literal/entry
+   entrytypebigint/type/entry
+   entry
+Number of queries cancelled because of recovery conflict with deadlocks in database
+   /entry
+  /row
+ 
+  row
entryliteralfunctionpg_stat_get_numscans/function(typeoid/type)/literal/entry
entrytypebigint/type/entry
entry
*** a/src/backend/catalog/system_views.sql
--- b/src/backend/catalog/system_views.sql
***
*** 502,508  CREATE VIEW pg_stat_database AS
  pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched,
  pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
  pg_stat_get_db_tuples_updated(D.oid) AS tup_updated,
! pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted
  FROM pg_database D;
  
  CREATE VIEW pg_stat_user_functions AS
--- 502,520 
  pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched,
  pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
  

[HACKERS] Recovery conflict monitoring

2010-12-23 Thread Magnus Hagander
This patch adds counters and views to monitor hot standby generated
recovery conflicts. It extends the pg_stat_database view with one
column with the total number of conflicts, and also creates a new view
pg_stat_database_conflicts that contains a breakdown of exactly what
caused the conflicts.

Documentation still pending, but comments meanwhile is of course appreciated ;)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
*** a/src/backend/catalog/system_views.sql
--- b/src/backend/catalog/system_views.sql
***
*** 502,508  CREATE VIEW pg_stat_database AS
  pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched,
  pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
  pg_stat_get_db_tuples_updated(D.oid) AS tup_updated,
! pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted
  FROM pg_database D;
  
  CREATE VIEW pg_stat_user_functions AS
--- 502,521 
  pg_stat_get_db_tuples_fetched(D.oid) AS tup_fetched,
  pg_stat_get_db_tuples_inserted(D.oid) AS tup_inserted,
  pg_stat_get_db_tuples_updated(D.oid) AS tup_updated,
! pg_stat_get_db_tuples_deleted(D.oid) AS tup_deleted,
! 	pg_stat_get_db_conflict_all(D.oid) AS conflicts
! FROM pg_database D;
! 
! CREATE VIEW pg_stat_database_conflicts AS
! SELECT
! D.oid AS datid,
! D.datname AS datname,
! pg_stat_get_db_conflict_database(D.oid) AS confl_database,
! pg_stat_get_db_conflict_tablespace(D.oid) AS confl_tablespace,
! pg_stat_get_db_conflict_lock(D.oid) AS confl_lock,
! pg_stat_get_db_conflict_snapshot(D.oid) AS confl_snapshot,
! pg_stat_get_db_conflict_bufferpin(D.oid) AS confl_bufferpin,
! pg_stat_get_db_conflict_startup_deadlock(D.oid) AS confl_deadlock
  FROM pg_database D;
  
  CREATE VIEW pg_stat_user_functions AS
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
***
*** 57,62 
--- 57,63 
  #include storage/ipc.h
  #include storage/pg_shmem.h
  #include storage/pmsignal.h
+ #include storage/procsignal.h
  #include utils/guc.h
  #include utils/memutils.h
  #include utils/ps_status.h
***
*** 278,283  static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
--- 279,285 
  static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
  static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
  static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
+ static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
  
  
  /* 
***
*** 1314,1319  pgstat_report_analyze(Relation rel, bool adopt_counts,
--- 1316,1340 
  	pgstat_send(msg, sizeof(msg));
  }
  
+ /* 
+  * pgstat_report_recovery_conflict() -
+  *
+  *  Tell the collector about a Hot Standby recovery conflict.
+  * 
+  */
+ void
+ pgstat_report_recovery_conflict(int reason)
+ {
+ 	PgStat_MsgRecoveryConflict msg;
+ 
+ 	if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+ 		return;
+ 
+ 	pgstat_setheader(msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
+ 	msg.m_databaseid = MyDatabaseId;
+ 	msg.m_reason = reason;
+ 	pgstat_send(msg, sizeof(msg));
+ }
  
  /* --
   * pgstat_ping() -
***
*** 3053,3058  PgstatCollectorMain(int argc, char *argv[])
--- 3074,3083 
  	pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) msg, len);
  	break;
  
+ case PGSTAT_MTYPE_RECOVERYCONFLICT:
+ 	pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) msg, len);
+ 	break;
+ 
  default:
  	break;
  			}
***
*** 3129,3134  pgstat_get_db_entry(Oid databaseid, bool create)
--- 3154,3165 
  		result-n_tuples_updated = 0;
  		result-n_tuples_deleted = 0;
  		result-last_autovac_time = 0;
+ 		result-n_conflict_database = 0;
+ 		result-n_conflict_tablespace = 0;
+ 		result-n_conflict_lock = 0;
+ 		result-n_conflict_snapshot = 0;
+ 		result-n_conflict_bufferpin = 0;
+ 		result-n_conflict_startup_deadlock = 0;
  
  		memset(hash_ctl, 0, sizeof(hash_ctl));
  		hash_ctl.keysize = sizeof(Oid);
***
*** 4204,4209  pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
--- 4235,4275 
  }
  
  /* --
+  * pgstat_recv_recoveryconflict() -
+  *
+  *  Process as RECOVERYCONFLICT message.
+  * --
+  */
+ static void
+ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
+ {
+ 	PgStat_StatDBEntry *dbentry;
+ 	dbentry = pgstat_get_db_entry(msg-m_databaseid, true);
+ 
+ 	switch (msg-m_reason)
+ 	{
+ 		case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+ 			dbentry-n_conflict_database++;
+ 			break;
+ 		case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+ 			dbentry-n_conflict_tablespace++;
+ 			break;
+ 		case PROCSIG_RECOVERY_CONFLICT_LOCK:
+