Re: shared-memory based stats collector - v66

2022-04-05 Thread Andres Freund
Hi,

On 2022-04-02 01:16:48 -0700, Andres Freund wrote:
> I just noticed that the code doesn't appear to actually work like that right
> now. Whenever the timeout is reached, pgstat_report_stat() is called with
> force = true.
> 
> And even if the backend is busy running queries, once there's contention, the
> next invocation of pgstat_report_stat() will return the timeout relative to
> pendin_since, which then will trigger a force report via a very short timeout
> soon.
> 
> It might actually make sense to only ever return PGSTAT_RETRY_MIN_INTERVAL
> (with a slightly different name) from pgstat_report_stat() when blocked
> (limiting the max reporting delay for an idle connection) and to continue
> calling pgstat_report_stat(force = true).  But to only trigger force
> "internally" in pgstat_report_stat() when PGSTAT_MAX_INTERVAL is reached.
> 
> I think that'd mean we'd report after max PGSTAT_RETRY_MIN_INTERVAL in an idle
> connection, and try reporting every PGSTAT_RETRY_MIN_INTERVAL (increasing up
> to PGSTAT_MAX_INTERVAL when blocked) on busy connections.
> 
> Makes sense?

I tried to come up with a workload producing a *lot* of stats (multiple
function calls within a transaction, multiple transactions pipelined) and ran
it with 1000 clients (on a machine with 2 x (10 cores / 20 threads)). To
reduce overhead I set
  default_transaction_isolation=repeatable read
  track_activities=false
MVCC Snapshot acquisition is the clear bottleneck otherwise, followed by
pgstat_report_activity() (which, as confusing as it may sound, is independent
of this patch).

I do see a *small* amount of contention if I lower PGSTAT_MIN_INTERVAL to
1ms. Too small to ever be captured in pg_stat_activity.wait_event, but just
about visible in a profiler.


Which leads me to conclude we can simplify the logic significantly. Here's my
current comment explaining the logic:

 * Unless called with 'force', pending stats updates are flushed happen once
 * per PGSTAT_MIN_INTERVAL (1000ms). When not forced, stats flushes do not
 * block on lock acquisition, except if stats updates have been pending for
 * longer than PGSTAT_MAX_INTERVAL (6ms).
 *
 * Whenever pending stats updates remain at the end of pgstat_report_stat() a
 * suggested idle timeout is returned. Currently this is always
 * PGSTAT_IDLE_INTERVAL (1ms). Callers can use the returned time to set up
 * a timeout after which to call pgstat_report_stat(true), but are not
 * required to to do so.

Comments?

Greetings,

Andres Freund




Re: shared-memory based stats collector - v66

2022-04-02 Thread Andres Freund
Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:
> > * AFIXME: Should all the stats drop code be moved into pgstat_drop.c?
> 
> Or pgstat_xact.c?

I wasn't initially happy with that suggestion, but after running with it, it
looks pretty good.

I also moved a fair bit of code into pgstat_shm.c, which to me improved code
navigation a lot. I'm wondering about splitting it further even, into
pgstat_shm.c and pgstat_entry.c.

What do you think?

Greetings,

Andres Freund




Re: shared-memory based stats collector - v66

2022-04-02 Thread Andres Freund
Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:
> > AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
> > for increasing it?
> 
> It is 1000ms in the comment just above but actually 1ms. The
> number came from a discussion that if we have 1000 clients and each
> backend writes stats once per 0.5 seconds, totally we flush pending
> data to shared area at 2000 times per second which is too frequent. I
> raised it to 5000ms, then 1ms.  So the expected maximum flush
> frequency is reduces to 100 times per second.  Of course it is
> assuming the worst case and the 1ms is apparently too long for the
> average cases.
> 
> The current implement of pgstat postpones flushing if lock collision
> happens then postpone by at most 60s.  This is a kind of
> auto-averaging mechanishm.  It might be enough and we can reduce the
> PGSTAT_MIN_INTERVAL to 500ms or so.

I just noticed that the code doesn't appear to actually work like that right
now. Whenever the timeout is reached, pgstat_report_stat() is called with
force = true.

And even if the backend is busy running queries, once there's contention, the
next invocation of pgstat_report_stat() will return the timeout relative to
pendin_since, which then will trigger a force report via a very short timeout
soon.

It might actually make sense to only ever return PGSTAT_RETRY_MIN_INTERVAL
(with a slightly different name) from pgstat_report_stat() when blocked
(limiting the max reporting delay for an idle connection) and to continue
calling pgstat_report_stat(force = true).  But to only trigger force
"internally" in pgstat_report_stat() when PGSTAT_MAX_INTERVAL is reached.

I think that'd mean we'd report after max PGSTAT_RETRY_MIN_INTERVAL in an idle
connection, and try reporting every PGSTAT_RETRY_MIN_INTERVAL (increasing up
to PGSTAT_MAX_INTERVAL when blocked) on busy connections.

Makes sense?


I think we need to do something with the pgstat_report_stat() calls outside of
postgres.c. Otherwise there's nothing limiting their reporting delay, because
they don't have the timeout logic postgres.c has.  None of them is ever hot
enough to be problematic, so I think we should just make them pass force=true?

Greetings,

Andres Freund




Re: shared-memory based stats collector - v66

2022-03-29 Thread Andres Freund
Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:
> I'd like to dump out my humble thoughts about other AFIXMEs..

Thanks!  Please have another look at the code in
https://github.com/anarazel/postgres/tree/shmstat I just pushed a revised
version with a lot of [a]fixmes removed.


Most importantly I did move replication slot stats into the hash table, and
just generally revised the replication slot stats code substantially. I
think it does look better now.

But also there's a new commit allowing dsm use in single user mode. To be able
to rely on stats drops we need to perform them even in single user mode. The
only reason this didn't previously fail was that we allocated enough "static"
shared memory for single user mode to never need DSMs.

Thanks to Melanie's tests, and a few more additions by myself, the code is now
reasonably well covered. The big exception to that is recovery conflict stats,
and as Melanie noticed, that was broken (somehow pgstat_database_flush_cb()
didn't sum them up)). I think she has some WIP tests...

Re the added tests: I did fix a few timing issues there. There's probably a
few more hiding somewhere.


I also found that unfortunately dshash_seq_next() as is isn't correct. I
included a workaround commit, but it's not correct. What we need to do is to
just always lock partition 0 in the initialization branch. Before we call
ensure_valid_bucket_pointers() status->hash_table->size_log2 isn't valid. And
ensure_valid_bucket_pointers can only be called with a lock...



Horiguchi-san, if you have time to look at the "XXX: The following could now be
generalized" in pgstat_read_statsfile(), pgstat_write_statsfile()... I think
that'd be nice to clean up.



> > AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
> > for increasing it?
> 
> It is 1000ms in the comment just above but actually 1ms. The
> number came from a discussion that if we have 1000 clients and each
> backend writes stats once per 0.5 seconds, totally we flush pending
> data to shared area at 2000 times per second which is too frequent.

Have you measured this (recently)? I tried to cause contention with a workload
targeted towards that, but couldn't see a problem with 1000ms. Of course
there's a problem with 1ms...

I think it's confusing to not report stats for 10s without a need.


> The current implement of pgstat postpones flushing if lock collision
> happens then postpone by at most 60s.  This is a kind of
> auto-averaging mechanishm.  It might be enough and we can reduce the
> PGSTAT_MIN_INTERVAL to 500ms or so.

Yea, I think the 60s part under contention is fine. I'd expect that to be
rarely reached.


> 
> > AFIXME: architecture explanation.
> 
> Mmm. next, please:p

Working on it. There's one more AFIXME that I want to resolve before, so I
don't end up with old type names strewn around (the one in pgstat_internal.h).


> 
> ( [PGSTAT_KIND_REPLSLOT] = {)
> > * AFIXME: With a bit of extra work this could now be a !fixed_amount
> > * stats kind.
> 
> Yeah.  The most bothersome point is the slot index is not persistent
> at all and the relationship between the index and name (or identity)
> is not stable even within a process life.  It can be resolved by
> allocating an object id to every replication slot.  I faintly remember
> of a discussion like that but I don't have a clear memory of the
> discussion.

I think it's resolved now. pgstat_report_replslot* all get the ReplicationSlot
as a parameter. They use the new ReplicationSlotIndex() to get an index from
that. pgstat_report_replslot_(create|acquire) ensure that the relevant index
doesn't somehow contain old stats.

To deal with indexes changing / slots getting removed during restart, there's
now a new callback made during pgstat_read_statsfile() to build the key from
the serialized NameStr. That can return false if a slot of that name is not
know, or use ReplicationSlotIndex() to get the index to store in-memory stats.


> > static Size
> > pgstat_dsa_init_size(void)
> > {
> > /*
> >  * AFIXME: What should we choose as an initial size? Should we make this
> >  * configurable? Maybe tune based on NBuffers?
> 
> > StatsShmemInit(void)
> >  * AFIXME: we need to guarantee this can be allocated in plain 
> > shared
> >  * memory, rather than allocating dsm segments.
> 
> I'm not sure that NBuffers is the ideal base for deciding the required
> size since it doesn't seem to be generally in proportion with the
> number of database objects.  If we made it manually-tunable, we will
> be able to emit a log when DSM segment allocation happens for this use
> as as the tuning aid..
> 
>WARNING: dsa allocation happened for activity statistics
>HINT: You might want to increase stat_dsa_initial_size if you see slow
>  down blah..

FWIW, I couldn't find any performance impact from using DSM. Because of the
"PgStatSharedRef" layer, there's not actually that much interaction with the
dsm 

Re: shared-memory based stats collector - v66

2022-03-25 Thread Kyotaro Horiguchi
At Fri, 25 Mar 2022 14:22:56 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 24 Mar 2022 13:21:33 -0400, Melanie Plageman 
>  wrote in 
> > On Thu, Mar 17, 2022 at 3:36 AM Andres Freund  wrote:
> > >
> > > The biggest todos are:
> > > - Address all the remaining AFIXMEs and XXXs
> > 
> > Attached is a patch that addresses three of the existing AFIXMEs.

I'd like to dump out my humble thoughts about other AFIXMEs..

> AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
> for increasing it?

It is 1000ms in the comment just above but actually 1ms. The
number came from a discussion that if we have 1000 clients and each
backend writes stats once per 0.5 seconds, totally we flush pending
data to shared area at 2000 times per second which is too frequent. I
raised it to 5000ms, then 1ms.  So the expected maximum flush
frequency is reduces to 100 times per second.  Of course it is
assuming the worst case and the 1ms is apparently too long for the
average cases.

The current implement of pgstat postpones flushing if lock collision
happens then postpone by at most 60s.  This is a kind of
auto-averaging mechanishm.  It might be enough and we can reduce the
PGSTAT_MIN_INTERVAL to 500ms or so.


> AFIXME: architecture explanation.

Mmm. next, please:p


(   [PGSTAT_KIND_REPLSLOT] = {)
> * AFIXME: With a bit of extra work this could now be a !fixed_amount
> * stats kind.

Yeah.  The most bothersome point is the slot index is not persistent
at all and the relationship between the index and name (or identity)
is not stable even within a process life.  It can be resolved by
allocating an object id to every replication slot.  I faintly remember
of a discussion like that but I don't have a clear memory of the
discussion.

> static Size
> pgstat_dsa_init_size(void)
> {
>   /*
>* AFIXME: What should we choose as an initial size? Should we make this
>* configurable? Maybe tune based on NBuffers?

> StatsShmemInit(void)
>* AFIXME: we need to guarantee this can be allocated in plain 
> shared
>* memory, rather than allocating dsm segments.

I'm not sure that NBuffers is the ideal base for deciding the required
size since it doesn't seem to be generally in proportion with the
number of database objects.  If we made it manually-tunable, we will
be able to emit a log when DSM segment allocation happens for this use
as as the tuning aid..

   WARNING: dsa allocation happened for activity statistics
   HINT: You might want to increase stat_dsa_initial_size if you see slow
 down blah..


> * AFIXME: Should all the stats drop code be moved into pgstat_drop.c?

Or pgstat_xact.c?


>  * AFIXME: comment
>  * AFIXME: see notes about race conditions for functions in
>  * pgstat_drop_function().
>  */
> void
> pgstat_schedule_stat_drop(PgStatKind kind, Oid dboid, Oid objoid)


pgstat_drop_function() doesn't seem to have such a note.

I suppose the "race condition" means the case a stats entry for an
object is created just after the same object is dropped on another
backend.  It seems to me such a race condition is eliminated by the
transactional drop mechanism.  Are you intending to write an
explanation of that?


>   /*
>* pgStatSharedRefAge increments quite slowly than the time the 
> following
>* loop takes so this is expected to iterate no more than twice.
>*
>* AFIXME: Why is this a good place to do this?
>*/
>   while (pgstat_shared_refs_need_gc())
>   pgstat_shared_refs_gc();

Is the reason for the AFIXME is you think that GC-check happens too
frequently?


> pgstat_shared_ref_release(PgStatHashKey key, PgStatSharedRef *shared_ref)
> {
...
> * AFIXME: this probably is racy. Another backend could look up the
> * stat, bump the refcount, as we free it.
>if (pg_atomic_fetch_sub_u32(_ref->shared_entry->refcount, 1) == 
> 1)
>{
...
>/* only dropped entries can reach a 0 refcount */
>Assert(shared_ref->shared_entry->dropped);

I didn't deeply examined, but is that race condition avoidable by
prevent pgstat_shared_ref_get from incrementing the refcount of
dropped entries?



>  * AFIXME: This needs to be deduplicated with pgstat_shared_ref_release(). But
>  * it's not entirely trivial, because we can't use plain dshash_delete_entry()
>  * (but have to use dshash_delete_current()).
>  */
> static bool
> pgstat_drop_stats_entry(dshash_seq_status *hstat)
...
>* AFIXME: don't do this while holding the dshash lock.

Is the AFIXMEs mean that we should move the call to
pgstat_shared_ref_release() out of the dshash-loop (in
pgstat_drop_database_and_contents) that calls this function?  Is it
sensible if we store the (key, ref) pairs for to-be released
shared_refs then clean up them after exiting the loop?


>* Database stats contain other stats. Drop those as well when
>* dropping the 

Re: shared-memory based stats collector - v66

2022-03-24 Thread Kyotaro Horiguchi
At Thu, 24 Mar 2022 13:21:33 -0400, Melanie Plageman 
 wrote in 
> On Thu, Mar 17, 2022 at 3:36 AM Andres Freund  wrote:
> >
> > The biggest todos are:
> > - Address all the remaining AFIXMEs and XXXs
> 
> Attached is a patch that addresses three of the existing AFIXMEs.

Thanks!

+   .reset_timestamp_cb = pgstat_shared_reset_timestamp_noop,

(I once misunderstood that the "shared" means shared memory area..)

The reset function is type-specific and it must be set.  So don't we
provide all to-be-required reset functions?


+   if (pgstat_shared_ref_get(kind, dboid, objoid, false, NULL))
+   {
+   Oid msg_oid = (kind == PGSTAT_KIND_DB) ? dboid : objoid;

Explicitly using PGSTAT_KIND_DB here is a kind of annoyance.  Since we
always give InvalidOid correctly as the parameters, and objoid alone
is not specific enough, do we warn using both dboid and objoid without
a special treat?

Concretely, I propose to do the following instead.

+   if (pgstat_shared_ref_get(kind, dboid, objoid, false, NULL))
+   {
+   ereport(WARNING,
+   errmsg("resetting existing stats for type %s, 
db=%d, oid=%d",
+  pgstat_kind_info_for(kind)->name, dboid, objoid);




+pgstat_pending_delete(PgStatSharedRef *shared_ref)
+{
+   void   *pending_data = shared_ref->pending;
+   PgStatKind kind = shared_ref->shared_entry->key.kind;
+
+   Assert(pending_data != NULL);
+   Assert(!pgstat_kind_info_for(kind)->fixed_amount);
+
+   /* PGSTAT_KIND_TABLE has its own callback */
+   Assert(kind != PGSTAT_KIND_TABLE);
+

"kind" is used only in assertion, which requires PG_USED_FOR_ASSERTS_ONLY.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: shared-memory based stats collector - v66

2022-03-24 Thread Melanie Plageman
On Thu, Mar 17, 2022 at 3:36 AM Andres Freund  wrote:
>
> The biggest todos are:
> - Address all the remaining AFIXMEs and XXXs

Attached is a patch that addresses three of the existing AFIXMEs.
From 2a975cdb5d10ec365ca2ced39b9f99a9385b6268 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 23 Mar 2022 19:43:12 -0400
Subject: [PATCH] Address 3 AFIXMEs

- reset timestamp callback
- schedule stat internl warn and reset if exists
- pending delete callback
---
 src/backend/utils/activity/pgstat.c   | 117 +-
 src/backend/utils/activity/pgstat_database.c  |   7 ++
 src/backend/utils/activity/pgstat_relation.c  |  16 +++
 .../utils/activity/pgstat_subscription.c  |   7 ++
 src/include/utils/pgstat_internal.h   |  16 +++
 5 files changed, 105 insertions(+), 58 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 7f86bc29ee..9841447a8b 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -154,10 +154,8 @@ static void pgstat_shared_refs_release_all(void);
 static void pgstat_perform_drop(xl_xact_stats_item *drop);
 static bool pgstat_drop_stats_entry(dshash_seq_status *hstat);
 
-static void pgstat_shared_reset_timestamp(PgStatKind kind, PgStatShm_StatEntryHeader *header);
 static void pgstat_reset_all_stats(TimestampTz ts);
 
-static void pgstat_pending_delete(PgStatSharedRef *shared_ref);
 static bool pgstat_pending_flush_stats(bool nowait);
 
 static PgStatKind pgstat_kind_for(char *kind_str);
@@ -281,6 +279,8 @@ static const PgStatKindInfo pgstat_kind_infos[PGSTAT_KIND_LAST + 1] = {
 		.pending_size = sizeof(PgStat_StatDBEntry),
 
 		.flush_pending_cb = pgstat_flush_database,
+		.pending_delete_cb = pgstat_pending_delete,
+		.reset_timestamp_cb = pgstat_database_reset_timestamp,
 	},
 
 	[PGSTAT_KIND_TABLE] = {
@@ -294,6 +294,8 @@ static const PgStatKindInfo pgstat_kind_infos[PGSTAT_KIND_LAST + 1] = {
 		.pending_size = sizeof(PgStat_TableStatus),
 
 		.flush_pending_cb = pgstat_flush_relation,
+		.pending_delete_cb = pgstat_relation_pending_delete,
+		.reset_timestamp_cb = pgstat_shared_reset_timestamp_noop,
 	},
 
 	[PGSTAT_KIND_FUNCTION] = {
@@ -307,6 +309,8 @@ static const PgStatKindInfo pgstat_kind_infos[PGSTAT_KIND_LAST + 1] = {
 		.pending_size = sizeof(PgStat_BackendFunctionEntry),
 
 		.flush_pending_cb = pgstat_flush_function,
+		.pending_delete_cb = pgstat_pending_delete,
+		.reset_timestamp_cb = pgstat_shared_reset_timestamp_noop,
 	},
 
 	[PGSTAT_KIND_SUBSCRIPTION] = {
@@ -322,6 +326,8 @@ static const PgStatKindInfo pgstat_kind_infos[PGSTAT_KIND_LAST + 1] = {
 		.pending_size = sizeof(PgStat_BackendSubEntry),
 
 		.flush_pending_cb = pgstat_flush_subscription,
+		.pending_delete_cb = pgstat_pending_delete,
+		.reset_timestamp_cb = pgstat_subscription_reset_timestamp,
 	},
 
 
@@ -1040,12 +1046,18 @@ pgstat_schedule_stat_internal(PgStatKind kind, Oid dboid, Oid objoid, bool is_cr
 void
 pgstat_schedule_stat_create(PgStatKind kind, Oid dboid, Oid objoid)
 {
-	pgstat_schedule_stat_internal(kind, dboid, objoid, /* create */ true);
+	if (pgstat_shared_ref_get(kind, dboid, objoid, false, NULL))
+	{
+		Oid msg_oid = (kind == PGSTAT_KIND_DB) ? dboid : objoid;
 
-	/*
-	 * AFIXME: It would be a good idea to check if an object with that key
-	 * already exists. WARN if so, and reset the stats to 0.
-	 */
+		ereport(WARNING,
+errmsg("Resetting existing stats for %s with OID %d.",
+	(pgstat_kind_info_for(kind))->name, msg_oid));
+
+		pgstat_reset_one(kind, dboid, objoid);
+	}
+
+	pgstat_schedule_stat_internal(kind, dboid, objoid, /* create */ true);
 }
 
 /*
@@ -2178,6 +2190,28 @@ pgstat_shared_refs_release_all(void)
  * 
  */
 
+/*
+ * Delete pending stats for variable number stats in the general case. Some
+ * types of stats require additional steps and have dedicated callbacks.
+ */
+void
+pgstat_pending_delete(PgStatSharedRef *shared_ref)
+{
+	void	   *pending_data = shared_ref->pending;
+	PgStatKind kind = shared_ref->shared_entry->key.kind;
+
+	Assert(pending_data != NULL);
+	Assert(!pgstat_kind_info_for(kind)->fixed_amount);
+
+	/* PGSTAT_KIND_TABLE has its own callback */
+	Assert(kind != PGSTAT_KIND_TABLE);
+
+	pfree(pending_data);
+	shared_ref->pending = NULL;
+
+	dlist_delete(_ref->pending_node);
+}
+
 /*
  * Returns the appropriate PgStatSharedRef, preparing it to receive pending
  * stats if not already done.
@@ -,38 +2256,6 @@ pgstat_pending_fetch(PgStatKind kind, Oid dboid, Oid objoid)
 	return shared_ref;
 }
 
-static void
-pgstat_pending_delete(PgStatSharedRef *shared_ref)
-{
-	void	   *pending_data = shared_ref->pending;
-	PgStatKind	kind;
-
-	Assert(pending_data != NULL);
-
-	/* AFIXME: Move into a PgStatKindInfo callback */
-	kind = shared_ref->shared_entry->key.kind;
-	switch (kind)
-	{
-		case PGSTAT_KIND_TABLE:
-			pgstat_relation_unlink(((PgStat_TableStatus *) 

Re: shared-memory based stats collector - v66

2022-03-22 Thread Melanie Plageman
On Thu, Mar 17, 2022 at 3:36 AM Andres Freund  wrote:
> I've attached a substantially improved version of the shared memory stats
> patch.
...
>   - lot of the pg_stat_ views like bgwriter, pg_stat_database have zero 
> coverage today

Attached are some tests including tests that reset of stats works for
all views having a reset timestamp as well as a basic test for at least
one column in all of the following stats views:
pg_stat_archiver, pg_stat_bgwriter, pg_stat_wal, pg_stat_slru,
pg_stat_replication_slots, pg_stat_database

It might be nice to have a test for one of the columns fetched from the
PgStatBgwriter data structure since those and the Checkpointer stats are
stored separately despite being displayed in the same view currently.
but, alas...

- Melanie
From 37e5ba3b7743309b00c81dbfe65cfd481d4859a6 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 22 Mar 2022 16:53:32 -0400
Subject: [PATCH] test

---
 src/test/isolation/expected/stats.out   |  35 
 src/test/isolation/specs/stats.spec |  31 +++
 src/test/recovery/t/006_logical_decoding.pl |  63 ++
 src/test/regress/expected/stats.out | 208 
 src/test/regress/sql/stats.sql  | 107 ++
 5 files changed, 444 insertions(+)

diff --git a/src/test/isolation/expected/stats.out b/src/test/isolation/expected/stats.out
index ca553be3bf..7a156e1da2 100644
--- a/src/test/isolation/expected/stats.out
+++ b/src/test/isolation/expected/stats.out
@@ -1895,3 +1895,38 @@ name  |pg_stat_get_function_calls|total_above_zero|self_above_zero
 test_stat_func| 5|t   |t  
 (1 row)
 
+
+starting permutation: s1_slru_save_stats s1_listen s1_big_notify s1_ff s1_slru_check_stats
+step s1_slru_save_stats: 
+	INSERT INTO test_slru_stats VALUES('Notify', 'blks_zeroed',
+(SELECT blks_zeroed FROM pg_stat_slru WHERE name = 'Notify'));
+
+step s1_listen: LISTEN big_notify;
+step s1_big_notify: SELECT pg_notify('big_notify',
+repeat('0', current_setting('block_size')::int / 2)) FROM generate_series(1, 2);
+
+pg_notify
+-
+ 
+ 
+(2 rows)
+
+s1: NOTIFY "big_notify" with payload 

Re: shared-memory based stats collector - v66

2022-03-21 Thread Melanie Plageman
On Sun, Mar 20, 2022 at 4:56 PM Melanie Plageman
 wrote:
>
> Addressed all of these points in
> v2-0001-add-replica-cleanup-tests.patch
>
> also added a new test file in
> v2-0002-Add-TAP-test-for-discarding-stats-after-crash.patch
> testing correct behavior after a crash and when stats file is invalid
>

Attached is the last of the tests confirming clean up for stats in the
shared stats hashtable (these are for the subscription stats).

I thought that maybe these tests could now use
pg_stat_force_next_flush() instead of poll_query_until() but I wasn't
sure how to ensure that the error has happened and the pending entry has
been added before setting force_next_flush.

I also added in tests that resetting subscription stats works as
expected.

- Melanie
From ffb83cc6ad2941f1d01b42b55dd0615a011d59cf Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 21 Mar 2022 14:52:55 -0400
Subject: [PATCH] add subscriber stats reset and drop tests

---
 src/test/subscription/t/026_stats.pl | 303 ---
 1 file changed, 230 insertions(+), 73 deletions(-)

diff --git a/src/test/subscription/t/026_stats.pl b/src/test/subscription/t/026_stats.pl
index a42ea3170e..e86bfb4fea 100644
--- a/src/test/subscription/t/026_stats.pl
+++ b/src/test/subscription/t/026_stats.pl
@@ -18,83 +18,240 @@ my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 $node_subscriber->init(allows_streaming => 'logical');
 $node_subscriber->start;
 
-# Initial table setup on both publisher and subscriber. On subscriber we
-# create the same tables but with primary keys. Also, insert some data that
-# will conflict with the data replicated from publisher later.
-$node_publisher->safe_psql(
-	'postgres',
-	qq[
-BEGIN;
-CREATE TABLE test_tab1 (a int);
-INSERT INTO test_tab1 VALUES (1);
-COMMIT;
-]);
-$node_subscriber->safe_psql(
-	'postgres',
-	qq[
-BEGIN;
-CREATE TABLE test_tab1 (a int primary key);
-INSERT INTO test_tab1 VALUES (1);
-COMMIT;
-]);
-
-# Setup publication.
-my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
-$node_publisher->safe_psql('postgres',
-	"CREATE PUBLICATION tap_pub FOR TABLE test_tab1;");
+
+sub create_sub_pub_w_errors
+{
+	my ($publisher, $subscriber, $db, $table_name) = @_;
+	# Initial table setup on both publisher and subscriber. On subscriber we
+	# create the same tables but with primary keys. Also, insert some data that
+	# will conflict with the data replicated from publisher later.
+	$publisher->safe_psql(
+		$db,
+		qq[
+	BEGIN;
+	CREATE TABLE $table_name(a int);
+	INSERT INTO $table_name VALUES (1);
+	COMMIT;
+	]);
+	$subscriber->safe_psql(
+		$db,
+		qq[
+	BEGIN;
+	CREATE TABLE $table_name(a int primary key);
+	INSERT INTO $table_name VALUES (1);
+	COMMIT;
+	]);
+
+	# Set up publication.
+	my $pub_name = $table_name . '_pub';
+	my $publisher_connstr = $publisher->connstr . qq( dbname=$db);
+
+	$publisher->safe_psql($db,
+		qq(CREATE PUBLICATION $pub_name FOR TABLE $table_name));
+
+	# Create subscription. The tablesync for table on subscription will enter into
+	# infinite error loop due to violating the unique constraint.
+	my $sub_name = $table_name . '_sub';
+	$subscriber->safe_psql($db,
+		qq(CREATE SUBSCRIPTION $sub_name CONNECTION '$publisher_connstr' PUBLICATION $pub_name)
+	);
+
+	$publisher->wait_for_catchup($sub_name);
+
+	# TODO: can this be replaced with pg_stat_force_next_flush() and a test
+	# that sync error > 0?
+	# Wait for the tablesync error to be reported.
+	$node_subscriber->poll_query_until(
+		$db,
+		qq[
+	SELECT sync_error_count > 0
+	FROM pg_stat_subscription_stats
+	WHERE subname = '$sub_name'
+	]) or die qq(Timed out while waiting for tablesync errors for subscription '$sub_name');
+
+	# Truncate test_tab1 so that tablesync worker can continue.
+	$subscriber->safe_psql($db, qq(TRUNCATE $table_name));
+
+	# Wait for initial tablesync to finish.
+	$subscriber->poll_query_until(
+		$db,
+		qq[
+	SELECT count(1) = 1 FROM pg_subscription_rel
+	WHERE srrelid = '$table_name'::regclass AND srsubstate in ('r', 's')
+	]) or die qq(Timed out while waiting for subscriber to synchronize data for table '$table_name'.);
+
+	# Check test table on the subscriber has one row.
+	my $result = $subscriber->safe_psql($db, qq(SELECT a FROM $table_name));
+	is($result, qq(1), qq(Check that table '$table_name' now has 1 row.));
+
+	# Insert data to test table on the publisher, raising an error on the
+	# subscriber due to violation of the unique constraint on test table.
+	$publisher->safe_psql($db, qq(INSERT INTO $table_name VALUES (1)));
+
+	# TODO: can this be replaced with a pg_stat_force_next_flush() and a test
+	# that apply error > 0?
+	# Wait for the apply error to be reported.
+	$subscriber->poll_query_until(
+		$db,
+		qq[
+	SELECT apply_error_count > 0
+	FROM pg_stat_subscription_stats
+	WHERE subname = '$sub_name'
+	]) or die qq(Timed out while waiting for apply error for subscription '$sub_name');
+
+	# Truncate test table so that apply 

Re: shared-memory based stats collector - v66

2022-03-20 Thread Melanie Plageman
On Sun, Mar 20, 2022 at 12:58 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-03-20 12:32:39 -0400, Melanie Plageman wrote:
> > Attached is a TAP test to check that stats are cleaned up on a physical
> > replica after the objects they concern are dropped on the primary.
>
> Thanks!
>
>
> > I'm not sure that the extra force next flush on standby is needed after
> > drop on primary since drop should report stats and I wait for catchup.
>
> A drop doesn't force stats in other sessions to be flushed immediately, so
> unless I misunderstand, yes, it's needed.
>
>
> > Also, I don't think the tests with DROP SCHEMA actually exercise another
> > code path, so it might be worth cutting those.
>
> > +/*
> > + * Checks for presence of stats for object with provided object oid of kind
> > + * specified in the type string in database of provided database oid.
> > + *
> > + * For subscription stats, only the objoid will be used. For database 
> > stats,
> > + * only the dboid will be used. The value passed in for the unused 
> > parameter is
> > + * discarded.
> > + * TODO: should it be 'pg_stat_stats_present' instead of 
> > 'pg_stat_stats_exist'?
> > + */
> > +Datum
> > +pg_stat_stats_exist(PG_FUNCTION_ARGS)
>
> Should we revoke stats for this one from PUBLIC (similar to the reset 
> functions)?
>
>
> > +# Set track_functions to all on standby
> > +$node_standby->append_conf('postgresql.conf', "track_functions = 'all'");
>
> That should already be set, cloning from the primary includes the
> configuration from that point in time.
>
> > +$node_standby->restart;
>
> FWIW, it'd also only require a reload
>

Addressed all of these points in
v2-0001-add-replica-cleanup-tests.patch

also added a new test file in
v2-0002-Add-TAP-test-for-discarding-stats-after-crash.patch
testing correct behavior after a crash and when stats file is invalid

- Melanie
From c521ba1dcb13ba236181adce06893c42ec439877 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 17 Mar 2022 21:54:16 -0400
Subject: [PATCH v2 1/2] add replica cleanup tests

---
 src/backend/catalog/system_functions.sql  |   2 +
 src/backend/utils/activity/pgstat.c   |  32 +++
 src/backend/utils/adt/pgstatfuncs.c   |  22 ++
 src/include/catalog/pg_proc.dat   |   6 +
 src/include/pgstat.h  |   3 +
 .../recovery/t/029_stats_cleanup_replica.pl   | 228 ++
 6 files changed, 293 insertions(+)
 create mode 100644 src/test/recovery/t/029_stats_cleanup_replica.pl

diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 81bac6f581..4f2c80b72e 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -639,6 +639,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_reset_single_function_counters(oid) FROM publ
 
 REVOKE EXECUTE ON FUNCTION pg_stat_reset_replication_slot(text) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_stat_stats_exist(oid, oid, text) FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_stat_reset_subscription_stats(oid) FROM public;
 
 REVOKE EXECUTE ON FUNCTION lo_import(text) FROM public;
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index e9fab35a46..4c0fea62b4 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -159,6 +159,7 @@ static void pgstat_shared_reset_timestamp(PgStatKind kind, PgStatShm_StatEntryHe
 static void pgstat_pending_delete(PgStatSharedRef *shared_ref);
 static bool pgstat_pending_flush_stats(bool nowait);
 
+static PgStatKind pgstat_kind_for(char *kind_str);
 static inline const PgStatKindInfo *pgstat_kind_info_for(PgStatKind kind);
 
 
@@ -1315,6 +1316,23 @@ pgstat_get_stat_snapshot_timestamp(bool *have_snapshot)
 	return 0;
 }
 
+bool
+pgstat_shared_stat_exists(char *kind_str, Oid dboid, Oid objoid)
+{
+	PgStatKind kind = pgstat_kind_for(kind_str);
+
+	/*
+	 * Ignore dboid or objoid for subscription and db stats respectively.
+	 */
+	if (kind == PGSTAT_KIND_SUBSCRIPTION)
+		dboid = InvalidOid;
+
+	if (kind == PGSTAT_KIND_DB)
+		objoid = InvalidOid;
+
+	return (bool) pgstat_fetch_entry(kind, dboid, objoid);
+}
+
 
 /* 
  * Helper functions
@@ -1343,6 +1361,20 @@ pgstat_setup_memcxt(void)
   ALLOCSET_SMALL_SIZES);
 }
 
+static PgStatKind
+pgstat_kind_for(char *kind_str)
+{
+	for (int kind = 0; kind <= PGSTAT_KIND_LAST; kind++)
+	{
+		if (pg_strcasecmp(kind_str, pgstat_kind_infos[kind].name) == 0)
+			return kind;
+	}
+
+	ereport(ERROR,
+			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("invalid statistic kind: \"%s\"", kind_str)));
+}
+
 static inline const PgStatKindInfo *
 pgstat_kind_info_for(PgStatKind kind)
 {
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5d843cde24..62b2df856f 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2420,3 +2420,25 

Re: shared-memory based stats collector - v66

2022-03-20 Thread Andres Freund
Hi,

On 2022-03-20 12:32:39 -0400, Melanie Plageman wrote:
> Attached is a TAP test to check that stats are cleaned up on a physical
> replica after the objects they concern are dropped on the primary.

Thanks!


> I'm not sure that the extra force next flush on standby is needed after
> drop on primary since drop should report stats and I wait for catchup.

A drop doesn't force stats in other sessions to be flushed immediately, so
unless I misunderstand, yes, it's needed.


> Also, I don't think the tests with DROP SCHEMA actually exercise another
> code path, so it might be worth cutting those.

> +/*
> + * Checks for presence of stats for object with provided object oid of kind
> + * specified in the type string in database of provided database oid.
> + *
> + * For subscription stats, only the objoid will be used. For database stats,
> + * only the dboid will be used. The value passed in for the unused parameter 
> is
> + * discarded.
> + * TODO: should it be 'pg_stat_stats_present' instead of 
> 'pg_stat_stats_exist'?
> + */
> +Datum
> +pg_stat_stats_exist(PG_FUNCTION_ARGS)

Should we revoke stats for this one from PUBLIC (similar to the reset 
functions)?


> +# Set track_functions to all on standby
> +$node_standby->append_conf('postgresql.conf', "track_functions = 'all'");

That should already be set, cloning from the primary includes the
configuration from that point in time.

> +$node_standby->restart;

FWIW, it'd also only require a reload

Greetings,

Andres Freund




Re: shared-memory based stats collector - v66

2022-03-20 Thread Melanie Plageman
On Thu, Mar 17, 2022 at 3:36 AM Andres Freund  wrote:
>
> Starting to feel more optimistic about this! There's loads more to do, but now
> the TODOs just seem to require elbow grease, rather than deep thinking.
>
> The biggest todos are:
> - Address all the remaining AFIXMEs and XXXs
>
> - add longer explanation of architecture to pgstat.c (or a README)
>
> - Further improve our stats test coverage - there's a crapton not covered,
>   despite 0017:
>   - test WAL replay with stats (stats for dropped tables are removed etc)

Attached is a TAP test to check that stats are cleaned up on a physical
replica after the objects they concern are dropped on the primary.

I'm not sure that the extra force next flush on standby is needed after
drop on primary since drop should report stats and I wait for catchup.
Also, I don't think the tests with DROP SCHEMA actually exercise another
code path, so it might be worth cutting those.

- Melanie
From 2cb108fadf656de9cc998c0b2123a184881ef4e0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 17 Mar 2022 21:54:16 -0400
Subject: [PATCH] add replica cleanup tests

---
 src/backend/utils/activity/pgstat.c   |  32 +++
 src/backend/utils/adt/pgstatfuncs.c   |  22 ++
 src/include/catalog/pg_proc.dat   |   6 +
 src/include/pgstat.h  |   3 +
 .../recovery/t/029_stats_cleanup_replica.pl   | 238 ++
 5 files changed, 301 insertions(+)
 create mode 100644 src/test/recovery/t/029_stats_cleanup_replica.pl

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index e9fab35a46..4c0fea62b4 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -159,6 +159,7 @@ static void pgstat_shared_reset_timestamp(PgStatKind kind, PgStatShm_StatEntryHe
 static void pgstat_pending_delete(PgStatSharedRef *shared_ref);
 static bool pgstat_pending_flush_stats(bool nowait);
 
+static PgStatKind pgstat_kind_for(char *kind_str);
 static inline const PgStatKindInfo *pgstat_kind_info_for(PgStatKind kind);
 
 
@@ -1315,6 +1316,23 @@ pgstat_get_stat_snapshot_timestamp(bool *have_snapshot)
 	return 0;
 }
 
+bool
+pgstat_shared_stat_exists(char *kind_str, Oid dboid, Oid objoid)
+{
+	PgStatKind kind = pgstat_kind_for(kind_str);
+
+	/*
+	 * Ignore dboid or objoid for subscription and db stats respectively.
+	 */
+	if (kind == PGSTAT_KIND_SUBSCRIPTION)
+		dboid = InvalidOid;
+
+	if (kind == PGSTAT_KIND_DB)
+		objoid = InvalidOid;
+
+	return (bool) pgstat_fetch_entry(kind, dboid, objoid);
+}
+
 
 /* 
  * Helper functions
@@ -1343,6 +1361,20 @@ pgstat_setup_memcxt(void)
   ALLOCSET_SMALL_SIZES);
 }
 
+static PgStatKind
+pgstat_kind_for(char *kind_str)
+{
+	for (int kind = 0; kind <= PGSTAT_KIND_LAST; kind++)
+	{
+		if (pg_strcasecmp(kind_str, pgstat_kind_infos[kind].name) == 0)
+			return kind;
+	}
+
+	ereport(ERROR,
+			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("invalid statistic kind: \"%s\"", kind_str)));
+}
+
 static inline const PgStatKindInfo *
 pgstat_kind_info_for(PgStatKind kind)
 {
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5d843cde24..62b2df856f 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2420,3 +2420,25 @@ pg_stat_get_subscription_stats(PG_FUNCTION_ARGS)
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
+
+
+/*
+ * Checks for presence of stats for object with provided object oid of kind
+ * specified in the type string in database of provided database oid.
+ *
+ * For subscription stats, only the objoid will be used. For database stats,
+ * only the dboid will be used. The value passed in for the unused parameter is
+ * discarded.
+ * TODO: should it be 'pg_stat_stats_present' instead of 'pg_stat_stats_exist'?
+ */
+Datum
+pg_stat_stats_exist(PG_FUNCTION_ARGS)
+{
+	Oid dboid = PG_GETARG_OID(0);
+	Oid	objoid = PG_GETARG_OID(1);
+	char *stats_type = text_to_cstring(PG_GETARG_TEXT_P(2));
+
+	PG_RETURN_BOOL((bool) pgstat_shared_stat_exists(stats_type, dboid,
+objoid));
+
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 75dae94c49..3f3c4e0427 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5376,6 +5376,12 @@
   proargmodes => '{i,o,o,o,o,o,o,o,o,o,o}',
   proargnames => '{slot_name,slot_name,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes,total_txns,total_bytes,stats_reset}',
   prosrc => 'pg_stat_get_replication_slot' },
+
+{ oid => '8384', descr => 'statistics: checks if stats exist for provided object of provided type in provided database',
+  proname => 'pg_stat_stats_exist', provolatile => 's', proparallel => 'r',
+  prorettype => 'bool', proargtypes => 'oid oid text',
+  prosrc => 'pg_stat_stats_exist' },
+
 {