subject:"shared\-memory based stats collector \- v70"

Re: shared-memory based stats collector - v70

2022-04-05 Thread David G. Johnston

On Tue, Apr 5, 2022 at 8:00 PM Andres Freund  wrote:

>
> Here comes v70:
>
> I think this is basically ready, minus a a few comment adjustments here and
> there. Unless somebody protests I'm planning to start pushing things
> tomorrow
> morning.
>
>
Nothing I've come across, given my area of experience, gives me pause.  I'm
mostly going to focus on docs and comments at this point - to try and help
the next person in my position (and end-users) have an easier go at
on-boarding.  Toward that end, I did just add a "Purpose" section writeup
to the v69 thread.

David J.

Re: shared-memory based stats collector - v70

2022-04-05 Thread Greg Stark

I've never tried to review a 24-patch series before. It's kind of
intimidating Is there a good place to start to get a good idea of
the most important changes?

Re: shared-memory based stats collector - v70

2022-04-05 Thread David G. Johnston

On Tue, Apr 5, 2022 at 8:11 PM Greg Stark  wrote:

> I've never tried to review a 24-patch series before. It's kind of
> intimidating Is there a good place to start to get a good idea of
> the most important changes?
>

It isn't as bad as the number makes it sound - I just used "git am" to
apply the patches to a branch and skimmed each commit separately.  Most of
them are tests or other minor pieces.  The remaining few cover different
aspects of the major commit and you can choose them based upon your
experience and time.

David J.

Re: shared-memory based stats collector - v70

2022-04-05 Thread Andres Freund

Hi,

On 2022-04-05 23:11:07 -0400, Greg Stark wrote:
> I've never tried to review a 24-patch series before. It's kind of
> intimidating Is there a good place to start to get a good idea of
> the most important changes?

It was more at some point :). And believe me, I find this whole project
intimidating and exhausting. The stats collector is entangled in a lot of
places, and there was a lot of preparatory work to get this point.

Most of the commits aren't really interesting, I broke them out to make the
"main commit" a bit smaller, because it's exhausting to look at a *huge*
single commit. I wish I could have broken it down more, but I didn't find a
good way.

The interesting commit is
v70-0010-pgstat-store-statistics-in-shared-memory.patch
which actually replaces the stats collector by storing stats in shared
memory. It contains a, now hopefully decent, overview of how things work at
the top of pgstat.c.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread Kyotaro Horiguchi

At Tue, 5 Apr 2022 20:00:08 -0700, Andres Freund  wrote in 
> Hi,
> 
> Here comes v70:
> - extended / polished the architecture comment based on feedback from Melanie
>   and David
> - other polishing as suggested by David
> - addressed the open issue around pgstat_report_stat(), as described in
>   
> https://www.postgresql.org/message-id/20220405204019.6yj7ocmpw352c2u5%40alap3.anarazel.de
> - while working on the above point, I noticed that hash_bytes() showed up
>   noticeably in profiles, so I replaced it with a fixed-width function
> - found a few potential regression test instabilities by either *always*
>   flushing in pgstat_report_stat(), or only flushing when force = true.
> - random minor improvements
> - reordered commits some
> 
> I still haven't renamed pg_stat_exists_stat() yet - I'm leaning towards
> pg_stat_have_stats() or pg_stat_exists() right now. But it's an SQL function
> for testing, so it doesn't really matter.
> 
> I think this is basically ready, minus a a few comment adjustments here and
> there. Unless somebody protests I'm planning to start pushing things tomorrow
> morning.
> 
> It'll be a few hours to get to the main commit - but except for 0001 it
> doesn't make sense to push without intending to push later changes too. I
> might squash a few commits togther.
> 
> There's lots that can be done once all this is in place, both simplifying
> pre-existing code and easy new features, but that's for a later release.

I'm not sure it's in time but..

(Sorry in advance for possible duplicate or pointless comments.)

0001: Looks fine.

0002:

All references to "stats collector" or alike looks like eliminated
after all of the 24 patches are applied.  So this seems fine.

0003:

This is just moving around functions and variables. Looks fine.

0004:

I can see padding_pgstat_send and fun:pgstat_send in valgrind.supp

0005:

The function is changed later patch, and it looks fine.

0006:

I'm fine with the categorize for now.

+#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
+#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1)

The number of kinds is 10. And PGSTAT_NUM_KINDS is 11?

+* Don't define an INVALID value so switch() statements can warn if some
+* cases aren't covered. But define the first member to 1 so that
+* uninitialized values can be detected more easily.

FWIW, I like this.


0007:

(mmm no comments)

0008:

+   xact_desc_stats(buf, "", parsed.nstats, parsed.stats);
+   xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats);
+   xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats);

I'm not sure I like this, but I don't object to this..

0009: (skipped)

0010:
(I didn't look this closer. The comments arised while looking other
patches.)

+pgstat_kind_from_str(char *kind_str)

I don't think I like "str" so much.  Don't we spell it as
"pgstat_kind_from_name"?


0011:
  Looks fine.

0012:
  Looks like covering all related parts.


0013:
  Just fine.

0014:
  I bilieve it:p

0015:
  Function attributes seems correct. Looks fine.

0016:
(skipped, but looks fine by a quick look.)

0017:

 I don't find a problem with this.

0018: (skipped)

0019:

+my $og_stats = $datadir . '/' . "pg_stat" . '/' . "pgstat.stat";

It can be "$datadir/pg_stat/pgstat.stat" or $datadir . '/pgstat/pgstat.stat'.
Isn't it simpler?


0020-24:(I believe them:p)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: shared-memory based stats collector - v70

2022-04-06 Thread John Naylor

On Wed, Apr 6, 2022 at 10:00 AM Andres Freund  wrote:
> - while working on the above point, I noticed that hash_bytes() showed up
>   noticeably in profiles, so I replaced it with a fixed-width function

I'm curious about this -- could you direct me to which patch introduces this?

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: shared-memory based stats collector - v70

2022-04-06 Thread Alvaro Herrera

Just skimming a bit here ...

On 2022-Apr-05, Andres Freund wrote:

> From 0532b869033595202d5797b148f22c61e4eb4969 Mon Sep 17 00:00:00 2001
> From: Andres Freund 
> Date: Mon, 4 Apr 2022 16:53:16 -0700
> Subject: [PATCH v70 10/27] pgstat: store statistics in shared memory.

> +  PgStatsData
> +  Waiting fo shared memory stats data access
> + 

Typo "fo" -> "for"

> @@ -5302,7 +5317,9 @@ StartupXLOG(void)
>   performedWalRecovery = true;
>   }
>   else
> + {
>   performedWalRecovery = false;
> + }

Why? :-)

Why give pgstat_get_entry_ref the responsibility of initializing
created_entry to false?  The vast majority of callers don't care about
that flag; it seems easier/cleaner to set it to false in
pgstat_init_function_usage (the only caller that cares that I could
find) before calling pgstat_prep_pending_entry.

(I suggest pgstat_prep_pending_entry should have a comment line stating
"*created_entry, if not NULL, is set true if the entry required to be
created.", same as pgstat_get_entry_ref.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep."(Robert Davidson)
   http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-06 18:11:04 +0900, Kyotaro Horiguchi wrote:
> 0004:
> 
> I can see padding_pgstat_send and fun:pgstat_send in valgrind.supp

Those shouldn't be affected by the patch, I think? But I did indeed forget to
remove those in 0010.

> 0006:
> 
> I'm fine with the categorize for now.
> 
> +#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
> +#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1)
> 
> The number of kinds is 10. And PGSTAT_NUM_KINDS is 11?

Yea, it's not great. I think I'll introduce INVALID and rename
PGSTAT_KIND_FIRST to FIRST_VALID.


> +  * Don't define an INVALID value so switch() statements can warn if some
> +  * cases aren't covered. But define the first member to 1 so that
> +  * uninitialized values can be detected more easily.
> 
> FWIW, I like this.

I think there's no switches left now, so it's not actually providing too much.


> 0008:
> 
> + xact_desc_stats(buf, "", parsed.nstats, parsed.stats);
> + xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats);
> + xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats);
> 
> I'm not sure I like this, but I don't object to this..

The string prefixes? Or the entire patch?


> 0010:
> (I didn't look this closer. The comments arised while looking other
> patches.)
> 
> +pgstat_kind_from_str(char *kind_str)
> 
> I don't think I like "str" so much.  Don't we spell it as
> "pgstat_kind_from_name"?

name makes me think of NameData. What do you dislike about str? We seem to use
str in plenty places?


> 0019:
> 
> +my $og_stats = $datadir . '/' . "pg_stat" . '/' . "pgstat.stat";
> 
> It can be "$datadir/pg_stat/pgstat.stat" or $datadir . '/pgstat/pgstat.stat'.
> Isn't it simpler?

Yes, will change.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-06 13:31:31 +0200, Alvaro Herrera wrote:
> Just skimming a bit here ...

Thanks!

> On 2022-Apr-05, Andres Freund wrote:
> 
> > From 0532b869033595202d5797b148f22c61e4eb4969 Mon Sep 17 00:00:00 2001
> > From: Andres Freund 
> > Date: Mon, 4 Apr 2022 16:53:16 -0700
> > Subject: [PATCH v70 10/27] pgstat: store statistics in shared memory.
> 
> > +  PgStatsData
> > +  Waiting fo shared memory stats data access
> > + 
> 
> Typo "fo" -> "for"

Oh, oops. I had fixed that in the wrong patch.

> > @@ -5302,7 +5317,9 @@ StartupXLOG(void)
> > performedWalRecovery = true;
> > }
> > else
> > +   {
> > performedWalRecovery = false;
> > +   }
> 
> Why? :-)

Damage from merging two commits yesterday. I'd left open where exactly we'd
reset stats, with the "main commit" implementing the current behaviour more
closely, and then a followup commit implementing something a bit
better. Nobody seemed to argue for keeping the behaviour 1:1, so I merged
them. Without removing the parens again :)

> Why give pgstat_get_entry_ref the responsibility of initializing
> created_entry to false?  The vast majority of callers don't care about
> that flag; it seems easier/cleaner to set it to false in
> pgstat_init_function_usage (the only caller that cares that I could
> find) before calling pgstat_prep_pending_entry.

It's annoying to have to initialize it, I agree. But I think it's bugprone for
the caller to know that it has to be pre-initialized to false.

> (I suggest pgstat_prep_pending_entry should have a comment line stating
> "*created_entry, if not NULL, is set true if the entry required to be
> created.", same as pgstat_get_entry_ref.)

Added something along those lines.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-06 16:24:28 +0700, John Naylor wrote:
> On Wed, Apr 6, 2022 at 10:00 AM Andres Freund  wrote:
> > - while working on the above point, I noticed that hash_bytes() showed up
> >   noticeably in profiles, so I replaced it with a fixed-width function
> 
> I'm curious about this -- could you direct me to which patch introduces this?

Commit 0010, search for pgstat_hash_key_hash. For simplicity I'm including it
here inline:

/* helpers for dshash / simplehash hashtables */
static inline int
pgstat_hash_key_cmp(const void *a, const void *b, size_t size, void *arg)
{
AssertArg(size == sizeof(PgStat_HashKey) && arg == NULL);
return memcmp(a, b, sizeof(PgStat_HashKey));
}

static inline uint32
pgstat_hash_key_hash(const void *d, size_t size, void *arg)
{
const PgStat_HashKey *key = (PgStat_HashKey *)d;
uint32 hash;

AssertArg(size == sizeof(PgStat_HashKey) && arg == NULL);

hash = murmurhash32(key->kind);
hash = hash_combine(hash, murmurhash32(key->dboid));
hash = hash_combine(hash, murmurhash32(key->objoid));

return hash;
}

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread Lukas Fittl

On Tue, Apr 5, 2022 at 8:00 PM Andres Freund  wrote:

> Here comes v70:
>

Some small nitpicks on the docs:

> From 13090823fc4c7fb94512110fb4d1b3e86fb312db Mon Sep 17 00:00:00 2001
> From: Andres Freund 
> Date: Sat, 2 Apr 2022 19:38:01 -0700
> Subject: [PATCH v70 14/27] pgstat: update docs.
> ...
> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> -  These parameters control server-wide statistics collection
features.
> -  When statistics collection is enabled, the data that is produced
can be
> +  These parameters control server-wide cumulative statistics system.
> +  When enabled, the data that is collected can be

Missing "the" ("These parameters control the server-wide cumulative
statistics system").

> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> +   any of the accumulated statistics, acessed values are cached until
the end

"acessed" => "accessed"

> +   stats_fetch_consistency can be set
> +   snapshot, at the price of increased memory usage
for

Missing "to" ("can be set to snapshot")

> +   caching not-needed statistics data.  Conversely, if it's known that
statistics

Double space between "data." and "Conversely" (not sure if that matters)

> +   current transaction's statistics snapshot or cached values (if any).
The

Double space between "(if any)." and "The" (not sure if that matters)

> +   next use of statistical information will cause a new snapshot to be
built
> +   or accessed statistics to be cached.

I believe this should be an "and", not an "or". (next access builds both a
new snapshot and caches accessed statistics)

Thanks,
Lukas

-- 
Lukas Fittl

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-06 12:14:35 -0700, Lukas Fittl wrote:
> On Tue, Apr 5, 2022 at 8:00 PM Andres Freund  wrote:
> 
> > Here comes v70:
> >
> 
> Some small nitpicks on the docs:

Thanks!

> > From 13090823fc4c7fb94512110fb4d1b3e86fb312db Mon Sep 17 00:00:00 2001
> > From: Andres Freund 
> > Date: Sat, 2 Apr 2022 19:38:01 -0700
> > Subject: [PATCH v70 14/27] pgstat: update docs.
> > ...
> > diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> > -  These parameters control server-wide statistics collection
> features.
> > -  When statistics collection is enabled, the data that is produced
> can be
> > +  These parameters control server-wide cumulative statistics system.
> > +  When enabled, the data that is collected can be
> 
> Missing "the" ("These parameters control the server-wide cumulative
> statistics system").

> > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > +   any of the accumulated statistics, acessed values are cached until
> the end
> 
> "acessed" => "accessed"

> > +   stats_fetch_consistency can be set
> > +   snapshot, at the price of increased memory usage
> for
> 
> Missing "to" ("can be set to snapshot")

Fixed.

> > +   caching not-needed statistics data.  Conversely, if it's known that
> statistics
> 
> Double space between "data." and "Conversely" (not sure if that matters)
> > +   current transaction's statistics snapshot or cached values (if any).
> The
> 
> Double space between "(if any)." and "The" (not sure if that matters)

That's done pretty widely in the docs and comments.


> > +   next use of statistical information will cause a new snapshot to be
> built
> > +   or accessed statistics to be cached.
> 
> I believe this should be an "and", not an "or". (next access builds both a
> new snapshot and caches accessed statistics)

I *think* or is correct? The new snapshot is when stats_fetch_consistency =
snapshot, the cached is when stats_fetch_consistency = cache. Not sure how to
make that clearer without making it a lot longer. Suggestions?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread Justin Pryzby

On Wed, Apr 06, 2022 at 12:27:34PM -0700, Andres Freund wrote:
> > > +   next use of statistical information will cause a new snapshot to be 
> > > built
> > > +   or accessed statistics to be cached.
> > 
> > I believe this should be an "and", not an "or". (next access builds both a
> > new snapshot and caches accessed statistics)
> 
> I *think* or is correct? The new snapshot is when stats_fetch_consistency =
> snapshot, the cached is when stats_fetch_consistency = cache. Not sure how to
> make that clearer without making it a lot longer. Suggestions?

I think it's correct.  Maybe it's clearer to say:

+   next use of statistical information will (when in snapshot mode) cause a 
new snapshot to be built
+   or (when in cache mode) accessed statistics to be cached.

Re: shared-memory based stats collector - v70

2022-04-06 Thread Lukas Fittl

On Wed, Apr 6, 2022 at 12:34 PM Justin Pryzby  wrote:

> On Wed, Apr 06, 2022 at 12:27:34PM -0700, Andres Freund wrote:
> > > > +   next use of statistical information will cause a new snapshot to
> be built
> > > > +   or accessed statistics to be cached.
> > >
> > > I believe this should be an "and", not an "or". (next access builds
> both a
> > > new snapshot and caches accessed statistics)
> >
> > I *think* or is correct? The new snapshot is when
> stats_fetch_consistency =
> > snapshot, the cached is when stats_fetch_consistency = cache. Not sure
> how to
> > make that clearer without making it a lot longer. Suggestions?
>
> I think it's correct.  Maybe it's clearer to say:
>
> +   next use of statistical information will (when in snapshot mode) cause
> a new snapshot to be built
> +   or (when in cache mode) accessed statistics to be cached.
>

Ah, yes, that does clarify what was meant.

+1 to Justin's edit, or something like it.

Thanks,
Lukas

-- 
Lukas Fittl

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-05 20:00:08 -0700, Andres Freund wrote:
> It'll be a few hours to get to the main commit - but except for 0001 it
> doesn't make sense to push without intending to push later changes too. I
> might squash a few commits togther.

I just noticed an existing incoherency that I'm wondering about fixing as part
of 0007 "pgstat: prepare APIs used by pgstatfuncs for shared memory stats."

The SQL functions to reset function and relation stats
pg_stat_reset_single_table_counters() and
pg_stat_reset_single_function_counters() respectively both make use of
pgstat_reset_single_counter().

Note that the SQL function uses plural "counters" (which makes sense, it
resets all counters for that object), whereas the C function they call to
perform the the reset uses singular.

Similarly pg_stat_reset_slru(), pg_stat_reset_replication_slot(),
pg_stat_reset_subscription_stats() SQL function use
pgstat_reset_subscription_counter(), pgstat_reset_replslot_counter() and
pgstat_reset_subscription_counter() to reset either the stats for one or all
SLRUs/slots.

This is relevant for the commit mentioned above because it separates the
functions to reset the stats for one slru / slot / sub from the function to
reset all slrus / slots / subs. Going with the existing naming I'd just named
them pgstat_reset_*_counters(). But that doesn't really make sense.

If it were just existing code I'd just not touch this for now. But because the
patch introduces further functions, I'd rather not introducing more weird
function names.

I'd go for
pgstat_reset_slru_counter() -> pgstat_reset_slru()
pgstat_reset_subscription_counter() -> pgstat_reset_subscription()
pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions()
pgstat_reset_replslot_counter() -> pgstat_reset_replslot()
pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots()

We could leave out the _all_ and just use plural too, but I think it's a bit
nicer with _all_ in there.

Not quite sure what to do with pgstat_reset_single_counter(). I'd either go
for the minimal pgstat_reset_single_counters() or pgstat_reset_one()?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread David G. Johnston

On Wednesday, April 6, 2022, Andres Freund  wrote:

>
>
> I'd go for
> pgstat_reset_slru_counter() -> pgstat_reset_slru()
> pgstat_reset_subscription_counter() -> pgstat_reset_subscription()
> pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions()
> pgstat_reset_replslot_counter() -> pgstat_reset_replslot()
> pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots()

I like having the SQL function paired with a matching implementation in
this scheme.

>
> We could leave out the _all_ and just use plural too, but I think it's a
> bit
> nicer with _all_ in there.

+1 to _all_

>
> Not quite sure what to do with pgstat_reset_single_counter(). I'd either go
> for the minimal pgstat_reset_single_counters() or pgstat_reset_one()?
>

Why not add both pgstat_resert_function() and pgstat_reset_table() (to keep
the pairing) and they can call the renamed pgstat_reset_function_or_table()
internally (since the function indeed handle both paths and we’ve yet to
come up with a label to use instead of “function and table stats”)?

These are private functions right?

David J.

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-06 15:32:39 -0700, David G. Johnston wrote:
> On Wednesday, April 6, 2022, Andres Freund  wrote:
> 
> >
> >
> > I'd go for
> > pgstat_reset_slru_counter() -> pgstat_reset_slru()
> > pgstat_reset_subscription_counter() -> pgstat_reset_subscription()
> > pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions()
> > pgstat_reset_replslot_counter() -> pgstat_reset_replslot()
> > pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots()
> 
> 
> I like having the SQL function paired with a matching implementation in
> this scheme.

It would have gotten things closer than it was before imo. We can't just
rename the SQL functions, they're obviously exposed API.

I'd like to remove the NULL -> all behaviour, but that should be discussed
separately.

I've hacked up the above, but after doing so I think I found a cleaner
approach:

I've introduced:

pgstat_reset_of_kind(PgStat_Kind kind) which works for both fixed and variable
numbered stats. That allows to remove pgstat_reset_subscription_counters(),
pgstat_reset_replslot_counters(), pgstat_reset_shared_counters().

pgstat_reset(PgStat_Kind kind, Oid dboid, Oid objoid), which removes the need
for pgstat_reset_subscription_counter(),
pgstat_reset_single_counter(). pgstat_reset_replslot() is still needed, to do
the name -> index lookup.

That imo makes a lot more sense than requiring each variable-amount kind to
have wrapper functions.

> > Not quite sure what to do with pgstat_reset_single_counter(). I'd either go
> > for the minimal pgstat_reset_single_counters() or pgstat_reset_one()?
> >
> 
> Why not add both pgstat_resert_function() and pgstat_reset_table() (to keep
> the pairing) and they can call the renamed pgstat_reset_function_or_table()
> internally (since the function indeed handle both paths and we’ve yet to
> come up with a label to use instead of “function and table stats”)?
> 
> These are private functions right?

What does "private" mean for you? They're exposed via pgstat.h not
pgstat_internal.h. But not to SQL.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread David G. Johnston

On Wed, Apr 6, 2022 at 4:12 PM Andres Freund  wrote:

>
> On 2022-04-06 15:32:39 -0700, David G. Johnston wrote:
> > On Wednesday, April 6, 2022, Andres Freund  wrote:
> >
> >
> > I like having the SQL function paired with a matching implementation in
> > this scheme.
>
> It would have gotten things closer than it was before imo. We can't just
> rename the SQL functions, they're obviously exposed API.
>

Right, I meant the naming scheme proposed was acceptable.  Not that I
wanted to change the SQL functions too.

I've hacked up the above, but after doing so I think I found a cleaner
>
approach:
>
> I've introduced:
>
> pgstat_reset_of_kind(PgStat_Kind kind) which works for both fixed and
> variable
> numbered stats. That allows to remove pgstat_reset_subscription_counters(),
> pgstat_reset_replslot_counters(), pgstat_reset_shared_counters().
>
> pgstat_reset(PgStat_Kind kind, Oid dboid, Oid objoid), which removes the
> need
> for pgstat_reset_subscription_counter(),
> pgstat_reset_single_counter().



> pgstat_reset_replslot() is still needed, to do
> the name -> index lookup.
>
> That imo makes a lot more sense than requiring each variable-amount kind to
> have wrapper functions.
>
>
I can see benefits of both, or even possibly combining them.  Absent being
able to point to some other part of the system and saying "it is done this
way there, let's do the same here" I think the details will inform the
decision.  The fact there is just the one outlier here suggests that this
is indeed the better option.



> What does "private" mean for you? They're exposed via pgstat.h not
> pgstat_internal.h. But not to SQL.
>
>
I was thinking specifically of the freedom to rename and not break
extensions.  Namely, are these truly implementation details or something
that, while unlikely to be used by extensions, still constitute an exposed
API?  It was mainly a passing thought, I'm not looking for a crash-course
in how all that works right now.

David J.

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-06 17:01:17 -0700, David G. Johnston wrote:
> On Wed, Apr 6, 2022 at 4:12 PM Andres Freund  wrote:
>
> The fact there is just the one outlier here suggests that this is indeed the
> better option.

FWIW, the outlier also uses pgstat_reset(), just with a small wrapper doing
the translation from slot name to slot index.

> > What does "private" mean for you? They're exposed via pgstat.h not
> > pgstat_internal.h. But not to SQL.
> I was thinking specifically of the freedom to rename and not break
> extensions.  Namely, are these truly implementation details or something
> that, while unlikely to be used by extensions, still constitute an exposed
> API?  It was mainly a passing thought, I'm not looking for a crash-course
> in how all that works right now.

I doubt there are extension using these functions - and they'd have been
broken the way things were in v70, because the signature already had changed.

Generally, between major releases, we don't worry too much about changing C
APIs. Of course we try to avoid unnecessarily breaking things, particularly
when it's going to cause widespread breakage.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-06 Thread Kyotaro Horiguchi

At Wed, 6 Apr 2022 09:04:09 -0700, Andres Freund  wrote in 
> > +* Don't define an INVALID value so switch() statements can warn if some
> > +* cases aren't covered. But define the first member to 1 so that
> > +* uninitialized values can be detected more easily.
> > 
> > FWIW, I like this.
> 
> I think there's no switches left now, so it's not actually providing too much.

(Ouch!)

> > 0008:
> > 
> > +   xact_desc_stats(buf, "", parsed.nstats, parsed.stats);
> > +   xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats);
> > +   xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats);
> > 
> > I'm not sure I like this, but I don't object to this..
> 
> The string prefixes? Or the entire patch?

The string prefixes, since they are a limited set of fixed
strings. That being said, I don't think it's better to use an enum
instead, too.  So I don't object to pass the strings here.

> > 0010:
> > (I didn't look this closer. The comments arised while looking other
> > patches.)
> > 
> > +pgstat_kind_from_str(char *kind_str)
> > 
> > I don't think I like "str" so much.  Don't we spell it as
> > "pgstat_kind_from_name"?
>
> name makes me think of NameData.  What do you dislike about str? We seem to 
> use
> str in plenty places?

For clarity, I don't dislike it so much.  So, I'm fine with the
current name.

I found that you meant a type by the "str".  I thought it as an
instance (I'm not sure I can express my feeling correctly here..) and
the following functions were in my mind.

char *get_namespace/rel/collation/func_name(Oid someoid)
char *pgstat_slru_name(int slru_idx)

Another instance of the same direction is

ForkNumber forkname_to_number(const char *forkName)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: shared-memory based stats collector - v70

2022-04-06 Thread Andres Freund

Hi,

On 2022-04-07 10:36:30 +0900, Kyotaro Horiguchi wrote:
> At Wed, 6 Apr 2022 09:04:09 -0700, Andres Freund  wrote 
> in 
> > > +  * Don't define an INVALID value so switch() statements can warn if some
> > > +  * cases aren't covered. But define the first member to 1 so that
> > > +  * uninitialized values can be detected more easily.
> > > 
> > > FWIW, I like this.
> > 
> > I think there's no switches left now, so it's not actually providing too 
> > much.
> 
> (Ouch!)

I think it's great that there's no switches left - means we're pretty close to
pgstat being runtime extensible...

> > > 0010:
> > > (I didn't look this closer. The comments arised while looking other
> > > patches.)
> > > 
> > > +pgstat_kind_from_str(char *kind_str)
> > > 
> > > I don't think I like "str" so much.  Don't we spell it as
> > > "pgstat_kind_from_name"?
> >
> > name makes me think of NameData.  What do you dislike about str? We seem to 
> > use
> > str in plenty places?
> 
> For clarity, I don't dislike it so much.  So, I'm fine with the
> current name.
> 
> I found that you meant a type by the "str".  I thought it as an
> instance (I'm not sure I can express my feeling correctly here..) and
> the following functions were in my mind.
> 
> char *get_namespace/rel/collation/func_name(Oid someoid)
> char *pgstat_slru_name(int slru_idx)
> 
> Another instance of the same direction is
> 
> ForkNumber forkname_to_number(const char *forkName)

It's now pgstat_get_kind_from_str().

It was harder to see earlier (I certainly didn't really see it) - because
there were so many "violations" - but most of pgstat is
pgstat__() or just _. I'd already moved most of
the patch series over to that (maybe in v68 or so). Now I also did that with
the internal functions.

There's a few functions breaking that pattern, partially because I added them
:(, but since they're not touched in these patches I've not renamed them. But
it's probably worth doing so tomorrow.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund

Hi,

On 2022-04-05 20:00:08 -0700, Andres Freund wrote:
> It'll be a few hours to get to the main commit - but except for 0001 it
> doesn't make sense to push without intending to push later changes too. I
> might squash a few commits togther.

I've gotten through the main commits (and then a fix for the apparently
inevitable bug that's immediately highlighted by the buildfarm), and the first
test. I'll call it a night now, and work on the other tests & docs tomorrow.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-07 Thread Kyotaro Horiguchi

At Wed, 6 Apr 2022 18:58:52 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-04-07 10:36:30 +0900, Kyotaro Horiguchi wrote:
> > At Wed, 6 Apr 2022 09:04:09 -0700, Andres Freund  wrote 
> > in 
> > > I think there's no switches left now, so it's not actually providing too 
> > > much.
> > 
> > (Ouch!)
> 
> I think it's great that there's no switches left - means we're pretty close to
> pgstat being runtime extensible...

Yes. I agree.

> It's now pgstat_get_kind_from_str().

I'm fine with it.

> It was harder to see earlier (I certainly didn't really see it) - because
> there were so many "violations" - but most of pgstat is
> pgstat__() or just _. I'd already moved most of
> the patch series over to that (maybe in v68 or so). Now I also did that with
> the internal functions.
> 
> There's a few functions breaking that pattern, partially because I added them
> :(, but since they're not touched in these patches I've not renamed them. But
> it's probably worth doing so tomorrow.

Thab being said, it gets far cleaner.   Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: shared-memory based stats collector - v70

2022-04-07 Thread Kyotaro Horiguchi

At Thu, 7 Apr 2022 00:28:45 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-04-05 20:00:08 -0700, Andres Freund wrote:
> > It'll be a few hours to get to the main commit - but except for 0001 it
> > doesn't make sense to push without intending to push later changes too. I
> > might squash a few commits togther.
> 
> I've gotten through the main commits (and then a fix for the apparently
> inevitable bug that's immediately highlighted by the buildfarm), and the first
> test. I'll call it a night now, and work on the other tests & docs tomorrow.

Thank you very much for the great effort on this to make it get in!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund

Hi,

On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> I've gotten through the main commits (and then a fix for the apparently
> inevitable bug that's immediately highlighted by the buildfarm), and the first
> test. I'll call it a night now, and work on the other tests & docs tomorrow.

I've gotten through the tests now. There's one known, not yet addressed, issue
with the stats isolation test, see [1].

Working on the docs. Found a few things worth raising:

1)
Existing text:
   When the server shuts down cleanly, a permanent copy of the statistics
   data is stored in the pg_stat subdirectory, so that
   statistics can be retained across server restarts.  When recovery is
   performed at server start (e.g., after immediate shutdown, server crash,
   and point-in-time recovery), all statistics counters are reset.

The existing docs patch hadn't updated yet. My current edit is

   When the server shuts down cleanly, a permanent copy of the statistics
   data is stored in the pg_stat subdirectory, so that
   statistics can be retained across server restarts.  When crash recovery is
   performed at server start (e.g., after immediate shutdown, server crash,
   and point-in-time recovery, but not when starting a standby that was shut
   down normally), all statistics counters are reset.

but I'm not sure the parenthetical is easy enough to understand?

2)
The edit is not a problem, but it's hard to understand what the existing
paragraph actually means?

diff --git a/doc/src/sgml/high-availability.sgml 
b/doc/src/sgml/high-availability.sgml
index 3247e056663..8bfb584b752 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -,17 +,17 @@ HINT:  You can then restart the server after making the 
necessary configuration
...

-The statistics collector is active during recovery. All scans, reads, 
blocks,
+The cumulative statistics system is active during recovery. All scans, 
reads, blocks,
 index usage, etc., will be recorded normally on the standby. Replayed
 actions will not duplicate their effects on primary, so replaying an
 insert will not increment the Inserts column of pg_stat_user_tables.
 The stats file is deleted at the start of recovery, so stats from primary
 and standby will differ; this is considered a feature, not a bug.

I'll just commit the necessary bit, but we really ought to rephrase this.

Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/20220407165709.jgdkrzqlkcwue6ko%40alap3.anarazel.de

Re: shared-memory based stats collector - v70

2022-04-07 Thread Kyotaro Horiguchi

At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > I've gotten through the main commits (and then a fix for the apparently
> > inevitable bug that's immediately highlighted by the buildfarm), and the 
> > first
> > test. I'll call it a night now, and work on the other tests & docs tomorrow.
> 
> I've gotten through the tests now. There's one known, not yet addressed, issue
> with the stats isolation test, see [1].
> 
> 
> Working on the docs. Found a few things worth raising:
> 
> 1)
> Existing text:
>When the server shuts down cleanly, a permanent copy of the statistics
>data is stored in the pg_stat subdirectory, so that
>statistics can be retained across server restarts.  When recovery is
>performed at server start (e.g., after immediate shutdown, server crash,
>and point-in-time recovery), all statistics counters are reset.
> 
> The existing docs patch hadn't updated yet. My current edit is
> 
>When the server shuts down cleanly, a permanent copy of the statistics
>data is stored in the pg_stat subdirectory, so that
>statistics can be retained across server restarts.  When crash recovery is
>performed at server start (e.g., after immediate shutdown, server crash,
>and point-in-time recovery, but not when starting a standby that was shut
>down normally), all statistics counters are reset.
> 
> but I'm not sure the parenthetical is easy enough to understand?

I can read it. But I'm not sure that the difference is obvious for
average users between "starting a standby from a basebackup" and
"starting a standby after a normal shutdown"..

Other than that, it might be easier to read if the additional part
were moved out to the end of the paragraph, prefixing with "Note:
". For example,

...
statistics can be retained across server restarts.  When crash recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery), all statistics counters are reset. Note that
crash recovery is not performed when starting a standby that was shut
down normally then all counters are retained.

> 2)
> The edit is not a problem, but it's hard to understand what the existing
> paragraph actually means?
> 
> diff --git a/doc/src/sgml/high-availability.sgml 
> b/doc/src/sgml/high-availability.sgml
> index 3247e056663..8bfb584b752 100644
> --- a/doc/src/sgml/high-availability.sgml
> +++ b/doc/src/sgml/high-availability.sgml
> @@ -,17 +,17 @@ HINT:  You can then restart the server after making 
> the necessary configuration
> ...
> 
> -The statistics collector is active during recovery. All scans, reads, 
> blocks,
> +The cumulative statistics system is active during recovery. All scans, 
> reads, blocks,
>  index usage, etc., will be recorded normally on the standby. Replayed
>  actions will not duplicate their effects on primary, so replaying an
>  insert will not increment the Inserts column of pg_stat_user_tables.
>  The stats file is deleted at the start of recovery, so stats from primary
>  and standby will differ; this is considered a feature, not a bug.
> 
> 
> 

Agreed partially. It's too detailed.  It might not need to mention WAL
replay.

> I'll just commit the necessary bit, but we really ought to rephrase this.
> 
> 
> 
> 
> Greetings,
> 
> Andres Freund
> 
> [1] 
> https://www.postgresql.org/message-id/20220407165709.jgdkrzqlkcwue6ko%40alap3.anarazel.de

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: shared-memory based stats collector - v70

2022-04-07 Thread David G. Johnston

On Thu, Apr 7, 2022 at 7:10 PM Kyotaro Horiguchi 
wrote:

> At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund 
> wrote in
> > Hi,
> >
> > On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > > I've gotten through the main commits (and then a fix for the apparently
> > > inevitable bug that's immediately highlighted by the buildfarm), and
> the first
> > > test. I'll call it a night now, and work on the other tests & docs
> tomorrow.
> >
> > I've gotten through the tests now. There's one known, not yet addressed,
> issue
> > with the stats isolation test, see [1].
> >
> >
> > Working on the docs. Found a few things worth raising:
> >
> > 1)
> > Existing text:
> >When the server shuts down cleanly, a permanent copy of the statistics
> >data is stored in the pg_stat subdirectory, so
> that
> >statistics can be retained across server restarts.  When recovery is
> >performed at server start (e.g., after immediate shutdown, server
> crash,
> >and point-in-time recovery), all statistics counters are reset.
> >
> > The existing docs patch hadn't updated yet. My current edit is
> >
> >When the server shuts down cleanly, a permanent copy of the statistics
> >data is stored in the pg_stat subdirectory, so
> that
> >statistics can be retained across server restarts.  When crash
> recovery is
> >performed at server start (e.g., after immediate shutdown, server
> crash,
> >and point-in-time recovery, but not when starting a standby that was
> shut
> >down normally), all statistics counters are reset.
> >
> > but I'm not sure the parenthetical is easy enough to understand?
>
> I can read it. But I'm not sure that the difference is obvious for
> average users between "starting a standby from a basebackup" and
> "starting a standby after a normal shutdown"..
>
> Other than that, it might be easier to read if the additional part
> were moved out to the end of the paragraph, prefixing with "Note:
> ". For example,
>
> ...
> statistics can be retained across server restarts.  When crash recovery is
> performed at server start (e.g., after immediate shutdown, server crash,
> and point-in-time recovery), all statistics counters are reset. Note that
> crash recovery is not performed when starting a standby that was shut
> down normally then all counters are retained.
>
>
Maybe:
When the server shuts down cleanly a permanent copy of the statistics data
is stored in the pg_stat subdirectory so that
statistics can be retained across server restarts.  However, if crash
recovery is performed (i.e., after immediate shutdown, server crash, or
point-in-time recovery), all statistics counters are reset.  For any
standby server, the initial startup to get the cluster initialized is a
point-in-time crash recovery startup. For all subsequent startups it
behaves like any other server.  For a hot standby server, statistics are
retained during a failover promotion.

I'm pretty sure i.e., is correct since those three situations are not
examples but rather the complete set.

Is crash recovery ever performed other than at server start?  If so I
choose to remove the redundancy.

I feel like some of this detail about standby servers is/should be covered
elsewhere and we are at least missing a cross-reference chance even if we
leave the material coverage as-is.

> 2)
> > The edit is not a problem, but it's hard to understand what the existing
> > paragraph actually means?
> >
> > diff --git a/doc/src/sgml/high-availability.sgml
> b/doc/src/sgml/high-availability.sgml
> > index 3247e056663..8bfb584b752 100644
> > --- a/doc/src/sgml/high-availability.sgml
> > +++ b/doc/src/sgml/high-availability.sgml
> > @@ -,17 +,17 @@ HINT:  You can then restart the server after
> making the necessary configuration
> > ...
> > 
> > -The statistics collector is active during recovery. All scans,
> reads, blocks,
> > +The cumulative statistics system is active during recovery. All
> scans, reads, blocks,
> >  index usage, etc., will be recorded normally on the standby.
> Replayed
> >  actions will not duplicate their effects on primary, so replaying an
> >  insert will not increment the Inserts column of pg_stat_user_tables.
> >  The stats file is deleted at the start of recovery, so stats from
> primary
> >  and standby will differ; this is considered a feature, not a bug.
> > 
> >
> > 
>
> Agreed partially. It's too detailed.  It might not need to mention WAL
> replay.
>
>
The insert example seems like a poor one...IIUC cumulative statistics are
not WAL logged and while in recovery INSERT is prohibited, so how would
replaying the insert in the WAL result in a duplicated effect on
pg_stat_user_tables.inserts?

I also have no idea what, in the fragment, "Replayed actions will not
duplicate their effects on primary...", what "on primary" is supposed to
mean.

I would like to write the following but I don't believe it is sufficiently
true:

"The cumulative statistics system reco

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund

Hi,

On 2022-04-08 11:10:14 +0900, Kyotaro Horiguchi wrote:
> At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund  wrote 
> in 
> > Hi,
> > 
> > On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > > I've gotten through the main commits (and then a fix for the apparently
> > > inevitable bug that's immediately highlighted by the buildfarm), and the 
> > > first
> > > test. I'll call it a night now, and work on the other tests & docs 
> > > tomorrow.
> > 
> > I've gotten through the tests now. There's one known, not yet addressed, 
> > issue
> > with the stats isolation test, see [1].
> > 
> > 
> > Working on the docs. Found a few things worth raising:
> > 
> > 1)
> > Existing text:
> >When the server shuts down cleanly, a permanent copy of the statistics
> >data is stored in the pg_stat subdirectory, so that
> >statistics can be retained across server restarts.  When recovery is
> >performed at server start (e.g., after immediate shutdown, server crash,
> >and point-in-time recovery), all statistics counters are reset.
> > 
> > The existing docs patch hadn't updated yet. My current edit is
> > 
> >When the server shuts down cleanly, a permanent copy of the statistics
> >data is stored in the pg_stat subdirectory, so that
> >statistics can be retained across server restarts.  When crash recovery 
> > is
> >performed at server start (e.g., after immediate shutdown, server crash,
> >and point-in-time recovery, but not when starting a standby that was shut
> >down normally), all statistics counters are reset.
> > 
> > but I'm not sure the parenthetical is easy enough to understand?
> 
> I can read it. But I'm not sure that the difference is obvious for
> average users between "starting a standby from a basebackup" and
> "starting a standby after a normal shutdown"..

Yea, that's what I was concerned about. How about:

  
   Cumulative statistics are collected in shared memory. Every
   PostgreSQL process collects statistics locally
   then updates the shared data at appropriate intervals.  When a server,
   including a physical replica, shuts down cleanly, a permanent copy of the
   statistics data is stored in the pg_stat subdirectory,
   so that statistics can be retained across server restarts.  In contrast,
   when starting from an unclean shutdown (e.g., after an immediate shutdown,
   a server crash, starting from a base backup, and point-in-time recovery),
   all statistics counters are reset.
  


> Other than that, it might be easier to read if the additional part
> were moved out to the end of the paragraph, prefixing with "Note:
> ". For example,
> 
> ...
> statistics can be retained across server restarts.  When crash recovery is
> performed at server start (e.g., after immediate shutdown, server crash,
> and point-in-time recovery), all statistics counters are reset. Note that
> crash recovery is not performed when starting a standby that was shut
> down normally then all counters are retained.

I think I like my version above a bit better?


> > 2)
> > The edit is not a problem, but it's hard to understand what the existing
> > paragraph actually means?
> > 
> > diff --git a/doc/src/sgml/high-availability.sgml 
> > b/doc/src/sgml/high-availability.sgml
> > index 3247e056663..8bfb584b752 100644
> > --- a/doc/src/sgml/high-availability.sgml
> > +++ b/doc/src/sgml/high-availability.sgml
> > @@ -,17 +,17 @@ HINT:  You can then restart the server after making 
> > the necessary configuration
> > ...
> > 
> > -The statistics collector is active during recovery. All scans, reads, 
> > blocks,
> > +The cumulative statistics system is active during recovery. All scans, 
> > reads, blocks,
> >  index usage, etc., will be recorded normally on the standby. Replayed
> >  actions will not duplicate their effects on primary, so replaying an
> >  insert will not increment the Inserts column of pg_stat_user_tables.
> >  The stats file is deleted at the start of recovery, so stats from 
> > primary
> >  and standby will differ; this is considered a feature, not a bug.
> > 
> > 
> > 
> 
> Agreed partially. It's too detailed.  It might not need to mention WAL
> replay.

My concern is more that it seems halfway nonsensical. "Replayed actions will
not duplicate their effects on primary" - I can guess what that means, but not
more. There's no "Inserts" column of pg_stat_user_tables.


   
The cumulative statistics system is active during recovery. All scans,
reads, blocks, index usage, etc., will be recorded normally on the
standby. However, WAL replay will not increment relation and database
specific counters. I.e. replay will not increment pg_stat_all_tables
columns (like n_tup_ins), nor will reads or writes performed by the
startup process be tracked in the pg_statio views, nor will associated
pg_stat_database columns be incremented.
   



Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund

Hi,

On 2022-04-07 20:51:10 -0700, David G. Johnston wrote:
> On Thu, Apr 7, 2022 at 7:10 PM Kyotaro Horiguchi 
> wrote:
> > I can read it. But I'm not sure that the difference is obvious for
> > average users between "starting a standby from a basebackup" and
> > "starting a standby after a normal shutdown"..
> >
> > Other than that, it might be easier to read if the additional part
> > were moved out to the end of the paragraph, prefixing with "Note:
> > ". For example,
> >
> > ...
> > statistics can be retained across server restarts.  When crash recovery is
> > performed at server start (e.g., after immediate shutdown, server crash,
> > and point-in-time recovery), all statistics counters are reset. Note that
> > crash recovery is not performed when starting a standby that was shut
> > down normally then all counters are retained.
> >
> >
> Maybe:
> When the server shuts down cleanly a permanent copy of the statistics data
> is stored in the pg_stat subdirectory so that
> statistics can be retained across server restarts.  However, if crash
> recovery is performed (i.e., after immediate shutdown, server crash, or
> point-in-time recovery), all statistics counters are reset.  For any
> standby server, the initial startup to get the cluster initialized is a
> point-in-time crash recovery startup. For all subsequent startups it
> behaves like any other server.  For a hot standby server, statistics are
> retained during a failover promotion.
> 
> I'm pretty sure i.e., is correct since those three situations are not
> examples but rather the complete set.

I don't think the "initial startup ..." bit is quite correct. A standby can be
created for a shut down server, and IIRC there's some differences in how PITR
is handled too.


> Is crash recovery ever performed other than at server start?  If so I
> choose to remove the redundancy.

No.


> I feel like some of this detail about standby servers is/should be covered
> elsewhere and we are at least missing a cross-reference chance even if we
> leave the material coverage as-is.

I didn't find anything good to reference...


> > 2)
> > > The edit is not a problem, but it's hard to understand what the existing
> > > paragraph actually means?
> > >
> > > diff --git a/doc/src/sgml/high-availability.sgml
> > b/doc/src/sgml/high-availability.sgml
> > > index 3247e056663..8bfb584b752 100644
> > > --- a/doc/src/sgml/high-availability.sgml
> > > +++ b/doc/src/sgml/high-availability.sgml
> > > @@ -,17 +,17 @@ HINT:  You can then restart the server after
> > making the necessary configuration
> > > ...
> > > 
> > > -The statistics collector is active during recovery. All scans,
> > reads, blocks,
> > > +The cumulative statistics system is active during recovery. All
> > scans, reads, blocks,
> > >  index usage, etc., will be recorded normally on the standby.
> > Replayed
> > >  actions will not duplicate their effects on primary, so replaying an
> > >  insert will not increment the Inserts column of pg_stat_user_tables.
> > >  The stats file is deleted at the start of recovery, so stats from
> > primary
> > >  and standby will differ; this is considered a feature, not a bug.
> > > 
> > >
> > > 
> >
> > Agreed partially. It's too detailed.  It might not need to mention WAL
> > replay.
> >
> >
> The insert example seems like a poor one...IIUC cumulative statistics are
> not WAL logged and while in recovery INSERT is prohibited, so how would
> replaying the insert in the WAL result in a duplicated effect on
> pg_stat_user_tables.inserts?

I agree, the sentence doesn't make much sense.

It doesn't really matter that stats aren't WAL logged, one could infer them
from the WAL at a decent level of accuracy. However, we can't actually
associate those actions to relations, we just know the relfilenode... And the
startup process can't read the catalog to figure the mapping out.


> I also have no idea what, in the fragment, "Replayed actions will not
> duplicate their effects on primary...", what "on primary" is supposed to
> mean.

I think it's trying to say "will not duplicate the effect on
pg_stat_user_tables they had on the primary".


> I would like to write the following but I don't believe it is sufficiently
> true:
> 
> "The cumulative statistics system records only the locally generated
> activity of the cluster, including while in recovery.  Activity happening
> only due to the replay of WAL is not considered local."
> 
> But to apply the WAL we have to fetch blocks from the local filesystem and
> write them back out.  That is local activity happening due to the replay of
> WAL which sounds like it is and should be reported ("All...blocks...", and
> the example given being logical DDL oriented).

That's not true today - the startup processes reads / writes aren't reflected
in pg_statio, pg_stat_database or whatnot.


> I cannot think of a better paragraph at the moment, the minimal change is
> good, and the detail it conta

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund

Hi,

On 2022-04-07 20:59:21 -0700, Andres Freund wrote:
> 
>   
>Cumulative statistics are collected in shared memory. Every
>PostgreSQL process collects statistics locally
>then updates the shared data at appropriate intervals.  When a server,
>including a physical replica, shuts down cleanly, a permanent copy of the
>statistics data is stored in the pg_stat subdirectory,
>so that statistics can be retained across server restarts.  In contrast,
>when starting from an unclean shutdown (e.g., after an immediate shutdown,
>a server crash, starting from a base backup, and point-in-time recovery),
>all statistics counters are reset.
>   
> ...
>
> The cumulative statistics system is active during recovery. All scans,
> reads, blocks, index usage, etc., will be recorded normally on the
> standby. However, WAL replay will not increment relation and database
> specific counters. I.e. replay will not increment pg_stat_all_tables
> columns (like n_tup_ins), nor will reads or writes performed by the
> startup process be tracked in the pg_statio views, nor will associated
> pg_stat_database columns be incremented.
>

I went with these for now. My guess is that there's further improvements in
them, and in surrounding areas...


With that, I'll close this CF entry. It's been a while.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund

On 2022-04-07 16:37:51 -0700, Andres Freund wrote:
> On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > I've gotten through the main commits (and then a fix for the apparently
> > inevitable bug that's immediately highlighted by the buildfarm), and the 
> > first
> > test. I'll call it a night now, and work on the other tests & docs tomorrow.
> 
> I've gotten through the tests now. There's one known, not yet addressed, issue
> with the stats isolation test, see [1].

That has since been fixed, in d6c0db14836cd843d589372d909c73aab68c7a24

Re: shared-memory based stats collector - v70

2022-04-07 Thread David G. Johnston

On Thu, Apr 7, 2022 at 8:59 PM Andres Freund  wrote:

>
>   
>Cumulative statistics are collected in shared memory. Every
>PostgreSQL process collects statistics
> locally
>then updates the shared data at appropriate intervals.  When a server,
>including a physical replica, shuts down cleanly, a permanent copy of
> the
>statistics data is stored in the pg_stat
> subdirectory,
>so that statistics can be retained across server restarts.  In contrast,
>when starting from an unclean shutdown (e.g., after an immediate
> shutdown,
>a server crash, starting from a base backup, and point-in-time
> recovery),
>all statistics counters are reset.
>   
>

I like this.  My comment regarding using "i.e.," here stands though.


>
>
> The cumulative statistics system is active during recovery. All scans,
> reads, blocks, index usage, etc., will be recorded normally on the
> standby. However, WAL replay will not increment relation and database
> specific counters. I.e. replay will not increment pg_stat_all_tables
> columns (like n_tup_ins), nor will reads or writes performed by the
> startup process be tracked in the pg_statio views, nor will associated
> pg_stat_database columns be incremented.
>
>
>
I like this too.  The second part with three nors is a bit rough.  Maybe:

... specific counters.  In particular, replay will not increment
pg_stat_database or pg_stat_all_tables columns, and the startup process
will not report reads and writes for the pg_statio views.

It would helpful to give at least one specific example of what is being
recorded normally, especially since we give three of what is not.

David J.

Re: shared-memory based stats collector - v70

2022-04-07 Thread Kyotaro Horiguchi

At Thu, 7 Apr 2022 20:59:21 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-04-08 11:10:14 +0900, Kyotaro Horiguchi wrote:
> > I can read it. But I'm not sure that the difference is obvious for
> > average users between "starting a standby from a basebackup" and
> > "starting a standby after a normal shutdown"..
> 
> Yea, that's what I was concerned about. How about:
> 
>   
>Cumulative statistics are collected in shared memory. Every
>PostgreSQL process collects statistics locally
>then updates the shared data at appropriate intervals.  When a server,
>including a physical replica, shuts down cleanly, a permanent copy of the
>statistics data is stored in the pg_stat subdirectory,
>so that statistics can be retained across server restarts.  In contrast,
>when starting from an unclean shutdown (e.g., after an immediate shutdown,
>a server crash, starting from a base backup, and point-in-time recovery),
>all statistics counters are reset.
>   

Looks perfect generally, and especially in regard to the concern.

> I think I like my version above a bit better?

Quite a bit.  It didn't answer for the concern.

> > > 2)
> > > The edit is not a problem, but it's hard to understand what the existing
> > > paragraph actually means?
> > > 
> > > diff --git a/doc/src/sgml/high-availability.sgml 
> > > b/doc/src/sgml/high-availability.sgml
> > > index 3247e056663..8bfb584b752 100644
> > > --- a/doc/src/sgml/high-availability.sgml
> > > +++ b/doc/src/sgml/high-availability.sgml
> > > @@ -,17 +,17 @@ HINT:  You can then restart the server after 
> > > making the necessary configuration
> > > ...
> > > 
> > > -The statistics collector is active during recovery. All scans, 
> > > reads, blocks,
> > > +The cumulative statistics system is active during recovery. All 
> > > scans, reads, blocks,
> > >  index usage, etc., will be recorded normally on the standby. Replayed
> > >  actions will not duplicate their effects on primary, so replaying an
> > >  insert will not increment the Inserts column of pg_stat_user_tables.
> > >  The stats file is deleted at the start of recovery, so stats from 
> > > primary
> > >  and standby will differ; this is considered a feature, not a bug.
> > > 
> > > 
> > > 
> > 
> > Agreed partially. It's too detailed.  It might not need to mention WAL
> > replay.
> 
> My concern is more that it seems halfway nonsensical. "Replayed actions will
> not duplicate their effects on primary" - I can guess what that means, but not
> more. There's no "Inserts" column of pg_stat_user_tables.
> 
> 
>
> The cumulative statistics system is active during recovery. All scans,
> reads, blocks, index usage, etc., will be recorded normally on the
> standby. However, WAL replay will not increment relation and database
> specific counters. I.e. replay will not increment pg_stat_all_tables
> columns (like n_tup_ins), nor will reads or writes performed by the
> startup process be tracked in the pg_statio views, nor will associated
> pg_stat_database columns be incremented.
>

Looks clearer since it mention user-facing interfaces with concrete
example columns.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: shared-memory based stats collector - v70

2022-04-09 Thread Andres Freund

Hi,

On 2022-04-07 21:39:55 -0700, David G. Johnston wrote:
> On Thu, Apr 7, 2022 at 8:59 PM Andres Freund  wrote:
> 
> >
> >   
> >Cumulative statistics are collected in shared memory. Every
> >PostgreSQL process collects statistics
> > locally
> >then updates the shared data at appropriate intervals.  When a server,
> >including a physical replica, shuts down cleanly, a permanent copy of
> > the
> >statistics data is stored in the pg_stat
> > subdirectory,
> >so that statistics can be retained across server restarts.  In contrast,
> >when starting from an unclean shutdown (e.g., after an immediate
> > shutdown,
> >a server crash, starting from a base backup, and point-in-time
> > recovery),
> >all statistics counters are reset.
> >   
> >
> 
> I like this.  My comment regarding using "i.e.," here stands though.

Argh. I'd used in e.g., but not i.e..


> >
> >
> > The cumulative statistics system is active during recovery. All scans,
> > reads, blocks, index usage, etc., will be recorded normally on the
> > standby. However, WAL replay will not increment relation and database
> > specific counters. I.e. replay will not increment pg_stat_all_tables
> > columns (like n_tup_ins), nor will reads or writes performed by the
> > startup process be tracked in the pg_statio views, nor will associated
> > pg_stat_database columns be incremented.
> >
> >
> >
> I like this too.  The second part with three nors is a bit rough.  Maybe:

Agreed. I tried to come up with a smoother formulation, but didn't (perhaps
because I was a tad tired).


> ... specific counters.  In particular, replay will not increment
> pg_stat_database or pg_stat_all_tables columns, and the startup process
> will not report reads and writes for the pg_statio views.
> 
> It would helpful to give at least one specific example of what is being
> recorded normally, especially since we give three of what is not.

The second sentence is a set of examples - or do you mean examples for what
actions by the startup process are counted?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-09 Thread David G. Johnston

On Sat, Apr 9, 2022 at 12:07 PM Andres Freund  wrote:

>
> > ... specific counters.  In particular, replay will not increment
> > pg_stat_database or pg_stat_all_tables columns, and the startup process
> > will not report reads and writes for the pg_statio views.
> >
> > It would helpful to give at least one specific example of what is being
> > recorded normally, especially since we give three of what is not.
>
> The second sentence is a set of examples - or do you mean examples for what
> actions by the startup process are counted?
>
>
Specific views that these statistics will be updating; like
pg_stat_database being the example of a view that is not updating.

David J.

Re: shared-memory based stats collector - v70

2022-04-13 Thread David G. Johnston

On Tue, Apr 5, 2022 at 8:00 PM Andres Freund  wrote:

> Here comes v70:
>
>
One thing I just noticed while peeking at pg_stat_slru:

The stats_reset column for my newly initdb'd cluster is showing me
"2000-01-01 00:00:00" (v15).  I was expecting null, though a non-null value
restriction does make sense.  Neither choice is documented though.

Based upon my expectation I checked to see if v14 reported null, and thus
this was a behavior change.  v14 reports the initdb timestamp (e.g.,
2022-04-13 23:26:48.349115+00)

Can we document the non-null aspect of this value (pg_stat_database is
happy being null, this seems to be a "fixed" type behavior) but have it
continue to report initdb as its initial value?

David J.

Re: shared-memory based stats collector - v70

2022-04-13 Thread David G. Johnston

On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Tue, Apr 5, 2022 at 8:00 PM Andres Freund  wrote:
>
>> Here comes v70:
>>
>>
> One thing I just noticed while peeking at pg_stat_slru:
>
> The stats_reset column for my newly initdb'd cluster is showing me
> "2000-01-01 00:00:00" (v15).  I was expecting null, though a non-null value
> restriction does make sense.  Neither choice is documented though.
>
> Based upon my expectation I checked to see if v14 reported null, and thus
> this was a behavior change.  v14 reports the initdb timestamp (e.g.,
> 2022-04-13 23:26:48.349115+00)
>
> Can we document the non-null aspect of this value (pg_stat_database is
> happy being null, this seems to be a "fixed" type behavior) but have it
> continue to report initdb as its initial value?
>
>
Sorry, apparently this "2000-01-01" behavior only manifests after crash
recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
same initdb timestamp.

Feels like we should still report the "end of crash recovery timestamp" for
these instead of 2000-01-01 (which I guess is derived from 0) if we are not
willing to produce null (and it seems other parts of the system using these
stats assumes non-null).

David J.

David J.

Re: shared-memory based stats collector - v70

2022-04-13 Thread Andres Freund

Hi,

On 2022-04-13 16:56:45 -0700, David G. Johnston wrote:
> On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston <
> david.g.johns...@gmail.com> wrote:
> Sorry, apparently this "2000-01-01" behavior only manifests after crash
> recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
> same initdb timestamp.

> Feels like we should still report the "end of crash recovery timestamp" for
> these instead of 2000-01-01 (which I guess is derived from 0) if we are not
> willing to produce null (and it seems other parts of the system using these
> stats assumes non-null).

Yes, that's definitely not correct. I see the bug (need to call
pgstat_reset_after_failure(); in pgstat_discard_stats()). Stupid, but
easy to fix - too fried to write a test tonight, but will commit the fix
tomorrow.

Thanks for catching!

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-04-13 Thread Michael Paquier

On Wed, Apr 13, 2022 at 04:56:45PM -0700, David G. Johnston wrote:
> Sorry, apparently this "2000-01-01" behavior only manifests after crash
> recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
> same initdb timestamp.
> 
> Feels like we should still report the "end of crash recovery timestamp" for
> these instead of 2000-01-01 (which I guess is derived from 0) if we are not
> willing to produce null (and it seems other parts of the system using these
> stats assumes non-null).

I can see this timestamp as well after crash recovery.  This seems
rather misleading to me.  I have added an open item.
--
Michael


signature.asc
Description: PGP signature

Re: shared-memory based stats collector - v70

2022-04-14 Thread Andres Freund

Hi,

On 2022-04-13 17:55:18 -0700, Andres Freund wrote:
> On 2022-04-13 16:56:45 -0700, David G. Johnston wrote:
> > On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston <
> > david.g.johns...@gmail.com> wrote:
> > Sorry, apparently this "2000-01-01" behavior only manifests after crash
> > recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
> > same initdb timestamp.
> 
> > Feels like we should still report the "end of crash recovery timestamp" for
> > these instead of 2000-01-01 (which I guess is derived from 0) if we are not
> > willing to produce null (and it seems other parts of the system using these
> > stats assumes non-null).
> 
> Yes, that's definitely not correct. I see the bug (need to call
> pgstat_reset_after_failure(); in pgstat_discard_stats()). Stupid, but
> easy to fix - too fried to write a test tonight, but will commit the fix
> tomorrow.

Pushed the fix (including a test that previously failed). Thanks again!

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-07-20 Thread Greg Stark

So I'm finally wrapping my head around this new code. There is
something I'm surprised by that perhaps I'm misreading or perhaps I
shouldn't be surprised by, not sure.

Is it true that the shared memory allocation contains the hash table
entry and body of every object in every database? I guess I was
assuming I would find some kind of LRU cache which loaded data from
disk on demand. But afaict it loads everything on startup and then
never loads from disk later. The disk is purely for recovering state
after a restart.

On the one hand the rest of Postgres seems to be designed on the
assumption that the number of tables and database objects is limited
only by disk space. The catalogs are stored in relational storage
which is read through the buffer cache. On the other hand it's true
that syscaches don't do expire entries (though I think the assumption
is that no one backend touches very much).

It seems like if we really think the total number of database objects
is reasonably limited to scales that fit in RAM there would be a much
simpler database design that would just store the catalog tables in
simple in-memory data structures and map them all on startup without
doing all the work Postgres does to make relational storage scale.

Re: shared-memory based stats collector - v70

2022-07-20 Thread Melanie Plageman

On Wed, Jul 20, 2022 at 11:35 AM Greg Stark  wrote:

> On the one hand the rest of Postgres seems to be designed on the
> assumption that the number of tables and database objects is limited
> only by disk space. The catalogs are stored in relational storage
> which is read through the buffer cache. On the other hand it's true
> that syscaches don't do expire entries (though I think the assumption
> is that no one backend touches very much).
>
> It seems like if we really think the total number of database objects
> is reasonably limited to scales that fit in RAM there would be a much
> simpler database design that would just store the catalog tables in
> simple in-memory data structures and map them all on startup without
> doing all the work Postgres does to make relational storage scale.
>

I think efforts to do such a thing have gotten caught up in solving
issues around visibility and managing the relationship between local and
global caches [1]. It doesn't seem like the primary technical concern
was memory usage.

[1]
https://www.postgresql.org/message-id/flat/4E72940DA2BF16479384A86D54D0988A567B9245%40G01JPEXMBKW04

Re: shared-memory based stats collector - v70

2022-07-20 Thread Andres Freund

Hi,

On 2022-07-20 11:35:13 -0400, Greg Stark wrote:
> Is it true that the shared memory allocation contains the hash table
> entry and body of every object in every database?

Yes. However, note that that was already the case with the old stats
collector - it also kept everything in memory. In addition every read
access to stats loaded a copy of the stats (well of the global stats and
the relevant per-database stats).

It might be worth doing something fancier at some point - the shared
memory stats was already a huge effort, cramming yet another change in
there would pretty much have guaranteed that it'd fail.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-07-20 Thread Tom Lane

Melanie Plageman  writes:
> On Wed, Jul 20, 2022 at 11:35 AM Greg Stark  wrote:
>> It seems like if we really think the total number of database objects
>> is reasonably limited to scales that fit in RAM there would be a much
>> simpler database design that would just store the catalog tables in
>> simple in-memory data structures and map them all on startup without
>> doing all the work Postgres does to make relational storage scale.

> I think efforts to do such a thing have gotten caught up in solving
> issues around visibility and managing the relationship between local and
> global caches [1]. It doesn't seem like the primary technical concern
> was memory usage.

AFAIR, the previous stats collector implementation had no such provision
either: it'd just keep adding hashtable entries as it received info about
new objects.  The only thing that's changed is that now those entries are
in shared memory instead of process-local memory.  We'd be well advised to
be sure that memory can be swapped out under pressure, but otherwise I'm
not seeing that things have gotten worse.

regards, tom lane

Re: shared-memory based stats collector - v70

2022-07-20 Thread Andres Freund

Hi,

On 2022-07-20 12:08:35 -0400, Tom Lane wrote:
> AFAIR, the previous stats collector implementation had no such provision
> either: it'd just keep adding hashtable entries as it received info about
> new objects.

Yep.

> The only thing that's changed is that now those entries are in shared
> memory instead of process-local memory.  We'd be well advised to be
> sure that memory can be swapped out under pressure, but otherwise I'm
> not seeing that things have gotten worse.

FWIW, I ran a few memory usage benchmarks. Without stats accesses the
memory usage with shared memory stats was sometimes below, sometimes
above the "old" memory usage, depending on the number of objects. As
soon as there's stats access, it's well below (that includes things like
autovac workers).

I think there's quite a bit of memory usage reduction potential around
dsa.c - we occasionally end up with [nearly] unused dsm segments.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-07-20 Thread Greg Stark

On Wed, 20 Jul 2022 at 12:08, Tom Lane  wrote:
>
> AFAIR, the previous stats collector implementation had no such provision
> either: it'd just keep adding hashtable entries as it received info about
> new objects.  The only thing that's changed is that now those entries are
> in shared memory instead of process-local memory.  We'd be well advised to
> be sure that memory can be swapped out under pressure, but otherwise I'm
> not seeing that things have gotten worse.

Just to be clear I'm not looking for ways things have gotten worse.
Just trying to understand what I'm reading and I guess I came in with
assumptions that led me astray.

But... adding entries as it received info about new objects isn't the
same as having info on everything. I didn't really understand how the
old system worked but if you had a very large schema but each session
only worked with a small subset did the local stats data ever absorb
info on the objects it never touched?

All that said -- having all objects loaded in shared memory makes my
work way easier. It actually seems feasible to dump out all the
objects from shared memory and including objects from other databases
and if I don't need a consistent snapshot it even seems like it would
be possible to do that without having a copy of more than one stats
entry at a time in local memory. I hope that doesn't cause huge
contention on the shared hash table to be doing that regularly.

-- 
greg

Re: shared-memory based stats collector - v70

2022-07-20 Thread Andres Freund

Hi, 

On July 20, 2022 8:41:53 PM GMT+02:00, Greg Stark  wrote:
>On Wed, 20 Jul 2022 at 12:08, Tom Lane  wrote:
>>
>> AFAIR, the previous stats collector implementation had no such provision
>> either: it'd just keep adding hashtable entries as it received info about
>> new objects.  The only thing that's changed is that now those entries are
>> in shared memory instead of process-local memory.  We'd be well advised to
>> be sure that memory can be swapped out under pressure, but otherwise I'm
>> not seeing that things have gotten worse.
>
>Just to be clear I'm not looking for ways things have gotten worse.
>Just trying to understand what I'm reading and I guess I came in with
>assumptions that led me astray.
>
>But... adding entries as it received info about new objects isn't the
>same as having info on everything. I didn't really understand how the
>old system worked but if you had a very large schema but each session
>only worked with a small subset did the local stats data ever absorb
>info on the objects it never touched?

Each backend only had stats for things it touched. But the stats collector read 
all files at startup into hash tables and absorbed all generated stats into 
those as well.


>All that said -- having all objects loaded in shared memory makes my
>work way easier.

What are your trying to do? 

>It actually seems feasible to dump out all the
>objects from shared memory and including objects from other databases
>and if I don't need a consistent snapshot it even seems like it would
>be possible to do that without having a copy of more than one stats
>entry at a time in local memory. I hope that doesn't cause huge
>contention on the shared hash table to be doing that regularly.

The stats accessors now default to not creating a full snapshot of stats data 
at first access (but that's configurable). So yes, that behavior is possible. 
E.g. autovac now uses a single object access like you describe.

Andres


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: shared-memory based stats collector - v70

2022-07-21 Thread Greg Stark

On Wed, 20 Jul 2022 at 15:09, Andres Freund  wrote:
>
> Each backend only had stats for things it touched. But the stats collector 
> read all files at startup into hash tables and absorbed all generated stats 
> into those as well.

Fascinating. I'm surprised this didn't raise issues previously for
people with millions of tables. I wonder if it wasn't causing issues
and we just didn't hear about them because there were other bigger
issues :)

> >All that said -- having all objects loaded in shared memory makes my
> >work way easier.
>
> What are your trying to do?

I'm trying to implement an exporter for prometheus/openmetrics/etc
that dumps directly from shared memory without going through the SQL
backend layer. I believe this will be much more reliable, lower
overhead, safer, and consistent than writing SQL queries.

Ideally I would want to dump out the stats without connecting to each
database. I suspect that would run into problems where the schema
really adds a lot of information (such as which table each index is on
or which table a toast relation is for. There are also some things
people think of as stats that are maintained in the catalog such as
reltuples and relpages. So I'm imagining this won't strictly stay true
in the end.

It seems like just having an interface to iterate over the shared hash
table and return entries one by one without filtering by database
would be fairly straightforward and I would be able to do most of what
I want just with that. There's actually enough meta information in the
stats entries to be able to handle them as they come instead of trying
to process look up specific stats one by one.

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-08 Thread Greg Stark

I'm trying to wrap my head around the shared memory stats collector
infrastructure from
<20220406030008.2qxipjxo776dw...@alap3.anarazel.de> committed in
5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd.

I have one question about locking -- afaics there's nothing protecting
reading the shared memory stats. There is an lwlock protecting
concurrent updates of the shared memory stats, but that lock isn't
taken when we read the stats. Are we ok relying on atomic 64-bit reads
or is there something else going on that I'm missing?

In particular I'm looking at pgstat.c:847 in pgstat_fetch_entry()
which does this:

memcpy(stats_data,
   pgstat_get_entry_data(kind, entry_ref->shared_stats),
   kind_info->shared_data_len);

stats_data is the returned copy of the stats entry with all the
statistics in it. But it's copied from the shared memory location
directly using memcpy and there's no locking or change counter or
anything protecting this memcpy that I can see.

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-09 Thread Kyotaro Horiguchi

At Mon, 8 Aug 2022 11:53:19 -0400, Greg Stark  wrote in 
> I'm trying to wrap my head around the shared memory stats collector
> infrastructure from
> <20220406030008.2qxipjxo776dw...@alap3.anarazel.de> committed in
> 5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd.
> 
> I have one question about locking -- afaics there's nothing protecting
> reading the shared memory stats. There is an lwlock protecting
> concurrent updates of the shared memory stats, but that lock isn't
> taken when we read the stats. Are we ok relying on atomic 64-bit reads
> or is there something else going on that I'm missing?
> 
> In particular I'm looking at pgstat.c:847 in pgstat_fetch_entry()
> which does this:
> 
> memcpy(stats_data,
>pgstat_get_entry_data(kind, entry_ref->shared_stats),
>kind_info->shared_data_len);
> 
> stats_data is the returned copy of the stats entry with all the
> statistics in it. But it's copied from the shared memory location
> directly using memcpy and there's no locking or change counter or
> anything protecting this memcpy that I can see.

We take LW_SHARED while creating a snapshot of fixed-numbered
stats. On the other hand we don't for variable-numbered stats.  I
agree to you, that we need that also for variable-numbered stats.

If I'm not missing something, it's strange that pgstat_lock_entry()
only takes LW_EXCLUSIVE. The atached changes the interface of
pgstat_lock_entry() but there's only one user since another read of
shared stats entry is not using reference. Thus the interface change
might be too much. If I just add bare LWLockAcquire/Release() to
pgstat_fetch_entry,the amount of the patch will be reduced.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..7c4e5f0238 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -844,9 +844,11 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 	else
 		stats_data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		kind_info->shared_data_len);
+	pgstat_lock_entry(entry_ref, LW_SHARED, false);
 	memcpy(stats_data,
 		   pgstat_get_entry_data(kind, entry_ref->shared_stats),
 		   kind_info->shared_data_len);
+	pgstat_unlock_entry(entry_ref);
 
 	if (pgstat_fetch_consistency > PGSTAT_FETCH_CONSISTENCY_NONE)
 	{
@@ -983,9 +985,15 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		 kind_info->shared_size);
+		/*
+		 * We're directly accesing the shared stats entry not using
+		 * reference. pg_stat_lock_entry() cannot be used here.
+		 */
+		LWLockAcquire(&stats_data->lock, LW_SHARED);
 		memcpy(entry->data,
 			   pgstat_get_entry_data(kind, stats_data),
 			   kind_info->shared_size);
+		LWLockRelease(&stats_data->lock);
 	}
 	dshash_seq_term(&hstat);
 
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index d9275611f0..fdf4d022c1 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -364,7 +364,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	pendingent = (PgStat_StatDBEntry *) entry_ref->pending;
 	sharedent = (PgStatShared_Database *) entry_ref->shared_stats;
 
-	if (!pgstat_lock_entry(entry_ref, nowait))
+	if (!pgstat_lock_entry(entry_ref, LW_EXCLUSIVE, nowait))
 		return false;
 
 #define PGSTAT_ACCUM_DBCOUNT(item)		\
diff --git a/src/backend/utils/activity/pgstat_function.c b/src/backend/utils/activity/pgstat_function.c
index 427d8c47fc..318db0b3c8 100644
--- a/src/backend/utils/activity/pgstat_function.c
+++ b/src/backend/utils/activity/pgstat_function.c
@@ -200,7 +200,7 @@ pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 
 	/* localent always has non-zero content */
 
-	if (!pgstat_lock_entry(entry_ref, nowait))
+	if (!pgstat_lock_entry(entry_ref, LW_EXCLUSIVE, nowait))
 		return false;
 
 	shfuncent->stats.f_numcalls += localent->f_counts.f_numcalls;
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..98dda726db 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -782,7 +782,7 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 		return true;
 	}
 
-	if (!pgstat_lock_entry(entry_ref, nowait))
+	if (!pgstat_lock_entry(entry_ref, LW_EXCLUSIVE, nowait))
 		return false;
 
 	/* add the values to the shared entry. */
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 89060ef29a..fdd20d80a1 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -568,14 +568,14 @@ pgstat_release_entry_ref(PgStat_HashKey key, PgStat_EntryRef *entry_ref,
 }
 
 bool
-pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait)
+pgstat_l

Re: shared-memory based stats collector - v70

2022-08-09 Thread Greg Stark

On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand  wrote:
>
>
> What do you think about adding a function in core PG to provide such
> functionality? (means being able to retrieve all the stats (+ eventually
> add some filtering) without the need to connect to each database).

I'm working on it myself too. I'll post a patch for discussion in a bit.

I was more aiming at a C function that extensions could use directly
rather than an SQL function -- though I suppose having the former it
would be simple enough to implement the latter using it. (though it
would have to be one for each stat type I guess)

The reason I want a C function is I'm trying to get as far as I can
without a connection to a database, without a transaction, without
accessing the catalog, and as much as possible without taking locks. I
think this is important for making monitoring highly reliable and low
impact on production. It's also kind of fundamental to accessing stats
for objects from other databases since we won't have easy access to
the catalogs for the other databases.

The main problem with my current code is that I'm accessing the shared
memory hash table directly. This means the I'm possibly introducing
locking contention on the shared memory hash table. I'm thinking of
separating the shared memory hash scan from the metric scan so the
list can be quickly  built minimizing the time the lock is held. We
could possibly also only rebuild that list at a lower frequency than
the metrics gathering so new objects might not show up instantly.

I have a few things I would like to suggest for future improvements to
this infrastructure. I haven't polished the details of it yet but the
main thing I think I'm missing is the catalog name for the object. I
don't want to have to fetch it from the catalog and in any case I
think it would generally be useful and might regularize the
replication slot handling too.

I also think it would be nice to have a change counter for every stat
object, or perhaps a change time. Prometheus wouldn't be able to make
use of it but other monitoring software might be able to receive only
metrics that have changed since the last update which would really
help on databases with large numbers of mostly static objects. Even on
typical databases there are tons of builtin objects (especially
functions) that are probably never getting updates.

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-09 Thread Andres Freund

Hi,

On 2022-08-09 12:18:47 +0200, Drouvot, Bertrand wrote:
> What do you think about adding a function in core PG to provide such
> functionality? (means being able to retrieve all the stats (+ eventually add
> some filtering) without the need to connect to each database).

I'm not that convinced by the use case, but I think it's also low cost to add
and maintain, so if somebody cares enough to write something...

The only thing I would "request" is that such a function requires more
permissions than the default accessors do. I think it's a minor problem that
we allow so much access within a database right now, regardless of object
permissions, but it'd not be a great idea to expand that to other databases,
in bulk?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-09 Thread Andres Freund

Hi,

On 2022-08-09 12:00:46 -0400, Greg Stark wrote:
> I was more aiming at a C function that extensions could use directly
> rather than an SQL function -- though I suppose having the former it
> would be simple enough to implement the latter using it. (though it
> would have to be one for each stat type I guess)

I think such a C extension could exist today, without patching core code? It'd
be a bit ugly to include pgstat_internal.h, I guess, but other than that...

> The reason I want a C function is I'm trying to get as far as I can
> without a connection to a database, without a transaction, without
> accessing the catalog, and as much as possible without taking locks.

I assume you don't include lwlocks under locks?

> I think this is important for making monitoring highly reliable and low
> impact on production.

I'm doubtful about that, but whatever.

> The main problem with my current code is that I'm accessing the shared
> memory hash table directly. This means the I'm possibly introducing
> locking contention on the shared memory hash table.

I don't think that's a large enough issue to worry about unless you're
polling at a very high rate, which'd be a bad idea in itself. If a backend
can't get the lock for some stats change it'll defer flushing the stats a bit,
so it'll not cause a lot of other problems.

> I'm thinking of separating the shared memory hash scan from the metric scan
> so the list can be quickly built minimizing the time the lock is held.

I'd really really want to see some evidence that any sort of complexity here
is worth it.

> I have a few things I would like to suggest for future improvements to
> this infrastructure. I haven't polished the details of it yet but the
> main thing I think I'm missing is the catalog name for the object. I
> don't want to have to fetch it from the catalog and in any case I
> think it would generally be useful and might regularize the
> replication slot handling too.

I'm *dead* set against including catalog names in shared memory stats. That'll
add a good amount of memory usage and complexity, without any sort of
comensurate gain.

> I also think it would be nice to have a change counter for every stat
> object, or perhaps a change time. Prometheus wouldn't be able to make
> use of it but other monitoring software might be able to receive only
> metrics that have changed since the last update which would really
> help on databases with large numbers of mostly static objects.

I think you're proposing adding overhead that doesn't even have a real user.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-09 Thread Andres Freund

Hi,

On 2022-08-09 17:24:35 +0900, Kyotaro Horiguchi wrote:
> At Mon, 8 Aug 2022 11:53:19 -0400, Greg Stark  wrote in
> > I'm trying to wrap my head around the shared memory stats collector
> > infrastructure from
> > <20220406030008.2qxipjxo776dw...@alap3.anarazel.de> committed in
> > 5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd.
> >
> > I have one question about locking -- afaics there's nothing protecting
> > reading the shared memory stats. There is an lwlock protecting
> > concurrent updates of the shared memory stats, but that lock isn't
> > taken when we read the stats. Are we ok relying on atomic 64-bit reads
> > or is there something else going on that I'm missing?

Yes, that's not right. Not sure how it ended up that way. There was a lot of
refactoring and pushing down the locking into different places, I guess it got
lost somewhere on the way :(. It's unlikely to be a large problem, but we
should fix it.

> If I'm not missing something, it's strange that pgstat_lock_entry()
> only takes LW_EXCLUSIVE.

I think it makes some sense, given that there's a larger number of callers for
that in various stats-emitting code. Perhaps we could just add a separate
function with a _shared() suffix?

> The atached changes the interface of
> pgstat_lock_entry() but there's only one user since another read of
> shared stats entry is not using reference. Thus the interface change
> might be too much. If I just add bare LWLockAcquire/Release() to
> pgstat_fetch_entry,the amount of the patch will be reduced.

Could you try the pgstat_lock_entry_shared() approach?

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-09 Thread Kyotaro Horiguchi

At Tue, 9 Aug 2022 09:53:19 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-08-09 17:24:35 +0900, Kyotaro Horiguchi wrote:
> > If I'm not missing something, it's strange that pgstat_lock_entry()
> > only takes LW_EXCLUSIVE.
> 
> I think it makes some sense, given that there's a larger number of callers for
> that in various stats-emitting code. Perhaps we could just add a separate
> function with a _shared() suffix?

Sure. That was an alternative I had in my mind.

> > The atached changes the interface of
> > pgstat_lock_entry() but there's only one user since another read of
> > shared stats entry is not using reference. Thus the interface change
> > might be too much. If I just add bare LWLockAcquire/Release() to
> > pgstat_fetch_entry,the amount of the patch will be reduced.
> 
> Could you try the pgstat_lock_entry_shared() approach?

Of course. Please find the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From f406fd384d7ca145e37810a2f6a7a410f74d5795 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 10 Aug 2022 10:53:19 +0900
Subject: [PATCH] Take lock while building stats snapshot

It got lost somewhere while developing shared memory stats collector
but it is undoubtedly required.
---
 src/backend/utils/activity/pgstat.c   |  8 
 src/backend/utils/activity/pgstat_shmem.c | 12 
 src/include/utils/pgstat_internal.h   |  1 +
 3 files changed, 21 insertions(+)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..c23142a670 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -844,9 +844,11 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 	else
 		stats_data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		kind_info->shared_data_len);
+	pgstat_lock_entry_shared(entry_ref, false);
 	memcpy(stats_data,
 		   pgstat_get_entry_data(kind, entry_ref->shared_stats),
 		   kind_info->shared_data_len);
+	pgstat_unlock_entry(entry_ref);
 
 	if (pgstat_fetch_consistency > PGSTAT_FETCH_CONSISTENCY_NONE)
 	{
@@ -983,9 +985,15 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		 kind_info->shared_size);
+		/*
+		 * Take lwlock directly instead of using pg_stat_lock_entry_shared()
+		 * which requires a reference.
+		 */
+		LWLockAcquire(&stats_data->lock, LW_SHARED);
 		memcpy(entry->data,
 			   pgstat_get_entry_data(kind, stats_data),
 			   kind_info->shared_size);
+		LWLockRelease(&stats_data->lock);
 	}
 	dshash_seq_term(&hstat);
 
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 89060ef29a..0bfa460af1 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -579,6 +579,18 @@ pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait)
 	return true;
 }
 
+bool
+pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait)
+{
+	LWLock	   *lock = &entry_ref->shared_stats->lock;
+
+	if (nowait)
+		return LWLockConditionalAcquire(lock, LW_SHARED);
+
+	LWLockAcquire(lock, LW_SHARED);
+	return true;
+}
+
 void
 pgstat_unlock_entry(PgStat_EntryRef *entry_ref)
 {
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9303d05427..901d2041d6 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -581,6 +581,7 @@ extern void pgstat_detach_shmem(void);
 extern PgStat_EntryRef *pgstat_get_entry_ref(PgStat_Kind kind, Oid dboid, Oid objoid,
 			 bool create, bool *found);
 extern bool pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait);
+extern bool pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait);
 extern void pgstat_unlock_entry(PgStat_EntryRef *entry_ref);
 extern bool pgstat_drop_entry(PgStat_Kind kind, Oid dboid, Oid objoid);
 extern void pgstat_drop_all_entries(void);
-- 
2.31.1

Re: shared-memory based stats collector - v70

2022-08-10 Thread Greg Stark

On Tue, 9 Aug 2022 at 12:48, Andres Freund  wrote:

> > The reason I want a C function is I'm trying to get as far as I can
> > without a connection to a database, without a transaction, without
> > accessing the catalog, and as much as possible without taking locks.
>
> I assume you don't include lwlocks under locks?

I guess it depends on which lwlock :) I would be leery of a monitoring
system taking an lwlock that could interfere with regular transactions
doing work. Or taking a lock that is itself the cause of the problem
elsewhere that you really need stats to debug would be a deal breaker.

> I don't think that's a large enough issue to worry about unless you're
> polling at a very high rate, which'd be a bad idea in itself. If a backend
> can't get the lock for some stats change it'll defer flushing the stats a bit,
> so it'll not cause a lot of other problems.

Hm. I wonder if we're on the same page about what constitutes a "high rate".

I've seen people try push prometheus or other similar systems to 5s
poll intervals. That would be challenging for Postgres due to the
volume of statistics. The default is 30s and people often struggle to
even have that function for large fleets. But if you had a small
fleet, perhaps an iot style system with a "one large table" type of
schema you might well want stats every 5s or even every 1s.

> I'm *dead* set against including catalog names in shared memory stats. That'll
> add a good amount of memory usage and complexity, without any sort of
> comensurate gain.

Well it's pushing the complexity there from elsewhere. If the labels
aren't in the stats structures then the exporter needs to connect to
each database, gather all the names into some local cache and then it
needs to worry about keeping it up to date. And if there are any
database problems such as disk errors or catalog objects being locked
then your monitoring breaks though perhaps it can be limited to just
missing some object names or having out of date names.

> > I also think it would be nice to have a change counter for every stat
> > object, or perhaps a change time. Prometheus wouldn't be able to make
> > use of it but other monitoring software might be able to receive only
> > metrics that have changed since the last update which would really
> > help on databases with large numbers of mostly static objects.
>
> I think you're proposing adding overhead that doesn't even have a real user.

I guess I'm just brainstorming here. I don't need to currently no. It
doesn't seem like significant overhead though compared to the locking
and copying though?

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-10 Thread Greg Stark

One thing that's bugging me is that the names we use for these stats
are *all* over the place.

The names go through three different stages

pgstat structs  ->  pgstatfunc tupledescs  ->  pg_stat_* views

(Followed by a fourth stage where pg_exporter or whatever names for
the monitoring software)

And for some reason both transitions (plus the exporter) felt the need
to fiddle with the names or values. And not in any sort of even
vaguely consistent way. So there are three (or four) different sets of
names for the same metrics :(

e.g.

* Some of the struct elements have abbreviated words which are
expanded in the tupledesc names or the view columns -- some have long
names which get abbreviated.

* Some struct members have n_ prefixes (presumably to avoid C keywords
or other namespace issues?) and then lose them at one of the other
stages. But then the relation stats do not have n_ prefixes and then
the pg_stat view *adds* n_ prefixes in the SQL view!

* Some columns are added together in the SQL view which seems like
gratuitously hiding information from the user. The pg_stat_*_tables
view actually looks up information from the indexes stats and combines
them to get idx_scan and idx_tup_fetch.

* The pg_stat_bgwriter view returns data from two different fixed
entries, the checkpointer and the bgwriter, is there a reason those
are kept separately but then reported as if they're one thing?


Some of the simpler renaming could be transparently fixed by making
the internal stats match the public facing names. But for many of them
I think the internal names are better. And the cases where the views
aggregate data in a way that loses information are not something I
want to reproduce.

I had intended to use the internal names directly, reasoning that
transparency and consistency are the direction to be headed. But in
some cases I think the current public names are the better choice -- I
certainly don't want to remove n_* prefixes from some names but then
add them to different names! And some of the cases where the data is
combined or modified do seem like they would be missed.

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-10 Thread Greg Stark

On Wed, 10 Aug 2022 at 04:05, Drouvot, Bertrand  wrote:
>
> Hi,
>
> On 8/9/22 6:00 PM, Greg Stark wrote:
> > On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand  wrote:
> >>
> >> What do you think about adding a function in core PG to provide such
> >> functionality? (means being able to retrieve all the stats (+ eventually
> >> add some filtering) without the need to connect to each database).
> > I'm working on it myself too. I'll post a patch for discussion in a bit.
>
> Great! Thank you!

So I was adding the code to pgstat.c because I had thought there were
some data types I needed and/or static functions I needed. However you
and Andres encouraged me to check again now. And indeed I was able,
after fixing a couple things, to make the code work entirely
externally.

This is definitely not polished and there's a couple obvious things
missing. But at the risk of embarrassment I've attached my WIP. Please
be gentle :) I'll post the github link in a bit when I've fixed up
some meta info.

I'm definitely not wedded to the idea of using callbacks, it was just
the most convenient way to get started, especially when I was putting
the main loop in pgstat.c.  Ideally I do want to keep open the
possibility of streaming the results out without buffering the whole
set in memory.

> Out of curiosity, would you be also interested by such a feature for
> previous versions (that will not get the patch in) ?

I always had trouble understanding the existing stats code so I was
hoping the new code would make it easier. It seems to have worked but
it's possible I'm wrong and it was always possible and the problem was
always just me :)


-- 
greg
/*-
 *
 * telemetry.c
 *
 * Most of this code was copied from pg_prewarm.c as a template. 
 *
 *
 *-
 */

#include "postgres.h"

#include 
#include 
#include 

#include "access/relation.h"
#include "access/xact.h"
#include "catalog/pg_class.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/bgworker.h"
#include "postmaster/interrupt.h"
#include "storage/buf_internals.h"
#include "storage/dsm.h"
#include "storage/ipc.h"
#include "storage/latch.h"
#include "storage/lwlock.h"
#include "storage/proc.h"
#include "storage/procsignal.h"
#include "storage/shmem.h"
#include "storage/smgr.h"
#include "tcop/tcopprot.h"
#include "utils/acl.h"
#include "utils/datetime.h"
#include "utils/guc.h"
#include "utils/memutils.h"
#include "utils/rel.h"
#include "utils/resowner.h"


#include "telemetry.h"
#include "telemetry_pgstat.h"

PG_MODULE_MAGIC;


/* We should already have included what we need to get uint64_t size_t
   fd_set socklen_t and struct sockaddr so don't let microhttpd
   include extra stuff for them */
#define MHD_PLATFORM_H
#include 

/* MHD_USE_EPOLL */
/* MHD_USE_AUTO */
/* MHD_OPTION_CONNECTION_LIMIT */
/* MHD_OPTION_CONNECTION_TIMEOUT */
/* MHD_OPTION_EXTERNAL_LOGGER */



/* Background worker harness */
void		_PG_init(void);

/* Actual internal background worker main entry point */
static void telemetry_start_worker(unsigned short port);

/* GUC variables. */
static int telemetry_port = 9187; /* TCP port to listen on for metrics */
static char *telemetry_listen_addresses; /* TCP listen addresses */
static bool telemetry = true; /* start worker automatically on default port */


static enum MHD_Result
telemetry_handler(void *cls,
		  struct MHD_Connection *connection,
		  const char *url,
		  const char *method,
		  const char *version,
		  const char *upload_data,
		  size_t *upload_data_size,
		  void **con_cls);

/*
 * Module load callback.
 */
void
_PG_init(void)
{
DefineCustomIntVariable("telemetry.port",
			"TCP Port to serve metrics on by default",
			NULL,
			&telemetry_port,
			9187, 1, 65535,
			PGC_SIGHUP,
			0, /* flags */
			NULL, NULL, NULL /* hooks */
			);


DefineCustomStringVariable("telemetry.listen_addresses",
			   "TCP Listen Addresses to serve metrics on by default",
			   NULL,
			   &telemetry_listen_addresses,
			   "*",
			   PGC_SIGHUP,
			   GUC_LIST_INPUT, /* flags */
			   NULL, NULL, NULL /* hooks */
			   );

if (!process_shared_preload_libraries_in_progress)
	return;

/* can't define PGC_POSTMASTER variable after startup */
DefineCustomBoolVariable("telemetry.start_server",
			 "Starts the telemetry worker on startup.",
			 NULL,
			 &telemetry,
			 true,
			 PGC_POSTMASTER,
			 0,
			 NULL,
			 NULL,
			 NULL);

EmitWarningsOnPlaceholders("telemetry");

/* Register telemetry worker, if enabled. */
if (telemetry)
	telemetry_start_worker(telemetry_port);
}


static void telemetry_logger(void *arg, const char *fmt, va_list ap);

struct MHD_OptionItem mhd_ops[] = {
{ MHD_OPTION_EXTERNAL_LOGGER, (intptr_t)telemetry_logger

Re: shared-memory based stats collector - v70

2022-08-10 Thread Andres Freund

Hi,

On 2022-08-10 14:18:25 -0400, Greg Stark wrote:
> > I don't think that's a large enough issue to worry about unless you're
> > polling at a very high rate, which'd be a bad idea in itself. If a backend
> > can't get the lock for some stats change it'll defer flushing the stats a 
> > bit,
> > so it'll not cause a lot of other problems.
> 
> Hm. I wonder if we're on the same page about what constitutes a "high rate".
> 
> I've seen people try push prometheus or other similar systems to 5s
> poll intervals. That would be challenging for Postgres due to the
> volume of statistics. The default is 30s and people often struggle to
> even have that function for large fleets. But if you had a small
> fleet, perhaps an iot style system with a "one large table" type of
> schema you might well want stats every 5s or even every 1s.

That's probably fine. Although I think you might run into trouble not from the
stats subystem side, but from the "amount of data" side. On a system with a
lot of objects that can be a fair amount.  If you really want to do very low
latency stats reporting, I suspect you'd have to build an incremental system.


> > I'm *dead* set against including catalog names in shared memory stats. 
> > That'll
> > add a good amount of memory usage and complexity, without any sort of
> > comensurate gain.
> 
> Well it's pushing the complexity there from elsewhere. If the labels
> aren't in the stats structures then the exporter needs to connect to
> each database, gather all the names into some local cache and then it
> needs to worry about keeping it up to date. And if there are any
> database problems such as disk errors or catalog objects being locked
> then your monitoring breaks though perhaps it can be limited to just
> missing some object names or having out of date names.

Shrug. If the stats system state desynchronizes from an alter table rename
you'll also have a problem in monitoring.

And even if you can benefit from having all that information, it'd still be an
overhead born by everybody for a very small share of users.


> > > I also think it would be nice to have a change counter for every stat
> > > object, or perhaps a change time. Prometheus wouldn't be able to make
> > > use of it but other monitoring software might be able to receive only
> > > metrics that have changed since the last update which would really
> > > help on databases with large numbers of mostly static objects.
> >
> > I think you're proposing adding overhead that doesn't even have a real user.
> 
> I guess I'm just brainstorming here. I don't need to currently no. It
> doesn't seem like significant overhead though compared to the locking
> and copying though?

Yes, timestamps aren't cheap to determine (nor free too store, but that's a
lesser issue).

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-10 Thread Andres Freund

Hi,

On 2022-08-10 15:48:15 -0400, Greg Stark wrote:
> One thing that's bugging me is that the names we use for these stats
> are *all* over the place.

Yes. I had a huge issue with this when polishing the patch. And Horiguchi-san
did as well.  I had to limit the amount of cleanup done to make it feasible to
get anything committed. I think it's a bit less bad than before, but by no
means good.

> * The pg_stat_bgwriter view returns data from two different fixed
> entries, the checkpointer and the bgwriter, is there a reason those
> are kept separately but then reported as if they're one thing?

Historical raisins. Checkpointer and bgwriter used to be one thing, but isn't
anymore.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-10 Thread Drouvot, Bertrand


Hi,

On 8/10/22 11:25 PM, Greg Stark wrote:

On Wed, 10 Aug 2022 at 04:05, Drouvot, Bertrand  wrote:

Hi,

On 8/9/22 6:00 PM, Greg Stark wrote:

On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand  wrote:

What do you think about adding a function in core PG to provide such
functionality? (means being able to retrieve all the stats (+ eventually
add some filtering) without the need to connect to each database).

I'm working on it myself too. I'll post a patch for discussion in a bit.

Great! Thank you!

So I was adding the code to pgstat.c because I had thought there were
some data types I needed and/or static functions I needed. However you
and Andres encouraged me to check again now. And indeed I was able,
after fixing a couple things, to make the code work entirely
externally.


Nice!

Though I still think to have an SQL API in core could be useful to.

As Andres was not -1 about that idea (as it should be low cost to add 
and maintain) as long as somebody cares enough to write something: then 
I'll give it a try and submit a patch for it.




This is definitely not polished and there's a couple obvious things
missing. But at the risk of embarrassment I've attached my WIP. Please
be gentle :) I'll post the github link in a bit when I've fixed up
some meta info.


Thanks! I will have a look at it on github (once you share the link).

Regards,

--
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

Re: shared-memory based stats collector - v70

2022-08-15 Thread Greg Stark

On Thu, 11 Aug 2022 at 02:11, Drouvot, Bertrand  wrote:
>
> As Andres was not -1 about that idea (as it should be low cost to add
> and maintain) as long as somebody cares enough to write something: then
> I'll give it a try and submit a patch for it.

I agree it would be a useful feature. I think there may be things to
talk about here though.

1) Are you planning to go through the local hash table and
LocalSnapshot and obey the consistency mode? I was thinking a flag
passed to build_snapshot to request global mode might be sufficient
instead of a completely separate function.

2) When I did the function attached above I tried to avoid returning
the whole set and make it possible to process them as they arrive. I
actually was hoping to get to the point where I could start shipping
out network data as they arrive and not even buffer up the response,
but I think I need to be careful about hash table locking then.

3) They key difference here is that we're returning whatever stats are
in the hash table rather than using the catalog to drive a list of id
numbers to look up. I guess the API should make it clear this is what
is being returned -- on that note I wonder if I've done something
wrong because I noted a few records with InvalidOid where I didn't
expect it.

4) I'm currently looping over the hash table returning the records all
intermixed. Some users will probably want to do things like "return
all Relation records for all databases" or "return all Index records
for database id xxx". So some form of filtering may be best or perhaps
a way to retrieve just the keys so they can then be looked up one by
one (through the local cache?).

5) On that note I'm not clear how the local cache will interact with
these cross-database lookups. That should probably be documented...

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-17 Thread Greg Stark

On Tue, 16 Aug 2022 at 08:49, Drouvot, Bertrand  wrote:
>
>
> +   if (p->key.kind != PGSTAT_KIND_RELATION)
> +   continue;

Hm. So presumably this needs to be extended. Either to let the caller
decide which types of stats to return or to somehow return all the
stats intermixed. In my monitoring code I did the latter because I
didn't think going through the hash table repeatedly would be very
efficient. But it's definitely a pretty awkward API since I need a
switch statement that explicitly lists each case and casts the result.

> > 2) When I did the function attached above I tried to avoid returning
> > the whole set and make it possible to process them as they arrive.
>
> Is it the way it has been done? (did not look at your function yet)

I did it with callbacks. It was quick and easy and convenient for my
use case. But in general I don't really like callbacks and would think
some kind of iterator style api would be nicer.

I am handling the stats entries as they turn up. I'm constructing the
text output for each in a callback and buffering up the whole http
response in a string buffer.

I think that's ok but if I wanted to avoid buffering it up and do
network i/o then I would think the thing to do would be to build the
list of entry keys and then loop over that list doing a hash lookup
for each one and generating the response for each out and writing it
to the network. That way there wouldn't be anything locked, not even
the hash table, while doing network i/o. It would mean a lot of
traffic on the hash table though.

> > -- on that note I wonder if I've done something
> > wrong because I noted a few records with InvalidOid where I didn't
> > expect it.
>
> It looks like that InvalidOid for the dbid means that the entry is for a
> shared relation.

Ah yes. I had actually found that but forgotten it.

There's also a database entry with dboid=InvalidOid which is
apparently where background workers with no database attached report
stats.

> I've in mind to add some filtering on the dbid (I think it could be
> useful for monitoring tool with a persistent connection to one database
> but that wants to pull the stats database per database).
>
> I don't think a look up through the local cache will work if the
> entry/key is related to another database the API is launched from.

Isn't there also a local hash table used to find the entries to reduce
traffic on the shared hash table? Even if you don't take a snapshot
does it get entered there? There are definitely still parts of this
I'm working on a pretty vague understanding of :/

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-17 Thread Andres Freund

Hi,

On 2022-08-17 15:46:42 -0400, Greg Stark wrote:
> Isn't there also a local hash table used to find the entries to reduce
> traffic on the shared hash table? Even if you don't take a snapshot
> does it get entered there? There are definitely still parts of this
> I'm working on a pretty vague understanding of :/

Yes, there is. But it's more about code that generates stats, rather than
reporting functions. While there's backend local pending stats we need to have
a refcount on the shared stats item so that the stats item can't be dropped
and then revived when those local stats are flushed.

Relevant comments from pgstat.c:

 * To avoid contention on the shared hashtable, each backend has a
 * backend-local hashtable (pgStatEntryRefHash) in front of the shared
 * hashtable, containing references (PgStat_EntryRef) to shared hashtable
 * entries. The shared hashtable only needs to be accessed when no prior
 * reference is found in the local hashtable. Besides pointing to the
 * shared hashtable entry (PgStatShared_HashEntry) PgStat_EntryRef also
 * contains a pointer to the shared statistics data, as a process-local
 * address, to reduce access costs.
 *
 * The names for structs stored in shared memory are prefixed with
 * PgStatShared instead of PgStat. Each stats entry in shared memory is
 * protected by a dedicated lwlock.
 *
 * Most stats updates are first accumulated locally in each process as pending
 * entries, then later flushed to shared memory (just after commit, or by
 * idle-timeout). This practically eliminates contention on individual stats
 * entries. For most kinds of variable-numbered pending stats data is stored
 * in PgStat_EntryRef->pending. All entries with pending data are in the
 * pgStatPending list. Pending statistics updates are flushed out by
 * pgstat_report_stat().
 *

pgstat_internal.h has more details about the refcount aspect:

 * Per-object statistics are stored in the "shared stats" hashtable. That
 * table's entries (PgStatShared_HashEntry) contain a pointer to the actual 
stats
 * data for the object (the size of the stats data varies depending on the
 * kind of stats). The table is keyed by PgStat_HashKey.
 *
 * Once a backend has a reference to a shared stats entry, it increments the
 * entry's refcount. Even after stats data is dropped (e.g., due to a DROP
 * TABLE), the entry itself can only be deleted once all references have been
 * released.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-18 Thread Greg Stark

On Thu, 18 Aug 2022 at 02:27, Drouvot, Bertrand  wrote:
>
> What I had in mind is to provide an API to retrieve stats for those that
> would need to connect to each database individually otherwise.
>
> That's why I focused on PGSTAT_KIND_RELATION that has
> PgStat_KindInfo.accessed_across_databases set to false.
>
> I think that another candidate could also be PGSTAT_KIND_FUNCTION.

And indexes of course. It's a bit frustrating since without the
catalog you won't know what table the index actually is for... But
they're pretty important stats.

On that note though... What do you think about having the capability
to add other stats kinds to the stats infrastructure? It would make a
lot of sense for pg_stat_statements to add its entries here instead of
having to reimplement a lot of the same magic. And I have in mind an
extension that allows adding other stats and it would be nice to avoid
having to reimplement any of this.

To do that I guess more of the code needs to be moved to be table
driven from the kind structs either with callbacks or with other meta
data. So the kind record could contain tupledesc and the code to
construct the returned tuple so that these functions could return any
custom entry as well as the standard entries.

-- 
greg

Re: shared-memory based stats collector - v70

2022-08-18 Thread Andres Freund

Hi,

On 2022-08-18 15:26:31 -0400, Greg Stark wrote:
> And indexes of course. It's a bit frustrating since without the
> catalog you won't know what table the index actually is for... But
> they're pretty important stats.

FWIW, I think we should split relation stats into table and index
stats. Historically it'd have added a lot of complexity to separate the two,
but I don't think that's the case anymore. And we waste space for index stats
by having lots of table specific fields.

> On that note though... What do you think about having the capability
> to add other stats kinds to the stats infrastructure?

Getting closer to that was one of my goals working on the shared memory stats
stuff.

> It would make a lot of sense for pg_stat_statements to add its entries here
> instead of having to reimplement a lot of the same magic.

Yes, we should move pg_stat_statements over.

It's pretty easy to get massive contention on stats entries with
pg_stat_statements, because it doesn't have support for "batching" updates to
shared stats. And reimplementing the same logic in pg_stat_statements.c
doesn't make sense.

And the set of normalized queries could probably stored in DSA as well - the
file based thing we have right now is problematic.

> To do that I guess more of the code needs to be moved to be table
> driven from the kind structs either with callbacks or with other meta
> data.

Pretty much all of it already is. The only substantial missing bit is
reading/writing of stats files, but that should be pretty easy. And of course
making the callback array extensible.

> So the kind record could contain tupledesc and the code to construct the
> returned tuple so that these functions could return any custom entry as well
> as the standard entries.

I don't see how this would work well - we don't have functions returning
variable kinds of tuples. And what would convert a struct to a tuple?

Nor do I think it's needed - if you have an extension providing a new stats
kind it can also provide accessors.

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

2022-08-18 Thread Drouvot, Bertrand


Hi,

On 8/18/22 9:51 PM, Andres Freund wrote:

Hi,

On 2022-08-18 15:26:31 -0400, Greg Stark wrote:

And indexes of course. It's a bit frustrating since without the
catalog you won't know what table the index actually is for... But
they're pretty important stats.

FWIW, I think we should split relation stats into table and index
stats. Historically it'd have added a lot of complexity to separate the two,
but I don't think that's the case anymore. And we waste space for index stats
by having lots of table specific fields.


It seems to me that we should work on that first then, what do you 
think? (If so I can try to have a look at it).


And once done then resume the work to provide the APIs to get all 
tables/indexes from all the databases.


That way we'll be able to provide one API for the tables and one for the 
indexes (instead of one API for both like my current POC is doing).



On that note though... What do you think about having the capability
to add other stats kinds to the stats infrastructure?


I think that's a good idea and that would be great to have.


Getting closer to that was one of my goals working on the shared memory stats
stuff.



It would make a lot of sense for pg_stat_statements to add its entries here
instead of having to reimplement a lot of the same magic.

Yes, we should move pg_stat_statements over.

It's pretty easy to get massive contention on stats entries with
pg_stat_statements, because it doesn't have support for "batching" updates to
shared stats. And reimplementing the same logic in pg_stat_statements.c
doesn't make sense.

And the set of normalized queries could probably stored in DSA as well - the
file based thing we have right now is problematic.



To do that I guess more of the code needs to be moved to be table
driven from the kind structs either with callbacks or with other meta
data.

Pretty much all of it already is. The only substantial missing bit is
reading/writing of stats files, but that should be pretty easy. And of course
making the callback array extensible.



So the kind record could contain tupledesc and the code to construct the
returned tuple so that these functions could return any custom entry as well
as the standard entries.

I don't see how this would work well - we don't have functions returning
variable kinds of tuples. And what would convert a struct to a tuple?

Nor do I think it's needed - if you have an extension providing a new stats
kind it can also provide accessors.


I think the same (the extension should be able to do that).

I really like the idea of being able to provide new stats kind.

Regards,

--
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

Re: shared-memory based stats collector - v70

2022-08-21 Thread Kyotaro Horiguchi

At Wed, 10 Aug 2022 14:02:34 +0200, "Drouvot, Bertrand"  
wrote in 
> what about?
> 
> +   /*
> +    * Acquire the LWLock directly instead of using
> pg_stat_lock_entry_shared()
> +    * which requires a reference.
> +    */
> 
> 
> I think that's more consistent with other comments mentioning LWLock
> acquisition.

Sure. Thaks!. I did that in the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 202ef49a8885f46e339a6d81c723ac3b0a7d55ca Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 22 Aug 2022 11:29:07 +0900
Subject: [PATCH v2] Acquire lock properly when building stats snapshot

It got lost somewhere while developing shared memory stats collector
but it is undoubtedly required.
---
 src/backend/utils/activity/pgstat.c   |  8 
 src/backend/utils/activity/pgstat_shmem.c | 12 
 src/include/utils/pgstat_internal.h   |  1 +
 3 files changed, 21 insertions(+)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..95f09cfcc7 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -844,9 +844,11 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 	else
 		stats_data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		kind_info->shared_data_len);
+	pgstat_lock_entry_shared(entry_ref, false);
 	memcpy(stats_data,
 		   pgstat_get_entry_data(kind, entry_ref->shared_stats),
 		   kind_info->shared_data_len);
+	pgstat_unlock_entry(entry_ref);
 
 	if (pgstat_fetch_consistency > PGSTAT_FETCH_CONSISTENCY_NONE)
 	{
@@ -983,9 +985,15 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		 kind_info->shared_size);
+		/*
+		 * Acquire the LWLock directly instead of using
+		 * pg_stat_lock_entry_shared() which requires a reference.
+		 */
+		LWLockAcquire(&stats_data->lock, LW_SHARED);
 		memcpy(entry->data,
 			   pgstat_get_entry_data(kind, stats_data),
 			   kind_info->shared_size);
+		LWLockRelease(&stats_data->lock);
 	}
 	dshash_seq_term(&hstat);
 
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 89060ef29a..0bfa460af1 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -579,6 +579,18 @@ pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait)
 	return true;
 }
 
+bool
+pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait)
+{
+	LWLock	   *lock = &entry_ref->shared_stats->lock;
+
+	if (nowait)
+		return LWLockConditionalAcquire(lock, LW_SHARED);
+
+	LWLockAcquire(lock, LW_SHARED);
+	return true;
+}
+
 void
 pgstat_unlock_entry(PgStat_EntryRef *entry_ref)
 {
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9303d05427..901d2041d6 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -581,6 +581,7 @@ extern void pgstat_detach_shmem(void);
 extern PgStat_EntryRef *pgstat_get_entry_ref(PgStat_Kind kind, Oid dboid, Oid objoid,
 			 bool create, bool *found);
 extern bool pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait);
+extern bool pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait);
 extern void pgstat_unlock_entry(PgStat_EntryRef *entry_ref);
 extern bool pgstat_drop_entry(PgStat_Kind kind, Oid dboid, Oid objoid);
 extern void pgstat_drop_all_entries(void);
-- 
2.31.1

Re: shared-memory based stats collector - v70

2022-08-22 Thread Andres Freund

Hi,

On 2022-08-22 11:32:14 +0900, Kyotaro Horiguchi wrote:
> At Wed, 10 Aug 2022 14:02:34 +0200, "Drouvot, Bertrand"  
> wrote in 
> > what about?
> > 
> > +   /*
> > +    * Acquire the LWLock directly instead of using
> > pg_stat_lock_entry_shared()
> > +    * which requires a reference.
> > +    */
> > 
> > 
> > I think that's more consistent with other comments mentioning LWLock
> > acquisition.
> 
> Sure. Thaks!. I did that in the attached.

Pushed, thanks!

Greetings,

Andres Freund

69 matches

Mail list logo