Re: shared-memory based stats collector - v70
Hi, On 2022-08-22 11:32:14 +0900, Kyotaro Horiguchi wrote: > At Wed, 10 Aug 2022 14:02:34 +0200, "Drouvot, Bertrand" > wrote in > > what about? > > > > + /* > > + * Acquire the LWLock directly instead of using > > pg_stat_lock_entry_shared() > > + * which requires a reference. > > + */ > > > > > > I think that's more consistent with other comments mentioning LWLock > > acquisition. > > Sure. Thaks!. I did that in the attached. Pushed, thanks! Greetings, Andres Freund
Re: shared-memory based stats collector - v70
At Wed, 10 Aug 2022 14:02:34 +0200, "Drouvot, Bertrand" wrote in > what about? > > + /* > + * Acquire the LWLock directly instead of using > pg_stat_lock_entry_shared() > + * which requires a reference. > + */ > > > I think that's more consistent with other comments mentioning LWLock > acquisition. Sure. Thaks!. I did that in the attached. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 202ef49a8885f46e339a6d81c723ac3b0a7d55ca Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Mon, 22 Aug 2022 11:29:07 +0900 Subject: [PATCH v2] Acquire lock properly when building stats snapshot It got lost somewhere while developing shared memory stats collector but it is undoubtedly required. --- src/backend/utils/activity/pgstat.c | 8 src/backend/utils/activity/pgstat_shmem.c | 12 src/include/utils/pgstat_internal.h | 1 + 3 files changed, 21 insertions(+) diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 88e5dd1b2b..95f09cfcc7 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -844,9 +844,11 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid objoid) else stats_data = MemoryContextAlloc(pgStatLocal.snapshot.context, kind_info->shared_data_len); + pgstat_lock_entry_shared(entry_ref, false); memcpy(stats_data, pgstat_get_entry_data(kind, entry_ref->shared_stats), kind_info->shared_data_len); + pgstat_unlock_entry(entry_ref); if (pgstat_fetch_consistency > PGSTAT_FETCH_CONSISTENCY_NONE) { @@ -983,9 +985,15 @@ pgstat_build_snapshot(void) entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context, kind_info->shared_size); + /* + * Acquire the LWLock directly instead of using + * pg_stat_lock_entry_shared() which requires a reference. + */ + LWLockAcquire(_data->lock, LW_SHARED); memcpy(entry->data, pgstat_get_entry_data(kind, stats_data), kind_info->shared_size); + LWLockRelease(_data->lock); } dshash_seq_term(); diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c index 89060ef29a..0bfa460af1 100644 --- a/src/backend/utils/activity/pgstat_shmem.c +++ b/src/backend/utils/activity/pgstat_shmem.c @@ -579,6 +579,18 @@ pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait) return true; } +bool +pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait) +{ + LWLock *lock = _ref->shared_stats->lock; + + if (nowait) + return LWLockConditionalAcquire(lock, LW_SHARED); + + LWLockAcquire(lock, LW_SHARED); + return true; +} + void pgstat_unlock_entry(PgStat_EntryRef *entry_ref) { diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h index 9303d05427..901d2041d6 100644 --- a/src/include/utils/pgstat_internal.h +++ b/src/include/utils/pgstat_internal.h @@ -581,6 +581,7 @@ extern void pgstat_detach_shmem(void); extern PgStat_EntryRef *pgstat_get_entry_ref(PgStat_Kind kind, Oid dboid, Oid objoid, bool create, bool *found); extern bool pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait); +extern bool pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait); extern void pgstat_unlock_entry(PgStat_EntryRef *entry_ref); extern bool pgstat_drop_entry(PgStat_Kind kind, Oid dboid, Oid objoid); extern void pgstat_drop_all_entries(void); -- 2.31.1
Re: shared-memory based stats collector - v70
Hi, On 8/18/22 9:51 PM, Andres Freund wrote: Hi, On 2022-08-18 15:26:31 -0400, Greg Stark wrote: And indexes of course. It's a bit frustrating since without the catalog you won't know what table the index actually is for... But they're pretty important stats. FWIW, I think we should split relation stats into table and index stats. Historically it'd have added a lot of complexity to separate the two, but I don't think that's the case anymore. And we waste space for index stats by having lots of table specific fields. It seems to me that we should work on that first then, what do you think? (If so I can try to have a look at it). And once done then resume the work to provide the APIs to get all tables/indexes from all the databases. That way we'll be able to provide one API for the tables and one for the indexes (instead of one API for both like my current POC is doing). On that note though... What do you think about having the capability to add other stats kinds to the stats infrastructure? I think that's a good idea and that would be great to have. Getting closer to that was one of my goals working on the shared memory stats stuff. It would make a lot of sense for pg_stat_statements to add its entries here instead of having to reimplement a lot of the same magic. Yes, we should move pg_stat_statements over. It's pretty easy to get massive contention on stats entries with pg_stat_statements, because it doesn't have support for "batching" updates to shared stats. And reimplementing the same logic in pg_stat_statements.c doesn't make sense. And the set of normalized queries could probably stored in DSA as well - the file based thing we have right now is problematic. To do that I guess more of the code needs to be moved to be table driven from the kind structs either with callbacks or with other meta data. Pretty much all of it already is. The only substantial missing bit is reading/writing of stats files, but that should be pretty easy. And of course making the callback array extensible. So the kind record could contain tupledesc and the code to construct the returned tuple so that these functions could return any custom entry as well as the standard entries. I don't see how this would work well - we don't have functions returning variable kinds of tuples. And what would convert a struct to a tuple? Nor do I think it's needed - if you have an extension providing a new stats kind it can also provide accessors. I think the same (the extension should be able to do that). I really like the idea of being able to provide new stats kind. Regards, -- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
Re: shared-memory based stats collector - v70
Hi, On 2022-08-18 15:26:31 -0400, Greg Stark wrote: > And indexes of course. It's a bit frustrating since without the > catalog you won't know what table the index actually is for... But > they're pretty important stats. FWIW, I think we should split relation stats into table and index stats. Historically it'd have added a lot of complexity to separate the two, but I don't think that's the case anymore. And we waste space for index stats by having lots of table specific fields. > On that note though... What do you think about having the capability > to add other stats kinds to the stats infrastructure? Getting closer to that was one of my goals working on the shared memory stats stuff. > It would make a lot of sense for pg_stat_statements to add its entries here > instead of having to reimplement a lot of the same magic. Yes, we should move pg_stat_statements over. It's pretty easy to get massive contention on stats entries with pg_stat_statements, because it doesn't have support for "batching" updates to shared stats. And reimplementing the same logic in pg_stat_statements.c doesn't make sense. And the set of normalized queries could probably stored in DSA as well - the file based thing we have right now is problematic. > To do that I guess more of the code needs to be moved to be table > driven from the kind structs either with callbacks or with other meta > data. Pretty much all of it already is. The only substantial missing bit is reading/writing of stats files, but that should be pretty easy. And of course making the callback array extensible. > So the kind record could contain tupledesc and the code to construct the > returned tuple so that these functions could return any custom entry as well > as the standard entries. I don't see how this would work well - we don't have functions returning variable kinds of tuples. And what would convert a struct to a tuple? Nor do I think it's needed - if you have an extension providing a new stats kind it can also provide accessors. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
On Thu, 18 Aug 2022 at 02:27, Drouvot, Bertrand wrote: > > What I had in mind is to provide an API to retrieve stats for those that > would need to connect to each database individually otherwise. > > That's why I focused on PGSTAT_KIND_RELATION that has > PgStat_KindInfo.accessed_across_databases set to false. > > I think that another candidate could also be PGSTAT_KIND_FUNCTION. And indexes of course. It's a bit frustrating since without the catalog you won't know what table the index actually is for... But they're pretty important stats. On that note though... What do you think about having the capability to add other stats kinds to the stats infrastructure? It would make a lot of sense for pg_stat_statements to add its entries here instead of having to reimplement a lot of the same magic. And I have in mind an extension that allows adding other stats and it would be nice to avoid having to reimplement any of this. To do that I guess more of the code needs to be moved to be table driven from the kind structs either with callbacks or with other meta data. So the kind record could contain tupledesc and the code to construct the returned tuple so that these functions could return any custom entry as well as the standard entries. -- greg
Re: shared-memory based stats collector - v70
Hi, On 2022-08-17 15:46:42 -0400, Greg Stark wrote: > Isn't there also a local hash table used to find the entries to reduce > traffic on the shared hash table? Even if you don't take a snapshot > does it get entered there? There are definitely still parts of this > I'm working on a pretty vague understanding of :/ Yes, there is. But it's more about code that generates stats, rather than reporting functions. While there's backend local pending stats we need to have a refcount on the shared stats item so that the stats item can't be dropped and then revived when those local stats are flushed. Relevant comments from pgstat.c: * To avoid contention on the shared hashtable, each backend has a * backend-local hashtable (pgStatEntryRefHash) in front of the shared * hashtable, containing references (PgStat_EntryRef) to shared hashtable * entries. The shared hashtable only needs to be accessed when no prior * reference is found in the local hashtable. Besides pointing to the * shared hashtable entry (PgStatShared_HashEntry) PgStat_EntryRef also * contains a pointer to the shared statistics data, as a process-local * address, to reduce access costs. * * The names for structs stored in shared memory are prefixed with * PgStatShared instead of PgStat. Each stats entry in shared memory is * protected by a dedicated lwlock. * * Most stats updates are first accumulated locally in each process as pending * entries, then later flushed to shared memory (just after commit, or by * idle-timeout). This practically eliminates contention on individual stats * entries. For most kinds of variable-numbered pending stats data is stored * in PgStat_EntryRef->pending. All entries with pending data are in the * pgStatPending list. Pending statistics updates are flushed out by * pgstat_report_stat(). * pgstat_internal.h has more details about the refcount aspect: * Per-object statistics are stored in the "shared stats" hashtable. That * table's entries (PgStatShared_HashEntry) contain a pointer to the actual stats * data for the object (the size of the stats data varies depending on the * kind of stats). The table is keyed by PgStat_HashKey. * * Once a backend has a reference to a shared stats entry, it increments the * entry's refcount. Even after stats data is dropped (e.g., due to a DROP * TABLE), the entry itself can only be deleted once all references have been * released. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
On Tue, 16 Aug 2022 at 08:49, Drouvot, Bertrand wrote: > > > + if (p->key.kind != PGSTAT_KIND_RELATION) > + continue; Hm. So presumably this needs to be extended. Either to let the caller decide which types of stats to return or to somehow return all the stats intermixed. In my monitoring code I did the latter because I didn't think going through the hash table repeatedly would be very efficient. But it's definitely a pretty awkward API since I need a switch statement that explicitly lists each case and casts the result. > > 2) When I did the function attached above I tried to avoid returning > > the whole set and make it possible to process them as they arrive. > > Is it the way it has been done? (did not look at your function yet) I did it with callbacks. It was quick and easy and convenient for my use case. But in general I don't really like callbacks and would think some kind of iterator style api would be nicer. I am handling the stats entries as they turn up. I'm constructing the text output for each in a callback and buffering up the whole http response in a string buffer. I think that's ok but if I wanted to avoid buffering it up and do network i/o then I would think the thing to do would be to build the list of entry keys and then loop over that list doing a hash lookup for each one and generating the response for each out and writing it to the network. That way there wouldn't be anything locked, not even the hash table, while doing network i/o. It would mean a lot of traffic on the hash table though. > > -- on that note I wonder if I've done something > > wrong because I noted a few records with InvalidOid where I didn't > > expect it. > > It looks like that InvalidOid for the dbid means that the entry is for a > shared relation. Ah yes. I had actually found that but forgotten it. There's also a database entry with dboid=InvalidOid which is apparently where background workers with no database attached report stats. > I've in mind to add some filtering on the dbid (I think it could be > useful for monitoring tool with a persistent connection to one database > but that wants to pull the stats database per database). > > I don't think a look up through the local cache will work if the > entry/key is related to another database the API is launched from. Isn't there also a local hash table used to find the entries to reduce traffic on the shared hash table? Even if you don't take a snapshot does it get entered there? There are definitely still parts of this I'm working on a pretty vague understanding of :/ -- greg
Re: shared-memory based stats collector - v70
On Thu, 11 Aug 2022 at 02:11, Drouvot, Bertrand wrote: > > As Andres was not -1 about that idea (as it should be low cost to add > and maintain) as long as somebody cares enough to write something: then > I'll give it a try and submit a patch for it. I agree it would be a useful feature. I think there may be things to talk about here though. 1) Are you planning to go through the local hash table and LocalSnapshot and obey the consistency mode? I was thinking a flag passed to build_snapshot to request global mode might be sufficient instead of a completely separate function. 2) When I did the function attached above I tried to avoid returning the whole set and make it possible to process them as they arrive. I actually was hoping to get to the point where I could start shipping out network data as they arrive and not even buffer up the response, but I think I need to be careful about hash table locking then. 3) They key difference here is that we're returning whatever stats are in the hash table rather than using the catalog to drive a list of id numbers to look up. I guess the API should make it clear this is what is being returned -- on that note I wonder if I've done something wrong because I noted a few records with InvalidOid where I didn't expect it. 4) I'm currently looping over the hash table returning the records all intermixed. Some users will probably want to do things like "return all Relation records for all databases" or "return all Index records for database id xxx". So some form of filtering may be best or perhaps a way to retrieve just the keys so they can then be looked up one by one (through the local cache?). 5) On that note I'm not clear how the local cache will interact with these cross-database lookups. That should probably be documented... -- greg
Re: shared-memory based stats collector - v70
Hi, On 8/10/22 11:25 PM, Greg Stark wrote: On Wed, 10 Aug 2022 at 04:05, Drouvot, Bertrand wrote: Hi, On 8/9/22 6:00 PM, Greg Stark wrote: On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand wrote: What do you think about adding a function in core PG to provide such functionality? (means being able to retrieve all the stats (+ eventually add some filtering) without the need to connect to each database). I'm working on it myself too. I'll post a patch for discussion in a bit. Great! Thank you! So I was adding the code to pgstat.c because I had thought there were some data types I needed and/or static functions I needed. However you and Andres encouraged me to check again now. And indeed I was able, after fixing a couple things, to make the code work entirely externally. Nice! Though I still think to have an SQL API in core could be useful to. As Andres was not -1 about that idea (as it should be low cost to add and maintain) as long as somebody cares enough to write something: then I'll give it a try and submit a patch for it. This is definitely not polished and there's a couple obvious things missing. But at the risk of embarrassment I've attached my WIP. Please be gentle :) I'll post the github link in a bit when I've fixed up some meta info. Thanks! I will have a look at it on github (once you share the link). Regards, -- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
Re: shared-memory based stats collector - v70
Hi, On 2022-08-10 15:48:15 -0400, Greg Stark wrote: > One thing that's bugging me is that the names we use for these stats > are *all* over the place. Yes. I had a huge issue with this when polishing the patch. And Horiguchi-san did as well. I had to limit the amount of cleanup done to make it feasible to get anything committed. I think it's a bit less bad than before, but by no means good. > * The pg_stat_bgwriter view returns data from two different fixed > entries, the checkpointer and the bgwriter, is there a reason those > are kept separately but then reported as if they're one thing? Historical raisins. Checkpointer and bgwriter used to be one thing, but isn't anymore. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
Hi, On 2022-08-10 14:18:25 -0400, Greg Stark wrote: > > I don't think that's a large enough issue to worry about unless you're > > polling at a very high rate, which'd be a bad idea in itself. If a backend > > can't get the lock for some stats change it'll defer flushing the stats a > > bit, > > so it'll not cause a lot of other problems. > > Hm. I wonder if we're on the same page about what constitutes a "high rate". > > I've seen people try push prometheus or other similar systems to 5s > poll intervals. That would be challenging for Postgres due to the > volume of statistics. The default is 30s and people often struggle to > even have that function for large fleets. But if you had a small > fleet, perhaps an iot style system with a "one large table" type of > schema you might well want stats every 5s or even every 1s. That's probably fine. Although I think you might run into trouble not from the stats subystem side, but from the "amount of data" side. On a system with a lot of objects that can be a fair amount. If you really want to do very low latency stats reporting, I suspect you'd have to build an incremental system. > > I'm *dead* set against including catalog names in shared memory stats. > > That'll > > add a good amount of memory usage and complexity, without any sort of > > comensurate gain. > > Well it's pushing the complexity there from elsewhere. If the labels > aren't in the stats structures then the exporter needs to connect to > each database, gather all the names into some local cache and then it > needs to worry about keeping it up to date. And if there are any > database problems such as disk errors or catalog objects being locked > then your monitoring breaks though perhaps it can be limited to just > missing some object names or having out of date names. Shrug. If the stats system state desynchronizes from an alter table rename you'll also have a problem in monitoring. And even if you can benefit from having all that information, it'd still be an overhead born by everybody for a very small share of users. > > > I also think it would be nice to have a change counter for every stat > > > object, or perhaps a change time. Prometheus wouldn't be able to make > > > use of it but other monitoring software might be able to receive only > > > metrics that have changed since the last update which would really > > > help on databases with large numbers of mostly static objects. > > > > I think you're proposing adding overhead that doesn't even have a real user. > > I guess I'm just brainstorming here. I don't need to currently no. It > doesn't seem like significant overhead though compared to the locking > and copying though? Yes, timestamps aren't cheap to determine (nor free too store, but that's a lesser issue). Greetings, Andres Freund
Re: shared-memory based stats collector - v70
On Wed, 10 Aug 2022 at 04:05, Drouvot, Bertrand wrote: > > Hi, > > On 8/9/22 6:00 PM, Greg Stark wrote: > > On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand wrote: > >> > >> What do you think about adding a function in core PG to provide such > >> functionality? (means being able to retrieve all the stats (+ eventually > >> add some filtering) without the need to connect to each database). > > I'm working on it myself too. I'll post a patch for discussion in a bit. > > Great! Thank you! So I was adding the code to pgstat.c because I had thought there were some data types I needed and/or static functions I needed. However you and Andres encouraged me to check again now. And indeed I was able, after fixing a couple things, to make the code work entirely externally. This is definitely not polished and there's a couple obvious things missing. But at the risk of embarrassment I've attached my WIP. Please be gentle :) I'll post the github link in a bit when I've fixed up some meta info. I'm definitely not wedded to the idea of using callbacks, it was just the most convenient way to get started, especially when I was putting the main loop in pgstat.c. Ideally I do want to keep open the possibility of streaming the results out without buffering the whole set in memory. > Out of curiosity, would you be also interested by such a feature for > previous versions (that will not get the patch in) ? I always had trouble understanding the existing stats code so I was hoping the new code would make it easier. It seems to have worked but it's possible I'm wrong and it was always possible and the problem was always just me :) -- greg /*- * * telemetry.c * * Most of this code was copied from pg_prewarm.c as a template. * * *- */ #include "postgres.h" #include #include #include #include "access/relation.h" #include "access/xact.h" #include "catalog/pg_class.h" #include "catalog/pg_type.h" #include "miscadmin.h" #include "pgstat.h" #include "postmaster/bgworker.h" #include "postmaster/interrupt.h" #include "storage/buf_internals.h" #include "storage/dsm.h" #include "storage/ipc.h" #include "storage/latch.h" #include "storage/lwlock.h" #include "storage/proc.h" #include "storage/procsignal.h" #include "storage/shmem.h" #include "storage/smgr.h" #include "tcop/tcopprot.h" #include "utils/acl.h" #include "utils/datetime.h" #include "utils/guc.h" #include "utils/memutils.h" #include "utils/rel.h" #include "utils/resowner.h" #include "telemetry.h" #include "telemetry_pgstat.h" PG_MODULE_MAGIC; /* We should already have included what we need to get uint64_t size_t fd_set socklen_t and struct sockaddr so don't let microhttpd include extra stuff for them */ #define MHD_PLATFORM_H #include /* MHD_USE_EPOLL */ /* MHD_USE_AUTO */ /* MHD_OPTION_CONNECTION_LIMIT */ /* MHD_OPTION_CONNECTION_TIMEOUT */ /* MHD_OPTION_EXTERNAL_LOGGER */ /* Background worker harness */ void _PG_init(void); /* Actual internal background worker main entry point */ static void telemetry_start_worker(unsigned short port); /* GUC variables. */ static int telemetry_port = 9187; /* TCP port to listen on for metrics */ static char *telemetry_listen_addresses; /* TCP listen addresses */ static bool telemetry = true; /* start worker automatically on default port */ static enum MHD_Result telemetry_handler(void *cls, struct MHD_Connection *connection, const char *url, const char *method, const char *version, const char *upload_data, size_t *upload_data_size, void **con_cls); /* * Module load callback. */ void _PG_init(void) { DefineCustomIntVariable("telemetry.port", "TCP Port to serve metrics on by default", NULL, _port, 9187, 1, 65535, PGC_SIGHUP, 0, /* flags */ NULL, NULL, NULL /* hooks */ ); DefineCustomStringVariable("telemetry.listen_addresses", "TCP Listen Addresses to serve metrics on by default", NULL, _listen_addresses, "*", PGC_SIGHUP, GUC_LIST_INPUT, /* flags */ NULL, NULL, NULL /* hooks */ ); if (!process_shared_preload_libraries_in_progress) return; /* can't define PGC_POSTMASTER variable after startup */ DefineCustomBoolVariable("telemetry.start_server", "Starts the telemetry worker on startup.", NULL, , true, PGC_POSTMASTER, 0, NULL, NULL, NULL); EmitWarningsOnPlaceholders("telemetry"); /* Register telemetry worker, if enabled. */ if (telemetry) telemetry_start_worker(telemetry_port); } static void telemetry_logger(void *arg, const char *fmt, va_list ap); struct MHD_OptionItem mhd_ops[] = { { MHD_OPTION_EXTERNAL_LOGGER, (intptr_t)telemetry_logger, NULL }, {
Re: shared-memory based stats collector - v70
One thing that's bugging me is that the names we use for these stats are *all* over the place. The names go through three different stages pgstat structs -> pgstatfunc tupledescs -> pg_stat_* views (Followed by a fourth stage where pg_exporter or whatever names for the monitoring software) And for some reason both transitions (plus the exporter) felt the need to fiddle with the names or values. And not in any sort of even vaguely consistent way. So there are three (or four) different sets of names for the same metrics :( e.g. * Some of the struct elements have abbreviated words which are expanded in the tupledesc names or the view columns -- some have long names which get abbreviated. * Some struct members have n_ prefixes (presumably to avoid C keywords or other namespace issues?) and then lose them at one of the other stages. But then the relation stats do not have n_ prefixes and then the pg_stat view *adds* n_ prefixes in the SQL view! * Some columns are added together in the SQL view which seems like gratuitously hiding information from the user. The pg_stat_*_tables view actually looks up information from the indexes stats and combines them to get idx_scan and idx_tup_fetch. * The pg_stat_bgwriter view returns data from two different fixed entries, the checkpointer and the bgwriter, is there a reason those are kept separately but then reported as if they're one thing? Some of the simpler renaming could be transparently fixed by making the internal stats match the public facing names. But for many of them I think the internal names are better. And the cases where the views aggregate data in a way that loses information are not something I want to reproduce. I had intended to use the internal names directly, reasoning that transparency and consistency are the direction to be headed. But in some cases I think the current public names are the better choice -- I certainly don't want to remove n_* prefixes from some names but then add them to different names! And some of the cases where the data is combined or modified do seem like they would be missed. -- greg
Re: shared-memory based stats collector - v70
On Tue, 9 Aug 2022 at 12:48, Andres Freund wrote: > > The reason I want a C function is I'm trying to get as far as I can > > without a connection to a database, without a transaction, without > > accessing the catalog, and as much as possible without taking locks. > > I assume you don't include lwlocks under locks? I guess it depends on which lwlock :) I would be leery of a monitoring system taking an lwlock that could interfere with regular transactions doing work. Or taking a lock that is itself the cause of the problem elsewhere that you really need stats to debug would be a deal breaker. > I don't think that's a large enough issue to worry about unless you're > polling at a very high rate, which'd be a bad idea in itself. If a backend > can't get the lock for some stats change it'll defer flushing the stats a bit, > so it'll not cause a lot of other problems. Hm. I wonder if we're on the same page about what constitutes a "high rate". I've seen people try push prometheus or other similar systems to 5s poll intervals. That would be challenging for Postgres due to the volume of statistics. The default is 30s and people often struggle to even have that function for large fleets. But if you had a small fleet, perhaps an iot style system with a "one large table" type of schema you might well want stats every 5s or even every 1s. > I'm *dead* set against including catalog names in shared memory stats. That'll > add a good amount of memory usage and complexity, without any sort of > comensurate gain. Well it's pushing the complexity there from elsewhere. If the labels aren't in the stats structures then the exporter needs to connect to each database, gather all the names into some local cache and then it needs to worry about keeping it up to date. And if there are any database problems such as disk errors or catalog objects being locked then your monitoring breaks though perhaps it can be limited to just missing some object names or having out of date names. > > I also think it would be nice to have a change counter for every stat > > object, or perhaps a change time. Prometheus wouldn't be able to make > > use of it but other monitoring software might be able to receive only > > metrics that have changed since the last update which would really > > help on databases with large numbers of mostly static objects. > > I think you're proposing adding overhead that doesn't even have a real user. I guess I'm just brainstorming here. I don't need to currently no. It doesn't seem like significant overhead though compared to the locking and copying though? -- greg
Re: shared-memory based stats collector - v70
At Tue, 9 Aug 2022 09:53:19 -0700, Andres Freund wrote in > Hi, > > On 2022-08-09 17:24:35 +0900, Kyotaro Horiguchi wrote: > > If I'm not missing something, it's strange that pgstat_lock_entry() > > only takes LW_EXCLUSIVE. > > I think it makes some sense, given that there's a larger number of callers for > that in various stats-emitting code. Perhaps we could just add a separate > function with a _shared() suffix? Sure. That was an alternative I had in my mind. > > The atached changes the interface of > > pgstat_lock_entry() but there's only one user since another read of > > shared stats entry is not using reference. Thus the interface change > > might be too much. If I just add bare LWLockAcquire/Release() to > > pgstat_fetch_entry,the amount of the patch will be reduced. > > Could you try the pgstat_lock_entry_shared() approach? Of course. Please find the attached. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From f406fd384d7ca145e37810a2f6a7a410f74d5795 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 10 Aug 2022 10:53:19 +0900 Subject: [PATCH] Take lock while building stats snapshot It got lost somewhere while developing shared memory stats collector but it is undoubtedly required. --- src/backend/utils/activity/pgstat.c | 8 src/backend/utils/activity/pgstat_shmem.c | 12 src/include/utils/pgstat_internal.h | 1 + 3 files changed, 21 insertions(+) diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 88e5dd1b2b..c23142a670 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -844,9 +844,11 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid objoid) else stats_data = MemoryContextAlloc(pgStatLocal.snapshot.context, kind_info->shared_data_len); + pgstat_lock_entry_shared(entry_ref, false); memcpy(stats_data, pgstat_get_entry_data(kind, entry_ref->shared_stats), kind_info->shared_data_len); + pgstat_unlock_entry(entry_ref); if (pgstat_fetch_consistency > PGSTAT_FETCH_CONSISTENCY_NONE) { @@ -983,9 +985,15 @@ pgstat_build_snapshot(void) entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context, kind_info->shared_size); + /* + * Take lwlock directly instead of using pg_stat_lock_entry_shared() + * which requires a reference. + */ + LWLockAcquire(_data->lock, LW_SHARED); memcpy(entry->data, pgstat_get_entry_data(kind, stats_data), kind_info->shared_size); + LWLockRelease(_data->lock); } dshash_seq_term(); diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c index 89060ef29a..0bfa460af1 100644 --- a/src/backend/utils/activity/pgstat_shmem.c +++ b/src/backend/utils/activity/pgstat_shmem.c @@ -579,6 +579,18 @@ pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait) return true; } +bool +pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait) +{ + LWLock *lock = _ref->shared_stats->lock; + + if (nowait) + return LWLockConditionalAcquire(lock, LW_SHARED); + + LWLockAcquire(lock, LW_SHARED); + return true; +} + void pgstat_unlock_entry(PgStat_EntryRef *entry_ref) { diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h index 9303d05427..901d2041d6 100644 --- a/src/include/utils/pgstat_internal.h +++ b/src/include/utils/pgstat_internal.h @@ -581,6 +581,7 @@ extern void pgstat_detach_shmem(void); extern PgStat_EntryRef *pgstat_get_entry_ref(PgStat_Kind kind, Oid dboid, Oid objoid, bool create, bool *found); extern bool pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait); +extern bool pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait); extern void pgstat_unlock_entry(PgStat_EntryRef *entry_ref); extern bool pgstat_drop_entry(PgStat_Kind kind, Oid dboid, Oid objoid); extern void pgstat_drop_all_entries(void); -- 2.31.1
Re: shared-memory based stats collector - v70
Hi, On 2022-08-09 17:24:35 +0900, Kyotaro Horiguchi wrote: > At Mon, 8 Aug 2022 11:53:19 -0400, Greg Stark wrote in > > I'm trying to wrap my head around the shared memory stats collector > > infrastructure from > > <20220406030008.2qxipjxo776dw...@alap3.anarazel.de> committed in > > 5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd. > > > > I have one question about locking -- afaics there's nothing protecting > > reading the shared memory stats. There is an lwlock protecting > > concurrent updates of the shared memory stats, but that lock isn't > > taken when we read the stats. Are we ok relying on atomic 64-bit reads > > or is there something else going on that I'm missing? Yes, that's not right. Not sure how it ended up that way. There was a lot of refactoring and pushing down the locking into different places, I guess it got lost somewhere on the way :(. It's unlikely to be a large problem, but we should fix it. > If I'm not missing something, it's strange that pgstat_lock_entry() > only takes LW_EXCLUSIVE. I think it makes some sense, given that there's a larger number of callers for that in various stats-emitting code. Perhaps we could just add a separate function with a _shared() suffix? > The atached changes the interface of > pgstat_lock_entry() but there's only one user since another read of > shared stats entry is not using reference. Thus the interface change > might be too much. If I just add bare LWLockAcquire/Release() to > pgstat_fetch_entry,the amount of the patch will be reduced. Could you try the pgstat_lock_entry_shared() approach? Greetings, Andres Freund
Re: shared-memory based stats collector - v70
Hi, On 2022-08-09 12:00:46 -0400, Greg Stark wrote: > I was more aiming at a C function that extensions could use directly > rather than an SQL function -- though I suppose having the former it > would be simple enough to implement the latter using it. (though it > would have to be one for each stat type I guess) I think such a C extension could exist today, without patching core code? It'd be a bit ugly to include pgstat_internal.h, I guess, but other than that... > The reason I want a C function is I'm trying to get as far as I can > without a connection to a database, without a transaction, without > accessing the catalog, and as much as possible without taking locks. I assume you don't include lwlocks under locks? > I think this is important for making monitoring highly reliable and low > impact on production. I'm doubtful about that, but whatever. > The main problem with my current code is that I'm accessing the shared > memory hash table directly. This means the I'm possibly introducing > locking contention on the shared memory hash table. I don't think that's a large enough issue to worry about unless you're polling at a very high rate, which'd be a bad idea in itself. If a backend can't get the lock for some stats change it'll defer flushing the stats a bit, so it'll not cause a lot of other problems. > I'm thinking of separating the shared memory hash scan from the metric scan > so the list can be quickly built minimizing the time the lock is held. I'd really really want to see some evidence that any sort of complexity here is worth it. > I have a few things I would like to suggest for future improvements to > this infrastructure. I haven't polished the details of it yet but the > main thing I think I'm missing is the catalog name for the object. I > don't want to have to fetch it from the catalog and in any case I > think it would generally be useful and might regularize the > replication slot handling too. I'm *dead* set against including catalog names in shared memory stats. That'll add a good amount of memory usage and complexity, without any sort of comensurate gain. > I also think it would be nice to have a change counter for every stat > object, or perhaps a change time. Prometheus wouldn't be able to make > use of it but other monitoring software might be able to receive only > metrics that have changed since the last update which would really > help on databases with large numbers of mostly static objects. I think you're proposing adding overhead that doesn't even have a real user. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
Hi, On 2022-08-09 12:18:47 +0200, Drouvot, Bertrand wrote: > What do you think about adding a function in core PG to provide such > functionality? (means being able to retrieve all the stats (+ eventually add > some filtering) without the need to connect to each database). I'm not that convinced by the use case, but I think it's also low cost to add and maintain, so if somebody cares enough to write something... The only thing I would "request" is that such a function requires more permissions than the default accessors do. I think it's a minor problem that we allow so much access within a database right now, regardless of object permissions, but it'd not be a great idea to expand that to other databases, in bulk? Greetings, Andres Freund
Re: shared-memory based stats collector - v70
On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand wrote: > > > What do you think about adding a function in core PG to provide such > functionality? (means being able to retrieve all the stats (+ eventually > add some filtering) without the need to connect to each database). I'm working on it myself too. I'll post a patch for discussion in a bit. I was more aiming at a C function that extensions could use directly rather than an SQL function -- though I suppose having the former it would be simple enough to implement the latter using it. (though it would have to be one for each stat type I guess) The reason I want a C function is I'm trying to get as far as I can without a connection to a database, without a transaction, without accessing the catalog, and as much as possible without taking locks. I think this is important for making monitoring highly reliable and low impact on production. It's also kind of fundamental to accessing stats for objects from other databases since we won't have easy access to the catalogs for the other databases. The main problem with my current code is that I'm accessing the shared memory hash table directly. This means the I'm possibly introducing locking contention on the shared memory hash table. I'm thinking of separating the shared memory hash scan from the metric scan so the list can be quickly built minimizing the time the lock is held. We could possibly also only rebuild that list at a lower frequency than the metrics gathering so new objects might not show up instantly. I have a few things I would like to suggest for future improvements to this infrastructure. I haven't polished the details of it yet but the main thing I think I'm missing is the catalog name for the object. I don't want to have to fetch it from the catalog and in any case I think it would generally be useful and might regularize the replication slot handling too. I also think it would be nice to have a change counter for every stat object, or perhaps a change time. Prometheus wouldn't be able to make use of it but other monitoring software might be able to receive only metrics that have changed since the last update which would really help on databases with large numbers of mostly static objects. Even on typical databases there are tons of builtin objects (especially functions) that are probably never getting updates. -- greg
Re: shared-memory based stats collector - v70
At Mon, 8 Aug 2022 11:53:19 -0400, Greg Stark wrote in > I'm trying to wrap my head around the shared memory stats collector > infrastructure from > <20220406030008.2qxipjxo776dw...@alap3.anarazel.de> committed in > 5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd. > > I have one question about locking -- afaics there's nothing protecting > reading the shared memory stats. There is an lwlock protecting > concurrent updates of the shared memory stats, but that lock isn't > taken when we read the stats. Are we ok relying on atomic 64-bit reads > or is there something else going on that I'm missing? > > In particular I'm looking at pgstat.c:847 in pgstat_fetch_entry() > which does this: > > memcpy(stats_data, >pgstat_get_entry_data(kind, entry_ref->shared_stats), >kind_info->shared_data_len); > > stats_data is the returned copy of the stats entry with all the > statistics in it. But it's copied from the shared memory location > directly using memcpy and there's no locking or change counter or > anything protecting this memcpy that I can see. We take LW_SHARED while creating a snapshot of fixed-numbered stats. On the other hand we don't for variable-numbered stats. I agree to you, that we need that also for variable-numbered stats. If I'm not missing something, it's strange that pgstat_lock_entry() only takes LW_EXCLUSIVE. The atached changes the interface of pgstat_lock_entry() but there's only one user since another read of shared stats entry is not using reference. Thus the interface change might be too much. If I just add bare LWLockAcquire/Release() to pgstat_fetch_entry,the amount of the patch will be reduced. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 88e5dd1b2b..7c4e5f0238 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -844,9 +844,11 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid objoid) else stats_data = MemoryContextAlloc(pgStatLocal.snapshot.context, kind_info->shared_data_len); + pgstat_lock_entry(entry_ref, LW_SHARED, false); memcpy(stats_data, pgstat_get_entry_data(kind, entry_ref->shared_stats), kind_info->shared_data_len); + pgstat_unlock_entry(entry_ref); if (pgstat_fetch_consistency > PGSTAT_FETCH_CONSISTENCY_NONE) { @@ -983,9 +985,15 @@ pgstat_build_snapshot(void) entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context, kind_info->shared_size); + /* + * We're directly accesing the shared stats entry not using + * reference. pg_stat_lock_entry() cannot be used here. + */ + LWLockAcquire(_data->lock, LW_SHARED); memcpy(entry->data, pgstat_get_entry_data(kind, stats_data), kind_info->shared_size); + LWLockRelease(_data->lock); } dshash_seq_term(); diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c index d9275611f0..fdf4d022c1 100644 --- a/src/backend/utils/activity/pgstat_database.c +++ b/src/backend/utils/activity/pgstat_database.c @@ -364,7 +364,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait) pendingent = (PgStat_StatDBEntry *) entry_ref->pending; sharedent = (PgStatShared_Database *) entry_ref->shared_stats; - if (!pgstat_lock_entry(entry_ref, nowait)) + if (!pgstat_lock_entry(entry_ref, LW_EXCLUSIVE, nowait)) return false; #define PGSTAT_ACCUM_DBCOUNT(item) \ diff --git a/src/backend/utils/activity/pgstat_function.c b/src/backend/utils/activity/pgstat_function.c index 427d8c47fc..318db0b3c8 100644 --- a/src/backend/utils/activity/pgstat_function.c +++ b/src/backend/utils/activity/pgstat_function.c @@ -200,7 +200,7 @@ pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait) /* localent always has non-zero content */ - if (!pgstat_lock_entry(entry_ref, nowait)) + if (!pgstat_lock_entry(entry_ref, LW_EXCLUSIVE, nowait)) return false; shfuncent->stats.f_numcalls += localent->f_counts.f_numcalls; diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c index a846d9ffb6..98dda726db 100644 --- a/src/backend/utils/activity/pgstat_relation.c +++ b/src/backend/utils/activity/pgstat_relation.c @@ -782,7 +782,7 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait) return true; } - if (!pgstat_lock_entry(entry_ref, nowait)) + if (!pgstat_lock_entry(entry_ref, LW_EXCLUSIVE, nowait)) return false; /* add the values to the shared entry. */ diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c index 89060ef29a..fdd20d80a1 100644 --- a/src/backend/utils/activity/pgstat_shmem.c +++ b/src/backend/utils/activity/pgstat_shmem.c @@ -568,14 +568,14 @@ pgstat_release_entry_ref(PgStat_HashKey key, PgStat_EntryRef *entry_ref, } bool -pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait)
Re: shared-memory based stats collector - v70
I'm trying to wrap my head around the shared memory stats collector infrastructure from <20220406030008.2qxipjxo776dw...@alap3.anarazel.de> committed in 5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd. I have one question about locking -- afaics there's nothing protecting reading the shared memory stats. There is an lwlock protecting concurrent updates of the shared memory stats, but that lock isn't taken when we read the stats. Are we ok relying on atomic 64-bit reads or is there something else going on that I'm missing? In particular I'm looking at pgstat.c:847 in pgstat_fetch_entry() which does this: memcpy(stats_data, pgstat_get_entry_data(kind, entry_ref->shared_stats), kind_info->shared_data_len); stats_data is the returned copy of the stats entry with all the statistics in it. But it's copied from the shared memory location directly using memcpy and there's no locking or change counter or anything protecting this memcpy that I can see. -- greg
Re: shared-memory based stats collector - v70
On Wed, 20 Jul 2022 at 15:09, Andres Freund wrote: > > Each backend only had stats for things it touched. But the stats collector > read all files at startup into hash tables and absorbed all generated stats > into those as well. Fascinating. I'm surprised this didn't raise issues previously for people with millions of tables. I wonder if it wasn't causing issues and we just didn't hear about them because there were other bigger issues :) > >All that said -- having all objects loaded in shared memory makes my > >work way easier. > > What are your trying to do? I'm trying to implement an exporter for prometheus/openmetrics/etc that dumps directly from shared memory without going through the SQL backend layer. I believe this will be much more reliable, lower overhead, safer, and consistent than writing SQL queries. Ideally I would want to dump out the stats without connecting to each database. I suspect that would run into problems where the schema really adds a lot of information (such as which table each index is on or which table a toast relation is for. There are also some things people think of as stats that are maintained in the catalog such as reltuples and relpages. So I'm imagining this won't strictly stay true in the end. It seems like just having an interface to iterate over the shared hash table and return entries one by one without filtering by database would be fairly straightforward and I would be able to do most of what I want just with that. There's actually enough meta information in the stats entries to be able to handle them as they come instead of trying to process look up specific stats one by one. -- greg
Re: shared-memory based stats collector - v70
Hi, On July 20, 2022 8:41:53 PM GMT+02:00, Greg Stark wrote: >On Wed, 20 Jul 2022 at 12:08, Tom Lane wrote: >> >> AFAIR, the previous stats collector implementation had no such provision >> either: it'd just keep adding hashtable entries as it received info about >> new objects. The only thing that's changed is that now those entries are >> in shared memory instead of process-local memory. We'd be well advised to >> be sure that memory can be swapped out under pressure, but otherwise I'm >> not seeing that things have gotten worse. > >Just to be clear I'm not looking for ways things have gotten worse. >Just trying to understand what I'm reading and I guess I came in with >assumptions that led me astray. > >But... adding entries as it received info about new objects isn't the >same as having info on everything. I didn't really understand how the >old system worked but if you had a very large schema but each session >only worked with a small subset did the local stats data ever absorb >info on the objects it never touched? Each backend only had stats for things it touched. But the stats collector read all files at startup into hash tables and absorbed all generated stats into those as well. >All that said -- having all objects loaded in shared memory makes my >work way easier. What are your trying to do? >It actually seems feasible to dump out all the >objects from shared memory and including objects from other databases >and if I don't need a consistent snapshot it even seems like it would >be possible to do that without having a copy of more than one stats >entry at a time in local memory. I hope that doesn't cause huge >contention on the shared hash table to be doing that regularly. The stats accessors now default to not creating a full snapshot of stats data at first access (but that's configurable). So yes, that behavior is possible. E.g. autovac now uses a single object access like you describe. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: shared-memory based stats collector - v70
On Wed, 20 Jul 2022 at 12:08, Tom Lane wrote: > > AFAIR, the previous stats collector implementation had no such provision > either: it'd just keep adding hashtable entries as it received info about > new objects. The only thing that's changed is that now those entries are > in shared memory instead of process-local memory. We'd be well advised to > be sure that memory can be swapped out under pressure, but otherwise I'm > not seeing that things have gotten worse. Just to be clear I'm not looking for ways things have gotten worse. Just trying to understand what I'm reading and I guess I came in with assumptions that led me astray. But... adding entries as it received info about new objects isn't the same as having info on everything. I didn't really understand how the old system worked but if you had a very large schema but each session only worked with a small subset did the local stats data ever absorb info on the objects it never touched? All that said -- having all objects loaded in shared memory makes my work way easier. It actually seems feasible to dump out all the objects from shared memory and including objects from other databases and if I don't need a consistent snapshot it even seems like it would be possible to do that without having a copy of more than one stats entry at a time in local memory. I hope that doesn't cause huge contention on the shared hash table to be doing that regularly. -- greg
Re: shared-memory based stats collector - v70
Hi, On 2022-07-20 12:08:35 -0400, Tom Lane wrote: > AFAIR, the previous stats collector implementation had no such provision > either: it'd just keep adding hashtable entries as it received info about > new objects. Yep. > The only thing that's changed is that now those entries are in shared > memory instead of process-local memory. We'd be well advised to be > sure that memory can be swapped out under pressure, but otherwise I'm > not seeing that things have gotten worse. FWIW, I ran a few memory usage benchmarks. Without stats accesses the memory usage with shared memory stats was sometimes below, sometimes above the "old" memory usage, depending on the number of objects. As soon as there's stats access, it's well below (that includes things like autovac workers). I think there's quite a bit of memory usage reduction potential around dsa.c - we occasionally end up with [nearly] unused dsm segments. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
Melanie Plageman writes: > On Wed, Jul 20, 2022 at 11:35 AM Greg Stark wrote: >> It seems like if we really think the total number of database objects >> is reasonably limited to scales that fit in RAM there would be a much >> simpler database design that would just store the catalog tables in >> simple in-memory data structures and map them all on startup without >> doing all the work Postgres does to make relational storage scale. > I think efforts to do such a thing have gotten caught up in solving > issues around visibility and managing the relationship between local and > global caches [1]. It doesn't seem like the primary technical concern > was memory usage. AFAIR, the previous stats collector implementation had no such provision either: it'd just keep adding hashtable entries as it received info about new objects. The only thing that's changed is that now those entries are in shared memory instead of process-local memory. We'd be well advised to be sure that memory can be swapped out under pressure, but otherwise I'm not seeing that things have gotten worse. regards, tom lane
Re: shared-memory based stats collector - v70
Hi, On 2022-07-20 11:35:13 -0400, Greg Stark wrote: > Is it true that the shared memory allocation contains the hash table > entry and body of every object in every database? Yes. However, note that that was already the case with the old stats collector - it also kept everything in memory. In addition every read access to stats loaded a copy of the stats (well of the global stats and the relevant per-database stats). It might be worth doing something fancier at some point - the shared memory stats was already a huge effort, cramming yet another change in there would pretty much have guaranteed that it'd fail. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
On Wed, Jul 20, 2022 at 11:35 AM Greg Stark wrote: > On the one hand the rest of Postgres seems to be designed on the > assumption that the number of tables and database objects is limited > only by disk space. The catalogs are stored in relational storage > which is read through the buffer cache. On the other hand it's true > that syscaches don't do expire entries (though I think the assumption > is that no one backend touches very much). > > It seems like if we really think the total number of database objects > is reasonably limited to scales that fit in RAM there would be a much > simpler database design that would just store the catalog tables in > simple in-memory data structures and map them all on startup without > doing all the work Postgres does to make relational storage scale. > I think efforts to do such a thing have gotten caught up in solving issues around visibility and managing the relationship between local and global caches [1]. It doesn't seem like the primary technical concern was memory usage. [1] https://www.postgresql.org/message-id/flat/4E72940DA2BF16479384A86D54D0988A567B9245%40G01JPEXMBKW04
Re: shared-memory based stats collector - v70
So I'm finally wrapping my head around this new code. There is something I'm surprised by that perhaps I'm misreading or perhaps I shouldn't be surprised by, not sure. Is it true that the shared memory allocation contains the hash table entry and body of every object in every database? I guess I was assuming I would find some kind of LRU cache which loaded data from disk on demand. But afaict it loads everything on startup and then never loads from disk later. The disk is purely for recovering state after a restart. On the one hand the rest of Postgres seems to be designed on the assumption that the number of tables and database objects is limited only by disk space. The catalogs are stored in relational storage which is read through the buffer cache. On the other hand it's true that syscaches don't do expire entries (though I think the assumption is that no one backend touches very much). It seems like if we really think the total number of database objects is reasonably limited to scales that fit in RAM there would be a much simpler database design that would just store the catalog tables in simple in-memory data structures and map them all on startup without doing all the work Postgres does to make relational storage scale.
Re: shared-memory based stats collector - v70
Hi, On 2022-04-13 17:55:18 -0700, Andres Freund wrote: > On 2022-04-13 16:56:45 -0700, David G. Johnston wrote: > > On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston < > > david.g.johns...@gmail.com> wrote: > > Sorry, apparently this "2000-01-01" behavior only manifests after crash > > recovery on v15 (didn't check v14); after a clean initdb on v15 I got the > > same initdb timestamp. > > > Feels like we should still report the "end of crash recovery timestamp" for > > these instead of 2000-01-01 (which I guess is derived from 0) if we are not > > willing to produce null (and it seems other parts of the system using these > > stats assumes non-null). > > Yes, that's definitely not correct. I see the bug (need to call > pgstat_reset_after_failure(); in pgstat_discard_stats()). Stupid, but > easy to fix - too fried to write a test tonight, but will commit the fix > tomorrow. Pushed the fix (including a test that previously failed). Thanks again! Greetings, Andres Freund
Re: shared-memory based stats collector - v70
On Wed, Apr 13, 2022 at 04:56:45PM -0700, David G. Johnston wrote: > Sorry, apparently this "2000-01-01" behavior only manifests after crash > recovery on v15 (didn't check v14); after a clean initdb on v15 I got the > same initdb timestamp. > > Feels like we should still report the "end of crash recovery timestamp" for > these instead of 2000-01-01 (which I guess is derived from 0) if we are not > willing to produce null (and it seems other parts of the system using these > stats assumes non-null). I can see this timestamp as well after crash recovery. This seems rather misleading to me. I have added an open item. -- Michael signature.asc Description: PGP signature
Re: shared-memory based stats collector - v70
Hi, On 2022-04-13 16:56:45 -0700, David G. Johnston wrote: > On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston < > david.g.johns...@gmail.com> wrote: > Sorry, apparently this "2000-01-01" behavior only manifests after crash > recovery on v15 (didn't check v14); after a clean initdb on v15 I got the > same initdb timestamp. > Feels like we should still report the "end of crash recovery timestamp" for > these instead of 2000-01-01 (which I guess is derived from 0) if we are not > willing to produce null (and it seems other parts of the system using these > stats assumes non-null). Yes, that's definitely not correct. I see the bug (need to call pgstat_reset_after_failure(); in pgstat_discard_stats()). Stupid, but easy to fix - too fried to write a test tonight, but will commit the fix tomorrow. Thanks for catching! Greetings, Andres Freund
Re: shared-memory based stats collector - v70
On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston < david.g.johns...@gmail.com> wrote: > On Tue, Apr 5, 2022 at 8:00 PM Andres Freund wrote: > >> Here comes v70: >> >> > One thing I just noticed while peeking at pg_stat_slru: > > The stats_reset column for my newly initdb'd cluster is showing me > "2000-01-01 00:00:00" (v15). I was expecting null, though a non-null value > restriction does make sense. Neither choice is documented though. > > Based upon my expectation I checked to see if v14 reported null, and thus > this was a behavior change. v14 reports the initdb timestamp (e.g., > 2022-04-13 23:26:48.349115+00) > > Can we document the non-null aspect of this value (pg_stat_database is > happy being null, this seems to be a "fixed" type behavior) but have it > continue to report initdb as its initial value? > > Sorry, apparently this "2000-01-01" behavior only manifests after crash recovery on v15 (didn't check v14); after a clean initdb on v15 I got the same initdb timestamp. Feels like we should still report the "end of crash recovery timestamp" for these instead of 2000-01-01 (which I guess is derived from 0) if we are not willing to produce null (and it seems other parts of the system using these stats assumes non-null). David J. David J.
Re: shared-memory based stats collector - v70
On Tue, Apr 5, 2022 at 8:00 PM Andres Freund wrote: > Here comes v70: > > One thing I just noticed while peeking at pg_stat_slru: The stats_reset column for my newly initdb'd cluster is showing me "2000-01-01 00:00:00" (v15). I was expecting null, though a non-null value restriction does make sense. Neither choice is documented though. Based upon my expectation I checked to see if v14 reported null, and thus this was a behavior change. v14 reports the initdb timestamp (e.g., 2022-04-13 23:26:48.349115+00) Can we document the non-null aspect of this value (pg_stat_database is happy being null, this seems to be a "fixed" type behavior) but have it continue to report initdb as its initial value? David J.
Re: shared-memory based stats collector - v70
On Sat, Apr 9, 2022 at 12:07 PM Andres Freund wrote: > > > ... specific counters. In particular, replay will not increment > > pg_stat_database or pg_stat_all_tables columns, and the startup process > > will not report reads and writes for the pg_statio views. > > > > It would helpful to give at least one specific example of what is being > > recorded normally, especially since we give three of what is not. > > The second sentence is a set of examples - or do you mean examples for what > actions by the startup process are counted? > > Specific views that these statistics will be updating; like pg_stat_database being the example of a view that is not updating. David J.
Re: shared-memory based stats collector - v70
Hi, On 2022-04-07 21:39:55 -0700, David G. Johnston wrote: > On Thu, Apr 7, 2022 at 8:59 PM Andres Freund wrote: > > > > > > >Cumulative statistics are collected in shared memory. Every > >PostgreSQL process collects statistics > > locally > >then updates the shared data at appropriate intervals. When a server, > >including a physical replica, shuts down cleanly, a permanent copy of > > the > >statistics data is stored in the pg_stat > > subdirectory, > >so that statistics can be retained across server restarts. In contrast, > >when starting from an unclean shutdown (e.g., after an immediate > > shutdown, > >a server crash, starting from a base backup, and point-in-time > > recovery), > >all statistics counters are reset. > > > > > > I like this. My comment regarding using "i.e.," here stands though. Argh. I'd used in e.g., but not i.e.. > > > > > > The cumulative statistics system is active during recovery. All scans, > > reads, blocks, index usage, etc., will be recorded normally on the > > standby. However, WAL replay will not increment relation and database > > specific counters. I.e. replay will not increment pg_stat_all_tables > > columns (like n_tup_ins), nor will reads or writes performed by the > > startup process be tracked in the pg_statio views, nor will associated > > pg_stat_database columns be incremented. > > > > > > > I like this too. The second part with three nors is a bit rough. Maybe: Agreed. I tried to come up with a smoother formulation, but didn't (perhaps because I was a tad tired). > ... specific counters. In particular, replay will not increment > pg_stat_database or pg_stat_all_tables columns, and the startup process > will not report reads and writes for the pg_statio views. > > It would helpful to give at least one specific example of what is being > recorded normally, especially since we give three of what is not. The second sentence is a set of examples - or do you mean examples for what actions by the startup process are counted? Greetings, Andres Freund
Re: shared-memory based stats collector - v70
At Thu, 7 Apr 2022 20:59:21 -0700, Andres Freund wrote in > Hi, > > On 2022-04-08 11:10:14 +0900, Kyotaro Horiguchi wrote: > > I can read it. But I'm not sure that the difference is obvious for > > average users between "starting a standby from a basebackup" and > > "starting a standby after a normal shutdown".. > > Yea, that's what I was concerned about. How about: > > >Cumulative statistics are collected in shared memory. Every >PostgreSQL process collects statistics locally >then updates the shared data at appropriate intervals. When a server, >including a physical replica, shuts down cleanly, a permanent copy of the >statistics data is stored in the pg_stat subdirectory, >so that statistics can be retained across server restarts. In contrast, >when starting from an unclean shutdown (e.g., after an immediate shutdown, >a server crash, starting from a base backup, and point-in-time recovery), >all statistics counters are reset. > Looks perfect generally, and especially in regard to the concern. > I think I like my version above a bit better? Quite a bit. It didn't answer for the concern. > > > 2) > > > The edit is not a problem, but it's hard to understand what the existing > > > paragraph actually means? > > > > > > diff --git a/doc/src/sgml/high-availability.sgml > > > b/doc/src/sgml/high-availability.sgml > > > index 3247e056663..8bfb584b752 100644 > > > --- a/doc/src/sgml/high-availability.sgml > > > +++ b/doc/src/sgml/high-availability.sgml > > > @@ -,17 +,17 @@ HINT: You can then restart the server after > > > making the necessary configuration > > > ... > > > > > > -The statistics collector is active during recovery. All scans, > > > reads, blocks, > > > +The cumulative statistics system is active during recovery. All > > > scans, reads, blocks, > > > index usage, etc., will be recorded normally on the standby. Replayed > > > actions will not duplicate their effects on primary, so replaying an > > > insert will not increment the Inserts column of pg_stat_user_tables. > > > The stats file is deleted at the start of recovery, so stats from > > > primary > > > and standby will differ; this is considered a feature, not a bug. > > > > > > > > > > > > > Agreed partially. It's too detailed. It might not need to mention WAL > > replay. > > My concern is more that it seems halfway nonsensical. "Replayed actions will > not duplicate their effects on primary" - I can guess what that means, but not > more. There's no "Inserts" column of pg_stat_user_tables. > > > > The cumulative statistics system is active during recovery. All scans, > reads, blocks, index usage, etc., will be recorded normally on the > standby. However, WAL replay will not increment relation and database > specific counters. I.e. replay will not increment pg_stat_all_tables > columns (like n_tup_ins), nor will reads or writes performed by the > startup process be tracked in the pg_statio views, nor will associated > pg_stat_database columns be incremented. > Looks clearer since it mention user-facing interfaces with concrete example columns. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: shared-memory based stats collector - v70
On Thu, Apr 7, 2022 at 8:59 PM Andres Freund wrote: > > >Cumulative statistics are collected in shared memory. Every >PostgreSQL process collects statistics > locally >then updates the shared data at appropriate intervals. When a server, >including a physical replica, shuts down cleanly, a permanent copy of > the >statistics data is stored in the pg_stat > subdirectory, >so that statistics can be retained across server restarts. In contrast, >when starting from an unclean shutdown (e.g., after an immediate > shutdown, >a server crash, starting from a base backup, and point-in-time > recovery), >all statistics counters are reset. > > I like this. My comment regarding using "i.e.," here stands though. > > > The cumulative statistics system is active during recovery. All scans, > reads, blocks, index usage, etc., will be recorded normally on the > standby. However, WAL replay will not increment relation and database > specific counters. I.e. replay will not increment pg_stat_all_tables > columns (like n_tup_ins), nor will reads or writes performed by the > startup process be tracked in the pg_statio views, nor will associated > pg_stat_database columns be incremented. > > > I like this too. The second part with three nors is a bit rough. Maybe: ... specific counters. In particular, replay will not increment pg_stat_database or pg_stat_all_tables columns, and the startup process will not report reads and writes for the pg_statio views. It would helpful to give at least one specific example of what is being recorded normally, especially since we give three of what is not. David J.
Re: shared-memory based stats collector - v70
On 2022-04-07 16:37:51 -0700, Andres Freund wrote: > On 2022-04-07 00:28:45 -0700, Andres Freund wrote: > > I've gotten through the main commits (and then a fix for the apparently > > inevitable bug that's immediately highlighted by the buildfarm), and the > > first > > test. I'll call it a night now, and work on the other tests & docs tomorrow. > > I've gotten through the tests now. There's one known, not yet addressed, issue > with the stats isolation test, see [1]. That has since been fixed, in d6c0db14836cd843d589372d909c73aab68c7a24
Re: shared-memory based stats collector - v70
Hi, On 2022-04-07 20:59:21 -0700, Andres Freund wrote: > > >Cumulative statistics are collected in shared memory. Every >PostgreSQL process collects statistics locally >then updates the shared data at appropriate intervals. When a server, >including a physical replica, shuts down cleanly, a permanent copy of the >statistics data is stored in the pg_stat subdirectory, >so that statistics can be retained across server restarts. In contrast, >when starting from an unclean shutdown (e.g., after an immediate shutdown, >a server crash, starting from a base backup, and point-in-time recovery), >all statistics counters are reset. > > ... > > The cumulative statistics system is active during recovery. All scans, > reads, blocks, index usage, etc., will be recorded normally on the > standby. However, WAL replay will not increment relation and database > specific counters. I.e. replay will not increment pg_stat_all_tables > columns (like n_tup_ins), nor will reads or writes performed by the > startup process be tracked in the pg_statio views, nor will associated > pg_stat_database columns be incremented. > I went with these for now. My guess is that there's further improvements in them, and in surrounding areas... With that, I'll close this CF entry. It's been a while. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
Hi, On 2022-04-07 20:51:10 -0700, David G. Johnston wrote: > On Thu, Apr 7, 2022 at 7:10 PM Kyotaro Horiguchi > wrote: > > I can read it. But I'm not sure that the difference is obvious for > > average users between "starting a standby from a basebackup" and > > "starting a standby after a normal shutdown".. > > > > Other than that, it might be easier to read if the additional part > > were moved out to the end of the paragraph, prefixing with "Note: > > ". For example, > > > > ... > > statistics can be retained across server restarts. When crash recovery is > > performed at server start (e.g., after immediate shutdown, server crash, > > and point-in-time recovery), all statistics counters are reset. Note that > > crash recovery is not performed when starting a standby that was shut > > down normally then all counters are retained. > > > > > Maybe: > When the server shuts down cleanly a permanent copy of the statistics data > is stored in the pg_stat subdirectory so that > statistics can be retained across server restarts. However, if crash > recovery is performed (i.e., after immediate shutdown, server crash, or > point-in-time recovery), all statistics counters are reset. For any > standby server, the initial startup to get the cluster initialized is a > point-in-time crash recovery startup. For all subsequent startups it > behaves like any other server. For a hot standby server, statistics are > retained during a failover promotion. > > I'm pretty sure i.e., is correct since those three situations are not > examples but rather the complete set. I don't think the "initial startup ..." bit is quite correct. A standby can be created for a shut down server, and IIRC there's some differences in how PITR is handled too. > Is crash recovery ever performed other than at server start? If so I > choose to remove the redundancy. No. > I feel like some of this detail about standby servers is/should be covered > elsewhere and we are at least missing a cross-reference chance even if we > leave the material coverage as-is. I didn't find anything good to reference... > > 2) > > > The edit is not a problem, but it's hard to understand what the existing > > > paragraph actually means? > > > > > > diff --git a/doc/src/sgml/high-availability.sgml > > b/doc/src/sgml/high-availability.sgml > > > index 3247e056663..8bfb584b752 100644 > > > --- a/doc/src/sgml/high-availability.sgml > > > +++ b/doc/src/sgml/high-availability.sgml > > > @@ -,17 +,17 @@ HINT: You can then restart the server after > > making the necessary configuration > > > ... > > > > > > -The statistics collector is active during recovery. All scans, > > reads, blocks, > > > +The cumulative statistics system is active during recovery. All > > scans, reads, blocks, > > > index usage, etc., will be recorded normally on the standby. > > Replayed > > > actions will not duplicate their effects on primary, so replaying an > > > insert will not increment the Inserts column of pg_stat_user_tables. > > > The stats file is deleted at the start of recovery, so stats from > > primary > > > and standby will differ; this is considered a feature, not a bug. > > > > > > > > > > > > > Agreed partially. It's too detailed. It might not need to mention WAL > > replay. > > > > > The insert example seems like a poor one...IIUC cumulative statistics are > not WAL logged and while in recovery INSERT is prohibited, so how would > replaying the insert in the WAL result in a duplicated effect on > pg_stat_user_tables.inserts? I agree, the sentence doesn't make much sense. It doesn't really matter that stats aren't WAL logged, one could infer them from the WAL at a decent level of accuracy. However, we can't actually associate those actions to relations, we just know the relfilenode... And the startup process can't read the catalog to figure the mapping out. > I also have no idea what, in the fragment, "Replayed actions will not > duplicate their effects on primary...", what "on primary" is supposed to > mean. I think it's trying to say "will not duplicate the effect on pg_stat_user_tables they had on the primary". > I would like to write the following but I don't believe it is sufficiently > true: > > "The cumulative statistics system records only the locally generated > activity of the cluster, including while in recovery. Activity happening > only due to the replay of WAL is not considered local." > > But to apply the WAL we have to fetch blocks from the local filesystem and > write them back out. That is local activity happening due to the replay of > WAL which sounds like it is and should be reported ("All...blocks...", and > the example given being logical DDL oriented). That's not true today - the startup processes reads / writes aren't reflected in pg_statio, pg_stat_database or whatnot. > I cannot think of a better paragraph at the moment, the minimal change is > good, and the detail it
Re: shared-memory based stats collector - v70
Hi, On 2022-04-08 11:10:14 +0900, Kyotaro Horiguchi wrote: > At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund wrote > in > > Hi, > > > > On 2022-04-07 00:28:45 -0700, Andres Freund wrote: > > > I've gotten through the main commits (and then a fix for the apparently > > > inevitable bug that's immediately highlighted by the buildfarm), and the > > > first > > > test. I'll call it a night now, and work on the other tests & docs > > > tomorrow. > > > > I've gotten through the tests now. There's one known, not yet addressed, > > issue > > with the stats isolation test, see [1]. > > > > > > Working on the docs. Found a few things worth raising: > > > > 1) > > Existing text: > >When the server shuts down cleanly, a permanent copy of the statistics > >data is stored in the pg_stat subdirectory, so that > >statistics can be retained across server restarts. When recovery is > >performed at server start (e.g., after immediate shutdown, server crash, > >and point-in-time recovery), all statistics counters are reset. > > > > The existing docs patch hadn't updated yet. My current edit is > > > >When the server shuts down cleanly, a permanent copy of the statistics > >data is stored in the pg_stat subdirectory, so that > >statistics can be retained across server restarts. When crash recovery > > is > >performed at server start (e.g., after immediate shutdown, server crash, > >and point-in-time recovery, but not when starting a standby that was shut > >down normally), all statistics counters are reset. > > > > but I'm not sure the parenthetical is easy enough to understand? > > I can read it. But I'm not sure that the difference is obvious for > average users between "starting a standby from a basebackup" and > "starting a standby after a normal shutdown".. Yea, that's what I was concerned about. How about: Cumulative statistics are collected in shared memory. Every PostgreSQL process collects statistics locally then updates the shared data at appropriate intervals. When a server, including a physical replica, shuts down cleanly, a permanent copy of the statistics data is stored in the pg_stat subdirectory, so that statistics can be retained across server restarts. In contrast, when starting from an unclean shutdown (e.g., after an immediate shutdown, a server crash, starting from a base backup, and point-in-time recovery), all statistics counters are reset. > Other than that, it might be easier to read if the additional part > were moved out to the end of the paragraph, prefixing with "Note: > ". For example, > > ... > statistics can be retained across server restarts. When crash recovery is > performed at server start (e.g., after immediate shutdown, server crash, > and point-in-time recovery), all statistics counters are reset. Note that > crash recovery is not performed when starting a standby that was shut > down normally then all counters are retained. I think I like my version above a bit better? > > 2) > > The edit is not a problem, but it's hard to understand what the existing > > paragraph actually means? > > > > diff --git a/doc/src/sgml/high-availability.sgml > > b/doc/src/sgml/high-availability.sgml > > index 3247e056663..8bfb584b752 100644 > > --- a/doc/src/sgml/high-availability.sgml > > +++ b/doc/src/sgml/high-availability.sgml > > @@ -,17 +,17 @@ HINT: You can then restart the server after making > > the necessary configuration > > ... > > > > -The statistics collector is active during recovery. All scans, reads, > > blocks, > > +The cumulative statistics system is active during recovery. All scans, > > reads, blocks, > > index usage, etc., will be recorded normally on the standby. Replayed > > actions will not duplicate their effects on primary, so replaying an > > insert will not increment the Inserts column of pg_stat_user_tables. > > The stats file is deleted at the start of recovery, so stats from > > primary > > and standby will differ; this is considered a feature, not a bug. > > > > > > > > Agreed partially. It's too detailed. It might not need to mention WAL > replay. My concern is more that it seems halfway nonsensical. "Replayed actions will not duplicate their effects on primary" - I can guess what that means, but not more. There's no "Inserts" column of pg_stat_user_tables. The cumulative statistics system is active during recovery. All scans, reads, blocks, index usage, etc., will be recorded normally on the standby. However, WAL replay will not increment relation and database specific counters. I.e. replay will not increment pg_stat_all_tables columns (like n_tup_ins), nor will reads or writes performed by the startup process be tracked in the pg_statio views, nor will associated pg_stat_database columns be incremented. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
On Thu, Apr 7, 2022 at 7:10 PM Kyotaro Horiguchi wrote: > At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund > wrote in > > Hi, > > > > On 2022-04-07 00:28:45 -0700, Andres Freund wrote: > > > I've gotten through the main commits (and then a fix for the apparently > > > inevitable bug that's immediately highlighted by the buildfarm), and > the first > > > test. I'll call it a night now, and work on the other tests & docs > tomorrow. > > > > I've gotten through the tests now. There's one known, not yet addressed, > issue > > with the stats isolation test, see [1]. > > > > > > Working on the docs. Found a few things worth raising: > > > > 1) > > Existing text: > >When the server shuts down cleanly, a permanent copy of the statistics > >data is stored in the pg_stat subdirectory, so > that > >statistics can be retained across server restarts. When recovery is > >performed at server start (e.g., after immediate shutdown, server > crash, > >and point-in-time recovery), all statistics counters are reset. > > > > The existing docs patch hadn't updated yet. My current edit is > > > >When the server shuts down cleanly, a permanent copy of the statistics > >data is stored in the pg_stat subdirectory, so > that > >statistics can be retained across server restarts. When crash > recovery is > >performed at server start (e.g., after immediate shutdown, server > crash, > >and point-in-time recovery, but not when starting a standby that was > shut > >down normally), all statistics counters are reset. > > > > but I'm not sure the parenthetical is easy enough to understand? > > I can read it. But I'm not sure that the difference is obvious for > average users between "starting a standby from a basebackup" and > "starting a standby after a normal shutdown".. > > Other than that, it might be easier to read if the additional part > were moved out to the end of the paragraph, prefixing with "Note: > ". For example, > > ... > statistics can be retained across server restarts. When crash recovery is > performed at server start (e.g., after immediate shutdown, server crash, > and point-in-time recovery), all statistics counters are reset. Note that > crash recovery is not performed when starting a standby that was shut > down normally then all counters are retained. > > Maybe: When the server shuts down cleanly a permanent copy of the statistics data is stored in the pg_stat subdirectory so that statistics can be retained across server restarts. However, if crash recovery is performed (i.e., after immediate shutdown, server crash, or point-in-time recovery), all statistics counters are reset. For any standby server, the initial startup to get the cluster initialized is a point-in-time crash recovery startup. For all subsequent startups it behaves like any other server. For a hot standby server, statistics are retained during a failover promotion. I'm pretty sure i.e., is correct since those three situations are not examples but rather the complete set. Is crash recovery ever performed other than at server start? If so I choose to remove the redundancy. I feel like some of this detail about standby servers is/should be covered elsewhere and we are at least missing a cross-reference chance even if we leave the material coverage as-is. > 2) > > The edit is not a problem, but it's hard to understand what the existing > > paragraph actually means? > > > > diff --git a/doc/src/sgml/high-availability.sgml > b/doc/src/sgml/high-availability.sgml > > index 3247e056663..8bfb584b752 100644 > > --- a/doc/src/sgml/high-availability.sgml > > +++ b/doc/src/sgml/high-availability.sgml > > @@ -,17 +,17 @@ HINT: You can then restart the server after > making the necessary configuration > > ... > > > > -The statistics collector is active during recovery. All scans, > reads, blocks, > > +The cumulative statistics system is active during recovery. All > scans, reads, blocks, > > index usage, etc., will be recorded normally on the standby. > Replayed > > actions will not duplicate their effects on primary, so replaying an > > insert will not increment the Inserts column of pg_stat_user_tables. > > The stats file is deleted at the start of recovery, so stats from > primary > > and standby will differ; this is considered a feature, not a bug. > > > > > > > > Agreed partially. It's too detailed. It might not need to mention WAL > replay. > > The insert example seems like a poor one...IIUC cumulative statistics are not WAL logged and while in recovery INSERT is prohibited, so how would replaying the insert in the WAL result in a duplicated effect on pg_stat_user_tables.inserts? I also have no idea what, in the fragment, "Replayed actions will not duplicate their effects on primary...", what "on primary" is supposed to mean. I would like to write the following but I don't believe it is sufficiently true: "The cumulative statistics system
Re: shared-memory based stats collector - v70
At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund wrote in > Hi, > > On 2022-04-07 00:28:45 -0700, Andres Freund wrote: > > I've gotten through the main commits (and then a fix for the apparently > > inevitable bug that's immediately highlighted by the buildfarm), and the > > first > > test. I'll call it a night now, and work on the other tests & docs tomorrow. > > I've gotten through the tests now. There's one known, not yet addressed, issue > with the stats isolation test, see [1]. > > > Working on the docs. Found a few things worth raising: > > 1) > Existing text: >When the server shuts down cleanly, a permanent copy of the statistics >data is stored in the pg_stat subdirectory, so that >statistics can be retained across server restarts. When recovery is >performed at server start (e.g., after immediate shutdown, server crash, >and point-in-time recovery), all statistics counters are reset. > > The existing docs patch hadn't updated yet. My current edit is > >When the server shuts down cleanly, a permanent copy of the statistics >data is stored in the pg_stat subdirectory, so that >statistics can be retained across server restarts. When crash recovery is >performed at server start (e.g., after immediate shutdown, server crash, >and point-in-time recovery, but not when starting a standby that was shut >down normally), all statistics counters are reset. > > but I'm not sure the parenthetical is easy enough to understand? I can read it. But I'm not sure that the difference is obvious for average users between "starting a standby from a basebackup" and "starting a standby after a normal shutdown".. Other than that, it might be easier to read if the additional part were moved out to the end of the paragraph, prefixing with "Note: ". For example, ... statistics can be retained across server restarts. When crash recovery is performed at server start (e.g., after immediate shutdown, server crash, and point-in-time recovery), all statistics counters are reset. Note that crash recovery is not performed when starting a standby that was shut down normally then all counters are retained. > 2) > The edit is not a problem, but it's hard to understand what the existing > paragraph actually means? > > diff --git a/doc/src/sgml/high-availability.sgml > b/doc/src/sgml/high-availability.sgml > index 3247e056663..8bfb584b752 100644 > --- a/doc/src/sgml/high-availability.sgml > +++ b/doc/src/sgml/high-availability.sgml > @@ -,17 +,17 @@ HINT: You can then restart the server after making > the necessary configuration > ... > > -The statistics collector is active during recovery. All scans, reads, > blocks, > +The cumulative statistics system is active during recovery. All scans, > reads, blocks, > index usage, etc., will be recorded normally on the standby. Replayed > actions will not duplicate their effects on primary, so replaying an > insert will not increment the Inserts column of pg_stat_user_tables. > The stats file is deleted at the start of recovery, so stats from primary > and standby will differ; this is considered a feature, not a bug. > > > Agreed partially. It's too detailed. It might not need to mention WAL replay. > I'll just commit the necessary bit, but we really ought to rephrase this. > > > > > Greetings, > > Andres Freund > > [1] > https://www.postgresql.org/message-id/20220407165709.jgdkrzqlkcwue6ko%40alap3.anarazel.de -- Kyotaro Horiguchi NTT Open Source Software Center
Re: shared-memory based stats collector - v70
Hi, On 2022-04-07 00:28:45 -0700, Andres Freund wrote: > I've gotten through the main commits (and then a fix for the apparently > inevitable bug that's immediately highlighted by the buildfarm), and the first > test. I'll call it a night now, and work on the other tests & docs tomorrow. I've gotten through the tests now. There's one known, not yet addressed, issue with the stats isolation test, see [1]. Working on the docs. Found a few things worth raising: 1) Existing text: When the server shuts down cleanly, a permanent copy of the statistics data is stored in the pg_stat subdirectory, so that statistics can be retained across server restarts. When recovery is performed at server start (e.g., after immediate shutdown, server crash, and point-in-time recovery), all statistics counters are reset. The existing docs patch hadn't updated yet. My current edit is When the server shuts down cleanly, a permanent copy of the statistics data is stored in the pg_stat subdirectory, so that statistics can be retained across server restarts. When crash recovery is performed at server start (e.g., after immediate shutdown, server crash, and point-in-time recovery, but not when starting a standby that was shut down normally), all statistics counters are reset. but I'm not sure the parenthetical is easy enough to understand? 2) The edit is not a problem, but it's hard to understand what the existing paragraph actually means? diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml index 3247e056663..8bfb584b752 100644 --- a/doc/src/sgml/high-availability.sgml +++ b/doc/src/sgml/high-availability.sgml @@ -,17 +,17 @@ HINT: You can then restart the server after making the necessary configuration ... -The statistics collector is active during recovery. All scans, reads, blocks, +The cumulative statistics system is active during recovery. All scans, reads, blocks, index usage, etc., will be recorded normally on the standby. Replayed actions will not duplicate their effects on primary, so replaying an insert will not increment the Inserts column of pg_stat_user_tables. The stats file is deleted at the start of recovery, so stats from primary and standby will differ; this is considered a feature, not a bug. I'll just commit the necessary bit, but we really ought to rephrase this. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20220407165709.jgdkrzqlkcwue6ko%40alap3.anarazel.de
Re: shared-memory based stats collector - v70
At Thu, 7 Apr 2022 00:28:45 -0700, Andres Freund wrote in > Hi, > > On 2022-04-05 20:00:08 -0700, Andres Freund wrote: > > It'll be a few hours to get to the main commit - but except for 0001 it > > doesn't make sense to push without intending to push later changes too. I > > might squash a few commits togther. > > I've gotten through the main commits (and then a fix for the apparently > inevitable bug that's immediately highlighted by the buildfarm), and the first > test. I'll call it a night now, and work on the other tests & docs tomorrow. Thank you very much for the great effort on this to make it get in! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: shared-memory based stats collector - v70
At Wed, 6 Apr 2022 18:58:52 -0700, Andres Freund wrote in > Hi, > > On 2022-04-07 10:36:30 +0900, Kyotaro Horiguchi wrote: > > At Wed, 6 Apr 2022 09:04:09 -0700, Andres Freund wrote > > in > > > I think there's no switches left now, so it's not actually providing too > > > much. > > > > (Ouch!) > > I think it's great that there's no switches left - means we're pretty close to > pgstat being runtime extensible... Yes. I agree. > It's now pgstat_get_kind_from_str(). I'm fine with it. > It was harder to see earlier (I certainly didn't really see it) - because > there were so many "violations" - but most of pgstat is > pgstat__() or just _. I'd already moved most of > the patch series over to that (maybe in v68 or so). Now I also did that with > the internal functions. > > There's a few functions breaking that pattern, partially because I added them > :(, but since they're not touched in these patches I've not renamed them. But > it's probably worth doing so tomorrow. Thab being said, it gets far cleaner. Thanks! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: shared-memory based stats collector - v70
Hi, On 2022-04-05 20:00:08 -0700, Andres Freund wrote: > It'll be a few hours to get to the main commit - but except for 0001 it > doesn't make sense to push without intending to push later changes too. I > might squash a few commits togther. I've gotten through the main commits (and then a fix for the apparently inevitable bug that's immediately highlighted by the buildfarm), and the first test. I'll call it a night now, and work on the other tests & docs tomorrow. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
Hi, On 2022-04-07 10:36:30 +0900, Kyotaro Horiguchi wrote: > At Wed, 6 Apr 2022 09:04:09 -0700, Andres Freund wrote > in > > > + * Don't define an INVALID value so switch() statements can warn if some > > > + * cases aren't covered. But define the first member to 1 so that > > > + * uninitialized values can be detected more easily. > > > > > > FWIW, I like this. > > > > I think there's no switches left now, so it's not actually providing too > > much. > > (Ouch!) I think it's great that there's no switches left - means we're pretty close to pgstat being runtime extensible... > > > 0010: > > > (I didn't look this closer. The comments arised while looking other > > > patches.) > > > > > > +pgstat_kind_from_str(char *kind_str) > > > > > > I don't think I like "str" so much. Don't we spell it as > > > "pgstat_kind_from_name"? > > > > name makes me think of NameData. What do you dislike about str? We seem to > > use > > str in plenty places? > > For clarity, I don't dislike it so much. So, I'm fine with the > current name. > > I found that you meant a type by the "str". I thought it as an > instance (I'm not sure I can express my feeling correctly here..) and > the following functions were in my mind. > > char *get_namespace/rel/collation/func_name(Oid someoid) > char *pgstat_slru_name(int slru_idx) > > Another instance of the same direction is > > ForkNumber forkname_to_number(const char *forkName) It's now pgstat_get_kind_from_str(). It was harder to see earlier (I certainly didn't really see it) - because there were so many "violations" - but most of pgstat is pgstat__() or just _. I'd already moved most of the patch series over to that (maybe in v68 or so). Now I also did that with the internal functions. There's a few functions breaking that pattern, partially because I added them :(, but since they're not touched in these patches I've not renamed them. But it's probably worth doing so tomorrow. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
At Wed, 6 Apr 2022 09:04:09 -0700, Andres Freund wrote in > > +* Don't define an INVALID value so switch() statements can warn if some > > +* cases aren't covered. But define the first member to 1 so that > > +* uninitialized values can be detected more easily. > > > > FWIW, I like this. > > I think there's no switches left now, so it's not actually providing too much. (Ouch!) > > 0008: > > > > + xact_desc_stats(buf, "", parsed.nstats, parsed.stats); > > + xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats); > > + xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats); > > > > I'm not sure I like this, but I don't object to this.. > > The string prefixes? Or the entire patch? The string prefixes, since they are a limited set of fixed strings. That being said, I don't think it's better to use an enum instead, too. So I don't object to pass the strings here. > > 0010: > > (I didn't look this closer. The comments arised while looking other > > patches.) > > > > +pgstat_kind_from_str(char *kind_str) > > > > I don't think I like "str" so much. Don't we spell it as > > "pgstat_kind_from_name"? > > name makes me think of NameData. What do you dislike about str? We seem to > use > str in plenty places? For clarity, I don't dislike it so much. So, I'm fine with the current name. I found that you meant a type by the "str". I thought it as an instance (I'm not sure I can express my feeling correctly here..) and the following functions were in my mind. char *get_namespace/rel/collation/func_name(Oid someoid) char *pgstat_slru_name(int slru_idx) Another instance of the same direction is ForkNumber forkname_to_number(const char *forkName) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: shared-memory based stats collector - v70
Hi, On 2022-04-06 17:01:17 -0700, David G. Johnston wrote: > On Wed, Apr 6, 2022 at 4:12 PM Andres Freund wrote: > > The fact there is just the one outlier here suggests that this is indeed the > better option. FWIW, the outlier also uses pgstat_reset(), just with a small wrapper doing the translation from slot name to slot index. > > What does "private" mean for you? They're exposed via pgstat.h not > > pgstat_internal.h. But not to SQL. > I was thinking specifically of the freedom to rename and not break > extensions. Namely, are these truly implementation details or something > that, while unlikely to be used by extensions, still constitute an exposed > API? It was mainly a passing thought, I'm not looking for a crash-course > in how all that works right now. I doubt there are extension using these functions - and they'd have been broken the way things were in v70, because the signature already had changed. Generally, between major releases, we don't worry too much about changing C APIs. Of course we try to avoid unnecessarily breaking things, particularly when it's going to cause widespread breakage. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
On Wed, Apr 6, 2022 at 4:12 PM Andres Freund wrote: > > On 2022-04-06 15:32:39 -0700, David G. Johnston wrote: > > On Wednesday, April 6, 2022, Andres Freund wrote: > > > > > > I like having the SQL function paired with a matching implementation in > > this scheme. > > It would have gotten things closer than it was before imo. We can't just > rename the SQL functions, they're obviously exposed API. > Right, I meant the naming scheme proposed was acceptable. Not that I wanted to change the SQL functions too. I've hacked up the above, but after doing so I think I found a cleaner > approach: > > I've introduced: > > pgstat_reset_of_kind(PgStat_Kind kind) which works for both fixed and > variable > numbered stats. That allows to remove pgstat_reset_subscription_counters(), > pgstat_reset_replslot_counters(), pgstat_reset_shared_counters(). > > pgstat_reset(PgStat_Kind kind, Oid dboid, Oid objoid), which removes the > need > for pgstat_reset_subscription_counter(), > pgstat_reset_single_counter(). > pgstat_reset_replslot() is still needed, to do > the name -> index lookup. > > That imo makes a lot more sense than requiring each variable-amount kind to > have wrapper functions. > > I can see benefits of both, or even possibly combining them. Absent being able to point to some other part of the system and saying "it is done this way there, let's do the same here" I think the details will inform the decision. The fact there is just the one outlier here suggests that this is indeed the better option. > What does "private" mean for you? They're exposed via pgstat.h not > pgstat_internal.h. But not to SQL. > > I was thinking specifically of the freedom to rename and not break extensions. Namely, are these truly implementation details or something that, while unlikely to be used by extensions, still constitute an exposed API? It was mainly a passing thought, I'm not looking for a crash-course in how all that works right now. David J.
Re: shared-memory based stats collector - v70
Hi, On 2022-04-06 15:32:39 -0700, David G. Johnston wrote: > On Wednesday, April 6, 2022, Andres Freund wrote: > > > > > > > I'd go for > > pgstat_reset_slru_counter() -> pgstat_reset_slru() > > pgstat_reset_subscription_counter() -> pgstat_reset_subscription() > > pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions() > > pgstat_reset_replslot_counter() -> pgstat_reset_replslot() > > pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots() > > > I like having the SQL function paired with a matching implementation in > this scheme. It would have gotten things closer than it was before imo. We can't just rename the SQL functions, they're obviously exposed API. I'd like to remove the NULL -> all behaviour, but that should be discussed separately. I've hacked up the above, but after doing so I think I found a cleaner approach: I've introduced: pgstat_reset_of_kind(PgStat_Kind kind) which works for both fixed and variable numbered stats. That allows to remove pgstat_reset_subscription_counters(), pgstat_reset_replslot_counters(), pgstat_reset_shared_counters(). pgstat_reset(PgStat_Kind kind, Oid dboid, Oid objoid), which removes the need for pgstat_reset_subscription_counter(), pgstat_reset_single_counter(). pgstat_reset_replslot() is still needed, to do the name -> index lookup. That imo makes a lot more sense than requiring each variable-amount kind to have wrapper functions. > > Not quite sure what to do with pgstat_reset_single_counter(). I'd either go > > for the minimal pgstat_reset_single_counters() or pgstat_reset_one()? > > > > Why not add both pgstat_resert_function() and pgstat_reset_table() (to keep > the pairing) and they can call the renamed pgstat_reset_function_or_table() > internally (since the function indeed handle both paths and we’ve yet to > come up with a label to use instead of “function and table stats”)? > > These are private functions right? What does "private" mean for you? They're exposed via pgstat.h not pgstat_internal.h. But not to SQL. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
On Wednesday, April 6, 2022, Andres Freund wrote: > > > I'd go for > pgstat_reset_slru_counter() -> pgstat_reset_slru() > pgstat_reset_subscription_counter() -> pgstat_reset_subscription() > pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions() > pgstat_reset_replslot_counter() -> pgstat_reset_replslot() > pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots() I like having the SQL function paired with a matching implementation in this scheme. > > We could leave out the _all_ and just use plural too, but I think it's a > bit > nicer with _all_ in there. +1 to _all_ > > Not quite sure what to do with pgstat_reset_single_counter(). I'd either go > for the minimal pgstat_reset_single_counters() or pgstat_reset_one()? > Why not add both pgstat_resert_function() and pgstat_reset_table() (to keep the pairing) and they can call the renamed pgstat_reset_function_or_table() internally (since the function indeed handle both paths and we’ve yet to come up with a label to use instead of “function and table stats”)? These are private functions right? David J.
Re: shared-memory based stats collector - v70
Hi, On 2022-04-05 20:00:08 -0700, Andres Freund wrote: > It'll be a few hours to get to the main commit - but except for 0001 it > doesn't make sense to push without intending to push later changes too. I > might squash a few commits togther. I just noticed an existing incoherency that I'm wondering about fixing as part of 0007 "pgstat: prepare APIs used by pgstatfuncs for shared memory stats." The SQL functions to reset function and relation stats pg_stat_reset_single_table_counters() and pg_stat_reset_single_function_counters() respectively both make use of pgstat_reset_single_counter(). Note that the SQL function uses plural "counters" (which makes sense, it resets all counters for that object), whereas the C function they call to perform the the reset uses singular. Similarly pg_stat_reset_slru(), pg_stat_reset_replication_slot(), pg_stat_reset_subscription_stats() SQL function use pgstat_reset_subscription_counter(), pgstat_reset_replslot_counter() and pgstat_reset_subscription_counter() to reset either the stats for one or all SLRUs/slots. This is relevant for the commit mentioned above because it separates the functions to reset the stats for one slru / slot / sub from the function to reset all slrus / slots / subs. Going with the existing naming I'd just named them pgstat_reset_*_counters(). But that doesn't really make sense. If it were just existing code I'd just not touch this for now. But because the patch introduces further functions, I'd rather not introducing more weird function names. I'd go for pgstat_reset_slru_counter() -> pgstat_reset_slru() pgstat_reset_subscription_counter() -> pgstat_reset_subscription() pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions() pgstat_reset_replslot_counter() -> pgstat_reset_replslot() pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots() We could leave out the _all_ and just use plural too, but I think it's a bit nicer with _all_ in there. Not quite sure what to do with pgstat_reset_single_counter(). I'd either go for the minimal pgstat_reset_single_counters() or pgstat_reset_one()? Greetings, Andres Freund
Re: shared-memory based stats collector - v70
On Wed, Apr 6, 2022 at 12:34 PM Justin Pryzby wrote: > On Wed, Apr 06, 2022 at 12:27:34PM -0700, Andres Freund wrote: > > > > + next use of statistical information will cause a new snapshot to > be built > > > > + or accessed statistics to be cached. > > > > > > I believe this should be an "and", not an "or". (next access builds > both a > > > new snapshot and caches accessed statistics) > > > > I *think* or is correct? The new snapshot is when > stats_fetch_consistency = > > snapshot, the cached is when stats_fetch_consistency = cache. Not sure > how to > > make that clearer without making it a lot longer. Suggestions? > > I think it's correct. Maybe it's clearer to say: > > + next use of statistical information will (when in snapshot mode) cause > a new snapshot to be built > + or (when in cache mode) accessed statistics to be cached. > Ah, yes, that does clarify what was meant. +1 to Justin's edit, or something like it. Thanks, Lukas -- Lukas Fittl
Re: shared-memory based stats collector - v70
On Wed, Apr 06, 2022 at 12:27:34PM -0700, Andres Freund wrote: > > > + next use of statistical information will cause a new snapshot to be > > > built > > > + or accessed statistics to be cached. > > > > I believe this should be an "and", not an "or". (next access builds both a > > new snapshot and caches accessed statistics) > > I *think* or is correct? The new snapshot is when stats_fetch_consistency = > snapshot, the cached is when stats_fetch_consistency = cache. Not sure how to > make that clearer without making it a lot longer. Suggestions? I think it's correct. Maybe it's clearer to say: + next use of statistical information will (when in snapshot mode) cause a new snapshot to be built + or (when in cache mode) accessed statistics to be cached.
Re: shared-memory based stats collector - v70
Hi, On 2022-04-06 12:14:35 -0700, Lukas Fittl wrote: > On Tue, Apr 5, 2022 at 8:00 PM Andres Freund wrote: > > > Here comes v70: > > > > Some small nitpicks on the docs: Thanks! > > From 13090823fc4c7fb94512110fb4d1b3e86fb312db Mon Sep 17 00:00:00 2001 > > From: Andres Freund > > Date: Sat, 2 Apr 2022 19:38:01 -0700 > > Subject: [PATCH v70 14/27] pgstat: update docs. > > ... > > diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml > > - These parameters control server-wide statistics collection > features. > > - When statistics collection is enabled, the data that is produced > can be > > + These parameters control server-wide cumulative statistics system. > > + When enabled, the data that is collected can be > > Missing "the" ("These parameters control the server-wide cumulative > statistics system"). > > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml > > + any of the accumulated statistics, acessed values are cached until > the end > > "acessed" => "accessed" > > + stats_fetch_consistency can be set > > + snapshot, at the price of increased memory usage > for > > Missing "to" ("can be set to snapshot") Fixed. > > + caching not-needed statistics data. Conversely, if it's known that > statistics > > Double space between "data." and "Conversely" (not sure if that matters) > > + current transaction's statistics snapshot or cached values (if any). > The > > Double space between "(if any)." and "The" (not sure if that matters) That's done pretty widely in the docs and comments. > > + next use of statistical information will cause a new snapshot to be > built > > + or accessed statistics to be cached. > > I believe this should be an "and", not an "or". (next access builds both a > new snapshot and caches accessed statistics) I *think* or is correct? The new snapshot is when stats_fetch_consistency = snapshot, the cached is when stats_fetch_consistency = cache. Not sure how to make that clearer without making it a lot longer. Suggestions? Greetings, Andres Freund
Re: shared-memory based stats collector - v70
On Tue, Apr 5, 2022 at 8:00 PM Andres Freund wrote: > Here comes v70: > Some small nitpicks on the docs: > From 13090823fc4c7fb94512110fb4d1b3e86fb312db Mon Sep 17 00:00:00 2001 > From: Andres Freund > Date: Sat, 2 Apr 2022 19:38:01 -0700 > Subject: [PATCH v70 14/27] pgstat: update docs. > ... > diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml > - These parameters control server-wide statistics collection features. > - When statistics collection is enabled, the data that is produced can be > + These parameters control server-wide cumulative statistics system. > + When enabled, the data that is collected can be Missing "the" ("These parameters control the server-wide cumulative statistics system"). > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml > + any of the accumulated statistics, acessed values are cached until the end "acessed" => "accessed" > + stats_fetch_consistency can be set > + snapshot, at the price of increased memory usage for Missing "to" ("can be set to snapshot") > + caching not-needed statistics data. Conversely, if it's known that statistics Double space between "data." and "Conversely" (not sure if that matters) > + current transaction's statistics snapshot or cached values (if any). The Double space between "(if any)." and "The" (not sure if that matters) > + next use of statistical information will cause a new snapshot to be built > + or accessed statistics to be cached. I believe this should be an "and", not an "or". (next access builds both a new snapshot and caches accessed statistics) Thanks, Lukas -- Lukas Fittl
Re: shared-memory based stats collector - v70
Hi, On 2022-04-06 16:24:28 +0700, John Naylor wrote: > On Wed, Apr 6, 2022 at 10:00 AM Andres Freund wrote: > > - while working on the above point, I noticed that hash_bytes() showed up > > noticeably in profiles, so I replaced it with a fixed-width function > > I'm curious about this -- could you direct me to which patch introduces this? Commit 0010, search for pgstat_hash_key_hash. For simplicity I'm including it here inline: /* helpers for dshash / simplehash hashtables */ static inline int pgstat_hash_key_cmp(const void *a, const void *b, size_t size, void *arg) { AssertArg(size == sizeof(PgStat_HashKey) && arg == NULL); return memcmp(a, b, sizeof(PgStat_HashKey)); } static inline uint32 pgstat_hash_key_hash(const void *d, size_t size, void *arg) { const PgStat_HashKey *key = (PgStat_HashKey *)d; uint32 hash; AssertArg(size == sizeof(PgStat_HashKey) && arg == NULL); hash = murmurhash32(key->kind); hash = hash_combine(hash, murmurhash32(key->dboid)); hash = hash_combine(hash, murmurhash32(key->objoid)); return hash; } Greetings, Andres Freund
Re: shared-memory based stats collector - v70
Hi, On 2022-04-06 13:31:31 +0200, Alvaro Herrera wrote: > Just skimming a bit here ... Thanks! > On 2022-Apr-05, Andres Freund wrote: > > > From 0532b869033595202d5797b148f22c61e4eb4969 Mon Sep 17 00:00:00 2001 > > From: Andres Freund > > Date: Mon, 4 Apr 2022 16:53:16 -0700 > > Subject: [PATCH v70 10/27] pgstat: store statistics in shared memory. > > > + PgStatsData > > + Waiting fo shared memory stats data access > > + > > Typo "fo" -> "for" Oh, oops. I had fixed that in the wrong patch. > > @@ -5302,7 +5317,9 @@ StartupXLOG(void) > > performedWalRecovery = true; > > } > > else > > + { > > performedWalRecovery = false; > > + } > > Why? :-) Damage from merging two commits yesterday. I'd left open where exactly we'd reset stats, with the "main commit" implementing the current behaviour more closely, and then a followup commit implementing something a bit better. Nobody seemed to argue for keeping the behaviour 1:1, so I merged them. Without removing the parens again :) > Why give pgstat_get_entry_ref the responsibility of initializing > created_entry to false? The vast majority of callers don't care about > that flag; it seems easier/cleaner to set it to false in > pgstat_init_function_usage (the only caller that cares that I could > find) before calling pgstat_prep_pending_entry. It's annoying to have to initialize it, I agree. But I think it's bugprone for the caller to know that it has to be pre-initialized to false. > (I suggest pgstat_prep_pending_entry should have a comment line stating > "*created_entry, if not NULL, is set true if the entry required to be > created.", same as pgstat_get_entry_ref.) Added something along those lines. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
Hi, On 2022-04-06 18:11:04 +0900, Kyotaro Horiguchi wrote: > 0004: > > I can see padding_pgstat_send and fun:pgstat_send in valgrind.supp Those shouldn't be affected by the patch, I think? But I did indeed forget to remove those in 0010. > 0006: > > I'm fine with the categorize for now. > > +#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL > +#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1) > > The number of kinds is 10. And PGSTAT_NUM_KINDS is 11? Yea, it's not great. I think I'll introduce INVALID and rename PGSTAT_KIND_FIRST to FIRST_VALID. > + * Don't define an INVALID value so switch() statements can warn if some > + * cases aren't covered. But define the first member to 1 so that > + * uninitialized values can be detected more easily. > > FWIW, I like this. I think there's no switches left now, so it's not actually providing too much. > 0008: > > + xact_desc_stats(buf, "", parsed.nstats, parsed.stats); > + xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats); > + xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats); > > I'm not sure I like this, but I don't object to this.. The string prefixes? Or the entire patch? > 0010: > (I didn't look this closer. The comments arised while looking other > patches.) > > +pgstat_kind_from_str(char *kind_str) > > I don't think I like "str" so much. Don't we spell it as > "pgstat_kind_from_name"? name makes me think of NameData. What do you dislike about str? We seem to use str in plenty places? > 0019: > > +my $og_stats = $datadir . '/' . "pg_stat" . '/' . "pgstat.stat"; > > It can be "$datadir/pg_stat/pgstat.stat" or $datadir . '/pgstat/pgstat.stat'. > Isn't it simpler? Yes, will change. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
Just skimming a bit here ... On 2022-Apr-05, Andres Freund wrote: > From 0532b869033595202d5797b148f22c61e4eb4969 Mon Sep 17 00:00:00 2001 > From: Andres Freund > Date: Mon, 4 Apr 2022 16:53:16 -0700 > Subject: [PATCH v70 10/27] pgstat: store statistics in shared memory. > + PgStatsData > + Waiting fo shared memory stats data access > + Typo "fo" -> "for" > @@ -5302,7 +5317,9 @@ StartupXLOG(void) > performedWalRecovery = true; > } > else > + { > performedWalRecovery = false; > + } Why? :-) Why give pgstat_get_entry_ref the responsibility of initializing created_entry to false? The vast majority of callers don't care about that flag; it seems easier/cleaner to set it to false in pgstat_init_function_usage (the only caller that cares that I could find) before calling pgstat_prep_pending_entry. (I suggest pgstat_prep_pending_entry should have a comment line stating "*created_entry, if not NULL, is set true if the entry required to be created.", same as pgstat_get_entry_ref.) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "How amazing is that? I call it a night and come back to find that a bug has been identified and patched while I sleep."(Robert Davidson) http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php
Re: shared-memory based stats collector - v70
On Wed, Apr 6, 2022 at 10:00 AM Andres Freund wrote: > - while working on the above point, I noticed that hash_bytes() showed up > noticeably in profiles, so I replaced it with a fixed-width function I'm curious about this -- could you direct me to which patch introduces this? -- John Naylor EDB: http://www.enterprisedb.com
Re: shared-memory based stats collector - v70
At Tue, 5 Apr 2022 20:00:08 -0700, Andres Freund wrote in > Hi, > > Here comes v70: > - extended / polished the architecture comment based on feedback from Melanie > and David > - other polishing as suggested by David > - addressed the open issue around pgstat_report_stat(), as described in > > https://www.postgresql.org/message-id/20220405204019.6yj7ocmpw352c2u5%40alap3.anarazel.de > - while working on the above point, I noticed that hash_bytes() showed up > noticeably in profiles, so I replaced it with a fixed-width function > - found a few potential regression test instabilities by either *always* > flushing in pgstat_report_stat(), or only flushing when force = true. > - random minor improvements > - reordered commits some > > I still haven't renamed pg_stat_exists_stat() yet - I'm leaning towards > pg_stat_have_stats() or pg_stat_exists() right now. But it's an SQL function > for testing, so it doesn't really matter. > > I think this is basically ready, minus a a few comment adjustments here and > there. Unless somebody protests I'm planning to start pushing things tomorrow > morning. > > It'll be a few hours to get to the main commit - but except for 0001 it > doesn't make sense to push without intending to push later changes too. I > might squash a few commits togther. > > There's lots that can be done once all this is in place, both simplifying > pre-existing code and easy new features, but that's for a later release. I'm not sure it's in time but.. (Sorry in advance for possible duplicate or pointless comments.) 0001: Looks fine. 0002: All references to "stats collector" or alike looks like eliminated after all of the 24 patches are applied. So this seems fine. 0003: This is just moving around functions and variables. Looks fine. 0004: I can see padding_pgstat_send and fun:pgstat_send in valgrind.supp 0005: The function is changed later patch, and it looks fine. 0006: I'm fine with the categorize for now. +#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL +#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1) The number of kinds is 10. And PGSTAT_NUM_KINDS is 11? +* Don't define an INVALID value so switch() statements can warn if some +* cases aren't covered. But define the first member to 1 so that +* uninitialized values can be detected more easily. FWIW, I like this. 0007: (mmm no comments) 0008: + xact_desc_stats(buf, "", parsed.nstats, parsed.stats); + xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats); + xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats); I'm not sure I like this, but I don't object to this.. 0009: (skipped) 0010: (I didn't look this closer. The comments arised while looking other patches.) +pgstat_kind_from_str(char *kind_str) I don't think I like "str" so much. Don't we spell it as "pgstat_kind_from_name"? 0011: Looks fine. 0012: Looks like covering all related parts. 0013: Just fine. 0014: I bilieve it:p 0015: Function attributes seems correct. Looks fine. 0016: (skipped, but looks fine by a quick look.) 0017: I don't find a problem with this. 0018: (skipped) 0019: +my $og_stats = $datadir . '/' . "pg_stat" . '/' . "pgstat.stat"; It can be "$datadir/pg_stat/pgstat.stat" or $datadir . '/pgstat/pgstat.stat'. Isn't it simpler? 0020-24:(I believe them:p) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: shared-memory based stats collector - v70
Hi, On 2022-04-05 23:11:07 -0400, Greg Stark wrote: > I've never tried to review a 24-patch series before. It's kind of > intimidating Is there a good place to start to get a good idea of > the most important changes? It was more at some point :). And believe me, I find this whole project intimidating and exhausting. The stats collector is entangled in a lot of places, and there was a lot of preparatory work to get this point. Most of the commits aren't really interesting, I broke them out to make the "main commit" a bit smaller, because it's exhausting to look at a *huge* single commit. I wish I could have broken it down more, but I didn't find a good way. The interesting commit is v70-0010-pgstat-store-statistics-in-shared-memory.patch which actually replaces the stats collector by storing stats in shared memory. It contains a, now hopefully decent, overview of how things work at the top of pgstat.c. Greetings, Andres Freund
Re: shared-memory based stats collector - v70
On Tue, Apr 5, 2022 at 8:11 PM Greg Stark wrote: > I've never tried to review a 24-patch series before. It's kind of > intimidating Is there a good place to start to get a good idea of > the most important changes? > It isn't as bad as the number makes it sound - I just used "git am" to apply the patches to a branch and skimmed each commit separately. Most of them are tests or other minor pieces. The remaining few cover different aspects of the major commit and you can choose them based upon your experience and time. David J.
Re: shared-memory based stats collector - v70
I've never tried to review a 24-patch series before. It's kind of intimidating Is there a good place to start to get a good idea of the most important changes?
Re: shared-memory based stats collector - v70
On Tue, Apr 5, 2022 at 8:00 PM Andres Freund wrote: > > Here comes v70: > > I think this is basically ready, minus a a few comment adjustments here and > there. Unless somebody protests I'm planning to start pushing things > tomorrow > morning. > > Nothing I've come across, given my area of experience, gives me pause. I'm mostly going to focus on docs and comments at this point - to try and help the next person in my position (and end-users) have an easier go at on-boarding. Toward that end, I did just add a "Purpose" section writeup to the v69 thread. David J.