Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Sep 28, 2022 at 1:18 PM I wrote: > Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density. > > I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size. While the most important challenge right now is how to best represent and organize the shared memory case, I wanted to get the above idea working and out of the way, to be saved for a future time. I've attached a rough implementation (applies on top of v9 0003) that splits node32 into 2 size classes. They both share the exact same base data type and hence the same search/set code, so the number of "kind"s is still four, but here there are five "size classes", so a new case in the "unlikely" node-growing path. The smaller instance of node32 is a "node15", because that's currently 160 bytes, corresponding to one of the DSA size classes. This idea can be applied to any other node except the max size, as we see fit. (Adding a singleton size class would bring it back in line with the prototype, at least as far as memory consumption.) One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind. In the course of working on this, I encountered a pain point. Since it's impossible to repalloc in slab, we have to do alloc/copy/free ourselves. That's fine, but the current coding makes too many assumptions about the use cases: rt_alloc_node and rt_copy_node are too entangled with each other and do too much work unrelated to what the names imply. I seem to remember an earlier version had something like rt_node_copy_common that did only...copying. That was much easier to reason about. In 0002 I resorted to doing my own allocation to show what I really want to do, because the new use case doesn't need zeroing and setting values. It only needs to...allocate (and increase the stats counter if built that way). Future optimization work while I'm thinking of it: rt_alloc_node should be always-inlined and the memset done separately (i.e. not *AllocZero). That way the compiler should be able generate more efficient zeroing code for smaller nodes. I'll test the numbers on this sometime in the future. -- John Naylor EDB: http://www.enterprisedb.com From 6fcc970ae7e31f44fa6b6aface983cadb023cc50 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Thu, 17 Nov 2022 16:10:44 +0700 Subject: [PATCH v901 2/2] Make node32 variable sized Add a size class for 15 elements, which corresponds to 160 bytes, an allocation size used by DSA. When a 16th element is to be inserted, allocte a larger area and memcpy the entire old node to it. NB: Zeroing the new area is only necessary if it's for an inner node128, since insert logic must check for null child pointers. This technique allows us to limit the node kinds to 4, which 1. limits the number of cases in switch statements 2. allows a possible future optimization to encode the node kind in a pointer tag --- src/backend/lib/radixtree.c | 141 +++- 1 file changed, 108 insertions(+), 33 deletions(-) diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c index bef1a438ab..f368e750d5 100644 --- a/src/backend/lib/radixtree.c +++ b/src/backend/lib/radixtree.c @@ -130,6 +130,7 @@ typedef enum typedef enum rt_size_class { RT_CLASS_4_FULL = 0, + RT_CLASS_32_PARTIAL, RT_CLASS_32_FULL, RT_CLASS_128_FULL, RT_CLASS_256 @@ -147,6 +148,8 @@ typedef struct rt_node uint16 count; /* Max number of children. We can use uint8 because we never need to store 256 */ + /* WIP: if we don't have a variable sized node4, this should instead be in the base + types as needed, since saving every byte is crucial for the smallest node kind */ uint8 fanout; /* @@ -166,6 +169,8 @@ typedef struct rt_node ((node)->base.n.count < (node)->base.n.fanout) /* Base type of each node kinds for leaf and inner nodes */ +/* The base
Re: contrib: auth_delay module
On Thu, Nov 17, 2022 at 05:37:51PM -0500, Tom Lane wrote: > =?UTF-8?B?5oiQ5LmL54SV?= writes: > > The attached patch is a contrib module to set login restrictions on users > > with > > too many authentication failure. The administrator could manage several GUC > > parameters to control the login restrictions which are listed below. > > - set the wait time when password authentication fails. > > - allow the wait time grows when users of the same IP consecutively logon > > failed. > > - set the maximum authentication failure number from the same user. The > > system > > will prevent a user who gets too many authentication failures from entering > > the > > database. > > I'm not yet forming an opinion on whether this is useful enough > to accept. I'm not sure that doing that on the backend side is really a great idea, an attacker will still be able to exhaust available connection slots. If your instance is reachable from some untrusted network (which already sounds scary), it's much easier to simply configure something like fail2ban to provide the same feature in a more efficient way. You can even block access to other services too while at it. Note that there's also an extension to log failed connection attempts on an alternate file with a fixed simple format if you're worried about your regular logs are too verbose.
Re: Perform streaming logical transactions by background workers and parallel apply
Here are some review comments for v47-0001 (This review is a WIP - I will post more comments for this patch next week) == .../replication/logical/applyparallelworker.c 1. + * Copyright (c) 2022, PostgreSQL Global Development Group + * + * IDENTIFICATION src/backend/replication/logical/applyparallelworker.c + * This IDENTIFICATION should be on 2 lines like it previously was instead of wrapped into one line. For consistency with all other file headers. ~~~ 2. File header comment + * Since the database structure (schema of subscription tables, etc.) of + * publisher and subscriber may be different. Incomplete sentence? ~~~ 3. + * When the following two scenarios occur, a deadlock occurs. Actually, you described three scenarios in this comment. Not two. SUGGESTION The following scenarios can cause a deadlock. ~~~ 4. + * LA (waiting to acquire the local transaction lock) -> PA1 (waiting to + * acquire the lock on the unique index) -> PA2 (waiting to acquire the lock on + * the remote transaction) -> LA "PA1" -> "PA-1" "PA2" -> "PA-2" ~~~ 5. + * To resolve this issue, we use non-blocking write and wait with a timeout. If + * timeout is exceeded, the LA report an error and restart logical replication. "report" --> "reports" "restart" -> "restarts" OR "LA report" -> "LA will report" ~~~ 6. pa_wait_for_xact_state +/* + * Wait until the parallel apply worker's transaction state reach or exceed the + * given xact_state. + */ +static void +pa_wait_for_xact_state(ParallelApplyWorkerShared *wshared, +ParallelTransState xact_state) "reach or exceed" -> "reaches or exceeds" ~~~ 7. pa_stream_abort + /* + * Although the lock can be automatically released during transaction + * rollback, but we still release the lock here as we may not in a + * transaction. + */ + pa_unlock_transaction(xid, AccessShareLock); "but we still" -> "we still" "we may not in a" -> "we may not be in a" ~~~ 8. + pa_savepoint_name(MySubscription->oid, subxid, spname, + sizeof(spname)); + Unnecessary wrapping ~~~ 9. + for (i = list_length(subxactlist) - 1; i >= 0; i--) + { + TransactionId xid_tmp = lfirst_xid(list_nth_cell(subxactlist, i)); + + if (xid_tmp == subxid) + { + found = true; + break; + } + } + + if (found) + { + RollbackToSavepoint(spname); + CommitTransactionCommand(); + subxactlist = list_truncate(subxactlist, i + 1); + } This code logic does not seem to require the 'found' flag. You can do the RollbackToSavepoint/CommitTransactionCommand/list_truncate before the break. ~~~ 10. pa_lock/unlock _stream/_transaction +/* + * Helper functions to acquire and release a lock for each stream block. + * + * Set locktag_field4 to 0 to indicate that it's a stream lock. + */ +/* + * Helper functions to acquire and release a lock for each local transaction. + * + * Set locktag_field4 to 1 to indicate that it's a transaction lock. Should constants/defines/enums replace those magic numbers 0 and 1? ~~~ 11. pa_lock_transaction + * Note that all the callers are passing remote transaction ID instead of local + * transaction ID as xid. This is because the local transaction ID will only be + * assigned while applying the first change in the parallel apply, but it's + * possible that the first change in parallel apply worker is blocked by a + * concurrently executing transaction in another parallel apply worker causing + * the leader cannot get local transaction ID. "causing the leader cannot" -> "which means the leader cannot" (??) == src/backend/replication/logical/worker.c 12. TransApplyAction +/* + * What action to take for the transaction. + * + * TRANS_LEADER_APPLY: + * The action means that we are in the leader apply worker and changes of the + * transaction are applied directly in the worker. + * + * TRANS_LEADER_SERIALIZE: + * It means that we are in the leader apply worker or table sync worker. + * Changes are written to temporary files and then applied when the final + * commit arrives. + * + * TRANS_LEADER_SEND_TO_PARALLEL: + * The action means that we are in the leader apply worker and need to send the + * changes to the parallel apply worker. + * + * TRANS_PARALLEL_APPLY: + * The action that we are in the parallel apply worker and changes of the + * transaction are applied directly in the worker. + */ +typedef enum 12a Too many various ways of saying the same thing: "The action means that we..." "It means that we..." "The action that we..." (typo?) Please word all these comments consistently ~ 12b. "directly in the worker" -> "directly by the worker" (??) 2x ~~~ 13. get_worker_name +/* + * Return the name of the logical replication worker. + */ +static const char * +get_worker_name(void) +{ + if (am_tablesync_worker()) + return _("logical replication table synchronization worker"); + else if (am_parallel_apply_worker()) + return _("logical replication parallel apply worker"); + else + return _("logical replication apply worker"); +} This function belongs nearer the top
Re: Avoid double lookup in pgstat_fetch_stat_tabentry()
On Fri, Nov 18, 2022 at 10:32 AM Drouvot, Bertrand wrote: > > Hi hackers, > > Please find attached a patch proposal to avoid 2 calls to > pgstat_fetch_stat_tabentry_ext() in pgstat_fetch_stat_tabentry() in case > the relation is not a shared one and no statistics are found. > > Thanks Andres for the suggestion done in [1]. > > [1]: > https://www.postgresql.org/message-id/20221116201202.3k74ajawyom2c3eq%40awork3.anarazel.de +1. The patch LGTM. However, I have a suggestion to simplify it further by getting rid of the local variable tabentry and just returning pgstat_fetch_stat_tabentry_ext(IsSharedRelation(relid), relid);. Furthermore, the pgstat_fetch_stat_tabentry() can just be a static inline function. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Assertion failure in SnapBuildInitialSnapshot()
On Thu, Nov 17, 2022 at 11:15 PM Andres Freund wrote: > > On 2022-11-17 10:44:18 +0530, Amit Kapila wrote: > > On Wed, Nov 16, 2022 at 11:56 PM Andres Freund wrote: > > > On 2022-11-16 14:22:01 +0530, Amit Kapila wrote: > > > > On Wed, Nov 16, 2022 at 7:30 AM Andres Freund > > > > wrote: > > > > > On 2022-11-15 16:20:00 +0530, Amit Kapila wrote: > > > > > > On Tue, Nov 15, 2022 at 8:08 AM Andres Freund > > > > > > wrote: > > > I don't think that'd catch a catalog snapshot. But perhaps the better > > > answer > > > for the catalog snapshot is to just invalidate it explicitly. The user > > > doesn't > > > have control over the catalog snapshot being taken, and it's not too hard > > > to > > > imagine the walsender code triggering one somewhere. > > > > > > So maybe we should add something like: > > > > > > InvalidateCatalogSnapshot(); /* about to overwrite MyProc->xmin */ > > > > > > > The comment "/* about to overwrite MyProc->xmin */" is unclear to me. > > We already have a check (/* so we don't overwrite the existing value > > */ > > if (TransactionIdIsValid(MyProc->xmin))) in SnapBuildInitialSnapshot() > > which ensures that we don't overwrite MyProc->xmin, so the above > > comment seems contradictory to me. > > The point is that catalog snapshots could easily end up setting MyProc->xmin, > even though the caller hasn't done anything wrong. So the > InvalidateCatalogSnapshot() would avoid erroring out in a number of scenarios. > Okay, updated the patch accordingly. -- With Regards, Amit Kapila. v3-0001-Add-additional-checks-while-creating-the-initial-.patch Description: Binary data
Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes
Richard Guo writes: > Yes, it is. Using zero flag would short-cut get_attstatsslot() to just > return whether the slot type exists without loading it. Do you think we > need to emphasize this use case in the comments for 'flags'? Perhaps, it's not really obvious now. > I wonder whether we need to also check statistic_proc_security_check() > when determining if MCVs exists in both sides. Yeah, I thought about hoisting the statistic_proc_security_check tests up into get_mcv_stats. I don't think it's a great idea though. Again, it'd complicate untangling this if we ever generalize the use of MCVs in this function. Also, I don't think we should be micro-optimizing the case where the security check doesn't pass --- if it doesn't, you're going to be hurting from bad plans a lot more than you are from some wasted cycles here. regards, tom lane
Re: Perform streaming logical transactions by background workers and parallel apply
On Fri, Nov 18, 2022 at 10:31 AM Masahiko Sawada wrote: > > On Fri, Nov 18, 2022 at 1:47 PM Amit Kapila wrote: > > > > On Fri, Nov 18, 2022 at 8:01 AM Peter Smith wrote: > > > > > > On Fri, Nov 18, 2022 at 11:36 AM Masahiko Sawada > > > wrote: > > > > > > > ... > > > > --- > > > > The streaming parameter has the new value "parallel" for "streaming" > > > > option to enable the parallel apply. It fits so far but I think the > > > > parallel apply feature doesn't necessarily need to be tied up with > > > > streaming replication. For example, we might want to support parallel > > > > apply also for non-streaming transactions in the future. It might be > > > > better to have another option, say "parallel", to control parallel > > > > apply behavior. The "parallel" option can be a boolean option and > > > > setting parallel = on requires streaming = on. > > > > > > > > If we do that then how will the user be able to use streaming > > serialize mode (write to file for streaming transactions) as we have > > now? Because after we introduce parallelism for non-streaming > > transactions, the user would want parallel = on irrespective of the > > streaming mode. Also, users may wish to only parallelize large > > transactions because of additional overhead for non-streaming > > transactions for transaction dependency tracking, etc. So, the user > > may wish to have a separate knob for large transactions as the patch > > has now. > > One idea for that would be to make it enum. For example, setting > parallel = "streaming" works for that. > Yeah, but then we will have two different parameters (parallel and streaming) to control streaming behavior. This will be confusing say when the user says parallel = 'streaming' and streaming = off, we need to probably disallow such settings but not sure if it would be any better than allowing parallelism for large xacts by streaming parameter. -- With Regards, Amit Kapila.
Avoid double lookup in pgstat_fetch_stat_tabentry()
Hi hackers, Please find attached a patch proposal to avoid 2 calls to pgstat_fetch_stat_tabentry_ext() in pgstat_fetch_stat_tabentry() in case the relation is not a shared one and no statistics are found. Thanks Andres for the suggestion done in [1]. [1]: https://www.postgresql.org/message-id/20221116201202.3k74ajawyom2c3eq%40awork3.anarazel.de Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comdiff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c index 55a355f583..a2a4fc6e2c 100644 --- a/src/backend/utils/activity/pgstat_relation.c +++ b/src/backend/utils/activity/pgstat_relation.c @@ -25,6 +25,7 @@ #include "utils/pgstat_internal.h" #include "utils/rel.h" #include "utils/timestamp.h" +#include "catalog/catalog.h" /* Record that's written to 2PC state file when pgstat state is persisted */ @@ -439,14 +440,8 @@ pgstat_fetch_stat_tabentry(Oid relid) { PgStat_StatTabEntry *tabentry; - tabentry = pgstat_fetch_stat_tabentry_ext(false, relid); - if (tabentry != NULL) - return tabentry; + tabentry = pgstat_fetch_stat_tabentry_ext(IsSharedRelation(relid), relid); - /* -* If we didn't find it, maybe it's a shared table. -*/ - tabentry = pgstat_fetch_stat_tabentry_ext(true, relid); return tabentry; }
Re: Perform streaming logical transactions by background workers and parallel apply
On Fri, Nov 18, 2022 at 1:47 PM Amit Kapila wrote: > > On Fri, Nov 18, 2022 at 8:01 AM Peter Smith wrote: > > > > On Fri, Nov 18, 2022 at 11:36 AM Masahiko Sawada > > wrote: > > > > > ... > > > --- > > > The streaming parameter has the new value "parallel" for "streaming" > > > option to enable the parallel apply. It fits so far but I think the > > > parallel apply feature doesn't necessarily need to be tied up with > > > streaming replication. For example, we might want to support parallel > > > apply also for non-streaming transactions in the future. It might be > > > better to have another option, say "parallel", to control parallel > > > apply behavior. The "parallel" option can be a boolean option and > > > setting parallel = on requires streaming = on. > > > > > If we do that then how will the user be able to use streaming > serialize mode (write to file for streaming transactions) as we have > now? Because after we introduce parallelism for non-streaming > transactions, the user would want parallel = on irrespective of the > streaming mode. Also, users may wish to only parallelize large > transactions because of additional overhead for non-streaming > transactions for transaction dependency tracking, etc. So, the user > may wish to have a separate knob for large transactions as the patch > has now. One idea for that would be to make it enum. For example, setting parallel = "streaming" works for that. > > > > > FWIW, I tend to agree with this idea but for a different reason. In > > this patch, the 'streaming' parameter had become a kind of hybrid > > boolean/enum. AFAIK there are no other parameters anywhere that use a > > hybrid pattern like this so I was thinking it may be better not to be > > different. > > > > I think we have a similar pattern for GUC parameters like > constraint_exclusion (see constraint_exclusion_options), > backslash_quote (see backslash_quote_options), etc. Right. vacuum_index_cleanup and buffering storage parameters that accept 'on', 'off', or 'auto') are other examples. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Perform streaming logical transactions by background workers and parallel apply
On Fri, Nov 18, 2022 at 8:01 AM Peter Smith wrote: > > On Fri, Nov 18, 2022 at 11:36 AM Masahiko Sawada > wrote: > > > ... > > --- > > The streaming parameter has the new value "parallel" for "streaming" > > option to enable the parallel apply. It fits so far but I think the > > parallel apply feature doesn't necessarily need to be tied up with > > streaming replication. For example, we might want to support parallel > > apply also for non-streaming transactions in the future. It might be > > better to have another option, say "parallel", to control parallel > > apply behavior. The "parallel" option can be a boolean option and > > setting parallel = on requires streaming = on. > > If we do that then how will the user be able to use streaming serialize mode (write to file for streaming transactions) as we have now? Because after we introduce parallelism for non-streaming transactions, the user would want parallel = on irrespective of the streaming mode. Also, users may wish to only parallelize large transactions because of additional overhead for non-streaming transactions for transaction dependency tracking, etc. So, the user may wish to have a separate knob for large transactions as the patch has now. > > FWIW, I tend to agree with this idea but for a different reason. In > this patch, the 'streaming' parameter had become a kind of hybrid > boolean/enum. AFAIK there are no other parameters anywhere that use a > hybrid pattern like this so I was thinking it may be better not to be > different. > I think we have a similar pattern for GUC parameters like constraint_exclusion (see constraint_exclusion_options), backslash_quote (see backslash_quote_options), etc. -- With Regards, Amit Kapila.
Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes
On Fri, Nov 18, 2022 at 9:36 AM Tom Lane wrote: > Actually, looking at get_attstatslot, I realize it was already designed > to do that -- just pass zero for flags. So we could do it as attached. Yes, it is. Using zero flag would short-cut get_attstatsslot() to just return whether the slot type exists without loading it. Do you think we need to emphasize this use case in the comments for 'flags'? It seems currently there is no such use case in the codes on HEAD. I wonder whether we need to also check statistic_proc_security_check() when determining if MCVs exists in both sides. Thanks Richard
Re: Strange failure on mamba
Hi, On 2022-11-17 17:47:50 -0500, Tom Lane wrote: > Yeah, that or some other NetBSD bug could be the explanation, too. > Without a stack trace it's hard to have any confidence about it, > but I've been unable to reproduce the problem outside the buildfarm. > (Which is a familiar refrain. I wonder what it is about the buildfarm > environment that makes it act different from the exact same code running > on the exact same machine.) > > So I'd like to have some way to make the postmaster send SIGABRT instead > of SIGKILL in the buildfarm environment. The lowest-tech way would be > to drive that off some #define or other. We could scale it up to a GUC > perhaps. Adjacent to that, I also wonder whether SIGABRT wouldn't be > more useful than SIGSTOP for the existing SendStop half-a-feature --- > the idea that people should collect cores manually seems mighty > last-century. I suspect that having a GUC would be a good idea. I needed something similar recently, debugging an occasional hang in the AIO patchset. I first tried something like your #define approach and it did cause a problematic flood of core files. I ended up using libbacktrace to generate useful backtraces (vs what backtrace_symbols() generates) when receiving SIGQUIT. I didn't do the legwork to make it properly signal safe, but it'd be doable afaiu. Greetings, Andres Freund
Re: logical decoding and replication of sequences, take 2
Hi, On 2022-11-17 22:13:23 +0100, Tomas Vondra wrote: > On 11/17/22 18:07, Andres Freund wrote: > > On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote: > >> On 11/17/22 03:43, Andres Freund wrote: > >>> I assume that part of the initial sync would have to be a new sequence > >>> synchronization step that reads all the sequence states on the publisher > >>> and > >>> ensures that the subscriber sequences are at the same point. There's a > >>> bit of > >>> trickiness there, but it seems entirely doable. The logical replication > >>> replay > >>> support for sequences will have to be a bit careful about not decreasing > >>> the > >>> subscriber's sequence values - the standby initially will be ahead of the > >>> increments we'll see in the WAL. But that seems inevitable given the > >>> non-transactional nature of sequences. > >>> > >> > >> See fetch_sequence_data / copy_sequence in the patch. The bit about > >> ensuring the sequence does not go away (say, using page LSN and/or LSN > >> of the increment) is not there, however isn't that pretty much what I > >> proposed doing for "reconciling" the sequence state logged at COMMIT? > > > > Well, I think the approach of logging all sequence increments at commit is > > the > > wrong idea... > > > > But we're not logging all sequence increments, no? I was imprecise - I meant streaming them out at commit. > Yeah, I think you're right. I looked at this again, with fresh mind, and > I came to the same conclusion. Roughly what the attached patch does. To me it seems a bit nicer to keep the SnapBuildGetOrBuildSnapshot() call in decode.c instead of moving it to reorderbuffer.c. Perhaps we should add a snapbuild.c helper similar to SnapBuildProcessChange() for non-transactional changes that also gets a snapshot? Could look something like Snapshot snapshot = NULL; if (message->transactional && !SnapBuildProcessChange(builder, xid, buf->origptr)) return; else if (!SnapBuildProcessStateNonTx(builder, )) return; ... Or perhaps we should just bite the bullet and add an argument to SnapBuildProcessChange to deal with that? Greetings, Andres Freund
Re: Perform streaming logical transactions by background workers and parallel apply
On Fri, Nov 18, 2022 at 11:36 AM Masahiko Sawada wrote: > ... > --- > The streaming parameter has the new value "parallel" for "streaming" > option to enable the parallel apply. It fits so far but I think the > parallel apply feature doesn't necessarily need to be tied up with > streaming replication. For example, we might want to support parallel > apply also for non-streaming transactions in the future. It might be > better to have another option, say "parallel", to control parallel > apply behavior. The "parallel" option can be a boolean option and > setting parallel = on requires streaming = on. > FWIW, I tend to agree with this idea but for a different reason. In this patch, the 'streaming' parameter had become a kind of hybrid boolean/enum. AFAIK there are no other parameters anywhere that use a hybrid pattern like this so I was thinking it may be better not to be different. But I didn't think that parallel_apply=on should *require* streaming=on. It might be better for parallel_apply=on is just the *default*, but it simply achieves nothing unless streaming=on too. That way users would not need to change anything at all to get the benefits of parallel streaming. > Another variant is to have a new subscription parameter for example > "parallel_workers" parameter that specifies the number of parallel > workers. That way, users can specify the number of parallel workers > per subscription. > -- Kind Regards, Peter Smith. Fujitsu Australia.
Re: Fix some newly modified tab-complete changes
On Wed, Nov 16, 2022 at 08:29:24AM +, shiy.f...@fujitsu.com wrote: > I have fixed the problems you saw, and improved the patch as you suggested. > > Besides, I noticed that the tab completion for "ALTER DEFAULT PRIVILEGES ... > GRANT/REVOKE ..." missed "CREATE". Fix it in 0001 patch. > > And commit e3ce2de09 supported GRANT ... WITH INHERIT ..., but there's no tab > completion for it. Add this in 0002 patch. Thanks, I have been looking at the patch, and pondered about all the bloat added by the handling of PRIVILEGES, to note at the end that ALL PRIVILEGES is parsed the same way as ALL. So we don't actually need any of the complications related to it and the result would be the same. I have merged 0001 and 0002 together, and applied the rest, which looked rather fine. I have simplified as well a bit the parts where "REVOKE GRANT" are specified in a row, to avoid fancy results in some branches when we apply Privilege_options_of_grant_and_revoke. -- Michael signature.asc Description: PGP signature
Re: Perform streaming logical transactions by background workers and parallel apply
On Wed, Nov 16, 2022 at 1:50 PM houzj.f...@fujitsu.com wrote: > > On Tuesday, November 15, 2022 7:58 PM houzj.f...@fujitsu.com > wrote: > > I noticed that I didn't add CHECK_FOR_INTERRUPTS while retrying send message. > So, attach the new version which adds that. Also attach the 0004 patch that > restarts logical replication with temporarily disabling the parallel apply if > failed to apply a transaction in parallel apply worker. > Few comments on v48-0001 == 1. The variable name pending_message_count seems to indicate a number of pending messages but normally it is pending start/stop streams except for probably rollback to savepoint case. Shall we name it pending_stream_count and change the comments accordingly? 2. The variable name abort_toplevel_transaction seems unnecessarily long. Shall we rename it to toplevel_xact or something like that? 3. + /* + * Increment the number of messages waiting to be processed by + * parallel apply worker. + */ + if (!abort_toplevel_transaction) + pg_atomic_add_fetch_u32(&(winfo->shared->pending_message_count), 1); + else + pa_unlock_stream(xid, AccessExclusiveLock); It is better to explain here why different actions are required for subtransaction and transaction rather than the current comment. 4. + + if (abort_toplevel_transaction) + { + (void) pa_free_worker(winfo, xid); + } {} is not required here. 5. /* + * Although the lock can be automatically released during transaction + * rollback, but we still release the lock here as we may not in a + * transaction. + */ + pa_unlock_transaction(xid, AccessShareLock); + It is better to explain for which case (I think it is for empty xacts) it will be useful to release it explicitly. 6. + * + * XXX We can avoid sending pairs of the START/STOP messages to the parallel + * worker because unlike apply worker it will process only one transaction at a + * time. However, it is not clear whether any optimization is worthwhile + * because these messages are sent only when the logical_decoding_work_mem + * threshold is exceeded. */ static void apply_handle_stream_start(StringInfo s) I think this comment is no longer valid as now we need to wait for the next stream at stream_stop message and also need to acquire the lock in stream_start message. So, I think it is better to remove it unless I am missing something. 7. I am able to compile applyparallelworker.c by commenting few of the header includes. Please check if those are really required. #include "libpq/pqformat.h" #include "libpq/pqmq.h" //#include "mb/pg_wchar.h" #include "pgstat.h" #include "postmaster/interrupt.h" #include "replication/logicallauncher.h" //#include "replication/logicalworker.h" #include "replication/origin.h" //#include "replication/walreceiver.h" #include "replication/worker_internal.h" #include "storage/ipc.h" #include "storage/lmgr.h" //#include "storage/procarray.h" #include "tcop/tcopprot.h" #include "utils/inval.h" #include "utils/memutils.h" //#include "utils/resowner.h" #include "utils/syscache.h" 8. +/* + * Is there a message sent by parallel apply worker which we need to receive? + */ +volatile sig_atomic_t ParallelApplyMessagePending = false; This comment and variable are placed in applyparallelworker.c, so 'we' in the above sentence is not clear. I think you need to use leader apply worker instead. 9. +static ParallelApplyWorkerInfo *pa_get_free_worker(void); Will it be better if we name this function pa_get_available_worker()? -- With Regards, Amit Kapila.
Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes
I wrote: > Tomas Vondra writes: >> Or perhaps what if we have a function that quickly determines if the >> attribute has MCV, without loading it? I'd bet the expensive part of >> get_attstatslot() is the deconstruct_array(). >> We could have a function that only does the first small loop over slots, >> and returns true/false if we have a slot of the requested stakind. > Yeah, I like this idea. Actually, looking at get_attstatslot, I realize it was already designed to do that -- just pass zero for flags. So we could do it as attached. We could make some consequent simplifications by only retaining one "have_mcvs" flag, but I'm inclined to leave the rest of the code as-is. We would not get much gain from that, and it would make this harder to undo if there ever is a reason to consider just one set of MCVs. regards, tom lane diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index d597b7e81f..e0aeaa6909 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -2261,6 +2261,7 @@ eqjoinsel(PG_FUNCTION_ARGS) Form_pg_statistic stats2 = NULL; bool have_mcvs1 = false; bool have_mcvs2 = false; + bool get_mcv_stats; bool join_is_reversed; RelOptInfo *inner_rel; @@ -2275,11 +2276,25 @@ eqjoinsel(PG_FUNCTION_ARGS) memset(, 0, sizeof(sslot1)); memset(, 0, sizeof(sslot2)); + /* + * There is no use in fetching one side's MCVs if we lack MCVs for the + * other side, so do a quick check to verify that both stats exist. + */ + get_mcv_stats = (HeapTupleIsValid(vardata1.statsTuple) && + HeapTupleIsValid(vardata2.statsTuple) && + get_attstatsslot(, vardata1.statsTuple, + STATISTIC_KIND_MCV, InvalidOid, + 0) && + get_attstatsslot(, vardata2.statsTuple, + STATISTIC_KIND_MCV, InvalidOid, + 0)); + if (HeapTupleIsValid(vardata1.statsTuple)) { /* note we allow use of nullfrac regardless of security check */ stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple); - if (statistic_proc_security_check(, opfuncoid)) + if (get_mcv_stats && + statistic_proc_security_check(, opfuncoid)) have_mcvs1 = get_attstatsslot(, vardata1.statsTuple, STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS); @@ -2289,7 +2304,8 @@ eqjoinsel(PG_FUNCTION_ARGS) { /* note we allow use of nullfrac regardless of security check */ stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple); - if (statistic_proc_security_check(, opfuncoid)) + if (get_mcv_stats && + statistic_proc_security_check(, opfuncoid)) have_mcvs2 = get_attstatsslot(, vardata2.statsTuple, STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
Re: Getting rid of SQLValueFunction
On Tue, Oct 25, 2022 at 02:20:12PM +0900, Michael Paquier wrote: > Attached is a rebased patch set, as of the conflicts from 2e0d80c. So, this patch set has been sitting in the CF app for a few weeks now, and I would like to apply them to remove a bit of code from the executor. Please note that in order to avoid tweaks when choosing the attribute name of function call, this needs a total of 8 new catalog functions mapping to the SQL keywords, which is what the test added by 2e0d80c is about: - current_role - user - current_catalog - current_date - current_time - current_timestamp - localtime - localtimestamp Any objections? -- Michael signature.asc Description: PGP signature
Re: libpq compression (part 2)
On Thu, Nov 17, 2022 at 7:09 AM Peter Eisentraut wrote: > > Note that the above code was just changed in dce92e59b1. Thanks! > I don't know > how that affects this patch set. With dce92e59b1 it would be much easier to find a bug in the compression patch. Some more notes about the patch. (sorry for posting review notes in so many different messages) 1. zs_is_valid_impl_id(unsigned int id) { return id >= 0 && id < lengthof(zs_algorithms); } id is unsigned, no need to check it's non-negative. 2. This literal {no_compression_name} should be replaced by explicit form {no_compression_name, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL} 3. Comments like "Return true if should, false if should not." are useless. Thanks! Best regards, Andrey Borodin.
Re: Typo in SH_LOOKUP and SH_LOOKUP_HASH comments
On Thu, 17 Nov 2022 at 17:48, Daniel Gustafsson wrote: > > > On 17 Nov 2022, at 10:56, vignesh C wrote: > > > When I was reviewing one of the patches, I found a typo in the > > comments of SH_LOOKUP and SH_LOOKUP_HASH. I felt "lookup up" should > > have been "lookup". > > Attached a patch to modify it. > > I agree with that, applied to HEAD. Thanks! Thanks for pushing this change. Regards, Vignesh
Re: Reducing power consumption on idle servers
Hi, On 2022-11-17 13:06:23 +0530, Bharath Rupireddy wrote: > I understand. I know it's a bit hard to measure the power savings, I'm > wondering if there's any info, maybe not necessarily related to > postgres, but in general how much power gets saved if a certain number > of waits/polls/system calls are reduced. It's heavily hardware and hardware settings dependent. On some systems you can get an *approximate* idea of the power usage even while plugged in. On others you can only see it while on battery power. On systems with RAPL support (most Intel and I think newer AMD CPUs) you can query power usage with: powerstat -R 1 (adding -D provides a bit more detail) But note that RAPL typically severely undercounts power usage because it doesn't cover a lot of sources of power usage (display, sometimes memory, all other peripherals). Sometimes powerstat -R -D can split power usage up more granularly, e.g. between the different CPU sockets and memory. On laptops you can often measure the discharge rate when not plugged in, with powerstat -d 0 1. But IME the latency till the values update makes it harder to interpret. On some workstations / servers you can read the power usage via ACPI. E.g. on my workstation 'sensors' shows a power_meter-acpi-0 As an example of the difference it can make, here's the output of powerstat -D 1 on my laptop. TimeUser Nice Sys IdleIO Run Ctxt/s IRQ/s Fork Exec Exit Watts 16:41:55 0.1 0.0 0.1 99.8 0.01403246101 2.62 16:41:56 0.1 0.0 0.1 99.8 0.01357196101 2.72 16:41:57 0.1 0.0 0.1 99.6 0.21510231404 2.64 16:41:58 0.1 0.0 0.1 99.9 0.02 1350758 64 62 63 4.06 16:41:59 0.3 0.0 1.0 98.7 0.02 4166 2406 244 243 244 7.20 16:42:00 0.2 0.0 0.7 99.1 0.02 4203 2353 247 246 247 7.21 16:42:01 0.5 0.0 1.6 98.0 0.02 4079 2395 240 239 240 7.08 16:42:02 0.5 0.0 0.9 98.7 0.02 4097 2405 245 243 245 7.20 16:42:03 0.4 0.0 1.3 98.3 0.02 4117 2311 243 242 243 7.14 16:42:04 0.1 0.0 0.4 99.4 0.11 1721 1152 70 70 71 4.48 16:42:05 0.1 0.0 0.2 99.8 0.01433250101 2.92 16:42:06 0.0 0.0 0.3 99.7 0.01400231101 2.66 In the period of higher power etc usage I ran while true;do sleep 0.001;done and then interupted that after a bit with ctrl-c. Same thing on my workstation (: TimeUser Nice Sys IdleIO Run Ctxt/s IRQ/s Fork Exec Exit Watts 16:43:48 1.0 0.0 0.2 98.7 0.11 8218 2354000 46.43 16:43:49 1.1 0.0 0.3 98.7 0.01 7866 2477000 45.99 16:43:50 1.1 0.0 0.4 98.5 0.02 7753 2996000 48.93 16:43:51 0.8 0.0 1.7 97.5 0.01 9395 5285000 75.48 16:43:52 0.5 0.0 1.7 97.8 0.01 9141 4806000 75.30 16:43:53 1.1 0.0 1.8 97.1 0.02 10065 5504000 76.27 16:43:54 1.3 0.0 1.5 97.2 0.01 10962 5165000 76.33 16:43:55 0.9 0.0 0.8 98.3 0.01 8452 3939000 61.99 16:43:56 0.6 0.0 0.1 99.3 0.02 6541 1999000 40.92 16:43:57 0.9 0.0 0.2 98.9 0.02 8199 2477000 42.91 And if I query the power supply via ACPI instead: while true;do sensors power_meter-acpi-0|grep power1|awk '{print $2, $3}';sleep 1;done ... 163.00 W 173.00 W 173.00 W 172.00 W 203.00 W 206.00 W 206.00 W 206.00 W 209.00 W 205.00 W 211.00 W 213.00 W 203.00 W 175.00 W 166.00 W ... As you can see the difference is quite substantial. This is solely due to a 1ms sleep loop (albeit one where we fork a process after each sleep, which likely is a good chunk of the CPU usage). However, the difference in power usage will be far smaller if the system already busy, because the CPU will already run at a high frequency. Greetings, Andres Freund
Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes
Tomas Vondra writes: > Or perhaps what if we have a function that quickly determines if the > attribute has MCV, without loading it? I'd bet the expensive part of > get_attstatslot() is the deconstruct_array(). > We could have a function that only does the first small loop over slots, > and returns true/false if we have a slot of the requested stakind. Yeah, I like this idea. > It might even check the isunique flag first, to make it more convenient. That would tie it to this one use-case, which doesn't seem advisable. I think we should forget the known-unique angle and just do quick checks to see if both sides have MCVs. regards, tom lane
Re: Perform streaming logical transactions by background workers and parallel apply
On Tue, Nov 15, 2022 at 8:57 PM houzj.f...@fujitsu.com wrote: > > On Saturday, November 12, 2022 7:06 PM Amit Kapila > > > > On Fri, Nov 11, 2022 at 2:12 PM houzj.f...@fujitsu.com > > wrote: > > > > > > > Few comments on v46-0001: > > == > > > > Thanks for the comments. > > > 1. > > +static void > > +apply_handle_stream_abort(StringInfo s) > > { > > ... > > + /* Send STREAM ABORT message to the parallel apply worker. */ > > + parallel_apply_send_data(winfo, s->len, s->data); > > + > > + if (abort_toplevel_transaction) > > + { > > + parallel_apply_unlock_stream(xid, AccessExclusiveLock); > > > > Shouldn't we need to release this lock before sending the message as > > we are doing for streap_prepare and stream_commit? If there is a > > reason for doing it differently here then let's add some comments for > > the same. > > Changed. > > > 2. It seems once the patch makes the file state as busy > > (LEADER_FILESET_BUSY), it will only be accessible after the leader > > apply worker receives a transaction end message like stream_commit. Is > > my understanding correct? If yes, then why can't we make it accessible > > after the stream_stop message? Are you worried about the concurrency > > handling for reading and writing the file? If so, we can probably deal > > with it via some lock for reading and writing to file for each change. > > I think after this we may not need additional stream level lock/unlock > > in parallel_apply_spooled_messages. I understand that you probably > > want to keep the code simple so I am not suggesting changing it > > immediately but just wanted to know whether you have considered > > alternatives here. > > I thought about this, but it seems the current buffile design doesn't allow > two > processes to open the same buffile at the same time(refer to the comment atop > of BufFileOpenFileSet()). This means the LA needs to make sure the PA has > closed the buffile before writing more changes into it. Although we could let > the LA wait for that, but it could cause another kind of deadlock. Suppose the > PA opened the file and is blocked when applying the just read change. And the > LA starts to wait when trying to write the next set of streaming changes into > file because the file is still opened by PA. Then the lock edge is like: > > LA (wait for file to be closed) -> PA1 (wait for unique lock in PA2) -> PA2 > (wait for stream lock held in LA) > > We could introduce another lock for this, but that seems not very great as we > already had two kinds of locks here. > > Another solution could be we create different filename for each streaming > block > so that the leader don't need to reopen the same file after writing changes > into it, but that seems largely increase the number of temp files and looks a > bit hacky. Or we could let PA open the file, then read and close the file for > each change, but it seems bring some overhead of opening and closing file. > > Another solution which doesn't need a new lock could be that we create > different filename for each streaming block so that the leader doesn't need to > reopen the same file after writing changes into it, but that seems largely > increase the number of temp files and looks a bit hacky. Or we could let PA > open the file, then read and close the file for each change, but it seems > bring > some overhead of opening and closing file. > > Based on above, how about keep the current approach ?(i.e. PA > will open the file only after the leader apply worker receives a transaction > end message like stream_commit). Ideally, it will enter partial serialize mode > only when PA is blocked by a backend or another PA which seems not that > common. +1. We can improve this area later in a separate patch. Here are review comments on v47-0001 and v47-0002 patches: When the parallel apply worker exited, I got the following server log. I think this log is not appropriate since the worker was not terminated by administrator command but exited by itself. Also, probably it should exit with exit code 0? FATAL: terminating logical replication worker due to administrator command LOG: background worker "logical replication parallel worker" (PID 3594918) exited with exit code 1 --- /* * Stop the worker if there are enough workers in the pool or the leader * apply worker serialized part of the transaction data to a file due to * send timeout. */ if (winfo->serialize_changes || napplyworkers > (max_parallel_apply_workers_per_subscription / 2)) Why do we need to stop the worker if the leader serializes changes? --- +/* + * Release all session level locks that could be held in parallel apply + * mode. + */ +LockReleaseAll(DEFAULT_LOCKMETHOD, true); + I think we call LockReleaseAll() at the process exit (in ProcKill()), but do we really need to do LockReleaseAll() here too? --- +elog(ERROR, "could not find replication state slot for replication" +
Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes
On 11/14/22 10:19, David Geier wrote: > Hi Tom, >> There won't *be* any MCV stats for a column that ANALYZE perceives to >> be unique, so I'm not quite sure where the claimed savings comes from. > > We save if one join attribute is unique while the other isn't. In that > case stored MCV stats are read for the non-unique attribute but then > never used. This is because MCV stats in join selectivity estimation are > only used if they're present on both columns > Right - if we only have MCV on one side of the join, we currently end up loading the MCV we have only to not use it anyway. The uniqueness is a simple way to detect some of those cases. I'd bet the savings can be quite significant for small joins and/or cases with large MCV. I wonder if we might be yet a bit smarter, though. For example, assume the first attribute is not defined as "unique" but we still don't have a MCV (it may be unique - or close to unique - in practice, or maybe it's just uniform distribution). We end up with have_mcvs1 = false Can't we just skip trying to load the second MCV? So we could do if (have_mcvs1 && HeapTupleIsValid(vardata2.statsTuple)) { ... try loading mcv2 ... } Or perhaps what if we have a function that quickly determines if the attribute has MCV, without loading it? I'd bet the expensive part of get_attstatslot() is the deconstruct_array(). We could have a function that only does the first small loop over slots, and returns true/false if we have a slot of the requested stakind. It might even check the isunique flag first, to make it more convenient. And only if both sides return "true" we'd load the MCV, deconstruct the array and all that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: allowing for control over SET ROLE
On Thu, 2022-11-17 at 16:52 -0500, Robert Haas wrote: > But I think the bigger > reason is that, in my opinion, this proposal is more generally > useful, > because it takes no position on why you wish to disallow SET ROLE. > You > can just disallow it in some cases and allow it in others, and that's > fine. I agree that the it's more flexible in the sense that it does what it does, and administrators can use it if it's useful for them. That means we don't need to understand the actual goals as well; but it also means that it's harder to determine the consequences if we tweak the behavior (or any related behavior) later. I'll admit that I don't have an example of a likely problem here, though. > That proposal targets a specific use case, which may make it a > better solution to that particular problem, but it makes it > unworkable > as a solution to any other problem, I believe. Yeah, that's the flip side: "virtual" roles (for lack of a better name) are a more narrow fix for the problem as I understand it; but it might leave related problems unfixed. You and Stephen[2] both seemed to consider this approach, and I happened to like it, so I wanted to make sure that it wasn't dismissed too quickly. But I'm fine if you'd like to move on with the SET ROLE privilege instead, as long as we believe it grants a stable set of capabilities (and conversely, that if the SET ROLE privilege is revoked, that it revokes a stable set of capabilities). [2] https://www.postgresql.org/message-id/YzIAGCrxoXibAKOD%40tamriel.snowman.net -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: Mingw task for Cirrus CI
Hi, On 2022-10-24 14:13:06 +0300, Melih Mutlu wrote: > If you say so, then I think it's ready. I pushed it with a few changes: - I added an only_if, otherwise the task shows up (without running) even if you use ci-os-only: linux - I added -Dnls=disabled - It seemed to add about 1.5min to the cached build - Inlined the -l into BASH_EXE and renamed it to BASH - The indentation of WINDOWS_ENVIRONMENT_BASE was irregular, leading to a larger diff. Some other similar cleanups. - I added 'mingw' as a ci-os-only choice. Perhaps not the be-all-end-all. Perhaps we should use windows-{msvc,mingw,cygwin}? Or add ci-task-only? I suspect we can make at least -Dssl=openssl work. Perhaps one of the uuid implementation is available in msys as well? Currently we end up with: External libraries bonjour: NO bsd_auth : NO gss: NO icu: YES 72.1 ldap : YES libxml : YES 2.10.3 libxslt: YES 1.1.37 llvm : NO lz4: YES 1.9.4 nls: NO pam: NO plperl : YES plpython : YES 3.10 pltcl : YES readline : YES selinux: NO ssl: NO systemd: NO uuid : NO zlib : YES 1.2.13 zstd : YES 1.5.2 gss should also work if the is available in the msys repo. Bonjour is apple only, bsd_auth bsd specific, selinux and systemd linux specific and pam unixoid specific. It could also be interesting to see if llvm works, but that might be more work. Greetings, Andres Freund
Re: redundant check of msg in does_not_exist_skipping
> On Nov 18, 2022, at 4:04 AM, Robert Haas wrote: > > On Thu, Nov 17, 2022 at 10:56 AM Tom Lane wrote: >> This is a completely bad idea. If it takes that level of analysis >> to see that msg can't be null, we should leave the test in place. >> Any future modification of either this code or what it calls could >> break the conclusion. > > +1. Also, even if the check were quite obviously useless, it's cheap > insurance. It's difficult to believe that it hurts performance in any > measurable way. If anything, we would benefit from having more sanity > checks in our code, rather than fewer. > Thanks for the explanation! Got it.
Re: Don't treate IndexStmt like AlterTable when DefineIndex is called from ProcessUtilitySlow.
Hi, thanks a lot for your reply! > Why do you think this is an improvement? I hit the issue in Greenplum Database (Massively Parallel Postgres), the MPP architecture is that coordinator dispatch statement to segments. The dispatch logic is quit different for AlterTable and CreateTableLike: * alter table: for each sub command, it will not dispatch; later it will dispatch the alter table statement as a whole. * for create table like statement, like `create table t (like t1 including indexes);` this statement's 2nd stmt, has to dispatch to segments, but now it is treated as alter table, the dispatch logic is broken for this case in Greenplum. I look into the issue and Greenplum Database wants to keep align with Upstream as possible. That is why I ask if we can force it to false. Best, Zhenghua Lyu Tom Lane 于2022年11月18日周五 06:26写道: > =?UTF-8?B?5q2j5Y2O5ZCV?= writes: > > Recently read the code, I find that when calling DefineIndex > > from ProcessUtilitySlow, is_alter_table will be set true if > > this statement is came from expandTableLikeClause. > > Yeah. > > > Based on the above, I think we can always a false value > > for is_alter_table when DefineIndex is called from > > ProcessUtilitySlow. > > Why do you think this is an improvement? Even if it's correct, > the code savings is so negligible that I'm not sure I want to > expend brain cells on figuring out whether it's correct. The > comment you want to remove does not suggest that it's optional > which value we should pass, so I think the burden of proof > is to show that this patch is okay not that somebody else > has to demonstrate that it isn't. > > regards, tom lane >
Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs
On Thu, Nov 17, 2022 at 9:02 AM Peter Geoghegan wrote: > Plan is to commit this later on today, barring objections. Pushed, thanks. -- Peter Geoghegan
Re: Strange failure on mamba
On Fri, Nov 18, 2022 at 11:35 AM Thomas Munro wrote: > I wonder if it's a runtime variant of the other problem. We do > load_file("libpqwalreceiver", false) before unblocking signals but > maybe don't resolve the symbols until calling them, or something like > that... Hmm, no, I take that back. A key ingredient was that a symbol was being resolved inside the signal handler, which is a postmaster-only thing.
Re: Strange failure on mamba
Thomas Munro writes: > On Fri, Nov 18, 2022 at 11:08 AM Tom Lane wrote: >> mamba has been showing intermittent failures in various replication >> tests since day one. > I wonder if it's a runtime variant of the other problem. We do > load_file("libpqwalreceiver", false) before unblocking signals but > maybe don't resolve the symbols until calling them, or something like > that... Yeah, that or some other NetBSD bug could be the explanation, too. Without a stack trace it's hard to have any confidence about it, but I've been unable to reproduce the problem outside the buildfarm. (Which is a familiar refrain. I wonder what it is about the buildfarm environment that makes it act different from the exact same code running on the exact same machine.) So I'd like to have some way to make the postmaster send SIGABRT instead of SIGKILL in the buildfarm environment. The lowest-tech way would be to drive that off some #define or other. We could scale it up to a GUC perhaps. Adjacent to that, I also wonder whether SIGABRT wouldn't be more useful than SIGSTOP for the existing SendStop half-a-feature --- the idea that people should collect cores manually seems mighty last-century. regards, tom lane
Re: contrib: auth_delay module
=?UTF-8?B?5oiQ5LmL54SV?= writes: > The attached patch is a contrib module to set login restrictions on users > with > too many authentication failure. The administrator could manage several GUC > parameters to control the login restrictions which are listed below. > - set the wait time when password authentication fails. > - allow the wait time grows when users of the same IP consecutively logon > failed. > - set the maximum authentication failure number from the same user. The > system > will prevent a user who gets too many authentication failures from entering > the > database. I'm not yet forming an opinion on whether this is useful enough to accept. However, I wonder why you chose to add this functionality to auth_delay instead of making a new, independent module. It seems fairly unrelated to what auth_delay does, and the newly-created requirement that the module be preloaded might possibly break some existing use-case for auth_delay. Also, a patch that lacks user documentation and has no code comments to speak of seems unlikely to draw serious review. regards, tom lane
Re: Strange failure on mamba
On Fri, Nov 18, 2022 at 11:08 AM Tom Lane wrote: > Thomas Munro writes: > > I wonder why the walreceiver didn't start in > > 008_min_recovery_point_node_3.log here: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-11-16%2023%3A13%3A38 > > mamba has been showing intermittent failures in various replication > tests since day one. My guess is that it's slow enough to be > particularly subject to the signal-handler race conditions that we > know exist in walreceivers and elsewhere. (Now, it wasn't any faster > in its previous incarnation as a macOS critter. But maybe modern > NetBSD has different scheduler behavior than ancient macOS and that > contributes somehow. Or maybe there's some other NetBSD weirdness > in here.) > > I've tried to reproduce manually, without much success :-( > > Like many of its other failures, there's a suggestive postmaster > log entry at the very end: > > 2022-11-16 19:45:53.851 EST [2036:4] LOG: received immediate shutdown request > 2022-11-16 19:45:58.873 EST [2036:5] LOG: issuing SIGKILL to recalcitrant > children > 2022-11-16 19:45:58.881 EST [2036:6] LOG: database system is shut down > > So some postmaster child is stuck somewhere where it's not responding > to SIGQUIT. While it's not unreasonable to guess that that's a > walreceiver, there's no hard evidence of it here. I've been wondering > if it'd be worth patching the postmaster so that it's a bit more verbose > about which children it had to SIGKILL. I've also wondered about > changing the SIGKILL to SIGABRT in hopes of reaping a core file that > could be investigated. I wonder if it's a runtime variant of the other problem. We do load_file("libpqwalreceiver", false) before unblocking signals but maybe don't resolve the symbols until calling them, or something like that...
Re: Don't treate IndexStmt like AlterTable when DefineIndex is called from ProcessUtilitySlow.
=?UTF-8?B?5q2j5Y2O5ZCV?= writes: > Recently read the code, I find that when calling DefineIndex > from ProcessUtilitySlow, is_alter_table will be set true if > this statement is came from expandTableLikeClause. Yeah. > Based on the above, I think we can always a false value > for is_alter_table when DefineIndex is called from > ProcessUtilitySlow. Why do you think this is an improvement? Even if it's correct, the code savings is so negligible that I'm not sure I want to expend brain cells on figuring out whether it's correct. The comment you want to remove does not suggest that it's optional which value we should pass, so I think the burden of proof is to show that this patch is okay not that somebody else has to demonstrate that it isn't. regards, tom lane
Re: allow segment size to be set to < 1GiB
On 2022-11-17 Th 10:48, Tom Lane wrote: > Andres Freund writes: >> On 2022-11-17 09:58:48 -0500, Andrew Dunstan wrote: >>> Are we going to impose some sane minimum, or leave it up to developers >>> to discover that for themselves? >> I don't think we should. It's actually useful to e.g. use 1 page sized >> segments for testing, and with one exceptions the tests pass with it too. Do >> you see a reason to impose one? > Yeah, I think we should allow setting it to 1 block. This switch is > only for testing purposes (I hope the docs make that clear). > > Yeah clearly if 1 is useful there's no point in limiting it. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Fix proposal for comparaison bugs in PostgreSQL::Version
On 2022-11-04 Fr 10:06, Jehan-Guillaume de Rorthais wrote: > On Thu, 3 Nov 2022 13:11:18 -0500 > Justin Pryzby wrote: > >> On Tue, Jun 28, 2022 at 06:17:40PM -0400, Andrew Dunstan wrote: >>> Nice catch, but this looks like massive overkill. I think we can very >>> simply fix the test in just a few lines of code, instead of a 190 line >>> fix and a 130 line TAP test. >>> >>> It was never intended to be able to compare markers like rc1 vs rc2, and >>> I don't see any need for it. If you can show me a sane use case I'll >>> have another look, but right now it seems quite unnecessary. >>> >>> Here's my proposed fix. >>> >>> diff --git a/src/test/perl/PostgreSQL/Version.pm >>> b/src/test/perl/PostgreSQL/Version.pm index 8f70491189..8d4dbbf694 100644 >>> --- a/src/test/perl/PostgreSQL/Version.pm >> Is this still an outstanding issue ? > The issue still exists on current HEAD: > > $ perl -Isrc/test/perl/ -MPostgreSQL::Version -le \ > 'print "bug" if PostgreSQL::Version->new("9.6") <= 9.0' > bug > > Regards, > Oops. this slipped off mt radar. I'll apply a fix shortly, thanks for the reminder. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Strange failure on mamba
Thomas Munro writes: > I wonder why the walreceiver didn't start in > 008_min_recovery_point_node_3.log here: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-11-16%2023%3A13%3A38 mamba has been showing intermittent failures in various replication tests since day one. My guess is that it's slow enough to be particularly subject to the signal-handler race conditions that we know exist in walreceivers and elsewhere. (Now, it wasn't any faster in its previous incarnation as a macOS critter. But maybe modern NetBSD has different scheduler behavior than ancient macOS and that contributes somehow. Or maybe there's some other NetBSD weirdness in here.) I've tried to reproduce manually, without much success :-( Like many of its other failures, there's a suggestive postmaster log entry at the very end: 2022-11-16 19:45:53.851 EST [2036:4] LOG: received immediate shutdown request 2022-11-16 19:45:58.873 EST [2036:5] LOG: issuing SIGKILL to recalcitrant children 2022-11-16 19:45:58.881 EST [2036:6] LOG: database system is shut down So some postmaster child is stuck somewhere where it's not responding to SIGQUIT. While it's not unreasonable to guess that that's a walreceiver, there's no hard evidence of it here. I've been wondering if it'd be worth patching the postmaster so that it's a bit more verbose about which children it had to SIGKILL. I've also wondered about changing the SIGKILL to SIGABRT in hopes of reaping a core file that could be investigated. regards, tom lane
Patch: Global Unique Index
Patch: Global Unique Index “Global unique index” in our definition is a unique index on a partitioned table that can ensure cross-partition uniqueness using a non-partition key. This work is inspired by this email thread, “Proposal: Global Index” started back in 2019 (Link below). My colleague David and I took a different approach to implement the feature that ensures uniqueness constraint spanning multiple partitions. We achieved this mainly by using application logics without heavy modification to current Postgres’s partitioned table/index structure. In other words, a global unique index and a regular partitioned index are essentially the same in terms of their storage structure except that one can do cross-partition uniqueness check, the other cannot. https://www.postgresql.org/message-id/CALtqXTcurqy1PKXzP9XO%3DofLLA5wBSo77BnUnYVEZpmcA3V0ag%40mail.gmail.com - Patch - The attached patches were generated based on commit `85d8b30724c0fd117a683cc72706f71b28463a05` on master branch. - Benefits of global unique index - 1. Ensure uniqueness spanning all partitions using a non-partition key column 2. Allow user to create a unique index on a non-partition key column without the need to include partition key (current Postgres enforces this) 3. Query performance increase when using a single unique non-partition key column - Supported Features - 1. Global unique index is supported only on btree index type 2. Global unique index is useful only when created on a partitioned table. 3. Cross-partition uniqueness check with CREATE UNIQUE INDEX in serial and parallel mode 4. Cross-partition uniqueness check with ATTACH in serial and parallel mode 5. Cross-partition uniqueness check when INSERT and UPDATE - Not-supported Features - 1. Global uniqueness check with Sub partition tables is not yet supported as we do not have immediate use case and it may involve majoy change in current implementation - Global unique index syntax - A global unique index can be created with "GLOBAL" and "UNIQUE" clauses in a "CREATE INDEX" statement run against a partitioned table. For example, CREATE UNIQUE INDEX global_index ON idxpart(bid) GLOBAL; - New Relkind: RELKIND_GLOBAL_INDEX - When a global unique index is created on a partitioned table, its relkind is RELKIND_PARTITIONED_INDEX (I). This is the same as creating a regular index. Then Postgres will recursively create index on each child partition, except now the relkind will be set as RELKIND_GLOBAL_INDEX (g) instead of RELKIND_INDEX (i). This new relkind, along with uniqueness flag are needed for cross-partition uniqueness check later. - Create a global unique index - To create a regular unique index on a partitioned table, Postgres has to perform heap scan and sorting on every child partition. Uniqueness check happens during the sorting phase and will raise an error if multiple tuples with the same index key are sorted together. To achieve global uniqueness check, we make Postgres perform the sorting after all of the child partitions have been scanned instead of on the "sort per partition" fashion. In otherwords, the sorting only happens once at the very end and it sorts the tuples coming from all the partitions and therefore can ensure global uniqueness. In parallel index build case, the idea is the same, except that the tuples will be put into shared file set (temp files) on disk instead of in memory to ensure other workers can share the sort results. At the end of the very last partition build, we make Postgres take over all the temp files and perform a final merge sort to ensure global uniqueness. Example: > CREATE TABLE gidx_part(a int, b int, c text) PARTITION BY RANGE (a); > CREATE TABLE gidx_part1 PARTITION OF gidx_part FOR VALUES FROM (0) to (10); > CREATE TABLE gidx_part2 PARTITION OF gidx_part FOR VALUES FROM (10) to (20); > INSERT INTO gidx_part values(5, 5, 'test'); > INSERT INTO gidx_part values(15, 5, 'test'); > CREATE UNIQUE INDEX global_unique_idx ON gidx_part(b) GLOBAL; ERROR: could not create unique index "gidx_part1_b_idx" DETAIL: Key (b)=(5) is duplicated. - INSERT and UPDATE - For every new tuple inserted or updated, Postgres attempts to fetch the same tuple from current partition to determine if a duplicate already exists. In the global unique index case, we make Postgres attempt to fetch the same tuple from other partitions as well as the current partition. If a duplicate is found, global uniqueness is violated and an error is raised. Example: > CREATE TABLE gidx_part (a int, b int, c text) PARTITION BY RANGE (a); > CREATE TABLE gidx_part1 partition of gidx_part FOR VALUES FROM (0) TO (10); > CREATE TABLE gidx_part2 partition of gidx_part FOR VALUES FROM (10) TO (20); > CREATE UNIQUE INDEX global_unique_idx ON gidx_part USING BTREE(b) GLOBAL; > INSERT INTO gidx_part values(5, 5, 'test'); > INSERT INTO gidx_part values(15, 5, 'test');
Strange failure on mamba
Hi, I wonder why the walreceiver didn't start in 008_min_recovery_point_node_3.log here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-11-16%2023%3A13%3A38 There was the case of commit 8acd8f86, but that involved a deadlocked postmaster whereas this one still handled a shutdown request.
Re: allowing for control over SET ROLE
On Tue, Nov 15, 2022 at 7:23 PM Jeff Davis wrote: > On Tue, 2022-11-15 at 12:07 -0500, Robert Haas wrote: > > If anyone else wants to comment, or if either of those people want to > > comment further, please speak up soon. > > Did you have some thoughts on: > > https://postgr.es/m/a41d606d03b629c2ef0ed274ae3b04a2c266.ca...@j-davis.com I mean, I think what we were discussing there could be done, but it's not the approach I like best. That's partly because that was just a back-of-the-envelope sketch of an idea, not a real proposal for something with a clear implementation path. But I think the bigger reason is that, in my opinion, this proposal is more generally useful, because it takes no position on why you wish to disallow SET ROLE. You can just disallow it in some cases and allow it in others, and that's fine. That proposal targets a specific use case, which may make it a better solution to that particular problem, but it makes it unworkable as a solution to any other problem, I believe. -- Robert Haas EDB: http://www.enterprisedb.com
Outdated url to Hunspell
I happened to notice that the link to Hunspell in our documentation goes to the hunspell sourceforge page last updated in 2015. The project has since moved to Github [0] with hunspell.sourceforge.net redirecting there, I'm not sure exactly when but Wikipedia updated their link entry in 2016 [1] so it seems about time we do too. Unless objected to I'll apply the attached to master with the doc/ hunk to all supported branches. -- Daniel Gustafsson https://vmware.com/ [0] https://hunspell.github.io/ [1] https://en.wikipedia.org/w/index.php?title=Hunspell=704306462=697230645 hunspell_url.diff Description: Binary data
Re: ubsan fails on 32bit builds
On Fri, Nov 18, 2022 at 9:13 AM Andres Freund wrote: > Agreed. I had started on a set of patches for some of the SHM_QUEUE uses, but > somehow we ended up a bit stuck on the naming of dlist_delete variant that > afterwards zeroes next/prev so we can replace SHMQueueIsDetached() uses. > > Should probably find and rebase those patches... https://www.postgresql.org/message-id/flat/20200211042229.msv23badgqljrdg2%40alap3.anarazel.de
Re: logical decoding and replication of sequences, take 2
On 11/17/22 18:07, Andres Freund wrote: > Hi, > > On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote: >> On 11/17/22 03:43, Andres Freund wrote: >>> On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote: Well, yeah - we can either try to perform the stuff independently of the transactions that triggered it, or we can try making it part of some of the transactions. Each of those options has problems, though :-( The first version of the patch tried the first approach, i.e. decode the increments and apply that independently. But: (a) What would you do with increments of sequences created/reset in a transaction? Can't apply those outside the transaction, because it might be rolled back (and that state is not visible on primary). >>> >>> I think a reasonable approach could be to actually perform different WAL >>> logging for that case. It'll require a bit of machinery, but could actually >>> result in *less* WAL logging overall, because we don't need to emit a WAL >>> record for each SEQ_LOG_VALS sequence values. >>> >> >> Could you elaborate? Hard to comment without knowing more ... >> >> My point was that stuff like this (creating a new sequence or at least a >> new relfilenode) means we can't apply that independently of the >> transaction (unlike regular increments). I'm not sure how a change to >> WAL logging would make that go away. > > Different WAL logging would make it easy to handle that on the logical > decoding level. We don't need to emit WAL records each time a > created-in-this-toplevel-xact sequences gets incremented as they're not > persisting anyway if the surrounding xact aborts. We already need to remember > the filenode so it can be dropped at the end of the transaction, so we could > emit a single record for each sequence at that point. > > (b) What about increments created before we have a proper snapshot? There may be transactions dependent on the increment. This is what ultimately led to revert of the patch. >>> >>> I don't understand this - why would we ever need to process those increments >>> from before we have a snapshot? Wouldn't they, by definition, be before the >>> slot was active? >>> >>> To me this is the rough equivalent of logical decoding not giving the >>> initial >>> state of all tables. You need some process outside of logical decoding to >>> get >>> that (obviously we have some support for that via the exported data snapshot >>> during slot creation). >>> >> >> Which is what already happens during tablesync, no? We more or less copy >> sequences as if they were tables. > > I think you might have to copy sequences after tables, but I'm not sure. But > otherwise, yea. > > >>> I assume that part of the initial sync would have to be a new sequence >>> synchronization step that reads all the sequence states on the publisher and >>> ensures that the subscriber sequences are at the same point. There's a bit >>> of >>> trickiness there, but it seems entirely doable. The logical replication >>> replay >>> support for sequences will have to be a bit careful about not decreasing the >>> subscriber's sequence values - the standby initially will be ahead of the >>> increments we'll see in the WAL. But that seems inevitable given the >>> non-transactional nature of sequences. >>> >> >> See fetch_sequence_data / copy_sequence in the patch. The bit about >> ensuring the sequence does not go away (say, using page LSN and/or LSN >> of the increment) is not there, however isn't that pretty much what I >> proposed doing for "reconciling" the sequence state logged at COMMIT? > > Well, I think the approach of logging all sequence increments at commit is the > wrong idea... > But we're not logging all sequence increments, no? We're logging the state for each sequence touched by the transaction, but only once - if the transaction incremented the sequence 100x times, we'll still log it just once (at least for this particular purpose). Yes, if transactions touch each sequence just once, then we're logging individual increments. The only more efficient solution would be to decode the existing WAL (every ~32 increments), and perhaps also tracking which sequences were accessed by a transaction. And then simply stashing the increments in a global reorderbuffer hash table, and then applying only the last one at commit time. This would require the transactional / non-transactional behavior (I think), but perhaps we can make that work. Or are you thinking about some other scheme? > Creating a new relfilenode whenever a sequence is incremented seems like a > complete no-go to me. That increases sequence overhead by several orders of > magnitude and will lead to *awful* catalog bloat on the subscriber. > You mean on the the apply side? Yes, I agree this needs a better approach, I've focused on the decoding side so far. > >>> This version of the patch tries to do the opposite thing -
Re: Reducing power consumption on idle servers
On Thu, Nov 17, 2022 at 2:55 AM Simon Riggs wrote: > No, it will have a direct effect only on people using promote_trigger_file > who do not read and act upon the deprecation notice before upgrading > by making a one line change to their failover scripts. TBH, I wonder if we shouldn't just nuke promote_trigger_file. pg_promote was added in 2018, and pg_ctl promote was added in 2011. So even if we haven't said promote_trigger_file was deprecated, it hasn't been the state-of-the-art way of doing this in a really long time. If we think that there are still a lot of people using it, or if popular tools are relying on it, then perhaps a deprecation period is warranted anyway. But I think we should at least consider the possibility that it's OK to just get rid of it right away. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On 11/17/22 18:29, Simon Riggs wrote: > On Thu, 17 Nov 2022 at 17:04, Simon Riggs > wrote: >> >> New version with greatly improved comments coming very soon. > >>> Perhaps it would be a good idea to split up the patch. The business >>> about making pg_subtrans flat rather than a tree seems like a good >>> idea in any event, although as I said it doesn't seem like we've got >>> a fleshed-out version of that here. We could push forward on getting >>> that done and then separately consider the rest of it. >> >> Yes, I thought you might ask that so, after some thought, have found a >> clean way to do that and have split this into two parts. > > Attached. > > 002 includes many comment revisions, as well as flattening the loops > in SubTransGetTopmostTransaction and TransactionIdDidCommit/Abort > 003 includes the idea to not-always do SubTransSetParent() > I'm a bit confused by the TransactionIdsAreOnSameXactPage naming. Isn't this really checking clog pages? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: ubsan fails on 32bit builds
Andres Freund writes: > On 2022-11-17 14:20:47 -0500, Robert Haas wrote: >> Not that I object to a targeted fix > Should we backpatch this fix? Likely this doesn't cause active breakage > outside of 32bit builds under ubsan, but that's not an unreasonable thing to > want to do in the backbranches. +1 for backpatching what you showed. >> but it's been 10 years since >> slist and dlist were committed, and we really ought to eliminate >> SHM_QUEUE entirely in favor of using those. > Agreed. I had started on a set of patches for some of the SHM_QUEUE uses, but > somehow we ended up a bit stuck on the naming of dlist_delete variant that > afterwards zeroes next/prev so we can replace SHMQueueIsDetached() uses. Also +1, but of course for HEAD only. regards, tom lane
Re: ubsan fails on 32bit builds
Hi, On 2022-11-17 14:20:47 -0500, Robert Haas wrote: > On Wed, Nov 16, 2022 at 8:42 PM Andres Freund wrote: > > Afaict the problem is that > > proc = (PGPROC *) &(waitQueue->links); > > > > is a gross gross hack - this isn't actually a PGPROC, it's pointing to an > > SHM_QUEUE, but *not* one embedded in PGPROC. It kinda works because ->links > > is at offset 0 in PGPROC, which means that > > SHMQueueInsertBefore(&(proc->links), &(MyProc->links)); > > will turn >links back into waitQueue->links. Which we then can enqueue > > again. > > Not that I object to a targeted fix Should we backpatch this fix? Likely this doesn't cause active breakage outside of 32bit builds under ubsan, but that's not an unreasonable thing to want to do in the backbranches. > but it's been 10 years since > slist and dlist were committed, and we really ought to eliminate > SHM_QUEUE entirely in favor of using those. It's basically an > open-coded implementation of something for which we now have a > toolkit. Not that it's impossible to make this kind of mistake with a > toolkit, but in general open-coding the same logic in multiple places > increases the risk of bugs. Agreed. I had started on a set of patches for some of the SHM_QUEUE uses, but somehow we ended up a bit stuck on the naming of dlist_delete variant that afterwards zeroes next/prev so we can replace SHMQueueIsDetached() uses. Should probably find and rebase those patches... Greetings, Andres Freund
Re: Allow single table VACUUM in transaction block
On Wed, Nov 16, 2022 at 5:14 PM Greg Stark wrote: > However I'm not a fan of commands that sometimes do one thing and > sometimes magically do something very different. I don't like the idea > that the same vacuum command would sometimes run in-process and > sometimes do this out of process request. And the rules for when it > does which are fairly complex to explain -- it runs in process unless > you're in a transaction when it runs out of process unless it's a > temporary table ... 100% agree. -- Robert Haas EDB: http://www.enterprisedb.com
Re: redundant check of msg in does_not_exist_skipping
On Thu, Nov 17, 2022 at 10:56 AM Tom Lane wrote: > This is a completely bad idea. If it takes that level of analysis > to see that msg can't be null, we should leave the test in place. > Any future modification of either this code or what it calls could > break the conclusion. +1. Also, even if the check were quite obviously useless, it's cheap insurance. It's difficult to believe that it hurts performance in any measurable way. If anything, we would benefit from having more sanity checks in our code, rather than fewer. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Allow single table VACUUM in transaction block
On Wed, Nov 16, 2022 at 05:14:07PM -0500, Greg Stark wrote: > I think this requesting autovacuum worker should be a distinct > command. Or at least an explicit option to vacuum. +1. I was going to suggest VACUUM (NOWAIT) .. -- Justin
Re: [PoC] configurable out of disk space elog level
On Wed, Nov 16, 2022 at 7:59 AM Maxim Orlov wrote: > The customer, mentioned above, in this particular case, would be glad to be > able to have a mechanism to stop the cluster. > Again, in this concrete case. > > My proposal is to add a tablespace option in order to be able to configure > which behaviour is appropriate for a > particular user. I've decided to call this option “on_no_space” for now. If > anyone has a better naming for this feature, > please, report. I don't think this is a good feature to add to PostgreSQL. First, it's unclear why stopping the cluster is a desirable behavior. It doesn't stop the user transactions from failing; it just makes them fail in some other way. Now it is of course perfectly legitimate for a particular user to want that particular behavior anyway, but there are a bunch of other things that a user could equally legitimately want to do, like page the DBA or trigger a failover or kill off sessions that are using large temporary relations or whatever. And, equally, there are many other operating system errors to which a user could want the database system to respond in similar ways. For example, someone might want any given one of those treatments when an I/O error occurs writing to the data directory, or a read-only filesystem error, or a permission denied error. Having a switch for one particular kind of error (out of many that could possibly occur) that triggers one particular coping strategy (out of many that could possibly be used) seems far too specific a thing to add as a core feature. And even if we had something much more general, I'm not sure why that should go into the database rather than being implemented outside it. After all, nothing at all prevents the user from scanning the database logs for "out of space" errors and shutting down the database if any are found. While you're at it, you could make your monitoring script also check the free space on the relevant partition using statfs() and page somebody if the utilization goes above 95% or whatever threshold you like, which would probably avoid service outages much more effectively than $SUBJECT. I just can't see much real benefit in putting this logic inside the database. -- Robert Haas EDB: http://www.enterprisedb.com
Re: ubsan fails on 32bit builds
On Wed, Nov 16, 2022 at 8:42 PM Andres Freund wrote: > Afaict the problem is that > proc = (PGPROC *) &(waitQueue->links); > > is a gross gross hack - this isn't actually a PGPROC, it's pointing to an > SHM_QUEUE, but *not* one embedded in PGPROC. It kinda works because ->links > is at offset 0 in PGPROC, which means that > SHMQueueInsertBefore(&(proc->links), &(MyProc->links)); > will turn >links back into waitQueue->links. Which we then can enqueue > again. Not that I object to a targeted fix, but it's been 10 years since slist and dlist were committed, and we really ought to eliminate SHM_QUEUE entirely in favor of using those. It's basically an open-coded implementation of something for which we now have a toolkit. Not that it's impossible to make this kind of mistake with a toolkit, but in general open-coding the same logic in multiple places increases the risk of bugs. -- Robert Haas EDB: http://www.enterprisedb.com
Re: when the startup process doesn't (logging startup delays)
On Thu, Nov 17, 2022 at 2:22 AM Bharath Rupireddy wrote: > Duplication is a problem that I agree with and I have an idea here - > how about introducing a new function, say EnableStandbyMode() that > sets StandbyMode to true and disables the startup progress timeout, > something like the attached? That works for me, more or less. But I think that enable_startup_progress_timeout should be amended to either say if (log_startup_progress_interval == 0 || StandbyMode) return; or else it should at least Assert(!StandbyMode), so that we can't accidentally re-enable the timer after we shut it off. -- Robert Haas EDB: http://www.enterprisedb.com
Re: TAP output format in pg_regress
Hi, On 2022-11-17 11:36:13 +0100, Daniel Gustafsson wrote: > > On 10 Nov 2022, at 11:44, Nikolay Shaplov wrote: > > Did not found quick way to include prove TAP harness right into Makefile, > > so I > > check dumped output, but it is not really important for now, I guess. > > I think we'll start by adding TAP to the meson testrunner and await to see > where the make buildsystem ends up before doing anything there. +1 prove IME is of, uh, very questionable quality IME. I do not want to run pg_regress/isolation tests through that unnecessarily. It'd not gain us much anyway, afaict. And we don't want it as a hard dependency either. > I'm not sure if that's the right way to go about configuring regress tests as > TAP emitting, but it at least shows that meson is properly parsing the output > AFAICT. Looks correct to me. > @@ -742,13 +823,13 @@ initialize_environment(void) > } > > if (pghost && pgport) > - printf(_("(using postmaster on %s, port %s)\n"), > pghost, pgport); > + status(_("# (using postmaster on %s, port %s)\n"), > pghost, pgport); > if (pghost && !pgport) > - printf(_("(using postmaster on %s, default port)\n"), > pghost); > + status(_("# (using postmaster on %s, default port)\n"), > pghost); > if (!pghost && pgport) > - printf(_("(using postmaster on Unix socket, port > %s)\n"), pgport); > + status(_("# (using postmaster on Unix socket, port > %s)\n"), pgport); > if (!pghost && !pgport) > - printf(_("(using postmaster on Unix socket, default > port)\n")); > + status(_("# (using postmaster on Unix socket, default > port)\n")); > } It doesn't seem quite right to include the '# ' bit in the translated string. If a translator were to not include it we'd suddenly have incorrect tap output. That's perhaps not likely, but ... It's also not really necessary to have it in as many translated strings? Not that I understand why we even make these strings translatable. It seems like it'd be a bad idea to translate this kind of output. But either way, it seems nicer to output the # inside a helper function? > + /* > + * In order for this information (or any error information) to be shown > + * when running pg_regress test suites under prove it needs to be > emitted > + * stderr instead of stdout. > + */ > if (file_size(difffilename) > 0) > { > - printf(_("The differences that caused some tests to fail can be > viewed in the\n" > - "file \"%s\". A copy of the test summary that > you see\n" > - "above is saved in the file \"%s\".\n\n"), > + status(_("\n# The differences that caused some tests to fail > can be viewed in the\n" > + "# file \"%s\". A copy of the test summary > that you see\n" > + "# above is saved in the file \"%s\".\n\n"), > difffilename, logfilename); The comment about needing to print to stderr is correct - but the patch doesn't do so (anymore)? The count of failed tests also should go to stderr (but not the report of all tests having passed). bail() probably also ought to print the error to stderr, so the most important detail is immediately visible when looking at the failed test result. Greetings, Andres Freund
Re: HOT chain validation in verify_heapam()
Hi, On 2022-11-17 21:33:17 +0530, Himanshu Upadhyaya wrote: > On Tue, Nov 15, 2022 at 3:32 AM Andres Freund wrote: > > > Furthermore, it is > > > possible that successor[x] = successor[x'] since the page might be > > corrupted > > > and we haven't checked otherwise. > > > > > > predecessor[y] = x means that successor[x] = y but in addition we've > > > checked that y is sane, and that x.xmax=y.xmin. If there are multiple > > > tuples for which these conditions hold, we've issued complaints about > > > all but one and entered the last into the predecessor array. > > > > As shown by the isolationtester test I just posted, this doesn't quite work > > right now. Probably fixable. > > > > I don't think we can follow non-HOT ctid chains if they're older than the > > xmin > > horizon, including all cases of xmin being frozen. There's just nothing > > guaranteeing that the tuples are actually "related". > > > I understand the problem with frozen tuples but don't understand the > concern with non-HOT chains, > could you please help with some explanation around it? I think there might be cases where following non-HOT ctid-chains across tuples within a page will trigger spurious errors, if the tuple versions are older than the xmin horizon. But it's a bit hard to say without seeing the code with a bunch of the other bugs fixed. Greetings, Andres Freund
Re: CREATE UNLOGGED TABLE seq faults when debug_discard_caches=1
I wrote: > I wonder whether we ought to back-patch f10f0ae42. We could > leave the RelationOpenSmgr macro in existence to avoid unnecessary > breakage of extension code, but stop using it within our own code. Concretely, about like this for v14 (didn't look at the older branches yet). I'm not sure whether to recommend that outside extensions switch to using RelationGetSmgr in pre-v15 branches. If they do, they run a risk of compile failure should they be built against old back-branch headers. Once compiled, though, they'd work against any minor release (since RelationGetSmgr is static inline, not something in the core backend). So maybe that'd be good enough, and keeping their code in sync with what they need for v15 would be worth something. regards, tom lane diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index 7f8231a681..f37ca1313d 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -322,8 +322,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed, bool heapkeyspace, allequalimage; - RelationOpenSmgr(indrel); - if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM)) + if (!smgrexists(RelationGetSmgr(indrel), MAIN_FORKNUM)) ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED), errmsg("index \"%s\" lacks a main relation fork", diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c index c34a640d1c..23661d1105 100644 --- a/contrib/bloom/blinsert.c +++ b/contrib/bloom/blinsert.c @@ -177,9 +177,9 @@ blbuildempty(Relation index) * this even when wal_level=minimal. */ PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO); - smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, + smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, (char *) metapage, true); - log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM, + log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, metapage, true); /* @@ -187,7 +187,7 @@ blbuildempty(Relation index) * write did not go through shared_buffers and therefore a concurrent * checkpoint may have moved the redo pointer past our xlog record. */ - smgrimmedsync(index->rd_smgr, INIT_FORKNUM); + smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM); } /* diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index b3f73ea92d..0289ea657c 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -514,15 +514,13 @@ autoprewarm_database_main(Datum main_arg) old_blk->filenode != blk->filenode || old_blk->forknum != blk->forknum) { - RelationOpenSmgr(rel); - /* * smgrexists is not safe for illegal forknum, hence check whether * the passed forknum is valid before using it in smgrexists. */ if (blk->forknum > InvalidForkNumber && blk->forknum <= MAX_FORKNUM && -smgrexists(rel->rd_smgr, blk->forknum)) +smgrexists(RelationGetSmgr(rel), blk->forknum)) nblocks = RelationGetNumberOfBlocksInFork(rel, blk->forknum); else nblocks = 0; diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c index 48d0132a0d..0043823974 100644 --- a/contrib/pg_prewarm/pg_prewarm.c +++ b/contrib/pg_prewarm/pg_prewarm.c @@ -109,8 +109,7 @@ pg_prewarm(PG_FUNCTION_ARGS) aclcheck_error(aclresult, get_relkind_objtype(rel->rd_rel->relkind), get_rel_name(relOid)); /* Check that the fork exists. */ - RelationOpenSmgr(rel); - if (!smgrexists(rel->rd_smgr, forkNumber)) + if (!smgrexists(RelationGetSmgr(rel), forkNumber)) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("fork \"%s\" does not exist for this relation", @@ -178,7 +177,7 @@ pg_prewarm(PG_FUNCTION_ARGS) for (block = first_block; block <= last_block; ++block) { CHECK_FOR_INTERRUPTS(); - smgrread(rel->rd_smgr, forkNumber, block, blockbuffer.data); + smgrread(RelationGetSmgr(rel), forkNumber, block, blockbuffer.data); ++blocks_done; } } diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c index dd0c124e62..7ccefdb566 100644 --- a/contrib/pg_visibility/pg_visibility.c +++ b/contrib/pg_visibility/pg_visibility.c @@ -391,14 +391,14 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS) /* Only some relkinds have a visibility map */ check_relation_relkind(rel); - RelationOpenSmgr(rel); - rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber; + /* Forcibly reset cached file size */ + RelationGetSmgr(rel)->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber; block = visibilitymap_prepare_truncate(rel, 0); if (BlockNumberIsValid(block)) { fork = VISIBILITYMAP_FORKNUM; - smgrtruncate(rel->rd_smgr, , 1, ); + smgrtruncate(RelationGetSmgr(rel), , 1, ); } if (RelationNeedsWAL(rel)) diff --git a/src/backend/access/gist/gistbuild.c
Re: Assertion failure in SnapBuildInitialSnapshot()
Hi, On 2022-11-17 10:44:18 +0530, Amit Kapila wrote: > On Wed, Nov 16, 2022 at 11:56 PM Andres Freund wrote: > > On 2022-11-16 14:22:01 +0530, Amit Kapila wrote: > > > On Wed, Nov 16, 2022 at 7:30 AM Andres Freund wrote: > > > > On 2022-11-15 16:20:00 +0530, Amit Kapila wrote: > > > > > On Tue, Nov 15, 2022 at 8:08 AM Andres Freund > > > > > wrote: > > I don't think that'd catch a catalog snapshot. But perhaps the better answer > > for the catalog snapshot is to just invalidate it explicitly. The user > > doesn't > > have control over the catalog snapshot being taken, and it's not too hard to > > imagine the walsender code triggering one somewhere. > > > > So maybe we should add something like: > > > > InvalidateCatalogSnapshot(); /* about to overwrite MyProc->xmin */ > > > > The comment "/* about to overwrite MyProc->xmin */" is unclear to me. > We already have a check (/* so we don't overwrite the existing value > */ > if (TransactionIdIsValid(MyProc->xmin))) in SnapBuildInitialSnapshot() > which ensures that we don't overwrite MyProc->xmin, so the above > comment seems contradictory to me. The point is that catalog snapshots could easily end up setting MyProc->xmin, even though the caller hasn't done anything wrong. So the InvalidateCatalogSnapshot() would avoid erroring out in a number of scenarios. Greetings, Andres Freund
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
On Thu, Nov 17, 2022 at 11:24 AM Andres Freund wrote: > As an experiment, I added a progress report to BufferSync()'s first loop > (i.e. where it checks all buffers). On a 128GB shared_buffers cluster that > increases the time for a do-nothing checkpoint from ~235ms to ~280ms. If I > remove the changecount stuff and use a single write + write barrier, it ends > up as 250ms. Inlining brings it down a bit further, to 247ms. OK, I'd say that's pretty good evidence that we can't totally disregard the issue. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
Andres Freund writes: > I think pgstat_progress_start_command() needs the changecount stuff, as does > pgstat_progress_update_multi_param(). But for anything updating a single > parameter at a time it really doesn't do anything useful on a platform that > doesn't tear 64bit writes (so it could be #ifdef > PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY). Seems safe to restrict it to that case. regards, tom lane
Re: logical decoding and replication of sequences, take 2
Hi, On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote: > On 11/17/22 03:43, Andres Freund wrote: > > On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote: > >> Well, yeah - we can either try to perform the stuff independently of the > >> transactions that triggered it, or we can try making it part of some of > >> the transactions. Each of those options has problems, though :-( > >> > >> The first version of the patch tried the first approach, i.e. decode the > >> increments and apply that independently. But: > >> > >> (a) What would you do with increments of sequences created/reset in a > >> transaction? Can't apply those outside the transaction, because it > >> might be rolled back (and that state is not visible on primary). > > > > I think a reasonable approach could be to actually perform different WAL > > logging for that case. It'll require a bit of machinery, but could actually > > result in *less* WAL logging overall, because we don't need to emit a WAL > > record for each SEQ_LOG_VALS sequence values. > > > > Could you elaborate? Hard to comment without knowing more ... > > My point was that stuff like this (creating a new sequence or at least a > new relfilenode) means we can't apply that independently of the > transaction (unlike regular increments). I'm not sure how a change to > WAL logging would make that go away. Different WAL logging would make it easy to handle that on the logical decoding level. We don't need to emit WAL records each time a created-in-this-toplevel-xact sequences gets incremented as they're not persisting anyway if the surrounding xact aborts. We already need to remember the filenode so it can be dropped at the end of the transaction, so we could emit a single record for each sequence at that point. > >> (b) What about increments created before we have a proper snapshot? > >> There may be transactions dependent on the increment. This is what > >> ultimately led to revert of the patch. > > > > I don't understand this - why would we ever need to process those increments > > from before we have a snapshot? Wouldn't they, by definition, be before the > > slot was active? > > > > To me this is the rough equivalent of logical decoding not giving the > > initial > > state of all tables. You need some process outside of logical decoding to > > get > > that (obviously we have some support for that via the exported data snapshot > > during slot creation). > > > > Which is what already happens during tablesync, no? We more or less copy > sequences as if they were tables. I think you might have to copy sequences after tables, but I'm not sure. But otherwise, yea. > > I assume that part of the initial sync would have to be a new sequence > > synchronization step that reads all the sequence states on the publisher and > > ensures that the subscriber sequences are at the same point. There's a bit > > of > > trickiness there, but it seems entirely doable. The logical replication > > replay > > support for sequences will have to be a bit careful about not decreasing the > > subscriber's sequence values - the standby initially will be ahead of the > > increments we'll see in the WAL. But that seems inevitable given the > > non-transactional nature of sequences. > > > > See fetch_sequence_data / copy_sequence in the patch. The bit about > ensuring the sequence does not go away (say, using page LSN and/or LSN > of the increment) is not there, however isn't that pretty much what I > proposed doing for "reconciling" the sequence state logged at COMMIT? Well, I think the approach of logging all sequence increments at commit is the wrong idea... Creating a new relfilenode whenever a sequence is incremented seems like a complete no-go to me. That increases sequence overhead by several orders of magnitude and will lead to *awful* catalog bloat on the subscriber. > > > >> This version of the patch tries to do the opposite thing - make sure > >> that the state after each commit matches what the transaction might have > >> seen (for sequences it accessed). It's imperfect, because it might log a > >> state generated "after" the sequence got accessed - it focuses on the > >> guarantee not to generate duplicate values. > > > > That approach seems quite wrong to me. > > > > Why? Because it might log a state for sequence as of COMMIT, when the > transaction accessed the sequence much earlier? Mainly because sequences aren't transactional and trying to make them will require awful contortions. While there are cases where we don't flush the WAL / wait for syncrep for sequences, we do replicate their state correctly on physical replication. If an LSN has been acknowledged as having been replicated, we won't just loose a prior sequence increment after promotion, even if the transaction didn't [yet] commit. It's completely valid for an application to call nextval() in one transaction, potentially even abort it, and then only use that sequence value in another
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On Wed, 16 Nov 2022 at 15:44, Tom Lane wrote: > > Simon Riggs writes: > > On Wed, 16 Nov 2022 at 00:09, Tom Lane wrote: > >> No, that's not what XidInMVCCSnapshot does. If snapshot->suboverflowed > >> is set (ie, somebody somewhere/somewhen overflowed), then it does > >> SubTransGetTopmostTransaction and searches only the xips with the result. > >> This behavior requires that all live subxids be correctly mapped by > >> SubTransGetTopmostTransaction, or we'll draw false conclusions. > > > Your comments are correct wrt to the existing coding, but not to the > > patch, which is coded as described and does not suffer those issues. > > Ah, OK. > > Still ... I really do not like this patch. It introduces a number of > extremely fragile assumptions, and I think those would come back to > bite us someday, even if they hold now which I'm still unsure about. Completely understand. It took me months to think this through. > It doesn't help that you've chosen to document them only at the place > making them and not at the place(s) likely to break them. Yes, apologies for that, I focused on the holistic explanation in the slides. > Also, to be blunt, this is not Ready For Committer. It's more WIP, > because even if the code is okay there are comments all over the system > that you've invalidated. (At the very least, the file header comments > in subtrans.c and the comments in struct SnapshotData need work; I've > not looked hard but I'm sure there are more places with comments > bearing on these data structures.) New version with greatly improved comments coming very soon. > Perhaps it would be a good idea to split up the patch. The business > about making pg_subtrans flat rather than a tree seems like a good > idea in any event, although as I said it doesn't seem like we've got > a fleshed-out version of that here. We could push forward on getting > that done and then separately consider the rest of it. Yes, I thought you might ask that so, after some thought, have found a clean way to do that and have split this into two parts. Julien has agreed to do further perf tests and is working on that now. I will post new versions soon, earliest tomorrow. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: [BUG] pg_dump blocked
Le 17/11/2022 à 17:59, Tom Lane a écrit : Gilles Darold writes: I have an incorrect behavior with pg_dump prior PG version 15. With PostgreSQL 15, thanks to commit e3fcbbd623b9ccc16cdbda374654d91a4727d173 the problem is gone but for older versions it persists with locks on partitioned tables. I didn't want to back-patch e3fcbbd62 at the time, but it's probably aged long enough now to be safe to back-patch. If we do anything here, it should be to back-patch the whole thing, else we've only partially fixed the issue. I can handle this work. -- Gilles Darold
Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs
On Wed, Nov 16, 2022 at 4:34 PM Peter Geoghegan wrote: > WFM. Attached is v3. Plan is to commit this later on today, barring objections. Thanks -- Peter Geoghegan v3-0001-Standardize-rmgrdesc-recovery-conflict-XID-output.patch Description: Binary data
Re: [BUG] pg_dump blocked
Gilles Darold writes: > I have an incorrect behavior with pg_dump prior PG version 15. With > PostgreSQL 15, thanks to commit e3fcbbd623b9ccc16cdbda374654d91a4727d173 > the problem is gone but for older versions it persists with locks on > partitioned tables. I didn't want to back-patch e3fcbbd62 at the time, but it's probably aged long enough now to be safe to back-patch. If we do anything here, it should be to back-patch the whole thing, else we've only partially fixed the issue. regards, tom lane
[BUG] pg_dump blocked
Hi, I have an incorrect behavior with pg_dump prior PG version 15. With PostgreSQL 15, thanks to commit e3fcbbd623b9ccc16cdbda374654d91a4727d173 the problem is gone but for older versions it persists with locks on partitioned tables. When we try to dump a database where a table is locked, pg_dump wait until the lock is released, this is expected. Now if the table is excluded from the dump using the -T option, obviously pg_dump is not concerned by the lock. Unfortunately this is not the case when the table is partitioned because of the call to pg_get_partkeydef(), pg_get_expr() in the query generated in getTables(). Here is the use case to reproduce. In a psql session execute: BEGIN; LOCK TABLE measurement; then run a pg_dump command excluding the measurement partitions: pg_dump -d test -T "measurement*" > /dev/null it will not end until the lock on the partition is released. I think the problem is the same if we use a schema exclusion where the partitioned table is locked. Is it possible to consider a backport fix? If yes, does adding the table/schema filters in the query generated in getTables() is enough or do you think about of a kind of backport of commit e3fcbbd623b9ccc16cdbda374654d91a4727d173 ? Best regards, -- Gilles Darold
Re: Moving forward with TDE
Hi Dilip, Thanks for the feedback here. I will review the docs changes and add to my tree. Best, David
Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
On Wed, Nov 16, 2022 at 3:30 AM Bharath Rupireddy wrote: > 1. > -if (config.filter_by_fpw && !XLogRecordHasFPW(xlogreader_state)) > +if (config.filter_by_fpw && !XLogRecordHasFPI(xlogreader_state)) > These changes are not related to this feature, hence renaming those > variables/function names must be dealt with separately. If required, > proposing another patch can be submitted to change filter_by_fpw to > filter_by_fpi and XLogRecordHasFPW() to XLogRecordHasFPI(). Not required; can revert the changes unrelated to this specific patch. (I'd written the original ones for it, so didn't really think anything of it... :-)) > 2. > +/* We fsync our output directory only; since these files are not part > + * of the production database we do not require the performance hit > + * that fsyncing every FPI would entail, so are doing this as a > + * compromise. */ > The commenting style doesn't match the standard that we follow > elsewhere in postgres, please refer to other multi-line comments. Will fix. > 3. > +fsync_fname(config.save_fpi_path, true); > +} > It looks like fsync_fname()/fsync() in general isn't recursive, in the > sense that it doesn't fsync the files under the directory, but the > directory only. So, the idea of directory fsync doesn't seem worth it. > We either 1) get rid of fsync entirely or 2) fsync all the files after > they are created and the directory at the end or 3) do option (2) with > --no-sync option similar to its friends. Since option (2) is a no go, > we can either choose option (1) or option (2). My vote at this point > is for option (1). Agree to remove. > 4. > +($walfile_name, $blocksize) = split '\|' => > $node->safe_psql('postgres',"SELECT pg_walfile_name(pg_switch_wal()), > current_setting('block_size')"); > +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name; > I think there's something wrong with this, no? pg_switch_wal() can, at > times, return end+1 of the prior segment (see below snippet) and I'm > not sure if such a case can happen here. > > * The return value is either the end+1 address of the switch record, > * or the end+1 address of the prior segment if we did not need to > * write a switch record because we are already at segment start. > */ > XLogRecPtr > RequestXLogSwitch(bool mark_unimportant) I think this approach is pretty common to get the walfile name, no? While there might be an edge case here, since the rest of the test is a controlled environment I'm inclined to just not worry about it; this would require the changes prior to this to exactly fill a WAL segment which strikes me as extremely unlikely to the point of impossible in this specific scenario. > 5. > +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name; > +ok(-f $walfile, "Got a WAL file"); > Is this checking if the WAL file is present or not in PGDATA/pg_wal? > If yes, I think this isn't required as pg_switch_wal() ensures that > the WAL is written and flushed to disk. You are correct, probably another artifact of the earlier version. That said, not sure I see the harm in keeping it as a sanity-check. > 6. > +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name; > Isn't "pgdata" hardcoded here? I think you might need to do the following: > $node->data_dir . '/pg_wal/' . $walfile_name;; Can fix. > 7. > +# save filename for later verification > +$files{$file}++; > > +# validate that we ended up with some FPIs saved > +ok(keys %files > 0, 'verify we processed some files'); > Why do we need to store filenames in an array when we later just check > the size of the array? Can't we use a boolean (file_found) or an int > variable (file_count) to verify that we found the file. Another artifact; we were comparing the files output between two separate lists of arbitrary numbers of pages being written out and verifying the raw/fixup versions had the same lists. > 8. > +$node->safe_psql('postgres', < +SELECT 'init' FROM > pg_create_physical_replication_slot('regress_pg_waldump_slot', true, > false); > +CREATE TABLE test_table AS SELECT generate_series(1,100) a; > +CHECKPOINT; -- required to force FPI for next writes > +UPDATE test_table SET a = a + 1; > +EOF > The EOF with append_conf() is being used in 4 files and elsewhere in > the TAP test files (more than 100?) qq[] or quotes is being used. I > have no strong opinion here, I'll leave it to the other reviewers or > committer. I'm inclined to leave it just for (personal) readability, but can change if there's a strong consensus against. > > > 11. > > > +# verify filename formats matches w/--save-fpi > > > +for my $fullpath (glob "$tmp_folder/raw/*") > > > Do we need to look for the exact match of the file that gets created > > > in the save-fpi path? While checking for this is great, it makes the > > > test code non-portable (may not work on Windows or other platforms, > > > no?) and complex? This way, you can get rid of
Re: allow segment size to be set to < 1GiB
On 2022-11-17 10:48:52 -0500, Tom Lane wrote: > Yeah, I think we should allow setting it to 1 block. This switch is > only for testing purposes (I hope the docs make that clear). "This option is only for developers, to test segment related code."
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
Hi, On 2022-11-17 09:03:32 -0500, Robert Haas wrote: > On Wed, Nov 16, 2022 at 2:52 PM Andres Freund wrote: > > I don't think we'd want every buffer write or whatnot go through the > > changecount mechanism, on some non-x86 platforms that could be noticable. > > But > > if we didn't stage the stats updates locally I think we could make most of > > the > > stats changes without that overhead. For updates that just increment a > > single > > counter there's simply no benefit in the changecount mechanism afaict. > > You might be right, but I'm not sure whether it's worth stressing > about. The progress reporting mechanism uses the st_changecount > mechanism, too, and as far as I know nobody's complained about that > having too much overhead. Maybe they have, though, and I've just > missed it. I've seen it in profiles, although not as the major contributor. Most places do a reasonable amount of work between calls though. As an experiment, I added a progress report to BufferSync()'s first loop (i.e. where it checks all buffers). On a 128GB shared_buffers cluster that increases the time for a do-nothing checkpoint from ~235ms to ~280ms. If I remove the changecount stuff and use a single write + write barrier, it ends up as 250ms. Inlining brings it down a bit further, to 247ms. Obviously this is a very extreme case - we only do very little work between the progress report calls. But it does seem to show that the overhead is not entirely neglegible. I think pgstat_progress_start_command() needs the changecount stuff, as does pgstat_progress_update_multi_param(). But for anything updating a single parameter at a time it really doesn't do anything useful on a platform that doesn't tear 64bit writes (so it could be #ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY). Out of further curiosity I wanted to test the impact when the loop doesn't even do a LockBufHdr() and added an unlocked pre-check. 109ms without progress. 138ms with. 114ms with the simplified pgstat_progress_update_param(). 108ms after inlining the simplified pgstat_progress_update_param(). Greetings, Andres Freund
Re: CREATE UNLOGGED TABLE seq faults when debug_discard_caches=1
Spyridon Dimitrios Agathos writes: > while testing the developer settings of PSQL (14.5) I came across this > issue: > postgres=# CREATE UNLOGGED TABLE stats ( > postgres(# pg_hash BIGINT NOT NULL, > postgres(# category TEXT NOT NULL, > postgres(# PRIMARY KEY (pg_hash, category) > postgres(# ); > server closed the connection unexpectedly Hmm ... confirmed in the v14 branch, but v15 and HEAD are fine, evidently as a result of commit f10f0ae42 having replaced this unprotected use of index->rd_smgr. I wonder whether we ought to back-patch f10f0ae42. We could leave the RelationOpenSmgr macro in existence to avoid unnecessary breakage of extension code, but stop using it within our own code. regards, tom lane
Re: logical decoding and replication of sequences, take 2
On Wed, Nov 16, 2022 at 8:41 PM Tomas Vondra wrote: > There's a couple of caveats, though: > > 1) Maybe we should focus more on "actually observed" state instead of > "observable". Who cares if the sequence moved forward in a transaction > that was ultimately rolled back? No committed transaction should have > observer those values - in a way, the last "valid" state of the sequence > is the last value generated in a transaction that ultimately committed. When I say "observable" I mean from a separate transaction, not one that is making changes to things. I said "observable" rather than "actually observed" because we neither know nor care whether someone actually ran a SELECT statement at any given moment in time, just what they would have seen if they did. > 2) I think what matters more is that we never generate duplicate value. > That is, if you generate a value from a sequence, commit a transaction > and replicate it, then the logical standby should not generate the same > value from the sequence. This guarantee seems necessary for "failover" > to logical standby. I think that matters, but I don't think it's sufficient. We need to preserve the order in which things appear to happen, and which changes are and are not atomic, not just the final result. > Well, yeah - we can either try to perform the stuff independently of the > transactions that triggered it, or we can try making it part of some of > the transactions. Each of those options has problems, though :-( > > The first version of the patch tried the first approach, i.e. decode the > increments and apply that independently. But: > > (a) What would you do with increments of sequences created/reset in a > transaction? Can't apply those outside the transaction, because it > might be rolled back (and that state is not visible on primary). If the state isn't going to be visible until the transaction commits, it has to be replicated as part of the transaction. If I create a sequence and then nextval it a bunch of times, I can't replicate that by first creating the sequence, and then later, as a separate operation, replicating the nextvals. If I do that, then there's an intermediate state visible on the replica that was never visible on the origin server. That's broken. > (b) What about increments created before we have a proper snapshot? > There may be transactions dependent on the increment. This is what > ultimately led to revert of the patch. Whatever problem exists here is with the implementation, not the concept. If you copy the initial state as it exists at some moment in time to a replica, and then replicate all the changes that happen afterward to that replica without messing up the order, the replica WILL be in sync with the origin server. The things that happen before you copy the initial state do not and cannot matter. But what you're describing sounds like the changes aren't really replicated in visibility order, and then it is easy to see how a problem like this can happen. Because now, an operation that actually became visible just before or just after the initial copy was taken might be thought to belong on the other side of that boundary, and then everything will break. And it sounds like that is what you are describing. > This version of the patch tries to do the opposite thing - make sure > that the state after each commit matches what the transaction might have > seen (for sequences it accessed). It's imperfect, because it might log a > state generated "after" the sequence got accessed - it focuses on the > guarantee not to generate duplicate values. Like Andres, I just can't imagine this being correct. It feels like it's trying to paper over the failure to do the replication properly during the transaction by overwriting state at the end. > Yes, this would mean we accept we may end up with something like this: > > 1: T1 logs sequence state S1 > 2: someone increments sequence > 3: T2 logs sequence stats S2 > 4: T2 commits > 5: T1 commits > > which "inverts" the apply order of S1 vs. S2, because we first apply S2 > and then the "old" S1. But as long as we're smart enough to "discard" > applying S1, I think that's acceptable - because it guarantees we'll not > generate duplicate values (with values in the committed transaction). > > I'd also argue it does not actually generate invalid state, because once > we commit either transaction, S2 is what's visible. I agree that it's OK if the sequence increment gets merged into the commit that immediately follows. However, I disagree with the idea of discarding the second update on the grounds that it would make the sequence go backward and we know that can't be right. That algorithm works in the really specific case where the only operations are increments. As soon as anyone does anything else to the sequence, such an algorithm can no longer work. Nor can it work for objects that are not sequences. The alternative strategy of replicating each change
Re: HOT chain validation in verify_heapam()
On Tue, Nov 15, 2022 at 3:32 AM Andres Freund wrote: > > > Furthermore, it is > > possible that successor[x] = successor[x'] since the page might be > corrupted > > and we haven't checked otherwise. > > > > predecessor[y] = x means that successor[x] = y but in addition we've > > checked that y is sane, and that x.xmax=y.xmin. If there are multiple > > tuples for which these conditions hold, we've issued complaints about > > all but one and entered the last into the predecessor array. > > As shown by the isolationtester test I just posted, this doesn't quite work > right now. Probably fixable. > > I don't think we can follow non-HOT ctid chains if they're older than the > xmin > horizon, including all cases of xmin being frozen. There's just nothing > guaranteeing that the tuples are actually "related". > > I understand the problem with frozen tuples but don't understand the concern with non-HOT chains, could you please help with some explanation around it? -- Regards, Himanshu Upadhyaya EnterpriseDB: http://www.enterprisedb.com
Re: Moving forward with TDE
Hi Jacob, Thanks, I've added this patch in my tree [1]. (For now, just adding fixes and the like atop the original separate patches, but will eventually get things winnowed down into probably the same 12 parts the originals were reviewed in. Best, David [1] https://github.com/pgguru/postgres/tree/tde
Re: HOT chain validation in verify_heapam()
On Wed, Nov 16, 2022 at 12:41 PM Himanshu Upadhyaya < upadhyaya.himan...@gmail.com> wrote: > > >> > + } >> > + >> > + /* Loop over offsets and validate the data in the >> predecessor array. */ >> > + for (OffsetNumber currentoffnum = FirstOffsetNumber; >> currentoffnum <= maxoff; >> > + currentoffnum = OffsetNumberNext(currentoffnum)) >> > + { >> > + HeapTupleHeader pred_htup; >> > + HeapTupleHeader curr_htup; >> > + TransactionId pred_xmin; >> > + TransactionId curr_xmin; >> > + ItemId pred_lp; >> > + ItemId curr_lp; >> > + >> > + ctx.offnum = predecessor[currentoffnum]; >> > + ctx.attnum = -1; >> > + >> > + if (ctx.offnum == 0) >> > + { >> > + /* >> > + * Either the root of the chain or an >> xmin-aborted tuple from >> > + * an abandoned portion of the HOT chain. >> > + */ >> >> Hm - couldn't we check that the tuple could conceivably be at the root of >> a >> chain? I.e. isn't HEAP_HOT_UPDATED? Or alternatively has an aborted xmin? >> >> > I don't see a way to check if tuple is at the root of HOT chain because > predecessor array will always be having either xmin from non-abandoned > transaction or it will be zero. We can't differentiate root or tuple > inserted via abandoned transaction. > > I was wrong here. I think this can be done and will be doing these changes in my next patch. -- Regards, Himanshu Upadhyaya EnterpriseDB: http://www.enterprisedb.com
Re: redundant check of msg in does_not_exist_skipping
Japin Li writes: > On Thu, 17 Nov 2022 at 23:06, Japin Li wrote: >> I think we cannot remove the check, for example, if objtype is >> OBJECT_OPFAMILY, >> and schema_does_not_exist_skipping() returns true, the so the msg keeps NULL, >> if we remove this check, a sigfault might be occurd in ereport(). > Sorry, I didn't look into schema_does_not_exist_skipping(), and after look > into schema_does_not_exist_skipping function and others, all paths that go > out switch branch has non-NULL for msg, so we can remove this check safely. This is a completely bad idea. If it takes that level of analysis to see that msg can't be null, we should leave the test in place. Any future modification of either this code or what it calls could break the conclusion. regards, tom lane
Re: allow segment size to be set to < 1GiB
Andres Freund writes: > On 2022-11-17 09:58:48 -0500, Andrew Dunstan wrote: >> Are we going to impose some sane minimum, or leave it up to developers >> to discover that for themselves? > I don't think we should. It's actually useful to e.g. use 1 page sized > segments for testing, and with one exceptions the tests pass with it too. Do > you see a reason to impose one? Yeah, I think we should allow setting it to 1 block. This switch is only for testing purposes (I hope the docs make that clear). regards, tom lane
Re: [PoC] configurable out of disk space elog level
On 2022-11-17 14:40:02 +0300, Maxim Orlov wrote: > On Wed, 16 Nov 2022 at 20:41, Andres Freund wrote: > > that aside, we can't do catalog accesses in all kinds of environments that > > this currently is active in - most importantly it's affecting the startup > > process. We don't do catalog accesses in the startup process, and even if > > we > > were to do so, we couldn't unconditionally because the catalog might not > > even > > be consistent at this point (nor is it guaranteed that the wal_level even > > allows to access catalogs during recovery). > > > Yep, that is why I do use in get_tablespace_elevel: > + /* > +* Use GUC level only in normal mode. > +*/ > + if (!IsNormalProcessingMode()) > + return ERROR; > > Anyway, I appreciate the opinion, thank you! The startup process is in normal processing mode.
Re: allow segment size to be set to < 1GiB
Hi, On 2022-11-17 09:58:48 -0500, Andrew Dunstan wrote: > On 2022-11-09 We 15:25, Andres Freund wrote: > > Hi, > > > > On 2022-11-09 14:44:42 -0500, Tom Lane wrote: > >> Andres Freund writes: > >>> A second question: Both autoconf and meson print the segment size as GB > >>> right > >>> now. Obviously that'll print out a size of 0 for a segsize < 1GB. > >>> The easiest way to would be to just display the number of blocks, but > >>> that's > >>> not particularly nice. > >> Well, it would be fine if you'd written --with-segsize-blocks, wouldn't > >> it? Can we make the printout format depend on which switch was used? > > Not sure why I didn't think of that... > > > > Updated patch attached. > > > > I made one autoconf and one meson CI task use a small block size, but just > > to > > ensure it work on both. I'd probably leave it set on one, so we keep the > > coverage for cfbot? > > > > Are we going to impose some sane minimum, or leave it up to developers > to discover that for themselves? I don't think we should. It's actually useful to e.g. use 1 page sized segments for testing, and with one exceptions the tests pass with it too. Do you see a reason to impose one? Greetings, Andres Freund
Re: Introduce "log_connection_stages" setting.
On Tue, Nov 08, 2022 at 07:30:38PM +0100, Sergey Dudoladov wrote: > +Logs reception of a connection. At this point a connection > has been received, but no further work has been done: receipt > +Logs the original identity that an authentication method > employs to identify a user. In most cases, the identity string equals the > PostgreSQL username, s/equals/matches > +/* check_hook: validate new log_connection_messages value */ > +bool > +check_log_connection_messages(char **newval, void **extra, GucSource source) > +{ > + char*rawname; > + List*namelist; > + ListCell*l; > + char*log_connection_messages = *newval; > + bool*myextra; > + > + /* > + * Set up the "extra" struct actually used by > assign_log_connection_messages. > + */ > + myextra = (bool *) guc_malloc(LOG, 4 * sizeof(bool)); This function hardcodes each of the 4 connections: > + if (pg_strcasecmp(stage, "received") == 0) > + myextra[0] = true; It'd be better to use #defines or enums for these. > --- a/src/backend/tcop/postgres.c > +++ b/src/backend/tcop/postgres.c > @@ -84,8 +84,11 @@ const char *debug_query_string; /* client-supplied query > string */ > /* Note: whereToSendOutput is initialized for the bootstrap/standalone case > */ > CommandDest whereToSendOutput = DestDebug; > > -/* flag for logging end of session */ > -bool Log_disconnections = false; > +/* flags for logging information about session state */ > +bool Log_disconnected = false; > +bool Log_authenticated = false; > +bool Log_authorized = false; > +bool Log_received = false; I think this ought to be an integer with flag bits, rather than 4 booleans (I don't know, but there might be more later?). Then, the implementation follows the user-facing GUC and also follows log_destination. -- Justin
Re: redundant check of msg in does_not_exist_skipping
On Thu, 17 Nov 2022 at 23:06, Japin Li wrote: > On Thu, 17 Nov 2022 at 20:12, Ted Yu wrote: >> Hi, >> I was looking at commit aca992040951c7665f1701cd25d48808eda7a809 >> >> I think the check of msg after the switch statement is not necessary. The >> variable msg is used afterward. >> If there is (potential) missing case in switch statement, the compiler >> would warn. >> >> How about removing the check ? >> > > I think we cannot remove the check, for example, if objtype is > OBJECT_OPFAMILY, > and schema_does_not_exist_skipping() returns true, the so the msg keeps NULL, > if we remove this check, a sigfault might be occurd in ereport(). > > case OBJECT_OPFAMILY: > { > List *opfname = list_copy_tail(castNode(List, object), > 1); > > if (!schema_does_not_exist_skipping(opfname, , )) > { > msg = gettext_noop("operator family \"%s\" does not exist > for access method \"%s\", skipping"); > name = NameListToString(opfname); > args = strVal(linitial(castNode(List, object))); > } > } > break; Sorry, I didn't look into schema_does_not_exist_skipping(), and after look into schema_does_not_exist_skipping function and others, all paths that go out switch branch has non-NULL for msg, so we can remove this check safely. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: libpq compression (part 2)
On 17.11.22 01:27, Andrey Borodin wrote: Also I've found one more TODO item for the patch. Currently in fe-connect.c patch is doing buffer reset: conn->inStart = conn->inCursor = conn->inEnd = 0; This effectively consumes butes up tu current cursor. However, packet inspection is continued. The patch works because in most cases following code will issue re-read of message. Coincidentally. /* Get the type of request. */ if (pqGetInt((int *) , 4, conn)) { /* We'll come back when there are more data */ return PGRES_POLLING_READING; } Note that the above code was just changed in dce92e59b1. I don't know how that affects this patch set.
Re: redundant check of msg in does_not_exist_skipping
On Thu, 17 Nov 2022 at 20:12, Ted Yu wrote: > Hi, > I was looking at commit aca992040951c7665f1701cd25d48808eda7a809 > > I think the check of msg after the switch statement is not necessary. The > variable msg is used afterward. > If there is (potential) missing case in switch statement, the compiler > would warn. > > How about removing the check ? > I think we cannot remove the check, for example, if objtype is OBJECT_OPFAMILY, and schema_does_not_exist_skipping() returns true, the so the msg keeps NULL, if we remove this check, a sigfault might be occurd in ereport(). case OBJECT_OPFAMILY: { List *opfname = list_copy_tail(castNode(List, object), 1); if (!schema_does_not_exist_skipping(opfname, , )) { msg = gettext_noop("operator family \"%s\" does not exist for access method \"%s\", skipping"); name = NameListToString(opfname); args = strVal(linitial(castNode(List, object))); } } break; -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
contrib: auth_delay module
The attached patch is a contrib module to set login restrictions on users with too many authentication failure. The administrator could manage several GUC parameters to control the login restrictions which are listed below. - set the wait time when password authentication fails. - allow the wait time grows when users of the same IP consecutively logon failed. - set the maximum authentication failure number from the same user. The system will prevent a user who gets too many authentication failures from entering the database. I hope this will be useful to future development. Thanks. - zhch...@ceresdata.com pgsql-v12.8-auth-delay.1.patch Description: Binary data
Re: libpq support for NegotiateProtocolVersion
On 16.11.22 19:35, Jacob Champion wrote: On Tue, Nov 15, 2022 at 2:19 AM Peter Eisentraut wrote: I think for the current code, the following would be an appropriate adjustment: diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index 746e9b4f1efc..d15fb96572d9 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -3412,8 +3412,7 @@ PQconnectPoll(PGconn *conn) /* Get the type of request. */ if (pqGetInt((int *) , 4, conn)) { - /* We'll come back when there are more data */ - return PGRES_POLLING_READING; + goto error_return; } msgLength -= 4; And then the handling of the 'v' message in my patch would also be adjusted like that. Yes -- though that particular example may be dead code, since we should have already checked that there are at least four more bytes in the buffer. I have committed this change and the adjusted original patch. Thanks.
Re: allow segment size to be set to < 1GiB
On 2022-11-09 We 15:25, Andres Freund wrote: > Hi, > > On 2022-11-09 14:44:42 -0500, Tom Lane wrote: >> Andres Freund writes: >>> A second question: Both autoconf and meson print the segment size as GB >>> right >>> now. Obviously that'll print out a size of 0 for a segsize < 1GB. >>> The easiest way to would be to just display the number of blocks, but that's >>> not particularly nice. >> Well, it would be fine if you'd written --with-segsize-blocks, wouldn't >> it? Can we make the printout format depend on which switch was used? > Not sure why I didn't think of that... > > Updated patch attached. > > I made one autoconf and one meson CI task use a small block size, but just to > ensure it work on both. I'd probably leave it set on one, so we keep the > coverage for cfbot? > Are we going to impose some sane minimum, or leave it up to developers to discover that for themselves? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Hash index build performance tweak from sorting
Hi, I did some simple benchmark with v2 and v3, using the attached script, which essentially just builds hash index on random data, with different data types and maintenance_work_mem values. And what I see is this (median of 10 runs): machine data type m_w_mmasterv2v3 i5 bigint 128MB 9652 9402 9669 32MB 9545 9291 9535 4MB 9599 9371 9741 int 128MB 9666 9475 9676 32MB 9530 9347 9528 4MB 9595 9394 9624 text 128MB 9755 9596 9897 32MB 9711 9547 9846 4MB 9808 9744 10024 xeon bigint 128MB 10790 10555 10812 32MB 10690 10373 10579 4MB 10682 10351 10650 int 128MB 11258 10550 10712 32MB 10963 10272 10410 4MB 11152 10366 10589 text 128MB 10935 10694 10930 32MB 10822 10672 10861 4MB 10835 10684 10895 Or, relative to master: machine data type memory v2 v3 -- i5bigint 128MB 97.40% 100.17% 32MB 97.34% 99.90% 4MB 97.62% 101.48% int 128MB 98.03% 100.11% 32MB 98.08% 99.98% 4MB 97.91% 100.31% text 128MB 98.37% 101.46% 32MB 98.32% 101.40% 4MB 99.35% 102.20% xeonbigint 128MB 97.82% 100.20% 32MB 97.03% 98.95% 4MB 96.89% 99.70% int 128MB 93.71% 95.15% 32MB 93.70% 94.95% 4MB 92.95% 94.95% text 128MB 97.80% 99.96% 32MB 98.62% 100.36% 4MB 98.61% 100.55% So to me it seems v2 performs demonstrably better, v3 is consistently slower - not only compared to v2, but often also to master. Attached is the script I used and the raw results - this includes also results for logged tables - the improvement is smaller, but the conclusions are otherwise similar. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Companyunlogged int pg-master 4MB 1 9654.418 unlogged int pg-master 32MB 1 9531.940 unlogged int pg-master 128MB 1 9676.285 unlogged int pg-master 4MB 2 9583.072 unlogged int pg-master 32MB 2 9527.424 unlogged int pg-master 128MB 2 9666.685 unlogged int pg-master 4MB 3 9581.031 unlogged int pg-master 32MB 3 9520.868 unlogged int pg-master 128MB 3 9652.160 unlogged int pg-master 4MB 4 9617.728 unlogged int pg-master 32MB 4 9532.923 unlogged int pg-master 128MB 4 9665.569 unlogged int pg-master 4MB 5 9590.640 unlogged int pg-master 32MB 5 9533.886 unlogged int pg-master 128MB 5 9668.846 unlogged int pg-master 4MB 6 9598.494 unlogged int pg-master 32MB 6 9528.925 unlogged int pg-master 128MB 6 9659.308 unlogged int pg-master 4MB 7 9605.972 unlogged int pg-master 32MB 7 9536.492 unlogged int pg-master 128MB 7 9662.790 unlogged int pg-master 4MB 8 9569.671 unlogged int pg-master 32MB 8 9522.346 unlogged int pg-master 128MB 8 9666.571 unlogged int pg-patched-v2 4MB 1 9392.726 unlogged int pg-patched-v2 32MB 1 9340.266 unlogged int pg-patched-v2 128MB 1 9478.272 unlogged int pg-patched-v2 4MB 2 9407.779 unlogged int pg-patched-v2 32MB 2 9339.967 unlogged int pg-patched-v2 128MB 2 9477.475 unlogged int pg-patched-v2 4MB 3 9403.061 unlogged int pg-patched-v2 32MB 3 9336.728 unlogged int pg-patched-v2 128MB 3 9465.251 unlogged int pg-patched-v2 4MB 4 9387.758 unlogged int pg-patched-v2 32MB 4 9346.300 unlogged int pg-patched-v2 128MB 4 9470.000 unlogged int pg-patched-v2 4MB 5 9394.634 unlogged int pg-patched-v2 32MB 5 9357.497 unlogged int pg-patched-v2 128MB 5 9474.249 unlogged int pg-patched-v2 4MB 6 9392.015 unlogged int pg-patched-v2 32MB 6 9350.377 unlogged int pg-patched-v2 128MB 6 9478.068 unlogged int pg-patched-v2 4MB 7 9397.745 unlogged int
Re: locale -a missing on Alpine Linux?
Andrew Dunstan writes: > malleefowl is a docker instance (mostly docker images is what Alpine is > used for). It would be extremely easy to recreate the image and add in > musl-locales, but maybe we should just leave it as it is to test the > case where the command isn't available. Agreed, leave it as-is. regards, tom lane
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
On Wed, Nov 16, 2022 at 2:52 PM Andres Freund wrote: > I don't think we'd want every buffer write or whatnot go through the > changecount mechanism, on some non-x86 platforms that could be noticable. But > if we didn't stage the stats updates locally I think we could make most of the > stats changes without that overhead. For updates that just increment a single > counter there's simply no benefit in the changecount mechanism afaict. You might be right, but I'm not sure whether it's worth stressing about. The progress reporting mechanism uses the st_changecount mechanism, too, and as far as I know nobody's complained about that having too much overhead. Maybe they have, though, and I've just missed it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Supporting TAP tests with MSVC and Windows
On 2022-11-17 Th 01:18, Noah Misch wrote: > On Mon, Aug 22, 2022 at 09:44:42AM -0400, Andrew Dunstan wrote: >> On 2022-08-21 Su 20:40, Noah Misch wrote: >>> This (commit 13d856e of 2015-07-29) added the following: >>> >>> --- a/src/test/perl/TestLib.pm >>> +++ b/src/test/perl/TestLib.pm >>> @@ -242,7 +288,17 @@ sub command_exit_is >>> print("# Running: " . join(" ", @{$cmd}) ."\n"); >>> my $h = start $cmd; >>> $h->finish(); >>> - is($h->result(0), $expected, $test_name); >>> + >>> + # On Windows, the exit status of the process is returned directly as the >>> + # process's exit code, while on Unix, it's returned in the high bits >>> + # of the exit code (see WEXITSTATUS macro in the standard >>> + # header file). IPC::Run's result function always returns exit code >> >>> 8, >>> + # assuming the Unix convention, which will always return 0 on Windows as >>> + # long as the process was not terminated by an exception. To work around >>> + # that, use $h->full_result on Windows instead. >>> + my $result = ($Config{osname} eq "MSWin32") ? >>> + ($h->full_results)[0] : $h->result(0); >>> + is($result, $expected, $test_name); >>> } >>> >>> That behavior came up again in the context of a newer IPC::Run test case. >>> I'm >>> considering changing the IPC::Run behavior such that the above would have >>> been >>> unnecessary. However, if I do, the above code would want to adapt to handle >>> pre-change and post-change IPC::Run versions. If you have an opinion on >>> whether or how IPC::Run should change, I welcome comments on >>> https://github.com/toddr/IPC-Run/issues/161. >> Assuming it changes, we'll have to have a version test here. I don't >> think we can have a flag day where we suddenly require IPC::Run's >> bleeding edge on Windows. So changing it is a good thing, but it won't >> help us much. > IPC::Run git now has the change, and we may release soon. Here's what I plan > to push to make PostgreSQL cope. LGTM cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: locale -a missing on Alpine Linux?
On 2022-11-16 We 14:25, Tom Lane wrote: > Christoph Moench-Tegeder writes: >> ## Peter Eisentraut (peter.eisentr...@enterprisedb.com): >>> First of all, is this a standard installation of this OS, or is perhaps >>> something incomplete, broken, or unusual about the current OS installation? >> Alpine uses musl libc, on which you need package musl-locales to get >> a /usr/bin/locale. >> https://pkgs.alpinelinux.org/package/edge/community/x86/musl-locales > Ah. And that also shows that if you didn't install that package, > you don't have any locales either, except presumably C/POSIX. > > So probably we should treat failure of the locale command as okay > and just press on with no non-built-in locales. malleefowl is a docker instance (mostly docker images is what Alpine is used for). It would be extremely easy to recreate the image and add in musl-locales, but maybe we should just leave it as it is to test the case where the command isn't available. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: HOT chain validation in verify_heapam()
On Wed, Nov 16, 2022 at 10:57 PM Himanshu Upadhyaya wrote: > I think if pred_xmin is aborted and curr_xmin is in progress we should > consider it as a corruption case but vice versa is not true. Yeah, you're right. I'm being stupid about subtransactions again. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
On Thu, Nov 17, 2022 at 12:49 AM Robert Haas wrote: > > > I also think that a new pg_stat_checkpoint view is needed > > because, right now, the PgStat_CheckpointerStats stats are exposed via > > the pg_stat_bgwriter view, having a separate view for checkpoint stats > > is good here. > > Yep. On Wed, Nov 16, 2022 at 11:44 PM Andres Freund wrote: > > > I also think that a new pg_stat_checkpoint view is needed > > because, right now, the PgStat_CheckpointerStats stats are exposed via > > the pg_stat_bgwriter view, having a separate view for checkpoint stats > > is good here. > > I agree that we should do that, but largely independent of the architectural > question at hand. Thanks. I quickly prepared a patch introducing pg_stat_checkpointer view and posted it here - https://www.postgresql.org/message-id/CALj2ACVxX2ii%3D66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg%40mail.gmail.com. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Don't treate IndexStmt like AlterTable when DefineIndex is called from ProcessUtilitySlow.
Hi, Recently read the code, I find that when calling DefineIndex from ProcessUtilitySlow, is_alter_table will be set true if this statement is came from expandTableLikeClause. I check the code of DefineIndex, there are only two places use is_alter_table: 1. the function index_check_primary_key 2. print a debug log on what the statement is For 1, since we are doing create table xxx (like yyy including indexes), we are sure that the check relationHasPrimaryKey in the function index_check_primary_key will be satisfied because we are just create the new table. For 2, I do not think it will mislead the user if we print it as CreateStmt. Based on the above, I think we can always a false value for is_alter_table when DefineIndex is called from ProcessUtilitySlow. Here I attach a patch. Any ideas? Thanks a lot. Best, Zhenghua Lyu 0001-Don-t-treate-IndexStmt-like-AlterTable-when-DefineIn.patch Description: Binary data
Introduce a new view for checkpointer related stats
Hi, pg_stat_bgwriter view currently reports checkpointer stats as well. It is that way because historically checkpointer was part of bgwriter until the commits 806a2ae and bf405ba, that went into PG 9.2, separated them out. I think it is time for us to separate checkpointer stats to its own view. I discussed it in another thread [1] and it seems like there's an unequivocal agreement for the proposal. I'm attaching a patch introducing a new pg_stat_checkpointer view, with this change the pg_stat_bgwriter view only focuses on bgwriter related stats. The patch does mostly mechanical changes. I'll add it to CF in a bit. Thoughts? [1] https://www.postgresql.org/message-id/flat/20221116181433.y2hq2pirtbqmmndt%40awork3.anarazel.de#b873a4bd7d8d7ec70750a7047db33f56 https://www.postgresql.org/message-id/CA%2BTgmoYCu6RpuJ3cZz0e7cZSfaVb%3Dcr6iVcgGMGd5dxX0MYNRA%40mail.gmail.com -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 707b3f0e0e7cf55b48a09709b070341d23ad03b8 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Thu, 17 Nov 2022 12:20:13 + Subject: [PATCH v1] Introduce a new view for checkpointer related stats pg_stat_bgwriter view currently reports checkpointer stats as well. It is that way because historically checkpointer was part of bgwriter until the commits 806a2ae and bf405ba, that went into PG 9.2, separated them out. It is time for us to separate checkpointer stats to its own view called pg_stat_checkpointer. Bump catalog version. --- doc/src/sgml/monitoring.sgml | 108 +- src/backend/catalog/system_views.sql | 18 +-- .../utils/activity/pgstat_checkpointer.c | 1 + src/backend/utils/adt/pgstatfuncs.c | 19 +-- src/include/catalog/pg_proc.dat | 22 ++-- src/include/pgstat.h | 1 + src/test/recovery/t/029_stats_restart.pl | 6 +- src/test/regress/expected/rules.out | 15 +-- src/test/regress/expected/stats.out | 21 +++- src/test/regress/sql/stats.sql| 12 +- 10 files changed, 154 insertions(+), 69 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index e5d622d514..52c3118727 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -448,6 +448,15 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser + + pg_stat_checkpointerpg_stat_checkpointer + One row only, showing statistics about the + checkpointer process's activity. See + + pg_stat_checkpointer for details. + + + pg_stat_walpg_stat_wal One row only, showing statistics about WAL activity. See @@ -3647,97 +3656,138 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i - checkpoints_timed bigint + buffers_clean bigint - Number of scheduled checkpoints that have been performed + Number of buffers written by the background writer - checkpoints_req bigint + maxwritten_clean bigint - Number of requested checkpoints that have been performed + Number of times the background writer stopped a cleaning + scan because it had written too many buffers - checkpoint_write_time double precision + buffers_alloc bigint - Total amount of time that has been spent in the portion of - checkpoint processing where files are written to disk, in milliseconds + Number of buffers allocated - checkpoint_sync_time double precision + stats_reset timestamp with time zone - Total amount of time that has been spent in the portion of - checkpoint processing where files are synchronized to disk, in - milliseconds + Time at which these statistics were last reset + + + + + + + + pg_stat_checkpointer + + + pg_stat_checkpointer + + + The pg_stat_checkpointer view will always have a + single row, containing global data for the cluster. + + + + pg_stat_checkpointer View + + - buffers_checkpoint bigint + Column Type - Number of buffers written during checkpoints + Description + + + + + + + + checkpoints_timed bigint + + + Number of scheduled checkpoints that have been performed - buffers_clean bigint + checkpoints_req bigint - Number of buffers written by the background writer + Number of requested checkpoints that have been performed - maxwritten_clean bigint + checkpoint_write_time double
Re: old_snapshot: add test for coverage
> On 8 Aug 2022, at 07:37, Tom Lane wrote: > Dong Wook Lee writes: >> I wrote a test of the old_snapshot extension for coverage. > > Hmm, does this really provide any meaningful coverage? The test > sure looks like it's not doing much. Looking at this I agree, this test doesn't provide enough to be of value and the LIMIT 0 might even hide bugs under a postive test result. I think we should mark this entry RwF. -- Daniel Gustafsson https://vmware.com/
CREATE UNLOGGED TABLE seq faults when debug_discard_caches=1
Hi Hackers, while testing the developer settings of PSQL (14.5) I came across this issue: postgres=# CREATE UNLOGGED TABLE stats ( postgres(# pg_hash BIGINT NOT NULL, postgres(# category TEXT NOT NULL, postgres(# PRIMARY KEY (pg_hash, category) postgres(# ); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Checking the stack trace I found this: Program received signal SIGSEGV, Segmentation fault. 0x00ab6662 in smgrwrite (reln=0x0, forknum=INIT_FORKNUM, blocknum=0, buffer=0x2b5eec0 "", skipFsync=true) at /opt/postgresql-src/debug-build/../src/backend/storage/smgr/smgr.c:526 526 smgrsw[reln->smgr_which].smgr_write(reln, forknum, blocknum, (gdb) bt #0 0x00ab6662 in smgrwrite (reln=0x0, forknum=INIT_FORKNUM, blocknum=0, buffer=0x2b5eec0 "", skipFsync=true) at /opt/postgresql-src/debug-build/../src/backend/storage/smgr/smgr.c:526 #1 0x0056991b in btbuildempty (index=0x7fe60ac9be60) at /opt/postgresql-src/debug-build/../src/backend/access/nbtree/nbtree.c:166 #2 0x00623ad9 in index_build (heapRelation=0x7fe60ac9c078, indexRelation=0x7fe60ac9be60, indexInfo=0x2b4c330, isreindex=false, parallel=true) at /opt/postgresql-src/debug-build/../src/backend/catalog/index.c:3028 #3 0x00621886 in index_create (heapRelation=0x7fe60ac9c078, indexRelationName=0x2b4c448 "stats_pkey", indexRelationId=16954, parentIndexRelid=0, parentConstraintId=0, relFileNode=0, indexInfo=0x2b4c330, indexColNames=0x2b4bee8, accessMethodObjectId=403, tableSpaceId=0, collationObjectId=0x2b4c560, classObjectId=0x2b4c580, coloptions=0x2b4c5a0, reloptions=0, flags=3, constr_flags=0, allow_system_table_mods=false, is_internal=false, constraintId=0x7ffef5cc4a7c) at /opt/postgresql-src/debug-build/../src/backend/catalog/index.c:1232 #4 0x0074af6e in DefineIndex (relationId=16949, stmt=0x2b527a0, indexRelationId=0, parentIndexId=0, parentConstraintId=0, is_alter_table=false, check_rights=true, check_not_in_use=true, skip_build=false, quiet=false) at /opt/postgresql-src/debug-build/../src/backend/commands/indexcmds.c:1164 #5 0x00ac8d78 in ProcessUtilitySlow (pstate=0x2b49230, pstmt=0x2b48fe8, queryString=0x2a71650 "CREATE UNLOGGED TABLE stats (\n pg_hash BIGINT NOT NULL,\ncategory TEXT NOT NULL,\nPRIMARY KEY (pg_hash, category)\n);", context=PROCESS_UTILITY_SUBCOMMAND, params=0x0, queryEnv=0x0, dest=0xe9ceb0 , qc=0x0) at /opt/postgresql-src/debug-build/../src/backend/tcop/utility.c:1535 #6 0x00ac6637 in standard_ProcessUtility (pstmt=0x2b48fe8, queryString=0x2a71650 "CREATE UNLOGGED TABLE stats (\npg_hash BIGINT NOT NULL,\ncategory TEXT NOT NULL,\nPRIMARY KEY (pg_hash, category)\n);", readOnlyTree=false, context=PROCESS_UTILITY_SUBCOMMAND, params=0x0, queryEnv=0x0, dest=0xe9ceb0 , qc=0x0) at /opt/postgresql-src/debug-build/../src/backend/tcop/utility.c:1066 #7 0x00ac548b in ProcessUtility (pstmt=0x2b48fe8, queryString=0x2a71650 "CREATE UNLOGGED TABLE stats (\npg_hash BIGINT NOT NULL,\ncategory TEXT NOT NULL,\nPRIMARY KEY (pg_hash, category)\n);", readOnlyTree=false, context=PROCESS_UTILITY_SUBCOMMAND, params=0x0, queryEnv=0x0, dest=0xe9ceb0 , qc=0x0) at /opt/postgresql-src/debug-build/../src/backend/tcop/utility.c:527 #8 0x00ac7e5e in ProcessUtilitySlow (pstate=0x2b52b10, pstmt=0x2a72d28, queryString=0x2a71650 "CREATE UNLOGGED TABLE stats (\n pg_hash BIGINT NOT NULL,\ncategory TEXT NOT NULL,\nPRIMARY KEY (pg_hash, category)\n);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x2a72df8, qc=0x7ffef5cc6c10) at /opt/postgresql-src/debug-build/../src/backend/tcop/utility.c:1244 #9 0x00ac6637 in standard_ProcessUtility (pstmt=0x2a72d28, queryString=0x2a71650 "CREATE UNLOGGED TABLE stats (\npg_hash BIGINT NOT NULL,\ncategory TEXT NOT NULL,\nPRIMARY KEY (pg_hash, category)\n);", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x2a72df8, qc=0x7ffef5cc6c10) at /opt/postgresql-src/debug-build/../src/backend/tcop/utility.c:1066 #10 0x00ac548b in ProcessUtility (pstmt=0x2a72d28, queryString=0x2a71650 "CREATE UNLOGGED TABLE stats (\npg_hash BIGINT NOT NULL,\ncategory TEXT NOT NULL,\nPRIMARY KEY (pg_hash, category)\n);", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x2a72df8, qc=0x7ffef5cc6c10) at /opt/postgresql-src/debug-build/../src/backend/tcop/utility.c:527 #11 0x00ac4aad in PortalRunUtility (portal=0x2b06bf0, pstmt=0x2a72d28, isTopLevel=true, setHoldSnapshot=false, dest=0x2a72df8, qc=0x7ffef5cc6c10) at /opt/postgresql-src/debug-build/../src/backend/tcop/pquery.c:1155 #12 0x00ac3b57 in PortalRunMulti (portal=0x2b06bf0, isTopLevel=true, setHoldSnapshot=false,
Re: Typo in SH_LOOKUP and SH_LOOKUP_HASH comments
> On 17 Nov 2022, at 10:56, vignesh C wrote: > When I was reviewing one of the patches, I found a typo in the > comments of SH_LOOKUP and SH_LOOKUP_HASH. I felt "lookup up" should > have been "lookup". > Attached a patch to modify it. I agree with that, applied to HEAD. Thanks! -- Daniel Gustafsson https://vmware.com/
redundant check of msg in does_not_exist_skipping
Hi, I was looking at commit aca992040951c7665f1701cd25d48808eda7a809 I think the check of msg after the switch statement is not necessary. The variable msg is used afterward. If there is (potential) missing case in switch statement, the compiler would warn. How about removing the check ? Thanks diff --git a/src/backend/commands/dropcmds.c b/src/backend/commands/dropcmds.c index db906f530e..55996940eb 100644 --- a/src/backend/commands/dropcmds.c +++ b/src/backend/commands/dropcmds.c @@ -518,9 +518,6 @@ does_not_exist_skipping(ObjectType objtype, Node *object) /* no default, to let compiler warn about missing case */ } -if (!msg) -elog(ERROR, "unrecognized object type: %d", (int) objtype); - if (!args) ereport(NOTICE, (errmsg(msg, name))); else
Re: Lockless queue of waiters in LWLock
Hi, hackers! I've noticed that alignment requirements for using pg_atomic_init_u64() for LWlock state (there's an assertion that it's aligned at 8 bytes) do not correspond to the code in SimpleLruInit() on 32-bit arch when MAXIMUM_ALIGNOF = 4. Fixed this in v4 patch (PFA). Regards, Pavel. v4-0001-Lockless-queue-of-LWLock-waiters.patch Description: Binary data
Re: [PoC] Reducing planning time when tables have many partitions
On Thu, 17 Nov 2022 at 11:20, Thom Brown wrote: > > On Thu, 17 Nov 2022 at 09:31, Alvaro Herrera wrote: > > > > On 2022-Nov-16, Thom Brown wrote: > > > > > Once the issue Tom identified has been resolved, I'd like to test > > > drive newer patches. > > > > What issue? If you mean the one from the thread "Reducing > > duplicativeness of EquivalenceClass-derived clauses", that patch is > > already applied (commit a5fc46414deb), and Yuya Watari's v8 series > > applies fine to current master. > > Ah, I see.. I'll test the v8 patches. No issues with applying. Created 1024 partitions, each of which is partitioned into 64 partitions. I'm getting a generic planning time of 1415ms. Is that considered reasonable in this situation? Bear in mind that the planning time prior to this patch was 282311ms, so pretty much a 200x speedup. Thom