Re: pgsql: Clean up role created in new subscription test.
On 06.07.23 00:00, Daniel Gustafsson wrote: On 16 May 2023, at 11:17, Daniel Gustafsson wrote: Parked in the July CF for now. Rebased to fix a trivial conflict highlighted by the CFBot. I think the problem with this approach is that one would need to reapply it to each regression test suite separately. For example, there are several tests under contrib/ that create roles. These would not be covered by this automatically. I think the earlier idea of just counting roles, tablespaces, etc. before and after would be sufficient.
Question about the Implementation of vector32_is_highbit_set on ARM
Hi all, I have some questions about the implementation of vector32_is_highbit_set on arm. Below is the comment and the implementation for this function. /* * Exactly like vector8_is_highbit_set except for the input type, so it * looks at each byte separately. * * XXX x86 uses the same underlying type for 8-bit, 16-bit, and 32-bit * integer elements, but Arm does not, hence the need for a separate * function. We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e. * check each 32-bit element, but that would require an additional mask * operation on x86. */ #ifndef USE_NO_SIMD static inline bool vector32_is_highbit_set(const Vector32 v) { #if defined(USE_NEON) return vector8_is_highbit_set((Vector8) v); #else return vector8_is_highbit_set(v); #endif } #endif /* ! USE_NO_SIMD */ But I still don't understand why the vmaxvq_u32 intrinsic is not used on the arm platform. We have used the macro USE_NEON to distinguish different platforms. In addition, according to the "Arm Neoverse N1 Software Optimization Guide", The vmaxvq_u32 intrinsic has half the latency of vmaxvq_u8 and twice the bandwidth. So I think just use vmaxvq_u32 directly. Any comments or feedback are welcome. IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Re: Improve WALRead() to suck data directly from WAL buffers when possible
On Sat, Nov 4, 2023 at 6:13 AM Andres Freund wrote: > > > + cur_lsn = GetFlushRecPtr(NULL); > > + if (unlikely(startptr > cur_lsn)) > > + elog(ERROR, "WAL start LSN %X/%X specified for reading from > > WAL buffers must be less than current database system WAL LSN %X/%X", > > + LSN_FORMAT_ARGS(startptr), LSN_FORMAT_ARGS(cur_lsn)); > > Hm, why does this check belong here? For some tools it might be legitimate to > read the WAL before it was fully flushed. Agreed and removed the check. > > + /* > > + * Holding WALBufMappingLock ensures inserters don't overwrite this > > value > > + * while we are reading it. We try to acquire it in shared mode so > > that > > + * the concurrent WAL readers are also allowed. We try to do as less > > work > > + * as possible while holding the lock to not impact concurrent WAL > > writers > > + * much. We quickly exit to not cause any contention, if the lock > > isn't > > + * immediately available. > > + */ > > + if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED)) > > + return 0; > > That seems problematic - that lock is often heavily contended. We could > instead check IsWALRecordAvailableInXLogBuffers() once before reading the > page, then read the page contents *without* holding a lock, and then check > IsWALRecordAvailableInXLogBuffers() again - if the page was replaced in the > interim we read bogus data, but that's a bit of a wasted effort. In the new approach described upthread here https://www.postgresql.org/message-id/c3455ab9da42e09ca9d059879b5c512b2d1f9681.camel%40j-davis.com, there's no lock required for reading from WAL buffers. PSA patches for more details. > > + /* > > + * The fact that we acquire WALBufMappingLock while reading > > the WAL > > + * buffer page itself guarantees that no one else initializes > > it or > > + * makes it ready for next use in AdvanceXLInsertBuffer(). > > However, we > > + * need to ensure that we are not reading a page that just got > > + * initialized. For this, we look at the needed page header. > > + */ > > + phdr = (XLogPageHeader) page; > > + > > + /* Return, if WAL buffer page doesn't look valid. */ > > + if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC && > > + phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) && > > + phdr->xlp_tli == tli)) > > + break; > > I don't think this code should ever encounter a page where this is not the > case? We particularly shouldn't do so silently, seems that could hide all > kinds of problems. I think it's possible to read a "just got initialized" page with the new approach to read WAL buffer pages without WALBufMappingLock if the page is read right after it is initialized and xlblocks is filled in AdvanceXLInsertBuffer() but before actual WAL is written. > > + /* > > + * Note that we don't perform all page header checks here to > > avoid > > + * extra work in production builds; callers will anyway do > > those > > + * checks extensively. However, in an assert-enabled build, > > we perform > > + * all the checks here and raise an error if failed. > > + */ > > Why? Minimal page header checks are performed to ensure we don't read the page that just got initialized unlike what XLogReaderValidatePageHeader(). Are you suggesting to remove page header checks with XLogReaderValidatePageHeader() for assert-enabled builds? Or are you suggesting to do page header checks with XLogReaderValidatePageHeader() for production builds too? PSA v16 patch set. Note that 0004 patch adds support for WAL read stats (both from WAL file and WAL buffers) to walsenders and may not necessarily the best approach to capture WAL read stats in light of https://www.postgresql.org/message-id/CALj2ACU_f5_c8F%2BxyNR4HURjG%3DJziiz07wCpQc%3DAqAJUFh7%2B8w%40mail.gmail.com which adds WAL read/write/fsync stats to pg_stat_io. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From b678edfccf7ecf490cb792391249cbf85ba0db29 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Tue, 7 Nov 2023 19:20:00 + Subject: [PATCH v16] Use 64-bit atomics for xlblocks array elements In AdvanceXLInsertBuffer(), xlblocks value of a WAL buffer page is updated only at the end after the page is initialized with all zeros. A problem with this approach is that anyone reading xlblocks and WAL buffer page without holding WALBufMappingLock will see the wrong page contents if the read happens before the xlblocks is marked with a new entry in AdvanceXLInsertBuffer() at the end. To fix this issue, xlblocks is made to use 64-bit atomics instead of XLogRecPtr and the xlblocks value is marked with
Re: Show WAL write and fsync stats in pg_stat_io
On Wed, Sep 20, 2023 at 1:28 PM Nazir Bilal Yavuz wrote: > > Hi, > > Thanks for the review! > > Current status of the patch is: > - IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync stats are added. > - IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added. > - IOOBJECT_WAL / IOCONTEXT_INIT stats are added. > - pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations. > - Working on which 'BackendType / IOContext / IOOp' should be banned in > pg_stat_io. > - PendingWalStats.wal_sync and PendingWalStats.wal_write_time / > PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() / > pgstat_count_io_op_time() respectively. > > TODOs: > - Documentation. > - Try to set op_bytes for BackendType / IOContext. > - Decide which 'BackendType / IOContext / IOOp' should not be tracked. > - Add IOOBJECT_WAL / IOCONTEXT_NORMAL read tests. > - Add IOOBJECT_WAL / IOCONTEXT_INIT tests. This patchset currently covers: - IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync. - IOOBJECT_WAL / IOCONTEXT_INIT write and fsync. doesn't cover: - Streaming replication WAL IO. Is there any plan to account for WAL read stats in the WALRead() function which will cover walsenders i.e. WAL read by logical and streaming replication, WAL read by pg_walinspect and so on? I see the patch already covers WAL read stats by recovery in XLogPageRead(), but not other page_read callbacks which will end up in WALRead() eventually. If added, the feature at https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com can then extend it to cover WAL read from WAL buffer stats. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Doubled test for SET statements in pg_stat_statements tests
Hello! I noticed that the same block -- SET statements. -- These use two different strings, still they count as one entry. SET work_mem = '1MB'; Set work_mem = '1MB'; SET work_mem = '2MB'; RESET work_mem; SET enable_seqscan = off; SET enable_seqscan = on; RESET enable_seqscan; is checked twice in contrib/pg_stat_statements/sql/utility.sql on lines 278-286 and 333-341. Is this on any purpose? I think the second set of tests is not needed and can be removed, as in the attached patch. regards, Sergeidiff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out index cc6e898cdf..a913421290 100644 --- a/contrib/pg_stat_statements/expected/utility.out +++ b/contrib/pg_stat_statements/expected/utility.out @@ -664,30 +664,3 @@ SELECT pg_stat_statements_reset(); (1 row) --- SET statements. --- These use two different strings, still they count as one entry. -SET work_mem = '1MB'; -Set work_mem = '1MB'; -SET work_mem = '2MB'; -RESET work_mem; -SET enable_seqscan = off; -SET enable_seqscan = on; -RESET enable_seqscan; -SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C"; - calls | rows | query +--+--- - 1 |0 | RESET enable_seqscan - 1 |0 | RESET work_mem - 1 |1 | SELECT pg_stat_statements_reset() - 1 |0 | SET enable_seqscan = off - 1 |0 | SET enable_seqscan = on - 2 |0 | SET work_mem = '1MB' - 1 |0 | SET work_mem = '2MB' -(7 rows) - -SELECT pg_stat_statements_reset(); - pg_stat_statements_reset --- - -(1 row) - diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql index 04598e5ae4..3fb8dde68e 100644 --- a/contrib/pg_stat_statements/sql/utility.sql +++ b/contrib/pg_stat_statements/sql/utility.sql @@ -329,16 +329,3 @@ DROP TABLE pgss_ctas; DROP TABLE pgss_select_into; SELECT pg_stat_statements_reset(); - --- SET statements. --- These use two different strings, still they count as one entry. -SET work_mem = '1MB'; -Set work_mem = '1MB'; -SET work_mem = '2MB'; -RESET work_mem; -SET enable_seqscan = off; -SET enable_seqscan = on; -RESET enable_seqscan; - -SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C"; -SELECT pg_stat_statements_reset();
Re: ensure, not insure
On Wed, 8 Nov 2023 at 19:56, Peter Smith wrote: > gettext_noop("The server will use the fsync() system call in several > places to make " > "sure that updates are physically written to disk. This insures " > "that a database cluster will recover to a consistent state after " > "an operating system or hardware crash.") > > ~ > > I believe the word should have been "ensures"; not "insures". I agree. It's surprisingly ancient, having arrived in b700a672f (June 2003). > In passing I found/fixed a bunch of similar misuses in comments. Those all look fine to me too. David
Re: Remove unnecessary 'always:' from CompilerWarnings task
On 05.09.23 12:25, Nazir Bilal Yavuz wrote: There are multiple 'always:' keywords under the CompilerWarnings task. Instead of that, we can use one 'always:' and move the instructions under this. So, I removed unnecessary ones and rearranged indents according to that change. I'm not sure this change is beneficial. The way the code is currently arranged, it's a bit easier to move or change individual blocks, and it's also easier to read the file, because the "always:" is next to each "script" and doesn't scroll off the screen.
Re: GenBKI emits useless open;close for catalogs without rows
On 08.11.23 08:16, Peter Eisentraut wrote: On 19.09.23 20:05, Heikki Linnakangas wrote: One thing caught my eye though: We currently have an "open" command after every "create". Except for bootstrap relations; creating a bootstrap relation opens it implicitly. That seems like a weird inconsistency. If we make "create" to always open the relation, we can both make it more consistent and save a few lines. That's maybe worth doing, per the attached. It removes the "open" command altogether, as it's not needed anymore. This seems like a good improvement to me. It would restrict the bootstrap language so that you can only manipulate a table right after creating it, but I don't see why that wouldn't be sufficient. Then again, this sort of achieves the opposite of what Matthias was aiming for: You are now forcing some relations to be opened even though we will end up closing it right away. (In any case, documentation in bki.sgml would need to be updated for this patch.)
Re: GenBKI emits useless open;close for catalogs without rows
On 19.09.23 20:05, Heikki Linnakangas wrote: One thing caught my eye though: We currently have an "open" command after every "create". Except for bootstrap relations; creating a bootstrap relation opens it implicitly. That seems like a weird inconsistency. If we make "create" to always open the relation, we can both make it more consistent and save a few lines. That's maybe worth doing, per the attached. It removes the "open" command altogether, as it's not needed anymore. This seems like a good improvement to me. It would restrict the bootstrap language so that you can only manipulate a table right after creating it, but I don't see why that wouldn't be sufficient.
Re: Synchronizing slots from primary to standby
Hi, On 11/8/23 4:50 AM, Amit Kapila wrote: On Tue, Nov 7, 2023 at 7:58 PM Drouvot, Bertrand wrote: If we think this window is too short we could: - increase it or - don't drop the slot once created (even if there is no activity on the primary during PrimaryCatchupWaitAttempt attempts) so that the next loop of attempts will compare with "older" LSN/xmin (as compare to dropping and re-creating the slot). That way the window would be since the initial slot creation. Yeah, this sounds reasonable but we can't mark such slots to be synced/available for use after failover. Yeah, currently we are fine as slots are dropped in wait_for_primary_slot_catchup() if we are not in recovery anymore. I think if we want to follow this approach then we need to also monitor these slots for any change in the consecutive cycles and if we are able to sync them then accordingly we enable them to use after failover. What about to add a new field in ReplicationSlotPersistentData indicating that we are waiting for "sync" and drop such slots during promotion and /or if not in recovery? Another somewhat related point is that right now, we just wait for the change on the first slot (the patch refers to it as the monitoring slot) for computing nap_time before which we will recheck all the slots. I think we can improve that as well such that even if any slot's information is changed, we don't consider changing naptime. Yeah, that sounds reasonable to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: [PGDOCS] Inconsistent linkends to "monitoring" views.
On Tue, Oct 3, 2023 at 4:40 PM Peter Smith wrote: > > On Tue, Oct 3, 2023 at 6:30 PM Michael Paquier wrote: > > > > On Tue, Oct 03, 2023 at 01:11:15PM +1100, Peter Smith wrote: > > > I noticed one or two "monitoring" links and linkends that are slightly > > > inconsistent from all the others. > > > > - > > + > > > > Is that really worth bothering for the internal link references? > > I preferred 100% consistency instead of 95% consistency. YMMV. > > > This can create extra backpatching conflicts. > > Couldn't the same be said for every patch that fixes a comment typo? > This is like a link typo, so what's the difference? My 2 cents: Comment typos are visible to readers, so more annoying when seen in isolation, and less likely to have surroundings that could change in back branches. Consistency would preferred all else being equal, but then again nothing is wrong with the existing links. In any case, no one has come out in favor of the patch, so it seems like it should be rejected unless that changes. -- John Naylor
Re: Infinite Interval
On Wed, Nov 8, 2023 at 7:42 AM Dean Rasheed wrote: > > On Tue, 7 Nov 2023 at 14:33, Dean Rasheed wrote: > > > > New version attached doing that, to run it past the cfbot again. > > > > Ah, Windows Server didn't like that. Trying again with "INT64CONST(0)" > instead of just "0" in interval_um(). > > Regards, > Dean I found this: https://developercommunity.visualstudio.com/t/please-implement-integer-overflow-detection/409051 maybe related.
ensure, not insure
Hi, While looking at GUC messages today I noticed the following: gettext_noop("The server will use the fsync() system call in several places to make " "sure that updates are physically written to disk. This insures " "that a database cluster will recover to a consistent state after " "an operating system or hardware crash.") ~ I believe the word should have been "ensures"; not "insures". In passing I found/fixed a bunch of similar misuses in comments. == Kind Regards, Peter Smith. Fujitsu Australia. v1-0001-ensure-not-insure.patch Description: Binary data
Re: Version 14/15 documentation Section "Alter Default Privileges"
Hi, On Tue, Nov 07, 2023 at 05:30:20PM -0500, Bruce Momjian wrote: > On Mon, Nov 6, 2023 at 09:53:50PM +0100, Laurenz Albe wrote: > > + > > + There is no way to change the default privileges for objects created by > > + arbitrary roles. You have run ALTER DEFAULT > > PRIVILEGES > > I find the above sentence odd. What is its purpose? I guess it is to address the main purpose of this patch/confusion with users: they believe setting DEFAULT PRIVILEGES will set grants accordingly for all objects created in the future, no matter who creates them. So hammering in that this is not the case seems fine from my side (modulo the "You have to run" typo). Michael
Re: Syncrep and improving latency due to WAL throttling
Hi, On 2023-11-04 20:00:46 +0100, Tomas Vondra wrote: > scope > - > Now, let's talk about scope - what the patch does not aim to do. The > patch is explicitly intended for syncrep clusters, not async. There have > been proposals to also support throttling for async replicas, logical > replication etc. I suppose all of that could be implemented, and I do > see the benefit of defining some sort of maximum lag even for async > replicas. But the agreement was to focus on the syncrep case, where it's > particularly painful, and perhaps extend it in the future. Perhaps we should take care to make the configuration extensible in that direction in the future? Hm - is this feature really tied to replication, at all? Pretty much the same situation exists without. On an ok-ish local nvme I ran pgbench with 1 client and -P1. Guess where I started a VACUUM (on a fully cached table, so no continuous WAL flushes): progress: 64.0 s, 634.0 tps, lat 1.578 ms stddev 0.477, 0 failed progress: 65.0 s, 634.0 tps, lat 1.577 ms stddev 0.546, 0 failed progress: 66.0 s, 639.0 tps, lat 1.566 ms stddev 0.656, 0 failed progress: 67.0 s, 642.0 tps, lat 1.557 ms stddev 0.273, 0 failed progress: 68.0 s, 556.0 tps, lat 1.793 ms stddev 0.690, 0 failed progress: 69.0 s, 281.0 tps, lat 3.568 ms stddev 1.050, 0 failed progress: 70.0 s, 282.0 tps, lat 3.539 ms stddev 1.072, 0 failed progress: 71.0 s, 273.0 tps, lat 3.663 ms stddev 2.602, 0 failed progress: 72.0 s, 261.0 tps, lat 3.832 ms stddev 1.889, 0 failed progress: 73.0 s, 268.0 tps, lat 3.738 ms stddev 0.934, 0 failed At 32 clients we go from ~10k to 2.5k, with a full 2s of 0. Subtracting pg_current_wal_flush_lsn() from pg_current_wal_insert_lsn() the "good times" show a delay of ~8kB (note that this includes WAL records that are still being inserted). Once the VACUUM runs, it's ~2-3MB. The picture with more clients is similar. If I instead severely limit the amount of outstanding (but not the amount of unflushed) WAL by setting wal_buffers to 128, latency dips quite a bit less (down to ~400 instead of ~260 at 1 client, ~10k to ~5k at 32). Of course that's ridiculous and will completely trash performance in many other cases, but it shows that limiting the amount of outstanding WAL could help without replication as well. With remote storage, that'd likely be a bigger difference. > problems > > Now let's talk about some problems - both conceptual and technical > (essentially review comments for the patch). > > 1) The goal of the patch is to limit the impact on latency, but the > relationship between WAL amounts and latency may not be linear. But we > don't have a good way to predict latency, and WAL lag is the only thing > we have, so there's that. Ultimately, it's a best effort. It's indeed probably not linear. Realistically, to do better, we probably need statistics for the specific system in question - the latency impact will differ hugely between different storage/network. > 2) The throttling is per backend. That makes it simple, but it means > that it's hard to enforce a global lag limit. Imagine the limit is 8MB, > and with a single backend that works fine - the lag should not exceed > the 8MB value. But if there are N backends, the lag could be up to > N-times 8MB, I believe. That's a bit annoying, but I guess the only > solution would be to have some autovacuum-like cost balancing, with all > backends (or at least those running large stuff) doing the checks more > often. I'm not sure we want to do that. Hm. The average case is likely fine - the throttling of the different backends will intersperse and flush more frequently - but the worst case is presumably part of the issue here. I wonder if we could deal with this by somehow offsetting the points at which backends flush at somehow. I doubt we want to go for something autovacuum balancing like - that doesn't seem to work well - but I think we could take the amount of actually unflushed WAL into account when deciding whether to throttle. We have the necessary state in local memory IIRC. We'd have to be careful to not throttle every backend at the same time, or we'll introduce latency penalties that way. But what if we scaled synchronous_commit_wal_throttle_threshold depending on the amount of unflushed WAL? By still taking backendWalInserted into account, we'd avoid throttling everyone at the same time, but still would make throttling more aggressive depending on the amount of unflushed/unreplicated WAL. > 3) The actual throttling (flush and wait for syncrep) happens in > ProcessInterrupts(), which mostly works but it has two drawbacks: > > * It may not happen "early enough" if the backends inserts a lot of > XLOG records without processing interrupts in between. Does such code exist? And if so, is there a reason not to fix said code? > * It may happen "too early" if the backend inserts enough WAL to need > throttling (i.e. sets XLogDelayPending), but then after processing > interrupts it
Re: pg_upgrade and logical replication
On Mon, 6 Nov 2023 at 07:51, Peter Smith wrote: > > Here are some review comments for patch v11-0001 > > == > Commit message > > 1. > The subscription's replication origin are needed to ensure > that we don't replicate anything twice. > > ~ > > /are needed/is needed/ Modified > > 2. > Author: Julien Rouhaud > Reviewed-by: FIXME > Discussion: https://postgr.es/m/20230217075433.u5mjly4d5cr4hcfe@jrouhaud > > ~ > > Include Vignesh as another author. Modified > == > doc/src/sgml/ref/pgupgrade.sgml > > 3. > + pg_upgrade attempts to migrate subscription > + dependencies which includes the subscription tables information present > in > + pg_subscription_rel > + system table and the subscription replication origin which > + will help in continuing logical replication from where the old > subscriber > + was replicating. This helps in avoiding the need for setting up the > > I became a bit lost reading paragraph due to the multiple 'which'... > > SUGGESTION > pg_upgrade attempts to migrate subscription dependencies which > includes the subscription table information present in > pg_subscription_rel system > catalog and also the subscription replication origin. This allows > logical replication on the new subscriber to continue from where the > old subscriber was up to. Modified > ~~~ > > 4. > + was replicating. This helps in avoiding the need for setting up the > + subscription objects manually which requires truncating all the > + subscription tables and setting the logical replication slots. Migration > > SUGGESTION > Having the ability to migrate subscription objects avoids the need to > set them up manually, which would require truncating all the > subscription tables and setting the logical replication slots. I have removed this > ~ > > TBH, I am wondering what is the purpose of this sentence. It seems > more like a justification for the patch, but does the user need to > know all this? > > ~~~ > > 5. > + > + All the subscription tables in the old subscriber should be in > + i (initialize), r (ready) or > + s (synchronized). This can be verified by checking > +linkend="catalog-pg-subscription-rel">pg_subscription_rel.srsubstate. > + > > /should be in/should be in state/ Modified > ~~~ > > 6. > + > + The replication origin entry corresponding to each of the > subscriptions > + should exist in the old cluster. This can be checking > + pg_subscription and > +linkend="catalog-pg-replication-origin">pg_replication_origin > + system tables. > + > > missing words? > > /This can be checking/This can be found by checking/ Modified > ~~~ > > 7. > + > + The subscriptions will be migrated to new cluster in disabled state, > they > + can be enabled after upgrade by following the steps: > + > > The first bullet also says "Enable the subscription..." so I think > this paragraph should be worded like the below. > > SUGGESTION > The subscriptions will be migrated to the new cluster in a disabled > state. After migration, do this: Modified > == > src/backend/catalog/pg_subscription.c > > 8. > #include "nodes/makefuncs.h" > +#include "replication/origin.h" > +#include "replication/worker_internal.h" > #include "storage/lmgr.h" > > Why does this change need to be in the patch when there are no other > code changes in this file? Modified > == > src/backend/utils/adt/pg_upgrade_support.c > > 9. binary_upgrade_create_sub_rel_state > > IMO a better name for this function would be > 'binary_upgrade_add_sub_rel_state' (because it delegates to > AddSubscriptionRelState). > > Then it would obey the same name pattern as the other function > 'binary_upgrade_replorigin_advance' (which delegates to > replorigin_advance). Modified > ~~~ > > 10. > +/* > + * binary_upgrade_create_sub_rel_state > + * > + * Add the relation with the specified relation state to pg_subscription_rel > + * table. > + */ > +Datum > +binary_upgrade_create_sub_rel_state(PG_FUNCTION_ARGS) > +{ > + Relation rel; > + HeapTuple tup; > + Oid subid; > + Form_pg_subscription form; > + char*subname; > + Oid relid; > + char relstate; > + XLogRecPtr sublsn; > > 10a. > /to pg_subscription_rel table./to pg_subscription_rel catalog./ Modified > ~ > > 10b. > Maybe it would be helpful if the function argument were documented > up-front in the function-comment, or in the variable declarations. > > SUGGESTION > char *subname; /* ARG0 = subscription name */ > Oidrelid;/* ARG1 = relation Oid */ > char relstate; /* ARG2 = subrel state */ > XLogRecPtr sublsn; /* ARG3 (optional) = subscription lsn */ I felt the variables are self explainatory in this case and also consistent with other functions. > ~~~ > > 11. > if (PG_ARGISNULL(3)) > sublsn = InvalidXLogRecPtr; > else > sublsn = PG_GETARG_LSN(3); > FWIW, I'd write that as a one-line ternary assignment allowing all the > args to be
Re: [PoC] Implementation of distinct in Window Aggregates: take two
Hi, I went through the Cfbot, and some of the test cases are failing for this patch. It seems like some tests are crashing: https://api.cirrus-ci.com/v1/artifact/task/6291153444667392/crashlog/crashlog-postgres.exe_03b0_2023-11-07_10-41-39-624.txt [10:46:56.546] Summary of Failures: [10:46:56.546] [10:46:56.546] 87/270 postgresql:postgres_fdw / postgres_fdw/regress ERROR 11.10s exit status 1 [10:46:56.546] 82/270 postgresql:regress / regress/regress ERROR 248.55s exit status 1 [10:46:56.546] 99/270 postgresql:recovery / recovery/027_stream_regress ERROR 161.40s exit status 29 [10:46:56.546] 98/270 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade ERROR 253.31s exit status 29 link of tests failing: https://cirrus-ci.com/task/664299716712 https://cirrus-ci.com/task/4602303584403456 https://cirrus-ci.com/task/5728203491246080 https://cirrus-ci.com/task/5165253537824768?logs=test_world#L511 https://cirrus-ci.com/task/6291153444667392 Thanks Shlok Kumar Kyal
Re: Fix use of openssl.path() if openssl isn't found
On Tue Nov 7, 2023 at 11:53 PM CST, Michael Paquier wrote: On Tue, Nov 07, 2023 at 04:06:56PM -0600, Tristan Partin wrote: > Found this issue during my Fedora 39 upgrade. Tested that uninstalling > openssl still allows the various ssl tests to run and succeed. Good catch. You are right that this is inconsistent with what we expect in the test. > +openssl_path = '' > +if openssl.found() > + openssl_path = openssl.path() > +endif > + > tests += { >'name': 'ssl', >'sd': meson.current_source_dir(), > @@ -7,7 +12,7 @@ tests += { >'tap': { > 'env': { >'with_ssl': ssl_library, > - 'OPENSSL': openssl.path(), > + 'OPENSSL': openssl_path, > }, > 'tests': [ >'t/001_ssltests.pl', Okay, that's a nit and it leads to the same result, but why not using the same one-liner style like all the other meson.build files that rely on optional commands? See pg_verifybackup, pg_dump, etc. That would be more consistent. Because I forgot there were ternary statements in Meson :). Thanks for the review. Here is v2. -- Tristan Partin Neon (https://neon.tech) From 5fc0460b0b85b8f04976182c0f6ec650c40df833 Mon Sep 17 00:00:00 2001 From: Tristan Partin Date: Tue, 7 Nov 2023 15:59:04 -0600 Subject: [PATCH v2] Fix use of openssl.path() if openssl isn't found openssl(1) is an optional dependency of the Postgres Meson build, but was inadvertantly required when defining some SSL tests. --- src/test/ssl/meson.build | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/test/ssl/meson.build b/src/test/ssl/meson.build index 4cda81f3bc..ed83c438ef 100644 --- a/src/test/ssl/meson.build +++ b/src/test/ssl/meson.build @@ -7,7 +7,7 @@ tests += { 'tap': { 'env': { 'with_ssl': ssl_library, - 'OPENSSL': openssl.path(), + 'OPENSSL': openssl.found() ? openssl.path : '', }, 'tests': [ 't/001_ssltests.pl', -- Tristan Partin Neon (https://neon.tech)
Re: Show WAL write and fsync stats in pg_stat_io
On Wed, Nov 08, 2023 at 10:27:44AM +0900, Michael Paquier wrote: > Yep, I'd be OK with that as well to maintain compatibility. By the way, note that the patch is failing to apply, and that I've switched it as waiting on author on 10/26. -- Michael signature.asc Description: PGP signature
Re: Fix use of openssl.path() if openssl isn't found
On Tue, Nov 07, 2023 at 04:06:56PM -0600, Tristan Partin wrote: > Found this issue during my Fedora 39 upgrade. Tested that uninstalling > openssl still allows the various ssl tests to run and succeed. Good catch. You are right that this is inconsistent with what we expect in the test. > +openssl_path = '' > +if openssl.found() > + openssl_path = openssl.path() > +endif > + > tests += { >'name': 'ssl', >'sd': meson.current_source_dir(), > @@ -7,7 +12,7 @@ tests += { >'tap': { > 'env': { >'with_ssl': ssl_library, > - 'OPENSSL': openssl.path(), > + 'OPENSSL': openssl_path, > }, > 'tests': [ >'t/001_ssltests.pl', Okay, that's a nit and it leads to the same result, but why not using the same one-liner style like all the other meson.build files that rely on optional commands? See pg_verifybackup, pg_dump, etc. That would be more consistent. -- Michael signature.asc Description: PGP signature
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Wed, Nov 8, 2023 at 10:52 AM Amul Sul wrote: Thanks for review Amul, > Here are some minor comments: > > + * By default, we'll use 1MB of for every 1GB of shared buffers, up to the > + * maximum value that slru.c will allow, but always at least 16 buffers. > */ > Size > CommitTsShmemBuffers(void) > { > - return Min(256, Max(4, NBuffers / 256)); > + /* Use configured value if provided. */ > + if (commit_ts_buffers > 0) > + return Max(16, commit_ts_buffers); > + return Min(256, Max(16, NBuffers / 256)); > > Do you mean "4MB of for every 1GB" in the comment? You are right > -- > > diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h > index 5087cdce51..78d017ad85 100644 > --- a/src/include/access/commit_ts.h > +++ b/src/include/access/commit_ts.h > @@ -16,7 +16,6 @@ > #include "replication/origin.h" > #include "storage/sync.h" > > - > extern PGDLLIMPORT bool track_commit_timestamp; > > A spurious change. Will fix > -- > > @@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns) > > if (nlsns > 0) > sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));/* > group_lsn[] */ > - > return BUFFERALIGN(sz) + BLCKSZ * nslots; > } > > Another spurious change in 0002 patch. Will fix > -- > > +/* > + * The slru buffer mapping table is partitioned to reduce contention. To > + * determine which partition lock a given pageno requires, compute the > pageno's > + * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock(). > + */ > > I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions anywhere > in > your patches, is that outdated comment? Yes will fix it, actually, there are some major design changes to this. > -- > > - sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */ > - sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* > part_locks[] */ > + sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded)); > /* locks[] */ > > I am a bit uncomfortable with these changes, merging parts and buffer locks > making it hard to understand the code. Not sure what we were getting out of > this? Yes, even I do not like this much because it is confusing. But the advantage of this is that we are using a single pointer for the lock which means the next variable for the LRU counter will come in the same cacheline and frequent updates of lru counter will be benefitted from this. Although I don't have any number which proves this. Currently, I want to focus on all the base patches and keep this patch as add on and later if we find its useful and want to pursue this then we will see how to make it better readable. > > Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of > partitions > > I think the 0005 patch can be merged to 0001. Yeah in the next version, it is done that way. Planning to post end of the day. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Fix output of zero privileges in psql
On Wed, Nov 8, 2023 at 10:46 AM Laurenz Albe wrote: > > On Sat, 2023-10-21 at 04:29 +0200, Erik Wienhold wrote: > > The attached v3 of my initial patch > > does that. It also includes Laurenz' fix to no longer ignore \pset null > > (minus the doc changes that suggest using \pset null to distinguish > > between default and empty privileges because that's no longer needed). > > Thanks! > > I went over the patch, fixed some problems and added some more stuff from > my patch. > > In particular: > > --- a/doc/src/sgml/ddl.sgml > +++ b/doc/src/sgml/ddl.sgml > @@ -2353,7 +2353,9 @@ GRANT SELECT (col1), UPDATE (col1) ON mytable TO > miriam_rw; > > If the Access privileges column is empty for a given > object, it means the object has default privileges (that is, its > - privileges entry in the relevant system catalog is null). Default > + privileges entry in the relevant system catalog is null). The column > shows > + (none) for empty privileges (that is, no privileges > at > + all, even for the object owner a rare occurrence). Default > privileges always include all privileges for the owner, and can include > some privileges for PUBLIC depending on the object > type, as explained above. The first GRANT > > This description of empty privileges is smack in the middle of describing > default privileges. I thought that was confusing and moved it to its > own paragraph. > > --- a/src/bin/psql/describe.c > +++ b/src/bin/psql/describe.c > @@ -6718,7 +6680,13 @@ static void >printACLColumn(PQExpBuffer buf, const char *colname) >{ > appendPQExpBuffer(buf, > - "pg_catalog.array_to_string(%s, E'\\n') AS \"%s\"", > + "CASE\n" > + " WHEN %s IS NULL THEN ''\n" > + " WHEN pg_catalog.cardinality(%s) = 0 THEN '%s'\n" > + " ELSE pg_catalog.array_to_string(%s, E'\\n')\n" > + "END AS \"%s\"", > + colname, > + colname, gettext_noop("(none)"), > colname, gettext_noop("Access privileges")); >} > > This erroneously displays NULL as empty string and subverts my changes. > I have removed the first branch of the CASE expression. > > --- a/src/test/regress/expected/psql.out > +++ b/src/test/regress/expected/psql.out > @@ -6663,3 +6663,97 @@ DROP ROLE regress_du_role0; >DROP ROLE regress_du_role1; >DROP ROLE regress_du_role2; >DROP ROLE regress_du_admin; > +-- Test empty privileges. > +BEGIN; > +WARNING: there is already a transaction in progress > > This warning is caused by a pre-existing error in the regression test, which > forgot to close the transaction. I have added a COMMIT at the appropriate > place. > > +ALTER TABLESPACE regress_tblspace OWNER TO CURRENT_USER; > +REVOKE ALL ON TABLESPACE regress_tblspace FROM CURRENT_USER; > +\db+ regress_tblspace > +List of tablespaces > + Name | Owner |Location | Access > privileges | Options | Size | Description > > +--++-+---+-+-+- > + regress_tblspace | regress_zeropriv_owner | pg_tblspc/16385 | (none) > | | 0 bytes | > +(1 row) > > This test is not stable, since it contains the OID of the tablespace, which > is different every time. > > +ALTER DATABASE :"DBNAME" OWNER TO CURRENT_USER; > +REVOKE ALL ON DATABASE :"DBNAME" FROM CURRENT_USER, PUBLIC; > +\l :"DBNAME" > +List of databases > +Name| Owner | Encoding | Locale Provider | > Collate | Ctype | ICU Locale | ICU Rules | Access privileges > > +++---+-+-+---++---+--- > + regression | regress_zeropriv_owner | SQL_ASCII | libc| C > | C || | (none) > +(1 row) > > This test is also not stable, since it depends on the locale definition > of the regression test database. If you use "make installcheck", that could > be a different locale. > > I think that these tests are not absolutely necessary, and the other tests > are sufficient. Consequently, I took the simple road of removing them. > > I also tried to improve the commit message. > > Patch attached. I tested the Patch for the modified changes and it is working fine. Thanks and regards, Shubham Khanna.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Fri, Nov 3, 2023 at 10:59 AM Dilip Kumar wrote: > On Mon, Oct 30, 2023 at 11:50 AM Dilip Kumar > wrote: > > [...] > [1] 0001-Make-all-SLRU-buffer-sizes-configurable: This is the same > patch as the previous patch set > [2] 0002-Add-a-buffer-mapping-table-for-SLRUs: Patch to introduce > buffer mapping hash table > [3] 0003-Partition-wise-slru-locks: Partition the hash table and also > introduce partition-wise locks: this is a merge of 0003 and 0004 from > the previous patch set but instead of bank-wise locks it has > partition-wise locks and LRU counter. > [4] 0004-Merge-partition-locks-array-with-buffer-locks-array: merging > buffer locks and bank locks in the same array so that the bank-wise > LRU counter does not fetch the next cache line in a hot function > SlruRecentlyUsed()(same as 0005 from the previous patch set) > [5] 0005-Ensure-slru-buffer-slots-are-in-multiple-of-number-of: Ensure > that the number of slots is in multiple of the number of banks > [...] Here are some minor comments: + * By default, we'll use 1MB of for every 1GB of shared buffers, up to the + * maximum value that slru.c will allow, but always at least 16 buffers. */ Size CommitTsShmemBuffers(void) { - return Min(256, Max(4, NBuffers / 256)); + /* Use configured value if provided. */ + if (commit_ts_buffers > 0) + return Max(16, commit_ts_buffers); + return Min(256, Max(16, NBuffers / 256)); Do you mean "4MB of for every 1GB" in the comment? -- diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h index 5087cdce51..78d017ad85 100644 --- a/src/include/access/commit_ts.h +++ b/src/include/access/commit_ts.h @@ -16,7 +16,6 @@ #include "replication/origin.h" #include "storage/sync.h" - extern PGDLLIMPORT bool track_commit_timestamp; A spurious change. -- @@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns) if (nlsns > 0) sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));/* group_lsn[] */ - return BUFFERALIGN(sz) + BLCKSZ * nslots; } Another spurious change in 0002 patch. -- +/* + * The slru buffer mapping table is partitioned to reduce contention. To + * determine which partition lock a given pageno requires, compute the pageno's + * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock(). + */ I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions anywhere in your patches, is that outdated comment? -- - sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */ - sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* part_locks[] */ + sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded)); /* locks[] */ I am a bit uncomfortable with these changes, merging parts and buffer locks making it hard to understand the code. Not sure what we were getting out of this? -- Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of partitions I think the 0005 patch can be merged to 0001. Regards, Amul
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin wrote: > > > > On 6 Nov 2023, at 14:31, Alvaro Herrera wrote: > > > > dynahash is notoriously slow, which is why we have simplehash.h since > > commit b30d3ea824c5. Maybe we could use that instead. > > Dynahash has lock partitioning. Simplehash has not, AFAIK. > The thing is we do not really need a hash function - pageno is already a > best hash function itself. And we do not need to cope with collisions much > - we can evict a collided buffer. > > Given this we do not need a hashtable at all. That’s exact reasoning how > banks emerged, I started implementing dynahsh patch in April 2021 and found > out that “banks” approach is cleaner. However the term “bank” is not common > in software, it’s taken from hardware cache. > I agree that we don't need the hash function to generate hash value out of pageno which itself is sufficient, but I don't understand how we can get rid of the hash table itself -- how we would map the pageno and the slot number? That mapping is not needed at all? Regards, Amul
Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry
On 2023-10-30 08:25:17 +0900, Michael Paquier wrote: > On Fri, Oct 27, 2023 at 09:45:52AM +0200, Drouvot, Bertrand wrote: > > Done in V8 attached (pgindent has been run on pgstatfuncs.c and > > pgstat_relation.c). > > And applied that after editing a bit the comments. Thank you both!
Re: Add new option 'all' to pg_stat_reset_shared()
Hi, On 2023-11-08 10:08:42 +0900, Kyotaro Horiguchi wrote: > At Mon, 6 Nov 2023 14:00:13 +0530, Bharath Rupireddy > wrote in > > On Mon, Nov 6, 2023 at 11:39 AM torikoshia > > wrote: > > > Since each stats, except wal_prefetch was reset acquiring LWLock, > > > attached PoC patch makes the call atomic by using these LWlocks. > > > > > > If this is the right direction, I'll try to make wal_prefetch also take > > > LWLock. > > I must say, I have reservations about this approach. The main concern > is the duplication of reset code, which has been efficiently > encapsulated for individual targets, into this location. This practice > degrades the maintainability and clarity of the code. Yes, as-is this seems to evolve the code in precisely the wrong direction. We want less central awareness of different types of stats, not more. The proposed new code is far longer than the current pg_stat_reset(), despite doing something conceptually simpler. Greetings, Andres Freund
Re: Add new option 'all' to pg_stat_reset_shared()
Hi, On 2023-11-06 14:00:13 +0530, Bharath Rupireddy wrote: > On Mon, Nov 6, 2023 at 11:39 AM torikoshia wrote: > > > > Thanks all for the comments! > > > > On Fri, Nov 3, 2023 at 5:17 AM Matthias van de Meent > > wrote: > > > Knowing that your metrics have a shared starting point can be quite > > > valuable, as it allows you to do some math that would otherwise be > > > much less accurate when working with stats over a short amount of > > > time. I've not used these stats systems much myself, but skew between > > > metrics caused by different reset points can be difficult to detect > > > and debug, so I think an atomic call to reset all these stats could be > > > worth implementing. > > > > Since each stats, except wal_prefetch was reset acquiring LWLock, > > attached PoC patch makes the call atomic by using these LWlocks. > > > > If this is the right direction, I'll try to make wal_prefetch also take > > LWLock. > > +// Acquire LWLocks > +LWLock *locks[] = {_archiver->lock, _bgwriter->lock, > + _checkpointer->lock, _wal->lock}; > + > +for (int i = 0; i < STATS_SHARED_NUM_LWLOCK; i++) > +LWLockAcquire(locks[i], LW_EXCLUSIVE); > + > +for (int i = 0; i < BACKEND_NUM_TYPES; i++) > +{ > +LWLock*bktype_lock = _io->locks[i]; > +LWLockAcquire(bktype_lock, LW_EXCLUSIVE); > +} > > Well, that's a total of ~17 LWLocks this new function takes to make > the stats reset atomic. I'm not sure if this atomicity is worth the > effort which can easily be misused - what if someone runs something > like SELECT pg_stat_reset_shared() FROM generate_series(1, > 10n); to cause heavy lock acquisition and release cycles? Yea, this seems like an *extremely* bad idea to me. Without careful analysis it could very well cause deadlocks. > IMV, atomicity is not something that applies for the stats reset > operation because stats are approximate numbers by nature after all. > If the pg_stat_reset_shared() resets stats for only a bunch of stats > types and fails, it's the basic application programming style that > when a query fails it's the application that needs to have a retry > mechanism. FWIW, the atomicity doesn't apply today if someone wants to > reset stats in a loop for all stats types. Yea. Additionally it's not really atomic regardless of the lwlocks, due to various processes all accumulating in local counters first, and only occasionally updating the shared data. So even after holding all the locks at the same time, the shared stats would still not actually represent a truly atomic state. > 2. > +{ oid => '8000', > + descr => 'statistics: reset collected statistics shared across the > cluster', > + proname => 'pg_stat_reset_shared', provolatile => 'v', prorettype => > 'void', > + proargtypes => '', prosrc => 'pg_stat_reset_shared_all' }, > > Why a new function consuming the oid? Why can't we just do the trick > of proisstrict => 'f' and if (PG_ARGISNULL(0)) { reset all stats} else > {reset specified stats kind} like the pg_stat_reset_slru()? It's not like oids are a precious resource. It's a more confusing API to have to have to specify a NULL as an argument than not having to do so. If we really want to avoid a separate oid, a more sensible path would be to add a default argument to pg_stat_reset_slru() (by doing a CREATE OR REPLACE in system_functions.sql). Greetings, Andres Freund
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Wed, Nov 8, 2023 at 8:44 AM vignesh C wrote: > > While verifying upgrade of subscriber patch, I found one issue with > upgrade in verbose mode. > I was able to reproduce this issue by performing a upgrade with a > verbose option. > > The trace for the same is given below: > Program received signal SIGSEGV, Segmentation fault. > __strlen_sse2 () at ../sysdeps/x86_64/multiarch/strlen-vec.S:126 > 126../sysdeps/x86_64/multiarch/strlen-vec.S: No such file or > directory. > (gdb) bt > #0 __strlen_sse2 () at ../sysdeps/x86_64/multiarch/strlen-vec.S:126 > #1 0x5556f572 in dopr (target=0x7fffbb90, > format=0x5557859e "\", plugin: \"%s\", two_phase: %s", > args=0x7fffdc40) at snprintf.c:444 > #2 0x5556ed95 in pg_vsnprintf (str=0x7fffbc10 "slot_name: > \"ication slots within the database:", count=8192, fmt=0x55578590 > "slot_name: \"%s\", plugin: \"%s\", two_phase: %s", > args=0x7fffdc40) at snprintf.c:195 > #3 0x555667e3 in pg_log_v (type=PG_VERBOSE, > fmt=0x55578590 "slot_name: \"%s\", plugin: \"%s\", two_phase: %s", > ap=0x7fffdc40) at util.c:184 > #4 0x55566b38 in pg_log (type=PG_VERBOSE, fmt=0x55578590 > "slot_name: \"%s\", plugin: \"%s\", two_phase: %s") at util.c:264 > #5 0x55561a06 in print_slot_infos (slot_arr=0x55595ed0) > at info.c:813 > #6 0x5556186e in print_db_infos (db_arr=0x55587518 > ) at info.c:782 > #7 0x555606da in get_db_rel_and_slot_infos > (cluster=0x555874a0 , live_check=false) at info.c:308 > #8 0x839a in check_new_cluster () at check.c:215 > #9 0x55563010 in main (argc=13, argv=0x7fffdf08) at > pg_upgrade.c:136 > > This issue occurs because we are accessing uninitialized slot array > information. > Thanks for the report. I'll review your proposed fix. -- With Regards, Amit Kapila.
Re: Synchronizing slots from primary to standby
On Tue, Nov 7, 2023 at 7:58 PM Drouvot, Bertrand wrote: > > On 11/7/23 11:55 AM, Amit Kapila wrote: > >>> > >>> This is not full proof solution but optimization over first one. Now > >>> in any sync-cycle, we take 2 attempts for slots-creation (if any slots > >>> are available to be created). In first attempt, we do not wait > >>> indefinitely on inactive slots, we wait only for a fixed amount of > >>> time and if remote-slot is still behind, then we add that to the > >>> pending list and move to the next slot. Once we are done with first > >>> attempt, in second attempt, we go for the pending ones and now we wait > >>> on each of them until the primary catches up. > >> > >> Aren't we "just" postponing the "issue"? I mean if there is really no > >> activity > >> on, say, the first created slot, then once we move to the second attempt > >> then any newly > >> created slot from that time would wait to be synced forever, no? > >> > > > > We have to wait at some point in time for such inactive slots and the > > same is true even for manually created slots on standby. Do you have > > any better ideas to deal with it? > > > > What about: > > - get rid of the second attempt and the pending_slot_list > - keep the wait_count and PrimaryCatchupWaitAttempt logic > > so basically, get rid of: > > /* > * Now sync the pending slots which were failed to be created in first > * attempt. > */ > foreach(cell, pending_slot_list) > { > RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell); > > /* Wait until the primary server catches up */ > PrimaryCatchupWaitAttempt = 0; > > synchronize_one_slot(wrconn, remote_slot, NULL); > } > > and the pending_slot_list list. > > That way, for each slot that have not been created and synced yet: > > - it will be created on the standby > - we will wait up to PrimaryCatchupWaitAttempt attempts > - the slot will be synced or removed on/from the standby > > That way an inactive slot on the primary would not "block" > any other slots on the standby. > > By "created" here I mean calling ReplicationSlotCreate() (not to be confused > with emitting "ereport(LOG, errmsg("created slot \"%s\" locally", > remote_slot->name)); " > which is confusing as mentioned up-thread). > > The problem I can see with this proposal is that the "sync" window waiting > for slot activity on the primary is "only" during the > PrimaryCatchupWaitAttempt > attempts (as the slot will be dropped/recreated). > > If we think this window is too short we could: > > - increase it > or > - don't drop the slot once created (even if there is no activity > on the primary during PrimaryCatchupWaitAttempt attempts) so that > the next loop of attempts will compare with "older" LSN/xmin (as compare to > dropping and re-creating the slot). That way the window would be since the > initial slot creation. > Yeah, this sounds reasonable but we can't mark such slots to be synced/available for use after failover. I think if we want to follow this approach then we need to also monitor these slots for any change in the consecutive cycles and if we are able to sync them then accordingly we enable them to use after failover. Another somewhat related point is that right now, we just wait for the change on the first slot (the patch refers to it as the monitoring slot) for computing nap_time before which we will recheck all the slots. I think we can improve that as well such that even if any slot's information is changed, we don't consider changing naptime. -- With Regards, Amit Kapila.
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Tue, 7 Nov 2023 at 13:25, Amit Kapila wrote: > > On Tue, Nov 7, 2023 at 10:01 AM Zhijie Hou (Fujitsu) > wrote: > > > > On Tuesday, November 7, 2023 12:14 PM Kuroda, Hayato/黒田 隼人 > > wrote: > > > > > > Dear hackers, > > > > > > PSA the patch to solve the issue [1]. > > > > > > Kindly Peter E. and Andrew raised an issue that delete_old_cluster.sh is > > > generated in the source directory, even when the VPATH/meson build. > > > This can avoid by changing the directory explicitly. > > > > > > [1]: > > > https://www.postgresql.org/message-id/flat/7b8a9460-5668-b372-04e6-7b > > > 52e9308493%40dunslane.net#554090099bbbd12c94bf570665a6badf > > > > Thanks for the patch, I have confirmed that the files won't be generated > > in source directory after applying the patch. > > > > After running: "meson test -C build/ --suite pg_upgrade", > > The files are in the test directory: > > ./build/testrun/pg_upgrade/003_logical_slots/data/delete_old_cluster.sh > > > > Thanks for the patch and verification. Pushed the fix. While verifying upgrade of subscriber patch, I found one issue with upgrade in verbose mode. I was able to reproduce this issue by performing a upgrade with a verbose option. The trace for the same is given below: Program received signal SIGSEGV, Segmentation fault. __strlen_sse2 () at ../sysdeps/x86_64/multiarch/strlen-vec.S:126 126../sysdeps/x86_64/multiarch/strlen-vec.S: No such file or directory. (gdb) bt #0 __strlen_sse2 () at ../sysdeps/x86_64/multiarch/strlen-vec.S:126 #1 0x5556f572 in dopr (target=0x7fffbb90, format=0x5557859e "\", plugin: \"%s\", two_phase: %s", args=0x7fffdc40) at snprintf.c:444 #2 0x5556ed95 in pg_vsnprintf (str=0x7fffbc10 "slot_name: \"ication slots within the database:", count=8192, fmt=0x55578590 "slot_name: \"%s\", plugin: \"%s\", two_phase: %s", args=0x7fffdc40) at snprintf.c:195 #3 0x555667e3 in pg_log_v (type=PG_VERBOSE, fmt=0x55578590 "slot_name: \"%s\", plugin: \"%s\", two_phase: %s", ap=0x7fffdc40) at util.c:184 #4 0x55566b38 in pg_log (type=PG_VERBOSE, fmt=0x55578590 "slot_name: \"%s\", plugin: \"%s\", two_phase: %s") at util.c:264 #5 0x55561a06 in print_slot_infos (slot_arr=0x55595ed0) at info.c:813 #6 0x5556186e in print_db_infos (db_arr=0x55587518 ) at info.c:782 #7 0x555606da in get_db_rel_and_slot_infos (cluster=0x555874a0 , live_check=false) at info.c:308 #8 0x839a in check_new_cluster () at check.c:215 #9 0x55563010 in main (argc=13, argv=0x7fffdf08) at pg_upgrade.c:136 This issue occurs because we are accessing uninitialized slot array information. We could fix it by a couple of ways: a) Initialize the whole of dbinfos by using pg_malloc0 instead of pg_malloc which will ensure that the slot information is set to 0. b) Setting only slot information. Attached patch has the changes for both the approaches. Thoughts? Regards, Vignesh diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c index 7f21d26fd2..d2a1815fef 100644 --- a/src/bin/pg_upgrade/info.c +++ b/src/bin/pg_upgrade/info.c @@ -408,7 +408,7 @@ get_db_infos(ClusterInfo *cluster) i_spclocation = PQfnumber(res, "spclocation"); ntups = PQntuples(res); - dbinfos = (DbInfo *) pg_malloc(sizeof(DbInfo) * ntups); + dbinfos = (DbInfo *) pg_malloc0(sizeof(DbInfo) * ntups); for (tupnum = 0; tupnum < ntups; tupnum++) { @@ -640,11 +640,7 @@ get_old_cluster_logical_slot_infos(DbInfo *dbinfo, bool live_check) /* Logical slots can be migrated since PG17. */ if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1600) - { - dbinfo->slot_arr.slots = slotinfos; - dbinfo->slot_arr.nslots = num_slots; return; - } conn = connectToServer(_cluster, dbinfo->db_name); diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c index 7f21d26fd2..21a0b0551a 100644 --- a/src/bin/pg_upgrade/info.c +++ b/src/bin/pg_upgrade/info.c @@ -297,6 +297,11 @@ get_db_rel_and_slot_infos(ClusterInfo *cluster, bool live_check) */ if (cluster == _cluster) get_old_cluster_logical_slot_infos(pDbInfo, live_check); + else + { + pDbInfo->slot_arr.slots = NULL; + pDbInfo->slot_arr.nslots = 0; + } } if (cluster == _cluster)
Re: Show WAL write and fsync stats in pg_stat_io
On Tue, Nov 07, 2023 at 05:19:28PM -0800, Andres Freund wrote: > Another approach would be to fetch the relevant columns from pg_stat_io in the > pg_stat_wal view. That'd avoid double accounting and breaking existing > monitoring. Yep, I'd be OK with that as well to maintain compatibility. -- Michael signature.asc Description: PGP signature
Re: Add new option 'all' to pg_stat_reset_shared()
On Wed, Nov 08, 2023 at 10:08:42AM +0900, Kyotaro Horiguchi wrote: > At Mon, 6 Nov 2023 14:00:13 +0530, Bharath Rupireddy > wrote in > I must say, I have reservations about this approach. The main concern > is the duplication of reset code, which has been efficiently > encapsulated for individual targets, into this location. This practice > degrades the maintainability and clarity of the code. +{ oid => '8000', This OID pick had me smile. >> IMV, atomicity is not something that applies for the stats reset >> operation because stats are approximate numbers by nature after all. >> If the pg_stat_reset_shared() resets stats for only a bunch of stats >> types and fails, it's the basic application programming style that >> when a query fails it's the application that needs to have a retry >> mechanism. FWIW, the atomicity doesn't apply today if someone wants to >> reset stats in a loop for all stats types. > > The infrequent use of this feature, coupled with the fact that there > is no inherent need for these counters to be reset simultaneoulsy, > leads me to think that there is little point in centralizing the > locks. Each stat listed with fixed_amount has meaning taken in isolation, so I don't see why this patch has to be that complicated. I'd expect one code path that just calls a series of pgstat_reset_of_kind(). There could be an argument for a new routine in pgstat.c that loops over the pgstat_kind_infos and triggers the callbacks where .fixed_amount is set, but that's less transparent than the other approach. The reset time should be consistent across all the calls as we rely on GetCurrentTimestamp(). -- Michael signature.asc Description: PGP signature
Re: Show WAL write and fsync stats in pg_stat_io
Hi, On 2023-11-08 09:52:16 +0900, Michael Paquier wrote: > By the way, if the write/sync quantities and times begin to be tracked > by pg_stat_io, I'd see a pretty good argument in removing the > equivalent columns in pg_stat_wal. It looks like this would reduce > the confusion related to the handling of PendingWalStats added in > pgstat_io.c, for one. Another approach would be to fetch the relevant columns from pg_stat_io in the pg_stat_wal view. That'd avoid double accounting and breaking existing monitoring. Greetings, Andres Freund
Re: Atomic ops for unlogged LSN
Hi, On 2023-11-07 00:57:32 +, John Morris wrote: > I found the comment about cache coherency a bit confusing. We are dealing > with a single address, so there should be no memory ordering or coherency > issues. (Did I misunderstand?) I see it more as a race condition. IMO cache coherency covers the value a single variable has in different threads / processes. In fact, the only reason there effectively is a guarantee that you're not seeing an outdated unloggedLSN value during shutdown checkpoints, even without the spinlock or full barrier atomic op, is that the LWLockAcquire(), a few lines above this, would prevent both the compiler and CPU from moving the read of unloggedLSN to much earlier. Obviously that lwlock has a different address... If the patch just had done the minimal conversion, it'd already have been committed... Even if there'd be a performance reason to get rid of the memory barrier around reading unloggedLSN in CreateCheckPoint(), I'd do the conversion in a separate commit. Greetings, Andres Freund
Re: Add new option 'all' to pg_stat_reset_shared()
At Mon, 6 Nov 2023 14:00:13 +0530, Bharath Rupireddy wrote in > On Mon, Nov 6, 2023 at 11:39 AM torikoshia wrote: > > Since each stats, except wal_prefetch was reset acquiring LWLock, > > attached PoC patch makes the call atomic by using these LWlocks. > > > > If this is the right direction, I'll try to make wal_prefetch also take > > LWLock. I must say, I have reservations about this approach. The main concern is the duplication of reset code, which has been efficiently encapsulated for individual targets, into this location. This practice degrades the maintainability and clarity of the code. > Well, that's a total of ~17 LWLocks this new function takes to make > the stats reset atomic. I'm not sure if this atomicity is worth the > effort which can easily be misused - what if someone runs something > like SELECT pg_stat_reset_shared() FROM generate_series(1, > 10n); to cause heavy lock acquisition and release cycles? ... > IMV, atomicity is not something that applies for the stats reset > operation because stats are approximate numbers by nature after all. > If the pg_stat_reset_shared() resets stats for only a bunch of stats > types and fails, it's the basic application programming style that > when a query fails it's the application that needs to have a retry > mechanism. FWIW, the atomicity doesn't apply today if someone wants to > reset stats in a loop for all stats types. The infrequent use of this feature, coupled with the fact that there is no inherent need for these counters to be reset simultaneoulsy, leads me to think that there is little point in cnetralizing the locks. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Atomic ops for unlogged LSN
Hi, On 2023-11-07 11:02:49 -0600, Nathan Bossart wrote: > On Tue, Nov 07, 2023 at 11:47:46AM -0500, Stephen Frost wrote: > > We only care about the value of the unlogged LSN being correct during > > normal shutdown when we're writing out the shutdown checkpoint, but by > > that time everything else has been shut down and the value absolutely > > should not be changing. > > I agree that's all true. I'm trying to connect how this scenario ensures > we see the most up-to-date value in light of this comment above > pg_atomic_read_u32(): > > * The read is guaranteed to return a value as it has been written by this or > * another process at some point in the past. There's however no cache > * coherency interaction guaranteeing the value hasn't since been written to > * again. > > Is there something special about all other backends being shut down that > ensures this returns the most up-to-date value and not something from "some > point in the past" as the stated contract for this function seems to > suggest? Practically yes - getting to the point of writing the shutdown checkpoint implies having gone through a bunch of code that implies memory barriers (spinlocks, lwlocks). However, even if there's likely some other implied memory barrier that we could piggyback on, the patch much simpler to understand if it doesn't change coherency rules. There's no way the overhead could matter. Greetings, Andres Freund
Re: Show WAL write and fsync stats in pg_stat_io
On Tue, Nov 07, 2023 at 03:30:48PM -0800, Andres Freund wrote: > I strongly disagree. A significant part of the design of pg_stat_io was to > make it possible to collect multiple sources of IO in a single view, so that > sysadmins don't have to look in dozens of places to figure out what is causing > what kind of IO. Okay. Point taken. > We should over time collect all sources of IO in pg_stat_io. For some things > we might want to also have more detailed information in other views (e.g. it > doesn't make sense to track FPIs in pg_stat_io, but does make sense in > pg_stat_wal) - but that should be in addition, not instead of. Sure. I understand here that you mean the number of FPIs counted when a record is inserted, different from the path where we decide to write and/or flush WAL. The proposed patch seems to be a bit inconsistent regarding wal_sync_time, by the way. By the way, if the write/sync quantities and times begin to be tracked by pg_stat_io, I'd see a pretty good argument in removing the equivalent columns in pg_stat_wal. It looks like this would reduce the confusion related to the handling of PendingWalStats added in pgstat_io.c, for one. -- Michael signature.asc Description: PGP signature
Re: Moving forward with TDE [PATCH v3]
Hi, On 2023-11-06 11:26:44 +0100, Matthias van de Meent wrote: > On Sat, 4 Nov 2023 at 03:38, Andres Freund wrote: > > On 2023-11-02 22:09:40 +0100, Matthias van de Meent wrote: > > > I'm quite surprised at the significant number of changes being made > > > outside the core storage manager files. I thought that changing out > > > mdsmgr with an encrypted smgr (that could wrap mdsmgr if so desired) > > > would be the most obvious change to implement cluster-wide encryption > > > with the least code touched, as relations don't need to know whether > > > the files they're writing are encrypted, right? Is there a reason to > > > not implement this at the smgr level that I overlooked in the > > > documentation of these patches? > > > > You can't really implement encryption transparently inside an smgr without > > significant downsides. You need a way to store an initialization vector > > associated with the page (or you can store that elsewhere, but then you've > > doubled the worst cse amount of random reads/writes). The patch uses the LSN > > as the IV (which I doubt is a good idea). For authenticated encryption > > further > > additional storage space is required. > > I am unaware of any user of the smgr API that doesn't also use the > buffer cache, and thus implicitly the Page layout with PageHeader > [^1] Everything indeed uses a PageHeader - but there are a number of places that do *not* utilize pd_lower/upper/special. E.g. visibilitymap.c just assumes that those fields are zero - and changing that wouldn't be trivial / free, because we do a lot of bitmasking/shifting with constants derived from #define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData)) which obviously wouldn't be constant anymore if you could reserve space on the page. > The API of smgr is also tailored to page-sized quanta of data > with mostly relation-level information. I don't see why there would be > a veil covering the layout of Page for smgr when all other information > already points to the use of PageHeader and Page layouts. In my view, > it would even make sense to allow the smgr to get exclusive access to > some part of the page in the current Page layout. > > Yes, I agree that there will be an impact on usable page size if you > want authenticated encryption, and that AMs will indeed need to > account for storage space now being used by the smgr - inconvenient, > but it serves a purpose. That would happen regardless of whether smgr > or some higher system decides where to store the data for encryption - > as long as it is on the page, the AM effectively can't use those > bytes. > But I'd say that's best solved by making the Page documentation and > PageInit API explicit about the potential use of that space by the > chosen storage method (encrypted, plain, ...) instead of requiring the > various AMs to manually consider encryption when using Postgres' APIs > for writing data to disk without hitting shared buffers; page space > management is already a task of AMs, but handling the actual > encryption is not. I don't particularly disagree with any detail here - but to me reserving space for nonces etc at PageInit() time pretty much is the opposite of handling encryption inside smgr. > Should the AM really care whether the data on disk is encrypted or > not? I don't think so. When the disk contains encrypted bytes, but > smgrread() and smgrwrite() both produce and accept plaintext data, > who's going to complain? Requiring AMs to be mindful about encryption > on all common paths only adds pitfalls where encryption would be > forgotten by the developer of AMs in one path or another. I agree with that - I think the way the patch currently is designed is not right. There's other stuff you can't trivially do at the smgr level. E.g. if checksums or encryption is enabled, you need to copy the buffer to compute checksums / do IO if in shared buffers, because somebody could set a hint bit even with just a shared content lock. But you don't need that when coming from private buffers during index builds. > I think that getting PageInit to allocate the smgr-specific area would > take some effort, too (which would potentially require adding some > relational context to PageInit, so that it knows which page of which > relation it is going to initialize), but IMHO that would be more > natural than requiring all index and table AMs to be aware the actual > encryption of its pages and require manual handling of that encryption > when the page needs to be written to disk, when it otherwise already > conforms to the various buffer management and file extension APIs > currently in use in PostgreSQL. I would expect "transparent" data > encryption to be handled at the file write layer (i.e. smgr), not > inside the AMs. As mentioned above - I agree that the relevant code shouldn't be in index AMs. But I somewhat doubt that smgr is the right level either. For one, the place computing checksums needs awareness of locking / sharing
Re: Call pqPipelineFlush from PQsendFlushRequest
On Tue, Nov 07, 2023 at 12:43:18PM +0100, Alvaro Herrera wrote: > I agree, and I intend to get this patch pushed once the release freeze > is lifted. Thanks! -- Michael signature.asc Description: PGP signature
Re: Force the old transactions logs cleanup even if checkpoint is skipped
On Tue, Oct 17, 2023 at 02:09:21PM +, Zakhlystov, Daniil (Nebius) wrote: > I've stumbled into an interesting problem. Currently, if Postgres > has nothing to write, it would skip the checkpoint creation defined > by the checkpoint timeout setting. However, we might face a > temporary archiving problem (for example, some network issues) that > might lead to a pile of wal files stuck in pg_wal. After this > temporary issue has gone, we would still be unable to archive them > since we effectively skip the checkpoint because we have nothing to > write. I am not sure to understand your last sentence here. Once the archiver is back up, you mean that the WAL segments that were not previously archived still are still not archived? Or do you mean that because of a succession of checkpoint skipped we are just enable to remove them from pg_wal. > That might lead to a problem - suppose you've run out of disk space > because of the temporary failure of the archiver. After this > temporary failure has gone, Postgres would be unable to recover from > it automatically and will require human attention to initiate a > CHECKPOINT call. > > I suggest changing this behavior by trying to clean up the old WAL > even if we skip the main checkpoint routine. I've attached the patch > that does exactly that. > > What do you think? I am not convinced that this is worth the addition in the skipped path. If your system is idle and a set of checkpoints is skipped, the data folder is not going to be under extra space pressure because of database activity (okay, unlogged tables even if these would generate some WAL for init pages), because there is nothing happening in it with no "important" WAL generated. Note that the backend is very unlikely going to generate WAL only marked with XLOG_MARK_UNIMPORTANT. More to the point: what's the origin of the disk space issues? System logs, unlogged tables or something else? It is usually a good practice to move logs to a different partition. At the end, it sounds to me that removing segments more aggressively is just kicking the can elsewhere, without taking care of the origin of the disk issues. -- Michael signature.asc Description: PGP signature
Re: [PATCHES] Post-special page storage TDE support
Hi, On 2023-05-09 17:08:26 -0500, David Christensen wrote: > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001 > From: David Christensen > Date: Tue, 9 May 2023 16:56:15 -0500 > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure > > This space is reserved for extended data on the Page structure which will be > ultimately used for > encrypted data, extended checksums, and potentially other things. This data > appears at the end of > the Page, after any `pd_special` area, and will be calculated at runtime > based on specific > ControlFile features. > > No effort is made to ensure this is backwards-compatible with existing > clusters for `pg_upgrade`, as > we will require logical replication to move data into a cluster with > different settings here. The first part of the last paragraph makes it sound like pg_upgrade won't be supported across this commit, rather than just between different settings... I think as a whole this is not an insane idea. A few comments: - IMO the patch touches many places it shouldn't need to touch, because of essentially renaming a lot of existing macro names to *Limit, necessitating modifying a lot of users. I think instead the few places that care about the runtime limit should be modified. As-is the patch would cause a lot of fallout in extensions that just do things like defining an on-stack array of Datums or such - even though all they'd need is to change the define to the *Limit one. Even leaving extensions aside, it must makes reviewing (and I'm sure maintaining) the patch very tedious. - I'm a bit worried about how the extra special page will be managed - if there are multiple features that want to use it, who gets to put their data at what offset? After writing this I saw that 0002 tries to address this - but I don't like the design. It introduces runtime overhead that seems likely to be visible. - Checking for features using PageGetFeatureOffset() seems the wrong design to me - instead of a branch for some feature being disabled, perfectly predictable for the CPU, we need to do an external function call every time to figure out that yet, checksums are *still* disabled. - Recomputing offsets every time in PageGetFeatureOffset() seems too expensive. The offsets can't change while running as PageGetFeatureOffset() have enough information to distinguish between different kinds of relations - so why do we need to recompute offsets on every single page? I'd instead add a distinct offset variable for each feature. - Modifying every single PageInit() call doesn't make sense to me. That'll just create a lot of breakage for - as far as I can tell - no win. - Why is it worth sacrificing space on every page to indicate which features were enabled? I think there'd need to be some convincing reasons for introducing such overhead. - Is it really useful to encode the set of features enabled in a cluster with a bitmask? That pretty much precludes utilizing extra page space in extensions. We could instead just have an extra cluster-wide file that defines a mapping of offset to feature. Greetings, Andres Freund
Re: Cleaning up array_in()
I got back to looking at this today (sorry for delay), and did a pass of code review. I think we are getting pretty close to something committable. The one loose end IMO is this bit in ReadArrayToken: +case '"': + +/* + * XXX "Unexpected %c character" would be more apropos, but + * this message is what the pre-v17 implementation produced, + * so we'll keep it for now. + */ +errsave(escontext, +(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("malformed array literal: \"%s\"", origStr), + errdetail("Unexpected array element."))); +return ATOK_ERROR; This comes out when you write something like '{foo"bar"}', and I'd say the choice of message is not great. On the other hand, it's consistent with what you get from '{"foo""bar"}', and if we wanted to change that too then some tweaking of the state machine in ReadArrayStr would be required (or else modify ReadArrayToken so it doesn't return instantly upon seeing the second quote mark). I'm not sure that this is worth messing with. Anyway, I think we are well past the point where splitting the patch into multiple parts is worth doing, because we've rewritten pretty much all of this code, and the intermediate versions are not terribly helpful. So I just folded it all into one patch. Some notes about specific points: * Per previous discussion, I undid the change to allow "[1:0]" dimensions, but I left a comment behind about that. * Removing the ArrayGetNItemsSafe/ArrayCheckBoundsSafe calls seems OK, but then we need to be more careful about detecting overflows and disallowed cases in ReadArrayDimensions. * I don't think the ARRAYDEBUG code is of any value whatever. The fact that nobody bothered to improve it to print more than the dim[] values proves it hasn't been used in decades. Let's just nuke it. * We can simplify the state machine in ReadArrayStr some more: it seems to me it's sufficient to track "expect_delim", as long as you realize that that really means "expect typdelim or right brace". (Maybe another name would be better? I couldn't think of anything short though.) * I switched to using a StringInfo instead of a fixed-size elembuf, as Heikki speculated about. * I added some more test cases to cover things that evidently weren't sufficiently tested, like the has_escapes business which was flat out broken in v10, and to improve the code coverage report. Comments? regards, tom lane diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c index 3ff13eb419..638e4f0382 100644 --- a/src/backend/utils/adt/arrayfuncs.c +++ b/src/backend/utils/adt/arrayfuncs.c @@ -54,18 +54,16 @@ bool Array_nulls = true; PG_FREE_IF_COPY(array, n); \ } while (0) +/* ReadArrayToken return type */ typedef enum { - ARRAY_NO_LEVEL, - ARRAY_LEVEL_STARTED, - ARRAY_ELEM_STARTED, - ARRAY_ELEM_COMPLETED, - ARRAY_QUOTED_ELEM_STARTED, - ARRAY_QUOTED_ELEM_COMPLETED, - ARRAY_ELEM_DELIMITED, - ARRAY_LEVEL_COMPLETED, - ARRAY_LEVEL_DELIMITED, -} ArrayParseState; + ATOK_LEVEL_START, + ATOK_LEVEL_END, + ATOK_DELIM, + ATOK_ELEM, + ATOK_ELEM_NULL, + ATOK_ERROR, +} ArrayToken; /* Working state for array_iterate() */ typedef struct ArrayIteratorData @@ -91,15 +89,21 @@ typedef struct ArrayIteratorData int current_item; /* the item # we're at in the array */ } ArrayIteratorData; -static int ArrayCount(const char *str, int *dim, char typdelim, - Node *escontext); -static bool ReadArrayStr(char *arrayStr, const char *origStr, - int nitems, int ndim, int *dim, +static bool ReadArrayDimensions(char **srcptr, int *ndim_p, +int *dim, int *lBound, +const char *origStr, Node *escontext); +static bool ReadDimensionInt(char **srcptr, int *result, + const char *origStr, Node *escontext); +static bool ReadArrayStr(char **srcptr, FmgrInfo *inputproc, Oid typioparam, int32 typmod, char typdelim, int typlen, bool typbyval, char typalign, - Datum *values, bool *nulls, - bool *hasnulls, int32 *nbytes, Node *escontext); + int *ndim_p, int *dim, + int *nitems_p, + Datum **values_p, bool **nulls_p, + const char *origStr, Node *escontext); +static ArrayToken ReadArrayToken(char **srcptr, StringInfo elembuf, char typdelim, + const char *origStr, Node *escontext); static void ReadArrayBinary(StringInfo buf, int nitems, FmgrInfo *receiveproc, Oid typioparam, int32 typmod, int typlen, bool typbyval, char typalign, @@ -185,12 +189,10 @@ array_in(PG_FUNCTION_ARGS) char typalign; char typdelim; Oid typioparam; - char *string_save, - *p; - int i, -nitems; - Datum *dataPtr; - bool *nullsPtr; + char *p; + int nitems; + Datum *values; + bool *nulls; bool hasnulls; int32 nbytes;
Re: Moving forward with TDE [PATCH v3]
Hi, On 2023-11-06 09:56:37 -0500, Stephen Frost wrote: > * Andres Freund (and...@anarazel.de) wrote: > > I still am quite quite unconvinced that using the LSN as a nonce is a good > > design decision. > > This is a really important part of the overall path to moving this > forward, so I wanted to jump to it and have a specific discussion > around this. I agree that there are downsides to using the LSN, some of > which we could possibly address (eg: include the timeline ID in the IV), > but others that would be harder to deal with. > The question then is- what's the alternative? > > One approach would be to change the page format to include space for an > explicit nonce. I don't see the community accepting such a new page > format as the only format we support though as that would mean no > pg_upgrade support along with wasted space if TDE isn't being used. Right. > Ideally, we'd instead be able to support multiple page formats where > users could decide when they create their cluster what features they > want- and luckily we even have such an effort underway with patches > posted for review [1]. I think there are some details wrong with that patch - IMO the existing macros should just continue to work as-is and instead the places that want the more narrow definition should be moved to the new macros and it changes places that should continue to use compile time constants - but it doesn't seem like a fundamentally bad idea to me. I certainly like it much better than making the page size runtime configurable. (I'll try to reply with the above points to [1]) > Certainly, with the base page-special-feature patch, we could have an option > for users to choose that they want a better nonce than the LSN, or we could > bundle that assumption in with, say, the authenticated-encryption feature > (if you want authenticated encryption, then you want more from the > encryption system than the basics, and therefore we presume you also want a > better nonce than the LSN). I don't think we should support using the LSN as a nonce if we have an alternative. The cost and complexity overhead is just not worth it. Yes, it'll be harder for users to migrate to encryption, but adding complexity elsewhere in the system to get an inferior result isn't worth it. > Another approach would be a separate fork, but that then has a number of > downsides too- every write has to touch that too, and a single page of > nonces would cover a pretty large number of pages also. Yea, the costs of doing so is nontrivial. If you were trying to implement encryption on the smgr level - which I doubt you should but am not certain about - my suggestion would be to interleave pages with metadata like nonces and AEAD with the data pages. I.e. one metadata page would be followed by (BLCKSZ - SizeOfPageHeaderData) / (sizeof(nonce) + sizeof(AEAD)) pages containing actual relation data. That way you still get decent locality during scans and writes. Relation forks were a mistake, we shouldn't use them in more places. I think it'd be much better if we also encrypted forks, rather than just the main fork... Greetings, Andres Freund
Re: Show WAL write and fsync stats in pg_stat_io
Hi, On 2023-10-26 15:28:32 +0900, Michael Paquier wrote: > On top of that pg_stat_io is now for block-based I/O operations, so > that does not fit entirely in the picture, though I guess that Melanie > has thought more on the matter than me. That may be also a matter of > taste. I strongly disagree. A significant part of the design of pg_stat_io was to make it possible to collect multiple sources of IO in a single view, so that sysadmins don't have to look in dozens of places to figure out what is causing what kind of IO. We should over time collect all sources of IO in pg_stat_io. For some things we might want to also have more detailed information in other views (e.g. it doesn't make sense to track FPIs in pg_stat_io, but does make sense in pg_stat_wal) - but that should be in addition, not instead of. Greetings, Andres Freund
Re: speed up a logical replica setup
On Tue, Nov 07, 2023 at 10:00:39PM +0100, Peter Eisentraut wrote: > Speaking of which, would it make sense to put this tool (whatever the name) > into the pg_basebackup directory? It's sort of related, and it also shares > some code. I've read the patch, and the additions to streamutil.h and streamutil.c make it kind of natural to have it sit in pg_basebackup/. There's pg_recvlogical already there. I am wondering about two things, though: - Should the subdirectory pg_basebackup be renamed into something more generic at this point? All these things are frontend tools that deal in some way with the replication protocol to do their work. Say a replication_tools? - And if it would be better to refactor some of the code generic to all these streaming tools to fe_utils. What makes streamutil.h a bit less pluggable are all its extern variables to control the connection, but perhaps that can be an advantage, as well, in some cases. -- Michael signature.asc Description: PGP signature
Re: Moving forward with TDE [PATCH v3]
Greetings, * Bruce Momjian (br...@momjian.us) wrote: > On Mon, Nov 6, 2023 at 09:56:37AM -0500, Stephen Frost wrote: > > The gist is, without a suggestion of things to try, we're left > > to our own devices to try and figure out things which might be > > successful, only to have those turned down too when we come back with > > them, see [1] for what feels like an example of this. Given your > > feedback overall, which I'm very thankful for, I'm hopeful that you see > > that this is, indeed, a useful feature that people are asking for and > > therefore are willing to spend some time on it, but if the feedback is > > that nothing on the page is acceptable to use for the nonce, we can't > > put the nonce somewhere else, and we can't change the page format, then > > everything else is just moving deck chairs around on the titanic that > > has been this effort. > > Yeah, I know the feeling, though I thought XTS was immune enough to > nonce/LSN reuse that it was acceptable. Ultimately it depends on the attack vector you're trying to address, but generally, I think you're right about the XTS tweak reuse not being that big of a deal. XTS isn't CTR or GCM. With FDE (full disk encryption) you're expecting the attacker to steal the physical laptop, hard drive, etc, generally, and so the downside of using the same tweak with XTS over and over again isn't that bad (and is what most FDE solutions do, aiui, by simply using the sector number; we could do something similar to that by using the relfilenode + block number) because that re-use is a problem if the attacker is able to see multiple copies of the same block over time where the block has been encrypted with different data but the same key and tweak. Using the LSN actually is better than what the FDE solutions do because the LSN varies much more often than the sector number. Sure, it doesn't change with every write and maybe an attacker could glean something from that, but that's quite narrow. The downside from the LSN based approach with XTS is probably more that using the LSN means that we can't encrypt the LSN itself and that is a leak too- but then again, we leak that through the simple WAL filenames too, to some extent, so it doesn't strike me as a huge issue. XTS as a block cipher doesn't suffer from the IV-reuse issue that you have with streaming ciphers where the same key+IV and different data leads to being able to trivally retrive the plaintext though and I worry that maybe that's what people were thinking. The README and comments I don't think were terribly clear on this and I think may have even been from back when CTR was being considered, where IV reuse would have resulted in plaintext being trivially available. > What got me sunk on the feature was the complexity of adding temporary > file encryption support and that tipped the scales in the negative for > me in community value of the feature vs. added complexity. (Yeah, I used > a Titanic reference in the last sentence. ;-) ) However, I am open to > the community value and complexity values changing over time. My blog > post on the topic: We do need to address the temporary file situation too and we do have a bit of an issue that how we deal with temporary files today in PG isn't very consistent and there's too many ways to do that. There's a patch that works on that, though it has some bitrot that we're working on addressing currently. There is value in simply fixing that situation wrt temporary file management independent of encryption, though of course then encryption of those temporary files becomes much simpler. Thanks, Stephen signature.asc Description: PGP signature
Re: Version 14/15 documentation Section "Alter Default Privileges"
On Mon, Nov 6, 2023 at 09:53:50PM +0100, Laurenz Albe wrote: > On Mon, 2023-11-06 at 10:55 -0500, Bruce Momjian wrote: > > Okay, I think I have good wording for this. I didn't like the wording > > of other roles, so I restructured that in the attached patch too. > > > > > !Default privileges apply only to the active role; the default > > !privileges of member roles have no affect on object permissions. > > !SET ROLE can be used to change the active user and > > !apply their default privileges. > > ! > > You don't mean member roles, but roles that the active role is a member of, > right? Yes, sorry fixed in the attached patch. > + > + As a non-superuser, you can change default privileges only on objects > created > + by yourself or by roles that you are a member of. However, you don't > inherit > + altered default privileges from roles you are a member of; objects you > create > + will receive the default privileges for your current role. > + I went with different wording since I found the above confusing. You didn't seem to like my SET ROLE suggestion so I removed it. > + > + > + There is no way to change the default privileges for objects created by > + arbitrary roles. You have run ALTER DEFAULT PRIVILEGES I find the above sentence odd. What is its purpose? > + for any role that can create objects whose default privileges should be > + modified. > + > + > + > + Currently, > + only the privileges for schemas, tables (including views and foreign > + tables), sequences, functions, and types (including domains) can be > + altered. For this command, functions include aggregates and procedures. > + The words FUNCTIONS and ROUTINES are > + equivalent in this command. (ROUTINES is preferred > + going forward as the standard term for functions and procedures taken > + together. In earlier PostgreSQL releases, only the > + word FUNCTIONS was allowed. It is not possible to set > + default privileges for functions and procedures separately.) > + > + > > Default privileges that are specified per-schema are added to whatever > the global default privileges are for the particular object type. > @@ -136,8 +149,9 @@ REVOKE [ GRANT OPTION FOR ] > target_role > > > - The name of an existing role of which the current role is a member. > - If FOR ROLE is omitted, the current role is assumed. > + Default privileges are changed for objects created by the > + target_role, or the current > + role if unspecified. I like a verb to be first, like "Change" rather than "default privileges". Patch attached. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you. diff --git a/doc/src/sgml/ref/alter_default_privileges.sgml b/doc/src/sgml/ref/alter_default_privileges.sgml index 8a6006188d..78744470c8 100644 --- a/doc/src/sgml/ref/alter_default_privileges.sgml +++ b/doc/src/sgml/ref/alter_default_privileges.sgml @@ -88,25 +88,19 @@ REVOKE [ GRANT OPTION FOR ] Description - ALTER DEFAULT PRIVILEGES allows you to set the privileges - that will be applied to objects created in the future. (It does not - affect privileges assigned to already-existing objects.) Currently, - only the privileges for schemas, tables (including views and foreign - tables), sequences, functions, and types (including domains) can be - altered. For this command, functions include aggregates and procedures. - The words FUNCTIONS and ROUTINES are - equivalent in this command. (ROUTINES is preferred - going forward as the standard term for functions and procedures taken - together. In earlier PostgreSQL releases, only the - word FUNCTIONS was allowed. It is not possible to set - default privileges for functions and procedures separately.) + ALTER DEFAULT PRIVILEGES allows you to set the + privileges that will be applied to objects created in the future. + (It does not affect privileges assigned to already-existing objects.) + Privileges can be set globally (i.e., for all objects created in the + current database), or just for objects created in specified schemas. - You can change default privileges only for objects that will be created by - yourself or by roles that you are a member of. The privileges can be set - globally (i.e., for all objects created in the current database), - or just for objects created in specified schemas. + While you can change your own default privileges and the defaults of + roles that you are a member of, at object creation time, new object + permissions are only affected by the default privileges of the current + role, and are not inherited from any roles in which the current role + is a member. @@ -118,6 +112,19 @@ REVOKE [ GRANT OPTION FOR ] ALTER DEFAULT PRIVILEGES. + + Currently,
Fix use of openssl.path() if openssl isn't found
Found this issue during my Fedora 39 upgrade. Tested that uninstalling openssl still allows the various ssl tests to run and succeed. -- Tristan Partin Neon (https://neon.tech) From d684d6fc1546335804d2ed82ff909991965a61a8 Mon Sep 17 00:00:00 2001 From: Tristan Partin Date: Tue, 7 Nov 2023 15:59:04 -0600 Subject: [PATCH v1] Fix use of openssl.path() if openssl isn't found openssl(1) is an optional dependency of the Postgres Meson build, but was inadvertantly required when defining some SSL tests. --- src/test/ssl/meson.build | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/src/test/ssl/meson.build b/src/test/ssl/meson.build index 4cda81f3bc..c3ffcaa032 100644 --- a/src/test/ssl/meson.build +++ b/src/test/ssl/meson.build @@ -1,5 +1,10 @@ # Copyright (c) 2022-2023, PostgreSQL Global Development Group +openssl_path = '' +if openssl.found() + openssl_path = openssl.path() +endif + tests += { 'name': 'ssl', 'sd': meson.current_source_dir(), @@ -7,7 +12,7 @@ tests += { 'tap': { 'env': { 'with_ssl': ssl_library, - 'OPENSSL': openssl.path(), + 'OPENSSL': openssl_path, }, 'tests': [ 't/001_ssltests.pl', -- Tristan Partin Neon (https://neon.tech)
Re: Fix search_path for all maintenance commands
On Tue, 2023-10-31 at 13:16 -0400, Isaac Morland wrote: > Perhaps the search_path for running a maintenance command should be > the search_path set for the table owner (ALTER ROLE … SET search_path > …)? After some thought, I don't think that's the right approach. It adds another way search path can be changed, which adds to the complexity. Also, by default it's "$user", public; and given that "public" was world-writable until recently, that doesn't seem like a good idea for a change intended to prevent search_path manipulation. Regards, Jeff Davis
Re: Compiler warning on Debian 12, PostgreSQL 16 Beta3
On Wed, Nov 8, 2023 at 8:13 AM Alvaro Herrera wrote: > On 2023-Nov-08, Thomas Munro wrote: > > On Wed, Nov 8, 2023 at 4:46 AM Alvaro Herrera > > wrote: > > > On 2023-Aug-25, Daniel Westermann (DWE) wrote: > > > > > > Yeah, I get this one too. I thought commit 37d5babb5cfa ("jit: Support > > > opaque pointers in LLVM 16.") was going to silence it, but I was quite > > > mistaken. I gave that code a quick look and could not understand what > > > it was complaining about. Is it a bug in the LLVM headers? > > > > I found the commit where they fixed that in 15+: > > > > https://github.com/llvm/llvm-project/commit/1d9086bf054c2e734940620d02d4451156b424e6 > > > > They don't seem to back-patch fixes, generally. > > Ah yeah, I can silence the warning by patching that file locally. Since LLVM only seems to maintain one branch at a time as a matter of policy (I don't see were that is written down but I do see for example their backport request format[1] which strictly goes from main to (currently) release/17.x, and see how the commit history of each release branch ends as a new branch is born), I suppose another angle would be to check if the Debian maintainers carry extra patches for stuff like that. They're the ones creating the dependency on an 'old' LLVM after all. Unlike the RHEL/etc maintainers' fast rolling version policy (that we learned about in the thread for CF #4640). Who wants to ship zombie unmaintained code for years? On the other hand, Debian itself rolls faster than RHEL. [1] https://llvm.org/docs/GitHub.html
Re: speed up a logical replica setup
On 23.10.23 05:53, Euler Taveira wrote: Let's continue with the bike shedding... I agree with Peter E that this name does not express what this tool is. At the moment, it only have one action: create. If I have to suggest other actions I would say that it could support switchover option too (that removes the infrastructure created by this tool). If we decide to keep this name, it should be a good idea to add an option to indicate what action it is executing (similar to pg_recvlogical) as suggested by Peter. Speaking of which, would it make sense to put this tool (whatever the name) into the pg_basebackup directory? It's sort of related, and it also shares some code.
Re: Add the ability to limit the amount of memory that can be allocated to backends.
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2023-11-06 13:02:50 -0500, Stephen Frost wrote: > > > > The max_total_memory limit is checked whenever the global counters are > > > > updated. There is no special error handling if a memory allocation > > > > exceeds > > > > the global limit. That allocation returns a NULL for malloc style > > > > allocations or an ENOMEM for shared memory allocations. Postgres has > > > > existing mechanisms for dealing with out of memory conditions. > > > > > > I still think it's extremely unwise to do tracking of memory and limiting > > > of > > > memory in one patch. You should work towards and acceptable patch that > > > just > > > tracks memory usage in an as simple and low overhead way as possible. > > > Then we > > > later can build on that. > > > > Frankly, while tracking is interesting, the limiting is the feature > > that's needed more urgently imv. > > I agree that we need limiting, but that the tracking needs to be very robust > for that to be usable. Is there an issue with the tracking in the patch that you saw? That's certainly an area that we've tried hard to get right and to match up to numbers from the rest of the system, such as the memory context system. > > We could possibly split it up but there's an awful lot of the same code that > > would need to be changed and that seems less than ideal. Still, we'll look > > into this. > > Shrug. IMO keeping them together just makes it very likely that neither goes > in. I'm happy to hear your support for the limiting part of this- that's encouraging. > > > > For sanity checking, pgstat now includes the > > > > pg_backend_memory_allocation > > > > view showing memory allocations made by the backend process. This view > > > > includes a scan of the top memory context, so it compares memory > > > > allocations > > > > reported through pgstat with actual allocations. The two should match. > > > > > > Can't you just do this using the existing pg_backend_memory_contexts view? > > > > Not and get a number that you can compare to the local backend number > > due to the query itself happening and performing allocations and > > creating new contexts. We wanted to be able to show that we are > > accounting correctly and exactly matching to what the memory context > > system is tracking. > > I think creating a separate view for this will be confusing for users, without > really much to show for. Excluding the current query would be useful for other > cases as well, why don't we provide a way to do that with > pg_backend_memory_contexts? Both of these feel very much like power-user views, so I'm not terribly concerned about users getting confused. That said, we could possibly drop this as a view and just have the functions which are then used in the regression tests to catch things should the numbers start to diverge. Having a way to get the memory contexts which don't include the currently running query might be interesting too but is rather independent of what this patch is trying to do. The only reason we collected up the memory-context info is as a cross-check to the tracking that we're doing and while the existing memory-context view is just fine for a lot of other things, it doesn't work for that specific need. Thanks, Stephen signature.asc Description: PGP signature
Re: GUC names in messages
FWIW, I am halfway through doing regex checking of the PG16 source for all GUC names in messages to see what current styles are in use today. Not sure if those numbers will influence the decision. I hope I can post my findings today or tomorrow. == Kind Regards, Peter Smith. Fujitsu Australia
Re: Popcount optimization using AVX512
On Mon, Nov 06, 2023 at 09:53:15PM -0800, Noah Misch wrote: > On Mon, Nov 06, 2023 at 09:59:26PM -0600, Nathan Bossart wrote: >> On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote: >> > The glibc/gcc "ifunc" mechanism was designed to solve this problem of >> > choosing >> > a function implementation based on the runtime CPU, without incurring >> > function >> > pointer overhead. I would not attempt to use AVX512 on non-glibc systems, >> > and >> > I would use ifunc to select the desired popcount implementation on glibc: >> > https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.html >> >> Thanks, that seems promising for the function pointer cases. I'll plan on >> trying to convert one of the existing ones to use it. BTW it looks like >> LLVM has something similar [0]. >> >> IIUC this unfortunately wouldn't help for cases where we wanted to keep >> stuff inlined, such as is_valid_ascii() and the functions in pg_lfind.h, >> unless we applied it to the calling functions, but that doesn't ѕound >> particularly maintainable. > > Agreed, it doesn't solve inline cases. If the gains are big enough, we should > move toward packages containing N CPU-specialized copies of the postgres > binary, with bin/postgres just exec'ing the right one. I performed a quick test with ifunc on my x86 machine that ordinarily uses the runtime checks for the CRC32C code, and I actually see a consistent 3.5% regression for pg_waldump -z on 100M 65-byte records. I've attached the patch used for testing. The multiple-copies-of-the-postgres-binary idea seems interesting. That's probably not something that could be enabled by default, but perhaps we could add support for a build option. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h index d085f1dc00..6db411ee29 100644 --- a/src/include/port/pg_crc32c.h +++ b/src/include/port/pg_crc32c.h @@ -78,7 +78,7 @@ extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_ #define FIN_CRC32C(crc) ((crc) ^= 0x) extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len); -extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len); +extern pg_crc32c pg_comp_crc32c(pg_crc32c crc, const void *data, size_t len); #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len); diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c index 41ff4a35ad..62bb981ee8 100644 --- a/src/port/pg_crc32c_sse42_choose.c +++ b/src/port/pg_crc32c_sse42_choose.c @@ -51,14 +51,14 @@ pg_crc32c_sse42_available(void) * so that subsequent calls are routed directly to the chosen implementation. */ static pg_crc32c -pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len) +(*pg_comp_crc32c_choose (void))(pg_crc32c crc, const void *data, size_t len) { if (pg_crc32c_sse42_available()) - pg_comp_crc32c = pg_comp_crc32c_sse42; + return pg_comp_crc32c_sse42; else - pg_comp_crc32c = pg_comp_crc32c_sb8; - - return pg_comp_crc32c(crc, data, len); + return pg_comp_crc32c_sb8; } -pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose; +pg_crc32c +pg_comp_crc32c(pg_crc32c crc, const void *data, size_t len) + __attribute__ ((ifunc ("pg_comp_crc32c_choose")));
Re: Add the ability to limit the amount of memory that can be allocated to backends.
Hi, On 2023-11-06 13:02:50 -0500, Stephen Frost wrote: > > > The max_total_memory limit is checked whenever the global counters are > > > updated. There is no special error handling if a memory allocation exceeds > > > the global limit. That allocation returns a NULL for malloc style > > > allocations or an ENOMEM for shared memory allocations. Postgres has > > > existing mechanisms for dealing with out of memory conditions. > > > > I still think it's extremely unwise to do tracking of memory and limiting of > > memory in one patch. You should work towards and acceptable patch that just > > tracks memory usage in an as simple and low overhead way as possible. Then > > we > > later can build on that. > > Frankly, while tracking is interesting, the limiting is the feature > that's needed more urgently imv. I agree that we need limiting, but that the tracking needs to be very robust for that to be usable. > We could possibly split it up but there's an awful lot of the same code that > would need to be changed and that seems less than ideal. Still, we'll look > into this. Shrug. IMO keeping them together just makes it very likely that neither goes in. > > > For sanity checking, pgstat now includes the pg_backend_memory_allocation > > > view showing memory allocations made by the backend process. This view > > > includes a scan of the top memory context, so it compares memory > > > allocations > > > reported through pgstat with actual allocations. The two should match. > > > > Can't you just do this using the existing pg_backend_memory_contexts view? > > Not and get a number that you can compare to the local backend number > due to the query itself happening and performing allocations and > creating new contexts. We wanted to be able to show that we are > accounting correctly and exactly matching to what the memory context > system is tracking. I think creating a separate view for this will be confusing for users, without really much to show for. Excluding the current query would be useful for other cases as well, why don't we provide a way to do that with pg_backend_memory_contexts? Greetings, Andres Freund
Re: Compiler warning on Debian 12, PostgreSQL 16 Beta3
On 2023-Nov-08, Thomas Munro wrote: > On Wed, Nov 8, 2023 at 4:46 AM Alvaro Herrera wrote: > > On 2023-Aug-25, Daniel Westermann (DWE) wrote: > > > > Yeah, I get this one too. I thought commit 37d5babb5cfa ("jit: Support > > opaque pointers in LLVM 16.") was going to silence it, but I was quite > > mistaken. I gave that code a quick look and could not understand what > > it was complaining about. Is it a bug in the LLVM headers? > > I found the commit where they fixed that in 15+: > > https://github.com/llvm/llvm-project/commit/1d9086bf054c2e734940620d02d4451156b424e6 > > They don't seem to back-patch fixes, generally. Ah yeah, I can silence the warning by patching that file locally. Annoying :-( -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "El sabio habla porque tiene algo que decir; el tonto, porque tiene que decir algo" (Platon).
Re: Compiler warning on Debian 12, PostgreSQL 16 Beta3
On Wed, Nov 8, 2023 at 4:46 AM Alvaro Herrera wrote: > On 2023-Aug-25, Daniel Westermann (DWE) wrote: > > I've just noticed this warning when building on Debian 12: > > > > In file included from > > /usr/lib/llvm-14/include/llvm/Analysis/ModuleSummaryAnalysis.h:17, > > from llvmjit_inline.cpp:51: > > /usr/lib/llvm-14/include/llvm/IR/ModuleSummaryIndex.h: In constructor > > ‘llvm::ModuleSummaryIndex::ModuleSummaryIndex(bool, bool)’: > > /usr/lib/llvm-14/include/llvm/IR/ModuleSummaryIndex.h:1175:73: warning: > > member ‘llvm::ModuleSummaryIndex::Alloc’ is used uninitialized > > [-Wuninitialized] > > 1175 | : HaveGVs(HaveGVs), EnableSplitLTOUnit(EnableSplitLTOUnit), > > Saver(Alloc), > > | > > Yeah, I get this one too. I thought commit 37d5babb5cfa ("jit: Support > opaque pointers in LLVM 16.") was going to silence it, but I was quite > mistaken. I gave that code a quick look and could not understand what > it was complaining about. Is it a bug in the LLVM headers? I found the commit where they fixed that in 15+: https://github.com/llvm/llvm-project/commit/1d9086bf054c2e734940620d02d4451156b424e6 They don't seem to back-patch fixes, generally.
Re: 2023-11-09 release announcement draft
On Tue, Nov 07, 2023 at 09:02:03AM -0500, Jonathan S. Katz wrote: > On 11/6/23 9:52 PM, Noah Misch wrote: > > On Mon, Nov 06, 2023 at 05:04:25PM -0500, Jonathan S. Katz wrote: > > Delete lines starting here ... > > > > > This is the **final release of PostgreSQL 11**. PostgreSQL 10 will no > > > longer > > > receive > > > [security and bug fixes](https://www.postgresql.org/support/versioning/). > > > If you are running PostgreSQL 10 in a production environment, we suggest > > > that > > > you make plans to upgrade. > > > > ... to here. They're redundant with "PostgreSQL 11 EOL Notice" below: > > Initially, I strongly disagreed with this recommendation, as I've seen > enough people say that they were unaware that a community version is EOL. We > can't say this enough. > > However, I did decide to clip it out because the notice is just below. I just figured it was a copy-paste error, given the similarity of nearby sentences. I have no concern with a general goal of saying more about the end of v11.
Re: meson documentation build open issues
Re: Andres Freund > > We might get around that by introducing a new postgresql-manpages-XX > > arch:all package, but that might be too much micropackaging. > > I've not done packaging in, uh, a fair while, but isn't the common solution to > that a -common package? There might be a few more files we could put itno one. True. /usr/share/postgresql/17/ is 4.2MB here, with 1.5MB manpages, 1.1MB /extensions/ and some other bits. Will consider, thanks. > > > + > > > + -Dselinux={ disabled | auto | enabled > > > } > > > + > > > + > > > +Build with selinux support, enabling the > > linkend="sepgsql"/> > > > +extension. > > > > This option defaults to ... auto? > > Not quite sure what you mean? Today it defaults to disabled, a patch changing > that should also change the docs? What I failed to say is that the other options document what the default it, this one doesn't yet. Christoph
Re: Atomic ops for unlogged LSN
On Tue, Nov 07, 2023 at 11:47:46AM -0500, Stephen Frost wrote: > We only care about the value of the unlogged LSN being correct during > normal shutdown when we're writing out the shutdown checkpoint, but by > that time everything else has been shut down and the value absolutely > should not be changing. I agree that's all true. I'm trying to connect how this scenario ensures we see the most up-to-date value in light of this comment above pg_atomic_read_u32(): * The read is guaranteed to return a value as it has been written by this or * another process at some point in the past. There's however no cache * coherency interaction guaranteeing the value hasn't since been written to * again. Is there something special about all other backends being shut down that ensures this returns the most up-to-date value and not something from "some point in the past" as the stated contract for this function seems to suggest? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: meson documentation build open issues
Hi, On 2023-11-06 10:45:27 +0100, Christoph Berg wrote: > Re: Andres Freund > > > > The reason for that is simply that the docs take too long to build. > > > > > > That why I'd prefer to be able to separate arch:all and arch:any > > > builds, yes. > > > > What's stopping you from doing that? I think the only arch:any content we > > have is the docs, and those you can build separately? Doc builds do trigger > > generation of a handful of files besides the docs, but not more. > > Historically, .deb files have been required to contain the manpages > for all executables even when there's a separate -doc package. This > means we'd need a separate (hopefully fast) manpage build even when > the arch:any binaries are built. Manpages are a bit faster to build than html, but not a whole lot. Both are a lot faster than PDF. > We might get around that by introducing a new postgresql-manpages-XX > arch:all package, but that might be too much micropackaging. I've not done packaging in, uh, a fair while, but isn't the common solution to that a -common package? There might be a few more files we could put itno one. > > + > > + -Dselinux={ disabled | auto | enabled } > > + > > + > > +Build with selinux support, enabling the > > +extension. > > This option defaults to ... auto? Not quite sure what you mean? Today it defaults to disabled, a patch changing that should also change the docs? > > index 90e2c062fa8..003b57498bb 100644 > > --- a/doc/src/sgml/meson.build > > +++ b/doc/src/sgml/meson.build > > @@ -142,6 +142,7 @@ if docs_dep.found() > >'--install-dir-contents', dir_doc_html, html], > > build_always_stale: true, build_by_default: false, > >) > > + alias_target('doc-html', install_doc_html) > >alias_target('install-doc-html', install_doc_html) > > Shouldn't this just build the html docs, without installing? > > > + alias_target('doc-man', install_doc_html) > >alias_target('install-doc-man', install_doc_man) > > ... same > > > > + > > + install-install-world > > install-world > > > + > > + install-doc-html > > + > > + > > + Install documentation in man page format. > > install-doc-man Oops. > > + > > +Documentation Targets > > > + > > + docs > > + doc-html > > + > > + > > + Build documentation in multi-page HTML format. Note that > > + docs does not include > > building > > + man page documentation, as man page generation seldom fails when > > + building HTML documentation succeeds. > > Why is that a reason for not building the manpages? I didn't have it that way, and Tom argued strongly for maintaining that behaviour from the make build. Personally I wouldn't. > > + > > +Code Targets > > I would have expected the sections to be in the order > build-docs-install. Having install first seems weird to me. Makes sense to me. I just had the install first because I wrote it first because of our conversation... > > + > > +Other Targets > > + > > + > > + > > + > > + clean > > + > > + > > + Remove all build products > > + > > + > > + > > + > > + > > + test > > + > > + > > + Remove all enabled tests. Support for some classes of tests can be > > + enabled / disabled with > linkend="configure-tap-tests-meson"/> > > + and . > > This should explicitly say if contrib tests are included (or there > needs to be a separate test-world target.) They are included, will state that. And also s/Remove/Run/ > > Subject: [PATCH v1 5/5] meson: Add -Dpkglibdir option > > Will give that a try, thanks! Thanks for the review! Greetings, Andres Freund
Re: meson documentation build open issues
Alvaro Herrera writes: > If the problem is broken doc patches, then maybe a solution is to > include the `xmllint --noout --valid` target in whatever the check-world > equivalent is for meson. +1, but let's do that for the "make" build too. I see that doc/src/sgml/Makefile has a "check" target, but AFAICS it's not wired up to the top-level check-world. regards, tom lane
Re: Atomic ops for unlogged LSN
Greetings, * Nathan Bossart (nathandboss...@gmail.com) wrote: > On Tue, Nov 07, 2023 at 12:57:32AM +, John Morris wrote: > > I incorporated your suggestions and added a few more. The changes are > > mainly related to catching potential errors if some basic assumptions > > aren’t met. > > Hm. Could we move that to a separate patch? We've lived without these > extra checks for a very long time, and I'm not aware of any related issues, > so I'm not sure it's worth the added complexity. And IMO it'd be better to > keep it separate from the initial atomics conversion, anyway. I do see the value in adding in an Assert though I don't want to throw away the info about what the recent unlogged LSN was when we crash. As that basically boils down to a one-line addition, I don't think it really needs to be in a separate patch. > > I found the comment about cache coherency a bit confusing. We are dealing > > with a single address, so there should be no memory ordering or coherency > > issues. (Did I misunderstand?) I see it more as a race condition. Rather > > than merely explaining why it shouldn’t happen, the new version verifies > > the assumptions and throws an Assert() if something goes wrong. > > I was thinking of the comment for pg_atomic_read_u32() that I cited earlier > [0]. This comment also notes that pg_atomic_read_u32/64() has no barrier > semantics. My interpretation of that comment is that these functions > provide no guarantee that the value returned is the most up-to-date value. There seems to be some serious misunderstanding about what is happening here. The value written into the control file for unlogged LSN during normal operation does *not* need to be the most up-to-date value and talking about it as if it needs to be the absolutely most up-to-date and correct value is, if anything, adding to the confusion, not reducing confusion. The reason to write in anything other than a zero during these routine checkpoints for unlogged LSN is entirely for forensics purposes, not because we'll ever actually use the value- during crash recovery and backup/restore, we're going to reset the unlogged LSN counter anyway and we're going to throw away all of unlogged table contents across the entire system. We only care about the value of the unlogged LSN being correct during normal shutdown when we're writing out the shutdown checkpoint, but by that time everything else has been shut down and the value absolutely should not be changing. Thanks, Stephen signature.asc Description: PGP signature
Re: meson documentation build open issues
On 2023-Nov-07, Andres Freund wrote: > >I would like to have some set of options that enables it so that the > >standard documentation targets become part of "meson compile" and > >"meson install". > > -0.5 - it's just too painfully slow. For all scripted uses you can just as > well use install-world... If the problem is broken doc patches, then maybe a solution is to include the `xmllint --noout --valid` target in whatever the check-world equivalent is for meson. Looking at doc/src/sgml/meson.build, we don't seem to do that anywhere. Doing the no-output lint run is very fast (375ms real time in my machine, whereas "make html" takes 27s). -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "¿Cómo puedes confiar en algo que pagas y que no ves, y no confiar en algo que te dan y te lo muestran?" (Germán Poo)
Re: meson documentation build open issues
Andres Freund writes: > On November 7, 2023 7:55:37 AM PST, Peter Eisentraut > wrote: >> I would like to have some set of options that enables it so that the >> standard documentation targets become part of "meson compile" and "meson >> install". > -0.5 - it's just too painfully slow. For all scripted uses you can just as > well use install-world... I think we should set up the meson stuff so that "install" and "install-world" cover exactly what they did with "make". Otherwise there will be too much confusion. regards, tom lane
Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)
On Mon, 6 Nov 2023 at 16:07, Alexander Korotkov wrote: > Hi! > > On Wed, Jul 5, 2023 at 4:46 PM Aleksander Alekseev < > aleksan...@timescale.com> wrote: > > PFE the corrected patchset v58. > > I'd like to revive this thread. > Hi! Great news! > BTW, there is a typo in a word "exceeed". > Fixed. > > +static int inline > +SlruFileName(SlruCtl ctl, char *path, int64 segno) > +{ > ... > +} > > I think it worth adding asserts here to verify there is no overflow making > us mapping different segments into the same files. > Agree, assertion added. > + return occupied == max_notify_queue_pages; > > I'm not sure if the current code could actually allow to occupy more than > max_notify_queue_pages. Probably not even in extreme cases. But I still > think it will more safe and easier to read to write "occupied >= > max_notify_queue"_pages here. > Fixed. > diff --git a/src/test/modules/test_slru/test_slru.c > b/src/test/modules/test_slru/test_slru.c > > The actual 64-bitness of SLRU pages isn't much exercised in our automated > tests. It would be too exhausting to make pg_notify actually use higher > than 2**32 page numbers. Thus, I think test/modules/test_slru is a good > place to give high page numbers a good test. > PFA, I've add test for a 64-bit SLRU pages. By the way, there is another one useful thing we may do here. For now pg_commit_ts functionality is rather strange: if it was enabled, then disabled and then enabled again all the data from before will be discarded. Meanwhile, users expected to have their commit timestamps for all transactions, which were "logged" when this feature was enabled. It's weird. AFICS, the only reason for this behaviour is becouse of transaction wraparound. It may occur while the feature is disabled end it is safe to simply remove all the data from previous period. If we switch to FullTransactionId in commit_ts we can overcome this limitation. But I'm not sure if it worth to try to fix this in current patchset, since it is already non trivial. -- Best regards, Maxim Orlov. v59-0003-Make-use-FullTransactionId-in-2PC-filenames.patch Description: Binary data v59-0004-Add-SLRU-tests-for-64-bit-page-case.patch Description: Binary data v59-0001-Index-SLRUs-by-64-bit-integers-rather-than-by-32.patch Description: Binary data v59-0002-Use-larger-segment-file-names-for-pg_notify.patch Description: Binary data
Re: ALTER TABLE uses a bistate but not for toast tables
Hi Justin, This patch has gone stale quite some time ago; CFbot does not seem to have any history of a successful apply attemps, nor do we have any succesful build history (which was introduced some time ago already). Are you planning on rebasing this patch? Kind regards, Matthias van de Meent
Re: meson documentation build open issues
Hi, On November 7, 2023 7:55:37 AM PST, Peter Eisentraut wrote: >On 03.11.23 19:19, Christoph Berg wrote: > You can control this with the "docs" option for meson, as of recently. I've been looking into switching the Debian PG 17 build to meson, but I'm running into several problems. * The docs are still not built by default, and -Ddocs=enabled doesn't change that >>> Maybe I am missing something - they aren't built by default in autoconf >>> either? >> True, but the documentation (and this thread) reads like it should. Or >> at least it should, when I explicitly say -Ddocs=enabled. >> >> What would also help is when the tail of the meson output had a list >> of features that are enabled. There's the list of "External libraries" >> which is quite helpful at figuring out what's still missing, but >> perhaps this could be extended: >> >>Features >> LLVM : YES (/usr/bin/llvm-config-16) >> DOCS : YES (html pdf texinfo) >> >> Atm it's hidden in the long initial blurb of "Checking for.." and the >> "NO" in there don't really stand out as much, since some of them are >> normal. > >I don't feel like we have fully worked out how the docs options should fit >together. > >With the make build system, there is a canonical sequence of > >make world >make check-world >make install-world > >that encompasses everything. > >Now with meson to handle the documentation one needs to remember a variety of >additional targets. (There is a risk that once this gets more widespread, >more people will submit broken documentation.) install-world with meson also installs docs. >I would like to have some set of options that enables it so that the standard >documentation targets become part of "meson compile" and "meson install". -0.5 - it's just too painfully slow. For all scripted uses you can just as well use install-world... Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: meson documentation build open issues
On 03.11.23 19:19, Christoph Berg wrote: You can control this with the "docs" option for meson, as of recently. I've been looking into switching the Debian PG 17 build to meson, but I'm running into several problems. * The docs are still not built by default, and -Ddocs=enabled doesn't change that Maybe I am missing something - they aren't built by default in autoconf either? True, but the documentation (and this thread) reads like it should. Or at least it should, when I explicitly say -Ddocs=enabled. What would also help is when the tail of the meson output had a list of features that are enabled. There's the list of "External libraries" which is quite helpful at figuring out what's still missing, but perhaps this could be extended: Features LLVM : YES (/usr/bin/llvm-config-16) DOCS : YES (html pdf texinfo) Atm it's hidden in the long initial blurb of "Checking for.." and the "NO" in there don't really stand out as much, since some of them are normal. I don't feel like we have fully worked out how the docs options should fit together. With the make build system, there is a canonical sequence of make world make check-world make install-world that encompasses everything. Now with meson to handle the documentation one needs to remember a variety of additional targets. (There is a risk that once this gets more widespread, more people will submit broken documentation.) I would like to have some set of options that enables it so that the standard documentation targets become part of "meson compile" and "meson install".
Re: Compiler warning on Debian 12, PostgreSQL 16 Beta3
On 2023-Aug-25, Daniel Westermann (DWE) wrote: > I've just noticed this warning when building on Debian 12: > > In file included from > /usr/lib/llvm-14/include/llvm/Analysis/ModuleSummaryAnalysis.h:17, > from llvmjit_inline.cpp:51: > /usr/lib/llvm-14/include/llvm/IR/ModuleSummaryIndex.h: In constructor > ‘llvm::ModuleSummaryIndex::ModuleSummaryIndex(bool, bool)’: > /usr/lib/llvm-14/include/llvm/IR/ModuleSummaryIndex.h:1175:73: warning: > member ‘llvm::ModuleSummaryIndex::Alloc’ is used uninitialized > [-Wuninitialized] > 1175 | : HaveGVs(HaveGVs), EnableSplitLTOUnit(EnableSplitLTOUnit), > Saver(Alloc), > | Yeah, I get this one too. I thought commit 37d5babb5cfa ("jit: Support opaque pointers in LLVM 16.") was going to silence it, but I was quite mistaken. I gave that code a quick look and could not understand what it was complaining about. Is it a bug in the LLVM headers? Adding Andres and Thomas to CC, because they're the ones touching the LLVM / JIT code. Any clues? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Buffer Cache Problem
On Tue, 7 Nov 2023 at 14:28, jacktby jacktby wrote: > > Hi, postgres hackers, I’m studying postgres buffer cache part. So I open this > thread to communicate some buffer cache codes design and try to improve some > tricky codes. > > For Buffer Cache, we know it’s a buffer array, every bucket of this array is > consist of a data page and its header which is used to describe the state of > the buffer. > > For field wait_backend_pgprocno, the comment is "backend of pin-count > waiter”, I have problems below: Did you read the README at src/backend/storage/buffer/README, as well as the comments and documentation in and around the buffer-locking functions? > 1. it means which processId is waiting this buffer, right? > 2. and if wait_backend_pgprocno is valid, so it says this buffer is in use by > one process, right? > 3. if one buffer is wait by another process, it means all buffers are out of > use, right? So let’s try this: we have 5 buffers with ids (1,2,3,4,5), and > they are all in use, now another process with processId 8017 is coming, and > it choose buffer id 1, so buffer1’s wait_backend_pgprocno is 8017, but later > buffer4 is released, can process 8017 change to get buffer4? how? I believe these questions are generally answered by the README and the comments in bufmgr.c/buf_internal.h for the functions that try to lock buffers. > 4. wait_backend_pgprocno is a “integer” type, not an array, why can one > buffer be wait by only one process? Yes, that is correct. It seems like PostgreSQL has yet to find a workload requires more than one backend to wait for super exclusive access to a buffer at the same time. VACUUM seems to be the only workload that currently can wait and sleep for this exclusive buffer access, and that is already limited to one process per relation, so there are no explicit concurrent super-exclusive waits in the system right now. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: meson documentation build open issues
On 03.11.23 22:16, Andres Freund wrote: [selinux option] Why isn't it "auto" like the others? I don't really remember why I did that, but it's platform specific, maybe that's why I did it that way? Isn't that kind the point of autodetecting things? Aren't bonjour and bsd_auth autodetected as well? I'd be happy to change it, unless somebody objects? Makes sense to me to change it to auto.
Re: Protocol question regarding Portal vs Cursor
Dave Cramer writes: > If we use a Portal it is possible to open the portal and do a describe and > then Fetch N records. > Using a Cursor we open the cursor. Is there a corresponding describe and a > way to fetch N records without getting the fields each time. Currently we > have to send the SQL "fetch N" and we get the fields and the > rows. This seems overly verbose. Portals and cursors are pretty much the same thing, so why not use the API that suits you better? regards, tom lane
Re: Relids instead of Bitmapset * in plannode.h
Alvaro Herrera writes: > On 2023-Oct-31, Ashutosh Bapat wrote: >> For some reason plannode.h has declared variable to hold RTIs as >> Bitmapset * instead of Relids like other places. Here's patch to fix >> it. This is superficial change as Relids is typedefed to Bitmapset *. >> Build succeeds for me and also make check passes. > I think the reason for having done it this way, was mostly to avoid > including pathnodes.h in plannodes.h. Yes, I'm pretty sure that's exactly the reason, and I'm strongly against the initially-proposed patch. The include footprint of pathnodes.h would be greatly expanded, for no real benefit. Among other things, that fuzzes the distinction between planner modules and non-planner modules. > While looking at it, I noticed that tcopprot.h includes both plannodes.h > and parsenodes.h, but there's no reason to include the latter (or at > least headerscheck doesn't complain after removing it), so I propose to > remove it, per 0001 here. 0001 is ok, except check #include alphabetization. > I also noticed while looking that I messed up in commit 7103ebb7aae8 > ("Add support for MERGE SQL command") on this point, because I added > #include parsenodes.h to plannodes.h. This is because MergeAction, > which is in parsenodes.h, is also needed by some executor code. But the > real way to fix that is to define that struct in primnodes.h. 0002 does > that. (I'm forced to also move enum OverridingKind there, which is a > bit annoying.) This seems OK. It seems to me that parsenodes.h has been required by plannodes.h for a long time, but if we can decouple them, all the better. > 0003 here is your patch, which I include because of conflicts with my > 0002. Still don't like it. > ... I would be more at ease if we didn't have to include > parsenodes.h in pathnodes.h, though, but that looks more difficult to > achieve. Yeah, that dependency has been there a long time too. I'm not too fussed by dependencies on parsenodes.h, because anything involved with either planning or execution will certainly be looking at query trees too. But I don't want to add dependencies that tie planning and execution together. regards, tom lane
MERGE: AFTER ROW trigger failure for cross-partition update
While working on the MERGE RETURNING patch, I noticed a pre-existing MERGE bug --- ExecMergeMatched() should not call ExecUpdateEpilogue() if ExecUpdateAct() indicates that it did a cross-partition update. The immediate consequence is that it incorrectly tries (and fails) to fire AFTER UPDATE ROW triggers, which it should not do if the UPDATE has been turned into a DELETE and an INSERT: DROP TABLE IF EXISTS foo CASCADE; CREATE TABLE foo (a int) PARTITION BY LIST (a); CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES IN (1); CREATE TABLE foo_p2 PARTITION OF foo FOR VALUES IN (2); INSERT INTO foo VALUES (1); CREATE OR REPLACE FUNCTION foo_trig_fn() RETURNS trigger AS $$ BEGIN RAISE NOTICE 'Trigger: % %', TG_WHEN, TG_OP; IF TG_OP = 'DELETE' THEN RETURN OLD; END IF; RETURN NEW; END $$ LANGUAGE plpgsql; CREATE TRIGGER foo_b_trig BEFORE INSERT OR UPDATE OR DELETE ON foo FOR EACH ROW EXECUTE FUNCTION foo_trig_fn(); CREATE TRIGGER foo_a_trig AFTER INSERT OR UPDATE OR DELETE ON foo FOR EACH ROW EXECUTE FUNCTION foo_trig_fn(); UPDATE foo SET a = 2 WHERE a = 1; NOTICE: Trigger: BEFORE UPDATE NOTICE: Trigger: BEFORE DELETE NOTICE: Trigger: BEFORE INSERT NOTICE: Trigger: AFTER DELETE NOTICE: Trigger: AFTER INSERT UPDATE 1 MERGE INTO foo USING (VALUES (1)) AS v(a) ON true WHEN MATCHED THEN UPDATE SET a = v.a; NOTICE: Trigger: BEFORE UPDATE NOTICE: Trigger: BEFORE DELETE NOTICE: Trigger: BEFORE INSERT NOTICE: Trigger: AFTER DELETE NOTICE: Trigger: AFTER INSERT ERROR: failed to fetch tuple2 for AFTER trigger The attached patch fixes that, making the UPDATE path in ExecMergeMatched() consistent with ExecUpdate(). (If there were no AFTER ROW triggers, the old code would go on to do other unnecessary things, like WCO/RLS checks, which I didn't really look into. This patch will stop any of that too.) Regards, Dean diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c new file mode 100644 index 299c2c7..b16fbe9 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -2899,6 +2899,22 @@ lmerge_matched: } result = ExecUpdateAct(context, resultRelInfo, tupleid, NULL, newslot, false, ); + +/* + * As in ExecUpdate(), if ExecUpdateAct() reports that a + * cross-partition update was done, then there's nothing else + * for us to do --- the UPDATE has been turned into a DELETE + * and an INSERT, and we must not perform any of the usual + * post-update tasks. + */ +if (updateCxt.crossPartUpdate) +{ + mtstate->mt_merge_updated += 1; + if (canSetTag) + (estate->es_processed)++; + return true; +} + if (result == TM_Ok && updateCxt.updated) { ExecUpdateEpilogue(context, , resultRelInfo, diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out new file mode 100644 index e8de916..6cbfbbe --- a/src/test/regress/expected/triggers.out +++ b/src/test/regress/expected/triggers.out @@ -2309,6 +2309,51 @@ NOTICE: trigger zzz on parted_trig_1_1 NOTICE: trigger bbb on parted_trig_2 AFTER INSERT for ROW NOTICE: trigger zzz on parted_trig_2 AFTER INSERT for ROW drop table parted_trig; +-- Verify that the correct triggers fire for cross-partition updates +create table parted_trig (a int) partition by list (a); +create table parted_trig1 partition of parted_trig for values in (1); +create table parted_trig2 partition of parted_trig for values in (2); +insert into parted_trig values (1); +create or replace function trigger_notice() returns trigger as $$ + begin +raise notice 'trigger % on % % % for %', TG_NAME, TG_TABLE_NAME, TG_WHEN, TG_OP, TG_LEVEL; +if TG_LEVEL = 'ROW' then + if TG_OP = 'DELETE' then +return OLD; + else +return NEW; + end if; +end if; +return null; + end; + $$ language plpgsql; +create trigger parted_trig_before_stmt before insert or update or delete on parted_trig + for each statement execute procedure trigger_notice(); +create trigger parted_trig_before_row before insert or update or delete on parted_trig + for each row execute procedure trigger_notice(); +create trigger parted_trig_after_row after insert or update or delete on parted_trig + for each row execute procedure trigger_notice(); +create trigger parted_trig_after_stmt after insert or update or delete on parted_trig + for each statement execute procedure trigger_notice(); +update parted_trig set a = 2 where a = 1; +NOTICE: trigger parted_trig_before_stmt on parted_trig BEFORE UPDATE for STATEMENT +NOTICE: trigger parted_trig_before_row on parted_trig1 BEFORE UPDATE for ROW +NOTICE: trigger parted_trig_before_row on parted_trig1 BEFORE DELETE for ROW +NOTICE: trigger parted_trig_before_row on parted_trig2 BEFORE INSERT for ROW +NOTICE: trigger parted_trig_after_row on parted_trig1 AFTER DELETE for ROW +NOTICE: trigger parted_trig_after_row on parted_trig2 AFTER INSERT
Re: Commitfest: older Waiting on Author entries
On Tue, Nov 07, 2023 at 03:27:53PM +0700, John Naylor wrote: > The following entries are WoA but have not had an email update since > July. If there is no actionable update soon, I plan to mark these > Returned with Feedback before end of CF: > > pg_usleep for multisecond delays > vector search support I haven't had a chance to follow up on these in some time, so I went ahead and marked them returned-with-feedback. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: GUC names in messages
On Tue, Nov 07, 2023 at 10:33:03AM +0100, Alvaro Herrera wrote: > On 2023-Nov-01, Nathan Bossart wrote: >> +1, IMHO quoting GUC names makes it abundantly clear that they are special >> identifiers. In de4d456, we quoted the role names in a bunch of messages. >> We didn't quote the attribute/option names, but those are in all-caps, so >> they already stand out nicely. > > I like this, and I propose we codify it in the message style guide. How > about this? We can start looking at code changes to make once we decide > we agree with this. > + > +In messages containing configuration variable names, quotes are > +not necessary when the names are visibly not English natural words, such > +as when they have underscores or are all-uppercase. Otherwise, quotes > +must be added. Do include double-quotes in a message where an arbitrary > +variable name is to be expanded. > + І'd vote for quoting all GUC names, if for no other reason than "visibly not English natural words" feels a bit open to interpretation. But this seems like it's on the right track, so I won't argue too strongly if I'm the only holdout. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: GUC names in messages
Alvaro Herrera writes: > On 2023-Nov-01, Nathan Bossart wrote: >> +1, IMHO quoting GUC names makes it abundantly clear that they are special >> identifiers. In de4d456, we quoted the role names in a bunch of messages. >> We didn't quote the attribute/option names, but those are in all-caps, so >> they already stand out nicely. > I like this, and I propose we codify it in the message style guide. How > about this? We can start looking at code changes to make once we decide > we agree with this. WFM. regards, tom lane
Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression
On 25.10.23 08:12, Amul Sul wrote: Here is the rebase version for the latest master head(673a17e3120). I haven't done any other changes related to the ON UPDATE trigger since that seems non-trivial; need a bit of work to add trigger support in ATRewriteTable(). Also, I am not sure yet, if we were doing these changes, and the correct direction for that. I did some detailed testing on Db2, Oracle, and MariaDB (the three existing implementations of this feature that I'm tracking), and none of them fire any row or statement triggers when the respective statement to alter the generation expression is run. So I'm withdrawing my comment that this should fire triggers. (I suppose event triggers are available if anyone really needs the functionality.)
Re: Synchronizing slots from primary to standby
Hi, On 11/7/23 11:55 AM, Amit Kapila wrote: On Tue, Nov 7, 2023 at 3:51 PM Drouvot, Bertrand wrote: On 10/31/23 10:37 AM, shveta malik wrote: On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand Agree with your test case, but in my case I was not using pub/sub. I was not clear, so when I said: - create logical_slot1 on the primary (and don't start using it) I meant don't start decoding from it (like using pg_recvlogical() or pg_logical_slot_get_changes()). By using pub/sub the "don't start using it" is not satisfied. My test case is: " SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 'test_decoding', false, true, true); SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 'test_decoding', false, true, true); pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f - " Okay, I am able to reproduce it now. Thanks for clarification. I have tried to change the algorithm as per suggestion by Amit in [1] [1]: https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com Thanks! This is not full proof solution but optimization over first one. Now in any sync-cycle, we take 2 attempts for slots-creation (if any slots are available to be created). In first attempt, we do not wait indefinitely on inactive slots, we wait only for a fixed amount of time and if remote-slot is still behind, then we add that to the pending list and move to the next slot. Once we are done with first attempt, in second attempt, we go for the pending ones and now we wait on each of them until the primary catches up. Aren't we "just" postponing the "issue"? I mean if there is really no activity on, say, the first created slot, then once we move to the second attempt then any newly created slot from that time would wait to be synced forever, no? We have to wait at some point in time for such inactive slots and the same is true even for manually created slots on standby. Do you have any better ideas to deal with it? What about: - get rid of the second attempt and the pending_slot_list - keep the wait_count and PrimaryCatchupWaitAttempt logic so basically, get rid of: /* * Now sync the pending slots which were failed to be created in first * attempt. */ foreach(cell, pending_slot_list) { RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell); /* Wait until the primary server catches up */ PrimaryCatchupWaitAttempt = 0; synchronize_one_slot(wrconn, remote_slot, NULL); } and the pending_slot_list list. That way, for each slot that have not been created and synced yet: - it will be created on the standby - we will wait up to PrimaryCatchupWaitAttempt attempts - the slot will be synced or removed on/from the standby That way an inactive slot on the primary would not "block" any other slots on the standby. By "created" here I mean calling ReplicationSlotCreate() (not to be confused with emitting "ereport(LOG, errmsg("created slot \"%s\" locally", remote_slot->name)); " which is confusing as mentioned up-thread). The problem I can see with this proposal is that the "sync" window waiting for slot activity on the primary is "only" during the PrimaryCatchupWaitAttempt attempts (as the slot will be dropped/recreated). If we think this window is too short we could: - increase it or - don't drop the slot once created (even if there is no activity on the primary during PrimaryCatchupWaitAttempt attempts) so that the next loop of attempts will compare with "older" LSN/xmin (as compare to dropping and re-creating the slot). That way the window would be since the initial slot creation. Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: 2023-11-09 release announcement draft
On 11/6/23 9:52 PM, Noah Misch wrote: On Mon, Nov 06, 2023 at 05:04:25PM -0500, Jonathan S. Katz wrote: The PostgreSQL Global Development Group has released an update to all supported versions of PostgreSQL, including 16.1, 15.5, 14.10, 13.13, 12.17, and 11.22 This release fixes over 55 bugs reported over the last several months. This release includes fixes for indexes where in certain cases, we advise reindexing. Please see the "Update" section for more details. s/"Update" section/"Updating" section/ or change section title below. Fixed. Delete lines starting here ... This is the **final release of PostgreSQL 11**. PostgreSQL 10 will no longer receive [security and bug fixes](https://www.postgresql.org/support/versioning/). If you are running PostgreSQL 10 in a production environment, we suggest that you make plans to upgrade. ... to here. They're redundant with "PostgreSQL 11 EOL Notice" below: Initially, I strongly disagreed with this recommendation, as I've seen enough people say that they were unaware that a community version is EOL. We can't say this enough. However, I did decide to clip it out because the notice is just below. That said, perhaps we should put out a separate announcement that states PostgreSQL 11 is EOL. We may want to consider doing standalone EOL announcement -- perhaps 6 months out, and then day of, to make it abundantly clear that a version is deprecating. Finally, I included Matthias' downthread recommendation in this version. Thanks, Jonathan The PostgreSQL Global Development Group has released an update to all supported versions of PostgreSQL, including 16.1, 15.5, 14.10, 13.13, 12.17, and 11.22 This release fixes over 55 bugs reported over the last several months. This release includes fixes for indexes where in certain cases, we advise reindexing. Please see the "Updating" section for more details. For the full list of changes, please review the [release notes](https://www.postgresql.org/docs/release/). PostgreSQL 11 EOL Notice **This is the final release of PostgreSQL 11**. PostgreSQL 11 is now end-of-life and will no longer receive security and bug fixes. If you are running PostgreSQL 11 in a production environment, we suggest that you make plans to upgrade to a newer, supported version of PostgreSQL. Please see our [versioning policy](https://www.postgresql.org/support/versioning/) for more information. Bug Fixes and Improvements -- This update fixes over 55 bugs that were reported in the last several months. The issues listed below affect PostgreSQL 16. Some of these issues may also affect other supported versions of PostgreSQL. * Fix issue where GiST indexes had an incorrect behavior during a "page split" operation that could lead to incorrect results in subsequent index searches. Please [reindex](https://www.postgresql.org/docs/current/sql-reindex.html) GiST indexes after installing this update. * Fix issue where B-tree indexes would incorrectly de-duplicate `interval` columns. Please [reindex](https://www.postgresql.org/docs/current/sql-reindex.html) any B-tree index that includes an `interval` column after installing this update. * Provide more efficient indexing of `date`, `timestamptz`, and `timestamp` values in BRIN indexes when using a [`minmax_multi` opsclass](https://www.postgresql.org/docs/current/brin-builtin-opclasses.html). While not required, we recommend [reindexing](https://www.postgresql.org/docs/current/sql-reindex.html) BRIN indexes that include these data types after installing this update. * Fix for bulk table insertion into partitioned tables. * Fix for hash-partitioned tables with multiple partition keys during step generation and runtime pruning that could lead to crashes in some cases. * Throw the correct error if [`pgrowlocks()`](https://www.postgresql.org/docs/current/pgrowlocks.html) is applied to a partitioned table * Fix inconsistent rechecking of concurrently-updated rows during [`MERGE`](https://www.postgresql.org/docs/current/sql-merge.html) when using [`READ COMMITTED`](https://www.postgresql.org/docs/current/transaction-iso.html#XACT-READ-COMMITTED) mode. * Correctly identify the target table in an inherited `UPDATE`/`DELETE`/`MERGE` even when the parent table is excluded by constraints. * Fix over-allocation of a constructed [`tsvector`](https://www.postgresql.org/docs/current/datatype-textsearch.html#DATATYPE-TSVECTOR). * Fix [`ALTER SUBSCRIPTION`](https://www.postgresql.org/docs/current/sql-altersubscription.html) to apply changes in the `run_as_owner` option. * Several fixes for [`COPY FROM`](https://www.postgresql.org/docs/current/sql-copy.html), * Several fixes for handling torn reads with [`pg_control`](https://www.postgresql.org/docs/current/wal-internals.html). * Fix "could not find pathkey item to sort" errors occurring while planning aggregate functions with `ORDER BY` or `DISTINCT` options. * When
Re: ResourceOwner refactoring
Hello Heikki, 07.11.2023 14:28, Heikki Linnakangas wrote: I feel pretty good about this overall. Barring objections or new cfbot failures, I will commit this in the next few days. A script, that I published in [1], detects several typos in the patches: betwen ther ReourceOwner ResouceOwnerSort ReleaseOwnerRelease Maybe it's worth to fix them now, or just accumulate and clean all such defects once a year... [1] https://www.postgresql.org/message-id/27a7998b-d67f-e32a-a28d-e659a2e390c6%40gmail.com Best regards, Alexander
Buffer Cache Problem
Hi, postgres hackers, I’m studying postgres buffer cache part. So I open this thread to communicate some buffer cache codes design and try to improve some tricky codes. For Buffer Cache, we know it’s a buffer array, every bucket of this array is consist of a data page and its header which is used to describe the state of the buffer. This is the origin code of buffer header: typedef struct BufferDesc { BufferTag tag;/* ID of page contained in buffer */ int buf_id; /* buffer's index number (from 0) */ /* state of the tag, containing flags, refcount and usagecount */ pg_atomic_uint32 state; int wait_backend_pgprocno; /* backend of pin-count waiter */ int freeNext; /* link in freelist chain */ LWLock content_lock; /* to lock access to buffer contents */ } BufferDesc; For field wait_backend_pgprocno, the comment is "backend of pin-count waiter”, I have problems below: 1. it means which processId is waiting this buffer, right? 2. and if wait_backend_pgprocno is valid, so it says this buffer is in use by one process, right? 3. if one buffer is wait by another process, it means all buffers are out of use, right? So let’s try this: we have 5 buffers with ids (1,2,3,4,5), and they are all in use, now another process with processId 8017 is coming, and it choose buffer id 1, so buffer1’s wait_backend_pgprocno is 8017, but later buffer4 is released, can process 8017 change to get buffer4? how? 4. wait_backend_pgprocno is a “integer” type, not an array, why can one buffer be wait by only one process? Hope your reply, thanks!! I’m willing to do contributions after I study buffer cache implementations.
Re: 2023-11-09 release announcement draft
On Mon, 6 Nov 2023 at 23:04, Jonathan S. Katz wrote: > > Hi, > > Attached is the release announcement draft for the 2023-11-09 release > (16.1 et al.). > > Please review for accuracy and notable omissions. Please have all > feedback in by 2023-11-09 08:00 UTC at the latest (albeit the sooner the > better). > 20231109updaterelease.md > [...] > * Provide more efficient indexing of `date`, `timestamptz`, and `timestamp` > values in BRIN indexes. While not required, we recommend > [reindexing](https://www.postgresql.org/docs/current/sql-reindex.html) BRIN > indexes that include these data types after installing this update. As the type's minmax_multi opclasses are marked as default, I believe it makes sense to explicitly mention that only indexes that use the type's minmax_multi opclasses would need to be reindexed for them to see improved performance. The types' *_bloom and *_minmax opclasses were not affected and therefore do not need to be reindexed. Kind regards, Matthias van de meent.
Re: collect_corrupt_items_vacuum.patch
07.11.2023 14:38, Alexander Korotkov wrote: Hi, Alexander. On Tue, Nov 7, 2023 at 1:00 PM Alexander Lakhin wrote: It looks like the v2 patch doesn't fix the original issue. Maybe I miss something, but with the patch applied, I see the failure immediately, though without the patch several iterations are needed to get it. That's a bug in the patch. Thank you for cathing it. It should start calculation from latestCompletedXid + 1, not InvalidTransactionId. Please, check the revised patch. Thanks for looking at this! Unfortunately, I still see the failure with the v3, but not on a first iteration: ... iteration 316 Error condition in psql-8.log: create table vacuum_test as select 42 i; vacuum (disable_page_skipping) vacuum_test; select * from pg_check_visible('vacuum_test'); t_ctid (0,1) (1 row) (I've double-checked that the patch is applied and get_strict_xid_horizon() is called.) Best regards, Alexander
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan wrote: > > On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent > wrote: > > I'm planning on reviewing this patch tomorrow, but in an initial scan > > through the patch I noticed there's little information about how the > > array keys state machine works in this new design. Do you have a more > > toplevel description of the full state machine used in the new design? > > This is an excellent question. You're entirely right: there isn't > enough information about the design of the state machine. > > I should be able to post v6 later this week. My current plan is to > commit the other nbtree patch first (the backwards scan "boundary > cases" one from the ongoing CF) -- since I saw your review earlier > today. I think that you should probably wait for this v6 before > starting your review. Okay, thanks for the update, then I'll wait for v6 to be posted. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)
Hi again, > PFA the corrected patchset v59. On second thought, I believe this Assert is incorrect: ``` +else +{ +Assert(segno >= 0 && segno <= 0x); +return snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, +(unsigned int) segno); +} ``` See SlruCorrectSegmentFilenameLength(): ``` if (ctl->long_segment_names) return (len == 15); /* see SlruFileName() */ else /* * Commit 638cf09e76d allowed 5-character lengths. Later commit * 73c986adde5 allowed 6-character length. * * XXX should we still consider such names to be valid? */ return (len == 4 || len == 5 || len == 6); ``` Should we just drop it or check that segno is <= 0xFF? -- Best regards, Aleksander Alekseev
Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)
Hi Alexander, > -#define TransactionIdToCTsPage(xid) \ > - ((xid) / (TransactionId) COMMIT_TS_XACTS_PER_PAGE) > + > +/* > + * Although we return an int64 the actual value can't currently exceeed > 2**32. > + */ > +static inline int64 > +TransactionIdToCTsPage(TransactionId xid) > +{ > + return xid / (int64) COMMIT_TS_XACTS_PER_PAGE; > +} > > Is there any reason we transform macro into a function? If not, I propose to > leave this as a macro. BTW, there is a typo in a word "exceeed". I kept the inline function, as we agreed above. Typo fixed. > +static int inline > +SlruFileName(SlruCtl ctl, char *path, int64 segno) > +{ > + if (ctl->long_segment_names) > + /* > +* We could use 16 characters here but the disadvantage would be that > +* the SLRU segments will be hard to distinguish from WAL segments. > +* > +* For this reason we use 15 characters. It is enough but also means > +* that in the future we can't decrease SLRU_PAGES_PER_SEGMENT easily. > +*/ > + return snprintf(path, MAXPGPATH, "%s/%015llX", ctl->Dir, > + (long long) segno); > + else > + return snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, > + (unsigned int) segno); > +} > > I think it worth adding asserts here to verify there is no overflow making us > mapping different segments into the same files. Added. I noticed a off-by-one error in the code snippet proposed above, so my code differs a bit. > + return occupied == max_notify_queue_pages; > > I'm not sure if the current code could actually allow to occupy more than > max_notify_queue_pages. Probably not even in extreme cases. But I still > think it will more safe and easier to read to write "occupied >= > max_notify_queue"_pages here. Fixed. > diff --git a/src/test/modules/test_slru/test_slru.c > b/src/test/modules/test_slru/test_slru.c > > The actual 64-bitness of SLRU pages isn't much exercised in our automated > tests. It would be too exhausting to make pg_notify actually use higher than > 2**32 page numbers. Thus, I think test/modules/test_slru is a good place to > give high page numbers a good test. Fixed. I choose not to change any numbers in the test in order to check any corner cases, etc. The code patches for long_segment_names = true and long_segment_names = false are almost the same thus it will not improve code coverage. Using the current numbers will allow to easily switch back to long_segment_names = false in the test if necessary. PFA the corrected patchset v59. -- Best regards, Aleksander Alekseev v59-0003-Make-use-FullTransactionId-in-2PC-filenames.patch Description: Binary data v59-0002-Use-larger-segment-file-names-for-pg_notify.patch Description: Binary data v59-0001-Index-SLRUs-by-64-bit-integers-rather-than-by-32.patch Description: Binary data
Re: RFC: Pluggable TOAST
On Tue, 7 Nov 2023 at 11:06, Nikita Malakhov wrote: > > Hi, > > I've been thinking about Matthias' proposals for some time and have some > questions: > > >So, in short, I don't think there is a need for a specific "Pluggable > >toast API" like the one in the patchset at [0] that can be loaded > >on-demand, but I think that updating our current TOAST system to a > >system for which types can provide support functions would likely be > >quite beneficial, for efficient extraction of data from composite > >values. > > As I understand one of the reasons against Pluggable TOAST is that differences > in plugged-in Toasters could result in incompatibility even in different > versions > of the same DB. That could be part of it, but it definitely wasn't my primary concern. The primary concern remains that the pluggable toaster patch made the jsonb type expose an API for a pluggable toaster that for all intents and purposes only has one implementation due to its API being specifically tailored for the jsonb internals use case, with similar type-specific API bindings getting built for other types, each having strict expectations about the details of the implementation. I agree that it makes sense to specialize TOASTing for jsonb, but what I don't understand about it is why that would need to be achieved outside the core jsonb code. I understand that the 'pluggable toaster' APIs originate from one of PostgresPRO's forks of PostgreSQL, and I think it shows. That's not to say it's bad, but it seems to be built on different expectations: When maintaining a fork, you have different tradeoffs when compared to maintaining the main product. A fork's changes need to be covered across many versions with unknown changes, thus you would want the smalles possible changes to enable the feature - pluggable toast makes sense here, as the changes are limited to a few jsonb internals, but most complex code is in an extension. However, for core PostgreSQL, I think this separation makes very little sense: the complexity of maintaining a toast api for each type (when there can be expected to be only one implementation) is much more work than just building a good set of helper functions that do that same job. It allows for more flexibility, as there is no noticable black box api implementation to keep track of. > The importance of the correct TOAST update is out of question, feel like I > have > to prepare a patch for it. There are some questions though, I'd address them > later with a patch. > > >Example support functions: > > >/* TODO: bikeshedding on names, signatures, further support functions. */ > >Datum typsup_roastsliceofbread(Datum ptr, int sizetarget, char cmethod) > >Datum typsup_unroastsliceofbread(Datum ptr) > >void typsup_releaseroastedsliceofbread(Datump ptr) /* in case of > >non-unitary in-memory datums */ > > I correctly understand that you mean extending PG_TYPE and type cache, > by adding a new function set for toasting/detoasting a value in addition to > in/out, etc? Yes. > I see several issues here: > 1) We could benefit from knowledge of internals of data being toasted (i.e. > in case of JSON value with key-value structure) only when EXTERNAL > storage mode is set, otherwise value will be compressed before toasted. > So we have to keep both TOAST mechanics regarding the storage mode > being used. It's the same issue as in Pluggable TOAST. Is it OK? I think it is OK that the storage-related changes of this only start once the toast mechanism is > 2) TOAST pointer is very limited in means of data it keeps, we'd have to > extend it anyway and keep both for backwards compatibility; Yes. We already have to retain the current (de)toast infrastructure to make sure current data files can still be read, given that we want to retain backward compatibility for currently toasted data. > 3) There is no API and such an approach would require implementing > toast and detoast in every data type we want to be custom toasted, resulting > in multiple files modification. Maybe we have to consider introducing such > an API? No. As I mentioned, we can retain the current toast mechanism for current types that do not yet want to use these new toast APIs. If we use one different varatt_1b_e tag for type-owned toast pointers, the system will be opt-in for types, and for types that don't (yet) have their own toast slicing design will keep using the old all-or-nothing single-allocation data with the good old compress-then-slice out-of-line toast storage. > 4) 1 toast relation per regular relation. With an update mechanics this will > be less limiting, but still a limiting factor because 1 entry in base table > could have a lot of entries in the toast table. Are we doing something with > this? I don't think that is relevant to the topic of type-aware toasting optimization. The toast storage relation growing too large is not unique to jsonb- or bytea-typed columns, so I believe this is better solved in a different thread. Ideas
Re: Call pqPipelineFlush from PQsendFlushRequest
On 2023-Nov-07, Michael Paquier wrote: > On Tue, Nov 07, 2023 at 10:38:04AM +0100, Jelte Fennema-Nio wrote: > > In pipeline mode after queuing a message to be sent we would flush the > > buffer if the size of the buffer passed some threshold. The only message > > type that we didn't do that for was the Flush message. This addresses > > that oversight. > > > > I noticed this discrepancy while reviewing the > > PQsendSyncMessage/PQpipelinePutSync patchset. > > Indeed, it looks a bit strange that there is no flush if the buffer > threshold is reached once the message is sent, so your suggestion > sounds right. Alvaro? I agree, and I intend to get this patch pushed once the release freeze is lifted. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "The Gord often wonders why people threaten never to come back after they've been told never to return" (www.actsofgord.com)
Re: collect_corrupt_items_vacuum.patch
Hi, Alexander. On Tue, Nov 7, 2023 at 1:00 PM Alexander Lakhin wrote: > It looks like the v2 patch doesn't fix the original issue. Maybe I miss > something, but with the patch applied, I see the failure immediately, > though without the patch several iterations are needed to get it. That's a bug in the patch. Thank you for cathing it. It should start calculation from latestCompletedXid + 1, not InvalidTransactionId. Please, check the revised patch. -- Regards, Alexander Korotkov 0001-Fix-false-reports-in-pg_visibility-v3.patch Description: Binary data
Protocol question regarding Portal vs Cursor
Greetings, If we use a Portal it is possible to open the portal and do a describe and then Fetch N records. Using a Cursor we open the cursor. Is there a corresponding describe and a way to fetch N records without getting the fields each time. Currently we have to send the SQL "fetch N" and we get the fields and the rows. This seems overly verbose. Dave Cramer
Re: ResourceOwner refactoring
On 25/10/2023 21:07, Andres Freund wrote: It's not too awful to have it be in this order: struct ResourceOwnerData { ResourceOwner parent; /* 0 8 */ ResourceOwner firstchild; /* 8 8 */ ResourceOwner nextchild;/*16 8 */ const char * name; /*24 8 */ _Bool releasing;/*32 1 */ _Bool sorted; /*33 1 */ uint8 narr; /*34 1 */ uint8 nlocks; /*35 1 */ uint32 nhash;/*36 4 */ uint32 capacity; /*40 4 */ uint32 grow_at; /*44 4 */ ResourceElem arr[32]; /*48 512 */ /* --- cacheline 8 boundary (512 bytes) was 48 bytes ago --- */ ResourceElem * hash; /* 560 8 */ LOCALLOCK *locks[15];/* 568 120 */ /* size: 688, cachelines: 11, members: 14 */ /* last cacheline: 48 bytes */ }; Requires rephrasing a few comments to document that the lenghts are separate from the array / hashtable / locks, but otherwise... Hmm, let's move 'capacity' and 'grow_at' after 'arr', too. They are only needed together with 'hash'. This reliably shows a decent speed improvement in my stress test [1], on the order of 5%. [1] pgbench running DO $$ BEGIN FOR i IN 1..5000 LOOP PERFORM abalance FROM pgbench_accounts WHERE aid = 17;END LOOP;END;$$; I'm seeing similar results, although there's enough noise in the test that I'm sure how real the would be across different tests. At that point, the first array entry fits into the first cacheline. If we were to move parent, firstchild, nextchild, name further down the struct, three array entries would be on the first line. Just using the array presumably is a very common case, so that might be worth it? I got less clear performance results with this one, and it's also quite possible it could hurt, if resowners aren't actually "used", just released. Therefore it's probably not worth it for now. You're assuming that the ResourceOwner struct begins at a cacheline boundary. That's not usually true, we don't try to cacheline-align it. So while it's helpful to avoid padding and to keep frequently-accessed fields close to each other, there's no benefit in keeping them at the beginning of the struct. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Call pqPipelineFlush from PQsendFlushRequest
On Tue, Nov 07, 2023 at 10:38:04AM +0100, Jelte Fennema-Nio wrote: > In pipeline mode after queuing a message to be sent we would flush the > buffer if the size of the buffer passed some threshold. The only message > type that we didn't do that for was the Flush message. This addresses > that oversight. > > I noticed this discrepancy while reviewing the > PQsendSyncMessage/PQpipelinePutSync patchset. Indeed, it looks a bit strange that there is no flush if the buffer threshold is reached once the message is sent, so your suggestion sounds right. Alvaro? -- Michael signature.asc Description: PGP signature
Re: Relids instead of Bitmapset * in plannode.h
Hello, On 2023-Oct-31, Ashutosh Bapat wrote: > For some reason plannode.h has declared variable to hold RTIs as > Bitmapset * instead of Relids like other places. Here's patch to fix > it. This is superficial change as Relids is typedefed to Bitmapset *. > Build succeeds for me and also make check passes. I think the reason for having done it this way, was mostly to avoid including pathnodes.h in plannodes.h. Did you explore what the consequences are? Starting here: https://doxygen.postgresql.org/plannodes_8h.html While looking at it, I noticed that tcopprot.h includes both plannodes.h and parsenodes.h, but there's no reason to include the latter (or at least headerscheck doesn't complain after removing it), so I propose to remove it, per 0001 here. There's a couple of files that need to be repaired for this change. windowfuncs.c is a no-brainer. However, having to edit bootstrap.h is a bit surprising -- I think before dac048f71ebb ("Build all Flex files standalone") this inclusion wasn't necessary, because the .y file already includes parsenodes.h; but after that commit, bootparse.h needs parsenodes.h to declare YYSTYPE, per comments in bootscanner.l. Anyway, this seems a good change. I also noticed while looking that I messed up in commit 7103ebb7aae8 ("Add support for MERGE SQL command") on this point, because I added #include parsenodes.h to plannodes.h. This is because MergeAction, which is in parsenodes.h, is also needed by some executor code. But the real way to fix that is to define that struct in primnodes.h. 0002 does that. (I'm forced to also move enum OverridingKind there, which is a bit annoying.) 0003 here is your patch, which I include because of conflicts with my 0002. After my 0002, plannodes.h is pretty slim, so I'd be hesitant to include pathnodes.h just to be able to change the Bitmapset * to Relids. But on the other hand, it doesn't seem to have too bad an effect overall (if only because plannodes.h is included by rather few files), so +0.1 on doing this. I would be more at ease if we didn't have to include parsenodes.h in pathnodes.h, though, but that looks more difficult to achieve. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ >From dee38d19b281586a17788856cf169b8344c30da7 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 6 Nov 2023 18:27:42 +0100 Subject: [PATCH v2 1/3] Don't include parsenodes.h in tcopprot.h --- src/backend/utils/adt/windowfuncs.c | 1 + src/include/bootstrap/bootstrap.h | 1 + src/include/tcop/tcopprot.h | 1 - 3 files changed, 2 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/adt/windowfuncs.c b/src/backend/utils/adt/windowfuncs.c index b87a624fb2..cbdfba6994 100644 --- a/src/backend/utils/adt/windowfuncs.c +++ b/src/backend/utils/adt/windowfuncs.c @@ -14,6 +14,7 @@ #include "postgres.h" #include "nodes/supportnodes.h" +#include "nodes/parsenodes.h" #include "utils/builtins.h" #include "windowapi.h" diff --git a/src/include/bootstrap/bootstrap.h b/src/include/bootstrap/bootstrap.h index e1cb73c5f2..6a212122a3 100644 --- a/src/include/bootstrap/bootstrap.h +++ b/src/include/bootstrap/bootstrap.h @@ -15,6 +15,7 @@ #define BOOTSTRAP_H #include "nodes/execnodes.h" +#include "nodes/parsenodes.h" /* diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h index 6c49db3bb7..b694d85974 100644 --- a/src/include/tcop/tcopprot.h +++ b/src/include/tcop/tcopprot.h @@ -15,7 +15,6 @@ #define TCOPPROT_H #include "nodes/params.h" -#include "nodes/parsenodes.h" #include "nodes/plannodes.h" #include "storage/procsignal.h" #include "utils/guc.h" -- 2.39.2 >From 870a853434b56d1fecd1a5fe85a56a2f60ff4ca9 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 6 Nov 2023 18:28:08 +0100 Subject: [PATCH v2 2/3] Remove parsenodes.h from plannodes.h This mistake was made because MergeAction was added to parsenodes.h instead of primnodes.h, where it should have been. Put it there. --- src/include/nodes/parsenodes.h | 24 src/include/nodes/plannodes.h | 1 - src/include/nodes/primnodes.h | 27 +++ 3 files changed, 27 insertions(+), 25 deletions(-) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index cf7e79062e..e494309da8 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -30,13 +30,6 @@ #include "partitioning/partdefs.h" -typedef enum OverridingKind -{ - OVERRIDING_NOT_SET = 0, - OVERRIDING_USER_VALUE, - OVERRIDING_SYSTEM_VALUE, -} OverridingKind; - /* Possible sources of a Query */ typedef enum QuerySource { @@ -1681,23 +1674,6 @@ typedef struct MergeWhenClause List *values; /* VALUES to INSERT, or NULL */ } MergeWhenClause; -/* - * MergeAction - - * Transformed representation of a WHEN clause in a MERGE statement - */ -typedef struct MergeAction -{ - NodeTag type; - bool matched; /* true=MATCHED, false=NOT MATCHED */ -
Re: collect_corrupt_items_vacuum.patch
Hi Alexander, 06.11.2023 12:30, Alexander Korotkov wrote: Surely these would significantly sacrifice accuracy. But we have to do so in order to avoid reporting false errors. I've reduced the dirty reproducer Daniel Shelepanov posted initially to the following: numdbs=10 for ((d=1;d<=$numdbs;d++)); do createdb db$d psql db$d -c "create extension pg_visibility" done for ((i=1;i<=300;i++)); do echo "iteration $i" for ((d=1;d<=$numdbs;d++)); do ( echo " create table vacuum_test as select 42 i; vacuum (disable_page_skipping) vacuum_test; select * from pg_check_visible('vacuum_test'); " | psql db$d -a -q >psql-$d.log 2>&1 ) & done wait res=0 for ((d=1;d<=$numdbs;d++)); do grep -q '0 rows' psql-$d.log || { echo "Error condition in psql-$d.log:"; cat psql-$d.log; res=1; break; } psql db$d -q -c "drop table vacuum_test" done [ $res == 0 ] || break; done It looks like the v2 patch doesn't fix the original issue. Maybe I miss something, but with the patch applied, I see the failure immediately, though without the patch several iterations are needed to get it. Best regards, Alexander
Re: Synchronizing slots from primary to standby
On Tue, Nov 7, 2023 at 3:51 PM Drouvot, Bertrand wrote: > > On 10/31/23 10:37 AM, shveta malik wrote: > > On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand > >> > >> Agree with your test case, but in my case I was not using pub/sub. > >> > >> I was not clear, so when I said: > >> > - create logical_slot1 on the primary (and don't start using it) > >> > >> I meant don't start decoding from it (like using pg_recvlogical() or > >> pg_logical_slot_get_changes()). > >> > >> By using pub/sub the "don't start using it" is not satisfied. > >> > >> My test case is: > >> > >> " > >> SELECT * FROM pg_create_logical_replication_slot('logical_slot1', > >> 'test_decoding', false, true, true); > >> SELECT * FROM pg_create_logical_replication_slot('logical_slot2', > >> 'test_decoding', false, true, true); > >> pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f - > >> " > >> > > > > Okay, I am able to reproduce it now. Thanks for clarification. I have > > tried to change the algorithm as per suggestion by Amit in [1] > > > > [1]: > > https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com > > Thanks! > > > > > This is not full proof solution but optimization over first one. Now > > in any sync-cycle, we take 2 attempts for slots-creation (if any slots > > are available to be created). In first attempt, we do not wait > > indefinitely on inactive slots, we wait only for a fixed amount of > > time and if remote-slot is still behind, then we add that to the > > pending list and move to the next slot. Once we are done with first > > attempt, in second attempt, we go for the pending ones and now we wait > > on each of them until the primary catches up. > > Aren't we "just" postponing the "issue"? I mean if there is really no activity > on, say, the first created slot, then once we move to the second attempt then > any newly > created slot from that time would wait to be synced forever, no? > We have to wait at some point in time for such inactive slots and the same is true even for manually created slots on standby. Do you have any better ideas to deal with it? > Looking at V30: > > + /* Update lsns of slot to remote slot's current position */ > + local_slot_update(remote_slot); > + ReplicationSlotPersist(); > + > + ereport(LOG, errmsg("created slot \"%s\" locally", > remote_slot->name)); > > I think this message is confusing as the slot has been created before it, > here: > > + else > + { > + TransactionId xmin_horizon = InvalidTransactionId; > + ReplicationSlot *slot; > + > + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, > + remote_slot->two_phase, false); > > So that it shows up in pg_replication_slots before this message is emitted > (and that > specially true/worst for non active slots). > > Maybe something like "newly locally created slot XXX has been synced..."? > > While at it, would that make sense to move > > + slot->data.failover = true; > > once we stop waiting for this slot? I think that would avoid confusion if one > query pg_replication_slots while we are still waiting for this slot to be > synced, > thoughts? (currently we can see pg_replication_slots.synced_slot set to true > while we are still waiting). > The failover property of the slot is different from whether the slot has been synced yet, so we can't change the location of marking it but we can try to improve when to show that slot has been synced. -- With Regards, Amit Kapila.
Re: A recent message added to pg_upgade
On Tue, Nov 7, 2023 at 8:12 AM Michael Paquier wrote: > > On Tue, Nov 07, 2023 at 07:59:46AM +0530, Amit Kapila wrote: > > Do you mean to say that if 'IsBinaryUpgrade' is true then let's not > > allow to launch launcher or apply worker? If so, I guess this won't be > > any better than prohibiting at an early stage or explicitly overriding > > those with internal values and documenting it, at least that way we > > can be consistent for both variables (max_logical_replication_workers > > and max_slot_wal_keep_size). > > Yes, I mean to paint an extra IsBinaryUpgrade before registering the > apply worker launcher. That would be consistent with what we do for > autovacuum in the postmaster. > But then we don't need the hardcoded value of max_logical_replication_workers as zero by pg_upgrade. I think doing IsBinaryUpgrade for slots won't be neat, so we anyway need to keep using the special value of max_slot_wal_keep_size GUC. Though the handling for both won't be the same but I guess given the situation, that seems like a reasonable thing to do. If we follow that then we can have this special GUC hook only for max_slot_wal_keep_size GUC. -- With Regards, Amit Kapila.
Re: Synchronizing slots from primary to standby
Hi, On 10/31/23 10:37 AM, shveta malik wrote: On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand Agree with your test case, but in my case I was not using pub/sub. I was not clear, so when I said: - create logical_slot1 on the primary (and don't start using it) I meant don't start decoding from it (like using pg_recvlogical() or pg_logical_slot_get_changes()). By using pub/sub the "don't start using it" is not satisfied. My test case is: " SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 'test_decoding', false, true, true); SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 'test_decoding', false, true, true); pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f - " Okay, I am able to reproduce it now. Thanks for clarification. I have tried to change the algorithm as per suggestion by Amit in [1] [1]: https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com Thanks! This is not full proof solution but optimization over first one. Now in any sync-cycle, we take 2 attempts for slots-creation (if any slots are available to be created). In first attempt, we do not wait indefinitely on inactive slots, we wait only for a fixed amount of time and if remote-slot is still behind, then we add that to the pending list and move to the next slot. Once we are done with first attempt, in second attempt, we go for the pending ones and now we wait on each of them until the primary catches up. Aren't we "just" postponing the "issue"? I mean if there is really no activity on, say, the first created slot, then once we move to the second attempt then any newly created slot from that time would wait to be synced forever, no? Looking at V30: + /* Update lsns of slot to remote slot's current position */ + local_slot_update(remote_slot); + ReplicationSlotPersist(); + + ereport(LOG, errmsg("created slot \"%s\" locally", remote_slot->name)); I think this message is confusing as the slot has been created before it, here: + else + { + TransactionId xmin_horizon = InvalidTransactionId; + ReplicationSlot *slot; + + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, + remote_slot->two_phase, false); So that it shows up in pg_replication_slots before this message is emitted (and that specially true/worst for non active slots). Maybe something like "newly locally created slot XXX has been synced..."? While at it, would that make sense to move + slot->data.failover = true; once we stop waiting for this slot? I think that would avoid confusion if one query pg_replication_slots while we are still waiting for this slot to be synced, thoughts? (currently we can see pg_replication_slots.synced_slot set to true while we are still waiting). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: RFC: Pluggable TOAST
Hi, I've been thinking about Matthias' proposals for some time and have some questions: >So, in short, I don't think there is a need for a specific "Pluggable >toast API" like the one in the patchset at [0] that can be loaded >on-demand, but I think that updating our current TOAST system to a >system for which types can provide support functions would likely be >quite beneficial, for efficient extraction of data from composite >values. As I understand one of the reasons against Pluggable TOAST is that differences in plugged-in Toasters could result in incompatibility even in different versions of the same DB. The importance of the correct TOAST update is out of question, feel like I have to prepare a patch for it. There are some questions though, I'd address them later with a patch. >Example support functions: >/* TODO: bikeshedding on names, signatures, further support functions. */ >Datum typsup_roastsliceofbread(Datum ptr, int sizetarget, char cmethod) >Datum typsup_unroastsliceofbread(Datum ptr) >void typsup_releaseroastedsliceofbread(Datump ptr) /* in case of >non-unitary in-memory datums */ I correctly understand that you mean extending PG_TYPE and type cache, by adding a new function set for toasting/detoasting a value in addition to in/out, etc? I see several issues here: 1) We could benefit from knowledge of internals of data being toasted (i.e. in case of JSON value with key-value structure) only when EXTERNAL storage mode is set, otherwise value will be compressed before toasted. So we have to keep both TOAST mechanics regarding the storage mode being used. It's the same issue as in Pluggable TOAST. Is it OK? 2) TOAST pointer is very limited in means of data it keeps, we'd have to extend it anyway and keep both for backwards compatibility; 3) There is no API and such an approach would require implementing toast and detoast in every data type we want to be custom toasted, resulting in multiple files modification. Maybe we have to consider introducing such an API? 4) 1 toast relation per regular relation. With an update mechanics this will be less limiting, but still a limiting factor because 1 entry in base table could have a lot of entries in the toast table. Are we doing something with this? >We would probably want at least 2 more subtypes of varattrib_1b_e - >one for on-disk pointers, and one for in-memory pointers - where the >payload of those pointers is managed by the type's toast mechanism and >considered opaque to the rest of PostgreSQL (and thus not compatible >with the binary transfer protocol). Types are currently already >expected to be able to handle their own binary representation, so >allowing types to manage parts of the toast representation should IMHO >not be too dangerous, though we should make sure that BINARY COERCIBLE >types share this toast support routine, or be returned to their >canonical binary version before they are cast to the coerced type, as >using different detoasting mechanisms could result in corrupted data >and thus crashes. >Lastly, there is the compression part of TOAST. I think it should be >relatively straightforward to expose the compression-related >components of TOAST through functions that can then be used by >type-specific toast support functions. >Note that this would be opt-in for a type, thus all functions that use >that type's internals should be aware of the different on-disk format >for toasted values and should thus be able to handle it gracefully. Thanks a lot for answers! -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/