Re: pgsql: Clean up role created in new subscription test.

2023-11-07 Thread Peter Eisentraut

On 06.07.23 00:00, Daniel Gustafsson wrote:

On 16 May 2023, at 11:17, Daniel Gustafsson  wrote:



Parked in the July CF for now.


Rebased to fix a trivial conflict highlighted by the CFBot.


I think the problem with this approach is that one would need to reapply 
it to each regression test suite separately.  For example, there are 
several tests under contrib/ that create roles.  These would not be 
covered by this automatically.


I think the earlier idea of just counting roles, tablespaces, etc. 
before and after would be sufficient.





Question about the Implementation of vector32_is_highbit_set on ARM

2023-11-07 Thread Xiang Gao
Hi all,
I have some questions about the implementation of vector32_is_highbit_set  on 
arm.
Below is the comment and the implementation for this function.
/*
 * Exactly like vector8_is_highbit_set except for the input type, so it
 * looks at each byte separately.
 *
 * XXX x86 uses the same underlying type for 8-bit, 16-bit, and 32-bit
 * integer elements, but Arm does not, hence the need for a separate
 * function. We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e.
 * check each 32-bit element, but that would require an additional mask
 * operation on x86.
 */
#ifndef USE_NO_SIMD
static inline bool
vector32_is_highbit_set(const Vector32 v)
{
#if defined(USE_NEON)
return vector8_is_highbit_set((Vector8) v);
#else
return vector8_is_highbit_set(v);
#endif
}
#endif  /* ! USE_NO_SIMD */

But I still don't understand why the vmaxvq_u32 intrinsic  is not used on the 
arm platform.
We have used the macro USE_NEON to distinguish different platforms.
In addition, according to the "Arm Neoverse N1 Software Optimization Guide",
The vmaxvq_u32 intrinsic  has half the latency of vmaxvq_u8 and twice the 
bandwidth.
So I think just use vmaxvq_u32 directly.

Any comments or feedback are welcome.
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-07 Thread Bharath Rupireddy
On Sat, Nov 4, 2023 at 6:13 AM Andres Freund  wrote:
>
> > + cur_lsn = GetFlushRecPtr(NULL);
> > + if (unlikely(startptr > cur_lsn))
> > + elog(ERROR, "WAL start LSN %X/%X specified for reading from 
> > WAL buffers must be less than current database system WAL LSN %X/%X",
> > +  LSN_FORMAT_ARGS(startptr), LSN_FORMAT_ARGS(cur_lsn));
>
> Hm, why does this check belong here? For some tools it might be legitimate to
> read the WAL before it was fully flushed.

Agreed and removed the check.

> > + /*
> > +  * Holding WALBufMappingLock ensures inserters don't overwrite this 
> > value
> > +  * while we are reading it. We try to acquire it in shared mode so 
> > that
> > +  * the concurrent WAL readers are also allowed. We try to do as less 
> > work
> > +  * as possible while holding the lock to not impact concurrent WAL 
> > writers
> > +  * much. We quickly exit to not cause any contention, if the lock 
> > isn't
> > +  * immediately available.
> > +  */
> > + if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
> > + return 0;
>
> That seems problematic - that lock is often heavily contended.  We could
> instead check IsWALRecordAvailableInXLogBuffers() once before reading the
> page, then read the page contents *without* holding a lock, and then check
> IsWALRecordAvailableInXLogBuffers() again - if the page was replaced in the
> interim we read bogus data, but that's a bit of a wasted effort.

In the new approach described upthread here
https://www.postgresql.org/message-id/c3455ab9da42e09ca9d059879b5c512b2d1f9681.camel%40j-davis.com,
there's no lock required for reading from WAL buffers. PSA patches for
more details.

> > + /*
> > +  * The fact that we acquire WALBufMappingLock while reading 
> > the WAL
> > +  * buffer page itself guarantees that no one else initializes 
> > it or
> > +  * makes it ready for next use in AdvanceXLInsertBuffer(). 
> > However, we
> > +  * need to ensure that we are not reading a page that just got
> > +  * initialized. For this, we look at the needed page header.
> > +  */
> > + phdr = (XLogPageHeader) page;
> > +
> > + /* Return, if WAL buffer page doesn't look valid. */
> > + if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> > +   phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
> > +   phdr->xlp_tli == tli))
> > + break;
>
> I don't think this code should ever encounter a page where this is not the
> case?  We particularly shouldn't do so silently, seems that could hide all
> kinds of problems.

I think it's possible to read a "just got initialized" page with the
new approach to read WAL buffer pages without WALBufMappingLock if the
page is read right after it is initialized and xlblocks is filled in
AdvanceXLInsertBuffer() but before actual WAL is written.

> > + /*
> > +  * Note that we don't perform all page header checks here to 
> > avoid
> > +  * extra work in production builds; callers will anyway do 
> > those
> > +  * checks extensively. However, in an assert-enabled build, 
> > we perform
> > +  * all the checks here and raise an error if failed.
> > +  */
>
> Why?

Minimal page header checks are performed to ensure we don't read the
page that just got initialized unlike what
XLogReaderValidatePageHeader(). Are you suggesting to remove page
header checks with XLogReaderValidatePageHeader() for assert-enabled
builds? Or are you suggesting to do page header checks with
XLogReaderValidatePageHeader() for production builds too?

PSA v16 patch set. Note that 0004 patch adds support for WAL read
stats (both from WAL file and WAL buffers) to walsenders and may not
necessarily the best approach to capture WAL read stats in light of
https://www.postgresql.org/message-id/CALj2ACU_f5_c8F%2BxyNR4HURjG%3DJziiz07wCpQc%3DAqAJUFh7%2B8w%40mail.gmail.com
which adds WAL read/write/fsync stats to pg_stat_io.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From b678edfccf7ecf490cb792391249cbf85ba0db29 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 7 Nov 2023 19:20:00 +
Subject: [PATCH v16] Use 64-bit atomics for xlblocks array elements

In AdvanceXLInsertBuffer(), xlblocks value of a WAL buffer page is
updated only at the end after the page is initialized with all
zeros. A problem with this approach is that anyone reading
xlblocks and WAL buffer page without holding WALBufMappingLock
will see the wrong page contents if the read happens before the
xlblocks is marked with a new entry in AdvanceXLInsertBuffer() at
the end.

To fix this issue, xlblocks is made to use 64-bit atomics instead
of XLogRecPtr and the xlblocks value is marked with

Re: Show WAL write and fsync stats in pg_stat_io

2023-11-07 Thread Bharath Rupireddy
On Wed, Sep 20, 2023 at 1:28 PM Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> Thanks for the review!
>
> Current status of the patch is:
> - IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync stats are added.
> - IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
> - IOOBJECT_WAL / IOCONTEXT_INIT stats are added.
> - pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
> - Working on which 'BackendType / IOContext / IOOp' should be banned in 
> pg_stat_io.
> - PendingWalStats.wal_sync and PendingWalStats.wal_write_time / 
> PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() / 
> pgstat_count_io_op_time() respectively.
>
> TODOs:
> - Documentation.
> - Try to set op_bytes for BackendType / IOContext.
> - Decide which 'BackendType / IOContext / IOOp' should not be tracked.
> - Add IOOBJECT_WAL / IOCONTEXT_NORMAL read tests.
> - Add IOOBJECT_WAL / IOCONTEXT_INIT tests.

This patchset currently covers:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync.
- IOOBJECT_WAL / IOCONTEXT_INIT write and fsync.

doesn't cover:
- Streaming replication WAL IO.

Is there any plan to account for WAL read stats in the WALRead()
function which will cover walsenders i.e. WAL read by logical and
streaming replication, WAL read by pg_walinspect and so on? I see the
patch already covers WAL read stats by recovery in XLogPageRead(), but
not other page_read callbacks which will end up in WALRead()
eventually. If added, the feature at
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
can then extend it to cover WAL read from WAL buffer stats.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Doubled test for SET statements in pg_stat_statements tests

2023-11-07 Thread Sergei Kornilov
Hello!

I noticed that the same block

-- SET statements.
-- These use two different strings, still they count as one entry.
SET work_mem = '1MB';
Set work_mem = '1MB';
SET work_mem = '2MB';
RESET work_mem;
SET enable_seqscan = off;
SET enable_seqscan = on;
RESET enable_seqscan;

is checked twice in contrib/pg_stat_statements/sql/utility.sql on lines 278-286 
and 333-341. Is this on any purpose? I think the second set of tests is not 
needed and can be removed, as in the attached patch.

regards, Sergeidiff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index cc6e898cdf..a913421290 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -664,30 +664,3 @@ SELECT pg_stat_statements_reset();
  
 (1 row)
 
--- SET statements.
--- These use two different strings, still they count as one entry.
-SET work_mem = '1MB';
-Set work_mem = '1MB';
-SET work_mem = '2MB';
-RESET work_mem;
-SET enable_seqscan = off;
-SET enable_seqscan = on;
-RESET enable_seqscan;
-SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
- calls | rows |   query   
+--+---
- 1 |0 | RESET enable_seqscan
- 1 |0 | RESET work_mem
- 1 |1 | SELECT pg_stat_statements_reset()
- 1 |0 | SET enable_seqscan = off
- 1 |0 | SET enable_seqscan = on
- 2 |0 | SET work_mem = '1MB'
- 1 |0 | SET work_mem = '2MB'
-(7 rows)
-
-SELECT pg_stat_statements_reset();
- pg_stat_statements_reset 
---
- 
-(1 row)
-
diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql
index 04598e5ae4..3fb8dde68e 100644
--- a/contrib/pg_stat_statements/sql/utility.sql
+++ b/contrib/pg_stat_statements/sql/utility.sql
@@ -329,16 +329,3 @@ DROP TABLE pgss_ctas;
 DROP TABLE pgss_select_into;
 
 SELECT pg_stat_statements_reset();
-
--- SET statements.
--- These use two different strings, still they count as one entry.
-SET work_mem = '1MB';
-Set work_mem = '1MB';
-SET work_mem = '2MB';
-RESET work_mem;
-SET enable_seqscan = off;
-SET enable_seqscan = on;
-RESET enable_seqscan;
-
-SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
-SELECT pg_stat_statements_reset();


Re: ensure, not insure

2023-11-07 Thread David Rowley
On Wed, 8 Nov 2023 at 19:56, Peter Smith  wrote:
> gettext_noop("The server will use the fsync() system call in several
> places to make "
>   "sure that updates are physically written to disk. This insures "
>   "that a database cluster will recover to a consistent state after "
>   "an operating system or hardware crash.")
>
> ~
>
> I believe the word should have been "ensures"; not "insures".

I agree. It's surprisingly ancient, having arrived in b700a672f (June 2003).

> In passing I found/fixed a bunch of similar misuses in comments.

Those all look fine to me too.

David




Re: Remove unnecessary 'always:' from CompilerWarnings task

2023-11-07 Thread Peter Eisentraut

On 05.09.23 12:25, Nazir Bilal Yavuz wrote:

There are multiple 'always:' keywords under the CompilerWarnings task.
Instead of that, we can use one 'always:' and move the instructions
under this. So, I removed unnecessary ones and rearranged indents
according to that change.


I'm not sure this change is beneficial.  The way the code is currently 
arranged, it's a bit easier to move or change individual blocks, and 
it's also easier to read the file, because the "always:" is next to each 
"script" and doesn't scroll off the screen.






Re: GenBKI emits useless open;close for catalogs without rows

2023-11-07 Thread Peter Eisentraut

On 08.11.23 08:16, Peter Eisentraut wrote:

On 19.09.23 20:05, Heikki Linnakangas wrote:
One thing caught my eye though: We currently have an "open" command 
after every "create". Except for bootstrap relations; creating a 
bootstrap relation opens it implicitly. That seems like a weird 
inconsistency. If we make "create" to always open the relation, we can 
both make it more consistent and save a few lines. That's maybe worth 
doing, per the attached. It removes the "open" command altogether, as 
it's not needed anymore.


This seems like a good improvement to me.

It would restrict the bootstrap language so that you can only manipulate 
a table right after creating it, but I don't see why that wouldn't be 
sufficient.


Then again, this sort of achieves the opposite of what Matthias was 
aiming for: You are now forcing some relations to be opened even though 
we will end up closing it right away.


(In any case, documentation in bki.sgml would need to be updated for 
this patch.)






Re: GenBKI emits useless open;close for catalogs without rows

2023-11-07 Thread Peter Eisentraut

On 19.09.23 20:05, Heikki Linnakangas wrote:
One thing caught my eye though: We currently have an "open" command 
after every "create". Except for bootstrap relations; creating a 
bootstrap relation opens it implicitly. That seems like a weird 
inconsistency. If we make "create" to always open the relation, we can 
both make it more consistent and save a few lines. That's maybe worth 
doing, per the attached. It removes the "open" command altogether, as 
it's not needed anymore.


This seems like a good improvement to me.

It would restrict the bootstrap language so that you can only manipulate 
a table right after creating it, but I don't see why that wouldn't be 
sufficient.






Re: Synchronizing slots from primary to standby

2023-11-07 Thread Drouvot, Bertrand

Hi,

On 11/8/23 4:50 AM, Amit Kapila wrote:

On Tue, Nov 7, 2023 at 7:58 PM Drouvot, Bertrand
 wrote:


If we think this window is too short we could:

- increase it
or
- don't drop the slot once created (even if there is no activity
on the primary during PrimaryCatchupWaitAttempt attempts) so that
the next loop of attempts will compare with "older" LSN/xmin (as compare to
dropping and re-creating the slot). That way the window would be since the
initial slot creation.



Yeah, this sounds reasonable but we can't mark such slots to be
synced/available for use after failover.


Yeah, currently we are fine as slots are dropped in 
wait_for_primary_slot_catchup() if
we are not in recovery anymore.


I think if we want to follow
this approach then we need to also monitor these slots for any change
in the consecutive cycles and if we are able to sync them then
accordingly we enable them to use after failover.


What about to add a new field in ReplicationSlotPersistentData
indicating that we are waiting for "sync" and drop such slots during promotion 
and
/or if not in recovery?


Another somewhat related point is that right now, we just wait for the
change on the first slot (the patch refers to it as the monitoring
slot) for computing nap_time before which we will recheck all the
slots. I think we can improve that as well such that even if any
slot's information is changed, we don't consider changing naptime.



Yeah, that sounds reasonable to me.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PGDOCS] Inconsistent linkends to "monitoring" views.

2023-11-07 Thread John Naylor
On Tue, Oct 3, 2023 at 4:40 PM Peter Smith  wrote:
>
> On Tue, Oct 3, 2023 at 6:30 PM Michael Paquier  wrote:
> >
> > On Tue, Oct 03, 2023 at 01:11:15PM +1100, Peter Smith wrote:
> > > I noticed one or two "monitoring" links and linkends that are slightly
> > > inconsistent from all the others.
> >
> > -   
> > +   
> >
> > Is that really worth bothering for the internal link references?
>
> I preferred 100% consistency instead of 95% consistency. YMMV.
>
> > This can create extra backpatching conflicts.
>
> Couldn't the same be said for every patch that fixes a comment typo?
> This is like a link typo, so what's the difference?

My 2 cents: Comment typos are visible to readers, so more annoying
when seen in isolation, and less likely to have surroundings that
could change in back branches. Consistency would preferred all else
being equal, but then again nothing is wrong with the existing links.
In any case, no one has come out in favor of the patch, so it seems
like it should be rejected unless that changes.

--
John Naylor




Re: Infinite Interval

2023-11-07 Thread jian he
On Wed, Nov 8, 2023 at 7:42 AM Dean Rasheed  wrote:
>
> On Tue, 7 Nov 2023 at 14:33, Dean Rasheed  wrote:
> >
> > New version attached doing that, to run it past the cfbot again.
> >
>
> Ah, Windows Server didn't like that. Trying again with "INT64CONST(0)"
> instead of just "0" in interval_um().
>
> Regards,
> Dean

I found this:
https://developercommunity.visualstudio.com/t/please-implement-integer-overflow-detection/409051
maybe related.




ensure, not insure

2023-11-07 Thread Peter Smith
Hi,

While looking at GUC messages today I noticed the following:

gettext_noop("The server will use the fsync() system call in several
places to make "
  "sure that updates are physically written to disk. This insures "
  "that a database cluster will recover to a consistent state after "
  "an operating system or hardware crash.")

~

I believe the word should have been "ensures"; not "insures".

In passing I found/fixed a bunch of similar misuses in comments.

==
Kind Regards,
Peter Smith.
Fujitsu Australia.


v1-0001-ensure-not-insure.patch
Description: Binary data


Re: Version 14/15 documentation Section "Alter Default Privileges"

2023-11-07 Thread Michael Banck
Hi,

On Tue, Nov 07, 2023 at 05:30:20PM -0500, Bruce Momjian wrote:
> On Mon, Nov  6, 2023 at 09:53:50PM +0100, Laurenz Albe wrote:
> > +  
> > +   There is no way to change the default privileges for objects created by
> > +   arbitrary roles.  You have run ALTER DEFAULT 
> > PRIVILEGES
> 
> I find the above sentence odd.  What is its purpose?

I guess it is to address the main purpose of this patch/confusion with
users: they believe setting DEFAULT PRIVILEGES will set grants
accordingly for all objects created in the future, no matter who creates
them. So hammering in that this is not the case seems fine from my side
(modulo the "You have to run" typo).


Michael




Re: Syncrep and improving latency due to WAL throttling

2023-11-07 Thread Andres Freund
Hi,

On 2023-11-04 20:00:46 +0100, Tomas Vondra wrote:
> scope
> -
> Now, let's talk about scope - what the patch does not aim to do. The
> patch is explicitly intended for syncrep clusters, not async. There have
> been proposals to also support throttling for async replicas, logical
> replication etc. I suppose all of that could be implemented, and I do
> see the benefit of defining some sort of maximum lag even for async
> replicas. But the agreement was to focus on the syncrep case, where it's
> particularly painful, and perhaps extend it in the future.

Perhaps we should take care to make the configuration extensible in that
direction in the future?


Hm - is this feature really tied to replication, at all? Pretty much the same
situation exists without. On an ok-ish local nvme I ran pgbench with 1 client
and -P1. Guess where I started a VACUUM (on a fully cached table, so no
continuous WAL flushes):

progress: 64.0 s, 634.0 tps, lat 1.578 ms stddev 0.477, 0 failed
progress: 65.0 s, 634.0 tps, lat 1.577 ms stddev 0.546, 0 failed
progress: 66.0 s, 639.0 tps, lat 1.566 ms stddev 0.656, 0 failed
progress: 67.0 s, 642.0 tps, lat 1.557 ms stddev 0.273, 0 failed
progress: 68.0 s, 556.0 tps, lat 1.793 ms stddev 0.690, 0 failed
progress: 69.0 s, 281.0 tps, lat 3.568 ms stddev 1.050, 0 failed
progress: 70.0 s, 282.0 tps, lat 3.539 ms stddev 1.072, 0 failed
progress: 71.0 s, 273.0 tps, lat 3.663 ms stddev 2.602, 0 failed
progress: 72.0 s, 261.0 tps, lat 3.832 ms stddev 1.889, 0 failed
progress: 73.0 s, 268.0 tps, lat 3.738 ms stddev 0.934, 0 failed

At 32 clients we go from ~10k to 2.5k, with a full 2s of 0.

Subtracting pg_current_wal_flush_lsn() from pg_current_wal_insert_lsn() the
"good times" show a delay of ~8kB (note that this includes WAL records that
are still being inserted). Once the VACUUM runs, it's ~2-3MB.

The picture with more clients is similar.

If I instead severely limit the amount of outstanding (but not the amount of
unflushed) WAL by setting wal_buffers to 128, latency dips quite a bit less
(down to ~400 instead of ~260 at 1 client, ~10k to ~5k at 32).  Of course
that's ridiculous and will completely trash performance in many other cases,
but it shows that limiting the amount of outstanding WAL could help without
replication as well.  With remote storage, that'd likely be a bigger
difference.




> problems
> 
> Now let's talk about some problems - both conceptual and technical
> (essentially review comments for the patch).
>
> 1) The goal of the patch is to limit the impact on latency, but the
> relationship between WAL amounts and latency may not be linear. But we
> don't have a good way to predict latency, and WAL lag is the only thing
> we have, so there's that. Ultimately, it's a best effort.

It's indeed probably not linear. Realistically, to do better, we probably need
statistics for the specific system in question - the latency impact will
differ hugely between different storage/network.


> 2) The throttling is per backend. That makes it simple, but it means
> that it's hard to enforce a global lag limit. Imagine the limit is 8MB,
> and with a single backend that works fine - the lag should not exceed
> the 8MB value. But if there are N backends, the lag could be up to
> N-times 8MB, I believe. That's a bit annoying, but I guess the only
> solution would be to have some autovacuum-like cost balancing, with all
> backends (or at least those running large stuff) doing the checks more
> often. I'm not sure we want to do that.

Hm. The average case is likely fine - the throttling of the different backends
will intersperse and flush more frequently - but the worst case is presumably
part of the issue here. I wonder if we could deal with this by somehow
offsetting the points at which backends flush at somehow.

I doubt we want to go for something autovacuum balancing like - that doesn't
seem to work well - but I think we could take the amount of actually unflushed
WAL into account when deciding whether to throttle. We have the necessary
state in local memory IIRC. We'd have to be careful to not throttle every
backend at the same time, or we'll introduce latency penalties that way. But
what if we scaled synchronous_commit_wal_throttle_threshold depending on the
amount of unflushed WAL? By still taking backendWalInserted into account, we'd
avoid throttling everyone at the same time, but still would make throttling
more aggressive depending on the amount of unflushed/unreplicated WAL.


> 3) The actual throttling (flush and wait for syncrep) happens in
> ProcessInterrupts(), which mostly works but it has two drawbacks:
>
>  * It may not happen "early enough" if the backends inserts a lot of
> XLOG records without processing interrupts in between.

Does such code exist? And if so, is there a reason not to fix said code?


>  * It may happen "too early" if the backend inserts enough WAL to need
> throttling (i.e. sets XLogDelayPending), but then after processing
> interrupts it 

Re: pg_upgrade and logical replication

2023-11-07 Thread vignesh C
On Mon, 6 Nov 2023 at 07:51, Peter Smith  wrote:
>
> Here are some review comments for patch v11-0001
>
> ==
> Commit message
>
> 1.
> The subscription's replication origin are needed to ensure
> that we don't replicate anything twice.
>
> ~
>
> /are needed/is needed/

Modified

>
> 2.
> Author: Julien Rouhaud
> Reviewed-by: FIXME
> Discussion: https://postgr.es/m/20230217075433.u5mjly4d5cr4hcfe@jrouhaud
>
> ~
>
> Include Vignesh as another author.

Modified

> ==
> doc/src/sgml/ref/pgupgrade.sgml
>
> 3.
> + pg_upgrade attempts to migrate subscription
> + dependencies which includes the subscription tables information present 
> in
> + pg_subscription_rel
> + system table and the subscription replication origin which
> + will help in continuing logical replication from where the old 
> subscriber
> + was replicating. This helps in avoiding the need for setting up the
>
> I became a bit lost reading paragraph due to the multiple 'which'...
>
> SUGGESTION
> pg_upgrade attempts to migrate subscription dependencies which
> includes the subscription table information present in
> pg_subscription_rel system
> catalog and also the subscription replication origin. This allows
> logical replication on the new subscriber to continue from where the
> old subscriber was up to.

Modified

> ~~~
>
> 4.
> + was replicating. This helps in avoiding the need for setting up the
> + subscription objects manually which requires truncating all the
> + subscription tables and setting the logical replication slots. Migration
>
> SUGGESTION
> Having the ability to migrate subscription objects avoids the need to
> set them up manually, which would require truncating all the
> subscription tables and setting the logical replication slots.

I have removed this

> ~
>
> TBH, I am wondering what is the purpose of this sentence. It seems
> more like a justification for the patch, but does the user need to
> know all this?
>
> ~~~
>
> 5.
> +  
> +   All the subscription tables in the old subscriber should be in
> +   i (initialize), r (ready) or
> +   s (synchronized). This can be verified by checking
> +linkend="catalog-pg-subscription-rel">pg_subscription_rel.srsubstate.
> +  
>
> /should be in/should be in state/

Modified

> ~~~
>
> 6.
> +  
> +   The replication origin entry corresponding to each of the 
> subscriptions
> +   should exist in the old cluster. This can be checking
> +   pg_subscription and
> +linkend="catalog-pg-replication-origin">pg_replication_origin
> +   system tables.
> +  
>
> missing words?
>
> /This can be checking/This can be found by checking/

Modified

> ~~~
>
> 7.
> +
> + The subscriptions will be migrated to new cluster in disabled state, 
> they
> + can be enabled after upgrade by following the steps:
> +
>
> The first bullet also says "Enable the subscription..." so I think
> this paragraph should be worded like the below.
>
> SUGGESTION
> The subscriptions will be migrated to the new cluster in a disabled
> state. After migration, do this:

Modified

> ==
> src/backend/catalog/pg_subscription.c
>
> 8.
>  #include "nodes/makefuncs.h"
> +#include "replication/origin.h"
> +#include "replication/worker_internal.h"
>  #include "storage/lmgr.h"
>
> Why does this change need to be in the patch when there are no other
> code changes in this file?

Modified

> ==
> src/backend/utils/adt/pg_upgrade_support.c
>
> 9. binary_upgrade_create_sub_rel_state
>
> IMO a better name for this function would be
> 'binary_upgrade_add_sub_rel_state' (because it delegates to
> AddSubscriptionRelState).
>
> Then it would obey the same name pattern as the other function
> 'binary_upgrade_replorigin_advance' (which delegates to
> replorigin_advance).

Modified

> ~~~
>
> 10.
> +/*
> + * binary_upgrade_create_sub_rel_state
> + *
> + * Add the relation with the specified relation state to pg_subscription_rel
> + * table.
> + */
> +Datum
> +binary_upgrade_create_sub_rel_state(PG_FUNCTION_ARGS)
> +{
> + Relation rel;
> + HeapTuple tup;
> + Oid subid;
> + Form_pg_subscription form;
> + char*subname;
> + Oid relid;
> + char relstate;
> + XLogRecPtr sublsn;
>
> 10a.
> /to pg_subscription_rel table./to pg_subscription_rel catalog./

Modified

> ~
>
> 10b.
> Maybe it would be helpful if the function argument were documented
> up-front in the function-comment, or in the variable declarations.
>
> SUGGESTION
> char  *subname;  /* ARG0 = subscription name */
> Oidrelid;/* ARG1 = relation Oid */
> char   relstate; /* ARG2 = subrel state */
> XLogRecPtr sublsn;   /* ARG3 (optional) = subscription lsn */

I felt the variables are self explainatory in this case and also
consistent with other functions.

> ~~~
>
> 11.
> if (PG_ARGISNULL(3))
> sublsn = InvalidXLogRecPtr;
> else
> sublsn = PG_GETARG_LSN(3);
> FWIW, I'd write that as a one-line ternary assignment allowing all the
> args to be 

Re: [PoC] Implementation of distinct in Window Aggregates: take two

2023-11-07 Thread Shlok Kyal
Hi,

I went through the Cfbot, and some of the test cases are failing for
this patch. It seems like some tests are crashing:
https://api.cirrus-ci.com/v1/artifact/task/6291153444667392/crashlog/crashlog-postgres.exe_03b0_2023-11-07_10-41-39-624.txt

[10:46:56.546] Summary of Failures:
[10:46:56.546]
[10:46:56.546] 87/270 postgresql:postgres_fdw / postgres_fdw/regress
ERROR 11.10s exit status 1
[10:46:56.546] 82/270 postgresql:regress / regress/regress ERROR
248.55s exit status 1
[10:46:56.546] 99/270 postgresql:recovery /
recovery/027_stream_regress ERROR 161.40s exit status 29
[10:46:56.546] 98/270 postgresql:pg_upgrade /
pg_upgrade/002_pg_upgrade ERROR 253.31s exit status 29

link of tests failing:
https://cirrus-ci.com/task/664299716712
https://cirrus-ci.com/task/4602303584403456
https://cirrus-ci.com/task/5728203491246080
https://cirrus-ci.com/task/5165253537824768?logs=test_world#L511
https://cirrus-ci.com/task/6291153444667392

Thanks
Shlok Kumar Kyal




Re: Fix use of openssl.path() if openssl isn't found

2023-11-07 Thread Tristan Partin

On Tue Nov 7, 2023 at 11:53 PM CST, Michael Paquier wrote:

On Tue, Nov 07, 2023 at 04:06:56PM -0600, Tristan Partin wrote:
> Found this issue during my Fedora 39 upgrade. Tested that uninstalling
> openssl still allows the various ssl tests to run and succeed.

Good catch.  You are right that this is inconsistent with what we
expect in the test.

> +openssl_path = ''
> +if openssl.found()
> +  openssl_path = openssl.path()
> +endif
> +
>  tests += {
>'name': 'ssl',
>'sd': meson.current_source_dir(),
> @@ -7,7 +12,7 @@ tests += {
>'tap': {
>  'env': {
>'with_ssl': ssl_library,
> -  'OPENSSL': openssl.path(),
> +  'OPENSSL': openssl_path,
>  },
>  'tests': [
>'t/001_ssltests.pl',

Okay, that's a nit and it leads to the same result, but why not using
the same one-liner style like all the other meson.build files that
rely on optional commands?  See pg_verifybackup, pg_dump, etc.  That
would be more consistent.


Because I forgot there were ternary statements in Meson :). Thanks for 
the review. Here is v2.


--
Tristan Partin
Neon (https://neon.tech)
From 5fc0460b0b85b8f04976182c0f6ec650c40df833 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 7 Nov 2023 15:59:04 -0600
Subject: [PATCH v2] Fix use of openssl.path() if openssl isn't found

openssl(1) is an optional dependency of the Postgres Meson build, but
was inadvertantly required when defining some SSL tests.
---
 src/test/ssl/meson.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/test/ssl/meson.build b/src/test/ssl/meson.build
index 4cda81f3bc..ed83c438ef 100644
--- a/src/test/ssl/meson.build
+++ b/src/test/ssl/meson.build
@@ -7,7 +7,7 @@ tests += {
   'tap': {
 'env': {
   'with_ssl': ssl_library,
-  'OPENSSL': openssl.path(),
+  'OPENSSL': openssl.found() ? openssl.path : '',
 },
 'tests': [
   't/001_ssltests.pl',
-- 
Tristan Partin
Neon (https://neon.tech)



Re: Show WAL write and fsync stats in pg_stat_io

2023-11-07 Thread Michael Paquier
On Wed, Nov 08, 2023 at 10:27:44AM +0900, Michael Paquier wrote:
> Yep, I'd be OK with that as well to maintain compatibility.

By the way, note that the patch is failing to apply, and that I've
switched it as waiting on author on 10/26.
--
Michael


signature.asc
Description: PGP signature


Re: Fix use of openssl.path() if openssl isn't found

2023-11-07 Thread Michael Paquier
On Tue, Nov 07, 2023 at 04:06:56PM -0600, Tristan Partin wrote:
> Found this issue during my Fedora 39 upgrade. Tested that uninstalling
> openssl still allows the various ssl tests to run and succeed.

Good catch.  You are right that this is inconsistent with what we
expect in the test.

> +openssl_path = ''
> +if openssl.found()
> +  openssl_path = openssl.path()
> +endif
> +
>  tests += {
>'name': 'ssl',
>'sd': meson.current_source_dir(),
> @@ -7,7 +12,7 @@ tests += {
>'tap': {
>  'env': {
>'with_ssl': ssl_library,
> -  'OPENSSL': openssl.path(),
> +  'OPENSSL': openssl_path,
>  },
>  'tests': [
>'t/001_ssltests.pl',

Okay, that's a nit and it leads to the same result, but why not using
the same one-liner style like all the other meson.build files that
rely on optional commands?  See pg_verifybackup, pg_dump, etc.  That
would be more consistent.
--
Michael


signature.asc
Description: PGP signature


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-07 Thread Dilip Kumar
On Wed, Nov 8, 2023 at 10:52 AM Amul Sul  wrote:

Thanks for review Amul,

> Here are some minor comments:
>
> + * By default, we'll use 1MB of for every 1GB of shared buffers, up to the
> + * maximum value that slru.c will allow, but always at least 16 buffers.
>   */
>  Size
>  CommitTsShmemBuffers(void)
>  {
> -   return Min(256, Max(4, NBuffers / 256));
> +   /* Use configured value if provided. */
> +   if (commit_ts_buffers > 0)
> +   return Max(16, commit_ts_buffers);
> +   return Min(256, Max(16, NBuffers / 256));
>
> Do you mean "4MB of for every 1GB"  in the comment?

You are right

> --
>
> diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
> index 5087cdce51..78d017ad85 100644
> --- a/src/include/access/commit_ts.h
> +++ b/src/include/access/commit_ts.h
> @@ -16,7 +16,6 @@
>  #include "replication/origin.h"
>  #include "storage/sync.h"
>
> -
>  extern PGDLLIMPORT bool track_commit_timestamp;
>
> A spurious change.

Will fix

> --
>
> @@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns)
>
> if (nlsns > 0)
> sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));/* 
> group_lsn[] */
> -
> return BUFFERALIGN(sz) + BLCKSZ * nslots;
>  }
>
> Another spurious change in 0002 patch.

Will fix

> --
>
> +/*
> + * The slru buffer mapping table is partitioned to reduce contention. To
> + * determine which partition lock a given pageno requires, compute the 
> pageno's
> + * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock().
> + */
>
> I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions anywhere 
> in
> your patches, is that outdated comment?

Yes will fix it, actually, there are some major design changes to this.

> --
>
> -   sz += MAXALIGN(nslots * sizeof(LWLockPadded));  /* buffer_locks[] */
> -   sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* 
> part_locks[] */
> +   sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded));  
> /* locks[] */
>
> I am a bit uncomfortable with these changes, merging parts and buffer locks
> making it hard to understand the code. Not sure what we were getting out of
> this?

Yes, even I do not like this much because it is confusing.  But the
advantage of this is that we are using a single pointer for the lock
which means the next variable for the LRU counter will come in the
same cacheline and frequent updates of lru counter will be benefitted
from this.  Although I don't have any number which proves this.
Currently, I want to focus on all the base patches and keep this patch
as add on and later if we find its useful and want to pursue this then
we will see how to make it better readable.


>
> Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of
>  partitions
>
> I think the 0005 patch can be merged to 0001.

Yeah in the next version, it is done that way.  Planning to post end of the day.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Fix output of zero privileges in psql

2023-11-07 Thread Shubham Khanna
On Wed, Nov 8, 2023 at 10:46 AM Laurenz Albe  wrote:
>
> On Sat, 2023-10-21 at 04:29 +0200, Erik Wienhold wrote:
> > The attached v3 of my initial patch
> > does that.  It also includes Laurenz' fix to no longer ignore \pset null
> > (minus the doc changes that suggest using \pset null to distinguish
> > between default and empty privileges because that's no longer needed).
>
> Thanks!
>
> I went over the patch, fixed some problems and added some more stuff from
> my patch.
>
> In particular:
>
>   --- a/doc/src/sgml/ddl.sgml
>   +++ b/doc/src/sgml/ddl.sgml
>   @@ -2353,7 +2353,9 @@ GRANT SELECT (col1), UPDATE (col1) ON mytable TO 
> miriam_rw;
>  
>   If the Access privileges column is empty for a given
>   object, it means the object has default privileges (that is, its
>   -   privileges entry in the relevant system catalog is null).  Default
>   +   privileges entry in the relevant system catalog is null).  The column 
> shows
>   +   (none) for empty privileges (that is, no privileges 
> at
>   +   all, even for the object owner  a rare occurrence).  Default
>   privileges always include all privileges for the owner, and can include
>   some privileges for PUBLIC depending on the object
>   type, as explained above.  The first GRANT
>
> This description of empty privileges is smack in the middle of describing
> default privileges.  I thought that was confusing and moved it to its
> own paragraph.
>
>   --- a/src/bin/psql/describe.c
>   +++ b/src/bin/psql/describe.c
>   @@ -6718,7 +6680,13 @@ static void
>printACLColumn(PQExpBuffer buf, const char *colname)
>{
>   appendPQExpBuffer(buf,
>   - "pg_catalog.array_to_string(%s, E'\\n') AS \"%s\"",
>   + "CASE\n"
>   + "  WHEN %s IS NULL THEN ''\n"
>   + "  WHEN pg_catalog.cardinality(%s) = 0 THEN '%s'\n"
>   + "  ELSE pg_catalog.array_to_string(%s, E'\\n')\n"
>   + "END AS \"%s\"",
>   + colname,
>   + colname, gettext_noop("(none)"),
> colname, gettext_noop("Access privileges"));
>}
>
> This erroneously displays NULL as empty string and subverts my changes.
> I have removed the first branch of the CASE expression.
>
>   --- a/src/test/regress/expected/psql.out
>   +++ b/src/test/regress/expected/psql.out
>   @@ -6663,3 +6663,97 @@ DROP ROLE regress_du_role0;
>DROP ROLE regress_du_role1;
>DROP ROLE regress_du_role2;
>DROP ROLE regress_du_admin;
>   +-- Test empty privileges.
>   +BEGIN;
>   +WARNING:  there is already a transaction in progress
>
> This warning is caused by a pre-existing error in the regression test, which
> forgot to close the transaction.  I have added a COMMIT at the appropriate 
> place.
>
>   +ALTER TABLESPACE regress_tblspace OWNER TO CURRENT_USER;
>   +REVOKE ALL ON TABLESPACE regress_tblspace FROM CURRENT_USER;
>   +\db+ regress_tblspace
>   +List of tablespaces
>   +   Name   | Owner  |Location | Access 
> privileges | Options |  Size   | Description
>   
> +--++-+---+-+-+-
>   + regress_tblspace | regress_zeropriv_owner | pg_tblspc/16385 | (none)  
>   | | 0 bytes |
>   +(1 row)
>
> This test is not stable, since it contains the OID of the tablespace, which
> is different every time.
>
>   +ALTER DATABASE :"DBNAME" OWNER TO CURRENT_USER;
>   +REVOKE ALL ON DATABASE :"DBNAME" FROM CURRENT_USER, PUBLIC;
>   +\l :"DBNAME"
>   +List of databases
>   +Name| Owner  | Encoding  | Locale Provider | 
> Collate | Ctype | ICU Locale | ICU Rules | Access privileges
>   
> +++---+-+-+---++---+---
>   + regression | regress_zeropriv_owner | SQL_ASCII | libc| C 
>   | C ||   | (none)
>   +(1 row)
>
> This test is also not stable, since it depends on the locale definition
> of the regression test database.  If you use "make installcheck", that could
> be a different locale.
>
> I think that these tests are not absolutely necessary, and the other tests
> are sufficient.  Consequently, I took the simple road of removing them.
>
> I also tried to improve the commit message.
>
> Patch attached.

I tested the Patch for the modified changes and it is working fine.

Thanks and regards,
Shubham Khanna.




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-07 Thread Amul Sul
On Fri, Nov 3, 2023 at 10:59 AM Dilip Kumar  wrote:

> On Mon, Oct 30, 2023 at 11:50 AM Dilip Kumar 
> wrote:
> > [...]
> [1] 0001-Make-all-SLRU-buffer-sizes-configurable: This is the same
> patch as the previous patch set
> [2] 0002-Add-a-buffer-mapping-table-for-SLRUs: Patch to introduce
> buffer mapping hash table
> [3] 0003-Partition-wise-slru-locks: Partition the hash table and also
> introduce partition-wise locks: this is a merge of 0003 and 0004 from
> the previous patch set but instead of bank-wise locks it has
> partition-wise locks and LRU counter.
> [4] 0004-Merge-partition-locks-array-with-buffer-locks-array: merging
> buffer locks and bank locks in the same array so that the bank-wise
> LRU counter does not fetch the next cache line in a hot function
> SlruRecentlyUsed()(same as 0005 from the previous patch set)
> [5] 0005-Ensure-slru-buffer-slots-are-in-multiple-of-number-of: Ensure
> that the number of slots is in multiple of the number of banks
> [...]


Here are some minor comments:

+ * By default, we'll use 1MB of for every 1GB of shared buffers, up to the
+ * maximum value that slru.c will allow, but always at least 16 buffers.
  */
 Size
 CommitTsShmemBuffers(void)
 {
-   return Min(256, Max(4, NBuffers / 256));
+   /* Use configured value if provided. */
+   if (commit_ts_buffers > 0)
+   return Max(16, commit_ts_buffers);
+   return Min(256, Max(16, NBuffers / 256));

Do you mean "4MB of for every 1GB"  in the comment?
--

diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 5087cdce51..78d017ad85 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -16,7 +16,6 @@
 #include "replication/origin.h"
 #include "storage/sync.h"

-
 extern PGDLLIMPORT bool track_commit_timestamp;

A spurious change.
--

@@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns)

if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));/*
group_lsn[] */
-
return BUFFERALIGN(sz) + BLCKSZ * nslots;
 }

Another spurious change in 0002 patch.
--

+/*
+ * The slru buffer mapping table is partitioned to reduce contention. To
+ * determine which partition lock a given pageno requires, compute the
pageno's
+ * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock().
+ */

I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions
anywhere in
your patches, is that outdated comment?
--

-   sz += MAXALIGN(nslots * sizeof(LWLockPadded));  /* buffer_locks[] */
-   sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /*
part_locks[] */
+   sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded));
 /* locks[] */

I am a bit uncomfortable with these changes, merging parts and buffer locks
making it hard to understand the code. Not sure what we were getting out of
this?
--

Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of
 partitions

I think the 0005 patch can be merged to 0001.

Regards,
Amul


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-07 Thread Amul Sul
On Mon, Nov 6, 2023 at 4:44 PM Andrey M. Borodin 
wrote:

>
>
> > On 6 Nov 2023, at 14:31, Alvaro Herrera  wrote:
> >
> > dynahash is notoriously slow, which is why we have simplehash.h since
> > commit b30d3ea824c5.  Maybe we could use that instead.
>
> Dynahash has lock partitioning. Simplehash has not, AFAIK.
> The thing is we do not really need a hash function - pageno is already a
> best hash function itself. And we do not need to cope with collisions much
> - we can evict a collided buffer.
>
> Given this we do not need a hashtable at all. That’s exact reasoning how
> banks emerged, I started implementing dynahsh patch in April 2021 and found
> out that “banks” approach is cleaner. However the term “bank” is not common
> in software, it’s taken from hardware cache.
>

I agree that we don't need the hash function to generate hash value out of
pageno which itself is sufficient, but I don't understand how we can get
rid of
the hash table itself -- how we would map the pageno and the slot number?
That mapping is not needed at all?

Regards,
Amul


Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

2023-11-07 Thread Andres Freund
On 2023-10-30 08:25:17 +0900, Michael Paquier wrote:
> On Fri, Oct 27, 2023 at 09:45:52AM +0200, Drouvot, Bertrand wrote:
> > Done in V8 attached (pgindent has been run on pgstatfuncs.c and
> > pgstat_relation.c).
> 
> And applied that after editing a bit the comments.

Thank you both!




Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-07 Thread Andres Freund
Hi,

On 2023-11-08 10:08:42 +0900, Kyotaro Horiguchi wrote:
> At Mon, 6 Nov 2023 14:00:13 +0530, Bharath Rupireddy 
>  wrote in 
> > On Mon, Nov 6, 2023 at 11:39 AM torikoshia  
> > wrote:
> > > Since each stats, except wal_prefetch was reset acquiring LWLock,
> > > attached PoC patch makes the call atomic by using these LWlocks.
> > >
> > > If this is the right direction, I'll try to make wal_prefetch also take
> > > LWLock.
> 
> I must say, I have reservations about this approach. The main concern
> is the duplication of reset code, which has been efficiently
> encapsulated for individual targets, into this location. This practice
> degrades the maintainability and clarity of the code.

Yes, as-is this seems to evolve the code in precisely the wrong direction. We
want less central awareness of different types of stats, not more. The
proposed new code is far longer than the current pg_stat_reset(), despite
doing something conceptually simpler.

Greetings,

Andres Freund




Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-07 Thread Andres Freund
Hi,

On 2023-11-06 14:00:13 +0530, Bharath Rupireddy wrote:
> On Mon, Nov 6, 2023 at 11:39 AM torikoshia  wrote:
> >
> > Thanks all for the comments!
> >
> > On Fri, Nov 3, 2023 at 5:17 AM Matthias van de Meent
> >  wrote:
> > > Knowing that your metrics have a shared starting point can be quite
> > > valuable, as it allows you to do some math that would otherwise be
> > > much less accurate when working with stats over a short amount of
> > > time. I've not used these stats systems much myself, but skew between
> > > metrics caused by different reset points can be difficult to detect
> > > and debug, so I think an atomic call to reset all these stats could be
> > > worth implementing.
> >
> > Since each stats, except wal_prefetch was reset acquiring LWLock,
> > attached PoC patch makes the call atomic by using these LWlocks.
> >
> > If this is the right direction, I'll try to make wal_prefetch also take
> > LWLock.
> 
> +// Acquire LWLocks
> +LWLock *locks[] = {_archiver->lock,  _bgwriter->lock,
> +   _checkpointer->lock, _wal->lock};
> +
> +for (int i = 0; i < STATS_SHARED_NUM_LWLOCK; i++)
> +LWLockAcquire(locks[i], LW_EXCLUSIVE);
> +
> +for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> +{
> +LWLock*bktype_lock = _io->locks[i];
> +LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
> +}
> 
> Well, that's a total of ~17 LWLocks this new function takes to make
> the stats reset atomic. I'm not sure if this atomicity is worth the
> effort which can easily be misused - what if someone runs something
> like SELECT pg_stat_reset_shared() FROM generate_series(1,
> 10n); to cause heavy lock acquisition and release cycles?

Yea, this seems like an *extremely* bad idea to me. Without careful analysis
it could very well cause deadlocks.


> IMV, atomicity is not something that applies for the stats reset
> operation because stats are approximate numbers by nature after all.
> If the pg_stat_reset_shared() resets stats for only a bunch of stats
> types and fails, it's the basic application programming style that
> when a query fails it's the application that needs to have a retry
> mechanism. FWIW, the atomicity doesn't apply today if someone wants to
> reset stats in a loop for all stats types.

Yea. Additionally it's not really atomic regardless of the lwlocks, due to
various processes all accumulating in local counters first, and only
occasionally updating the shared data. So even after holding all the locks at
the same time, the shared stats would still not actually represent a truly
atomic state.


> 2.
> +{ oid => '8000',
> +  descr => 'statistics: reset collected statistics shared across the 
> cluster',
> +  proname => 'pg_stat_reset_shared', provolatile => 'v', prorettype => 
> 'void',
> +  proargtypes => '', prosrc => 'pg_stat_reset_shared_all' },
> 
> Why a new function consuming the oid? Why can't we just do the trick
> of proisstrict => 'f' and if (PG_ARGISNULL(0)) { reset all stats} else
> {reset specified stats kind} like the pg_stat_reset_slru()?

It's not like oids are a precious resource. It's a more confusing API to have
to have to specify a NULL as an argument than not having to do so. If we
really want to avoid a separate oid, a more sensible path would be to add a
default argument to pg_stat_reset_slru() (by doing a CREATE OR REPLACE in
system_functions.sql).

Greetings,

Andres Freund




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-11-07 Thread Amit Kapila
On Wed, Nov 8, 2023 at 8:44 AM vignesh C  wrote:
>
> While verifying upgrade of subscriber patch, I found one issue with
> upgrade in verbose mode.
> I was able to reproduce this issue by performing a upgrade with a
> verbose option.
>
> The trace for the same is given below:
> Program received signal SIGSEGV, Segmentation fault.
> __strlen_sse2 () at ../sysdeps/x86_64/multiarch/strlen-vec.S:126
> 126../sysdeps/x86_64/multiarch/strlen-vec.S: No such file or 
> directory.
> (gdb) bt
> #0  __strlen_sse2 () at ../sysdeps/x86_64/multiarch/strlen-vec.S:126
> #1  0x5556f572 in dopr (target=0x7fffbb90,
> format=0x5557859e "\", plugin: \"%s\", two_phase: %s",
> args=0x7fffdc40) at snprintf.c:444
> #2  0x5556ed95 in pg_vsnprintf (str=0x7fffbc10 "slot_name:
> \"ication slots within the database:", count=8192, fmt=0x55578590
> "slot_name: \"%s\", plugin: \"%s\", two_phase: %s",
> args=0x7fffdc40) at snprintf.c:195
> #3  0x555667e3 in pg_log_v (type=PG_VERBOSE,
> fmt=0x55578590 "slot_name: \"%s\", plugin: \"%s\", two_phase: %s",
> ap=0x7fffdc40) at util.c:184
> #4  0x55566b38 in pg_log (type=PG_VERBOSE, fmt=0x55578590
> "slot_name: \"%s\", plugin: \"%s\", two_phase: %s") at util.c:264
> #5  0x55561a06 in print_slot_infos (slot_arr=0x55595ed0)
> at info.c:813
> #6  0x5556186e in print_db_infos (db_arr=0x55587518
> ) at info.c:782
> #7  0x555606da in get_db_rel_and_slot_infos
> (cluster=0x555874a0 , live_check=false) at info.c:308
> #8  0x839a in check_new_cluster () at check.c:215
> #9  0x55563010 in main (argc=13, argv=0x7fffdf08) at
> pg_upgrade.c:136
>
> This issue occurs because we are accessing uninitialized slot array 
> information.
>

Thanks for the report. I'll review your proposed fix.

-- 
With Regards,
Amit Kapila.




Re: Synchronizing slots from primary to standby

2023-11-07 Thread Amit Kapila
On Tue, Nov 7, 2023 at 7:58 PM Drouvot, Bertrand
 wrote:
>
> On 11/7/23 11:55 AM, Amit Kapila wrote:
> >>>
> >>> This is not full proof solution but optimization over first one. Now
> >>> in any sync-cycle, we take 2 attempts for slots-creation (if any slots
> >>> are available to be created). In first attempt, we do not wait
> >>> indefinitely on inactive slots, we wait only for a fixed amount of
> >>> time and if remote-slot is still behind, then we add that to the
> >>> pending list and move to the next slot. Once we are done with first
> >>> attempt, in second attempt, we go for the pending ones and now we wait
> >>> on each of them until the primary catches up.
> >>
> >> Aren't we "just" postponing the "issue"? I mean if there is really no 
> >> activity
> >> on, say, the first created slot, then once we move to the second attempt 
> >> then any newly
> >> created slot from that time would wait to be synced forever, no?
> >>
> >
> > We have to wait at some point in time for such inactive slots and the
> > same is true even for manually created slots on standby. Do you have
> > any better ideas to deal with it?
> >
>
> What about:
>
> - get rid of the second attempt and the pending_slot_list
> - keep the wait_count and PrimaryCatchupWaitAttempt logic
>
> so basically, get rid of:
>
> /*
>  * Now sync the pending slots which were failed to be created in first
>  * attempt.
>  */
> foreach(cell, pending_slot_list)
> {
> RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell);
>
> /* Wait until the primary server catches up */
> PrimaryCatchupWaitAttempt = 0;
>
> synchronize_one_slot(wrconn, remote_slot, NULL);
> }
>
> and the pending_slot_list list.
>
> That way, for each slot that have not been created and synced yet:
>
> - it will be created on the standby
> - we will wait up to PrimaryCatchupWaitAttempt attempts
> - the slot will be synced or removed on/from the standby
>
> That way an inactive slot on the primary would not "block"
> any other slots on the standby.
>
> By "created" here I mean calling ReplicationSlotCreate() (not to be confused
> with emitting "ereport(LOG, errmsg("created slot \"%s\" locally", 
> remote_slot->name)); "
> which is confusing as mentioned up-thread).
>
> The problem I can see with this proposal is that the "sync" window waiting
> for slot activity on the primary is "only" during the 
> PrimaryCatchupWaitAttempt
> attempts (as the slot will be dropped/recreated).
>
> If we think this window is too short we could:
>
> - increase it
> or
> - don't drop the slot once created (even if there is no activity
> on the primary during PrimaryCatchupWaitAttempt attempts) so that
> the next loop of attempts will compare with "older" LSN/xmin (as compare to
> dropping and re-creating the slot). That way the window would be since the
> initial slot creation.
>

Yeah, this sounds reasonable but we can't mark such slots to be
synced/available for use after failover. I think if we want to follow
this approach then we need to also monitor these slots for any change
in the consecutive cycles and if we are able to sync them then
accordingly we enable them to use after failover.

Another somewhat related point is that right now, we just wait for the
change on the first slot (the patch refers to it as the monitoring
slot) for computing nap_time before which we will recheck all the
slots. I think we can improve that as well such that even if any
slot's information is changed, we don't consider changing naptime.

-- 
With Regards,
Amit Kapila.




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-11-07 Thread vignesh C
On Tue, 7 Nov 2023 at 13:25, Amit Kapila  wrote:
>
> On Tue, Nov 7, 2023 at 10:01 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Tuesday, November 7, 2023 12:14 PM Kuroda, Hayato/黒田 隼人 
> >  wrote:
> > >
> > > Dear hackers,
> > >
> > > PSA the patch to solve the issue [1].
> > >
> > > Kindly Peter E. and Andrew raised an issue that delete_old_cluster.sh is
> > > generated in the source directory, even when the VPATH/meson build.
> > > This can avoid by changing the directory explicitly.
> > >
> > > [1]:
> > > https://www.postgresql.org/message-id/flat/7b8a9460-5668-b372-04e6-7b
> > > 52e9308493%40dunslane.net#554090099bbbd12c94bf570665a6badf
> >
> > Thanks for the patch, I have confirmed that the files won't be generated
> > in source directory after applying the patch.
> >
> > After running: "meson test -C build/ --suite pg_upgrade",
> > The files are in the test directory:
> > ./build/testrun/pg_upgrade/003_logical_slots/data/delete_old_cluster.sh
> >
>
> Thanks for the patch and verification. Pushed the fix.

While verifying upgrade of subscriber patch, I found one issue with
upgrade in verbose mode.
I was able to reproduce this issue by performing a upgrade with a
verbose option.

The trace for the same is given below:
Program received signal SIGSEGV, Segmentation fault.
__strlen_sse2 () at ../sysdeps/x86_64/multiarch/strlen-vec.S:126
126../sysdeps/x86_64/multiarch/strlen-vec.S: No such file or directory.
(gdb) bt
#0  __strlen_sse2 () at ../sysdeps/x86_64/multiarch/strlen-vec.S:126
#1  0x5556f572 in dopr (target=0x7fffbb90,
format=0x5557859e "\", plugin: \"%s\", two_phase: %s",
args=0x7fffdc40) at snprintf.c:444
#2  0x5556ed95 in pg_vsnprintf (str=0x7fffbc10 "slot_name:
\"ication slots within the database:", count=8192, fmt=0x55578590
"slot_name: \"%s\", plugin: \"%s\", two_phase: %s",
args=0x7fffdc40) at snprintf.c:195
#3  0x555667e3 in pg_log_v (type=PG_VERBOSE,
fmt=0x55578590 "slot_name: \"%s\", plugin: \"%s\", two_phase: %s",
ap=0x7fffdc40) at util.c:184
#4  0x55566b38 in pg_log (type=PG_VERBOSE, fmt=0x55578590
"slot_name: \"%s\", plugin: \"%s\", two_phase: %s") at util.c:264
#5  0x55561a06 in print_slot_infos (slot_arr=0x55595ed0)
at info.c:813
#6  0x5556186e in print_db_infos (db_arr=0x55587518
) at info.c:782
#7  0x555606da in get_db_rel_and_slot_infos
(cluster=0x555874a0 , live_check=false) at info.c:308
#8  0x839a in check_new_cluster () at check.c:215
#9  0x55563010 in main (argc=13, argv=0x7fffdf08) at
pg_upgrade.c:136

This issue occurs because we are accessing uninitialized slot array information.

We could fix it by a couple of ways: a) Initialize the whole of
dbinfos by using pg_malloc0 instead of pg_malloc which will ensure
that the slot information is set to 0. b) Setting only slot
information. Attached patch has the changes for both the approaches.
Thoughts?

Regards,
Vignesh
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index 7f21d26fd2..d2a1815fef 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -408,7 +408,7 @@ get_db_infos(ClusterInfo *cluster)
 	i_spclocation = PQfnumber(res, "spclocation");
 
 	ntups = PQntuples(res);
-	dbinfos = (DbInfo *) pg_malloc(sizeof(DbInfo) * ntups);
+	dbinfos = (DbInfo *) pg_malloc0(sizeof(DbInfo) * ntups);
 
 	for (tupnum = 0; tupnum < ntups; tupnum++)
 	{
@@ -640,11 +640,7 @@ get_old_cluster_logical_slot_infos(DbInfo *dbinfo, bool live_check)
 
 	/* Logical slots can be migrated since PG17. */
 	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1600)
-	{
-		dbinfo->slot_arr.slots = slotinfos;
-		dbinfo->slot_arr.nslots = num_slots;
 		return;
-	}
 
 	conn = connectToServer(_cluster, dbinfo->db_name);
 
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index 7f21d26fd2..21a0b0551a 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -297,6 +297,11 @@ get_db_rel_and_slot_infos(ClusterInfo *cluster, bool live_check)
 		 */
 		if (cluster == _cluster)
 			get_old_cluster_logical_slot_infos(pDbInfo, live_check);
+		else
+		{
+			pDbInfo->slot_arr.slots = NULL;
+			pDbInfo->slot_arr.nslots = 0;
+		}
 	}
 
 	if (cluster == _cluster)


Re: Show WAL write and fsync stats in pg_stat_io

2023-11-07 Thread Michael Paquier
On Tue, Nov 07, 2023 at 05:19:28PM -0800, Andres Freund wrote:
> Another approach would be to fetch the relevant columns from pg_stat_io in the
> pg_stat_wal view. That'd avoid double accounting and breaking existing
> monitoring.

Yep, I'd be OK with that as well to maintain compatibility.
--
Michael


signature.asc
Description: PGP signature


Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-07 Thread Michael Paquier
On Wed, Nov 08, 2023 at 10:08:42AM +0900, Kyotaro Horiguchi wrote:
> At Mon, 6 Nov 2023 14:00:13 +0530, Bharath Rupireddy 
>  wrote in 
> I must say, I have reservations about this approach. The main concern
> is the duplication of reset code, which has been efficiently
> encapsulated for individual targets, into this location. This practice
> degrades the maintainability and clarity of the code.

+{ oid => '8000',
This OID pick had me smile.

>> IMV, atomicity is not something that applies for the stats reset
>> operation because stats are approximate numbers by nature after all.
>> If the pg_stat_reset_shared() resets stats for only a bunch of stats
>> types and fails, it's the basic application programming style that
>> when a query fails it's the application that needs to have a retry
>> mechanism. FWIW, the atomicity doesn't apply today if someone wants to
>> reset stats in a loop for all stats types.
> 
> The infrequent use of this feature, coupled with the fact that there
> is no inherent need for these counters to be reset simultaneoulsy,
> leads me to think that there is little point in centralizing the
> locks.

Each stat listed with fixed_amount has meaning taken in isolation, so
I don't see why this patch has to be that complicated.  I'd expect one
code path that just calls a series of pgstat_reset_of_kind().  There
could be an argument for a new routine in pgstat.c that loops over the
pgstat_kind_infos and triggers the callbacks where .fixed_amount is 
set, but that's less transparent than the other approach.  The reset
time should be consistent across all the calls as we rely on
GetCurrentTimestamp().
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-11-07 Thread Andres Freund
Hi,

On 2023-11-08 09:52:16 +0900, Michael Paquier wrote:
> By the way, if the write/sync quantities and times begin to be tracked
> by pg_stat_io, I'd see a pretty good argument in removing the
> equivalent columns in pg_stat_wal.  It looks like this would reduce
> the confusion related to the handling of PendingWalStats added in
> pgstat_io.c, for one.

Another approach would be to fetch the relevant columns from pg_stat_io in the
pg_stat_wal view. That'd avoid double accounting and breaking existing
monitoring.

Greetings,

Andres Freund




Re: Atomic ops for unlogged LSN

2023-11-07 Thread Andres Freund
Hi,

On 2023-11-07 00:57:32 +, John Morris wrote:
> I found the comment about cache coherency a bit confusing. We are dealing
> with a single address, so there should be no memory ordering or coherency
> issues. (Did I misunderstand?) I see it more as a race condition.

IMO cache coherency covers the value a single variable has in different
threads / processes.

In fact, the only reason there effectively is a guarantee that you're not
seeing an outdated unloggedLSN value during shutdown checkpoints, even without
the spinlock or full barrier atomic op, is that the LWLockAcquire(), a few
lines above this, would prevent both the compiler and CPU from moving the read
of unloggedLSN to much earlier.  Obviously that lwlock has a different
address...


If the patch just had done the minimal conversion, it'd already have been
committed...  Even if there'd be a performance reason to get rid of the memory
barrier around reading unloggedLSN in CreateCheckPoint(), I'd do the
conversion in a separate commit.

Greetings,

Andres Freund




Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-07 Thread Kyotaro Horiguchi
At Mon, 6 Nov 2023 14:00:13 +0530, Bharath Rupireddy 
 wrote in 
> On Mon, Nov 6, 2023 at 11:39 AM torikoshia  wrote:
> > Since each stats, except wal_prefetch was reset acquiring LWLock,
> > attached PoC patch makes the call atomic by using these LWlocks.
> >
> > If this is the right direction, I'll try to make wal_prefetch also take
> > LWLock.

I must say, I have reservations about this approach. The main concern
is the duplication of reset code, which has been efficiently
encapsulated for individual targets, into this location. This practice
degrades the maintainability and clarity of the code.

> Well, that's a total of ~17 LWLocks this new function takes to make
> the stats reset atomic. I'm not sure if this atomicity is worth the
> effort which can easily be misused - what if someone runs something
> like SELECT pg_stat_reset_shared() FROM generate_series(1,
> 10n); to cause heavy lock acquisition and release cycles?
...
> IMV, atomicity is not something that applies for the stats reset
> operation because stats are approximate numbers by nature after all.
> If the pg_stat_reset_shared() resets stats for only a bunch of stats
> types and fails, it's the basic application programming style that
> when a query fails it's the application that needs to have a retry
> mechanism. FWIW, the atomicity doesn't apply today if someone wants to
> reset stats in a loop for all stats types.

The infrequent use of this feature, coupled with the fact that there
is no inherent need for these counters to be reset simultaneoulsy,
leads me to think that there is little point in cnetralizing the
locks.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Atomic ops for unlogged LSN

2023-11-07 Thread Andres Freund
Hi,

On 2023-11-07 11:02:49 -0600, Nathan Bossart wrote:
> On Tue, Nov 07, 2023 at 11:47:46AM -0500, Stephen Frost wrote:
> > We only care about the value of the unlogged LSN being correct during
> > normal shutdown when we're writing out the shutdown checkpoint, but by
> > that time everything else has been shut down and the value absolutely
> > should not be changing.
> 
> I agree that's all true.  I'm trying to connect how this scenario ensures
> we see the most up-to-date value in light of this comment above
> pg_atomic_read_u32():
> 
>  * The read is guaranteed to return a value as it has been written by this or
>  * another process at some point in the past. There's however no cache
>  * coherency interaction guaranteeing the value hasn't since been written to
>  * again.
> 
> Is there something special about all other backends being shut down that
> ensures this returns the most up-to-date value and not something from "some
> point in the past" as the stated contract for this function seems to
> suggest?

Practically yes - getting to the point of writing the shutdown checkpoint
implies having gone through a bunch of code that implies memory barriers
(spinlocks, lwlocks).

However, even if there's likely some other implied memory barrier that we
could piggyback on, the patch much simpler to understand if it doesn't change
coherency rules. There's no way the overhead could matter.

Greetings,

Andres Freund




Re: Show WAL write and fsync stats in pg_stat_io

2023-11-07 Thread Michael Paquier
On Tue, Nov 07, 2023 at 03:30:48PM -0800, Andres Freund wrote:
> I strongly disagree. A significant part of the design of pg_stat_io was to
> make it possible to collect multiple sources of IO in a single view, so that
> sysadmins don't have to look in dozens of places to figure out what is causing
> what kind of IO.

Okay.  Point taken.

> We should over time collect all sources of IO in pg_stat_io. For some things
> we might want to also have more detailed information in other views (e.g. it
> doesn't make sense to track FPIs in pg_stat_io, but does make sense in
> pg_stat_wal) - but that should be in addition, not instead of.

Sure.  I understand here that you mean the number of FPIs counted when
a record is inserted, different from the path where we decide to write
and/or flush WAL.  The proposed patch seems to be a bit inconsistent
regarding wal_sync_time, by the way.

By the way, if the write/sync quantities and times begin to be tracked
by pg_stat_io, I'd see a pretty good argument in removing the
equivalent columns in pg_stat_wal.  It looks like this would reduce
the confusion related to the handling of PendingWalStats added in
pgstat_io.c, for one.
--
Michael


signature.asc
Description: PGP signature


Re: Moving forward with TDE [PATCH v3]

2023-11-07 Thread Andres Freund
Hi,

On 2023-11-06 11:26:44 +0100, Matthias van de Meent wrote:
> On Sat, 4 Nov 2023 at 03:38, Andres Freund  wrote:
> > On 2023-11-02 22:09:40 +0100, Matthias van de Meent wrote:
> > > I'm quite surprised at the significant number of changes being made
> > > outside the core storage manager files. I thought that changing out
> > > mdsmgr with an encrypted smgr (that could wrap mdsmgr if so desired)
> > > would be the most obvious change to implement cluster-wide encryption
> > > with the least code touched, as relations don't need to know whether
> > > the files they're writing are encrypted, right? Is there a reason to
> > > not implement this at the smgr level that I overlooked in the
> > > documentation of these patches?
> >
> > You can't really implement encryption transparently inside an smgr without
> > significant downsides. You need a way to store an initialization vector
> > associated with the page (or you can store that elsewhere, but then you've
> > doubled the worst cse amount of random reads/writes). The patch uses the LSN
> > as the IV (which I doubt is a good idea). For authenticated encryption 
> > further
> > additional storage space is required.
> 
> I am unaware of any user of the smgr API that doesn't also use the
> buffer cache, and thus implicitly the Page layout with PageHeader
> [^1]

Everything indeed uses a PageHeader - but there are a number of places that do
*not* utilize pd_lower/upper/special. E.g. visibilitymap.c just assumes that
those fields are zero - and changing that wouldn't be trivial / free, because
we do a lot of bitmasking/shifting with constants derived from

#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))

which obviously wouldn't be constant anymore if you could reserve space on the
page.


> The API of smgr is also tailored to page-sized quanta of data
> with mostly relation-level information. I don't see why there would be
> a veil covering the layout of Page for smgr when all other information
> already points to the use of PageHeader and Page layouts. In my view,
> it would even make sense to allow the smgr to get exclusive access to
> some part of the page in the current Page layout.
> 
> Yes, I agree that there will be an impact on usable page size if you
> want authenticated encryption, and that AMs will indeed need to
> account for storage space now being used by the smgr - inconvenient,
> but it serves a purpose. That would happen regardless of whether smgr
> or some higher system decides where to store the data for encryption -
> as long as it is on the page, the AM effectively can't use those
> bytes.
> But I'd say that's best solved by making the Page documentation and
> PageInit API explicit about the potential use of that space by the
> chosen storage method (encrypted, plain, ...) instead of requiring the
> various AMs to manually consider encryption when using Postgres' APIs
> for writing data to disk without hitting shared buffers; page space
> management is already a task of AMs, but handling the actual
> encryption is not.

I don't particularly disagree with any detail here - but to me reserving space
for nonces etc at PageInit() time pretty much is the opposite of handling
encryption inside smgr.


> Should the AM really care whether the data on disk is encrypted or
> not? I don't think so. When the disk contains encrypted bytes, but
> smgrread() and smgrwrite() both produce and accept plaintext data,
> who's going to complain? Requiring AMs to be mindful about encryption
> on all common paths only adds pitfalls where encryption would be
> forgotten by the developer of AMs in one path or another.

I agree with that - I think the way the patch currently is designed is not
right.


There's other stuff you can't trivially do at the smgr level. E.g. if
checksums or encryption is enabled, you need to copy the buffer to compute
checksums / do IO if in shared buffers, because somebody could set a hint bit
even with just a shared content lock. But you don't need that when coming from
private buffers during index builds.



> I think that getting PageInit to allocate the smgr-specific area would
> take some effort, too (which would potentially require adding some
> relational context to PageInit, so that it knows which page of which
> relation it is going to initialize), but IMHO that would be more
> natural than requiring all index and table AMs to be aware the actual
> encryption of its pages and require manual handling of that encryption
> when the page needs to be written to disk, when it otherwise already
> conforms to the various buffer management and file extension APIs
> currently in use in PostgreSQL. I would expect "transparent" data
> encryption to be handled at the file write layer (i.e. smgr), not
> inside the AMs.

As mentioned above - I agree that the relevant code shouldn't be in index
AMs. But I somewhat doubt that smgr is the right level either. For one, the
place computing checksums needs awareness of locking / sharing 

Re: Call pqPipelineFlush from PQsendFlushRequest

2023-11-07 Thread Michael Paquier
On Tue, Nov 07, 2023 at 12:43:18PM +0100, Alvaro Herrera wrote:
> I agree, and I intend to get this patch pushed once the release freeze
> is lifted.

Thanks!
--
Michael


signature.asc
Description: PGP signature


Re: Force the old transactions logs cleanup even if checkpoint is skipped

2023-11-07 Thread Michael Paquier
On Tue, Oct 17, 2023 at 02:09:21PM +, Zakhlystov, Daniil (Nebius) wrote:
> I've stumbled into an interesting problem. Currently, if Postgres
> has nothing to write, it would skip the checkpoint creation defined
> by the checkpoint timeout setting. However, we might face a
> temporary archiving problem (for example, some network issues) that
> might lead to a pile of wal files stuck in pg_wal. After this
> temporary issue has gone, we would still be unable to archive them
> since we effectively skip the checkpoint because we have nothing to
> write.

I am not sure to understand your last sentence here.  Once the
archiver is back up, you mean that the WAL segments that were not
previously archived still are still not archived?  Or do you mean that
because of a succession of checkpoint skipped we are just enable to
remove them from pg_wal.

> That might lead to a problem - suppose you've run out of disk space
> because of the temporary failure of the archiver. After this
> temporary failure has gone, Postgres would be unable to recover from
> it automatically and will require human attention to initiate a
> CHECKPOINT call.
>
> I suggest changing this behavior by trying to clean up the old WAL
> even if we skip the main checkpoint routine. I've attached the patch
> that does exactly that.
> 
> What do you think?

I am not convinced that this is worth the addition in the skipped
path.  If your system is idle and a set of checkpoints is skipped, the
data folder is not going to be under extra space pressure because of
database activity (okay, unlogged tables even if these would generate
some WAL for init pages), because there is nothing happening in it
with no "important" WAL generated.  Note that the backend is very
unlikely going to generate WAL only marked with XLOG_MARK_UNIMPORTANT.

More to the point: what's the origin of the disk space issues?  System
logs, unlogged tables or something else?  It is usually a good
practice to move logs to a different partition.  At the end, it sounds
to me that removing segments more aggressively is just kicking the can
elsewhere, without taking care of the origin of the disk issues.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCHES] Post-special page storage TDE support

2023-11-07 Thread Andres Freund
Hi,

On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> From: David Christensen 
> Date: Tue, 9 May 2023 16:56:15 -0500
> Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
>
> This space is reserved for extended data on the Page structure which will be 
> ultimately used for
> encrypted data, extended checksums, and potentially other things.  This data 
> appears at the end of
> the Page, after any `pd_special` area, and will be calculated at runtime 
> based on specific
> ControlFile features.
>
> No effort is made to ensure this is backwards-compatible with existing 
> clusters for `pg_upgrade`, as
> we will require logical replication to move data into a cluster with
> different settings here.

The first part of the last paragraph makes it sound like pg_upgrade won't be
supported across this commit, rather than just between different settings...

I think as a whole this is not an insane idea. A few comments:

- IMO the patch touches many places it shouldn't need to touch, because of
  essentially renaming a lot of existing macro names to *Limit,
  necessitating modifying a lot of users. I think instead the few places that
  care about the runtime limit should be modified.

  As-is the patch would cause a lot of fallout in extensions that just do
  things like defining an on-stack array of Datums or such - even though all
  they'd need is to change the define to the *Limit one.

  Even leaving extensions aside, it must makes reviewing (and I'm sure
  maintaining) the patch very tedious.


- I'm a bit worried about how the extra special page will be managed - if
  there are multiple features that want to use it, who gets to put their data
  at what offset?

  After writing this I saw that 0002 tries to address this - but I don't like
  the design. It introduces runtime overhead that seems likely to be visible.


- Checking for features using PageGetFeatureOffset() seems the wrong design to
  me - instead of a branch for some feature being disabled, perfectly
  predictable for the CPU, we need to do an external function call every time
  to figure out that yet, checksums are *still* disabled.


- Recomputing offsets every time in PageGetFeatureOffset() seems too
  expensive. The offsets can't change while running as PageGetFeatureOffset()
  have enough information to distinguish between different kinds of relations
  - so why do we need to recompute offsets on every single page?  I'd instead
  add a distinct offset variable for each feature.


- Modifying every single PageInit() call doesn't make sense to me. That'll
  just create a lot of breakage for - as far as I can tell - no win.


- Why is it worth sacrificing space on every page to indicate which features
  were enabled?  I think there'd need to be some convincing reasons for
  introducing such overhead.

- Is it really useful to encode the set of features enabled in a cluster with
  a bitmask? That pretty much precludes utilizing extra page space in
  extensions. We could instead just have an extra cluster-wide file that
  defines a mapping of offset to feature.


Greetings,

Andres Freund




Re: Cleaning up array_in()

2023-11-07 Thread Tom Lane
I got back to looking at this today (sorry for delay), and did a pass
of code review.  I think we are getting pretty close to something
committable.  The one loose end IMO is this bit in ReadArrayToken:

+case '"':
+
+/*
+ * XXX "Unexpected %c character" would be more apropos, but
+ * this message is what the pre-v17 implementation produced,
+ * so we'll keep it for now.
+ */
+errsave(escontext,
+(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed array literal: \"%s\"", origStr),
+ errdetail("Unexpected array element.")));
+return ATOK_ERROR;

This comes out when you write something like '{foo"bar"}', and I'd
say the choice of message is not great.  On the other hand, it's
consistent with what you get from '{"foo""bar"}', and if we wanted
to change that too then some tweaking of the state machine in
ReadArrayStr would be required (or else modify ReadArrayToken so
it doesn't return instantly upon seeing the second quote mark).
I'm not sure that this is worth messing with.

Anyway, I think we are well past the point where splitting the patch
into multiple parts is worth doing, because we've rewritten pretty
much all of this code, and the intermediate versions are not terribly
helpful.  So I just folded it all into one patch.

Some notes about specific points:

* Per previous discussion, I undid the change to allow "[1:0]"
dimensions, but I left a comment behind about that.

* Removing the ArrayGetNItemsSafe/ArrayCheckBoundsSafe calls
seems OK, but then we need to be more careful about detecting
overflows and disallowed cases in ReadArrayDimensions.

* I don't think the ARRAYDEBUG code is of any value whatever.
The fact that nobody bothered to improve it to print more than
the dim[] values proves it hasn't been used in decades.
Let's just nuke it.

* We can simplify the state machine in ReadArrayStr some more: it
seems to me it's sufficient to track "expect_delim", as long as you
realize that that really means "expect typdelim or right brace".
(Maybe another name would be better?  I couldn't think of anything
short though.)

* I switched to using a StringInfo instead of a fixed-size elembuf,
as Heikki speculated about.

* I added some more test cases to cover things that evidently weren't
sufficiently tested, like the has_escapes business which was flat
out broken in v10, and to improve the code coverage report.

Comments?

regards, tom lane

diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index 3ff13eb419..638e4f0382 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -54,18 +54,16 @@ bool		Array_nulls = true;
 			PG_FREE_IF_COPY(array, n); \
 	} while (0)
 
+/* ReadArrayToken return type */
 typedef enum
 {
-	ARRAY_NO_LEVEL,
-	ARRAY_LEVEL_STARTED,
-	ARRAY_ELEM_STARTED,
-	ARRAY_ELEM_COMPLETED,
-	ARRAY_QUOTED_ELEM_STARTED,
-	ARRAY_QUOTED_ELEM_COMPLETED,
-	ARRAY_ELEM_DELIMITED,
-	ARRAY_LEVEL_COMPLETED,
-	ARRAY_LEVEL_DELIMITED,
-} ArrayParseState;
+	ATOK_LEVEL_START,
+	ATOK_LEVEL_END,
+	ATOK_DELIM,
+	ATOK_ELEM,
+	ATOK_ELEM_NULL,
+	ATOK_ERROR,
+} ArrayToken;
 
 /* Working state for array_iterate() */
 typedef struct ArrayIteratorData
@@ -91,15 +89,21 @@ typedef struct ArrayIteratorData
 	int			current_item;	/* the item # we're at in the array */
 }			ArrayIteratorData;
 
-static int	ArrayCount(const char *str, int *dim, char typdelim,
-	   Node *escontext);
-static bool ReadArrayStr(char *arrayStr, const char *origStr,
-		 int nitems, int ndim, int *dim,
+static bool ReadArrayDimensions(char **srcptr, int *ndim_p,
+int *dim, int *lBound,
+const char *origStr, Node *escontext);
+static bool ReadDimensionInt(char **srcptr, int *result,
+			 const char *origStr, Node *escontext);
+static bool ReadArrayStr(char **srcptr,
 		 FmgrInfo *inputproc, Oid typioparam, int32 typmod,
 		 char typdelim,
 		 int typlen, bool typbyval, char typalign,
-		 Datum *values, bool *nulls,
-		 bool *hasnulls, int32 *nbytes, Node *escontext);
+		 int *ndim_p, int *dim,
+		 int *nitems_p,
+		 Datum **values_p, bool **nulls_p,
+		 const char *origStr, Node *escontext);
+static ArrayToken ReadArrayToken(char **srcptr, StringInfo elembuf, char typdelim,
+ const char *origStr, Node *escontext);
 static void ReadArrayBinary(StringInfo buf, int nitems,
 			FmgrInfo *receiveproc, Oid typioparam, int32 typmod,
 			int typlen, bool typbyval, char typalign,
@@ -185,12 +189,10 @@ array_in(PG_FUNCTION_ARGS)
 	char		typalign;
 	char		typdelim;
 	Oid			typioparam;
-	char	   *string_save,
-			   *p;
-	int			i,
-nitems;
-	Datum	   *dataPtr;
-	bool	   *nullsPtr;
+	char	   *p;
+	int			nitems;
+	Datum	   *values;
+	bool	   *nulls;
 	bool		hasnulls;
 	int32		nbytes;
 

Re: Moving forward with TDE [PATCH v3]

2023-11-07 Thread Andres Freund
Hi,

On 2023-11-06 09:56:37 -0500, Stephen Frost wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > I still am quite quite unconvinced that using the LSN as a nonce is a good
> > design decision.
> 
> This is a really important part of the overall path to moving this
> forward, so I wanted to jump to it and have a specific discussion
> around this.  I agree that there are downsides to using the LSN, some of
> which we could possibly address (eg: include the timeline ID in the IV),
> but others that would be harder to deal with.

> The question then is- what's the alternative?
> 
> One approach would be to change the page format to include space for an
> explicit nonce.  I don't see the community accepting such a new page
> format as the only format we support though as that would mean no
> pg_upgrade support along with wasted space if TDE isn't being used.

Right.


> Ideally, we'd instead be able to support multiple page formats where
> users could decide when they create their cluster what features they
> want- and luckily we even have such an effort underway with patches
> posted for review [1].

I think there are some details wrong with that patch - IMO the existing macros
should just continue to work as-is and instead the places that want the more
narrow definition should be moved to the new macros and it changes places that
should continue to use compile time constants - but it doesn't seem like a
fundamentally bad idea to me.  I certainly like it much better than making the
page size runtime configurable.

(I'll try to reply with the above points to [1])


> Certainly, with the base page-special-feature patch, we could have an option
> for users to choose that they want a better nonce than the LSN, or we could
> bundle that assumption in with, say, the authenticated-encryption feature
> (if you want authenticated encryption, then you want more from the
> encryption system than the basics, and therefore we presume you also want a
> better nonce than the LSN).

I don't think we should support using the LSN as a nonce if we have an
alternative. The cost and complexity overhead is just not worth it.  Yes,
it'll be harder for users to migrate to encryption, but adding complexity
elsewhere in the system to get an inferior result isn't worth it.


> Another approach would be a separate fork, but that then has a number of
> downsides too- every write has to touch that too, and a single page of
> nonces would cover a pretty large number of pages also.

Yea, the costs of doing so is nontrivial. If you were trying to implement
encryption on the smgr level - which I doubt you should but am not certain
about - my suggestion would be to interleave pages with metadata like nonces
and AEAD with the data pages. I.e. one metadata page would be followed by
  (BLCKSZ - SizeOfPageHeaderData) / (sizeof(nonce) + sizeof(AEAD))
pages containing actual relation data.  That way you still get decent locality
during scans and writes.

Relation forks were a mistake, we shouldn't use them in more places.


I think it'd be much better if we also encrypted forks, rather than just the
main fork...

Greetings,

Andres Freund




Re: Show WAL write and fsync stats in pg_stat_io

2023-11-07 Thread Andres Freund
Hi,

On 2023-10-26 15:28:32 +0900, Michael Paquier wrote:
> On top of that pg_stat_io is now for block-based I/O operations, so
> that does not fit entirely in the picture, though I guess that Melanie
> has thought more on the matter than me.  That may be also a matter of
> taste.

I strongly disagree. A significant part of the design of pg_stat_io was to
make it possible to collect multiple sources of IO in a single view, so that
sysadmins don't have to look in dozens of places to figure out what is causing
what kind of IO.

We should over time collect all sources of IO in pg_stat_io. For some things
we might want to also have more detailed information in other views (e.g. it
doesn't make sense to track FPIs in pg_stat_io, but does make sense in
pg_stat_wal) - but that should be in addition, not instead of.

Greetings,

Andres Freund




Re: speed up a logical replica setup

2023-11-07 Thread Michael Paquier
On Tue, Nov 07, 2023 at 10:00:39PM +0100, Peter Eisentraut wrote:
> Speaking of which, would it make sense to put this tool (whatever the name)
> into the pg_basebackup directory?  It's sort of related, and it also shares
> some code.

I've read the patch, and the additions to streamutil.h and
streamutil.c make it kind of natural to have it sit in pg_basebackup/.
There's pg_recvlogical already there.  I am wondering about two
things, though:
- Should the subdirectory pg_basebackup be renamed into something more
generic at this point?  All these things are frontend tools that deal
in some way with the replication protocol to do their work.  Say
a replication_tools?
- And if it would be better to refactor some of the code generic to
all these streaming tools to fe_utils.  What makes streamutil.h a bit
less pluggable are all its extern variables to control the connection,
but perhaps that can be an advantage, as well, in some cases.
--
Michael


signature.asc
Description: PGP signature


Re: Moving forward with TDE [PATCH v3]

2023-11-07 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Mon, Nov  6, 2023 at 09:56:37AM -0500, Stephen Frost wrote:
> > The gist is, without a suggestion of things to try, we're left
> > to our own devices to try and figure out things which might be
> > successful, only to have those turned down too when we come back with
> > them, see [1] for what feels like an example of this.  Given your
> > feedback overall, which I'm very thankful for, I'm hopeful that you see
> > that this is, indeed, a useful feature that people are asking for and
> > therefore are willing to spend some time on it, but if the feedback is
> > that nothing on the page is acceptable to use for the nonce, we can't
> > put the nonce somewhere else, and we can't change the page format, then
> > everything else is just moving deck chairs around on the titanic that
> > has been this effort.
> 
> Yeah, I know the feeling, though I thought XTS was immune enough to
> nonce/LSN reuse that it was acceptable.

Ultimately it depends on the attack vector you're trying to address, but
generally, I think you're right about the XTS tweak reuse not being that
big of a deal.  XTS isn't CTR or GCM.

With FDE (full disk encryption) you're expecting the attacker to steal
the physical laptop, hard drive, etc, generally, and so the downside of
using the same tweak with XTS over and over again isn't that bad (and is
what most FDE solutions do, aiui, by simply using the sector number; we
could do something similar to that by using the relfilenode + block
number) because that re-use is a problem if the attacker is able to see
multiple copies of the same block over time where the block has been
encrypted with different data but the same key and tweak.

Using the LSN actually is better than what the FDE solutions do because
the LSN varies much more often than the sector number.  Sure, it doesn't
change with every write and maybe an attacker could glean something from
that, but that's quite narrow.  The downside from the LSN based approach
with XTS is probably more that using the LSN means that we can't encrypt
the LSN itself and that is a leak too- but then again, we leak that
through the simple WAL filenames too, to some extent, so it doesn't
strike me as a huge issue.

XTS as a block cipher doesn't suffer from the IV-reuse issue that you
have with streaming ciphers where the same key+IV and different data
leads to being able to trivally retrive the plaintext though and I worry
that maybe that's what people were thinking.

The README and comments I don't think were terribly clear on this and I
think may have even been from back when CTR was being considered, where
IV reuse would have resulted in plaintext being trivially available.

> What got me sunk on the feature was the complexity of adding temporary
> file encryption support and that tipped the scales in the negative for
> me in community value of the feature vs. added complexity. (Yeah, I used
> a Titanic reference in the last sentence. ;-) )  However, I am open to
> the community value and complexity values changing over time.  My blog
> post on the topic:

We do need to address the temporary file situation too and we do have a
bit of an issue that how we deal with temporary files today in PG isn't
very consistent and there's too many ways to do that.  There's a patch
that works on that, though it has some bitrot that we're working on
addressing currently.

There is value in simply fixing that situation wrt temporary file
management independent of encryption, though of course then encryption
of those temporary files becomes much simpler.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Version 14/15 documentation Section "Alter Default Privileges"

2023-11-07 Thread Bruce Momjian
On Mon, Nov  6, 2023 at 09:53:50PM +0100, Laurenz Albe wrote:
> On Mon, 2023-11-06 at 10:55 -0500, Bruce Momjian wrote:
> > Okay, I think I have good wording for this.  I didn't like the wording
> > of other roles, so I restructured that in the attached patch too.
> 
> > 
> > !Default privileges apply only to the active role;  the default
> > !privileges of member roles have no affect on object permissions.
> > !SET ROLE can be used to change the active user and
> > !apply their default privileges.
> > !   
> 
> You don't mean member roles, but roles that the active role is a member of,
> right?

Yes, sorry fixed in the attached patch.

> +  
> +   As a non-superuser, you can change default privileges only on objects 
> created
> +   by yourself or by roles that you are a member of.  However, you don't 
> inherit
> +   altered default privileges from roles you are a member of; objects you 
> create
> +   will receive the default privileges for your current role.
> +  

I went with different wording since I found the above confusing.

You didn't seem to like my SET ROLE suggestion so I removed it.

> +
> +  
> +   There is no way to change the default privileges for objects created by
> +   arbitrary roles.  You have run ALTER DEFAULT PRIVILEGES

I find the above sentence odd.  What is its purpose?

> +   for any role that can create objects whose default privileges should be
> +   modified.
> +  
> +
> +  
> +   Currently,
> +   only the privileges for schemas, tables (including views and foreign
> +   tables), sequences, functions, and types (including domains) can be
> +   altered.  For this command, functions include aggregates and procedures.
> +   The words FUNCTIONS and ROUTINES are
> +   equivalent in this command.  (ROUTINES is preferred
> +   going forward as the standard term for functions and procedures taken
> +   together.  In earlier PostgreSQL releases, only the
> +   word FUNCTIONS was allowed.  It is not possible to set
> +   default privileges for functions and procedures separately.)
> +  
> +
>
> Default privileges that are specified per-schema are added to whatever
> the global default privileges are for the particular object type.
> @@ -136,8 +149,9 @@ REVOKE [ GRANT OPTION FOR ]
>  target_role
>  
>   
> -  The name of an existing role of which the current role is a member.
> -  If FOR ROLE is omitted, the current role is assumed.
> +  Default privileges are changed for objects created by the
> +  target_role, or the current
> +  role if unspecified.

I like a verb to be first, like "Change" rather than "default
privileges".

Patch attached.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/ref/alter_default_privileges.sgml b/doc/src/sgml/ref/alter_default_privileges.sgml
index 8a6006188d..78744470c8 100644
--- a/doc/src/sgml/ref/alter_default_privileges.sgml
+++ b/doc/src/sgml/ref/alter_default_privileges.sgml
@@ -88,25 +88,19 @@ REVOKE [ GRANT OPTION FOR ]
   Description
 
   
-   ALTER DEFAULT PRIVILEGES allows you to set the privileges
-   that will be applied to objects created in the future.  (It does not
-   affect privileges assigned to already-existing objects.)  Currently,
-   only the privileges for schemas, tables (including views and foreign
-   tables), sequences, functions, and types (including domains) can be
-   altered.  For this command, functions include aggregates and procedures.
-   The words FUNCTIONS and ROUTINES are
-   equivalent in this command.  (ROUTINES is preferred
-   going forward as the standard term for functions and procedures taken
-   together.  In earlier PostgreSQL releases, only the
-   word FUNCTIONS was allowed.  It is not possible to set
-   default privileges for functions and procedures separately.)
+   ALTER DEFAULT PRIVILEGES allows you to set the
+   privileges that will be applied to objects created in the future.
+   (It does not affect privileges assigned to already-existing objects.)
+   Privileges can be set globally (i.e., for all objects created in the
+   current database), or just for objects created in specified schemas.
   
 
   
-   You can change default privileges only for objects that will be created by
-   yourself or by roles that you are a member of.  The privileges can be set
-   globally (i.e., for all objects created in the current database),
-   or just for objects created in specified schemas.
+   While you can change your own default privileges and the defaults of
+   roles that you are a member of, at object creation time, new object
+   permissions are only affected by the default privileges of the current
+   role, and are not inherited from any roles in which the current role
+   is a member.
   
 
   
@@ -118,6 +112,19 @@ REVOKE [ GRANT OPTION FOR ]
ALTER DEFAULT PRIVILEGES.
   
 
+  
+   Currently,

Fix use of openssl.path() if openssl isn't found

2023-11-07 Thread Tristan Partin
Found this issue during my Fedora 39 upgrade. Tested that uninstalling 
openssl still allows the various ssl tests to run and succeed.


--
Tristan Partin
Neon (https://neon.tech)
From d684d6fc1546335804d2ed82ff909991965a61a8 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 7 Nov 2023 15:59:04 -0600
Subject: [PATCH v1] Fix use of openssl.path() if openssl isn't found

openssl(1) is an optional dependency of the Postgres Meson build, but
was inadvertantly required when defining some SSL tests.
---
 src/test/ssl/meson.build | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/test/ssl/meson.build b/src/test/ssl/meson.build
index 4cda81f3bc..c3ffcaa032 100644
--- a/src/test/ssl/meson.build
+++ b/src/test/ssl/meson.build
@@ -1,5 +1,10 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
+openssl_path = ''
+if openssl.found()
+  openssl_path = openssl.path()
+endif
+
 tests += {
   'name': 'ssl',
   'sd': meson.current_source_dir(),
@@ -7,7 +12,7 @@ tests += {
   'tap': {
 'env': {
   'with_ssl': ssl_library,
-  'OPENSSL': openssl.path(),
+  'OPENSSL': openssl_path,
 },
 'tests': [
   't/001_ssltests.pl',
-- 
Tristan Partin
Neon (https://neon.tech)



Re: Fix search_path for all maintenance commands

2023-11-07 Thread Jeff Davis
On Tue, 2023-10-31 at 13:16 -0400, Isaac Morland wrote:
> Perhaps the search_path for running a maintenance command should be
> the search_path set for the table owner (ALTER ROLE … SET search_path
> …)?

After some thought, I don't think that's the right approach. It adds
another way search path can be changed, which adds to the complexity.

Also, by default it's "$user", public; and given that "public" was
world-writable until recently, that doesn't seem like a good idea for a
change intended to prevent search_path manipulation.

Regards,
Jeff Davis





Re: Compiler warning on Debian 12, PostgreSQL 16 Beta3

2023-11-07 Thread Thomas Munro
On Wed, Nov 8, 2023 at 8:13 AM Alvaro Herrera  wrote:
> On 2023-Nov-08, Thomas Munro wrote:
> > On Wed, Nov 8, 2023 at 4:46 AM Alvaro Herrera  
> > wrote:
> > > On 2023-Aug-25, Daniel Westermann (DWE) wrote:
> > >
> > > Yeah, I get this one too.  I thought commit 37d5babb5cfa ("jit: Support
> > > opaque pointers in LLVM 16.") was going to silence it, but I was quite
> > > mistaken.  I gave that code a quick look and could not understand what
> > > it was complaining about.  Is it a bug in the LLVM headers?
> >
> > I found the commit where they fixed that in 15+:
> >
> > https://github.com/llvm/llvm-project/commit/1d9086bf054c2e734940620d02d4451156b424e6
> >
> > They don't seem to back-patch fixes, generally.
>
> Ah yeah, I can silence the warning by patching that file locally.

Since LLVM only seems to maintain one branch at a time as a matter of
policy (I don't see were that is written down but I do see for example
their backport request format[1] which strictly goes from main to
(currently) release/17.x, and see how the commit history of each
release branch ends as a new branch is born), I suppose another angle
would be to check if the Debian maintainers carry extra patches for
stuff like that.  They're the ones creating the dependency on an 'old'
LLVM after all.  Unlike the RHEL/etc maintainers' fast rolling version
policy (that we learned about in the thread for CF #4640).  Who wants
to ship zombie unmaintained code for years?  On the other hand, Debian
itself rolls faster than RHEL.

[1] https://llvm.org/docs/GitHub.html




Re: speed up a logical replica setup

2023-11-07 Thread Peter Eisentraut

On 23.10.23 05:53, Euler Taveira wrote:

Let's continue with the bike shedding... I agree with Peter E that this name
does not express what this tool is. At the moment, it only have one action:
create. If I have to suggest other actions I would say that it could support
switchover option too (that removes the infrastructure created by this 
tool).

If we decide to keep this name, it should be a good idea to add an option to
indicate what action it is executing (similar to pg_recvlogical) as 
suggested

by Peter.


Speaking of which, would it make sense to put this tool (whatever the 
name) into the pg_basebackup directory?  It's sort of related, and it 
also shares some code.






Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-11-07 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2023-11-06 13:02:50 -0500, Stephen Frost wrote:
> > > > The max_total_memory limit is checked whenever the global counters are
> > > > updated. There is no special error handling if a memory allocation 
> > > > exceeds
> > > > the global limit. That allocation returns a NULL for malloc style
> > > > allocations or an ENOMEM for shared memory allocations. Postgres has
> > > > existing mechanisms for dealing with out of memory conditions.
> > > 
> > > I still think it's extremely unwise to do tracking of memory and limiting 
> > > of
> > > memory in one patch.  You should work towards and acceptable patch that 
> > > just
> > > tracks memory usage in an as simple and low overhead way as possible. 
> > > Then we
> > > later can build on that.
> > 
> > Frankly, while tracking is interesting, the limiting is the feature
> > that's needed more urgently imv.
> 
> I agree that we need limiting, but that the tracking needs to be very robust
> for that to be usable.

Is there an issue with the tracking in the patch that you saw?  That's
certainly an area that we've tried hard to get right and to match up to
numbers from the rest of the system, such as the memory context system.

> > We could possibly split it up but there's an awful lot of the same code that
> > would need to be changed and that seems less than ideal.  Still, we'll look
> > into this.
> 
> Shrug. IMO keeping them together just makes it very likely that neither goes
> in.

I'm happy to hear your support for the limiting part of this- that's
encouraging.

> > > > For sanity checking, pgstat now includes the 
> > > > pg_backend_memory_allocation
> > > > view showing memory allocations made by the backend process. This view
> > > > includes a scan of the top memory context, so it compares memory 
> > > > allocations
> > > > reported through pgstat with actual allocations. The two should match.
> > > 
> > > Can't you just do this using the existing pg_backend_memory_contexts view?
> > 
> > Not and get a number that you can compare to the local backend number
> > due to the query itself happening and performing allocations and
> > creating new contexts.  We wanted to be able to show that we are
> > accounting correctly and exactly matching to what the memory context
> > system is tracking.
> 
> I think creating a separate view for this will be confusing for users, without
> really much to show for. Excluding the current query would be useful for other
> cases as well, why don't we provide a way to do that with
> pg_backend_memory_contexts?

Both of these feel very much like power-user views, so I'm not terribly
concerned about users getting confused.  That said, we could possibly
drop this as a view and just have the functions which are then used in
the regression tests to catch things should the numbers start to
diverge.

Having a way to get the memory contexts which don't include the
currently running query might be interesting too but is rather
independent of what this patch is trying to do.  The only reason we
collected up the memory-context info is as a cross-check to the tracking
that we're doing and while the existing memory-context view is just fine
for a lot of other things, it doesn't work for that specific need.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: GUC names in messages

2023-11-07 Thread Peter Smith
FWIW, I am halfway through doing regex checking of the PG16 source for
all GUC names in messages to see what current styles are in use today.

Not sure if those numbers will influence the decision.

I hope I can post my findings today or tomorrow.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Popcount optimization using AVX512

2023-11-07 Thread Nathan Bossart
On Mon, Nov 06, 2023 at 09:53:15PM -0800, Noah Misch wrote:
> On Mon, Nov 06, 2023 at 09:59:26PM -0600, Nathan Bossart wrote:
>> On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote:
>> > The glibc/gcc "ifunc" mechanism was designed to solve this problem of 
>> > choosing
>> > a function implementation based on the runtime CPU, without incurring 
>> > function
>> > pointer overhead.  I would not attempt to use AVX512 on non-glibc systems, 
>> > and
>> > I would use ifunc to select the desired popcount implementation on glibc:
>> > https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.html
>> 
>> Thanks, that seems promising for the function pointer cases.  I'll plan on
>> trying to convert one of the existing ones to use it.  BTW it looks like
>> LLVM has something similar [0].
>> 
>> IIUC this unfortunately wouldn't help for cases where we wanted to keep
>> stuff inlined, such as is_valid_ascii() and the functions in pg_lfind.h,
>> unless we applied it to the calling functions, but that doesn't ѕound
>> particularly maintainable.
> 
> Agreed, it doesn't solve inline cases.  If the gains are big enough, we should
> move toward packages containing N CPU-specialized copies of the postgres
> binary, with bin/postgres just exec'ing the right one.

I performed a quick test with ifunc on my x86 machine that ordinarily uses
the runtime checks for the CRC32C code, and I actually see a consistent
3.5% regression for pg_waldump -z on 100M 65-byte records.  I've attached
the patch used for testing.

The multiple-copies-of-the-postgres-binary idea seems interesting.  That's
probably not something that could be enabled by default, but perhaps we
could add support for a build option.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index d085f1dc00..6db411ee29 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -78,7 +78,7 @@ extern pg_crc32c pg_comp_crc32c_loongarch(pg_crc32c crc, const void *data, size_
 #define FIN_CRC32C(crc) ((crc) ^= 0x)
 
 extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len);
-extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len);
+extern pg_crc32c pg_comp_crc32c(pg_crc32c crc, const void *data, size_t len);
 
 #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 extern pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len);
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 41ff4a35ad..62bb981ee8 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -51,14 +51,14 @@ pg_crc32c_sse42_available(void)
  * so that subsequent calls are routed directly to the chosen implementation.
  */
 static pg_crc32c
-pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len)
+(*pg_comp_crc32c_choose (void))(pg_crc32c crc, const void *data, size_t len)
 {
 	if (pg_crc32c_sse42_available())
-		pg_comp_crc32c = pg_comp_crc32c_sse42;
+		return pg_comp_crc32c_sse42;
 	else
-		pg_comp_crc32c = pg_comp_crc32c_sb8;
-
-	return pg_comp_crc32c(crc, data, len);
+		return pg_comp_crc32c_sb8;
 }
 
-pg_crc32c	(*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose;
+pg_crc32c
+pg_comp_crc32c(pg_crc32c crc, const void *data, size_t len)
+	__attribute__ ((ifunc ("pg_comp_crc32c_choose")));


Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-11-07 Thread Andres Freund
Hi,

On 2023-11-06 13:02:50 -0500, Stephen Frost wrote:
> > > The max_total_memory limit is checked whenever the global counters are
> > > updated. There is no special error handling if a memory allocation exceeds
> > > the global limit. That allocation returns a NULL for malloc style
> > > allocations or an ENOMEM for shared memory allocations. Postgres has
> > > existing mechanisms for dealing with out of memory conditions.
> > 
> > I still think it's extremely unwise to do tracking of memory and limiting of
> > memory in one patch.  You should work towards and acceptable patch that just
> > tracks memory usage in an as simple and low overhead way as possible. Then 
> > we
> > later can build on that.
> 
> Frankly, while tracking is interesting, the limiting is the feature
> that's needed more urgently imv.

I agree that we need limiting, but that the tracking needs to be very robust
for that to be usable.


> We could possibly split it up but there's an awful lot of the same code that
> would need to be changed and that seems less than ideal.  Still, we'll look
> into this.

Shrug. IMO keeping them together just makes it very likely that neither goes
in.


> > > For sanity checking, pgstat now includes the pg_backend_memory_allocation
> > > view showing memory allocations made by the backend process. This view
> > > includes a scan of the top memory context, so it compares memory 
> > > allocations
> > > reported through pgstat with actual allocations. The two should match.
> > 
> > Can't you just do this using the existing pg_backend_memory_contexts view?
> 
> Not and get a number that you can compare to the local backend number
> due to the query itself happening and performing allocations and
> creating new contexts.  We wanted to be able to show that we are
> accounting correctly and exactly matching to what the memory context
> system is tracking.

I think creating a separate view for this will be confusing for users, without
really much to show for. Excluding the current query would be useful for other
cases as well, why don't we provide a way to do that with
pg_backend_memory_contexts?

Greetings,

Andres Freund




Re: Compiler warning on Debian 12, PostgreSQL 16 Beta3

2023-11-07 Thread Alvaro Herrera
On 2023-Nov-08, Thomas Munro wrote:

> On Wed, Nov 8, 2023 at 4:46 AM Alvaro Herrera  wrote:
> > On 2023-Aug-25, Daniel Westermann (DWE) wrote:
> >
> > Yeah, I get this one too.  I thought commit 37d5babb5cfa ("jit: Support
> > opaque pointers in LLVM 16.") was going to silence it, but I was quite
> > mistaken.  I gave that code a quick look and could not understand what
> > it was complaining about.  Is it a bug in the LLVM headers?
> 
> I found the commit where they fixed that in 15+:
> 
> https://github.com/llvm/llvm-project/commit/1d9086bf054c2e734940620d02d4451156b424e6
> 
> They don't seem to back-patch fixes, generally.

Ah yeah, I can silence the warning by patching that file locally.

Annoying :-(

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"El sabio habla porque tiene algo que decir;
el tonto, porque tiene que decir algo" (Platon).




Re: Compiler warning on Debian 12, PostgreSQL 16 Beta3

2023-11-07 Thread Thomas Munro
On Wed, Nov 8, 2023 at 4:46 AM Alvaro Herrera  wrote:
> On 2023-Aug-25, Daniel Westermann (DWE) wrote:
> > I've just noticed this warning when building on Debian 12:
> >
> > In file included from 
> > /usr/lib/llvm-14/include/llvm/Analysis/ModuleSummaryAnalysis.h:17,
> >  from llvmjit_inline.cpp:51:
> > /usr/lib/llvm-14/include/llvm/IR/ModuleSummaryIndex.h: In constructor 
> > ‘llvm::ModuleSummaryIndex::ModuleSummaryIndex(bool, bool)’:
> > /usr/lib/llvm-14/include/llvm/IR/ModuleSummaryIndex.h:1175:73: warning: 
> > member ‘llvm::ModuleSummaryIndex::Alloc’ is used uninitialized 
> > [-Wuninitialized]
> >  1175 |   : HaveGVs(HaveGVs), EnableSplitLTOUnit(EnableSplitLTOUnit), 
> > Saver(Alloc),
> >   |
>
> Yeah, I get this one too.  I thought commit 37d5babb5cfa ("jit: Support
> opaque pointers in LLVM 16.") was going to silence it, but I was quite
> mistaken.  I gave that code a quick look and could not understand what
> it was complaining about.  Is it a bug in the LLVM headers?

I found the commit where they fixed that in 15+:

https://github.com/llvm/llvm-project/commit/1d9086bf054c2e734940620d02d4451156b424e6

They don't seem to back-patch fixes, generally.




Re: 2023-11-09 release announcement draft

2023-11-07 Thread Noah Misch
On Tue, Nov 07, 2023 at 09:02:03AM -0500, Jonathan S. Katz wrote:
> On 11/6/23 9:52 PM, Noah Misch wrote:
> > On Mon, Nov 06, 2023 at 05:04:25PM -0500, Jonathan S. Katz wrote:

> > Delete lines starting here ...
> > 
> > > This is the **final release of PostgreSQL 11**. PostgreSQL 10 will no 
> > > longer
> > > receive
> > > [security and bug fixes](https://www.postgresql.org/support/versioning/).
> > > If you are running PostgreSQL 10 in a production environment, we suggest 
> > > that
> > > you make plans to upgrade.
> > 
> > ... to here.  They're redundant with "PostgreSQL 11 EOL Notice" below:
> 
> Initially, I strongly disagreed with this recommendation, as I've seen
> enough people say that they were unaware that a community version is EOL. We
> can't say this enough.
> 
> However, I did decide to clip it out because the notice is just below.

I just figured it was a copy-paste error, given the similarity of nearby
sentences.  I have no concern with a general goal of saying more about the end
of v11.




Re: meson documentation build open issues

2023-11-07 Thread Christoph Berg
Re: Andres Freund
> > We might get around that by introducing a new postgresql-manpages-XX
> > arch:all package, but that might be too much micropackaging.
> 
> I've not done packaging in, uh, a fair while, but isn't the common solution to
> that a -common package? There might be a few more files we could put itno one.

True. /usr/share/postgresql/17/ is 4.2MB here, with 1.5MB manpages,
1.1MB /extensions/ and some other bits. Will consider, thanks.

> > > + 
> > > +  -Dselinux={ disabled | auto | enabled 
> > > }
> > > +  
> > > +   
> > > +Build with selinux support, enabling the  > > linkend="sepgsql"/>
> > > +extension.
> > 
> > This option defaults to ... auto?
> 
> Not quite sure what you mean? Today it defaults to disabled, a patch changing
> that should also change the docs?

What I failed to say is that the other options document what the
default it, this one doesn't yet.

Christoph




Re: Atomic ops for unlogged LSN

2023-11-07 Thread Nathan Bossart
On Tue, Nov 07, 2023 at 11:47:46AM -0500, Stephen Frost wrote:
> We only care about the value of the unlogged LSN being correct during
> normal shutdown when we're writing out the shutdown checkpoint, but by
> that time everything else has been shut down and the value absolutely
> should not be changing.

I agree that's all true.  I'm trying to connect how this scenario ensures
we see the most up-to-date value in light of this comment above
pg_atomic_read_u32():

 * The read is guaranteed to return a value as it has been written by this or
 * another process at some point in the past. There's however no cache
 * coherency interaction guaranteeing the value hasn't since been written to
 * again.

Is there something special about all other backends being shut down that
ensures this returns the most up-to-date value and not something from "some
point in the past" as the stated contract for this function seems to
suggest?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: meson documentation build open issues

2023-11-07 Thread Andres Freund
Hi,

On 2023-11-06 10:45:27 +0100, Christoph Berg wrote:
> Re: Andres Freund
> > > > The reason for that is simply that the docs take too long to build.
> > >
> > > That why I'd prefer to be able to separate arch:all and arch:any
> > > builds, yes.
> > 
> > What's stopping you from doing that?  I think the only arch:any content we
> > have is the docs, and those you can build separately? Doc builds do trigger
> > generation of a handful of files besides the docs, but not more.
> 
> Historically, .deb files have been required to contain the manpages
> for all executables even when there's a separate -doc package. This
> means we'd need a separate (hopefully fast) manpage build even when
> the arch:any binaries are built.

Manpages are a bit faster to build than html, but not a whole lot. Both are a
lot faster than PDF.


> We might get around that by introducing a new postgresql-manpages-XX
> arch:all package, but that might be too much micropackaging.

I've not done packaging in, uh, a fair while, but isn't the common solution to
that a -common package? There might be a few more files we could put itno one.


> > + 
> > +  -Dselinux={ disabled | auto | enabled }
> > +  
> > +   
> > +Build with selinux support, enabling the 
> > +extension.
> 
> This option defaults to ... auto?

Not quite sure what you mean? Today it defaults to disabled, a patch changing
that should also change the docs?


> > index 90e2c062fa8..003b57498bb 100644
> > --- a/doc/src/sgml/meson.build
> > +++ b/doc/src/sgml/meson.build
> > @@ -142,6 +142,7 @@ if docs_dep.found()
> >'--install-dir-contents', dir_doc_html, html],
> >  build_always_stale: true, build_by_default: false,
> >)
> > +  alias_target('doc-html', install_doc_html)
> >alias_target('install-doc-html', install_doc_html)
> 
> Shouldn't this just build the html docs, without installing?
> 
> > +  alias_target('doc-man', install_doc_html)
> >alias_target('install-doc-man', install_doc_man)
> 
> ... same
> 
> 
> > + 
> > +   install-install-world
> 
> install-world
> 
> > + 
> > +   install-doc-html
> > +   
> > +
> > + Install documentation in man page format.
> 
> install-doc-man

Oops.


> > +   
> > +Documentation Targets
> 
> > + 
> > +   docs
> > +   doc-html
> > +   
> > +
> > + Build documentation in multi-page HTML format.  Note that
> > + docs does not include 
> > building
> > + man page documentation, as man page generation seldom fails when
> > + building HTML documentation succeeds.
> 
> Why is that a reason for not building the manpages?

I didn't have it that way, and Tom argued strongly for maintaining that
behaviour from the make build. Personally I wouldn't.



> > +   
> > +Code Targets
> 
> I would have expected the sections to be in the order
> build-docs-install. Having install first seems weird to me.

Makes sense to me. I just had the install first because I wrote it first
because of our conversation...


> > +   
> > +Other Targets
> > +
> > +
> > +
> > + 
> > +   clean
> > +   
> > +
> > + Remove all build products
> > +
> > +   
> > + 
> > +
> > + 
> > +   test
> > +   
> > +
> > + Remove all enabled tests. Support for some classes of tests can be
> > + enabled / disabled with  > linkend="configure-tap-tests-meson"/>
> > + and .
> 
> This should explicitly say if contrib tests are included (or there
> needs to be a separate test-world target.)

They are included, will state that. And also s/Remove/Run/


> > Subject: [PATCH v1 5/5] meson: Add -Dpkglibdir option
> 
> Will give that a try, thanks!

Thanks for the review!

Greetings,

Andres Freund




Re: meson documentation build open issues

2023-11-07 Thread Tom Lane
Alvaro Herrera  writes:
> If the problem is broken doc patches, then maybe a solution is to
> include the `xmllint --noout --valid` target in whatever the check-world
> equivalent is for meson.

+1, but let's do that for the "make" build too.  I see that
doc/src/sgml/Makefile has a "check" target, but AFAICS it's not
wired up to the top-level check-world.

regards, tom lane




Re: Atomic ops for unlogged LSN

2023-11-07 Thread Stephen Frost
Greetings,

* Nathan Bossart (nathandboss...@gmail.com) wrote:
> On Tue, Nov 07, 2023 at 12:57:32AM +, John Morris wrote:
> > I incorporated your suggestions and added a few more. The changes are
> > mainly related to catching potential errors if some basic assumptions
> > aren’t met.
> 
> Hm.  Could we move that to a separate patch?  We've lived without these
> extra checks for a very long time, and I'm not aware of any related issues,
> so I'm not sure it's worth the added complexity.  And IMO it'd be better to
> keep it separate from the initial atomics conversion, anyway.

I do see the value in adding in an Assert though I don't want to throw
away the info about what the recent unlogged LSN was when we crash.  As
that basically boils down to a one-line addition, I don't think it
really needs to be in a separate patch.

> > I found the comment about cache coherency a bit confusing. We are dealing
> > with a single address, so there should be no memory ordering or coherency
> > issues. (Did I misunderstand?) I see it more as a race condition. Rather
> > than merely explaining why it shouldn’t happen, the new version verifies
> > the assumptions and throws an Assert() if something goes wrong.
> 
> I was thinking of the comment for pg_atomic_read_u32() that I cited earlier
> [0].  This comment also notes that pg_atomic_read_u32/64() has no barrier
> semantics.  My interpretation of that comment is that these functions
> provide no guarantee that the value returned is the most up-to-date value.

There seems to be some serious misunderstanding about what is happening
here.  The value written into the control file for unlogged LSN during
normal operation does *not* need to be the most up-to-date value and
talking about it as if it needs to be the absolutely most up-to-date and
correct value is, if anything, adding to the confusion, not reducing
confusion.  The reason to write in anything other than a zero during
these routine checkpoints for unlogged LSN is entirely for forensics
purposes, not because we'll ever actually use the value- during crash
recovery and backup/restore, we're going to reset the unlogged LSN
counter anyway and we're going to throw away all of unlogged table
contents across the entire system.

We only care about the value of the unlogged LSN being correct during
normal shutdown when we're writing out the shutdown checkpoint, but by
that time everything else has been shut down and the value absolutely
should not be changing.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: meson documentation build open issues

2023-11-07 Thread Alvaro Herrera
On 2023-Nov-07, Andres Freund wrote:

> >I would like to have some set of options that enables it so that the
> >standard documentation targets become part of "meson compile" and
> >"meson install".
> 
> -0.5 - it's just too painfully slow. For all scripted uses you can just as 
> well use install-world...

If the problem is broken doc patches, then maybe a solution is to
include the `xmllint --noout --valid` target in whatever the check-world
equivalent is for meson.  Looking at doc/src/sgml/meson.build, we don't
seem to do that anywhere.  Doing the no-output lint run is very fast
(375ms real time in my machine, whereas "make html" takes 27s).

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"¿Cómo puedes confiar en algo que pagas y que no ves,
y no confiar en algo que te dan y te lo muestran?" (Germán Poo)




Re: meson documentation build open issues

2023-11-07 Thread Tom Lane
Andres Freund  writes:
> On November 7, 2023 7:55:37 AM PST, Peter Eisentraut  
> wrote:
>> I would like to have some set of options that enables it so that the 
>> standard documentation targets become part of "meson compile" and "meson 
>> install".

> -0.5 - it's just too painfully slow. For all scripted uses you can just as 
> well use install-world...

I think we should set up the meson stuff so that "install" and
"install-world" cover exactly what they did with "make".  Otherwise
there will be too much confusion.

regards, tom lane




Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-11-07 Thread Maxim Orlov
On Mon, 6 Nov 2023 at 16:07, Alexander Korotkov 
wrote:

> Hi!
>
> On Wed, Jul 5, 2023 at 4:46 PM Aleksander Alekseev <
> aleksan...@timescale.com> wrote:
> > PFE the corrected patchset v58.
>
> I'd like to revive this thread.
>
Hi! Great news!


> BTW, there is a typo in a word "exceeed".
>
Fixed.


>
> +static int inline
> +SlruFileName(SlruCtl ctl, char *path, int64 segno)
> +{
> ...
> +}
>
> I think it worth adding asserts here to verify there is no overflow making
> us mapping different segments into the same files.
>
Agree, assertion added.


> +   return occupied == max_notify_queue_pages;
>
> I'm not sure if the current code could actually allow to occupy more than
> max_notify_queue_pages.  Probably not even in extreme cases.  But I still
> think it will more safe and easier to read to write "occupied >=
> max_notify_queue"_pages here.
>
Fixed.


> diff --git a/src/test/modules/test_slru/test_slru.c
> b/src/test/modules/test_slru/test_slru.c
>
> The actual 64-bitness of SLRU pages isn't much exercised in our automated
> tests.  It would be too exhausting to make pg_notify actually use higher
> than 2**32 page numbers.  Thus, I think test/modules/test_slru is a good
> place to give high page numbers a good test.
>
PFA, I've add test for a 64-bit SLRU pages.

By the way, there is another one useful thing we may do here.  For now
pg_commit_ts functionality is rather strange: if it was enabled, then
disabled and then enabled again all the data from before will be
discarded.  Meanwhile, users expected to have their commit timestamps for
all transactions, which were "logged" when this feature was enabled.  It's
weird.

AFICS, the only reason for this behaviour is becouse of transaction
wraparound. It may occur while the feature is disabled end it is safe to
simply remove all the data from previous period.  If we switch to
FullTransactionId in commit_ts we can overcome this limitation.  But I'm
not sure if it worth to try to fix this in current patchset, since it is
already non trivial.

-- 
Best regards,
Maxim Orlov.


v59-0003-Make-use-FullTransactionId-in-2PC-filenames.patch
Description: Binary data


v59-0004-Add-SLRU-tests-for-64-bit-page-case.patch
Description: Binary data


v59-0001-Index-SLRUs-by-64-bit-integers-rather-than-by-32.patch
Description: Binary data


v59-0002-Use-larger-segment-file-names-for-pg_notify.patch
Description: Binary data


Re: ALTER TABLE uses a bistate but not for toast tables

2023-11-07 Thread Matthias van de Meent
Hi Justin,

This patch has gone stale quite some time ago; CFbot does not seem to
have any history of a successful apply attemps, nor do we have any
succesful build history (which was introduced some time ago already).

Are you planning on rebasing this patch?

Kind regards,

Matthias van de Meent




Re: meson documentation build open issues

2023-11-07 Thread Andres Freund
Hi, 

On November 7, 2023 7:55:37 AM PST, Peter Eisentraut  
wrote:
>On 03.11.23 19:19, Christoph Berg wrote:
> You can control this with the "docs" option for meson, as of recently.
 I've been looking into switching the Debian PG 17 build to meson, but
 I'm running into several problems.
 
 * The docs are still not built by default, and -Ddocs=enabled doesn't
change that
>>> Maybe I am missing something - they aren't built by default in autoconf
>>> either?
>> True, but the documentation (and this thread) reads like it should. Or
>> at least it should, when I explicitly say -Ddocs=enabled.
>> 
>> What would also help is when the tail of the meson output had a list
>> of features that are enabled. There's the list of "External libraries"
>> which is quite helpful at figuring out what's still missing, but
>> perhaps this could be extended:
>> 
>>Features
>>  LLVM : YES (/usr/bin/llvm-config-16)
>>  DOCS : YES (html pdf texinfo)
>> 
>> Atm it's hidden in the long initial blurb of "Checking for.." and the
>> "NO" in there don't really stand out as much, since some of them are
>> normal.
>
>I don't feel like we have fully worked out how the docs options should fit 
>together.
>
>With the make build system, there is a canonical sequence of
>
>make world
>make check-world
>make install-world
>
>that encompasses everything.
>
>Now with meson to handle the documentation one needs to remember a variety of 
>additional targets.  (There is a risk that once this gets more widespread, 
>more people will submit broken documentation.)

install-world with meson also installs docs.


>I would like to have some set of options that enables it so that the standard 
>documentation targets become part of "meson compile" and "meson install".

-0.5 - it's just too painfully slow. For all scripted uses you can just as well 
use install-world...

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: meson documentation build open issues

2023-11-07 Thread Peter Eisentraut

On 03.11.23 19:19, Christoph Berg wrote:

You can control this with the "docs" option for meson, as of recently.

I've been looking into switching the Debian PG 17 build to meson, but
I'm running into several problems.

* The docs are still not built by default, and -Ddocs=enabled doesn't
   change that

Maybe I am missing something - they aren't built by default in autoconf
either?

True, but the documentation (and this thread) reads like it should. Or
at least it should, when I explicitly say -Ddocs=enabled.

What would also help is when the tail of the meson output had a list
of features that are enabled. There's the list of "External libraries"
which is quite helpful at figuring out what's still missing, but
perhaps this could be extended:

   Features
 LLVM : YES (/usr/bin/llvm-config-16)
 DOCS : YES (html pdf texinfo)

Atm it's hidden in the long initial blurb of "Checking for.." and the
"NO" in there don't really stand out as much, since some of them are
normal.


I don't feel like we have fully worked out how the docs options should 
fit together.


With the make build system, there is a canonical sequence of

make world
make check-world
make install-world

that encompasses everything.

Now with meson to handle the documentation one needs to remember a 
variety of additional targets.  (There is a risk that once this gets 
more widespread, more people will submit broken documentation.)


I would like to have some set of options that enables it so that the 
standard documentation targets become part of "meson compile" and "meson 
install".






Re: Compiler warning on Debian 12, PostgreSQL 16 Beta3

2023-11-07 Thread Alvaro Herrera
On 2023-Aug-25, Daniel Westermann (DWE) wrote:

> I've just noticed this warning when building on Debian 12:
> 
> In file included from 
> /usr/lib/llvm-14/include/llvm/Analysis/ModuleSummaryAnalysis.h:17,
>  from llvmjit_inline.cpp:51:
> /usr/lib/llvm-14/include/llvm/IR/ModuleSummaryIndex.h: In constructor 
> ‘llvm::ModuleSummaryIndex::ModuleSummaryIndex(bool, bool)’:
> /usr/lib/llvm-14/include/llvm/IR/ModuleSummaryIndex.h:1175:73: warning: 
> member ‘llvm::ModuleSummaryIndex::Alloc’ is used uninitialized 
> [-Wuninitialized]
>  1175 |   : HaveGVs(HaveGVs), EnableSplitLTOUnit(EnableSplitLTOUnit), 
> Saver(Alloc),
>   |

Yeah, I get this one too.  I thought commit 37d5babb5cfa ("jit: Support
opaque pointers in LLVM 16.") was going to silence it, but I was quite
mistaken.  I gave that code a quick look and could not understand what
it was complaining about.  Is it a bug in the LLVM headers?

Adding Andres and Thomas to CC, because they're the ones touching the
LLVM / JIT code.

Any clues?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Buffer Cache Problem

2023-11-07 Thread Matthias van de Meent
On Tue, 7 Nov 2023 at 14:28, jacktby jacktby  wrote:
>
> Hi, postgres hackers, I’m studying postgres buffer cache part. So I open this 
> thread to communicate some buffer cache codes design and try to improve some 
> tricky codes.
>
> For Buffer Cache, we know it’s a buffer array, every bucket of this array is 
> consist of a data page and its header which is used to describe the state of 
> the buffer.
>
> For field wait_backend_pgprocno, the comment is "backend of pin-count 
> waiter”, I have problems below:

Did you read the README at src/backend/storage/buffer/README, as well
as the comments and documentation in and around the buffer-locking
functions?

> 1. it means which processId is waiting this buffer, right?
> 2. and if wait_backend_pgprocno is valid, so it says this buffer is in use by 
> one process, right?
> 3. if one buffer is wait by another process, it means all buffers are out of 
> use, right? So let’s try this: we have 5 buffers with ids (1,2,3,4,5), and 
> they  are all in use, now another process  with processId 8017 is coming, and 
> it choose buffer id 1, so  buffer1’s wait_backend_pgprocno is 8017, but later
> buffer4 is released, can process 8017 change to get buffer4? how?

I believe these questions are generally answered by the README and the
comments in bufmgr.c/buf_internal.h for the functions that try to lock
buffers.

> 4. wait_backend_pgprocno is a “integer” type, not an array, why can one 
> buffer be wait by only one process?

Yes, that is correct. It seems like PostgreSQL has yet to find a
workload requires more than one backend to wait for super exclusive
access to a buffer at the same time.
VACUUM seems to be the only workload that currently can wait and sleep
for this exclusive buffer access, and that is already limited to one
process per relation, so there are no explicit concurrent
super-exclusive waits in the system right now.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: meson documentation build open issues

2023-11-07 Thread Peter Eisentraut

On 03.11.23 22:16, Andres Freund wrote:
[selinux option]

Why isn't it "auto" like the others?

I don't really remember why I did that, but it's platform specific, maybe
that's why I did it that way?

Isn't that kind the point of autodetecting things? Aren't bonjour and
bsd_auth autodetected as well?

I'd be happy to change it, unless somebody objects?


Makes sense to me to change it to auto.





Re: Protocol question regarding Portal vs Cursor

2023-11-07 Thread Tom Lane
Dave Cramer  writes:
> If we use a Portal it is possible to open the portal and do a describe and
> then Fetch N records.

> Using a Cursor we open the cursor. Is there a corresponding describe and a
> way to fetch N records without getting the fields each time. Currently we
> have to send the SQL  "fetch  N" and we get the fields and the
> rows. This seems overly verbose.

Portals and cursors are pretty much the same thing, so why not use
the API that suits you better?

regards, tom lane




Re: Relids instead of Bitmapset * in plannode.h

2023-11-07 Thread Tom Lane
Alvaro Herrera  writes:
> On 2023-Oct-31, Ashutosh Bapat wrote:
>> For some reason plannode.h has declared variable to hold RTIs as
>> Bitmapset * instead of Relids like other places. Here's patch to fix
>> it. This is superficial change as Relids is typedefed to Bitmapset *.
>> Build succeeds for me and also make check passes.

> I think the reason for having done it this way, was mostly to avoid
> including pathnodes.h in plannodes.h.

Yes, I'm pretty sure that's exactly the reason, and I'm strongly
against the initially-proposed patch.  The include footprint of
pathnodes.h would be greatly expanded, for no real benefit.
Among other things, that fuzzes the distinction between planner
modules and non-planner modules.

> While looking at it, I noticed that tcopprot.h includes both plannodes.h
> and parsenodes.h, but there's no reason to include the latter (or at
> least headerscheck doesn't complain after removing it), so I propose to
> remove it, per 0001 here.

0001 is ok, except check #include alphabetization.

> I also noticed while looking that I messed up in commit 7103ebb7aae8
> ("Add support for MERGE SQL command") on this point, because I added
> #include parsenodes.h to plannodes.h.  This is because MergeAction,
> which is in parsenodes.h, is also needed by some executor code.  But the
> real way to fix that is to define that struct in primnodes.h.  0002 does
> that.  (I'm forced to also move enum OverridingKind there, which is a
> bit annoying.)

This seems OK.  It seems to me that parsenodes.h has been required
by plannodes.h for a long time, but if we can decouple them, all
the better.

> 0003 here is your patch, which I include because of conflicts with my
> 0002.

Still don't like it.

> ... I would be more at ease if we didn't have to include
> parsenodes.h in pathnodes.h, though, but that looks more difficult to
> achieve.

Yeah, that dependency has been there a long time too.  I'm not too
fussed by dependencies on parsenodes.h, because anything involved
with either planning or execution will certainly be looking at
query trees too.  But I don't want to add dependencies that tie
planning and execution together.

regards, tom lane




MERGE: AFTER ROW trigger failure for cross-partition update

2023-11-07 Thread Dean Rasheed
While working on the MERGE RETURNING patch, I noticed a pre-existing
MERGE bug --- ExecMergeMatched() should not call ExecUpdateEpilogue()
if ExecUpdateAct() indicates that it did a cross-partition update.

The immediate consequence is that it incorrectly tries (and fails) to
fire AFTER UPDATE ROW triggers, which it should not do if the UPDATE
has been turned into a DELETE and an INSERT:

DROP TABLE IF EXISTS foo CASCADE;

CREATE TABLE foo (a int) PARTITION BY LIST (a);
CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES IN (1);
CREATE TABLE foo_p2 PARTITION OF foo FOR VALUES IN (2);
INSERT INTO foo VALUES (1);

CREATE OR REPLACE FUNCTION foo_trig_fn() RETURNS trigger AS
$$
BEGIN
  RAISE NOTICE 'Trigger: % %', TG_WHEN, TG_OP;
  IF TG_OP = 'DELETE' THEN RETURN OLD; END IF;
  RETURN NEW;
END
$$ LANGUAGE plpgsql;

CREATE TRIGGER foo_b_trig BEFORE INSERT OR UPDATE OR DELETE ON foo
  FOR EACH ROW EXECUTE FUNCTION foo_trig_fn();
CREATE TRIGGER foo_a_trig AFTER INSERT OR UPDATE OR DELETE ON foo
  FOR EACH ROW EXECUTE FUNCTION foo_trig_fn();

UPDATE foo SET a = 2 WHERE a = 1;

NOTICE:  Trigger: BEFORE UPDATE
NOTICE:  Trigger: BEFORE DELETE
NOTICE:  Trigger: BEFORE INSERT
NOTICE:  Trigger: AFTER DELETE
NOTICE:  Trigger: AFTER INSERT
UPDATE 1

MERGE INTO foo USING (VALUES (1)) AS v(a) ON true
  WHEN MATCHED THEN UPDATE SET a = v.a;

NOTICE:  Trigger: BEFORE UPDATE
NOTICE:  Trigger: BEFORE DELETE
NOTICE:  Trigger: BEFORE INSERT
NOTICE:  Trigger: AFTER DELETE
NOTICE:  Trigger: AFTER INSERT
ERROR:  failed to fetch tuple2 for AFTER trigger

The attached patch fixes that, making the UPDATE path in
ExecMergeMatched() consistent with ExecUpdate().

(If there were no AFTER ROW triggers, the old code would go on to do
other unnecessary things, like WCO/RLS checks, which I didn't really
look into. This patch will stop any of that too.)

Regards,
Dean
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
new file mode 100644
index 299c2c7..b16fbe9
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2899,6 +2899,22 @@ lmerge_matched:
 }
 result = ExecUpdateAct(context, resultRelInfo, tupleid, NULL,
 	   newslot, false, );
+
+/*
+ * As in ExecUpdate(), if ExecUpdateAct() reports that a
+ * cross-partition update was done, then there's nothing else
+ * for us to do --- the UPDATE has been turned into a DELETE
+ * and an INSERT, and we must not perform any of the usual
+ * post-update tasks.
+ */
+if (updateCxt.crossPartUpdate)
+{
+	mtstate->mt_merge_updated += 1;
+	if (canSetTag)
+		(estate->es_processed)++;
+	return true;
+}
+
 if (result == TM_Ok && updateCxt.updated)
 {
 	ExecUpdateEpilogue(context, , resultRelInfo,
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
new file mode 100644
index e8de916..6cbfbbe
--- a/src/test/regress/expected/triggers.out
+++ b/src/test/regress/expected/triggers.out
@@ -2309,6 +2309,51 @@ NOTICE:  trigger zzz on parted_trig_1_1
 NOTICE:  trigger bbb on parted_trig_2 AFTER INSERT for ROW
 NOTICE:  trigger zzz on parted_trig_2 AFTER INSERT for ROW
 drop table parted_trig;
+-- Verify that the correct triggers fire for cross-partition updates
+create table parted_trig (a int) partition by list (a);
+create table parted_trig1 partition of parted_trig for values in (1);
+create table parted_trig2 partition of parted_trig for values in (2);
+insert into parted_trig values (1);
+create or replace function trigger_notice() returns trigger as $$
+  begin
+raise notice 'trigger % on % % % for %', TG_NAME, TG_TABLE_NAME, TG_WHEN, TG_OP, TG_LEVEL;
+if TG_LEVEL = 'ROW' then
+  if TG_OP = 'DELETE' then
+return OLD;
+  else
+return NEW;
+  end if;
+end if;
+return null;
+  end;
+  $$ language plpgsql;
+create trigger parted_trig_before_stmt before insert or update or delete on parted_trig
+   for each statement execute procedure trigger_notice();
+create trigger parted_trig_before_row before insert or update or delete on parted_trig
+   for each row execute procedure trigger_notice();
+create trigger parted_trig_after_row after insert or update or delete on parted_trig
+   for each row execute procedure trigger_notice();
+create trigger parted_trig_after_stmt after insert or update or delete on parted_trig
+   for each statement execute procedure trigger_notice();
+update parted_trig set a = 2 where a = 1;
+NOTICE:  trigger parted_trig_before_stmt on parted_trig BEFORE UPDATE for STATEMENT
+NOTICE:  trigger parted_trig_before_row on parted_trig1 BEFORE UPDATE for ROW
+NOTICE:  trigger parted_trig_before_row on parted_trig1 BEFORE DELETE for ROW
+NOTICE:  trigger parted_trig_before_row on parted_trig2 BEFORE INSERT for ROW
+NOTICE:  trigger parted_trig_after_row on parted_trig1 AFTER DELETE for ROW
+NOTICE:  trigger parted_trig_after_row on parted_trig2 AFTER INSERT 

Re: Commitfest: older Waiting on Author entries

2023-11-07 Thread Nathan Bossart
On Tue, Nov 07, 2023 at 03:27:53PM +0700, John Naylor wrote:
> The following entries are WoA but have not had an email update since
> July. If there is no actionable update soon, I plan to mark these
> Returned with Feedback before end of CF:
> 
> pg_usleep for multisecond delays
> vector search support

I haven't had a chance to follow up on these in some time, so I went ahead
and marked them returned-with-feedback.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: GUC names in messages

2023-11-07 Thread Nathan Bossart
On Tue, Nov 07, 2023 at 10:33:03AM +0100, Alvaro Herrera wrote:
> On 2023-Nov-01, Nathan Bossart wrote:
>> +1, IMHO quoting GUC names makes it abundantly clear that they are special
>> identifiers.  In de4d456, we quoted the role names in a bunch of messages.
>> We didn't quote the attribute/option names, but those are in all-caps, so
>> they already stand out nicely.
> 
> I like this, and I propose we codify it in the message style guide.  How
> about this?  We can start looking at code changes to make once we decide
> we agree with this.

> +   
> +In messages containing configuration variable names, quotes are
> +not necessary when the names are visibly not English natural words, such
> +as when they have underscores or are all-uppercase.  Otherwise, quotes
> +must be added.  Do include double-quotes in a message where an arbitrary
> +variable name is to be expanded.
> +   

І'd vote for quoting all GUC names, if for no other reason than "visibly
not English natural words" feels a bit open to interpretation.  But this
seems like it's on the right track, so I won't argue too strongly if I'm
the only holdout.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: GUC names in messages

2023-11-07 Thread Tom Lane
Alvaro Herrera  writes:
> On 2023-Nov-01, Nathan Bossart wrote:
>> +1, IMHO quoting GUC names makes it abundantly clear that they are special
>> identifiers.  In de4d456, we quoted the role names in a bunch of messages.
>> We didn't quote the attribute/option names, but those are in all-caps, so
>> they already stand out nicely.

> I like this, and I propose we codify it in the message style guide.  How
> about this?  We can start looking at code changes to make once we decide
> we agree with this.

WFM.

regards, tom lane




Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression

2023-11-07 Thread Peter Eisentraut

On 25.10.23 08:12, Amul Sul wrote:

Here is the rebase version for the latest master head(673a17e3120).

I haven't done any other changes related to the ON UPDATE trigger since that
seems non-trivial; need a bit of work to add trigger support in 
ATRewriteTable().
Also, I am not sure yet, if we were doing these changes, and the correct 
direction

for that.


I did some detailed testing on Db2, Oracle, and MariaDB (the three 
existing implementations of this feature that I'm tracking), and none of 
them fire any row or statement triggers when the respective statement to 
alter the generation expression is run.  So I'm withdrawing my comment 
that this should fire triggers.  (I suppose event triggers are available 
if anyone really needs the functionality.)






Re: Synchronizing slots from primary to standby

2023-11-07 Thread Drouvot, Bertrand

Hi,

On 11/7/23 11:55 AM, Amit Kapila wrote:

On Tue, Nov 7, 2023 at 3:51 PM Drouvot, Bertrand
 wrote:


On 10/31/23 10:37 AM, shveta malik wrote:

On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand


Agree with your test case, but in my case I was not using pub/sub.

I was not clear, so when I said:


- create logical_slot1 on the primary (and don't start using it)


I meant don't start decoding from it (like using pg_recvlogical() or
pg_logical_slot_get_changes()).

By using pub/sub the "don't start using it" is not satisfied.

My test case is:

"
SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 
'test_decoding', false, true, true);
SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 
'test_decoding', false, true, true);
pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f -
"



Okay, I am able to reproduce it now. Thanks for clarification. I have
tried to change the algorithm as per suggestion by Amit in [1]

[1]: 
https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com


Thanks!



This is not full proof solution but optimization over first one. Now
in any sync-cycle, we take 2 attempts for slots-creation (if any slots
are available to be created). In first attempt, we do not wait
indefinitely on inactive slots, we wait only for a fixed amount of
time and if remote-slot is still behind, then we add that to the
pending list and move to the next slot. Once we are done with first
attempt, in second attempt, we go for the pending ones and now we wait
on each of them until the primary catches up.


Aren't we "just" postponing the "issue"? I mean if there is really no activity
on, say, the first created slot, then once we move to the second attempt then 
any newly
created slot from that time would wait to be synced forever, no?



We have to wait at some point in time for such inactive slots and the
same is true even for manually created slots on standby. Do you have
any better ideas to deal with it?



What about:

- get rid of the second attempt and the pending_slot_list
- keep the wait_count and PrimaryCatchupWaitAttempt logic

so basically, get rid of:

   /*
* Now sync the pending slots which were failed to be created in first
* attempt.
*/
   foreach(cell, pending_slot_list)
   {
   RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell);

   /* Wait until the primary server catches up */
   PrimaryCatchupWaitAttempt = 0;

   synchronize_one_slot(wrconn, remote_slot, NULL);
   }

and the pending_slot_list list.

That way, for each slot that have not been created and synced yet:

- it will be created on the standby
- we will wait up to PrimaryCatchupWaitAttempt attempts
- the slot will be synced or removed on/from the standby

That way an inactive slot on the primary would not "block"
any other slots on the standby.

By "created" here I mean calling ReplicationSlotCreate() (not to be confused
with emitting "ereport(LOG, errmsg("created slot \"%s\" locally", 
remote_slot->name)); "
which is confusing as mentioned up-thread).

The problem I can see with this proposal is that the "sync" window waiting
for slot activity on the primary is "only" during the PrimaryCatchupWaitAttempt
attempts (as the slot will be dropped/recreated).

If we think this window is too short we could:

- increase it
or
- don't drop the slot once created (even if there is no activity
on the primary during PrimaryCatchupWaitAttempt attempts) so that
the next loop of attempts will compare with "older" LSN/xmin (as compare to
dropping and re-creating the slot). That way the window would be since the
initial slot creation.

Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: 2023-11-09 release announcement draft

2023-11-07 Thread Jonathan S. Katz

On 11/6/23 9:52 PM, Noah Misch wrote:

On Mon, Nov 06, 2023 at 05:04:25PM -0500, Jonathan S. Katz wrote:

The PostgreSQL Global Development Group has released an update to all supported
versions of PostgreSQL, including 16.1, 15.5, 14.10, 13.13, 12.17, and 11.22
This release fixes over 55 bugs reported over the last several months.

This release includes fixes for indexes where in certain cases, we advise
reindexing. Please see the "Update" section for more details.


s/"Update" section/"Updating" section/ or change section title below.


Fixed.


Delete lines starting here ...


This is the **final release of PostgreSQL 11**. PostgreSQL 10 will no longer
receive
[security and bug fixes](https://www.postgresql.org/support/versioning/).
If you are running PostgreSQL 10 in a production environment, we suggest that
you make plans to upgrade.


... to here.  They're redundant with "PostgreSQL 11 EOL Notice" below:


Initially, I strongly disagreed with this recommendation, as I've seen 
enough people say that they were unaware that a community version is 
EOL. We can't say this enough.


However, I did decide to clip it out because the notice is just below.

That said, perhaps we should put out a separate announcement that states 
PostgreSQL 11 is EOL. We may want to consider doing standalone EOL 
announcement -- perhaps 6 months out, and then day of, to make it 
abundantly clear that a version is deprecating.


Finally, I included Matthias' downthread recommendation in this version.

Thanks,

Jonathan
The PostgreSQL Global Development Group has released an update to all supported
versions of PostgreSQL, including 16.1, 15.5, 14.10, 13.13, 12.17, and 11.22
This release fixes over 55 bugs reported over the last several months.

This release includes fixes for indexes where in certain cases, we advise
reindexing. Please see the "Updating" section for more details.

For the full list of changes, please review the
[release notes](https://www.postgresql.org/docs/release/).

PostgreSQL 11 EOL Notice


**This is the final release of PostgreSQL 11**. PostgreSQL 11 is now end-of-life
and will no longer receive security and bug fixes. If you are
running PostgreSQL 11 in a production environment, we suggest that you make
plans to upgrade to a newer, supported version of PostgreSQL. Please see our
[versioning policy](https://www.postgresql.org/support/versioning/) for more
information.

Bug Fixes and Improvements
--
 
This update fixes over 55 bugs that were reported in the last several months.
The issues listed below affect PostgreSQL 16. Some of these issues may also
affect other supported versions of PostgreSQL.

* Fix issue where GiST indexes had an incorrect behavior during a "page split"
operation that could lead to incorrect results in subsequent index searches.
Please [reindex](https://www.postgresql.org/docs/current/sql-reindex.html) GiST
indexes after installing this update.
* Fix issue where B-tree indexes would incorrectly de-duplicate `interval`
columns. Please 
[reindex](https://www.postgresql.org/docs/current/sql-reindex.html)
any B-tree index that includes an `interval` column after installing this
update.
* Provide more efficient indexing of `date`, `timestamptz`, and `timestamp`
values in BRIN indexes when using a [`minmax_multi` 
opsclass](https://www.postgresql.org/docs/current/brin-builtin-opclasses.html).
While not required, we recommend
[reindexing](https://www.postgresql.org/docs/current/sql-reindex.html) BRIN
indexes that include these data types after installing this update.
* Fix for bulk table insertion into partitioned tables.
* Fix for hash-partitioned tables with multiple partition keys during step
generation and runtime pruning that could lead to crashes in some cases.
* Throw the correct error if 
[`pgrowlocks()`](https://www.postgresql.org/docs/current/pgrowlocks.html) is 
applied to a partitioned table

* Fix inconsistent rechecking of concurrently-updated rows during
[`MERGE`](https://www.postgresql.org/docs/current/sql-merge.html) when using
[`READ 
COMMITTED`](https://www.postgresql.org/docs/current/transaction-iso.html#XACT-READ-COMMITTED)
mode.
* Correctly identify the target table in an inherited `UPDATE`/`DELETE`/`MERGE`
even when the parent table is excluded by constraints.
* Fix over-allocation of a constructed 
[`tsvector`](https://www.postgresql.org/docs/current/datatype-textsearch.html#DATATYPE-TSVECTOR).
* Fix [`ALTER 
SUBSCRIPTION`](https://www.postgresql.org/docs/current/sql-altersubscription.html)
to apply changes in the `run_as_owner` option.
* Several fixes for [`COPY 
FROM`](https://www.postgresql.org/docs/current/sql-copy.html),
* Several fixes for handling torn reads with 
[`pg_control`](https://www.postgresql.org/docs/current/wal-internals.html).
* Fix "could not find pathkey item to sort" errors occurring while planning
aggregate functions with `ORDER BY` or `DISTINCT` options.
* When 

Re: ResourceOwner refactoring

2023-11-07 Thread Alexander Lakhin

Hello Heikki,

07.11.2023 14:28, Heikki Linnakangas wrote:


I feel pretty good about this overall. Barring objections or new cfbot 
failures, I will commit this in the next few days.



A script, that I published in [1], detects several typos in the patches:
betwen
ther
ReourceOwner
ResouceOwnerSort
ReleaseOwnerRelease

Maybe it's worth to fix them now, or just accumulate and clean all such
defects once a year...

[1] 
https://www.postgresql.org/message-id/27a7998b-d67f-e32a-a28d-e659a2e390c6%40gmail.com

Best regards,
Alexander




Buffer Cache Problem

2023-11-07 Thread jacktby jacktby
Hi, postgres hackers, I’m studying postgres buffer cache part. So I open this 
thread to communicate some buffer cache codes design and try to improve some 
tricky codes.

For Buffer Cache, we know it’s a buffer array, every bucket of this array is 
consist of a data page and its header which is used to describe the state of 
the buffer. 

This is the origin code of buffer header:
typedef struct BufferDesc
{
BufferTag   tag;/* ID of page contained in 
buffer */
int buf_id; /* buffer's index 
number (from 0) */

/* state of the tag, containing flags, refcount and usagecount */
pg_atomic_uint32 state;

int wait_backend_pgprocno;  /* backend of pin-count 
waiter */
int freeNext;   /* link in freelist 
chain */
LWLock  content_lock;   /* to lock access to buffer contents */
} BufferDesc;

For field wait_backend_pgprocno, the comment is "backend of pin-count waiter”, 
I have problems below:
1. it means which processId is waiting this buffer, right? 
2. and if wait_backend_pgprocno is valid, so it says this buffer is in use by 
one process, right?
3. if one buffer is wait by another process, it means all buffers are out of 
use, right? So let’s try this: we have 5 buffers with ids (1,2,3,4,5), and they 
 are all in use, now another process  with processId 8017 is coming, and it 
choose buffer id 1, so  buffer1’s wait_backend_pgprocno is 8017, but later
buffer4 is released, can process 8017 change to get buffer4? how?
4. wait_backend_pgprocno is a “integer” type, not an array, why can one buffer 
be wait by only one process?

Hope your reply, thanks!! I’m willing to do contributions after I study buffer 
cache implementations.

Re: 2023-11-09 release announcement draft

2023-11-07 Thread Matthias van de Meent
On Mon, 6 Nov 2023 at 23:04, Jonathan S. Katz  wrote:
>
> Hi,
>
> Attached is the release announcement draft for the 2023-11-09 release
> (16.1 et al.).
>
> Please review for accuracy and notable omissions. Please have all
> feedback in by 2023-11-09 08:00 UTC at the latest (albeit the sooner the
> better).

> 20231109updaterelease.md
> [...]
> * Provide more efficient indexing of `date`, `timestamptz`, and `timestamp`
> values in BRIN indexes. While not required, we recommend
> [reindexing](https://www.postgresql.org/docs/current/sql-reindex.html) BRIN
> indexes that include these data types after installing this update.

As the type's minmax_multi opclasses are marked as default, I believe
it makes sense to explicitly mention that only indexes that use the
type's minmax_multi opclasses would need to be reindexed for them to
see improved performance. The types' *_bloom and *_minmax opclasses
were not affected and therefore do not need to be reindexed.

Kind regards,

Matthias van de meent.




Re: collect_corrupt_items_vacuum.patch

2023-11-07 Thread Alexander Lakhin

07.11.2023 14:38, Alexander Korotkov wrote:

Hi, Alexander.

On Tue, Nov 7, 2023 at 1:00 PM Alexander Lakhin  wrote:

It looks like the v2 patch doesn't fix the original issue. Maybe I miss
something, but with the patch applied, I see the failure immediately,
though without the patch several iterations are needed to get it.


That's a bug in the patch.  Thank you for cathing it.  It should start
calculation from latestCompletedXid + 1, not InvalidTransactionId.
Please, check the revised patch.


Thanks for looking at this!
Unfortunately, I still see the failure with the v3, but not on a first
iteration:
...
iteration 316
Error condition in psql-8.log:
create table vacuum_test as select 42 i;
vacuum (disable_page_skipping) vacuum_test;
select * from pg_check_visible('vacuum_test');
 t_ctid

 (0,1)
(1 row)

(I've double-checked that the patch is applied and get_strict_xid_horizon()
is called.)

Best regards,
Alexander




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-11-07 Thread Matthias van de Meent
On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan  wrote:
>
> On Mon, Nov 6, 2023 at 1:28 PM Matthias van de Meent
>  wrote:
> > I'm planning on reviewing this patch tomorrow, but in an initial scan
> > through the patch I noticed there's little information about how the
> > array keys state machine works in this new design. Do you have a more
> > toplevel description of the full state machine used in the new design?
>
> This is an excellent question. You're entirely right: there isn't
> enough information about the design of the state machine.
>
> I should be able to post v6 later this week. My current plan is to
> commit the other nbtree patch first (the backwards scan "boundary
> cases" one from the ongoing CF) -- since I saw your review earlier
> today. I think that you should probably wait for this v6 before
> starting your review.

Okay, thanks for the update, then I'll wait for v6 to be posted.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-11-07 Thread Aleksander Alekseev
Hi again,

> PFA the corrected patchset v59.

On second thought, I believe this Assert is incorrect:

```
+else
+{
+Assert(segno >= 0 && segno <= 0x);
+return snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir,
+(unsigned int) segno);
+}
```

See SlruCorrectSegmentFilenameLength():

```
if (ctl->long_segment_names)
return (len == 15); /* see SlruFileName() */
else
/*
 * Commit 638cf09e76d allowed 5-character lengths. Later commit
 * 73c986adde5 allowed 6-character length.
 *
 * XXX should we still consider such names to be valid?
 */
return (len == 4 || len == 5 || len == 6);
```

Should we just drop it or check that segno is <= 0xFF?

-- 
Best regards,
Aleksander Alekseev




Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-11-07 Thread Aleksander Alekseev
Hi Alexander,

> -#define TransactionIdToCTsPage(xid) \
> -   ((xid) / (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
> +
> +/*
> + * Although we return an int64 the actual value can't currently exceeed 
> 2**32.
> + */
> +static inline int64
> +TransactionIdToCTsPage(TransactionId xid)
> +{
> +   return xid / (int64) COMMIT_TS_XACTS_PER_PAGE;
> +}
>
> Is there any reason we transform macro into a function?  If not, I propose to 
> leave this as a macro.  BTW, there is a typo in a word "exceeed".

I kept the inline function, as we agreed above.

Typo fixed.

> +static int inline
> +SlruFileName(SlruCtl ctl, char *path, int64 segno)
> +{
> +   if (ctl->long_segment_names)
> +   /*
> +* We could use 16 characters here but the disadvantage would be that
> +* the SLRU segments will be hard to distinguish from WAL segments.
> +*
> +* For this reason we use 15 characters. It is enough but also means
> +* that in the future we can't decrease SLRU_PAGES_PER_SEGMENT easily.
> +*/
> +   return snprintf(path, MAXPGPATH, "%s/%015llX", ctl->Dir,
> +   (long long) segno);
> +   else
> +   return snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir,
> +   (unsigned int) segno);
> +}
>
> I think it worth adding asserts here to verify there is no overflow making us 
> mapping different segments into the same files.

Added. I noticed a off-by-one error in the code snippet proposed
above, so my code differs a bit.

> +   return occupied == max_notify_queue_pages;
>
> I'm not sure if the current code could actually allow to occupy more than 
> max_notify_queue_pages.  Probably not even in extreme cases.  But I still 
> think it will more safe and easier to read to write "occupied >= 
> max_notify_queue"_pages here.

Fixed.

> diff --git a/src/test/modules/test_slru/test_slru.c 
> b/src/test/modules/test_slru/test_slru.c
>
> The actual 64-bitness of SLRU pages isn't much exercised in our automated 
> tests.  It would be too exhausting to make pg_notify actually use higher than 
> 2**32 page numbers.  Thus, I think test/modules/test_slru is a good place to 
> give high page numbers a good test.

Fixed. I choose not to change any numbers in the test in order to
check any corner cases, etc. The code patches for long_segment_names =
true and long_segment_names = false are almost the same thus it will
not improve code coverage. Using the current numbers will allow to
easily switch back to long_segment_names = false in the test if
necessary.

PFA the corrected patchset v59.

-- 
Best regards,
Aleksander Alekseev


v59-0003-Make-use-FullTransactionId-in-2PC-filenames.patch
Description: Binary data


v59-0002-Use-larger-segment-file-names-for-pg_notify.patch
Description: Binary data


v59-0001-Index-SLRUs-by-64-bit-integers-rather-than-by-32.patch
Description: Binary data


Re: RFC: Pluggable TOAST

2023-11-07 Thread Matthias van de Meent
On Tue, 7 Nov 2023 at 11:06, Nikita Malakhov  wrote:
>
> Hi,
>
> I've been thinking about Matthias' proposals for some time and have some
> questions:
>
> >So, in short, I don't think there is a need for a specific "Pluggable
> >toast API" like the one in the patchset at [0] that can be loaded
> >on-demand, but I think that updating our current TOAST system to a
> >system for which types can provide support functions would likely be
> >quite beneficial, for efficient extraction of data from composite
> >values.
>
> As I understand one of the reasons against Pluggable TOAST is that differences
> in plugged-in Toasters could result in incompatibility even in different 
> versions
> of the same DB.

That could be part of it, but it definitely wasn't my primary concern.
The primary concern remains that the pluggable toaster patch made the
jsonb type expose an API for a pluggable toaster that for all intents
and purposes only has one implementation due to its API being
specifically tailored for the jsonb internals use case, with similar
type-specific API bindings getting built for other types, each having
strict expectations about the details of the implementation. I agree
that it makes sense to specialize TOASTing for jsonb, but what I don't
understand about it is why that would need to be achieved outside the
core jsonb code.

I understand that the 'pluggable toaster' APIs originate from one of
PostgresPRO's forks of PostgreSQL, and I think it shows. That's not to
say it's bad, but it seems to be built on different expectations:
When maintaining a fork, you have different tradeoffs when compared to
maintaining the main product. A fork's changes need to be covered
across many versions with unknown changes, thus you would want the
smalles possible changes to enable the feature - pluggable toast makes
sense here, as the changes are limited to a few jsonb internals, but
most complex code is in an extension.
However, for core PostgreSQL, I think this separation makes very
little sense: the complexity of maintaining a toast api for each type
(when there can be expected to be only one implementation) is much
more work than just building a good set of helper functions that do
that same job. It allows for more flexibility, as there is no
noticable black box api implementation to keep track of.

> The importance of the correct TOAST update is out of question, feel like I 
> have
> to prepare a patch for it. There are some questions though, I'd address them
> later with a patch.
>
> >Example support functions:
>
> >/* TODO: bikeshedding on names, signatures, further support functions. */
> >Datum typsup_roastsliceofbread(Datum ptr, int sizetarget, char cmethod)
> >Datum typsup_unroastsliceofbread(Datum ptr)
> >void typsup_releaseroastedsliceofbread(Datump ptr) /* in case of
> >non-unitary in-memory datums */
>
> I correctly understand that you mean extending PG_TYPE and type cache,
> by adding a new function set for toasting/detoasting a value in addition to
> in/out, etc?

Yes.

> I see several issues here:
> 1) We could benefit from knowledge of internals of data being toasted (i.e.
> in case of JSON value with key-value structure) only when EXTERNAL
> storage mode is set, otherwise value will be compressed before toasted.
> So we have to keep both TOAST mechanics regarding the storage mode
> being used. It's the same issue as in Pluggable TOAST. Is it OK?

I think it is OK that the storage-related changes of this only start
once the toast mechanism is

> 2) TOAST pointer is very limited in means of data it keeps, we'd have to
> extend it anyway and keep both for backwards compatibility;

Yes. We already have to retain the current (de)toast infrastructure to
make sure current data files can still be read, given that we want to
retain backward compatibility for currently toasted data.

> 3) There is no API and such an approach would require implementing
> toast and detoast in every data type we want to be custom toasted, resulting
> in multiple files modification. Maybe we have to consider introducing such
> an API?

No. As I mentioned, we can retain the current toast mechanism for
current types that do not yet want to use these new toast APIs. If we
use one different varatt_1b_e tag for type-owned toast pointers, the
system will be opt-in for types, and for types that don't (yet) have
their own toast slicing design will keep using the old all-or-nothing
single-allocation data with the good old compress-then-slice
out-of-line toast storage.

> 4) 1 toast relation per regular relation. With an update mechanics this will
> be less limiting, but still a limiting factor because 1 entry in base table
> could have a lot of entries in the toast table. Are we doing something with
> this?

I don't think that is relevant to the topic of type-aware toasting
optimization. The toast storage relation growing too large is not
unique to jsonb- or bytea-typed columns, so I believe this is better
solved in a different thread. Ideas 

Re: Call pqPipelineFlush from PQsendFlushRequest

2023-11-07 Thread Alvaro Herrera
On 2023-Nov-07, Michael Paquier wrote:

> On Tue, Nov 07, 2023 at 10:38:04AM +0100, Jelte Fennema-Nio wrote:
> > In pipeline mode after queuing a message to be sent we would flush the
> > buffer if the size of the buffer passed some threshold. The only message
> > type that we didn't do that for was the Flush message. This addresses
> > that oversight.
> > 
> > I noticed this discrepancy while reviewing the
> > PQsendSyncMessage/PQpipelinePutSync patchset.
> 
> Indeed, it looks a bit strange that there is no flush if the buffer
> threshold is reached once the message is sent, so your suggestion
> sounds right.  Alvaro?

I agree, and I intend to get this patch pushed once the release freeze
is lifted.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"The Gord often wonders why people threaten never to come back after they've
been told never to return" (www.actsofgord.com)




Re: collect_corrupt_items_vacuum.patch

2023-11-07 Thread Alexander Korotkov
Hi, Alexander.

On Tue, Nov 7, 2023 at 1:00 PM Alexander Lakhin  wrote:
> It looks like the v2 patch doesn't fix the original issue. Maybe I miss
> something, but with the patch applied, I see the failure immediately,
> though without the patch several iterations are needed to get it.


That's a bug in the patch.  Thank you for cathing it.  It should start
calculation from latestCompletedXid + 1, not InvalidTransactionId.
Please, check the revised patch.

--
Regards,
Alexander Korotkov


0001-Fix-false-reports-in-pg_visibility-v3.patch
Description: Binary data


Protocol question regarding Portal vs Cursor

2023-11-07 Thread Dave Cramer
Greetings,

If we use a Portal it is possible to open the portal and do a describe and
then Fetch N records.

Using a Cursor we open the cursor. Is there a corresponding describe and a
way to fetch N records without getting the fields each time. Currently we
have to send the SQL  "fetch  N" and we get the fields and the
rows. This seems overly verbose.

Dave Cramer


Re: ResourceOwner refactoring

2023-11-07 Thread Heikki Linnakangas

On 25/10/2023 21:07, Andres Freund wrote:

It's not too awful to have it be in this order:

struct ResourceOwnerData {
 ResourceOwner  parent;   /* 0 8 */
 ResourceOwner  firstchild;   /* 8 8 */
 ResourceOwner  nextchild;/*16 8 */
 const char  *  name; /*24 8 */
 _Bool  releasing;/*32 1 */
 _Bool  sorted;   /*33 1 */
 uint8  narr; /*34 1 */
 uint8  nlocks;   /*35 1 */
 uint32 nhash;/*36 4 */
 uint32 capacity; /*40 4 */
 uint32 grow_at;  /*44 4 */
 ResourceElem   arr[32];  /*48   512 */
 /* --- cacheline 8 boundary (512 bytes) was 48 bytes ago --- */
 ResourceElem * hash; /*   560 8 */
 LOCALLOCK *locks[15];/*   568   120 */

 /* size: 688, cachelines: 11, members: 14 */
 /* last cacheline: 48 bytes */
};

Requires rephrasing a few comments to document that the lenghts are separate
from the array / hashtable / locks, but otherwise...


Hmm, let's move 'capacity' and 'grow_at' after 'arr', too. They are only 
needed together with 'hash'.



This reliably shows a decent speed improvement in my stress test [1], on the
order of 5%.

[1] pgbench running
DO $$ BEGIN FOR i IN 1..5000 LOOP PERFORM abalance FROM pgbench_accounts 
WHERE aid = 17;END LOOP;END;$$;


I'm seeing similar results, although there's enough noise in the test 
that I'm sure how real the would be across different tests.



At that point, the first array entry fits into the first cacheline. If we were
to move parent, firstchild, nextchild, name further down the struct, three
array entries would be on the first line. Just using the array presumably is a
very common case, so that might be worth it?

I got less clear performance results with this one, and it's also quite
possible it could hurt, if resowners aren't actually "used", just
released. Therefore it's probably not worth it for now.


You're assuming that the ResourceOwner struct begins at a cacheline 
boundary. That's not usually true, we don't try to cacheline-align it. 
So while it's helpful to avoid padding and to keep frequently-accessed 
fields close to each other, there's no benefit in keeping them at the 
beginning of the struct.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Call pqPipelineFlush from PQsendFlushRequest

2023-11-07 Thread Michael Paquier
On Tue, Nov 07, 2023 at 10:38:04AM +0100, Jelte Fennema-Nio wrote:
> In pipeline mode after queuing a message to be sent we would flush the
> buffer if the size of the buffer passed some threshold. The only message
> type that we didn't do that for was the Flush message. This addresses
> that oversight.
> 
> I noticed this discrepancy while reviewing the
> PQsendSyncMessage/PQpipelinePutSync patchset.

Indeed, it looks a bit strange that there is no flush if the buffer
threshold is reached once the message is sent, so your suggestion
sounds right.  Alvaro?
--
Michael


signature.asc
Description: PGP signature


Re: Relids instead of Bitmapset * in plannode.h

2023-11-07 Thread Alvaro Herrera
Hello,

On 2023-Oct-31, Ashutosh Bapat wrote:

> For some reason plannode.h has declared variable to hold RTIs as
> Bitmapset * instead of Relids like other places. Here's patch to fix
> it. This is superficial change as Relids is typedefed to Bitmapset *.
> Build succeeds for me and also make check passes.

I think the reason for having done it this way, was mostly to avoid
including pathnodes.h in plannodes.h.  Did you explore what the
consequences are?  Starting here:
https://doxygen.postgresql.org/plannodes_8h.html

While looking at it, I noticed that tcopprot.h includes both plannodes.h
and parsenodes.h, but there's no reason to include the latter (or at
least headerscheck doesn't complain after removing it), so I propose to
remove it, per 0001 here.  There's a couple of files that need to be
repaired for this change.  windowfuncs.c is a no-brainer.  However,
having to edit bootstrap.h is a bit surprising -- I think before
dac048f71ebb ("Build all Flex files standalone") this inclusion wasn't
necessary, because the .y file already includes parsenodes.h; but after
that commit, bootparse.h needs parsenodes.h to declare YYSTYPE, per
comments in bootscanner.l.  Anyway, this seems a good change.

I also noticed while looking that I messed up in commit 7103ebb7aae8
("Add support for MERGE SQL command") on this point, because I added
#include parsenodes.h to plannodes.h.  This is because MergeAction,
which is in parsenodes.h, is also needed by some executor code.  But the
real way to fix that is to define that struct in primnodes.h.  0002 does
that.  (I'm forced to also move enum OverridingKind there, which is a
bit annoying.)

0003 here is your patch, which I include because of conflicts with my
0002.  After my 0002, plannodes.h is pretty slim, so I'd be hesitant to
include pathnodes.h just to be able to change the Bitmapset * to Relids.
But on the other hand, it doesn't seem to have too bad an effect overall
(if only because plannodes.h is included by rather few files), so +0.1
on doing this.  I would be more at ease if we didn't have to include
parsenodes.h in pathnodes.h, though, but that looks more difficult to
achieve.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
>From dee38d19b281586a17788856cf169b8344c30da7 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 6 Nov 2023 18:27:42 +0100
Subject: [PATCH v2 1/3] Don't include parsenodes.h in tcopprot.h

---
 src/backend/utils/adt/windowfuncs.c | 1 +
 src/include/bootstrap/bootstrap.h   | 1 +
 src/include/tcop/tcopprot.h | 1 -
 3 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/windowfuncs.c b/src/backend/utils/adt/windowfuncs.c
index b87a624fb2..cbdfba6994 100644
--- a/src/backend/utils/adt/windowfuncs.c
+++ b/src/backend/utils/adt/windowfuncs.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "nodes/supportnodes.h"
+#include "nodes/parsenodes.h"
 #include "utils/builtins.h"
 #include "windowapi.h"
 
diff --git a/src/include/bootstrap/bootstrap.h b/src/include/bootstrap/bootstrap.h
index e1cb73c5f2..6a212122a3 100644
--- a/src/include/bootstrap/bootstrap.h
+++ b/src/include/bootstrap/bootstrap.h
@@ -15,6 +15,7 @@
 #define BOOTSTRAP_H
 
 #include "nodes/execnodes.h"
+#include "nodes/parsenodes.h"
 
 
 /*
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 6c49db3bb7..b694d85974 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -15,7 +15,6 @@
 #define TCOPPROT_H
 
 #include "nodes/params.h"
-#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "storage/procsignal.h"
 #include "utils/guc.h"
-- 
2.39.2

>From 870a853434b56d1fecd1a5fe85a56a2f60ff4ca9 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 6 Nov 2023 18:28:08 +0100
Subject: [PATCH v2 2/3] Remove parsenodes.h from plannodes.h

This mistake was made because MergeAction was added to parsenodes.h
instead of primnodes.h, where it should have been.  Put it there.
---
 src/include/nodes/parsenodes.h | 24 
 src/include/nodes/plannodes.h  |  1 -
 src/include/nodes/primnodes.h  | 27 +++
 3 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index cf7e79062e..e494309da8 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -30,13 +30,6 @@
 #include "partitioning/partdefs.h"
 
 
-typedef enum OverridingKind
-{
-	OVERRIDING_NOT_SET = 0,
-	OVERRIDING_USER_VALUE,
-	OVERRIDING_SYSTEM_VALUE,
-} OverridingKind;
-
 /* Possible sources of a Query */
 typedef enum QuerySource
 {
@@ -1681,23 +1674,6 @@ typedef struct MergeWhenClause
 	List	   *values;			/* VALUES to INSERT, or NULL */
 } MergeWhenClause;
 
-/*
- * MergeAction -
- *		Transformed representation of a WHEN clause in a MERGE statement
- */
-typedef struct MergeAction
-{
-	NodeTag		type;
-	bool		matched;		/* true=MATCHED, false=NOT MATCHED */
-	

Re: collect_corrupt_items_vacuum.patch

2023-11-07 Thread Alexander Lakhin

Hi Alexander,

06.11.2023 12:30, Alexander Korotkov wrote:

Surely these would significantly sacrifice accuracy. But we have to do
so in order to avoid reporting false errors.



I've reduced the dirty reproducer Daniel Shelepanov posted initially
to the following:
numdbs=10
for ((d=1;d<=$numdbs;d++)); do
  createdb db$d
  psql db$d -c "create extension pg_visibility"
done

for ((i=1;i<=300;i++)); do
echo "iteration $i"
for ((d=1;d<=$numdbs;d++)); do
(
echo "
create table vacuum_test as select 42 i;
vacuum (disable_page_skipping) vacuum_test;
select * from pg_check_visible('vacuum_test');
" | psql db$d -a -q >psql-$d.log 2>&1
) &
done
wait

res=0
for ((d=1;d<=$numdbs;d++)); do
grep -q '0 rows' psql-$d.log || { echo "Error condition in psql-$d.log:"; cat 
psql-$d.log; res=1; break; }
psql db$d -q -c "drop table vacuum_test"
done
[ $res == 0 ] || break;
done

It looks like the v2 patch doesn't fix the original issue. Maybe I miss
something, but with the patch applied, I see the failure immediately,
though without the patch several iterations are needed to get it.

Best regards,
Alexander




Re: Synchronizing slots from primary to standby

2023-11-07 Thread Amit Kapila
On Tue, Nov 7, 2023 at 3:51 PM Drouvot, Bertrand
 wrote:
>
> On 10/31/23 10:37 AM, shveta malik wrote:
> > On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand
> >>
> >> Agree with your test case, but in my case I was not using pub/sub.
> >>
> >> I was not clear, so when I said:
> >>
>  - create logical_slot1 on the primary (and don't start using it)
> >>
> >> I meant don't start decoding from it (like using pg_recvlogical() or
> >> pg_logical_slot_get_changes()).
> >>
> >> By using pub/sub the "don't start using it" is not satisfied.
> >>
> >> My test case is:
> >>
> >> "
> >> SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 
> >> 'test_decoding', false, true, true);
> >> SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 
> >> 'test_decoding', false, true, true);
> >> pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f -
> >> "
> >>
> >
> > Okay, I am able to reproduce it now. Thanks for clarification. I have
> > tried to change the algorithm as per suggestion by Amit in [1]
> >
> > [1]: 
> > https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com
>
> Thanks!
>
> >
> > This is not full proof solution but optimization over first one. Now
> > in any sync-cycle, we take 2 attempts for slots-creation (if any slots
> > are available to be created). In first attempt, we do not wait
> > indefinitely on inactive slots, we wait only for a fixed amount of
> > time and if remote-slot is still behind, then we add that to the
> > pending list and move to the next slot. Once we are done with first
> > attempt, in second attempt, we go for the pending ones and now we wait
> > on each of them until the primary catches up.
>
> Aren't we "just" postponing the "issue"? I mean if there is really no activity
> on, say, the first created slot, then once we move to the second attempt then 
> any newly
> created slot from that time would wait to be synced forever, no?
>

We have to wait at some point in time for such inactive slots and the
same is true even for manually created slots on standby. Do you have
any better ideas to deal with it?

> Looking at V30:
>
> +   /* Update lsns of slot to remote slot's current position */
> +   local_slot_update(remote_slot);
> +   ReplicationSlotPersist();
> +
> +   ereport(LOG, errmsg("created slot \"%s\" locally", 
> remote_slot->name));
>
> I think this message is confusing as the slot has been created before it, 
> here:
>
> +   else
> +   {
> +   TransactionId xmin_horizon = InvalidTransactionId;
> +   ReplicationSlot *slot;
> +
> +   ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL,
> + remote_slot->two_phase, false);
>
> So that it shows up in pg_replication_slots before this message is emitted 
> (and that
> specially true/worst for non active slots).
>
> Maybe something like "newly locally created slot XXX has been synced..."?
>
> While at it, would that make sense to move
>
> +   slot->data.failover = true;
>
> once we stop waiting for this slot? I think that would avoid confusion if one
> query pg_replication_slots while we are still waiting for this slot to be 
> synced,
> thoughts? (currently we can see pg_replication_slots.synced_slot set to true
> while we are still waiting).
>

The failover property of the slot is different from whether the slot
has been synced yet, so we can't change the location of marking it but
we can try to improve when to show that slot has been synced.

-- 
With Regards,
Amit Kapila.




Re: A recent message added to pg_upgade

2023-11-07 Thread Amit Kapila
On Tue, Nov 7, 2023 at 8:12 AM Michael Paquier  wrote:
>
> On Tue, Nov 07, 2023 at 07:59:46AM +0530, Amit Kapila wrote:
> > Do you mean to say that if 'IsBinaryUpgrade' is true then let's not
> > allow to launch launcher or apply worker? If so, I guess this won't be
> > any better than prohibiting at an early stage or explicitly overriding
> > those with internal values and documenting it, at least that way we
> > can be consistent for both variables (max_logical_replication_workers
> > and max_slot_wal_keep_size).
>
> Yes, I mean to paint an extra IsBinaryUpgrade before registering the
> apply worker launcher.  That would be consistent with what we do for
> autovacuum in the postmaster.
>

But then we don't need the hardcoded value of
max_logical_replication_workers as zero by pg_upgrade. I think doing
IsBinaryUpgrade for slots won't be neat, so we anyway need to keep
using the special value of max_slot_wal_keep_size GUC. Though the
handling for both won't be the same but I guess given the situation,
that seems like a reasonable thing to do. If we follow that then we
can have this special GUC hook only for max_slot_wal_keep_size GUC.

-- 
With Regards,
Amit Kapila.




Re: Synchronizing slots from primary to standby

2023-11-07 Thread Drouvot, Bertrand

Hi,

On 10/31/23 10:37 AM, shveta malik wrote:

On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand


Agree with your test case, but in my case I was not using pub/sub.

I was not clear, so when I said:


- create logical_slot1 on the primary (and don't start using it)


I meant don't start decoding from it (like using pg_recvlogical() or
pg_logical_slot_get_changes()).

By using pub/sub the "don't start using it" is not satisfied.

My test case is:

"
SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 
'test_decoding', false, true, true);
SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 
'test_decoding', false, true, true);
pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f -
"



Okay, I am able to reproduce it now. Thanks for clarification. I have
tried to change the algorithm as per suggestion by Amit in [1]

[1]: 
https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com


Thanks!



This is not full proof solution but optimization over first one. Now
in any sync-cycle, we take 2 attempts for slots-creation (if any slots
are available to be created). In first attempt, we do not wait
indefinitely on inactive slots, we wait only for a fixed amount of
time and if remote-slot is still behind, then we add that to the
pending list and move to the next slot. Once we are done with first
attempt, in second attempt, we go for the pending ones and now we wait
on each of them until the primary catches up.


Aren't we "just" postponing the "issue"? I mean if there is really no activity
on, say, the first created slot, then once we move to the second attempt then 
any newly
created slot from that time would wait to be synced forever, no?

Looking at V30:

+   /* Update lsns of slot to remote slot's current position */
+   local_slot_update(remote_slot);
+   ReplicationSlotPersist();
+
+   ereport(LOG, errmsg("created slot \"%s\" locally", remote_slot->name));

I think this message is confusing as the slot has been created before it, here:

+   else
+   {
+   TransactionId xmin_horizon = InvalidTransactionId;
+   ReplicationSlot *slot;
+
+   ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL,
+ remote_slot->two_phase, false);

So that it shows up in pg_replication_slots before this message is emitted (and 
that
specially true/worst for non active slots).

Maybe something like "newly locally created slot XXX has been synced..."?

While at it, would that make sense to move

+   slot->data.failover = true;

once we stop waiting for this slot? I think that would avoid confusion if one
query pg_replication_slots while we are still waiting for this slot to be 
synced,
thoughts? (currently we can see pg_replication_slots.synced_slot set to true
while we are still waiting).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: RFC: Pluggable TOAST

2023-11-07 Thread Nikita Malakhov
Hi,

I've been thinking about Matthias' proposals for some time and have some
questions:

>So, in short, I don't think there is a need for a specific "Pluggable
>toast API" like the one in the patchset at [0] that can be loaded
>on-demand, but I think that updating our current TOAST system to a
>system for which types can provide support functions would likely be
>quite beneficial, for efficient extraction of data from composite
>values.

As I understand one of the reasons against Pluggable TOAST is that
differences
in plugged-in Toasters could result in incompatibility even in different
versions
of the same DB.

The importance of the correct TOAST update is out of question, feel like I
have
to prepare a patch for it. There are some questions though, I'd address them
later with a patch.

>Example support functions:

>/* TODO: bikeshedding on names, signatures, further support functions. */
>Datum typsup_roastsliceofbread(Datum ptr, int sizetarget, char cmethod)
>Datum typsup_unroastsliceofbread(Datum ptr)
>void typsup_releaseroastedsliceofbread(Datump ptr) /* in case of
>non-unitary in-memory datums */

I correctly understand that you mean extending PG_TYPE and type cache,
by adding a new function set for toasting/detoasting a value in addition to
in/out, etc?

I see several issues here:
1) We could benefit from knowledge of internals of data being toasted (i.e.
in case of JSON value with key-value structure) only when EXTERNAL
storage mode is set, otherwise value will be compressed before toasted.
So we have to keep both TOAST mechanics regarding the storage mode
being used. It's the same issue as in Pluggable TOAST. Is it OK?

2) TOAST pointer is very limited in means of data it keeps, we'd have to
extend it anyway and keep both for backwards compatibility;

3) There is no API and such an approach would require implementing
toast and detoast in every data type we want to be custom toasted, resulting
in multiple files modification. Maybe we have to consider introducing such
an API?

4) 1 toast relation per regular relation. With an update mechanics this will
be less limiting, but still a limiting factor because 1 entry in base table
could have a lot of entries in the toast table. Are we doing something with
this?

>We would probably want at least 2 more subtypes of varattrib_1b_e -
>one for on-disk pointers, and one for in-memory pointers - where the
>payload of those pointers is managed by the type's toast mechanism and
>considered opaque to the rest of PostgreSQL (and thus not compatible
>with the binary transfer protocol). Types are currently already
>expected to be able to handle their own binary representation, so
>allowing types to manage parts of the toast representation should IMHO
>not be too dangerous, though we should make sure that BINARY COERCIBLE
>types share this toast support routine, or be returned to their
>canonical binary version before they are cast to the coerced type, as
>using different detoasting mechanisms could result in corrupted data
>and thus crashes.

>Lastly, there is the compression part of TOAST. I think it should be
>relatively straightforward to expose the compression-related
>components of TOAST through functions that can then be used by
>type-specific toast support functions.
>Note that this would be opt-in for a type, thus all functions that use
>that type's internals should be aware of the different on-disk format
>for toasted values and should thus be able to handle it gracefully.

Thanks a lot for answers!

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


  1   2   >