Re: pg_walfile_name_offset can return inconsistent values

2023-11-10 Thread Andres Freund
Hi,

On 2023-11-09 16:14:07 -0500, Bruce Momjian wrote:
> On Thu, Nov  9, 2023 at 09:49:48PM +0100, Matthias van de Meent wrote:
> > Either way, I think fix #1 is most correct (as was attached in
> > offset2.diff, and quoted verbatim here), because that has no chance of
> > having surprising underflowing behaviour when you use '0/0'::lsn as
> > input.
> 
> Attached is the full patch that changes pg_walfile_name_offset() and
> pg_walfile_name().  There is no need for doc changes.

I think this needs to add tests "documenting" the changed behaviour and
perhaps also for a few other edge cases.  You could e.g. test
  SELECT * FROM pg_walfile_name_offset('0/0')
which today returns bogus values, and which is independent of the wal segment
size.

And with
SELECT setting::int8 AS segment_size FROM pg_settings WHERE name = 
'wal_segment_size' \gset
you can test real things too, e.g.:
SELECT segment_number, file_offset FROM pg_walfile_name_offset('0/0'::pg_lsn + 
:segment_size), pg_split_walfile_name(file_name);
SELECT segment_number, file_offset FROM pg_walfile_name_offset('0/0'::pg_lsn + 
:segment_size + 1), pg_split_walfile_name(file_name);
SELECT segment_number, file_offset = :segment_size - 1 FROM 
pg_walfile_name_offset('0/0'::pg_lsn + :segment_size - 1), 
pg_split_walfile_name(file_name);

Greetings,

Andres Freund




Re: remaining sql/json patches

2023-11-10 Thread Amit Langote
Hi Erik,

On Sat, Nov 11, 2023 at 11:52 Erik Rijkers  wrote:

> Hi,
>
> At the moment, what is the patchset to be tested?  The latest SQL/JSON
> server I have is from September, and it's become unclear to me what
> belongs to the SQL/JSON patchset.  It seems to me cfbot erroneously
> shows green because it successfully compiles later detail-patches (i.e.,
> not the SQL/JSON set itself). Please correct me if I'm wrong and it is
> in fact possible to derive from cfbot a patchset that are the ones to
> use to build the latest SQL/JSON server.


I’ll be posting a new set that addresses Andres’ comments early next week.

>


Re: locked reads for atomics

2023-11-10 Thread Nathan Bossart
On Fri, Nov 10, 2023 at 06:48:39PM -0800, Andres Freund wrote:
> Yes. We should optimize pg_atomic_exchange_u32() one of these days - it can be
> done *far* faster than a cmpxchg. When I was adding the atomic abstraction
> there was concern with utilizing too many different atomic instructions. I
> didn't really agree back then, but these days I really don't see a reason to
> not use a few more intrinsics.

I might give this a try, if for no other reason than it'd force me to
improve my mental model of this stuff.  :)

>> It refers to "unlocked writes," but this isn't
>> pg_atomic_unlocked_write_u32_impl().  The original thread for this comment
>> [0] doesn't offer any hints, either.  Does "unlocked" mean something
>> different here, such as "write without any barrier semantics?"
> 
> It's just about not using the spinlock. If we were to *not* use a spinlock
> here, we'd break pg_atomic_compare_exchange_u32(), because the
> spinlock-implementation of pg_atomic_compare_exchange_u32() needs to actually
> be able to rely on no concurrent changes to the value to happen.

Thanks for clarifying.  I thought it might've been hinting at something
beyond the compare/exchange implications.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: remaining sql/json patches

2023-11-10 Thread Erik Rijkers

Hi,

At the moment, what is the patchset to be tested?  The latest SQL/JSON 
server I have is from September, and it's become unclear to me what 
belongs to the SQL/JSON patchset.  It seems to me cfbot erroneously 
shows green because it successfully compiles later detail-patches (i.e., 
not the SQL/JSON set itself). Please correct me if I'm wrong and it is 
in fact possible to derive from cfbot a patchset that are the ones to 
use to build the latest SQL/JSON server.


Thanks!

Erik




Re: locked reads for atomics

2023-11-10 Thread Andres Freund
Hi,

On 2023-11-10 20:38:13 -0600, Nathan Bossart wrote:
> On Fri, Nov 10, 2023 at 03:11:50PM -0800, Andres Freund wrote:
> > On 2023-11-10 14:51:28 -0600, Nathan Bossart wrote:
> >> + * This read is guaranteed to read the current value,
> > 
> > It doesn't guarantee that *at all*. What it guarantees is solely that the
> > current CPU won't be doing something that could lead to reading an outdated
> > value. To actually ensure the value is up2date, the modifying side also 
> > needs
> > to have used a form of barrier (in the form of fetch_add, compare_exchange,
> > etc or an explicit barrier).
> 
> Okay, I think I was missing that this doesn't work along with
> pg_atomic_write_u32() because that doesn't have any barrier semantics
> (probably because the spinlock version does).  IIUC you'd want to use
> pg_atomic_exchange_u32() to write the value instead, which seems to really
> just be another compare/exchange under the hood.

Yes. We should optimize pg_atomic_exchange_u32() one of these days - it can be
done *far* faster than a cmpxchg. When I was adding the atomic abstraction
there was concern with utilizing too many different atomic instructions. I
didn't really agree back then, but these days I really don't see a reason to
not use a few more intrinsics.


> Speaking of the spinlock implementation of pg_atomic_write_u32(), I've been
> staring at this comment for a while and can't make sense of it:
> 
>   void
>   pg_atomic_write_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val)
>   {
>   /*
>* One might think that an unlocked write doesn't need to 
> acquire the
>* spinlock, but one would be wrong. Even an unlocked write has 
> to cause a
>* concurrent pg_atomic_compare_exchange_u32() (et al) to fail.
>*/
>   SpinLockAcquire((slock_t *) &ptr->sema);
>   ptr->value = val;
>   SpinLockRelease((slock_t *) &ptr->sema);
>   }
> 
> It refers to "unlocked writes," but this isn't
> pg_atomic_unlocked_write_u32_impl().  The original thread for this comment
> [0] doesn't offer any hints, either.  Does "unlocked" mean something
> different here, such as "write without any barrier semantics?"

It's just about not using the spinlock. If we were to *not* use a spinlock
here, we'd break pg_atomic_compare_exchange_u32(), because the
spinlock-implementation of pg_atomic_compare_exchange_u32() needs to actually
be able to rely on no concurrent changes to the value to happen.

Greetings,

Andres Freund




Re: locked reads for atomics

2023-11-10 Thread Nathan Bossart
On Fri, Nov 10, 2023 at 03:11:50PM -0800, Andres Freund wrote:
> On 2023-11-10 14:51:28 -0600, Nathan Bossart wrote:
>> + * This read is guaranteed to read the current value,
> 
> It doesn't guarantee that *at all*. What it guarantees is solely that the
> current CPU won't be doing something that could lead to reading an outdated
> value. To actually ensure the value is up2date, the modifying side also needs
> to have used a form of barrier (in the form of fetch_add, compare_exchange,
> etc or an explicit barrier).

Okay, I think I was missing that this doesn't work along with
pg_atomic_write_u32() because that doesn't have any barrier semantics
(probably because the spinlock version does).  IIUC you'd want to use
pg_atomic_exchange_u32() to write the value instead, which seems to really
just be another compare/exchange under the hood.

Speaking of the spinlock implementation of pg_atomic_write_u32(), I've been
staring at this comment for a while and can't make sense of it:

void
pg_atomic_write_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val)
{
/*
 * One might think that an unlocked write doesn't need to 
acquire the
 * spinlock, but one would be wrong. Even an unlocked write has 
to cause a
 * concurrent pg_atomic_compare_exchange_u32() (et al) to fail.
 */
SpinLockAcquire((slock_t *) &ptr->sema);
ptr->value = val;
SpinLockRelease((slock_t *) &ptr->sema);
}

It refers to "unlocked writes," but this isn't
pg_atomic_unlocked_write_u32_impl().  The original thread for this comment
[0] doesn't offer any hints, either.  Does "unlocked" mean something
different here, such as "write without any barrier semantics?"

[0] https://postgr.es/m/14947.1475690465%40sss.pgh.pa.us

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Adding facility for injection points (or probe points?) for more advanced tests

2023-11-10 Thread Andres Freund
Hi,

On 2023-10-25 13:13:38 +0900, Michael Paquier wrote:
> So, please find attached a patch set that introduces an in-core
> facility to be able to set what I'm calling here an "injection point",
> that consists of being able to register in shared memory a callback
> that can be run within a defined location of the code.  It means that
> it is not possible to trigger a callback before shared memory is set,
> but I've faced far more the case where I wanted to trigger something
> after shmem is set anyway.  Persisting an injection point across
> restarts is also possible by adding some through an extension's shmem
> hook, as we do for custom LWLocks for example, as long as a library is
> loaded.

I would like to see a few example tests using this facility - without that
it's a bit hard to judge how the impact on core code would be and how easy
tests are to write.

It also seems like there's a few bits and pieces missing to actually be able
to write interesting tests. It's one thing to be able to inject code, but what
you commonly want to do for tests is to actually wait for such a spot in the
code to be reached, then perhaps wait inside the "modified" code, and do
something else in the test script. But as-is a decent amount of C code would
need to be written to write such a test, from what I can tell?

Greetings,

Andres Freund




Re: Why do indexes and sorts use the database collation?

2023-11-10 Thread Andres Freund
Hi,

On 2023-11-10 16:03:16 -0800, Jeff Davis wrote:
> An "en_US" user doing:
>
>CREATE TABLE foo(t TEXT PRIMARY KEY);
>
> is providing no indication that they want an index tailored to their
> locale. Yet we are creating the index with the "en_US" collation and
> therefore imposing huge performance costs (something like 2X slower
> index build time than the "C" locale), and also huge dependency
> versioning risks that could lead to index corruption and/or wrong
> results.

I guess you are arguing that the user didn't intend to create an index here? I
don't think that is true - users know that pkeys create indexes. If we used C
here, users would often need to create a second index on the same column using
the actual database collation - I think we'd very commonly end up with
complaints that the pkey index doesn't work, because it's been automatically
created with a different collation than the column.

Also, wouldn't the intent to use a different collation for the column be
expressed by changing the column's collation?


> Similarly, a user doing:
>
>SELECT DISTINCT t FROM bar;
>
> is providing no indication that they care about the collation of "t"
> (we are free to choose a HashAgg which makes no ordering guarantee at
> all). Yet if we choose Sort+GroupAgg, the Sort will be performed in the
> "en_US" locale, which is something like 2X slower than the "C" locale.

OTOH, if we are choosing a groupagg, we might be able to implement that using
an index, which is more likey to exist in the databases collation.  Looks like
we even just look for indexes that are in the database collation.

Might be worth teaching the planner additional smarts here.


> Thoughts?

I seriously doubt its a good idea to change which collations primary keys use
by default.  But I think there's a decent bit of work we could do in the
planner, e.g:

- Teach the planner to take collation costs into account for costing - right
  now index scans with "C" cost the same as index scans with more expensive
  collations. That seems wrong even for equality lookups and would make it
  hard to make improvements to prefer cheaper collations in other situations.

- Teach the planner to use cheaper collations when ordering for reasons other
  than the user's direct request (e.g. DISTINCT/GROUP BY, merge joins).
  

I think we should also explain in our docs that C can be considerably faster -
I couldn't find anything in a quick look.

Greetings,

Andres Freund




Re: maybe a type_sanity. sql bug

2023-11-10 Thread Michael Paquier
On Sat, Nov 11, 2023 at 08:00:00AM +0800, jian he wrote:
> I am not sure the pg_class "relam" description part is correct. since
> partitioned indexes (relkind "I") also have the access method, but no
> storage.
> "
> If this is a table or an index, the access method used (heap, B-tree,
> hash, etc.); otherwise zero (zero occurs for sequences, as well as
> relations without storage, such as views)
> "

This should be adjusted as well in the docs, IMO.  I would propose
something slightly more complicated:
"
If this is a table, index, materialized view or partitioned index, the
access method used (heap, B-tree, hash, etc.); otherwise zero (zero
occurs for sequences, as well as relations without storage, like
views).
"

> diff --git a/src/test/regress/sql/type_sanity.sql 
> b/src/test/regress/sql/type_sanity.sql
> index a546ba89..6d806941 100644
> --- a/src/test/regress/sql/type_sanity.sql
> +++ b/src/test/regress/sql/type_sanity.sql

Ahah, nice catches.  I'll go adjust that on HEAD like the other one
you pointed out.  Just note that materialized views have a relam
defined, so the first comment you have changed is not completely
correct.
--
Michael


signature.asc
Description: PGP signature


Re: A recent message added to pg_upgade

2023-11-10 Thread Michael Paquier
On Fri, Nov 10, 2023 at 03:27:25PM +0530, Amit Kapila wrote:
> I don't think this comment is correct because there won't be any apply
> activity on the new cluster as after restoration subscriptions should
> be disabled. On the old cluster, I think one problem is that the
> origins may move forward after we copy them which can cause data
> inconsistency issues. The other is that we may not prefer to generate
> additional data and WAL during the upgrade. Also, I am not completely
> sure about using the word 'corruption' in this context.

What is your suggestion here?  Would it be better to just aim for
simplicity and just say that we don't want it to run because "it can
lead to some inconsistent behaviors"?
--
Michael


signature.asc
Description: PGP signature


Why do indexes and sorts use the database collation?

2023-11-10 Thread Jeff Davis
An "en_US" user doing:

   CREATE TABLE foo(t TEXT PRIMARY KEY);

is providing no indication that they want an index tailored to their
locale. Yet we are creating the index with the "en_US" collation and
therefore imposing huge performance costs (something like 2X slower
index build time than the "C" locale), and also huge dependency
versioning risks that could lead to index corruption and/or wrong
results.

Similarly, a user doing:

   SELECT DISTINCT t FROM bar;

is providing no indication that they care about the collation of "t"
(we are free to choose a HashAgg which makes no ordering guarantee at
all). Yet if we choose Sort+GroupAgg, the Sort will be performed in the
"en_US" locale, which is something like 2X slower than the "C" locale.

One of the strongest arguments for using a non-C collation in these
cases is the chance to use a non-deterministic collation, like a case-
insensitive one. But the database collation is always deterministic,
and all deterministic collations have exactly the same definition of
equality, so there's no reason not to use "C".

Another argument is that, if the column is the database collation and
the index is "C", then the index is unusable for text range scans, etc.
But there are two ways to solve that problem:

  1. Set the column collation to "C"; or
  2. Set the index collation to the database collation.

Range scans are often most useful when the text is not actually natural
language, but instead is some kind of formatted text representing
another type of thing, often in ASCII. In that case, the range scan is
really some kind of prefix search or partitioning, and the "C" locale
is probably the right thing to use, and #1 wins.

Granted, there are reasons to want an index to have a particular
collation, in which case it makes sense to opt-in to #2. But in the
common case, the high performance costs and dependency versioning risks
aren't worth it.

Thoughts?

Regards,
Jeff Davis





Re: maybe a type_sanity. sql bug

2023-11-10 Thread jian he
looking around.
I found other three minor issues. attached.

I am not sure the pg_class "relam" description part is correct. since
partitioned indexes (relkind "I") also have the access method, but no
storage.
"
If this is a table or an index, the access method used (heap, B-tree,
hash, etc.); otherwise zero (zero occurs for sequences, as well as
relations without storage, such as views)
"
diff --git a/src/test/regress/sql/type_sanity.sql b/src/test/regress/sql/type_sanity.sql
index a546ba89..6d806941 100644
--- a/src/test/regress/sql/type_sanity.sql
+++ b/src/test/regress/sql/type_sanity.sql
@@ -364,22 +364,22 @@ WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p', 'I') OR
 relpersistence NOT IN ('p', 'u', 't') OR
 relreplident NOT IN ('d', 'n', 'f', 'i');
 
--- All tables and indexes should have an access method.
+-- All tables and indexes, partitioned indexes should have an access method.
 SELECT c1.oid, c1.relname
 FROM pg_class as c1
-WHERE c1.relkind NOT IN ('S', 'v', 'f', 'c') and
+WHERE c1.relkind NOT IN ('S', 'v', 'f', 'c','p') and
 c1.relam = 0;
 
--- Conversely, sequences, views, types shouldn't have them
+-- Conversely, sequences, views, types, partitioned table shouldn't have them
 SELECT c1.oid, c1.relname
 FROM pg_class as c1
-WHERE c1.relkind IN ('S', 'v', 'f', 'c') and
+WHERE c1.relkind IN ('S', 'v', 'f', 'c','p') and
 c1.relam != 0;
 
--- Indexes should have AMs of type 'i'
+-- Indexes, partitioned indexes should have AMs of type 'i'
 SELECT pc.oid, pc.relname, pa.amname, pa.amtype
 FROM pg_class as pc JOIN pg_am AS pa ON (pc.relam = pa.oid)
-WHERE pc.relkind IN ('i') and
+WHERE pc.relkind IN ('i','I') and
 pa.amtype != 'i';
 
 -- Tables, matviews etc should have AMs of type 't'


Re: Force the old transactions logs cleanup even if checkpoint is skipped

2023-11-10 Thread Andres Freund
Hi,

On 2023-11-09 11:50:10 +, Zakhlystov, Daniil (Nebius) wrote:
> > On 9 Nov 2023, at 01:30, Michael Paquier  wrote:
> >
> > I am not really convinced that this is worth complicating the skipped
> > path for this goal.  In my experience, I've seen complaints where WAL
> > archiving bloat was coming from the archive command not able to keep
> > up with the amount generated by the backend, particularly because the
> > command invocation was taking longer than it takes to generate a new
> > segment.  Even if there is a hole of activity in the server, if too
> > much WAL has been generated it may not be enough to catch up depending
> > on the number of segments that need to be processed.  Others are free
> > to chime in with extra opinions, of course.
>
> I agree that there might multiple reasons of pg_wal bloat. Please note that
> I am not addressing the WAL archiving issue at all. My proposal is to add a
> small improvement to the WAL cleanup routine for WALs that have been already
> archived successfully to free the disk space.
>
> Yes, it might be not a common case, but a fairly realistic one. It occurred 
> multiple times
> in our production when we had temporary issues with archiving. This small
> complication of the skipped path will help Postgres to return to a normal 
> operational
> state without any human operator / external control routine intervention.

I agree that the scenario is worth addressing - it's quite a nasty situation.

But I'm not sure this is the way to address it. If a checkpoint does have to
happen, we might not get to the point of removing the old segments, because we
might fail to emit the WAL record due to running out of space. And if that
doesn't happen - do we really want to wait till a checkpoint finishes to free
up space?

What if we instead made archiver delete WAL files after archiving, if they're
old enough?  Some care would be needed to avoid checkpointer and archiver
trampling on each other, but that doesn't seem too hard.

Greetings,

Andres Freund




Re: ResourceOwner refactoring

2023-11-10 Thread Andres Freund
Hi,

On 2023-11-10 16:26:51 +0200, Heikki Linnakangas wrote:
> The quick, straightforward fix is to move the "CurrentResourceOwner = NULL"
> line earlier in CommitTransaction, per attached
> 0003-Clear-CurrentResourceOwner-earlier-in-CommitTransact.patch. You're not
> allowed to use the resource owner after you start to release it; it was a
> bit iffy even before the ResourceOwner rewrite but now it's explicitly
> forbidden. By clearing CurrentResourceOwner as soon as we start releasing
> it, we can prevent any accidental use.
> 
> When CurrentResourceOwner == NULL, dsm_attach() returns a handle that's not
> associated with any ResourceOwner. That's appropriate for the pgstat case.
> The DSA is "pinned", so the handle is forgotten from the ResourceOwner right
> after calling dsm_attach() anyway.

Yea - I wonder if we should have a version of dsm_attach() that pins at the
same time. It's pretty silly to first attach (remembering the dsm) and then
immediately pin (forgetting the dsm).


> Looking closer at dsa.c, I think this is a wider problem though. The
> comments don't make it very clear how it's supposed to interact with
> ResourceOwners. There's just this brief comment in dsa_pin_mapping():
> 
> >  * By default, areas are owned by the current resource owner, which means 
> > they
> >  * are detached automatically when that scope ends.
> 
> The dsa_area struct isn't directly owned by any ResourceOwner though. The
> DSM segments created by dsa_create() or dsa_attach() are.

But there's a relationship from the dsa too, via on_dsm_detach().


> But the functions dsa_allocate() and dsa_get_address() can create or attach
> more DSM segments to the area, and they will be owned by the by the current
> resource owner *at the time of the call*.

Yea, that doesn't seem great. It shouldn't affect the stats could though, due
the dsa being pinned.

> I think that is surprising behavior from the DSA facility. When you make
> allocations with dsa_allocate() or just call dsa_get_address() on an
> existing dsa_pointer, you wouldn't expect the current resource owner to
> matter. I think dsa_create/attach() should store the current resource owner
> in the dsa_area, for use in subsequent operations on the DSA, per attached
> patch (0002-Fix-dsa.c-with-different-resource-owners.patch).

Yea, that seems the right direction from here.

Greetings,

Andres Freund




Re: locked reads for atomics

2023-11-10 Thread Andres Freund
Hi,

On 2023-11-10 21:49:06 +, John Morris wrote:
> >I wonder if it's worth providing a set of "locked read" functions.
> 
> Most out-of-order machines include “read acquire” and “write release” which
> are pretty close to what you’re suggesting.

Is that really true? It's IA64 lingo. X86 doesn't have them, while arm has
more granular barriers, they don't neatly map onto acquire/release either.

I don't think an acquire here would actually be equivalent to making this a
full barrier - an acquire barrier allows moving reads or stores from *before*
the barrier to be moved after the barrier. It just prevents the opposite.


And for proper use of acquire/release semantics we'd need to pair operations
much more closely. Right now we often rely on another preceding memory barrier
to ensure correct ordering, having to use paired operations everywhere would
lead to slower code.


I thoroughly dislike how strongly C++11/C11 prefer paired atomics *on the same
address* over "global" fences. It often leads to substantially slower
code. And they don't at all map neatly on hardware, where largely barrier
semantics are *not* tied to individual addresses. And the fence specification
is just about unreadable (although I think they did fix some of the worst
issues).

Greetings,

Andres Freund




Re: locked reads for atomics

2023-11-10 Thread Andres Freund
Hi,

On 2023-11-10 14:51:28 -0600, Nathan Bossart wrote:
> Moving this to a new thread and adding it to the January commitfest.
> 
> On Thu, Nov 09, 2023 at 03:27:33PM -0600, Nathan Bossart wrote:
> > On Tue, Nov 07, 2023 at 04:58:16PM -0800, Andres Freund wrote:
> >> However, even if there's likely some other implied memory barrier that we
> >> could piggyback on, the patch much simpler to understand if it doesn't 
> >> change
> >> coherency rules. There's no way the overhead could matter.
> > 
> > I wonder if it's worth providing a set of "locked read" functions.  Those
> > could just do a compare/exchange with 0 in the generic implementation.  For
> > patches like this one where the overhead really shouldn't matter, I'd
> > encourage folks to use those to make it easy to reason about correctness.
> 
> Concretely, like this.

I don't think "locked" is a good name - there's no locking. I think that'll
deter their use, because it will make it sound more expensive than they are.

pg_atomic_read_membarrier_u32()?



> @@ -228,7 +228,8 @@ pg_atomic_init_u32(volatile pg_atomic_uint32 *ptr, uint32 
> val)
>   * The read is guaranteed to return a value as it has been written by this or
>   * another process at some point in the past. There's however no cache
>   * coherency interaction guaranteeing the value hasn't since been written to
> - * again.
> + * again.  Consider using pg_atomic_locked_read_u32() unless you have a 
> strong
> + * reason (e.g., performance) to use unlocked reads.

I think that's too strong an endorsement. Often there will be no difference in
difficulty analysing correctness, because the barrier pairing / the
interaction with the surrounding code needs to be analyzed just as much.


>   * No barrier semantics.
>   */
> @@ -239,6 +240,24 @@ pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)
>   return pg_atomic_read_u32_impl(ptr);
>  }
>  
> +/*
> + * pg_atomic_read_u32 - locked read from atomic variable.

Un-updated name...


> + * This read is guaranteed to read the current value,

It doesn't guarantee that *at all*. What it guarantees is solely that the
current CPU won't be doing something that could lead to reading an outdated
value. To actually ensure the value is up2date, the modifying side also needs
to have used a form of barrier (in the form of fetch_add, compare_exchange,
etc or an explicit barrier).


> +#ifndef PG_HAVE_ATOMIC_LOCKED_READ_U32
> +#define PG_HAVE_ATOMIC_LOCKED_READ_U32
> +static inline uint32
> +pg_atomic_locked_read_u32_impl(volatile pg_atomic_uint32 *ptr)
> +{
> + uint32 old = 0;
> +
> + /*
> +  * In the generic implementation, locked reads are implemented as a
> +  * compare/exchange with 0.  That'll fail or succeed, but always return 
> the
> +  * most up-to-date value.  It might also store a 0, but only if the
> +  * previous value was also a zero, i.e., harmless.
> +  */
> + pg_atomic_compare_exchange_u32_impl(ptr, &old, 0);
> +
> + return old;
> +}
> +#endif

I suspect implementing it with an atomic fetch_add of 0 would be faster...

Greetings,

Andres Freund




Re: locked reads for atomics

2023-11-10 Thread Nathan Bossart
On Fri, Nov 10, 2023 at 09:49:06PM +, John Morris wrote:
> Most out-of-order machines include “read acquire” and “write release”
> which are pretty close to what you’re suggesting. With the current
> routines, we only have “read relaxed” and  “write relaxed”. I think
> implementing acquire/release semantics is a very good idea,

We do have both pg_atomic_write_u32() and pg_atomic_unlocked_write_u32()
(see commit b0779ab), but AFAICT those only differ in the fallback/spinlock
implementations.  I suppose there could be an unlocked 64-bit write on
platforms that have 8-byte single-copy atomicity but still need to use the
fallback/spinlock implementation for some reason, but that might be a bit
of a stretch, and the use-cases might be few and far between...

> I would also like to clarify the properties of atomics. One very
> important question: Are atomics also volatile?

The PostgreSQL atomics support appears to ensure they are volatile.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Failure during Building Postgres in Windows with Meson

2023-11-10 Thread Andres Freund
Hi, 

On November 10, 2023 10:53:12 AM PST, Tristan Partin  wrote:
>On Thu Nov 9, 2023 at 9:31 AM CST, Nazir Bilal Yavuz wrote:
>> Hi,
>> 
>> On Thu, 9 Nov 2023 at 18:27, Tristan Partin  wrote:
>> >
>> > Can you try with Meson v1.2.3?
>> 
>> I tried with Meson v1.2.3 and upstream, both failed with the same error.

>An employee at Collabora produced a fix[0].

If I understand correctly, you can thus work around the problem by enabling use 
of precompiled headers. Which also explains why CI didn't show this - normally 
on Windows you want to use pch. 


> It might still be worthwhile however to see why it happens in one shell and 
> not the other.

It's gcc vs msvc due to path.

Andres


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: pgsql: Don't trust unvalidated xl_tot_len.

2023-11-10 Thread Thomas Munro
On Sat, Nov 11, 2023 at 10:42 AM Christoph Berg  wrote:
> > I haven't investigated the details yet, and it's not affecting the
> > builds on apt.postgresql.org, but the Debian amd64 and i386 regression
> > tests just failed this test on PG13 (11 and 15 are ok):
>
> 12 and 14 are also failing, now on Debian unstable. (Again, only in
> the salsa.debian.org tests, not on apt.postgresql.org's buildds.)
>
> 12 amd64: https://salsa.debian.org/postgresql/postgresql/-/jobs/4898857
> 12 i386:  https://salsa.debian.org/postgresql/postgresql/-/jobs/4898858
>
> The tests there are running in Docker containers.

Hmm.  regress_log_039_end_of_wal says:

No such file or directory at
/builds/postgresql/postgresql/debian/output/source_dir/build/../src/test/perl/TestLib.pm
line 655.

In the 13 branch we see that's in the new scan_server_header()
subroutine where it tries to open the header, after asking pg_config
--includedir-server where it lives.  Hmm...




Re: locked reads for atomics

2023-11-10 Thread John Morris
>I wonder if it's worth providing a set of "locked read" functions.

Most out-of-order machines include “read acquire” and “write release” which are 
pretty close to what you’re suggesting. With the current routines, we only have 
“read relaxed” and  “write relaxed”. I think implementing acquire/release 
semantics is a very good idea,

I would also like to clarify the properties of atomics. One very important 
question: Are atomics also volatile?  If so, the compiler has very limited 
ability to move them around. If not, it is difficult to tell when or where they 
will take place unless the surrounding code is peppered with barriers.


Re: pgsql: Don't trust unvalidated xl_tot_len.

2023-11-10 Thread Christoph Berg
Re: To Thomas Munro
> > src/test/recovery/t/039_end_of_wal.pl   | 460 
> > 
> 
> I haven't investigated the details yet, and it's not affecting the
> builds on apt.postgresql.org, but the Debian amd64 and i386 regression
> tests just failed this test on PG13 (11 and 15 are ok):

12 and 14 are also failing, now on Debian unstable. (Again, only in
the salsa.debian.org tests, not on apt.postgresql.org's buildds.)

12 amd64: https://salsa.debian.org/postgresql/postgresql/-/jobs/4898857
12 i386:  https://salsa.debian.org/postgresql/postgresql/-/jobs/4898858

The tests there are running in Docker containers.

Christoph




Re: Atomic ops for unlogged LSN

2023-11-10 Thread Nathan Bossart
On Thu, Nov 09, 2023 at 03:27:33PM -0600, Nathan Bossart wrote:
> I wonder if it's worth providing a set of "locked read" functions.  Those
> could just do a compare/exchange with 0 in the generic implementation.  For
> patches like this one where the overhead really shouldn't matter, I'd
> encourage folks to use those to make it easy to reason about correctness.

I moved this proposal to a new thread [0].

[0] https://postgr.es/m/20231110205128.GB1315705%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




locked reads for atomics

2023-11-10 Thread Nathan Bossart
Moving this to a new thread and adding it to the January commitfest.

On Thu, Nov 09, 2023 at 03:27:33PM -0600, Nathan Bossart wrote:
> On Tue, Nov 07, 2023 at 04:58:16PM -0800, Andres Freund wrote:
>> However, even if there's likely some other implied memory barrier that we
>> could piggyback on, the patch much simpler to understand if it doesn't change
>> coherency rules. There's no way the overhead could matter.
> 
> I wonder if it's worth providing a set of "locked read" functions.  Those
> could just do a compare/exchange with 0 in the generic implementation.  For
> patches like this one where the overhead really shouldn't matter, I'd
> encourage folks to use those to make it easy to reason about correctness.

Concretely, like this.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index bbff945eba..21a6edac3e 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -228,7 +228,8 @@ pg_atomic_init_u32(volatile pg_atomic_uint32 *ptr, uint32 val)
  * The read is guaranteed to return a value as it has been written by this or
  * another process at some point in the past. There's however no cache
  * coherency interaction guaranteeing the value hasn't since been written to
- * again.
+ * again.  Consider using pg_atomic_locked_read_u32() unless you have a strong
+ * reason (e.g., performance) to use unlocked reads.
  *
  * No barrier semantics.
  */
@@ -239,6 +240,24 @@ pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)
 	return pg_atomic_read_u32_impl(ptr);
 }
 
+/*
+ * pg_atomic_read_u32 - locked read from atomic variable.
+ *
+ * This read is guaranteed to read the current value, but note that there's no
+ * guarantee that the value isn't updated between when this function returns
+ * and when you try to use it.  Note that this is less performant than an
+ * unlocked read (i.e., pg_atomic_read_u32()), but it is generally much easier
+ * to reason about correctness with locked reads.
+ *
+ * Full barrier semantics.
+ */
+static inline uint32
+pg_atomic_locked_read_u32(volatile pg_atomic_uint32 *ptr)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_locked_read_u32_impl(ptr);
+}
+
 /*
  * pg_atomic_write_u32 - write to atomic variable.
  *
@@ -429,6 +448,15 @@ pg_atomic_read_u64(volatile pg_atomic_uint64 *ptr)
 	return pg_atomic_read_u64_impl(ptr);
 }
 
+static inline uint64
+pg_atomic_locked_read_u64(volatile pg_atomic_uint64 *ptr)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	return pg_atomic_locked_read_u64_impl(ptr);
+}
+
 static inline void
 pg_atomic_write_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
 {
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index cb5804adbf..5bb3aea9b7 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -49,6 +49,25 @@ pg_atomic_read_u32_impl(volatile pg_atomic_uint32 *ptr)
 }
 #endif
 
+#ifndef PG_HAVE_ATOMIC_LOCKED_READ_U32
+#define PG_HAVE_ATOMIC_LOCKED_READ_U32
+static inline uint32
+pg_atomic_locked_read_u32_impl(volatile pg_atomic_uint32 *ptr)
+{
+	uint32 old = 0;
+
+	/*
+	 * In the generic implementation, locked reads are implemented as a
+	 * compare/exchange with 0.  That'll fail or succeed, but always return the
+	 * most up-to-date value.  It might also store a 0, but only if the
+	 * previous value was also a zero, i.e., harmless.
+	 */
+	pg_atomic_compare_exchange_u32_impl(ptr, &old, 0);
+
+	return old;
+}
+#endif
+
 #ifndef PG_HAVE_ATOMIC_WRITE_U32
 #define PG_HAVE_ATOMIC_WRITE_U32
 static inline void
@@ -325,6 +344,25 @@ pg_atomic_read_u64_impl(volatile pg_atomic_uint64 *ptr)
 #endif /* PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY && !PG_HAVE_ATOMIC_U64_SIMULATION */
 #endif /* PG_HAVE_ATOMIC_READ_U64 */
 
+#ifndef PG_HAVE_ATOMIC_LOCKED_READ_U64
+#define PG_HAVE_ATOMIC_LOCKED_READ_U64
+static inline uint64
+pg_atomic_locked_read_u64_impl(volatile pg_atomic_uint64 *ptr)
+{
+	uint64 old = 0;
+
+	/*
+	 * In the generic implementation, locked reads are implemented as a
+	 * compare/exchange with 0.  That'll fail or succeed, but always return the
+	 * most up-to-date value.  It might also store a 0, but only if the
+	 * previous value was also a zero, i.e., harmless.
+	 */
+	pg_atomic_compare_exchange_u64_impl(ptr, &old, 0);
+
+	return old;
+}
+#endif
+
 #ifndef PG_HAVE_ATOMIC_INIT_U64
 #define PG_HAVE_ATOMIC_INIT_U64
 static inline void


Re: pg_dump needs SELECT privileges on irrelevant extension table

2023-11-10 Thread Jacob Champion
On Thu, Nov 9, 2023 at 11:02 AM Tom Lane  wrote:
> I'm hearing nothing but crickets :-(

Yeah :/

Based on your arguments above, it sounds like your patch may improve
several other corner cases when backported, so that sounds good
overall to me. My best guess is that Timescale will be happy with this
patch's approach. But I can't speak with any authority.

Aleks -- anything to add?

--Jacob




Re: Adding facility for injection points (or probe points?) for more advanced tests

2023-11-10 Thread Nathan Bossart
On Tue, Nov 07, 2023 at 05:01:16PM +0900, Michael Paquier wrote:
> On Mon, Nov 06, 2023 at 10:28:14PM +0300, Nazir Bilal Yavuz wrote:
>> I liked the idea; thanks for working on this!

+1, this seems very useful.

> +#ifdef USE_INJECTION_POINTS
> +#define INJECTION_POINT_RUN(name) InjectionPointRun(name)
> +#else
> +#define INJECTION_POINT_RUN(name) ((void) name)
> +#endif

nitpick: Why is the non-injection-point version "(void) name"?  I see
"(void) true" used elsewhere for this purpose.

> +
> + Here is an example of callback for
> + InjectionPointCallback:
> +
> +static void
> +custom_injection_callback(const char *name)
> +{
> +elog(NOTICE, "%s: executed custom callback", name);
> +}
> +

Why do we provide the name to the callback functions?

Overall, the design looks pretty good to me.  I think it's a good idea to
keep it simple to start with.  Since this is really only intended for
special tests that run in special builds, it seems like we ought to be able
to change it easily in the future as needed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add recovery to pg_control and remove backup_label

2023-11-10 Thread David Steele

On 11/10/23 00:37, Michael Paquier wrote:

On Tue, Nov 07, 2023 at 05:20:27PM +0900, Michael Paquier wrote:

On Mon, Nov 06, 2023 at 05:39:02PM -0400, David Steele wrote:
I've retested today, and miss the failure.  I'll let you know if I see
this again.


I've done a few more dozen runs, and still nothing.  I am wondering
what this disturbance was.


OK, hopefully it was just a blip.


If we set these fields where backup_label was renamed, the logic would not
be exactly the same since pg_control won't be updated until the next time
through the loop. Since the fields should be updated before
UpdateControlFile() I thought it made sense to keep all the updates
together.

Overall I think it is simpler, and we don't need to acquire a lock on
ControlFile.


What you are proposing is the same as what we already do for
backupEndRequired or backupStartPoint in the control file when
initializing recovery, so objection withdrawn.


Thomas included this change in his pg_basebackup changes so I did the same.
Maybe wait a bit before we split this out? Seems like a pretty small
change...


Seems like a pretty good argument for refactoring that now, and let
any other patches rely on it.  Would you like to send a separate
patch?


The split still looks worth doing seen from here, so I am switching
the patch as WoA for now.


This has been split out.


Actually, grouping backupStartPointTLI and minRecoveryPointTLI should
reduce more the size with some alignment magic, no?


I thought about this, but it seemed to me that existing fields had been
positioned to make the grouping logical rather than to optimize alignment,
e.g. minRecoveryPointTLI. Ideally that would have been placed near
backupEndRequired (or vice versa). But if the general opinion is to
rearrange for alignment, I'm OK with that.


I've not tested, but it looks like moving backupStartPointTLI after
backupEndPoint should shave 8 bytes, if you want to maintain a more
coherent group for the LSNs.


OK, I have moved backupStartPointTLI.


+* backupFromStandby indicates that the backup was taken on a standby. It is
+* require to initialize recovery and set to false afterwards.
s/require/required/.


Fixed.


The term "backup recovery", that we've never used in the tree until
now as far as I know.  Perhaps this recovery method should just be
referred as "recovery from backup"?


Well, "backup recovery" is less awkward, I think. For instance "backup 
recovery field" vs "recovery from backup field".



By the way, there is another thing that this patch has forgotten: the
SQL functions that display data from the control file.  Shouldn't
pg_control_recovery() be extended with the new fields?  These fields
may be less critical than the other ones related to recovery, but I
suspect that showing them can become handy at least for debugging and
monitoring purposes.


I guess that depends on whether we reset them or not (discussion below). 
Right now they would not be visible since by the time the user could log 
on they would be reset.



Something in this area is that backupRecoveryRequired is the switch
controlling if the fields set by the recovery initialization.  Could
it be actually useful to leave the other fields as they are and only
reset backupRecoveryRequired before the first control file update?
This would leave a trace of the backup history directly in the control
file.


Since the other recovery fields are cleared in ReachedEndOfBackup() this 
would be a change from what we do now.


None of these fields are ever visible (with the exception of 
minRecoveryPoint/TLI) since they are reset when the database becomes 
consistent and before logons are allowed. Viewing them with 
pg_controldata makes sense, but I'm not sure pg_control_recovery() does.


In fact, are backup_start_lsn, backup_end_lsn, and 
end_of_backup_record_required ever non-zero when logged onto Postgres? 
Maybe I'm missing something?



What about pg_resetwal and RewriteControlFile()?  Shouldn't these
recovery fields be reset as well?


Done.


git diff --check is complaining a bit.


Fixed.

New patches attached based on eb81e8e790.

Regards,
-DavidFrom d241a0e8aa589b3abf66331f8a3af0aabe87c214 Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Fri, 10 Nov 2023 17:50:54 +
Subject: Allow content size to be passed to sendFileWithContent().

sendFileWithContent() previously got the content length by using strlen(),
but it is also possible to pass binary content. Use len == -1 to indicate
that strlen() should be use to get the content length, otherwise honor the
value in len.
---
 src/backend/backup/basebackup.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index b537f462197..f216b588422 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -94,7 +94,7 @@ static bool verify_page_checksum(Page page, XLogRecPtr 
start_lsn,
   

Re: Failure during Building Postgres in Windows with Meson

2023-11-10 Thread Tristan Partin

On Thu Nov 9, 2023 at 9:31 AM CST, Nazir Bilal Yavuz wrote:

Hi,

On Thu, 9 Nov 2023 at 18:27, Tristan Partin  wrote:
>
> Can you try with Meson v1.2.3?

I tried with Meson v1.2.3 and upstream, both failed with the same error.


An employee at Collabora produced a fix[0]. It might still be worthwhile 
however to see why it happens in one shell and not the other.


[0]: https://github.com/mesonbuild/meson/pull/12498

--
Tristan Partin
Neon (https://neon.tech)




Re: Move bki file pre-processing from initdb to bootstrap

2023-11-10 Thread Krishnakumar R
Thank you for review, Peter.

Makes sense on the split part. Was starting to think in same lines, at the
end of last iteration. Will come back shortly.

On Fri, Nov 10, 2023 at 12:48 AM Peter Eisentraut 
wrote:

> On 17.10.23 03:32, Krishnakumar R wrote:
> >> The version comparison has been moved from initdb to bootstrap. This
> >> created some compatibility problems with windows tests. For now I kept
> >> the version check to not have \n added, which worked fine and serves
> >> the purpose. However hoping to have something better in v3 in addition
> >> to addressing any other comments.
> >
> > With help from Thomas, figured out that on windows fopen uses binary
> > mode in the backend which causes issues with EOL. Please find the
> > attached patch updated with a fix for this.
>
> I suggest that this patch set be split up into three incremental parts:
>
> 1. Move some build-time settings from initdb to postgres.bki.
> 2. The database locale handling.
> 3. The bki file handling.
>
> Each of these topics really needs a separate detailed consideration.
>
>


Re: pgsql: Don't trust unvalidated xl_tot_len.

2023-11-10 Thread Christoph Berg
Re: To Thomas Munro
> I haven't investigated the details yet, and it's not affecting the
> builds on apt.postgresql.org, but the Debian amd64 and i386 regression
> tests just failed this test on PG13 (11 and 15 are ok):

That's on Debian bullseye, fwiw. (But the 13 build on apt.pg.o/bullseye passed.)

Christoph




Re: pgsql: Don't trust unvalidated xl_tot_len.

2023-11-10 Thread Christoph Berg
Re: Thomas Munro
> Don't trust unvalidated xl_tot_len.

> src/test/recovery/t/039_end_of_wal.pl   | 460 

I haven't investigated the details yet, and it's not affecting the
builds on apt.postgresql.org, but the Debian amd64 and i386 regression
tests just failed this test on PG13 (11 and 15 are ok):

t/039_end_of_wal.pl ..
Dubious, test returned 2 (wstat 512, 0x200)
No subtests run
Test Summary Report
---
t/039_end_of_wal.pl(Wstat: 512 Tests: 0 Failed: 0)
  Non-zero exit status: 2
  Parse errors: No plan found in TAP output
Files=26, Tests=254, 105 wallclock secs ( 0.10 usr  0.02 sys + 19.58 cusr 11.02 
csys = 30.72 CPU)
Result: FAIL
make[2]: *** [Makefile:23: check] Error 1

https://salsa.debian.org/postgresql/postgresql/-/jobs/4910354
https://salsa.debian.org/postgresql/postgresql/-/jobs/4910355

Christoph




Re: Improvements in pg_dump/pg_restore toc format and performances

2023-11-10 Thread Nathan Bossart
On Tue, Oct 03, 2023 at 03:17:57PM +0530, vignesh C wrote:
> Few comments:

Pierre, do you plan to submit a new revision of this patch set for the
November commitfest?  If not, the commitfest entry may be marked as
returned-with-feedback soon.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: SET ROLE documentation improvement

2023-11-10 Thread Nathan Bossart
On Tue, Sep 26, 2023 at 08:33:25AM -0700, Yurii Rashkovskii wrote:
> This is a good start, indeed. I've amended my patch to include it.

Thanks for the new patch.

Looking again, I'm kind of hesitant to add too much qualification to this
note about losing superuser privileges.  If we changed it to

Note that when a superuser chooses to SET ROLE to a non-superuser role,
they lose their superuser privileges, except for the privilege to
change to another role again using SET ROLE or RESET ROLE.

it almost seems to imply that a non-superuser role could obtain the ability
to switch to any role if they first SET ROLE to a superuser.  In practice,
that's true because they could just give the session role SUPERUSER, but I
don't think that's the intent of this section.

I thought about changing it to something like

Note that when a superuser chooses to SET ROLE to a non-superuser role,
they lose their superuser privileges.  However, if the current session
user is a superuser, they retain the ability to set the current user
identifier to any role via SET ROLE and RESET ROLE.

but it seemed weird to me to single out superusers here when it's always
true that the current session user retains the ability to SET ROLE to any
role they have the SET option on.  That is already covered above in the
"Description" section, so I don't really see the need to belabor the point
by adding qualifications to the "Notes" section.  ISTM the point of these
couple of paragraphs in the "Notes" section is to explain the effects on
privileges for schemas, tables, etc.

I still think we should update the existing note about privileges for
SET/RESET ROLE to something like the following:

diff --git a/doc/src/sgml/ref/set_role.sgml b/doc/src/sgml/ref/set_role.sgml
index 13bad1bf66..c91a95f5af 100644
--- a/doc/src/sgml/ref/set_role.sgml
+++ b/doc/src/sgml/ref/set_role.sgml
@@ -41,8 +41,10 @@ RESET ROLE
   
 
   
-   The specified role_name
-   must be a role that the current session user is a member of.
+   The current session user must have the SET for the
+   specified role_name, either
+   directly or indirectly via a chain of memberships with the
+   SET option.
(If the session user is a superuser, any role can be selected.)
   

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync

2023-11-10 Thread Andres Freund
Hi,

On 2023-11-10 17:16:35 +0200, Heikki Linnakangas wrote:
> On 10/11/2023 05:54, Andres Freund wrote:
> > In this case I had used wal_sync_method=open_datasync - it's often faster 
> > and
> > if we want to scale WAL writes more we'll have to use it more widely (you
> > can't have multiple fdatasyncs in progress and reason about which one 
> > affects
> > what, but you can have multiple DSYNC writes in progress at the same time).
>
> Not sure I understand that. If you issue an fdatasync, it will sync all
> writes that were complete before the fdatasync started. Right? If you have
> multiple fdatasyncs in progress, that's true for each fdatasync. Or is there
> a bottleneck in the kernel with multiple in-progress fdatasyncs or
> something?

Many filesystems only allow a single fdatasync to really be in progress at the
same time, they eventually acquire an inode specific lock.  More problematic
cases include things like a write followed by an fdatasync, followed by a
write of the same block in another process/thread - there's very little
guarantee about which contents of that block are now durable.

But more importantly, using fdatasync doesn't scale because it effectively has
to flush the entire write cache one the device - which often contains plenty
other dirty data. Whereas O_DSYNC can use FUA writes, which just makes the
individual WAL writes write through the cache, while leaving the rest of the
cache "unaffected".


> > After a bit of confused staring and debugging I figured out that the problem
> > is that the RequestXLogSwitch() within the code for starting a basebackup 
> > was
> > triggering writing back the WAL in individual 8kB writes via
> > GetXLogBuffer()->AdvanceXLInsertBuffer(). With open_datasync each of these
> > writes is durable - on this drive each take about 1ms.
>
> I see. So the assumption in AdvanceXLInsertBuffer() is that XLogWrite() is
> relatively fast. But with open_datasync, it's not.

I'm not sure that was an explicit assumption rather than just how it worked
out.


> > To fix this, I suspect we need to make
> > GetXLogBuffer()->AdvanceXLInsertBuffer() flush more aggressively. In this
> > specific case, we even know for sure that we are going to fill a lot more
> > buffers, so no heuristic would be needed. In other cases however we need 
> > some
> > heuristic to know how much to write out.
>
> +1. Maybe use the same logic as in XLogFlush().

I've actually been wondering about moving all the handling of WALWriteLock to
XLogWrite() and/or a new function called from all the places calling
XLogWrite().

I suspect we can't quite use the same logic in AdvanceXLInsertBuffer() as we
do in XLogFlush() - we e.g. don't ever want to trigger flushing out a
partially filled page, for example. Or really ever want to unnecessarily wait
for a WAL insertion to complete when we don't have to.


> I wonder if the 'flexible' argument to XLogWrite() is too inflexible. It
> would be nice to pass a hard minimum XLogRecPtr that it must write up to,
> but still allow it to write more than that if it's convenient.

Yes, I've also thought that.  In the AIOified WAL code I ended up tracking
"minimum" and "optimal" write/flush locations.

Greetings,

Andres Freund




Re: Buffer Cache Problem

2023-11-10 Thread jacktby jacktby
Hi, I have 3 questions here:
1. I see comments in but_internals.h below:

 * Also, in places we do one-time reads of the flags without bothering to
 * lock the buffer header; this is generally for situations where we don't
 * expect the flag bit being tested to be changing.

In fact, the flag is in state filed which is an atomic_u32, so we don’t need to 
acquire buffer header lock in any case, but for this comment, seems it’s saying 
we need to hold a buffer header lock when read flag in general.

2. Another question:

 * We can't physically remove items from a disk page if another backend has
 * the buffer pinned.  Hence, a backend may need to wait for all other pins
 * to go away.  This is signaled by storing its own pgprocno into
 * wait_backend_pgprocno and setting flag bit BM_PIN_COUNT_WAITER.  At present,
 * there can be only one such waiter per buffer.

The comments above,  in fact for now, if a backend plan to remove items from a 
disk page, this is a mutation operation, so this backend must hold a exclusive 
lock for this buffer page, then in this case, there are no other backends 
pinning this buffer, so the pin refcount must be 1 (it’s by this backend), then 
this backend can remove the items safely and no need to wait other backends 
(because there are no other backends pinning this buffer). So my question is 
below:
 The operation “storing its own pgprocno into
 * wait_backend_pgprocno and setting flag bit BM_PIN_COUNT_WAITER” is whether 
too expensive, we should not do like this, right?

3. Where is the array?

 * Per-buffer I/O condition variables are currently kept outside this struct in
 * a separate array.  They could be moved in here and still fit within that
 * limit on common systems, but for now that is not done.


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-10 Thread Nathan Bossart
On Fri, Nov 10, 2023 at 10:17:49AM +0530, Dilip Kumar wrote:
> On Thu, Nov 9, 2023 at 4:55 PM Alvaro Herrera  wrote:
>> The only point on which we do not have full consensus yet is the need to
>> have one GUC per SLRU, and a lot of effort seems focused on trying to
>> fix the problem without adding so many GUCs (for example, using shared
>> buffers instead, or use a single "scaling" GUC).  I think that hinders
>> progress.  Let's just add multiple GUCs, and users can leave most of
>> them alone and only adjust the one with which they have a performance
>> problems; it's not going to be the same one for everybody.
> 
> +1

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: CRC32C Parallel Computation Optimization on ARM

2023-11-10 Thread Nathan Bossart
On Tue, Nov 07, 2023 at 08:05:45AM +, Xiang Gao wrote:
> I think I understand what you mean, this is the latest patch. Thank you!

Thanks for the new patch.

+# PGAC_ARMV8_VMULL_INTRINSICS
+# 
+# Check if the compiler supports the vmull_p64
+# intrinsic functions. These instructions
+# were first introduced in ARMv8 crypto Extension.

I wonder if it'd be better to call this PGAC_ARMV8_CRYPTO_INTRINSICS since
this check seems to indicate the presence of +crypto.  Presumably there are
other instructions in this extension that could be used elsewhere, in which
case we could reuse this.

+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && test 
x"$USE_ARMV8_VMULL_WITH_RUNTIME_CHECK" = x"" && (test x"$USE_ARMV8_CRC32C" = 
x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes" && test x"$CFLAGS_VMULL" = 
x""; then
+USE_ARMV8_VMULL=1
+  else
+if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+  USE_ARMV8_VMULL_WITH_RUNTIME_CHECK=1
+fi
+  fi
+fi

I'm not sure I see the need to check USE_ARMV8_CRC32C* when setting these.
Couldn't we set them solely on the results of our
PGAC_ARMV8_VMULL_INTRINSICS check?  It looks like this is what you are
doing in meson.build already.

+extern pg_crc32c pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void 
*data, size_t len);

nitpick: Maybe pg_comp_crc32_armv8_parallel()?

-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
-pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)

Why are these lines deleted?

-  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],

What is the purpose of this change?

+__attribute__((target("+crc+crypto")))

I'm not sure we can assume that all compilers will understand this, and I'm
not sure we need it.

+   if (use_vmull)
+   {
+/*
+ * Crc32c parallel computation Input data is divided into three
+ * equal-sized blocks. Block length : 42 words(42 * 8 bytes).
+ * CRC0: 0 ~ 41 * 8,
+ * CRC1: 42 * 8 ~ (42 * 2 - 1) * 8,
+ * CRC2: 42 * 2 * 8 ~ (42 * 3 - 1) * 8.
+ */

Shouldn't we surround this with #ifdefs for USE_ARMV8_VMULL*?

if (pg_crc32c_armv8_available())
+   {
+#if defined(USE_ARMV8_VMULL)
+   pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+#elif defined(USE_ARMV8_VMULL_WITH_RUNTIME_CHECK)
+   if (pg_vmull_armv8_available())
+   {
+   pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+   }
+   else
+   {
+   pg_comp_crc32c = pg_comp_crc32c_armv8;
+   }
+#else
pg_comp_crc32c = pg_comp_crc32c_armv8;
+#endif
+   }

IMO it'd be better to move the #ifdefs into the functions so that we can
simplify this to something like

if (pg_crc32c_armv8_available())
{
if (pg_crc32c_armv8_crypto_available())
pg_comp_crc32c = pg_comp_crc32c_armv8_parallel;
else
pg_comp_crc32c = pg_comp_crc32c_armv8;
else
pc_comp_crc32c = pg_comp_crc32c_sb8;

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync

2023-11-10 Thread Heikki Linnakangas

On 10/11/2023 05:54, Andres Freund wrote:

In this case I had used wal_sync_method=open_datasync - it's often faster and
if we want to scale WAL writes more we'll have to use it more widely (you
can't have multiple fdatasyncs in progress and reason about which one affects
what, but you can have multiple DSYNC writes in progress at the same time).


Not sure I understand that. If you issue an fdatasync, it will sync all 
writes that were complete before the fdatasync started. Right? If you 
have multiple fdatasyncs in progress, that's true for each fdatasync. Or 
is there a bottleneck in the kernel with multiple in-progress fdatasyncs 
or something?



After a bit of confused staring and debugging I figured out that the problem
is that the RequestXLogSwitch() within the code for starting a basebackup was
triggering writing back the WAL in individual 8kB writes via
GetXLogBuffer()->AdvanceXLInsertBuffer(). With open_datasync each of these
writes is durable - on this drive each take about 1ms.


I see. So the assumption in AdvanceXLInsertBuffer() is that XLogWrite() 
is relatively fast. But with open_datasync, it's not.



To fix this, I suspect we need to make
GetXLogBuffer()->AdvanceXLInsertBuffer() flush more aggressively. In this
specific case, we even know for sure that we are going to fill a lot more
buffers, so no heuristic would be needed. In other cases however we need some
heuristic to know how much to write out.


+1. Maybe use the same logic as in XLogFlush().

I wonder if the 'flexible' argument to XLogWrite() is too inflexible. It 
would be nice to pass a hard minimum XLogRecPtr that it must write up 
to, but still allow it to write more than that if it's convenient.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: EXCLUDE COLLATE in CREATE/ALTER TABLE document

2023-11-10 Thread jian he
On Wed, Nov 1, 2023 at 10:30 AM shihao zhong  wrote:
>
> Thank you for your feedback on my previous patch. I have fixed the issue and 
> attached a new patch for your review. Could you please take a look for it if 
> you have a sec? Thanks
>

Your patch works fine. you can see it here:
https://cirrus-ci.com/task/6481922939944960
in an ideal world, since the doc is already built, we can probably
view it as a plain html file just click the ci test result.

in src/sgml/ref/create_table.sgml:
"Each exclude_element can optionally specify an operator class and/or
ordering options; these are described fully under CREATE INDEX."

You may need to update this sentence to reflect that exclude_element
can also optionally specify collation.




Re: Buffer Cache Problem

2023-11-10 Thread jacktby jacktby

> 2023年11月10日 22:31,jacktby jacktby  写道:
> 
> In the bus_internal.h,I see
> 
>  Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change  
> tag, state or wait_backend_pgprocno fields.
> 
> As we all know, this buffer header lock is implemented by a bit in state 
> filed, and this state field is a atomic_u32 type, so in fact we don’t need to 
> hold buffer lock when we update state, this comment has error,right?
Oh, sorry this is true, in fact we never acquire a spin lock when update the 
state.

Re: Buffer Cache Problem

2023-11-10 Thread jacktby jacktby
In the bus_internal.h,I see

 Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change  
tag, state or wait_backend_pgprocno fields.

As we all know, this buffer header lock is implemented by a bit in state filed, 
and this state field is a atomic_u32 type, so in fact we don’t need to 
hold buffer lock when we update state, this comment has error,right?

Re: ResourceOwner refactoring

2023-11-10 Thread Heikki Linnakangas

Thanks for the testing again!

On 10/11/2023 11:00, Alexander Lakhin wrote:

I could see two failure modes:
2023-11-10 08:42:28.870 UTC [1163274] ERROR:  ResourceOwnerEnlarge called after 
release started
2023-11-10 08:42:28.870 UTC [1163274] STATEMENT:  drop table t;
2023-11-10 08:42:28.870 UTC [1163274] WARNING:  AbortTransaction while in 
COMMIT state
2023-11-10 08:42:28.870 UTC [1163274] PANIC:  cannot abort transaction 906, it 
was already committed

2023-11-10 08:43:27.897 UTC [1164148] ERROR:  ResourceOwnerEnlarge called after 
release started
2023-11-10 08:43:27.897 UTC [1164148] STATEMENT:  DROP DATABASE db69;
2023-11-10 08:43:27.897 UTC [1164148] WARNING:  AbortTransaction while in 
COMMIT state
2023-11-10 08:43:27.897 UTC [1164148] PANIC:  cannot abort transaction 1043, it 
was already committed

The stack trace for the second ERROR (ResourceOwnerEnlarge called ...) is:
...
#6  0x558af5b2f35c in ResourceOwnerEnlarge (owner=0x558af716f3c8) at 
resowner.c:455
#7  0x558af5888f18 in dsm_create_descriptor () at dsm.c:1207
#8  0x558af5889205 in dsm_attach (h=3172038420) at dsm.c:697
#9  0x558af5b1ebed in get_segment_by_index (area=0x558af711da18, index=2) 
at dsa.c:1764
#10 0x558af5b1ea4b in dsa_get_address (area=0x558af711da18, 
dp=2199023329568) at dsa.c:970
#11 0x558af5669366 in dshash_seq_next (status=0x7ffdd5912fd0) at 
dshash.c:687
#12 0x558af5901998 in pgstat_drop_database_and_contents (dboid=16444) at 
pgstat_shmem.c:830
#13 0x558af59016f0 in pgstat_drop_entry (kind=PGSTAT_KIND_DATABASE, 
dboid=16444, objoid=0) at pgstat_shmem.c:888
#14 0x558af59044eb in AtEOXact_PgStat_DroppedStats 
(xact_state=0x558af7111ee0, isCommit=true) at pgstat_xact.c:88
#15 0x558af59043c7 in AtEOXact_PgStat (isCommit=true, parallel=false) at 
pgstat_xact.c:55
#16 0x558af53c782e in CommitTransaction () at xact.c:2371
#17 0x558af53c709e in CommitTransactionCommand () at xact.c:306
...


The quick, straightforward fix is to move the "CurrentResourceOwner = 
NULL" line earlier in CommitTransaction, per attached 
0003-Clear-CurrentResourceOwner-earlier-in-CommitTransact.patch. You're 
not allowed to use the resource owner after you start to release it; it 
was a bit iffy even before the ResourceOwner rewrite but now it's 
explicitly forbidden. By clearing CurrentResourceOwner as soon as we 
start releasing it, we can prevent any accidental use.


When CurrentResourceOwner == NULL, dsm_attach() returns a handle that's 
not associated with any ResourceOwner. That's appropriate for the pgstat 
case. The DSA is "pinned", so the handle is forgotten from the 
ResourceOwner right after calling dsm_attach() anyway.


Looking closer at dsa.c, I think this is a wider problem though. The 
comments don't make it very clear how it's supposed to interact with 
ResourceOwners. There's just this brief comment in dsa_pin_mapping():



 * By default, areas are owned by the current resource owner, which means they
 * are detached automatically when that scope ends.


The dsa_area struct isn't directly owned by any ResourceOwner though. 
The DSM segments created by dsa_create() or dsa_attach() are.


But the functions dsa_allocate() and dsa_get_address() can create or 
attach more DSM segments to the area, and they will be owned by the by 
the current resource owner *at the time of the call*. So if you call 
dsa_get_address() while in a different resource owner, things get very 
confusing. The attached new test module demonstrates that 
(0001-Add-test_dsa-module.patch), here's a shortened version:


a = dsa_create(tranche_id);

/* Switch to new resource owner */
oldowner = CurrentResourceOwner;
childowner = ResourceOwnerCreate(oldowner, "temp owner");
CurrentResourceOwner = childowner;

/* make a bunch of allocations */
for (int i = 0; i < 1; i++)
p[i] = dsa_allocate(a, 1000);

/* Release the child resource owner */
CurrentResourceOwner = oldowner;
ResourceOwnerRelease(childowner,
 RESOURCE_RELEASE_BEFORE_LOCKS,
 true, false);
ResourceOwnerRelease(childowner,
 RESOURCE_RELEASE_LOCKS,
 true, false);
ResourceOwnerRelease(childowner,
 RESOURCE_RELEASE_AFTER_LOCKS,
 true, false);
ResourceOwnerDelete(childowner);


dsa_detach(a);

This first prints warnings on The ResourceOwnerRelease() calls:

2023-11-10 13:57:21.475 EET [745346] WARNING:  resource was not closed: 
dynamic shared memory segment 2395813396
2023-11-10 13:57:21.475 EET [745346] WARNING:  resource was not closed: 
dynamic shared memory segment 3922992700
2023-11-10 13:57:21.475 EET [745346] WARNING:  resource was not closed

Re: pg_upgrade and logical replication

2023-11-10 Thread vignesh C
On Thu, 9 Nov 2023 at 07:44, Peter Smith  wrote:
>
> Thanks for addressing my previous review comments.
>
> I re-checked the latest patch v12-0001 and found the following:
>
> ==
> Commit message
>
> 1.
> The new SQL binary_upgrade_create_sub_rel_state function has the following
> syntax:
> SELECT binary_upgrade_create_sub_rel_state(subname text, relid oid,
> state char [,sublsn pg_lsn])
>
> ~
>
> Looks like v12 accidentally forgot to update this to the modified
> function name 'binary_upgrade_add_sub_rel_state'

This is handled in the v13 version patch posted at:
https://www.postgresql.org/message-id/CALDaNm0mGz6_69BiJTmEqC8Q0U0x2nMZOs3w9btKOHZZpfC2ow%40mail.gmail.com

Regards,
Vignesh




Re: pg_upgrade and logical replication

2023-11-10 Thread vignesh C
On Thu, 9 Nov 2023 at 12:23, Michael Paquier  wrote:
>
> On Thu, Nov 09, 2023 at 01:14:05PM +1100, Peter Smith wrote:
> > Looks like v12 accidentally forgot to update this to the modified
> > function name 'binary_upgrade_add_sub_rel_state'
>
> This v12 is overall cleaner than its predecessors.  Nice to see.
>
> +my $result = $publisher->safe_psql('postgres', "SELECT count(*) FROM t1");
> +is($result, qq(1), "check initial t1 table data on publisher");
> +$result = $publisher->safe_psql('postgres', "SELECT count(*) FROM t2");
> +is($result, qq(1), "check initial t1 table data on publisher");
> +$result = $old_sub->safe_psql('postgres', "SELECT count(*) FROM t1");
> +is($result, qq(1), "check initial t1 table data on the old subscriber");
> +$result = $old_sub->safe_psql('postgres', "SELECT count(*) FROM t2");
>
> I'd argue that t1 and t2 should have less generic names.  t1 is used
> to check that the upgrade process works, while t2 is added to the
> publication after upgrading the subscriber.  Say something like
> tab_upgraded or tab_not_upgraded?

Modified

> +my $synced_query =
> +  "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN 
> ('r');";
>
> Perhaps it would be safer to use a query that checks the number of
> relations in 'r' state?  This query would return true if
> pg_subscription_rel has no tuples.

Modified

> +# Table will be in 'd' (data is being copied) state as table sync will fail
> +# because of primary key constraint error.
> +my $started_query =
> +  "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'd';";
>
> Relying on a pkey error to enforce an incorrect state is a good trick.
> Nice.

That was better way to get data sync state without manually changing
the pg_subscription_rel catalog

> +command_fails(
> +[
> +'pg_upgrade', '--no-sync','-d', $old_sub->data_dir,
> +'-D', $new_sub->data_dir, '-b', $bindir,
> +'-B', $bindir,'-s', $new_sub->host,
> +'-p', $old_sub->port, '-P', $new_sub->port,
> +$mode,'--check',
> +],
> +'run of pg_upgrade --check for old instance with relation in \'d\' 
> datasync(invalid) state');
> +rmtree($new_sub->data_dir . "/pg_upgrade_output.d");
>
> Okay by me to not stop the cluster for the --check to shave a few
> cycles.  It's a bit sad that we don't cross-check the contents of
> subscription_state.txt before removing pg_upgrade_output.d.  Finding
> the file is easy even if the subdir where it is included is not a
> constant name.  Then it is possible to apply a regexp with the
> contents consumed by a slurp_file().

Modified

> +my $remote_lsn = $old_sub->safe_psql('postgres',
> +"SELECT remote_lsn FROM pg_replication_origin_status");
> Perhaps you've not noticed, but this would be 0/0 most of the time.
> However the intention is to check after a valid LSN to make sure that
> the origin is set, no?

I have added few more inserts to make remote_lsn not be 0/0

> I am wondering whether this should use a bit more data than just one
> tuple, say at least two transaction, one of them with a multi-value
> INSERT?

Added one more multi-insert

> +# --
> +# Check that pg_upgrade is successful when all tables are in ready state.
> +# --
> This comment is a bit inconsistent with the state that are accepted,
> but why not, at least that's predictible.

The key test validation is mentioned in this style of comment

> +* relation to pg_subscripion_rel table. This will be used only in
>
> Typo: s/pg_subscripion_rel/pg_subscription_rel/.

Modified

> This needs some word-smithing to explain the reasons why a state is
> not needed:
>
> +/*
> + * The subscription relation should be in either i (initialize),
> + * r (ready) or s (synchronized) state as either the replication slot
> + * is not created or the replication slot is already dropped and the
> + * required WAL files will be present in the publisher. The other
> + * states are not ok as the worker has dependency on the replication
> + * slot/origin in these case:
>
> A slot not created yet refers to the 'i' state, while 'r' and 's'
> refer to a slot created previously but already dropped, right?
> Shouldn't this comment tell that rather than mixing the assumptions?

Modified

> + * a) SUBREL_STATE_DATASYNC: In this case, the table sync worker will
> + * try to drop the replication slot but as the replication slots will
> + * be created with old subscription id in the publisher and the
> + * upgraded subscriber will not be able to clean the slots in this
> + * case.
>
> Proposal: A relation upgraded while in this state would retain a
> replication slot, which could not be dropped by the sync worker
> spawned after the upgrade because the subscription ID tracked by the
>

Re: Bug: RLS policy FOR SELECT is used to check new rows

2023-11-10 Thread Laurenz Albe
On Fri, 2023-11-10 at 09:39 +, Dean Rasheed wrote:
> On Thu, 9 Nov 2023 at 18:55, Laurenz Albe  wrote:
> > I think it can be useful to allow a user an UPDATE where the result
> > does not satisfy the USING clause of the FOR SELECT policy.
> > 
> > The idea that an UPDATE should only produce rows you can SELECT is not
> > true today: if you run an UPDATE without a WHERE clause, you can
> > create rows you cannot see.  The restriction is only on UPDATEs with
> > a WHERE clause.  Weird, isn't it?
> 
> That's true, but only if the UPDATE also doesn't have a RETURNING
> clause. What I find weird about your proposal is that it would allow
> an UPDATE ... RETURNING command to return something that would be
> visible just that once, but then subsequently disappear. That seems
> like a cure that's worse than the original disease that kicked off
> this discussion.

What kicked off the discussion was my complaint that FOR SELECT
rules mess with UPDATE, so that's exactly what I would have liked:
an UPDATE that makes the rows vanish.

My naïve expectation was that FOR SELECT policies govern SELECT
and FOR UPDATE policies govern UPDATE.  After all, there is a
WITH CHECK clause for FOR UPDATE policies that checks the result rows.

So, from my perspective, we should never have let FOR SELECT policies
mess with an UPDATE.  But I am too late for that; such a change would
be way too invasive now.  So I'd like to introduce a "back door" by
creating a FOR SELECT policy with WITH CHECK (TRUE).

> As mentioned by others, the intention was that RLS behave like WITH
> CHECK OPTION on an updatable view, so that new rows can't just
> disappear. There are, however, 2 differences between the way it
> currently works for RLS, and an updatable view:
> 
> 1). RLS only does this for UPDATE commands. INSERT commands *can*
> insert new rows that aren't visible, and so disappear.
> 
> 2). It can't be turned off. The WITH CHECK OPTION on an updatable view
> is an option that the user can choose to turn on or off. That's not
> possible with RLS.

Right.  Plus the above-mentioned fact that you can make rows vanish
with an UPDATE that has no WHERE.

> It might be possible to change (2) though, by adding a new table-level
> option (similar to a view's WITH CHECK OPTION) that enabled or
> disabled the checking of new rows for that table, and whose default
> matched the current behaviour.

That would be a viable solution.

Pro: it doesn't make the already hideously complicated RLS system
even more complicated.

Con: yet another storage option...

> Before going too far down that route though, it is perhaps worth
> asking whether this is something users really want. Is there a real
> use-case for being able to UPDATE rows and have them disappear?

What triggered my investigation was this question:
https://stackoverflow.com/q/77346757/6464308

I personally don't have any stake in this.  I just wanted a way to
make RLS behave more like I think it should.

Yours,
Laurenz Albe




Re: Parallel aggregates in PG 16.1

2023-11-10 Thread Matthias van de Meent
On Fri, 10 Nov 2023 at 11:47, ZIMANYI Esteban  wrote:
>
> In MobilityDB
> https://github.com/MobilityDB/MobilityDB
> we have defined a tstzspan type which is a fixed-size equivalent of the 
> tstzrange type in PostgreSQL.
>
> We have a span_union aggregate function which is the equivalent of the 
> range_agg function in PostgreSQL defined as follows
>
> CREATE FUNCTION tstzspan_union_finalfn(internal)
>   RETURNS tstzspanset
>   AS 'MODULE_PATHNAME', 'Span_union_finalfn'
>   LANGUAGE C IMMUTABLE PARALLEL SAFE;
>
> CREATE AGGREGATE span_union(tstzspan) (
>   SFUNC = array_agg_transfn,
>   STYPE = internal,
>   COMBINEFUNC = array_agg_combine,
>   SERIALFUNC = array_agg_serialize,
>   DESERIALFUNC = array_agg_deserialize,
>   FINALFUNC = tstzspan_union_finalfn
> );
>
> As can be seen, we reuse the array_agg function to accumulate the values in 
> an array and the final function just does similar work as the 
> range_agg_finalfn to merge the overlapping spans.

Did you note the following section in the CREATE AGGREGATE documentation [0]?

"""
An aggregate can optionally support partial aggregation, as described
in Section 38.12.4.
This requires specifying the COMBINEFUNC parameter. If the
state_data_type is internal, it's usually also appropriate to provide
the SERIALFUNC and DESERIALFUNC parameters so that parallel
aggregation is possible.
Note that the aggregate must also be marked PARALLEL SAFE to enable
parallel aggregation.
"""

>From this, it seems like the PARALLEL = SAFE argument is missing from
your aggregate definition as provided above.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://www.postgresql.org/docs/16/sql-createaggregate.html




Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-10 Thread torikoshia

On 2023-11-10 13:18, Andres Freund wrote:

Hi,

On November 8, 2023 11:28:08 PM PST, Michael Paquier
 wrote:

On Thu, Nov 09, 2023 at 01:50:34PM +0900, torikoshia wrote:

PGSTAT_KIND_SLRU cannot be reset by pg_stat_reset_shared(), so I feel
uncomfortable to delete it all together.
It might be better after pg_stat_reset_shared() has been modified to 
take

'slru' as an argument, though.


Not sure how to feel about that, TBH, but I would not include SLRUs
here if we have already a separate function.

I thought it would be better to reset statistics even when null is 
specified
so that users are not confused with the behavior of 
pg_stat_reset_slru().
Attached patch added pg_stat_reset_shared() in system_functions.sql 
mainly

for this reason.


I'm OK with doing what your patch does, aka do the work if the value
is NULL or if there's no argument given.

-Resets some cluster-wide statistics counters to zero, 
depending on the
+Resets cluster-wide statistics counters to zero, depending on 
the


This does not need to change, aka SLRUs are for example still global
and not included here.

+If the argument is NULL or not specified, all counters shown 
in all

+of these views are reset.

Missing a  markup around NULL.  I know, we're not consistent
about that, either, but if we are tweaking the area let's be right at
least.  Perhaps "all the counters from the views listed above are
reset"?


I see no reason to not include slrus. We should never have added the
ability to reset them individually, particularly not without a use
case - I couldn't find one skimming some discussion. And what's the
point in not allowing to reset them via pg_stat_reset_shared()?


When including SLRUs, do you think it's better to add 'slrus' argument 
which enables pg_stat_reset_shared() to reset all SLRUs?


As described above, since SLRUs cannot be reset by 
pg_stat_reset_shared(), I feel a bit uncomfortable to delete it all 
together.



--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: trying again to get incremental backup

2023-11-10 Thread Dilip Kumar
On Tue, Nov 7, 2023 at 2:06 AM Robert Haas  wrote:
>
> On Mon, Oct 30, 2023 at 2:46 PM Andres Freund  wrote:
> > After playing with this for a while, I don't see a reason for 
> > wal_summarize_mb
> > from a memory usage POV at least.
>
> Here's v8. Changes:

Review comments, based on what I reviewed so far.

- I think 0001 looks good improvement irrespective of the patch series.

- review 0003
1.
+   be enabled either on a primary or on a standby. WAL summarization can
+   cannot be enabled when wal_level is set to
+   minimal.

Grammatical error
"WAL summarization can cannot" -> WAL summarization cannot

2.
+ 
+  wal_summarize_keep_time (boolean)
+  
+   wal_summarize_keep_time
configuration parameter
+  

I feel the name of the guy should be either wal_summarizer_keep_time
or wal_summaries_keep_time, I mean either we should refer to the
summarizer process or to the way summaries files.

3.

+XLogGetOldestSegno(TimeLineID tli)
+{
+
+ /* Ignore files that are not XLOG segments */
+ if (!IsXLogFileName(xlde->d_name))
+ continue;
+
+ /* Parse filename to get TLI and segno. */
+ XLogFromFileName(xlde->d_name, &file_tli, &file_segno,
+ wal_segment_size);
+
+ /* Ignore anything that's not from the TLI of interest. */
+ if (tli != file_tli)
+ continue;
+
+ /* If it's the oldest so far, update oldest_segno. */

Some of the single-line comments end with a full stop whereas others
do not, so better to be consistent.

4.

+ * If start_lsn != InvalidXLogRecPtr, only summaries that end before the
+ * indicated LSN will be included.
+ *
+ * If end_lsn != InvalidXLogRecPtr, only summaries that start before the
+ * indicated LSN will be included.
+ *
+ * The intent is that you can call GetWalSummaries(tli, start_lsn, end_lsn)
+ * to get all WAL summaries on the indicated timeline that overlap the
+ * specified LSN range.
+ */
+List *
+GetWalSummaries(TimeLineID tli, XLogRecPtr start_lsn, XLogRecPtr end_lsn)


Instead of "If start_lsn != InvalidXLogRecPtr, only summaries that end
before the" it should be "If start_lsn != InvalidXLogRecPtr, only
summaries that end after the" because only if the summary files are
Ending after the start_lsn then it will have some overlapping and we
need to return them if ending before start lsn then those files are
not overlapping at all, right?

5.
In FilterWalSummaries() header also the comment is wrong same as for
GetWalSummaries() function.

6.
+ * If the whole range of LSNs is covered, returns true, otherwise false.
+ * If false is returned, *missing_lsn is set either to InvalidXLogRecPtr
+ * if there are no WAL summary files in the input list, or to the first LSN
+ * in the range that is not covered by a WAL summary file in the input list.
+ */
+bool
+WalSummariesAreComplete(List *wslist, XLogRecPtr start_lsn,

I did not see the usage of this function, but I think if the whole
range is not covered why not keep the behavior uniform w.r.t. what we
set for '*missing_lsn',  I mean suppose there is no file then
missing_lsn is the start_lsn because a very first LSN is missing.

7.
+ nbytes = FileRead(io->file, data, length, io->filepos,
+   WAIT_EVENT_WAL_SUMMARY_READ);
+ if (nbytes < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write file \"%s\": %m",
+ FilePathName(io->file;

/could not write file/ could not read file

8.
+/*
+ * Comparator to sort a List of WalSummaryFile objects by start_lsn.
+ */
+static int
+ListComparatorForWalSummaryFiles(const ListCell *a, const ListCell *b)
+{


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Synchronizing slots from primary to standby

2023-11-10 Thread Drouvot, Bertrand

Hi,

On 11/10/23 4:31 AM, shveta malik wrote:

On Thu, Nov 9, 2023 at 9:15 PM Drouvot, Bertrand
 wrote:

Yeah I think so, because there is a time window when one could "use" the slot
after the promotion and before it is removed. Producing things like:

"
2023-11-09 15:16:50.294 UTC [2580462] LOG:  dropped replication slot 
"logical_slot2" of dbid 5 as it was not sync-ready
2023-11-09 15:16:50.295 UTC [2580462] LOG:  dropped replication slot 
"logical_slot3" of dbid 5 as it was not sync-ready
2023-11-09 15:16:50.297 UTC [2580462] LOG:  dropped replication slot 
"logical_slot4" of dbid 5 as it was not sync-ready
2023-11-09 15:16:50.297 UTC [2580462] ERROR:  replication slot "logical_slot5" 
is active for PID 2594628
"

After the promotion one was able to use logical_slot5 and now we can now drop 
it.


Yes, I was suspicious about this small window which may allow others
to use this slot, that is why I was thinking of putting it in the
promotion flow and thus asked that question earlier. But the slot-sync
worker may end up creating it again in case it has not exited.


Sorry, there is a typo up-thread, I meant "After the promotion one was able to
use logical_slot5 and now we can NOT drop it.". We can not drop it because it
is in use.


So we
need to carefully decide at what all places we need to put  'not-in
recovery' checks in slot-sync workers. In the previous version,
synchronize_one_slot() had that check and it was skipping sync if
'!RecoveryInProgress'. But I have removed that check in v32 thinking
that the slots which the worker has already fetched from the primary,
let them all get synced and exit after that  nstead of syncing half
and leaving rest. But now on rethinking, was the previous behaviour
correct i.e. skip sync at that point onward where we see it is no
longer in standby-mode while few of the slots have already been synced
in that sync-cycle. Thoughts?



I think we still need to think/discuss the promotion flow. I think we would need
to have the slot sync worker shutdown during the promotion (as suggested by 
Amit in [1])
but before that let the sync slot worker knows it is now acting during 
promotion.

Something like:

- let the sync worker know it is now acting under promotion
- do what needs to be done while acting under promotion
- shutdown the sync worker

That way we would avoid any "risk" of having the sync worker doing something
we don't expect while not in recovery anymore.

Regarding "do what needs to be done while acting under promotion":

- Ensure all slots in 'r' state are synced
- drop slots that are in 'i' state

Thoughts?

[1]: 
https://www.postgresql.org/message-id/CAA4eK1J2Pc%3D5TOgty5u4bp--y7ZHaQx3_2eWPL%3DVPJ7A_0JF2g%40mail.gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Parallel aggregates in PG 16.1

2023-11-10 Thread ZIMANYI Esteban
In MobilityDB
https://github.com/MobilityDB/MobilityDB
we have defined a tstzspan type which is a fixed-size equivalent of the 
tstzrange type in PostgreSQL.

We have a span_union aggregate function which is the equivalent of the 
range_agg function in PostgreSQL defined as follows

CREATE FUNCTION tstzspan_union_finalfn(internal)
  RETURNS tstzspanset
  AS 'MODULE_PATHNAME', 'Span_union_finalfn'
  LANGUAGE C IMMUTABLE PARALLEL SAFE;

CREATE AGGREGATE span_union(tstzspan) (
  SFUNC = array_agg_transfn,
  STYPE = internal,
  COMBINEFUNC = array_agg_combine,
  SERIALFUNC = array_agg_serialize,
  DESERIALFUNC = array_agg_deserialize,
  FINALFUNC = tstzspan_union_finalfn
);

As can be seen, we reuse the array_agg function to accumulate the values in an 
array and the final function just does similar work as the range_agg_finalfn to 
merge the overlapping spans.

I am testing the parallel aggregate features of PG 16.1

test=# select version();
version
---
 PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit

I create a table with 1M random spans and another one with the same data 
converted to tstzrange

CREATE TABLE tbl_tstzspan_1M AS
SELECT k, random_tstzspan('2001-01-01', '2002-12-31', 10) AS t
FROM generate_series(1, 1e6) AS k;

CREATE TABLE tbl_tstzrange_1M AS
SELECT k, t::tstzrange
FROM tbl_tstzspan_1M;

test=# analyze;
ANALYZE
test=#

The tstzrange DOES NOT support parallel aggregates

test=# EXPLAIN
SELECT k%10, range_agg(t) AS t
FROM tbl_tstzrange_1M
group by k%10
order by k%10;

 QUERY PLAN
-
 GroupAggregate  (cost=66706.17..203172.65 rows=100 width=64)
   Group Key: ((k % '10'::numeric))
   ->  Gather Merge  (cost=66706.17..183172.65 rows=100 width=54)
 Workers Planned: 2
 ->  Sort  (cost=65706.15..66747.81 rows=416667 width=54)
   Sort Key: ((k % '10'::numeric))
   ->  Parallel Seq Scan on tbl_tstzrange_1m  (cost=0.00..12568.33 
rows=416667 width=54)
(7 rows)

The array_agg function supports parallel aggregates

test=# EXPLAIN
SELECT k%10, array_agg(t) AS t
FROM tbl_tstzspan_1M
group by k%10
order by k%10;
QUERY PLAN
--
 Finalize GroupAggregate  (cost=66706.17..193518.60 rows=100 width=64)
   Group Key: ((k % '10'::numeric))
   ->  Gather Merge  (cost=66706.17..172268.60 rows=84 width=64)
 Workers Planned: 2
 ->  Partial GroupAggregate  (cost=65706.15..75081.15 rows=416667 
width=64)
   Group Key: ((k % '10'::numeric))
   ->  Sort  (cost=65706.15..66747.81 rows=416667 width=56)
 Sort Key: ((k % '10'::numeric))
 ->  Parallel Seq Scan on tbl_tstzspan_1m  
(cost=0.00..12568.33 rows=416667 width=56)
(9 rows)

We are not able to make span_union aggregate support parallel aggregates

test=# EXPLAIN
SELECT k%10, span_union(t) AS t
FROM tbl_tstzspan_1M
group by k%10
order by k%10;
  QUERY PLAN
--
 GroupAggregate  (cost=187879.84..210379.84 rows=100 width=64)
   Group Key: ((k % '10'::numeric))
   ->  Sort  (cost=187879.84..190379.84 rows=100 width=56)
 Sort Key: ((k % '10'::numeric))
 ->  Seq Scan on tbl_tstzspan_1m  (cost=0.00..19860.00 rows=100 
width=56)

Any suggestion?

Thanks

Esteban




Re: A recent message added to pg_upgade

2023-11-10 Thread Amit Kapila
On Fri, Nov 10, 2023 at 7:50 AM Michael Paquier  wrote:
>
> On Thu, Nov 09, 2023 at 04:52:32PM +0900, Michael Paquier wrote:
> > Thanks!
>
> Also, please see also a patch about switching the logirep launcher to
> rely on IsBinaryUpgrade to prevent its startup.  Any thoughts about
> that?
>

Preventing these
+ * processes from starting while upgrading avoids any activity on the new
+ * cluster before the physical files are put in place, which could cause
+ * corruption on the new cluster upgrading to.

I don't think this comment is correct because there won't be any apply
activity on the new cluster as after restoration subscriptions should
be disabled. On the old cluster, I think one problem is that the
origins may move forward after we copy them which can cause data
inconsistency issues. The other is that we may not prefer to generate
additional data and WAL during the upgrade. Also, I am not completely
sure about using the word 'corruption' in this context.

-- 
With Regards,
Amit Kapila.




Re: Add the ability to limit the amount of memory that can be allocated to backends.

2023-11-10 Thread jian he
hi.

+static void checkAllocations();
should be "static void checkAllocations(void);" ?

PgStatShared_Memtrack there is a lock, but seems not initialized, and
not used. Can you expand on it?
So in view  pg_stat_global_memory_tracking, column
"total_memory_reserved" is a point of time, total memory the whole
server reserved/malloced? will it change every time you call it?
the function pg_stat_get_global_memory_tracking provolatile => 's'.
should be a VOLATILE function?


pg_stat_get_memory_reservation, pg_stat_get_global_memory_tracking
should be proretset => 'f'.
+{ oid => '9891',
+  descr => 'statistics: memory utilized by current backend',
+  proname => 'pg_get_backend_memory_allocation', prorows => '1',
proisstrict => 'f',
+  proretset => 't', provolatile => 's', proparallel => 'r',


you declared
+void pgstat_backend_memory_reservation_cb(void);
but seems there is no definition.


this part is unnecessary since you already declared
src/include/catalog/pg_proc.dat?
+/* SQL Callable functions */
+extern Datum pg_stat_get_memory_reservation(PG_FUNCTION_ARGS);
+extern Datum pg_get_backend_memory_allocation(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_global_memory_tracking(PG_FUNCTION_ARGS);

The last sentence is just a plain link, no explanation. something is missing?
 
+  Reports how much memory remains available to the server. If a
+  backend process attempts to allocate more memory than remains,
+  the process will fail with an out of memory error, resulting in
+  cancellation of the process's active query/transaction.
+  If memory is not being limited (ie. max_total_memory is zero or not set),
+  this column returns NULL.
+  .
+ 
+ 
+
+  
+   
+static_shared_memory bigint
+   
+  
+   Reports how much static shared memory (non-DSM shared memory)
is being used by
+   the server. Static shared memory is configured by the postmaster at
+   at server startup.
+   .
+  
+  




Re: Bug: RLS policy FOR SELECT is used to check new rows

2023-11-10 Thread Dean Rasheed
On Thu, 9 Nov 2023 at 18:55, Laurenz Albe  wrote:
>
> I think it can be useful to allow a user an UPDATE where the result
> does not satisfy the USING clause of the FOR SELECT policy.
>
> The idea that an UPDATE should only produce rows you can SELECT is not
> true today: if you run an UPDATE without a WHERE clause, you can
> create rows you cannot see.  The restriction is only on UPDATEs with
> a WHERE clause.  Weird, isn't it?
>

That's true, but only if the UPDATE also doesn't have a RETURNING
clause. What I find weird about your proposal is that it would allow
an UPDATE ... RETURNING command to return something that would be
visible just that once, but then subsequently disappear. That seems
like a cure that's worse than the original disease that kicked off
this discussion.

As mentioned by others, the intention was that RLS behave like WITH
CHECK OPTION on an updatable view, so that new rows can't just
disappear. There are, however, 2 differences between the way it
currently works for RLS, and an updatable view:

1). RLS only does this for UPDATE commands. INSERT commands *can*
insert new rows that aren't visible, and so disappear.

2). It can't be turned off. The WITH CHECK OPTION on an updatable view
is an option that the user can choose to turn on or off. That's not
possible with RLS.

In a green field, I would say that it would be better to fix (1), so
that INSERT and UPDATE are consistent. However, I fear that it may be
too late for that, because any such change would risk breaking
existing RLS policy setups in subtle ways.

It might be possible to change (2) though, by adding a new table-level
option (similar to a view's WITH CHECK OPTION) that enabled or
disabled the checking of new rows for that table, and whose default
matched the current behaviour.

Before going too far down that route though, it is perhaps worth
asking whether this is something users really want. Is there a real
use-case for being able to UPDATE rows and have them disappear?

Regards,
Dean




Re: POC, WIP: OR-clause support for indexes

2023-11-10 Thread Alena Rybakina

On 06.11.2023 16:51, Alena Rybakina wrote:
I also support this approach. I have almost finished writing a patch 
that fixes the first problem related to the quadratic complexity of 
processing expressions by adding a hash table.


I also added a check: if the number of groups is equal to the number 
of OR expressions, we assume that no expressions need to be converted 
and interrupt further execution.


Now I am trying to fix the last problem in this patch: three tests 
have indicated a problem related to incorrect conversion. I don't 
think it can be serious, but I haven't figured out where the mistake 
is yet.


I added log like that: ERROR:  unrecognized node type: 0.


I fixed this issue and added some cosmetic refactoring.

The changes are presented in the or_patch_changes.diff file.

--
Regards,
Alena Rybakina
Postgres Professional
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 25a4235dbd9..46212a77c64 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -44,7 +44,7 @@
 
 /* GUC parameters */
 bool		Transform_null_equals = false;
-bool		or_transform_limit = false;
+bool		enable_or_transformation = false;
 
 
 static Node *transformExprRecurse(ParseState *pstate, Node *expr);
@@ -108,7 +108,7 @@ typedef struct OrClauseGroupEntry
 	List		   *consts;
 	Oidscalar_type;
 	Oidopno;
-	Expr 		   *expr;
+	Node 		   *expr;
 } OrClauseGroupEntry;
 
 static int
@@ -189,16 +189,16 @@ transformBoolExprOr(ParseState *pstate, BoolExpr *expr_orig)
 	  HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
 
 	/* If this is not an 'OR' expression, skip the transformation */
-	if (expr_orig->boolop != OR_EXPR || !or_transform_limit || len_ors == 1 || !or_group_htab)
+	if (expr_orig->boolop != OR_EXPR || !enable_or_transformation || len_ors == 1 || !or_group_htab)
 		return transformBoolExpr(pstate, (BoolExpr *) expr_orig);
 
 	foreach(lc, expr_orig->args)
 	{
 		Node			   *arg = lfirst(lc);
-		Node			   *orqual;
-		Node			   *const_expr;
-		Node			   *nconst_expr;
-		OrClauseGroupEntry *gentry;
+		Node			   *orqual = NULL;
+		Node			   *const_expr = NULL;
+		Node			   *nconst_expr = NULL;
+		OrClauseGroupEntry *gentry = NULL;
 		boolfound;
 		char		   	   *str;
 
@@ -270,7 +270,7 @@ transformBoolExprOr(ParseState *pstate, BoolExpr *expr_orig)
 		gentry = hash_search(or_group_htab, &str, HASH_ENTER, &found);
 		gentry->node = nconst_expr;
 		gentry->consts = list_make1(const_expr);
-		gentry->expr = (Expr *) orqual;
+		gentry->expr = orqual;
 		gentry->hash_leftvar_key = str;
 	}
 
@@ -283,7 +283,6 @@ transformBoolExprOr(ParseState *pstate, BoolExpr *expr_orig)
 		* transformed bool expression.
 		*/
 		hash_destroy(or_group_htab);
-		list_free(or_list);
 		return (Node *) makeBoolExpr(OR_EXPR, or_list, expr_orig->location);
 	}
 	else
@@ -345,9 +344,9 @@ transformBoolExprOr(ParseState *pstate, BoolExpr *expr_orig)
  * OK: coerce all the right-hand non-Var inputs to the common
  * type and build an ArrayExpr for them.
  */
-List	   *aexprs;
-ArrayExpr  *newa;
-ScalarArrayOpExpr *saopexpr;
+List	   *aexprs = NIL;
+ArrayExpr  *newa = NULL;
+ScalarArrayOpExpr *saopexpr = NULL;
 ListCell *l;
 
 aexprs = NIL;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 54fd09abde7..3411f023df8 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1049,12 +1049,12 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
-		{"or_transform_limit", PGC_USERSET, QUERY_TUNING_OTHER,
+		{"enable_or_transformation", PGC_USERSET, QUERY_TUNING_OTHER,
 			gettext_noop("Transform a sequence of OR clauses to an IN expression."),
 			gettext_noop("The planner will replace clauses like 'x=c1 OR x=c2 .."
 		 "to the clause 'x IN (c1,c2,...)'")
 		},
-		&or_transform_limit,
+		&enable_or_transformation,
 		false,
 		NULL, NULL, NULL
 	},
diff --git a/src/include/parser/parse_expr.h b/src/include/parser/parse_expr.h
index 7a6943c116c..3a87de02859 100644
--- a/src/include/parser/parse_expr.h
+++ b/src/include/parser/parse_expr.h
@@ -17,7 +17,7 @@
 
 /* GUC parameters */
 extern PGDLLIMPORT bool Transform_null_equals;
-extern PGDLLIMPORT bool or_transform_limit;
+extern PGDLLIMPORT bool enable_or_transformation;
 
 extern Node *transformExpr(ParseState *pstate, Node *expr, ParseExprKind exprKind);
 
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 29c2bc6a2b2..dbc8bc3bed0 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1883,7 +1883,7 @@ SELECT count(*) FROM tenk1
 10
 (1 row)
 
-SET or_transform_limit = on;
+SET enable_or_transformation = on;
 EXPLAIN (COSTS OFF)
 SELECT * FROM tenk1
   WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR tenthous = 42);
@@ -1997,7 +1997,7 @@ SELECT count(*) FROM tenk1
   

Re: ResourceOwner refactoring

2023-11-10 Thread Alexander Lakhin

Hello Heikki,

09.11.2023 02:48, Heikki Linnakangas wrote:


Thanks for the testing! Fixed. ...


Thank you for the fix!

Please look at one more failure caused be the new implementation of
ResourceOwners:
numdbs=80
for ((i=1;i<=10;i++)); do
echo "ITERATION $i"

for ((d=1;d<=$numdbs;d++)); do createdb db$d; done

for ((d=1;d<=$numdbs;d++)); do
echo "
create table t(t1 text);
drop table t;
" | psql -d db$d >psql-$d.log 2>&1 &
done
wait
grep 'PANIC' server.log  && break;

for ((d=1;d<=$numdbs;d++)); do dropdb db$d; done
grep 'PANIC' server.log  && break;
done

I could see two failure modes:
2023-11-10 08:42:28.870 UTC [1163274] ERROR:  ResourceOwnerEnlarge called after 
release started
2023-11-10 08:42:28.870 UTC [1163274] STATEMENT:  drop table t;
2023-11-10 08:42:28.870 UTC [1163274] WARNING:  AbortTransaction while in 
COMMIT state
2023-11-10 08:42:28.870 UTC [1163274] PANIC:  cannot abort transaction 906, it 
was already committed

2023-11-10 08:43:27.897 UTC [1164148] ERROR:  ResourceOwnerEnlarge called after 
release started
2023-11-10 08:43:27.897 UTC [1164148] STATEMENT:  DROP DATABASE db69;
2023-11-10 08:43:27.897 UTC [1164148] WARNING:  AbortTransaction while in 
COMMIT state
2023-11-10 08:43:27.897 UTC [1164148] PANIC:  cannot abort transaction 1043, it 
was already committed

The stack trace for the second ERROR (ResourceOwnerEnlarge called ...) is:
...
#6  0x558af5b2f35c in ResourceOwnerEnlarge (owner=0x558af716f3c8) at 
resowner.c:455
#7  0x558af5888f18 in dsm_create_descriptor () at dsm.c:1207
#8  0x558af5889205 in dsm_attach (h=3172038420) at dsm.c:697
#9  0x558af5b1ebed in get_segment_by_index (area=0x558af711da18, index=2) 
at dsa.c:1764
#10 0x558af5b1ea4b in dsa_get_address (area=0x558af711da18, 
dp=2199023329568) at dsa.c:970
#11 0x558af5669366 in dshash_seq_next (status=0x7ffdd5912fd0) at 
dshash.c:687
#12 0x558af5901998 in pgstat_drop_database_and_contents (dboid=16444) at 
pgstat_shmem.c:830
#13 0x558af59016f0 in pgstat_drop_entry (kind=PGSTAT_KIND_DATABASE, 
dboid=16444, objoid=0) at pgstat_shmem.c:888
#14 0x558af59044eb in AtEOXact_PgStat_DroppedStats 
(xact_state=0x558af7111ee0, isCommit=true) at pgstat_xact.c:88
#15 0x558af59043c7 in AtEOXact_PgStat (isCommit=true, parallel=false) at 
pgstat_xact.c:55
#16 0x558af53c782e in CommitTransaction () at xact.c:2371
#17 0x558af53c709e in CommitTransactionCommand () at xact.c:306
...

Best regards,
Alexander




Re: Move bki file pre-processing from initdb to bootstrap

2023-11-10 Thread Peter Eisentraut

On 17.10.23 03:32, Krishnakumar R wrote:

The version comparison has been moved from initdb to bootstrap. This
created some compatibility problems with windows tests. For now I kept
the version check to not have \n added, which worked fine and serves
the purpose. However hoping to have something better in v3 in addition
to addressing any other comments.


With help from Thomas, figured out that on windows fopen uses binary
mode in the backend which causes issues with EOL. Please find the
attached patch updated with a fix for this.


I suggest that this patch set be split up into three incremental parts:

1. Move some build-time settings from initdb to postgres.bki.
2. The database locale handling.
3. The bki file handling.

Each of these topics really needs a separate detailed consideration.





Re: Move bki file pre-processing from initdb to bootstrap

2023-11-10 Thread Peter Eisentraut

On 06.10.23 02:24, Krishnakumar R wrote:

elog(INFO, "Open bki file %s\n", bki_file);
+   boot_yyin = fopen(bki_file, "r");

Why is this needed?  It already reads the bki file from stdin?

We no longer open the bki file in initdb and pass to postgres to parse
from stdin, instead we open the bki file directly in bootstrap and
pass the file stream to the parser. Hence the need to switch the yyin.
Have added a comment in the commit logs to capture this.


Why this change?  I mean, there is nothing wrong with it, but I don't 
follow how changing from reading from stdin to reading from a named file 
is related to moving the parameter substitution from initdb to the backend.


One effect of this is that we would now have two different ways initdb 
interacts with the backend.  In bootstrap mode, it reads from a named 
file, and the second run (the one that loads the system views etc.) 
reads from stdin.  It's already confusing enough, so any further 
divergence should be adequately explained.






Re: Synchronizing slots from primary to standby

2023-11-10 Thread Drouvot, Bertrand

Hi,

On 11/10/23 8:55 AM, Amit Kapila wrote:

On Fri, Nov 10, 2023 at 12:50 PM Drouvot, Bertrand
 wrote:


But even if we ERROR out instead of emitting a WARNING, the user would still
need to be notified/monitor such errors. I agree that then probably they will
come to know earlier because the slot sync mechanism would be stopped but still
it is not "guaranteed" (specially if there is no others "working" synced slots
around.)




And if they do not, then there is still a risk to use this slot after a
failover thinking this is a "synced" slot.



I think this is another reason that probably giving ERROR has better
chances for the user to notice before failover. IF knowing such errors
user still proceeds with the failover, the onus is on her.


Agree. My concern is more when they don't know about the error.


We can
probably document this hazard along with the failover feature so that
users are aware that they either need to be careful while creating
slots on standby or consult ERROR logs. I guess we can even make it
visible in the view also.


Yeah.


Giving more thoughts, what about using a dedicated/reserved naming convention 
for
synced slot like synced_ or such and then:

- prevent user to create sync_ slots on standby
- sync  on primary to sync_ on standby
- during failover, rename  sync_ to  and if  exists then
emit a WARNING and keep sync_ in place.

That way both slots are still in place (the manually created  and
the sync_

Hmm, I think after failover, users need to rename all slots or we need
to provide a way to rename them so that they can be used by
subscribers which sounds like much more work.


Agree that's much more work for the subscriber case. Maybe that's not worth
the extra work.


Also, the current coding doesn't ensure
we will always give WARNING. If we see the below code that deals with
this WARNING,

+  /* User created slot with the same name exists, emit WARNING. */
+  else if (found && s->data.sync_state == SYNCSLOT_STATE_NONE)
+  {
+ereport(WARNING,
+errmsg("not synchronizing slot %s; it is a user created slot",
+     remote_slot->name));
+  }
+  /* Otherwise create the slot first. */
+  else
+  {
+TransactionId xmin_horizon = InvalidTransactionId;
+ReplicationSlot *slot;
+
+ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL,
+    remote_slot->two_phase, false);

I think this is not a solid check to ensure that the slot existed
before. Because it could be created as soon as the slot sync worker
invokes ReplicationSlotCreate() here.


Agree.



So, having a concrete check to give WARNING would require some more
logic which I don't think is a good idea to handle this boundary case.



Yeah good point, agree to just error out in all the case then (if we discard
the sync_ reserved wording proposal, which seems to be the case as probably
not worth the extra work).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




RE: MinGW compiler warnings in ecpg tests

2023-11-10 Thread Hayato Kuroda (Fujitsu)
Dear Peter, Michael,

Sorry for reviving the old thread. While trying to build postgres on msys2 by 
meson,
I faced the same warning. The OS is Windows 10.

```
$ ninja
[2378/2402] Compiling C object 
src/interfaces/ecpg/test/sql/sqlda.exe.p/meson-generated_.._sqlda.c.obj
../postgres/src/interfaces/ecpg/test/sql/sqlda.pgc: In function 'dump_sqlda':
../postgres/src/interfaces/ecpg/test/sql/sqlda.pgc:45:33: warning: format '%d' 
expects argument of type 'int', but argument 3 has type 'long long int' 
[-Wformat=]
   45 | "name sqlda descriptor: '%s' value 
%I64d\n",
  | 
^~~
..
   49 | sqlda->sqlvar[i].sqlname.data, *(long 
long int *)sqlda->sqlvar[i].sqldata);
  |
~~
  ||
  |long 
long int
```


Before building, I did below steps:

1. Installed required software listed in [1].
2. ran `meson setup -Dcassert=true -Ddebug=true /path/to/builddir`
3. moved to /path/to/builddir
4. ran `ninja`
5. got above warning

Attached file summarize the result of meson command, which was output at the 
end of it.
Also, belows show the version of meson/ninja.

```
$ ninja --version
1.11.1
$ meson -v
1.2.3
```

I was quite not sure the windows build, but I could see that gcc compiler was
used here. Does it mean that the compiler might not like the format string 
"%I64d"?
I modified like below and could be compiled without warnings.

```
--- a/src/interfaces/ecpg/test/sql/sqlda.pgc
+++ b/src/interfaces/ecpg/test/sql/sqlda.pgc
@@ -41,7 +41,7 @@ dump_sqlda(sqlda_t *sqlda)
break;
case ECPGt_long_long:
printf(
-#ifdef _WIN32
+#if !defined(__GNUC__)
"name sqlda descriptor: '%s' value %I64d\n",
 #else
"name sqlda descriptor: '%s' value %lld\n",

```

[1]: 
https://www.postgresql.org/message-id/9f4f22be-f9f1-b350-bc06-521226b87f7a%40dunslane.net

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

postgresql 17devel

  Data layout
data block size: 8 kB
WAL block size : 8 kB
segment size   : 1 GB

  System
host system: windows x86_64
build system   : windows x86_64

  Compiler
linker : ld.bfd
C compiler : gcc 13.2.0

  Compiler Flags
CPP FLAGS  :
C FLAGS, functional: -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard
C FLAGS, warnings  : -Wmissing-prototypes -Wpointer-arith -Werror=vla 
-Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 
-Wcast-function-type -Wshadow=compatible-local -Wforma
t-security -Wdeclaration-after-statement -Wno-format-truncation 
-Wno-stringop-truncation
C FLAGS, modules   : -fvisibility=hidden
C FLAGS, user specified:
LD FLAGS   : -Wl,--stack,4194304 
-Wl,--allow-multiple-definition -Wl,--disable-auto-import

  Programs
bison  : C:\msys64\usr\bin/bison.EXE 3.8.2
dtrace : NO
flex   : C:\msys64\usr\bin/flex.EXE 2.6.4

  External libraries
bonjour: NO
bsd_auth   : NO
docs   : YES
docs_pdf   : NO
gss: NO
icu: NO
ldap   : YES
libxml : NO
libxslt: NO
llvm   : NO
lz4: NO
nls: YES
openssl: YES 3.1.4
pam: NO
plperl : NO
plpython   : NO
pltcl  : YES 8.6.12
readline   : YES
selinux: NO
systemd: NO
uuid   : NO
zlib   : YES 1.3
zstd   : YES 1.5.5

  User defined options
debug  : true
cassert: true

Found ninja-1.11.1 at C:\msys64\mingw64\bin/ninja.EXE