date:20240311

Re: Add new error_action COPY ON_ERROR "log"

2024-03-11 Thread Michael Paquier

On Mon, Mar 11, 2024 at 06:00:00PM +0530, Bharath Rupireddy wrote:
> Please see the attached v6 patch set.

I am tempted to tweak a few things in the docs, the comments and the
tests (particularly adding more patterns for tuples that fail on
conversion while it's clear that there are not enough attributes after
the incorrect values), but that looks roughly OK.

Wouldn't it be better to squash the patches together, by the way?
--
Michael


signature.asc
Description: PGP signature

Re: Streaming I/O, vectored I/O (WIP)

2024-03-11 Thread Thomas Munro

On Tue, Mar 12, 2024 at 7:40 PM Thomas Munro  wrote:
> possible.  So in the current patch you say "hey please read these 16
> blocks" and it returns saying "only read 1", you call again with 15

Oops, typo worth correcting: s/15/16/.  Point being that the caller is
interested in more blocks after the original 16, so it uses 16 again
when it calls back (because that's the size of the Buffer array it
provides).

Re: Streaming I/O, vectored I/O (WIP)

2024-03-11 Thread Thomas Munro

On Tue, Mar 12, 2024 at 7:15 PM Dilip Kumar  wrote:
> I am planning to review this patch set, so started going through 0001,
> I have a question related to how we are issuing smgrprefetch in
> StartReadBuffers()

Thanks!

> + /*
> + * In theory we should only do this if PrepareReadBuffers() had to
> + * allocate new buffers above.  That way, if two calls to
> + * StartReadBuffers() were made for the same blocks before
> + * WaitReadBuffers(), only the first would issue the advice.
> + * That'd be a better simulation of true asynchronous I/O, which
> + * would only start the I/O once, but isn't done here for
> + * simplicity.  Note also that the following call might actually
> + * issue two advice calls if we cross a segment boundary; in a
> + * true asynchronous version we might choose to process only one
> + * real I/O at a time in that case.
> + */
> + smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
>   }
>
>  This is always issuing smgrprefetch starting with the input blockNum,
> shouldn't we pass the first blockNum which we did not find in the
>  Buffer pool?  So basically in the loop above this call where we are
> doing PrepareReadBuffer() we should track the first blockNum for which
>  the found is not true and pass that blockNum into the smgrprefetch()
> as a first block right?

I think you'd be right if StartReadBuffers() were capable of
processing a sequence consisting of a hit followed by misses, but
currently it always gives up after the first hit.  That is, it always
processes some number of misses (0-16) and then at most one hit.  So
for now the variable would always turn out to be the same as blockNum.

The reason is that I wanted to allows "full sized" read system calls
to form.  If you said "hey please read these 16 blocks" (I'm calling
that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits,
then it could only form a read of 14 blocks, but there might be more
blocks that could be read after those.  We would have some arbitrary
shorter read system calls, when we wanted to make them all as big as
possible.  So in the current patch you say "hey please read these 16
blocks" and it returns saying "only read 1", you call again with 15
and it says "only read 1", and you call again and says "read 16!"
(assuming 2 more were readable after the original range we started
with).  Then physical reads are maximised.  Maybe there is some nice
way to solve that, but I thought this way was the simplest (and if
there is some instruction-cache-locality/tight-loop/perf reason why we
should work harder to find ranges of hits, it could be for later).
Does that make sense?

Re: UUID v7

2024-03-11 Thread Michael Paquier

On Tue, Mar 12, 2024 at 11:10:37AM +0500, Andrey M. Borodin wrote:
> On 12 Mar 2024, at 10:53, Michael Paquier  wrote:
>> On 22 Jan 2024, at 09:22, Nikolay Samokhvalov  wrote:
>> 
>> And many libraries are already including implementation of UUIDv7 – here are 
>> some examples:
>> 
>> - https://www.npmjs.com/package/uuidv7
>> - https://crates.io/crates/uuidv7
>> - https://github.com/google/uuid/pull/139
> 
> So at least reviewing patch and agreeing on chosen methods and constants 
> makes sense.

Sure, there is no problem in discussing a patch to implement a
behavior.  But I disagree about taking a risk in merging something
that could become non-compliant with the approved RFC, if the draft is
approved at the end, of course.  This just strikes me as a bad idea.
--
Michael


signature.asc
Description: PGP signature

Re: Spurious pgstat_drop_replslot() call

2024-03-11 Thread Bertrand Drouvot

Hi,

On Tue, Mar 12, 2024 at 02:36:58PM +0900, Michael Paquier wrote:
> On Mon, Mar 11, 2024 at 04:15:40PM +0900, Michael Paquier wrote:
> > That's a slight change in behavior, unfortunately, and it cannot be
> > called a bug as this improves the visibility of the stats after an
> > invalidation, so this is not something that can be backpatched.
> 
> I've looked again at that and that was OK on a second look.

Thanks!

> May I
> suggest the attached additions with LWLockHeldByMeInMode to make sure
> that the stats are dropped and created while we hold
> ReplicationSlotAllocationLock?

Yeah, makes fully sense and looks good to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Streaming I/O, vectored I/O (WIP)

2024-03-11 Thread Dilip Kumar

On Sat, Mar 9, 2024 at 3:55 AM Thomas Munro  wrote:
>
Hi Thomas,

I am planning to review this patch set, so started going through 0001,
I have a question related to how we are issuing smgrprefetch in
StartReadBuffers()

+ if (operation->io_buffers_len > 0)
+ {
+ if (flags & READ_BUFFERS_ISSUE_ADVICE)
  {
- if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
- {
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum;
- MemSet((char *) bufBlock, 0, BLCKSZ);
- }
- else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum;
+ /*
+ * In theory we should only do this if PrepareReadBuffers() had to
+ * allocate new buffers above.  That way, if two calls to
+ * StartReadBuffers() were made for the same blocks before
+ * WaitReadBuffers(), only the first would issue the advice.
+ * That'd be a better simulation of true asynchronous I/O, which
+ * would only start the I/O once, but isn't done here for
+ * simplicity.  Note also that the following call might actually
+ * issue two advice calls if we cross a segment boundary; in a
+ * true asynchronous version we might choose to process only one
+ * real I/O at a time in that case.
+ */
+ smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
  }

 This is always issuing smgrprefetch starting with the input blockNum,
shouldn't we pass the first blockNum which we did not find in the
 Buffer pool?  So basically in the loop above this call where we are
doing PrepareReadBuffer() we should track the first blockNum for which
 the found is not true and pass that blockNum into the smgrprefetch()
as a first block right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: UUID v7

2024-03-11 Thread Andrey M. Borodin




> On 12 Mar 2024, at 10:53, Michael Paquier  wrote:
> 
> It does not strike me as a good idea to rush an implementation without
> a specification officially approved because there is always a risk of
> shipping something that's non-compliant into core.  But perhaps I am
> missing something on the RFC side?

Upthread one of document’s authors commented:

> On 14 Feb 2023, at 19:13, Kyzer Davis (kydavis)  wrote:
> 
> The point is 99% of the work since adoption by the IETF has been ironing out 
> RFC4122's problems and nothing major related to UUIDv6/7/8 which are all in a 
> very good state.

And also


> On 22 Jan 2024, at 09:22, Nikolay Samokhvalov  wrote:
> 
> And many libraries are already including implementation of UUIDv7 – here are 
> some examples:
> 
> - https://www.npmjs.com/package/uuidv7
> - https://crates.io/crates/uuidv7
> - https://github.com/google/uuid/pull/139

So at least reviewing patch and agreeing on chosen methods and constants makes 
sense.


Best regards, Andrey Borodin.

Re: UUID v7

2024-03-11 Thread Michael Paquier

On Mon, Mar 11, 2024 at 11:27:43PM +0500, Andrey M. Borodin wrote:
> Sorry for this long and vague explanation, if it still seems too
> uncertain we can have a chat or something like that. I don't think
> this number picking stuff deserve to be commented, because it still
> is quite close to random. RFC gives us too much freedom of choice.

Speaking about the RFC, I can see that there is a draft but nothing
formal yet.  The last one I can see is v14 from last November:
https://datatracker.ietf.org/doc/html/draft-ietf-uuidrev-rfc4122bis-14

It does not strike me as a good idea to rush an implementation without
a specification officially approved because there is always a risk of
shipping something that's non-compliant into core.  But perhaps I am
missing something on the RFC side?
--
Michael


signature.asc
Description: PGP signature

Re: Spurious pgstat_drop_replslot() call

2024-03-11 Thread Michael Paquier

On Mon, Mar 11, 2024 at 04:15:40PM +0900, Michael Paquier wrote:
> That's a slight change in behavior, unfortunately, and it cannot be
> called a bug as this improves the visibility of the stats after an
> invalidation, so this is not something that can be backpatched.

I've looked again at that and that was OK on a second look.  May I
suggest the attached additions with LWLockHeldByMeInMode to make sure
that the stats are dropped and created while we hold
ReplicationSlotAllocationLock?
--
Michael
diff --git a/src/backend/utils/activity/pgstat_replslot.c b/src/backend/utils/activity/pgstat_replslot.c
index c61bad1627..889e86ac5a 100644
--- a/src/backend/utils/activity/pgstat_replslot.c
+++ b/src/backend/utils/activity/pgstat_replslot.c
@@ -113,6 +113,8 @@ pgstat_create_replslot(ReplicationSlot *slot)
 	PgStat_EntryRef *entry_ref;
 	PgStatShared_ReplSlot *shstatent;
 
+	Assert(LWLockHeldByMeInMode(ReplicationSlotAllocationLock, LW_EXCLUSIVE));
+
 	entry_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid,
 			ReplicationSlotIndex(slot), false);
 	shstatent = (PgStatShared_ReplSlot *) entry_ref->shared_stats;
@@ -153,6 +155,8 @@ pgstat_acquire_replslot(ReplicationSlot *slot)
 void
 pgstat_drop_replslot(ReplicationSlot *slot)
 {
+	Assert(LWLockHeldByMeInMode(ReplicationSlotAllocationLock, LW_EXCLUSIVE));
+
 	pgstat_drop_entry(PGSTAT_KIND_REPLSLOT, InvalidOid,
 	  ReplicationSlotIndex(slot));
 }


signature.asc
Description: PGP signature

Re: Regardign RecentFlushPtr in WalSndWaitForWal()

2024-03-11 Thread Amit Kapila

On Mon, Mar 11, 2024 at 4:36 PM Amit Kapila  wrote:
>
> On Mon, Mar 11, 2024 at 4:17 PM shveta malik  wrote:
> >
> >
> > Please find the patch attached for the same.
> >
>
> LGTM. I'll push this tomorrow unless I see any comments/objections to
> this change.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: Improve eviction algorithm in ReorderBuffer

2024-03-11 Thread Masahiko Sawada

On Fri, Mar 8, 2024 at 12:58 PM Peter Smith  wrote:
>
> On Thu, Mar 7, 2024 at 2:16 PM Masahiko Sawada  wrote:
> >
> > On Tue, Mar 5, 2024 at 3:28 PM Peter Smith  wrote:
> > >
>
> > > 4a.
> > > The comment in simplehash.h says
> > >  *   The following parameters are only relevant when SH_DEFINE is defined:
> > >  *   - SH_KEY - ...
> > >  *   - SH_EQUAL(table, a, b) - ...
> > >  *   - SH_HASH_KEY(table, key) - ...
> > >  *   - SH_STORE_HASH - ...
> > >  *   - SH_GET_HASH(tb, a) - ...
> > >
> > > So maybe it is nicer to reorder the #defines in that same order?
> > >
> > > SUGGESTION:
> > > +#define SH_PREFIX bh_nodeidx
> > > +#define SH_ELEMENT_TYPE bh_nodeidx_entry
> > > +#define SH_KEY_TYPE bh_node_type
> > > +#define SH_SCOPE extern
> > > +#ifdef FRONTEND
> > > +#define SH_RAW_ALLOCATOR pg_malloc0
> > > +#endif
> > >
> > > +#define SH_DEFINE
> > > +#define SH_KEY key
> > > +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
> > > +#define SH_HASH_KEY(tb, key) \
> > > + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
> > > +#include "lib/simplehash.h"
> >
> > I'm really not sure it helps increase readability. For instance, for
> > me it's readable if SH_DEFINE and SH_DECLARE come to the last before
> > #include since it's more obvious whether we want to declare, define or
> > both. Other simplehash.h users also do so.
> >
>
> OK.
>
> > > 5.
> > > + *
> > > + * If 'indexed' is true, we create a hash table to track of each node's
> > > + * index in the heap, enabling to perform some operations such as 
> > > removing
> > > + * the node from the heap.
> > >   */
> > >  binaryheap *
> > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void 
> > > *arg)
> > > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > > + bool indexed, void *arg)
> > >
> > > BEFORE
> > > ... enabling to perform some operations such as removing the node from 
> > > the heap.
> > >
> > > SUGGESTION
> > > ... to help make operations such as removing nodes more efficient.
> > >
> >
> > But these operations literally require the indexed binary heap as we
> > have an assertion:
> >
> > void
> > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> > {
> > bh_nodeidx_entry *ent;
> >
> > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > Assert(heap->bh_indexed);
> >
>
> I didn’t quite understand -- the operations mentioned are "operations
> such as removing the node", but binaryheap_remove_node() also removes
> a node from the heap. So I still felt the comment wording of the patch
> is not quite correct.

Now I understand your point. That's a valid point.

>
> Now, if the removal of a node from an indexed heap can *only* be done
> using binaryheap_remove_node_ptr() then:
> - the other removal functions (binaryheap_remove_*) probably need some
> comments to make sure nobody is tempted to call them directly for an
> indexed heap.
> - maybe some refactoring and assertions are needed to ensure those
> *cannot* be called directly for an indexed heap.
>

If the 'index' is true, the caller can not only use the existing
functions but also newly added functions such as
binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
something like below?

 * If 'indexed' is true, we create a hash table to track each node's
 * index in the heap, enabling to perform some operations such as
 * binaryheap_remove_node_ptr() etc.

>
> And, here are some review comments for v8-0002.
>
> ==
> 1. delete_nodeidx
>
> +/*
> + * Remove the node's index from the hash table if the heap is indexed.
> + */
> +static bool
> +delete_nodeidx(binaryheap *heap, bh_node_type node)
> +{
> + if (!binaryheap_indexed(heap))
> + return false;
> +
> + return bh_nodeidx_delete(heap->bh_nodeidx, node);
> +}
>
> 1a.
> In v8 this function was changed to now return bool, so, I think the
> function comment should explain the meaning of that return value.
>
> ~
>
> 1b.
> I felt the function body is better expressed positively: "If this then
> do that", instead of "If not this then do nothing otherwise do that"
>
> SUGGESTION
> if (binaryheap_indexed(heap))
>   return bh_nodeidx_delete(heap->bh_nodeidx, node);
>
> return false;
>
> ~~~
>
> 2.
> +static void
> +replace_node(binaryheap *heap, int index, bh_node_type new_node)
> +{
> + bool found PG_USED_FOR_ASSERTS_ONLY;
> +
> + /* Quick return if not necessary to move */
> + if (heap->bh_nodes[index] == new_node)
> + return;
> +
> + /*
> + * Remove overwritten node's index. The overwritten node's position must
> + * have been tracked, if enabled.
> + */
> + found = delete_nodeidx(heap, heap->bh_nodes[index]);
> + Assert(!binaryheap_indexed(heap) || found);
> +
> + /*
> + * Replace it with the given new node. This node's position must also be
> + * tracked as we assume to replace the node by the existing node.
> + */
> + found = set_node(heap, new_node, index);
> + Assert(!binaryheap_indexed(heap) || found);
> +}
>
> 2a.

Re: [PROPOSAL] Skip test citext_utf8 on Windows

2024-03-11 Thread Oleg Tselebrovskiy


Michael Paquier писал(а) 2024-03-12 06:24:

On Mon, Mar 11, 2024 at 03:21:11PM +0700, Oleg Tselebrovskiy wrote:

The proposed patch for skipping test is attached


Your attached patch seems to be in binary format.
--
Michael

Right, I had it saved in not-UTF-8 encoding. Kind of ironic

Here's a fixed versiondiff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out
index 5d988dcd485..6c4069f9469 100644
--- a/contrib/citext/expected/citext_utf8.out
+++ b/contrib/citext/expected/citext_utf8.out
@@ -10,7 +10,8 @@
 SELECT getdatabaseencoding() <> 'UTF8' OR
(SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'
 FROM pg_database
-WHERE datname=current_database())
+WHERE datname=current_database()) OR
+	   (version() ~ 'windows' OR version() ~ 'Visual C\+\+' OR version() ~ 'mingw32')
AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/citext/expected/citext_utf8_1.out b/contrib/citext/expected/citext_utf8_1.out
index 7065a5da190..d4472b1c36a 100644
--- a/contrib/citext/expected/citext_utf8_1.out
+++ b/contrib/citext/expected/citext_utf8_1.out
@@ -10,7 +10,8 @@
 SELECT getdatabaseencoding() <> 'UTF8' OR
(SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'
 FROM pg_database
-WHERE datname=current_database())
+WHERE datname=current_database()) OR
+	   (version() ~ 'windows' OR version() ~ 'Visual C\+\+' OR version() ~ 'mingw32')
AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/citext/sql/citext_utf8.sql b/contrib/citext/sql/citext_utf8.sql
index 34b232d64e2..53775cdcd35 100644
--- a/contrib/citext/sql/citext_utf8.sql
+++ b/contrib/citext/sql/citext_utf8.sql
@@ -11,7 +11,8 @@
 SELECT getdatabaseencoding() <> 'UTF8' OR
(SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'
 FROM pg_database
-WHERE datname=current_database())
+WHERE datname=current_database()) OR
+	   (version() ~ 'windows' OR version() ~ 'Visual C\+\+' OR version() ~ 'mingw32')
AS skip_test \gset
 \if :skip_test
 \quit

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-03-11 Thread Ajin Cherian

On Tue, Mar 12, 2024 at 2:59 PM vignesh C  wrote:

>
> Thanks, I have created the following Commitfest entry for this:
> https://commitfest.postgresql.org/47/4816/
>
> Regards,
> Vignesh
>

Thanks for the patch, I have verified that the fix works well by following
the steps mentioned to reproduce the problem.
Reviewing the patch, it seems good and is well documented. Just one minor
comment I had was probably to change the name of the variable
table_states_valid to table_states_validity. The current name made sense
when it was a bool, but now that it is a tri-state enum, it doesn't fit
well.

regards,
Ajin Cherian
Fujitsu Australia

Re: [PATCH] LockAcquireExtended improvement

2024-03-11 Thread Jingxian Li

Hello Robert,
On 2024/3/8 1:02, Robert Haas wrote:
>
> But instead of just complaining, I decided to try writing a version of
> the patch that seemed acceptable to me. Here it is. I took a different
> approach than you. Instead of splitting up ProcSleep(), I just passed
> down the dontWait flag to WaitOnLock() and ProcSleep(). In
> LockAcquireExtended(), I moved the existing code that handles giving
> up in the don't-wait case from before the call to ProcSleep() to
> afterward. As far as I can see, the major way this could be wrong is
> if calling ProcSleep() with dontWait = true and having it fail to
> acquire the lock changes the state in some way that makes the cleanup
> code that I moved incorrect. I don't *think* that's the case, but I
> might be wrong.
>
> What do you think of this version?

Your version changes less code than mine by pushing the nowait flag down 
into ProcSleep(). This looks fine in general, except for a little advice,
which I don't think there is necessary to add 'waiting' suffix to the 
process name in function WaitOnLock with dontwait being true, as follows:

--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1801,8 +1801,12 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, 
bool dontWait)
LOCK_PRINT("WaitOnLock: sleeping on lock",
   locallock->lock, locallock->tag.mode);
 
-   /* adjust the process title to indicate that it's waiting */
-   set_ps_display_suffix("waiting");
+   if (!dontWait)
+   {
+   /* adjust the process title to indicate that it's waiting */
+   set_ps_display_suffix("waiting");   
+   }
+
 
awaitedLock = locallock;
awaitedOwner = owner;
@@ -1855,9 +1859,12 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, 
bool dontWait)
{
/* In this path, awaitedLock remains set until LockErrorCleanup 
*/
 
-   /* reset ps display to remove the suffix */
-   set_ps_display_remove_suffix();
-
+   if (!dontWait)
+   {
+   /* reset ps display to remove the suffix */
+   set_ps_display_remove_suffix();
+   }
+   
/* and propagate the error */
PG_RE_THROW();
}
@@ -1865,8 +1872,11 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner, 
bool dontWait)
 
awaitedLock = NULL;
 
-   /* reset ps display to remove the suffix */
-   set_ps_display_remove_suffix();
+   if (!dontWait)
+   {
+   /* reset ps display to remove the suffix */
+   set_ps_display_remove_suffix();
+   }
 
LOCK_PRINT("WaitOnLock: wakeup on lock",
   locallock->lock, locallock->tag.mode);


--
Jingxian Li

Re: [PROPOSAL] Skip test citext_utf8 on Windows

2024-03-11 Thread Thomas Munro

On Tue, Mar 12, 2024 at 2:56 PM Andrew Dunstan  wrote:
> On 2024-03-11 Mo 04:21, Oleg Tselebrovskiy wrote:
> > Greetings, everyone!
> >
> > While running "installchecks" on databases with UTF-8 encoding the test
> > citext_utf8 fails because of Turkish dotted I like this:
> >
> >  SELECT 'i'::citext = 'İ'::citext AS t;
> >   t
> >  ---
> > - t
> > + f
> >  (1 row)
> >
> > I tried to replicate the test's results by hand and with any collation
> > that I tried (including --locale="Turkish") this test failed
> >
> > Also an interesing result of my tesing. If you initialize you DB
> > with -E utf-8 --locale="Turkish" and then run select LOWER('İ');
> > the output will be this:
> >  lower
> > ---
> >  İ
> > (1 row)
> >
> > Which I find strange since lower() uses collation that was passed
> > (default in this case but still)
>
> Wouldn't we be better off finding a Windows fix for this, instead of
> sweeping it under the rug?

Given the sorry state of our Windows locale support, I've started
wondering about deleting it and telling users to adopt our nascent
built-in support or ICU[1].

This other thread [2] says the sorting is intransitive so I don't
think it really meets our needs anyway.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGJhV__g_TJ0jVqPbnTuqT%2B%2BM6KFv2wj%2B9AV-cABNCXN6Q%40mail.gmail.com#bc35c0b88962ff8c24c27aecc1bca72e
[2] 
https://www.postgresql.org/message-id/flat/1407a2c0-062b-4e4c-b728-438fdff5cb07%40manitou-mail.org

Re: SQL:2011 application time

2024-03-11 Thread jian he

On Mon, Mar 11, 2024 at 3:46 PM Peter Eisentraut  wrote:
>
> A few general comments on the tests:
>
> - In the INSERT commands, specify the column names explicitly.  This
> makes the tests easier to read (especially since the column order
> between the PK and the FK table is sometimes different).
>
> - Let's try to make it so that the inserted literals match the values
> shown in the various error messages, so it's easier to match them up.
> So, change the int4range literals to half-open notation.  And also maybe
> change the date output format to ISO.
>
maybe just change the tsrange type to daterange, then the dot out file
will be far less verbose.

minor issues while reviewing v27, 0001 to 0004.
transformFkeyGetPrimaryKey comments need to update,
since bool pk_period also returned.

+/*
+ * FindFKComparisonOperators -
+ *
+ * Gets the operators for pfeqopOut, ppeqopOut, and ffeqopOut.
+ * Sets old_check_ok if we can avoid re-validating the constraint.
+ * Sets old_pfeqop_item to the old pfeqop values.
+ */
+static void
+FindFKComparisonOperators(Constraint *fkconstraint,
+  AlteredTableInfo *tab,
+  int i,
+  int16 *fkattnum,
+  bool *old_check_ok,
+  ListCell **old_pfeqop_item,
+  Oid pktype, Oid fktype, Oid opclass,
+  Oid *pfeqopOut, Oid *ppeqopOut, Oid *ffeqopOut)

I think the above comments is
`Sets the operators for pfeqopOut, ppeqopOut, and ffeqopOut.`.


+ if (is_temporal)
+ {
+ if (!fkconstraint->fk_with_period)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_FOREIGN_KEY),
+ errmsg("foreign key uses PERIOD on the referenced table but not the
referencing table")));
+ }
can be
if (is_temporal && !fkconstraint->fk_with_period)
ereport(ERROR,
(errcode(ERRCODE_INVALID_FOREIGN_KEY),
errmsg("foreign key uses PERIOD on the referenced table but not the
referencing table")));

+
+ if (is_temporal)
+ {
+ if (!fkconstraint->pk_with_period)
+ /* Since we got pk_attrs, one should be a period. */
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_FOREIGN_KEY),
+ errmsg("foreign key uses PERIOD on the referencing table but not the
referenced table")));
+ }
can be
if (is_temporal && !fkconstraint->pk_with_period)
/* Since we got pk_attrs, one should be a period. */
ereport(ERROR,
(errcode(ERRCODE_INVALID_FOREIGN_KEY),
errmsg("foreign key uses PERIOD on the referencing table but not the
referenced table")));

refactor decompile_column_index_array seems unnecessary.
Peter already mentioned it at [1], I have tried to fix it at [2].


@@ -12141,7 +12245,8 @@ transformFkeyGetPrimaryKey(Relation pkrel, Oid
*indexOid,
  /*
  * Now build the list of PK attributes from the indkey definition (we
- * assume a primary key cannot have expressional elements)
+ * assume a primary key cannot have expressional elements, unless it
+ * has a PERIOD)
  */
  *attnamelist = NIL;
  for (i = 0; i < indexStruct->indnkeyatts; i++)
@@ -12155,6 +12260,8 @@ transformFkeyGetPrimaryKey(Relation pkrel, Oid
*indexOid,
makeString(pstrdup(NameStr(*attnumAttName(pkrel, pkattno);
  }
+ *pk_period = (indexStruct->indisexclusion);

I  don't understand the "expression elements" in the comments, most of
the tests case is like
`
PRIMARY KEY (id, valid_at WITHOUT OVERLAPS)
`
+ *pk_period = (indexStruct->indisexclusion);
can be
`+ *pk_period = indexStruct->indisexclusion;`


[1] https://postgr.es/m/7be8724a-5c25-46d7-8325-1bd8be6fa...@eisentraut.org
[2] 
https://postgr.es/m/CACJufxHVg65raNhG2zBwXgjrD6jqace4NZbePyMhP8-_Q=i...@mail.gmail.com

Re: SQL:2011 application time

2024-03-11 Thread jian he

+ 
+  If the last column is marked with PERIOD,
+  it is treated in a special way.
+  While the non-PERIOD columns are treated normally
+  (and there must be at least one of them),
+  the PERIOD column is not compared for equality.
+  Instead the constraint is considered satisfied
+  if the referenced table has matching records
+  (based on the non-PERIOD parts of the key)
+  whose combined PERIOD values completely cover
+  the referencing record's.
+  In other words, the reference must have a referent for its
entire duration.
+  Normally this column would be a range or multirange type,
+  although any type whose GiST opclass has a "contained by" operator
+  and a referenced_agg support function is allowed.
+  (See .)
+  In addition the referenced table must have a primary key
+  or unique constraint declared with WITHOUT PORTION.
+ 

typo "referenced_agg", in the gist-extensibility.html page is "referencedagg"
WITHOUT PORTION should be WITHOUT OVERLAPS

+  While the non-PERIOD columns are treated normally
+  (and there must be at least one of them),
+  the PERIOD column is not compared for equality.
the above sentence didn't say what is "normally"?
maybe we can do the following:
+  While the non-PERIOD columns are treated
+ normally for equality
+  (and there must be at least one of them),
+  the PERIOD column is not compared for equality.



+
+Datum
+my_range_agg_transfn(PG_FUNCTION_ARGS)
+{
+MemoryContextaggContext;
+Oid  rngtypoid;
+ArrayBuildState *state;
+
+if (!AggCheckCallContext(fcinfo, &aggContext))
+elog(ERROR, "range_agg_transfn called in non-aggregate context");
+
+rngtypoid = get_fn_expr_argtype(fcinfo->flinfo, 1);
+if (!type_is_range(rngtypoid))
+elog(ERROR, "range_agg must be called with a range");
+
+if (PG_ARGISNULL(0))
+state = initArrayResult(rngtypoid, aggContext, false);
+else
+state = (ArrayBuildState *) PG_GETARG_POINTER(0);
+
+/* skip NULLs */
+if (!PG_ARGISNULL(1))
+accumArrayResult(state, PG_GETARG_DATUM(1), false, rngtypoid,
aggContext);
+
+PG_RETURN_POINTER(state);
+}
+
+Datum
+my_range_agg_finalfn(PG_FUNCTION_ARGS)
+{
+MemoryContextaggContext;
+Oid  mltrngtypoid;
+TypeCacheEntry  *typcache;
+ArrayBuildState *state;
+int32range_count;
+RangeType  **ranges;
+int  i;
+
+if (!AggCheckCallContext(fcinfo, &aggContext))
+elog(ERROR, "range_agg_finalfn called in non-aggregate context");
+
+state = PG_ARGISNULL(0) ? NULL : (ArrayBuildState *) PG_GETARG_POINTER(0);
+if (state == NULL)
+/* This shouldn't be possible, but just in case */
+PG_RETURN_NULL();
+
+/* Also return NULL if we had zero inputs, like other aggregates */
+range_count = state->nelems;
+if (range_count == 0)
+PG_RETURN_NULL();
+
+mltrngtypoid = get_fn_expr_rettype(fcinfo->flinfo);
+typcache = multirange_get_typcache(fcinfo, mltrngtypoid);
+
+ranges = palloc0(range_count * sizeof(RangeType *));
+for (i = 0; i < range_count; i++)
+ranges[i] = DatumGetRangeTypeP(state->dvalues[i]);
+
+PG_RETURN_MULTIRANGE_P(make_multirange(mltrngtypoid,
typcache->rngtype, range_count, ranges));
+}

my_range_agg_transfn error message is inconsistent?
 `elog(ERROR, "range_agg_transfn called in non-aggregate context");`
`elog(ERROR, "range_agg must be called with a range");`
maybe just `my_range_agg_transfn`, instead of mention
{range_agg_transfn|range_agg}
similarly my_range_agg_finalfn error is also inconsistent.

my_range_agg_finalfn need  `type_is_multirange(mltrngtypoid)`?

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-11 Thread Masahiko Sawada

On Mon, Mar 11, 2024 at 5:13 PM Masahiko Sawada  wrote:
>
> In the latest (v69) patch:
>
> - squashed v68-0005 and v68-0006 patches.
> - removed most of the changes in v68-0007 patch.
> - addressed above review comments in v69-0002 patch.
> - v69-0003, 0004, and 0005 are miscellaneous updates.

Since the v69 conflicts with the current HEAD, I've rebased them. In
addition, v70-0008 is the new patch, which cleans up the vacuum
integration patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v70-ART.tar.gz
Description: GNU Zip compressed data

Re: [PROPOSAL] Skip test citext_utf8 on Windows

2024-03-11 Thread Andrew Dunstan




On 2024-03-11 Mo 04:21, Oleg Tselebrovskiy wrote:

Greetings, everyone!

While running "installchecks" on databases with UTF-8 encoding the test
citext_utf8 fails because of Turkish dotted I like this:

 SELECT 'i'::citext = 'İ'::citext AS t;
  t
 ---
- t
+ f
 (1 row)

I tried to replicate the test's results by hand and with any collation
that I tried (including --locale="Turkish") this test failed

Also an interesing result of my tesing. If you initialize you DB
with -E utf-8 --locale="Turkish" and then run select LOWER('İ');
the output will be this:
 lower
---
 İ
(1 row)

Which I find strange since lower() uses collation that was passed
(default in this case but still)




Wouldn't we be better off finding a Windows fix for this, instead of 
sweeping it under the rug?



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Reports on obsolete Postgres versions

2024-03-11 Thread Nathan Bossart

On Mon, Mar 11, 2024 at 05:17:13PM -0400, Bruce Momjian wrote:
> On Mon, Mar 11, 2024 at 04:12:04PM -0500, Nathan Bossart wrote:
>> I've read that the use of the term "minor release" can be confusing.  While
>> the versioning page clearly describes what is eligible for a minor release,
>> not everyone reads it, so I suspect that many folks think there are new
>> features, etc. in minor releases.  I think a "minor release" of Postgres is
>> more similar to what other projects would call a "patch version."
> 
> Well, we do say:
> 
>   While upgrading will always contain some level of risk, PostgreSQL
>   minor releases fix only frequently-encountered bugs, security issues,
>   and data corruption problems to reduce the risk associated with
>   upgrading. For minor releases, the community considers not upgrading to
>   be riskier than upgrading. 
> 
> but that is far down the page.  Do we need to improve this?

I think making that note more visible would certainly be an improvement.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-03-11 Thread Nathan Bossart

On Mon, Mar 11, 2024 at 09:59:53PM +, Amonson, Paul D wrote:
> I will be splitting the request into 2 patches. I am attaching the first
> patch (refactoring only) and I updated the commitfest entry to match this
> patch. I have a question however:
> Do I need to wait for the refactor patch to be merged before I post the
> AVX portion of this feature in this thread?

Thanks.  There's no need to wait to post the AVX portion.  I recommend
using "git format-patch" to construct the patch set for the lists.

>> Apologies for harping on this, but I'm still not seeing the need for these
>> SIZEOF_VOID_P changes.  While it's unlikely that this makes any practical
>> difference, I see no reason to more strictly check SIZEOF_VOID_P here.
> 
> I got rid of the second occurrence as I agree it is not needed but unless
> you see something I don't how to know which function to call between a
> 32-bit and 64-bit architecture? Maybe I am missing something obvious?
> What exactly do you suggest here? I am happy to always call either
> pg_popcount32() or pg_popcount64() with the understanding that it may not
> be optimal, but I do need to know which to use.

I'm recommending that we don't change any of the code in the pg_popcount()
function (which is renamed to pg_popcount_slow() in your v6 patch).  If
pointers are 8 or more bytes, we'll try to process the buffer in 64-bit
chunks.  Else, we'll try to process it in 32-bit chunks.  Any remaining
bytes will be processed one-by-one.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Improve readability by using designated initializers when possible

2024-03-11 Thread Japin Li



On Fri, 08 Mar 2024 at 13:21, Michael Paquier  wrote:
> On Wed, Mar 06, 2024 at 08:24:09AM +0800, Japin Li wrote:
>> On Wed, 06 Mar 2024 at 01:53, Jelte Fennema-Nio  wrote:
>>> I think if you remove the EEO_CASE(EEOP_LAST) block the warning should
>>> go away. That block is clearly marked as unreachable, so it doesn't
>>> really serve a purpose.
>>
>> Thanks! Fixed as you suggested.  Please see v3 patch.
>
> Hmm.  I am not sure if this one is a good idea.

Sorry for the late reply!

> This makes the code a
> bit more complicated to grasp under EEO_USE_COMPUTED_GOTO with the
> reverse dispatch table, to say the least.

I'm not get your mind.  Could you explain in more detail?

Re: Using the %m printf format more

2024-03-11 Thread Michael Paquier

On Mon, Mar 11, 2024 at 11:19:16AM +, Dagfinn Ilmari Mannsåker wrote:
> On closer look, fprintf() is actually the only errno-clobbering function
> it calls, I was just hedging my bets in that statement.

This makes the whole simpler, so I'd be OK with that.  I am wondering
if others have opinions to offer about that.

I've applied v2-0001 for now, as it is worth on its own and it shaves
a bit of code.
--
Michael


signature.asc
Description: PGP signature

Re: Reports on obsolete Postgres versions

2024-03-11 Thread Greg Sabino Mullane

On Mon, Mar 11, 2024 at 4:38 PM Bruce Momjian  wrote:

> https://www.postgresql.org/support/versioning/
>
> This web page should correct the idea that "upgrades are more risky than
> staying with existing versions".  Is there more we can do?  Should we have
> a more consistent response for such reporters?
>

It could be helpful to remove this sentence:

"Upgrading to a minor release does not normally require a dump and restore"

While technically true, "not normally" is quite the understatement, as the
true answer is "never" or at least "not in the last few decades". I have a
hard time even imagining a scenario that would require a minor revision to
do a dump and restore - surely, that in itself would warrant a major
release?

> It would be a crazy idea to report something in the logs if a major
> version is run after a certain date, since we know the date when major
> versions will become unsupported.
>

Could indeed be useful to spit something out at startup. Heck, even minor
versions are fairly regular now. Sure would be nice to be able to point a
client at the database and say "See? Even Postgres itself thinks you should
upgrade from 11.3!!" (totally made up example, not at all related to an
actual production system /sarcasm)

Cheers,
Greg

Re: Patch: Add parse_type Function

2024-03-11 Thread Erik Wienhold

On 2024-03-09 02:42 +0100, David E. Wheeler wrote:
> On Mar 7, 2024, at 23:39, Erik Wienhold  wrote:
> 
> > I think you need to swap the examples.  The text mentions the error case
> > first and the NULL case second.
> 
> 🤦🏻‍♂️
> 
> Thanks, fixed in the attached patch.

Thanks, LGTM.

-- 
Erik

Re: Sequence Access Methods, round two

2024-03-11 Thread Michael Paquier

On Tue, Feb 27, 2024 at 10:27:13AM +0900, Michael Paquier wrote:
> On Thu, Feb 22, 2024 at 05:36:00PM +0100, Tomas Vondra wrote:
>> 0001
>> --
>> 
>> I think this bit in pg_proc.dat is not quite right:
>> 
>>   proallargtypes => '{regclass,bool,int8}', proargmodes => '{i,o,o}',
>>   proargnames => '{seqname,is_called,last_value}',
>> 
>> the first argument should not be "seqname" but rather "seqid".
> 
> Ah, right.  There are not many system functions that use regclass as
> arguments, but the existing ones refer more to IDs, not names.

This patch set is not going to be merged for this release, so I am
going to move it to the next commit fest to continue the discussion in
v18~.

Anyway, there is one piece of this patch set that I think has a lot of
value outside of the discussion with access methods, which is to
redesign pg_sequence_last_value so as it returns a (last_value,
is_called) tuple rather than a (last_value).  This has the benefit of
switching pg_dump to use this function rather than relying on a scan
of the heap table used by a sequence to retrieve the state of a
sequence dumped.  This is the main diff:
-appendPQExpBuffer(query,
-  "SELECT last_value, is_called FROM %s",
-  fmtQualifiedDumpable(tbinfo));
+/*
+ * In versions 17 and up, pg_sequence_last_value() has been switched to
+ * return a tuple with last_value and is_called.
+ */
+if (fout->remoteVersion >= 17)
+appendPQExpBuffer(query,
+  "SELECT last_value, is_called "
+  "FROM pg_sequence_last_value('%s')",
+  fmtQualifiedDumpable(tbinfo));
+else
+appendPQExpBuffer(query,
+  "SELECT last_value, is_called FROM %s",
+  fmtQualifiedDumpable(tbinfo));

Are there any objections to that?  pg_sequence_last_value() is
something that we've only been relying on internally for the catalog 
pg_sequences.
--
Michael


signature.asc
Description: PGP signature

Re: Add bump memory context type and use it for tuplesorts

2024-03-11 Thread David Rowley

On Mon, 11 Mar 2024 at 22:09, John Naylor  wrote:
> I ran the test function, but using 256kB and 3MB for the reset
> frequency, and with 8,16,24,32 byte sizes (patched against a commit
> after the recent hot/cold path separation). Images attached. I also
> get a decent speedup with the bump context, but not quite as dramatic
> as on your machine. It's worth noting that slab is the slowest for me.
> This is an Intel i7-10750H.

Thanks for trying this out.  I didn't check if the performance was
susceptible to the memory size before the reset.  It certainly would
be once the allocation crosses some critical threshold of CPU cache
size, but probably it will also be to some extent regarding the number
of actual mallocs that are required underneath.

I see there's some discussion of bump in [1].  Do you still have a
valid use case for bump for performance/memory usage reasons?

The reason I ask is due to what Tom mentioned in [2] ("It's not
apparent to me that the "bump context" idea is valuable enough to
foreclose ever adding more context types"). So, I'm just probing to
find other possible use cases that reinforce the usefulness of bump.
It would be interesting to try it in a few places to see what
performance gains could be had.  I've not done much scouting around
the codebase for other uses other than non-bounded tuplesorts.

David

[1] 
https://postgr.es/m/canwcazbxxhysytrpyz-wzbdtvrpwoete7rqm1g_+4cb8z6k...@mail.gmail.com
[2] https://postgr.es/m/3537323.1708125...@sss.pgh.pa.us

Re: Add bump memory context type and use it for tuplesorts

2024-03-11 Thread David Rowley

On Tue, 12 Mar 2024 at 12:25, Tomas Vondra
 wrote:
> (b) slab is considerably slower

It would be interesting to modify SlabReset() to, instead of free()ing
the blocks, push the first SLAB_MAXIMUM_EMPTY_BLOCKS of them onto the
emptyblocks list.

That might give us an idea of how much overhead comes from malloc/free.

Having something like this as an option when creating a context might
be a good idea.  generation.c now keeps 1 "freeblock" which currently
does not persist during context resets. Some memory context usages
might suit having an option like this.  Maybe something like the
executor's per-tuple context, which perhaps (c|sh)ould be a generation
context...  However, saying that, I see you measure it to be slightly
slower than aset.

David

Re: [PROPOSAL] Skip test citext_utf8 on Windows

2024-03-11 Thread Michael Paquier

On Mon, Mar 11, 2024 at 03:21:11PM +0700, Oleg Tselebrovskiy wrote:
> The proposed patch for skipping test is attached

Your attached patch seems to be in binary format.
--
Michael


signature.asc
Description: PGP signature

Re: Infinite loop in XLogPageRead() on standby

2024-03-11 Thread Michael Paquier

On Mon, Mar 11, 2024 at 04:43:32PM +0900, Kyotaro Horiguchi wrote:
> At Wed, 6 Mar 2024 11:34:29 +0100, Alexander Kukushkin  
> wrote in 
>> Thank you for spending your time on it!
> 
> You're welcome, but I aplogize for the delay in the work..

Thanks for spending time on this.  Everybody is busy with the last
commit fest, and the next minor release set is planned for May so
there should still be time even after the feature freeze:
https://www.postgresql.org/developer/roadmap/

I should be able to come back to this thread around the beginning of
April.  If you are able to do some progress in-between, that would be
surely helpful.

Thanks,
--
Michael

signature.asc
Description: PGP signature

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-11 Thread Heikki Linnakangas


On 09/03/2024 22:41, Melanie Plageman wrote:

On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas  wrote:

Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?


Okay, so I thought a lot about this, and I don't understand how
GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId
correctly.

vacrel->cutoffs.OldestXmin is computed initially from
GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons().
GlobalVisState is updated by ComputeXidHorizons() (when it is
updated).

I do see that the comment above GlobalVisTestIsRemovableXid() says

  * It is crucial that this only gets called for xids from a source that
  * protects against xid wraparounds (e.g. from a table and thus protected by
  * relfrozenxid).

and then in

  * Convert 32 bit argument to FullTransactionId. We can do so safely
  * because we know the xid has to, at the very least, be between
  * [oldestXid, nextXid), i.e. within 2 billion of xid.

I'm not sure what oldestXid is here.
It's true that I don't see GlobalVisTestIsRemovableXid() being called
anywhere else with an xmin as an input. I think that hints that it is
not safe with FrozenTransactionId. But I don't see what could go
wrong.

Maybe it has something to do with converting it to a FullTransactionId?

FullTransactionIdFromU64(U64FromFullTransactionId(rel)  + (int32)
(xid - rel_xid));

Sorry, I couldn't quite figure it out :(


I just tested it. No, GlobalVisTestIsRemovableXid does not work for 
FrozenTransactionId. I just tested it with state->definitely_needed == 
{0, 40} and xid == FrozenTransactionid, and it incorrectly 
returned 'false'. It treats FrozenTransactionId as if was a regular xid '2'.



The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
the case that there's no pruning, just freezing. The record format
(xl_heap_prune) looks pretty complex as it is, I think it could be made
both more compact and more clear with some refactoring.


I'm happy to change up xl_heap_prune format. In its current form,
according to pahole, it has no holes and just 3 bytes of padding at
the end.

One way we could make it smaller is by moving the isCatalogRel member
to directly after snapshotConflictHorizon and then adding a flags
field and defining flags to indicate whether or not other members
exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED,
HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16
nredirected, nunused, nplans, ndead and just put the number of
redirected/unused/etc at the beginning of the arrays containing the
offsets.


Sounds good.


Then I could write various macros for accessing them. That
would make it smaller, but it definitely wouldn't make it less complex
(IMO).


I don't know, it might turn out not that complex. If you define the 
formats of each of those "sub-record types" as explicit structs, per 
attached sketch, you won't need so many macros. Some care is still 
needed with alignment though.


--
Heikki Linnakangas
Neon (https://neon.tech)/*
 * XXX: This one combined record type replaces existing
 * XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM, and XLOG_HEAP2_FREEZE_PAGE record types
 */

/*
 * This is what we need to know about page pruning and freezing, both during
 * VACUUM and during opportunistic pruning.
 *
 * If XLPH_HAS_REDIRECTIONS, XLHP_HAS_DEAD_ITEMS or XLHP_HAS_FREEZE_PLANS is
 * set, acquires a full cleanup lock. Otherwise an ordinary exclusive lock is
 * enough (VACUUM's second heap pass generates such records).
 *
 * The data for block reference 0 contains "sub-records" depending on which
 * of the XLPH_HAS_* flags are set. See xlhp_* struct definitions below.
 */
typedef struct xl_heap_prune
{
TransactionId snapshotConflictHorizon;
uint8   flags;
} xl_heap_prune;

#define XLHP_IS_CATALOG_REL 0x01 /* to handle 
recovery conflict during logical

  * decoding on standby */
#define XLHP_HAS_REDIRECTIONS   0x02
#define XLHP_HAS_DEAD_ITEMS 0x04
#define XLHP_HAS_NOW_UNUSED_ITEMS   0x08
#define XLHP_HAS_FREEZE_PLANS   0x10

#define SizeOfHeapPrune (offsetof(xl_heap_prune, flags) + sizeof(uint8))

/*
 * Sub-record contained in block reference 0 of a prune record if
 * XLHP_HAS_REDIRECTIONS is set
 */
typedef struct xl_heap_prune_redirections
{
uint16  nitems;
struct
{
OffsetNumber from;
OffsetNumber to;
}   items[FLEXIBLE_ARRAY_MEMBER];
} xl_heap_prune_redirections;

/*
 * Sub-record contained in block reference 0 of a prune record if
 * XLHP_HAS_DEAD_ITEMS is set
 */
typedef struct xlhp_dead_items
{
uint16  nitems;
OffsetNumber items[FLEXIBLE_ARRAY_MEMBER];
} xlhp_dead_items;

/*
 *

RE: Popcount optimization using AVX512

2024-03-11 Thread Amonson, Paul D

> -Original Message-
> From: Nathan Bossart 
> Sent: Thursday, March 7, 2024 1:36 PM
> Subject: Re: Popcount optimization using AVX512

I will be splitting the request into 2 patches. I am attaching the first patch 
(refactoring only) and I updated the commitfest entry to match this patch. I 
have a question however:
Do I need to wait for the refactor patch to be merged before I post the AVX 
portion of this feature in this thread?

> > +  PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
> 
> I'm curious why we need both -mavx512vpopcntdq and -mavx512f.  On my
> machine, -mavx512vpopcntdq alone is enough to pass this test, so if there are
> other instructions required that need -mavx512f, then we might need to
> expand the test.

First, nice catch on the required flags to build! When I changed my algorithm, 
dependence on the -mavx512f flag was no longer needed, In the second patch (AVX 
specific) I will fix this.

> I still think it's worth breaking this change into at least 2 patches.  In 
> particular,
> I think there's an opportunity to do the refactoring into pg_popcnt_choose.c
> and pg_popcnt_x86_64_accel.c prior to adding the AVX512 stuff.  These
> changes are likely straightforward, and getting them out of the way early
> would make it easier to focus on the more interesting changes.  IMHO there
> are a lot of moving parts in this patch.

As stated above I am doing this in 2 patches. :)

> > +#undef HAVE__GET_CPUID_COUNT
> > +
> > +/* Define to 1 if you have  immintrin. */ #undef HAVE__IMMINTRIN
> 
> Is this missing HAVE__CPUIDEX?

Yes I missed it, I will include in the second patch (AVX specific) of the 2 
patches.

> >  uint64
> > -pg_popcount(const char *buf, int bytes)
> > +pg_popcount_slow(const char *buf, int bytes)
> >  {
> >  uint64  popcnt = 0;
> >
> > -#if SIZEOF_VOID_P >= 8
> > +#if SIZEOF_VOID_P == 8
> >  /* Process in 64-bit chunks if the buffer is aligned. */
> >  if (buf == (const char *) TYPEALIGN(8, buf))
> >  {
> > @@ -311,7 +224,7 @@ pg_popcount(const char *buf, int bytes)
> >
> >  buf = (const char *) words;
> >  }
> > -#else
> > +#elif SIZEOF_VOID_P == 4
> >  /* Process in 32-bit chunks if the buffer is aligned. */
> >  if (buf == (const char *) TYPEALIGN(4, buf))
> >  {
> 
> Apologies for harping on this, but I'm still not seeing the need for these
> SIZEOF_VOID_P changes.  While it's unlikely that this makes any practical
> difference, I see no reason to more strictly check SIZEOF_VOID_P here.

I got rid of the second occurrence as I agree it is not needed but unless you 
see something I don't how to know which function to call between a 32-bit and 
64-bit architecture? Maybe I am missing something obvious? What exactly do you 
suggest here? I am happy to always call either pg_popcount32() or 
pg_popcount64() with the understanding that it may not be optimal, but I do 
need to know which to use.

> > +/* Process any remaining bytes */
> > +while (bytes--)
> > +popcnt += pg_number_of_ones[(unsigned char) *buf++];
> > +return popcnt;
> > +#else
> > +return pg_popcount_slow(buf, bytes);
> > +#endif  /* USE_AVX512_CODE */
> 
> nitpick: Could we call pg_popcount_slow() in a common section for these
> "remaining bytes?"

Agreed, will fix in the second patch as well.

> > +#if defined(_MSC_VER)
> > +pg_popcount_indirect = pg_popcount512_fast; #else
> > +pg_popcount = pg_popcount512_fast; #endif

> These _MSC_VER sections are interesting.  I'm assuming this is the
> workaround for the MSVC linking issue you mentioned above.  I haven't
> looked too closely, but I wonder if the CRC32C code (see
> src/include/port/pg_crc32c.h) is doing something different to avoid this 
> issue.

Using the latest master branch, I see what the needed changes are, I will 
implement using PGDLLIMPORT macro in the second patch.

> Upthread, Alvaro suggested a benchmark [0] that might be useful.  I scanned
> through this thread and didn't see any recent benchmark results for the latest
> form of the patch.  I think it's worth verifying that we are still seeing the
> expected improvements.

I will get new benchmarks using the same process I used before (from Akash) so 
I get apples to apples. These are pending completion of the second patch which 
is still in progress.

Just a reminder, I asked questions above about 1) multi-part dependent patches 
and, 2) What specifically to do about the SIZE_VOID_P checks. :)

Thanks,
Paul



v7-0001-Refactor-POPCNT.patch
Description: v7-0001-Refactor-POPCNT.patch

Re: Reports on obsolete Postgres versions

2024-03-11 Thread Bruce Momjian

On Mon, Mar 11, 2024 at 04:12:04PM -0500, Nathan Bossart wrote:
> On Mon, Mar 11, 2024 at 04:37:49PM -0400, Bruce Momjian wrote:
> > https://www.postgresql.org/support/versioning/
> > 
> > This web page should correct the idea that "upgrades are more risky than
> > staying with existing versions".  Is there more we can do?  Should we
> > have a more consistent response for such reporters?
> 
> I've read that the use of the term "minor release" can be confusing.  While
> the versioning page clearly describes what is eligible for a minor release,
> not everyone reads it, so I suspect that many folks think there are new
> features, etc. in minor releases.  I think a "minor release" of Postgres is
> more similar to what other projects would call a "patch version."

Well, we do say:

While upgrading will always contain some level of risk, PostgreSQL
minor releases fix only frequently-encountered bugs, security issues,
and data corruption problems to reduce the risk associated with
upgrading. For minor releases, the community considers not upgrading to
be riskier than upgrading. 

but that is far down the page.  Do we need to improve this?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Reports on obsolete Postgres versions

2024-03-11 Thread Nathan Bossart

On Mon, Mar 11, 2024 at 04:37:49PM -0400, Bruce Momjian wrote:
>   https://www.postgresql.org/support/versioning/
> 
> This web page should correct the idea that "upgrades are more risky than
> staying with existing versions".  Is there more we can do?  Should we
> have a more consistent response for such reporters?

I've read that the use of the term "minor release" can be confusing.  While
the versioning page clearly describes what is eligible for a minor release,
not everyone reads it, so I suspect that many folks think there are new
features, etc. in minor releases.  I think a "minor release" of Postgres is
more similar to what other projects would call a "patch version."

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-11 Thread Melanie Plageman

On Sun, Mar 10, 2024 at 11:01 PM Thomas Munro  wrote:
>
> On Mon, Mar 11, 2024 at 5:31 AM Melanie Plageman
>  wrote:
> > I have investigated the interaction between
> > maintenance_io_concurrency, streaming reads, and the vacuum buffer
> > access strategy (BAS_VACUUM).
> >
> > The streaming read API limits max_pinned_buffers to a pinned buffer
> > multiplier (currently 4) * maintenance_io_concurrency buffers with the
> > goal of constructing reads of at least MAX_BUFFERS_PER_TRANSFER size.
> >
> > Since the BAS_VACUUM ring buffer is size 256 kB or 32 buffers with
> > default block size, that means that for a fully uncached vacuum in
> > which all blocks must be vacuumed and will be dirtied, you'd have to
> > set maintenance_io_concurrency at 8 or lower to see the same number of
> > reuses (and shared buffer consumption) as master.
> >
> > Given that we allow users to specify BUFFER_USAGE_LIMIT to vacuum, it
> > seems like we should force max_pinned_buffers to a value that
> > guarantees the expected shared buffer usage by vacuum. But that means
> > that maintenance_io_concurrency does not have a predictable impact on
> > streaming read vacuum.
> >
> > What is the right thing to do here?
> >
> > At the least, the default size of the BAS_VACUUM ring buffer should be
> > BLCKSZ * pinned_buffer_multiplier * default maintenance_io_concurrency
> > (probably rounded up to the next power of two) bytes.
>
> Hmm, does the v6 look-ahead distance control algorithm mitigate that
> problem?  Using the ABC classifications from the streaming read
> thread, I think for A it should now pin only 1, for B 16 and for C, it
> depends on the size of the random 'chunks': if you have a lot of size
> 1 random reads then it shouldn't go above 10 because of (default)
> maintenance_io_concurrency.  The only way to get up to very high
> numbers would be to have a lot of random chunks triggering behaviour
> C, but each made up of long runs of misses.  For example one can
> contrive a BHS query that happens to read pages 0-15 then 20-35 then
> 40-55 etc etc so that we want to get lots of wide I/Os running
> concurrently.  Unless vacuum manages to do something like that, it
> shouldn't be able to exceed 32 buffers very easily.
>
> I suspect that if we taught streaming_read.c to ask the
> BufferAccessStrategy (if one is passed in) what its recommended pin
> limit is (strategy->nbuffers?), we could just clamp
> max_pinned_buffers, and it would be hard to find a workload where that
> makes a difference, and we could think about more complicated logic
> later.
>
> In other words, I think/hope your complaints about excessive pinning
> from v5 WRT all-cached heap scans might have also already improved
> this case by happy coincidence?  I haven't tried it out though, I just
> read your description of the problem...

I've rebased the attached v10 over top of the changes to
lazy_scan_heap() Heikki just committed and over the v6 streaming read
patch set. I started testing them and see that you are right, we no
longer pin too many buffers. However, the uncached example below is
now slower with streaming read than on master -- it looks to be
because it is doing twice as many WAL writes and syncs. I'm still
investigating why that is.

psql \
-c "create table small (a int) with (autovacuum_enabled=false,
fillfactor=25);" \
-c "insert into small select generate_series(1,20) % 3;" \
-c "update small set a = 6 where a = 1;"

pg_ctl stop
# drop caches
pg_ctl start

psql -c "\timing on" -c "vacuum (verbose) small"

- Melanie
From 2b181d178f0cd55de45659540ec35536918a4c9d Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 11 Mar 2024 15:36:39 -0400
Subject: [PATCH v10 1/3] Streaming Read API

---
 src/backend/storage/Makefile |   2 +-
 src/backend/storage/aio/Makefile |  14 +
 src/backend/storage/aio/meson.build  |   5 +
 src/backend/storage/aio/streaming_read.c | 648 +++
 src/backend/storage/buffer/bufmgr.c  | 641 +++---
 src/backend/storage/buffer/localbuf.c|  14 +-
 src/backend/storage/meson.build  |   1 +
 src/include/storage/bufmgr.h |  45 ++
 src/include/storage/streaming_read.h |  52 ++
 src/tools/pgindent/typedefs.list |   3 +
 10 files changed, 1215 insertions(+), 210 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS = aio buffer file freespace ipc large_object lmgr page smg

Re: Reducing output size of nodeToString

2024-03-11 Thread Matthias van de Meent

On Mon, 11 Mar 2024 at 14:19, Peter Eisentraut  wrote:
>
> On 22.02.24 16:07, Matthias van de Meent wrote:
> > On Thu, 22 Feb 2024 at 13:37, Matthias van de Meent
> >  wrote:
> >>
> >> On Mon, 19 Feb 2024 at 14:19, Matthias van de Meent
> >>  wrote:
> >>> Attached the updated version of the patch on top of 5497daf3, which
> >>> incorporates this last round of feedback.
> >>
> >> Now attached rebased on top of 93db6cbd to fix conflicts with fbc93b8b
> >> and an issue in the previous patchset: I attached one too many v3-0001
> >> from a previous patch I worked on.
> >
> > ... and now with a fix for not overwriting newly deserialized location
> > attributes with -1, which breaks test output for
> > READ_WRITE_PARSE_PLAN_TREES installations. Again, no other significant
> > changes since the patch of last Monday.
>
> * v7-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch
>
> This patch looks much more complicated than I was expecting.  I had
> suggested to model this after stringToNodeWithLocations().  This uses a
> global variable to toggle the mode.  Your patch creates a function
> nodeToStringNoQLocs() -- why the different naming scheme?

It doesn't just exclude .location fields, but also Query.len, a
similar field which contains the length of the query's string. The
name has been further refined to nodeToStringNoParseLocs() in the
attached version, but feel free to replace the names in the patch to
anything else you might want.

>  -- and passes
> the flag down as an argument to all the output functions.  I mean, in a
> green field, avoiding global variables can be sensible, of course, but I
> think in this limited scope here it would really be better to keep the
> code for the two directions read and write the same.

I'm a big fan of _not_ using magic global variables as passed context
without resetting on subnormal exits...
For GUCs their usage is understandable (and there is infrastructure to
reset them, and you're not supposed to manually update them), but IMO
here its usage should be a function-scoped variable or in a
passed-by-reference context struct, not a file-local static.
Regardless, attached is an adapted version with the file-local
variable implementation.

> Attached is a small patch that shows what I had in mind.  (It doesn't
> contain any callers, but your patch shows where all those would go.)

Attached a revised version that does it like stringToNodeInternal's
handling of restore_location_fields.

> * v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
>
> This looks sensible, but maybe making Location a global type is a bit
> much?  Maybe something more specific like ParseLocation, or ParseLoc, to
> keep it under 12 characters.

I've gone with ParseLoc in the attached v8 patchset.

Kind regards,

Matthias van de Meent


v8-0003-pg_node_tree-Don-t-store-query-text-locations-in-.patch
Description: Binary data


v8-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch
Description: Binary data


v8-0005-nodeToString-omit-serializing-NULL-datums-in-Cons.patch
Description: Binary data


v8-0004-gen_node_support.pl-Add-a-TypMod-type-for-signall.patch
Description: Binary data


v8-0002-gen_node_support.pl-Mark-location-fields-as-type-.patch
Description: Binary data


v8-0006-nodeToString-Apply-RLE-on-Bitmapset-and-numeric-L.patch
Description: Binary data


v8-0008-nodeToString-omit-serializing-0s-in-enum-typed-fi.patch
Description: Binary data


v8-0007-gen_node_support.pl-Optimize-serialization-of-fie.patch
Description: Binary data

Re: vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases

2024-03-11 Thread Nathan Bossart

On Fri, Mar 08, 2024 at 04:03:22PM -0600, Nathan Bossart wrote:
> On Fri, Mar 08, 2024 at 09:33:19AM +, Dean Rasheed wrote:
>> I think this is good to go.
> 
> Thanks.  In v4, I've added a first draft of the commit messages, and I am
> planning to commit this early next week.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-11 Thread Melanie Plageman

On Mon, Mar 11, 2024 at 2:47 PM Heikki Linnakangas  wrote:
>
>
> > Otherwise LGTM
>
> Ok, pushed! Thank you, this is much more understandable now!

Cool, thanks!

Reports on obsolete Postgres versions

2024-03-11 Thread Bruce Momjian

I am seeing an increasing number of bug/problem reports on obsolete
Postgres versions, either not running a superseded minor version, or
running an unsupported major version.

What can we do to reduce such reports, or at least give a consistent
response?  It is very helpful that we have this web page, and I have
made a habit of pointing reporters to that page since it has all the
information they need:

https://www.postgresql.org/support/versioning/

This web page should correct the idea that "upgrades are more risky than
staying with existing versions".  Is there more we can do?  Should we
have a more consistent response for such reporters?

It would be a crazy idea to report something in the logs if a major
version is run after a certain date, since we know the date when major
versions will become unsupported.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

2024-03-11 Thread Corey Huinker

>
> > In which case we're failing nearly silently, yes, there is a null
> returned,
> > but we have no idea why there is a null returned. If I were using this
> > function manually I'd want to know what I did wrong, what parameter I
> > skipped, etc.
>
> I can see it both ways and don't feel super strongly about it ... I just
> know that I've had some cases where we returned an ERROR or otherwise
> were a bit noisy on NULL values getting passed into a function and it
> was much more on the annoying side than on the helpful side; to the
> point where we've gone back and pulled out ereport(ERROR) calls from
> functions before because they were causing issues in otherwise pretty
> reasonable queries (consider things like functions getting pushed down
> to below WHERE clauses and such...).
>

I don't have strong feelings either. I think we should get more input on
this. Regardless, it's easy to change...for now.



>
> Sure.  Not a huge deal either way, was just pointing out the difference.
> I do think it'd be good to match what ANALYZE does here, so checking if
> the values in pg_class are different and only updating if they are,
> while keeping the code for pg_statistic where it'll just always update.
>

I agree that mirroring ANALYZE wherever possible is the ideal.



> > I like the symmetry of a consistent interface, but we've already got an
> > asymmetry in that the pg_class update is done non-transactionally (like
> > ANALYZE does).
>
> Don't know that I really consider that to be the same kind of thing when
> it comes to talking about the interface as the other aspects we're
> discussing ...
>

Fair.




>
> > One persistent problem is that there is no _safe equivalent to ARRAY_IN,
> so
> > that can always fail on us, though it should only do so if the string
> > passed in wasn't a valid array input format, or the values in the array
> > can't coerce to the attribute's basetype.
>
> That would happen before we even get to being called and there's not
> much to do about it anyway.
>

Not sure I follow you here. the ARRAY_IN function calls happen once for
every non-null stavaluesN parameter, and it's done inside the function
because the result type could be the base type for a domain/array type, or
could be the type itself. I suppose we could move that determination to the
caller, but then we'd need to call get_base_element_type() inside a client,
and that seems wrong if it's even possible.


> > I should also point out that we've lost the ability to check if the
> export
> > values were of a type, and if the destination column is also of that
> type.
> > That's a non-issue in binary upgrades, but of course if a field changed
> > from integers to text the histograms would now be highly misleading.
> > Thoughts on adding a typname parameter that the function uses as a cheap
> > validity check?
>
> Seems reasonable to me.
>

I'd like to hear what Tomas thinks about this, as he was the initial
advocate for it.


> > As for pg_dump, I'm currently leading toward the TOC entry having either
> a
> > series of commands:
> >
> > SELECT pg_set_relation_stats('foo.bar'::regclass, ...);
> > pg_set_attribute_stats('foo.bar'::regclass, 'id'::name, ...); ...
>
> I'm guessing the above was intended to be SELECT ..; SELECT ..;
>

Yes.


>
> > Or one compound command
> >
> > SELECT pg_set_relation_stats(t.oid, ...)
> >  pg_set_attribute_stats(t.oid, 'id'::name, ...),
> >  pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
> >  ...
> > FROM (VALUES('foo.bar'::regclass)) AS t(oid);
> >
> > The second one has the feature that if any one attribute fails, then the
> > whole update fails, except, of course, for the in-place update of
> pg_class.
> > This avoids having an explicit transaction block, but we could get that
> > back by having restore wrap the list of commands in a transaction block
> > (and adding the explicit lock commands) when it is safe to do so.
>
> Hm, I like this approach as it should essentially give us the
> transaction block we had been talking about wanting but without needing
> to explicitly do a begin/commit, which would add in some annoying
> complications.  This would hopefully also reduce the locking concern
> mentioned previously, since we'd get the lock needed in the first
> function call and then the others would be able to just see that we've
> already got the lock pretty quickly.
>

True, we'd get the lock needed in the first function call, but wouldn't we
also release that very lock before the subsequent call? Obviously we'd be
shrinking the window in which another process could get in line and take a
superior lock, and the universe of other processes that would even want a
lock that blocks us is nil in the case of an upgrade, identical to existing
behavior in the case of an FDW ANALYZE, and perfectly fine in the case of
someone tinkering with stats.


>
> > Subject: [PATCH v8] Create pg_set_relation_stats, pg_set_attribute_stats.
>
> [...]
>
> > +Datum

Re: remaining sql/json patches

2024-03-11 Thread Shruthi Gowda

Thanka Alvaro. It works fine when quotes are used around the column name.

On Mon, Mar 11, 2024 at 9:04 PM Alvaro Herrera 
wrote:

> On 2024-Mar-11, Shruthi Gowda wrote:
>
> > *CASE 2:*
> > --
> > SELECT * FROM JSON_TABLE(jsonb '{
> >  "id" : 901,
> >  "age" : 30,
> >  "*FULL_NAME*" : "KATE DANIEL"}',
> > '$'
> > COLUMNS(
> >  FULL_NAME varchar(20),
> >  ID int,
> >  AGE int
> >   )
> >) as t;
>
> I think this is expected: when you use FULL_NAME as a SQL identifier, it
> is down-cased, so it no longer matches the uppercase identifier in the
> JSON data.  You'd have to do it like this:
>
> SELECT * FROM JSON_TABLE(jsonb '{
>  "id" : 901,
>  "age" : 30,
>  "*FULL_NAME*" : "KATE DANIEL"}',
> '$'
> COLUMNS(
>  "FULL_NAME" varchar(20),
>  ID int,
>  AGE int
>   )
>) as t;
>
> so that the SQL identifier is not downcased.
>
> --
> Álvaro Herrera PostgreSQL Developer  —
> https://www.EnterpriseDB.com/
>

Re: Statistics Import and Export

2024-03-11 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> > Having thought about it a bit more, I generally like the idea of being
> > able to just update one stat instead of having to update all of them at
> > once (and therefore having to go look up what the other values currently
> > are...).  That said, per below, perhaps making it strict is the better
> > plan.
> 
> v8 has it as strict.

Ok.

> > > > Also, in some cases we allow the function to be called with a
> > > > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > > > OID ends up being NULL).
> > >
> > > Thoughts on it emitting a WARN or NOTICE before returning false?
> >
> > Eh, I don't think so?
> >
> > Where this is coming from is that we can often end up with functions
> > like these being called inside of larger queries, and having them spit
> > out WARN or NOTICE will just make them noisy.
> >
> > That leads to my general feeling of just returning NULL if called with a
> > NULL OID, as we would get with setting the function strict.
> 
> In which case we're failing nearly silently, yes, there is a null returned,
> but we have no idea why there is a null returned. If I were using this
> function manually I'd want to know what I did wrong, what parameter I
> skipped, etc.

I can see it both ways and don't feel super strongly about it ... I just
know that I've had some cases where we returned an ERROR or otherwise
were a bit noisy on NULL values getting passed into a function and it
was much more on the annoying side than on the helpful side; to the
point where we've gone back and pulled out ereport(ERROR) calls from
functions before because they were causing issues in otherwise pretty
reasonable queries (consider things like functions getting pushed down
to below WHERE clauses and such...).

> > Well, that code is for pg_statistic while I was looking at pg_class (in
> > vacuum.c:1428-1443, where we track if we're actually changing anything
> > and only make the pg_class change if there's actually something
> > different):
> 
> I can do that, especially since it's only 3 tuples of known types, but my
> reservations are summed up in the next comment.

> > Not sure why we don't treat both the same way though ... although it's
> > probably the case that it's much less likely to have an entire
> > pg_statistic row be identical than the few values in pg_class.
> 
> That would also involve comparing ANYARRAY values, yuk. Also, a matched
> record will never be the case when used in primary purpose of the function
> (upgrades), and not a big deal in the other future cases (if we use it in
> ANALYZE on foreign tables instead of remote table samples, users
> experimenting with tuning queries under hypothetical workloads).

Sure.  Not a huge deal either way, was just pointing out the difference.
I do think it'd be good to match what ANALYZE does here, so checking if
the values in pg_class are different and only updating if they are,
while keeping the code for pg_statistic where it'll just always update.

> > Hmm, that's a valid point, so a NULL passed in would need to set that
> > value actually to NULL, presumably.  Perhaps then we should have
> > pg_set_relation_stats() be strict and have pg_set_attribute_stats()
> > handles NULLs passed in appropriately, and return NULL if the relation
> > itself or attname, or other required (not NULL'able) argument passed in
> > cause the function to return NULL.
> >
> 
> That's how I have relstats done in v8, and could make it do that for attr
> stats.

That'd be my suggestion, at least, but as I mention above, it's not a
position I hold very strongly.

> > (What I'm trying to drive at here is a consistent interface for these
> > functions, but one which does a no-op instead of returning an ERROR on
> > values being passed in which aren't allowable; it can be quite
> > frustrating trying to get a query to work where one of the functions
> > decides to return ERROR instead of just ignoring things passed in which
> > aren't valid.)
> 
> I like the symmetry of a consistent interface, but we've already got an
> asymmetry in that the pg_class update is done non-transactionally (like
> ANALYZE does).

Don't know that I really consider that to be the same kind of thing when
it comes to talking about the interface as the other aspects we're
discussing ...

> One persistent problem is that there is no _safe equivalent to ARRAY_IN, so
> that can always fail on us, though it should only do so if the string
> passed in wasn't a valid array input format, or the values in the array
> can't coerce to the attribute's basetype.

That would happen before we even get to being called and there's not
much to do about it anyway.

> I should also point out that we've lost the ability to check if the export
> values were of a type, and if the destination column is also of that type.
> That's a non-issue in binary upgrades, but of course if a field changed
> from integers to text the histograms would now be hig

Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-11 Thread Heikki Linnakangas


On 11/03/2024 18:15, Melanie Plageman wrote:

On Mon, Mar 11, 2024 at 11:29:44AM +0200, Heikki Linnakangas wrote:

diff --git a/src/backend/access/heap/vacuumlazy.c 
b/src/backend/access/heap/vacuumlazy.c
index ac55ebd2ae..1757eb49b7 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
+



  /*
- * lazy_scan_skip() -- set up range of skippable blocks using visibility 
map.
+ * heap_vac_scan_next_block() -- get next block for vacuum to process
   *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
+ * lazy_scan_heap() calls here every time it needs to get the next block to
+ * prune and vacuum.  The function uses the visibility map, vacuum options,
+ * and various thresholds to skip blocks which do not need to be processed and
+ * sets blkno to the next block that actually needs to be processed.


I wonder if "need" is too strong a word since this function
(heap_vac_scan_next_block()) specifically can set blkno to a block which
doesn't *need* to be processed but which it chooses to process because
of SKIP_PAGES_THRESHOLD.


Ok yeah, there's a lot of "needs" here :-). Fixed.


   *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
+ * The block number and visibility status of the next block to process are set
+ * in *blkno and *all_visible_according_to_vm.  The return value is false if
+ * there are no further blocks to process.
+ *
+ * vacrel is an in/out parameter here; vacuum options and information about
+ * the relation are read, and vacrel->skippedallvis is set to ensure we don't
+ * advance relfrozenxid when we have skipped vacuuming all-visible blocks.  It


Maybe this should say when we have skipped vacuuming all-visible blocks
which are not all-frozen or just blocks which are not all-frozen.


Ok, rephrased.


+ * also holds information about the next unskippable block, as bookkeeping for
+ * this function.
   *
   * Note: our opinion of which blocks can be skipped can go stale immediately.
   * It's okay if caller "misses" a page whose all-visible or all-frozen marking


Wonder if it makes sense to move this note to
find_next_nunskippable_block().


Moved.


@@ -1098,26 +1081,119 @@ lazy_scan_heap(LVRelState *vacrel)
   * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
   * choice to skip such a range is actually made, making everything safe.)
   */
-static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-  bool *next_unskippable_allvis, bool 
*skipping_current_range)
+static bool
+heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
+bool 
*all_visible_according_to_vm)
  {
-   BlockNumber rel_pages = vacrel->rel_pages,
-   next_unskippable_block = next_block,
-   nskippable_blocks = 0;
-   boolskipsallvis = false;
+   BlockNumber next_block;
  
-	*next_unskippable_allvis = true;

-   while (next_unskippable_block < rel_pages)
+   /* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
+   next_block = vacrel->current_block + 1;
+
+   /* Have we reached the end of the relation? */


No strong opinion on this, but I wonder if being at the end of the
relation counts as a fourth state?


Yeah, perhaps. But I think it makes sense to treat it as a special case.


+   if (next_block >= vacrel->rel_pages)
+   {
+   if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
+   {
+   ReleaseBuffer(vacrel->next_unskippable_vmbuffer);
+   vacrel->next_unskippable_vmbuffer = InvalidBuffer;
+   }
+   *blkno = vacrel->rel_pages;
+   return false;
+   }
+
+   /*
+* We must be in one of the three following states:
+*/
+   if (vacrel->next_unskippable_block == InvalidBlockNumber ||
+   next_block > vacrel->next_unskippable_block)
+   {
+   /*
+* 1. We have just processed an unskippable block (or we're at 
the
+* beginning of the scan).  Find the next unskippable block 
using the
+* visibility map.
+*/


I would reorder the options in the comment or in the if statement since
they seem to be in the rev

Re: btree: downlink right separator/HIKEY optimization

2024-03-11 Thread Matthias van de Meent

On Fri, 8 Mar 2024 at 20:14, Peter Geoghegan  wrote:
>
> On Thu, Feb 22, 2024 at 10:42 AM Matthias van de Meent
>  wrote:
> > I forgot to address this in the previous patch, so here's v3 which
> > fixes the issue warning.
>
> What benchmarking have you done here?

I have benchmarked this atop various versions of master when it was
part of the btree specialization patchset, where it showed a 2-9%
increase in btree insert performance over the previous patch in the
patchset on the various index types in that set.
More recently, on an unlogged pgbench with foreign keys enabled (-s400
-j4 -c8) I can't find any obvious regressions (it gains 0-0.7% on
master across 5-minute runs), while being 4.5% faster on inserting
data on a table with an excessively bad index shape (single index of
10 columns of empty strings with the non-default "nl-BE-x-icu"
collation followed by 1 random uuid column, inserted from a 10M row
dataset. Extrapolation indicates this could indeed get over 7%
improvement when the index shape is 31 nondefault -collated nonnull
text columns and a single random ID index column).

> Have you tried just reordering things in _bt_search() instead? If we
> delay the check until after the binary search, then the result of the
> binary search is usually proof enough that we cannot possibly need to
> move right. That has the advantage of not requiring that we copy
> anything to the stack.

I've not tried that, because it would makes page-level prefix
truncation more expensive by ~50%: With this patch, we need only 2
full tuple _bt_compares per page before we can establish a prefix,
while without this patch (i.e. if we did a binsrch-first approach)
we'd need 3 on average (assuming linearly randomly distributed
accesses). Because full-key compares can be arbitrarily more expensive
than normal attribute compares, I'd rather not have that 50% overhead.

> > On Fri, Mar 8, 2024 at 2:14 PM Peter Geoghegan  wrote:
> > What benchmarking have you done here?
> I think that the memcmp() test is subtly wrong:

Good catch, it's been fixed in the attached version, using a new function.

Kind regards,

Matthias van de Meent.


v3-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patch
Description: Binary data

Re: UUID v7

2024-03-11 Thread Andrey M. Borodin

> On 11 Mar 2024, at 20:56, Jelte Fennema-Nio  wrote:
> 
> Attached a few comment fixes/improvements and a pgindent run (patch 0002-0004)

Thanks!

> Now with the added comments, one thing pops out to me: The comments
> mention that we use "Monotonic Random", but when I read the spec that
> explicitly recommends against using an increment of 1 when using
> monotonic random. I feel like if we use an increment of 1, we're
> better off going for the "Fixed-Length Dedicated Counter Bits" method
> (i.e. change the code to start the counter at 0). See patch 0005 for
> an example of that change.
> 
> I'm also wondering if we really want to use the extra rand_b bits for
> this. The spec says we MAY, but it does remove the amount of
> randomness in our UUIDs.

Method 1 is a just a Method 2 with specifically picked constants.
But I'll have to use some hand-wavy wordings...

UUID consists of these 128 bits:
a. Mandatory 2 var and 4 ver bits.
b. Flexible but strongly recommended 48 bits unix_ts_ms. These bits contribute 
to global sortability of values generated at frequency less than 1KHz.
c. Counter bits:
c1. Initialised with 0 on any time tick.
c2. Initialised with randomness.
c3*. bit width of a counter step (*not counted in 128 bit capacity, can be 
non-integral)
d. Randomness bits.

Method 1 is when c2=0. My implementation of method 2 uses c1=1, c2=17

Consider all UUIDs generated at any given milliseconds. Probability of a 
collision of two UUIDs generated at frequency less than 1KHz is p = 2^-(c2+d)
Capacity of a counter has expected value of c = 2^(c1)*2^(c2-1)/2^c3
To guess next UUID you can correctly pick one of u = 2^(d+c3)

First, observe that c3 contributes unguessability at exactly same scale as 
decreases counter capacity. There is no difference between using bits in d 
directly, or in c3. There is no point in non-zero c3. Every bit that could be 
given to c3 can equally be given to d.

Second, observe that c2 bits contribute to both collision protection and 
counter capacity! And when the time ticks, c2 also contribute to 
unguessability! So, technically, we should consider using all available bits as 
c2 bits.

How many c1 bits do we need? I've chosen one - to prevent occasional counter 
capacity reduction.

If c1 = 1, we can distribute 73 bits between c2 and d. I've chosen c2 = 17 and 
d = 56 as an arbitrary compromise between capacity of one backend per ms and 
prevention of global collision.
This compromise is mostly dictated by maximum frequency of UUID generation by 
one backend, I've chosen 200MHz as a sane value.

This compromise is much easier when you do not have 74 spare bits, this crazy 
amount of information forgives almost any mistake. Imagine you have to 
distribute 10 bits between c2 and d. And you try to prevent collision between 
10 independent devices which need capacity to generate IDs with frequency of 
10KHz each and keep sortability. You would have something like c1=1, c2=3,d=6.

Sorry for this long and vague explanation, if it still seems too uncertain we 
can have a chat or something like that. I don't think this number picking stuff 
deserve to be commented, because it still is quite close to random. RFC gives 
us too much freedom of choice.

Thanks!

Best regards, Andrey Borodin.

Re: Statistics Import and Export

2024-03-11 Thread Corey Huinker

>
>
>
>
> Having thought about it a bit more, I generally like the idea of being
> able to just update one stat instead of having to update all of them at
> once (and therefore having to go look up what the other values currently
> are...).  That said, per below, perhaps making it strict is the better
> plan.
>

v8 has it as strict.


>
> > > Also, in some cases we allow the function to be called with a
> > > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > > OID ends up being NULL).
> >
> > Thoughts on it emitting a WARN or NOTICE before returning false?
>
> Eh, I don't think so?
>
> Where this is coming from is that we can often end up with functions
> like these being called inside of larger queries, and having them spit
> out WARN or NOTICE will just make them noisy.
>
> That leads to my general feeling of just returning NULL if called with a
> NULL OID, as we would get with setting the function strict.
>

In which case we're failing nearly silently, yes, there is a null returned,
but we have no idea why there is a null returned. If I were using this
function manually I'd want to know what I did wrong, what parameter I
skipped, etc.


> Well, that code is for pg_statistic while I was looking at pg_class (in
> vacuum.c:1428-1443, where we track if we're actually changing anything
> and only make the pg_class change if there's actually something
> different):
>

I can do that, especially since it's only 3 tuples of known types, but my
reservations are summed up in the next comment.



> Not sure why we don't treat both the same way though ... although it's
> probably the case that it's much less likely to have an entire
> pg_statistic row be identical than the few values in pg_class.
>

That would also involve comparing ANYARRAY values, yuk. Also, a matched
record will never be the case when used in primary purpose of the function
(upgrades), and not a big deal in the other future cases (if we use it in
ANALYZE on foreign tables instead of remote table samples, users
experimenting with tuning queries under hypothetical workloads).




> Hmm, that's a valid point, so a NULL passed in would need to set that
> value actually to NULL, presumably.  Perhaps then we should have
> pg_set_relation_stats() be strict and have pg_set_attribute_stats()
> handles NULLs passed in appropriately, and return NULL if the relation
> itself or attname, or other required (not NULL'able) argument passed in
> cause the function to return NULL.
>

That's how I have relstats done in v8, and could make it do that for attr
stats.

(What I'm trying to drive at here is a consistent interface for these
> functions, but one which does a no-op instead of returning an ERROR on
> values being passed in which aren't allowable; it can be quite
> frustrating trying to get a query to work where one of the functions
> decides to return ERROR instead of just ignoring things passed in which
> aren't valid.)
>

I like the symmetry of a consistent interface, but we've already got an
asymmetry in that the pg_class update is done non-transactionally (like
ANALYZE does).

One persistent problem is that there is no _safe equivalent to ARRAY_IN, so
that can always fail on us, though it should only do so if the string
passed in wasn't a valid array input format, or the values in the array
can't coerce to the attribute's basetype.

I should also point out that we've lost the ability to check if the export
values were of a type, and if the destination column is also of that type.
That's a non-issue in binary upgrades, but of course if a field changed
from integers to text the histograms would now be highly misleading.
Thoughts on adding a typname parameter that the function uses as a cheap
validity check?

v8 attached, incorporating these suggestions plus those of Bharath and
Bertrand. Still no pg_dump.

As for pg_dump, I'm currently leading toward the TOC entry having either a
series of commands:

SELECT pg_set_relation_stats('foo.bar'::regclass, ...);
pg_set_attribute_stats('foo.bar'::regclass, 'id'::name, ...); ...

Or one compound command

SELECT pg_set_relation_stats(t.oid, ...)
 pg_set_attribute_stats(t.oid, 'id'::name, ...),
 pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
 ...
FROM (VALUES('foo.bar'::regclass)) AS t(oid);

The second one has the feature that if any one attribute fails, then the
whole update fails, except, of course, for the in-place update of pg_class.
This avoids having an explicit transaction block, but we could get that
back by having restore wrap the list of commands in a transaction block
(and adding the explicit lock commands) when it is safe to do so.
From bdfde573f4f79770439a1455c1cb337701eb20dc Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Mon, 11 Mar 2024 14:18:39 -0400
Subject: [PATCH v8] Create pg_set_relation_stats, pg_set_attribute_stats.

These functions will be used by pg_dump/restore and pg_upgrade to convey
relation and attribute statisti

Alternative SAOP support for GiST

2024-03-11 Thread Michał Kłeczek

Hi All,

During my journey of designing Pg based solution at my work I was severely hit 
by several shortcomings in GiST.
The most severe one is the lack of support for SAOP filters as it makes it 
difficult to have partition pruning and index (only) scans working together.

To overcome the difficulties I implemented a simple extension:
https://github.com/mkleczek/btree_gist_extra
mkleczek/btree_gist_extra: Extra operators support for PostgreSQL btree_gist
github.com

Since it provides a separate operator class from btree_gist it requires 
re-indexing the data.
So I thought maybe it would be a good idea to incorporate it into btree_gist.

I am aware of two patches related to SAOP being discussed at the moment but I 
am not sure if SAOP support for GiST indexes is planned.

Let me know if you think it is a good idea to work on a patch.

More general remark:
I am wondering if SAOP support in core should not be implemented by mapping 
SAOP operations to specialised operators and delegating
all the work to index AMs. That way core could be decoupled from particular 
index AMs implementation details.


Thanks!

—
Michal

Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-11 Thread Melanie Plageman

On Mon, Mar 11, 2024 at 11:29:44AM +0200, Heikki Linnakangas wrote:
>
> > I see why you removed my treatise-level comment that was here about
> > unskipped skippable blocks. However, when I was trying to understand
> > this code, I did wish there was some comment that explained to me why we
> > needed all of the variables next_unskippable_block,
> > next_unskippable_allvis, all_visible_according_to_vm, and current_block.
> > 
> > The idea that we would choose not to skip a skippable block because of
> > kernel readahead makes sense. The part that I had trouble wrapping my
> > head around was that we want to also keep the visibility status of both
> > the beginning and ending blocks of the skippable range and then use
> > those to infer the visibility status of the intervening blocks without
> > another VM lookup if we decide not to skip them.
> 
> Right, I removed the comment because looked a little out of place and it
> duplicated the other comments sprinkled in the function. I agree this could
> still use some more comments though.
> 
> Here's yet another attempt at making this more readable. I moved the logic
> to find the next unskippable block to a separate function, and added
> comments to make the states more explicit. What do you think?

Oh, I like the new structure. Very cool! Just a few remarks:

> From c21480e9da61e145573de3b502551dde1b8fa3f6 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Fri, 8 Mar 2024 17:32:19 +0200
> Subject: [PATCH v9 1/2] Confine vacuum skip logic to lazy_scan_skip()
> 
> Rename lazy_scan_skip() to heap_vac_scan_next_block() and move more
> code into the function, so that the caller doesn't need to know about
> ranges or skipping anymore. heap_vac_scan_next_block() returns the
> next block to process, and the logic for determining that block is all
> within the function. This makes the skipping logic easier to
> understand, as it's all in the same function, and makes the calling
> code easier to understand as it's less cluttered. The state variables
> needed to manage the skipping logic are moved to LVRelState.
> 
> heap_vac_scan_next_block() now manages its own VM buffer separately
> from the caller's vmbuffer variable. The caller's vmbuffer holds the
> VM page for the current block its processing, while
> heap_vac_scan_next_block() keeps a pin on the VM page for the next
> unskippable block. Most of the time they are the same, so we hold two
> pins on the same buffer, but it's more convenient to manage them
> separately.
> 
> For readability inside heap_vac_scan_next_block(), move the logic of
> finding the next unskippable block to separate function, and add some
> comments.
> 
> This refactoring will also help future patches to switch to using a
> streaming read interface, and eventually AIO
> (https://postgr.es/m/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com)
> 
> Author: Melanie Plageman, with some changes by me

I'd argue you earned co-authorship by now :)

> Discussion: 
> https://postgr.es/m/CAAKRu_Yf3gvXGcCnqqfoq0Q8LX8UM-e-qbm_B1LeZh60f8WhWA%40mail.gmail.com
> ---
>  src/backend/access/heap/vacuumlazy.c | 233 +--
>  1 file changed, 146 insertions(+), 87 deletions(-)
> 
> diff --git a/src/backend/access/heap/vacuumlazy.c 
> b/src/backend/access/heap/vacuumlazy.c
> index ac55ebd2ae..1757eb49b7 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> +

>  /*
> - *   lazy_scan_skip() -- set up range of skippable blocks using visibility 
> map.
> + *   heap_vac_scan_next_block() -- get next block for vacuum to process
>   *
> - * lazy_scan_heap() calls here every time it needs to set up a new range of
> - * blocks to skip via the visibility map.  Caller passes the next block in
> - * line.  We return a next_unskippable_block for this range.  When there are
> - * no skippable blocks we just return caller's next_block.  The all-visible
> - * status of the returned block is set in *next_unskippable_allvis for 
> caller,
> - * too.  Block usually won't be all-visible (since it's unskippable), but it
> - * can be during aggressive VACUUMs (as well as in certain edge cases).
> + * lazy_scan_heap() calls here every time it needs to get the next block to
> + * prune and vacuum.  The function uses the visibility map, vacuum options,
> + * and various thresholds to skip blocks which do not need to be processed 
> and

I wonder if "need" is too strong a word since this function
(heap_vac_scan_next_block()) specifically can set blkno to a block which
doesn't *need* to be processed but which it chooses to process because
of SKIP_PAGES_THRESHOLD.

> + * sets blkno to the next block that actually needs to be processed.
>   *
> - * Sets *skipping_current_range to indicate if caller should skip this range.
> - * Costs and benefits drive our decision.  Very small ranges won't be 
> skipped.
> + * The block number and visibility status of the next block to process are 
> set

Re: Avoiding inadvertent debugging mode for pgbench

2024-03-11 Thread Alvaro Herrera

On 2024-Mar-01, Euler Taveira wrote:

> I don't like to break backward compatibility but in this case I suspect that 
> it
> is ok. I don't recall the last time I saw a script that makes use of -d 
> option.
> How often do you need a pgbench debug information?

I wondered what the difference actually is, so I checked.  In -i mode,
the only difference is that if the tables don't exist before hand, we
receive the NOTICE that it doesn't.  In normal mode, the -d switch emits
so much junk that I would believe if somebody told me that passing -d
distorted the benchmark results; and it's hard to believe that such
output is valuable for anything other than debugging pgbench itself.

All in all, I support the original patch.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I love the Postgres community. It's all about doing things _properly_. :-)"
(David Garamond)

Re: UUID v7

2024-03-11 Thread Jelte Fennema-Nio

Attached a few comment fixes/improvements and a pgindent run (patch 0002-0004)

Now with the added comments, one thing pops out to me: The comments
mention that we use "Monotonic Random", but when I read the spec that
explicitly recommends against using an increment of 1 when using
monotonic random. I feel like if we use an increment of 1, we're
better off going for the "Fixed-Length Dedicated Counter Bits" method
(i.e. change the code to start the counter at 0). See patch 0005 for
an example of that change.

I'm also wondering if we really want to use the extra rand_b bits for
this. The spec says we MAY, but it does remove the amount of
randomness in our UUIDs.


v21-0001-Implement-UUID-v7.patch
Description: Binary data


v21-0003-Run-pgindent.patch
Description: Binary data


v21-0005-Change-to-Fixed-Length-Dedicated-Counter-Bits.patch
Description: Binary data


v21-0004-Fix-comments-a-bit.patch
Description: Binary data


v21-0002-Fix-typos.patch
Description: Binary data

Re: remaining sql/json patches

2024-03-11 Thread Alvaro Herrera

On 2024-Mar-11, Shruthi Gowda wrote:

> *CASE 2:*
> --
> SELECT * FROM JSON_TABLE(jsonb '{
>  "id" : 901,
>  "age" : 30,
>  "*FULL_NAME*" : "KATE DANIEL"}',
> '$'
> COLUMNS(
>  FULL_NAME varchar(20),
>  ID int,
>  AGE int
>   )
>) as t;

I think this is expected: when you use FULL_NAME as a SQL identifier, it
is down-cased, so it no longer matches the uppercase identifier in the
JSON data.  You'd have to do it like this:

SELECT * FROM JSON_TABLE(jsonb '{
 "id" : 901,
 "age" : 30,
 "*FULL_NAME*" : "KATE DANIEL"}',
'$'
COLUMNS(
 "FULL_NAME" varchar(20),
 ID int,
 AGE int
  )
   ) as t;

so that the SQL identifier is not downcased.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: remaining sql/json patches

2024-03-11 Thread Shruthi Gowda

Hi,
I was experimenting with the v42 patches, and I tried testing without
providing the path explicitly. There is one difference between the two test
cases that I have highlighted in blue.
The full_name column is empty in the second test case result.  Let me know
if this is an issue or expected behaviour.

*CASE 1:*
---
SELECT * FROM JSON_TABLE(jsonb '{
 "id" : 901,
 "age" : 30,
 "*full_name*" : "KATE DANIEL"}',
'$'
COLUMNS(
 FULL_NAME varchar(20),
 ID int,
 AGE  int
  )

   ) as t;

*RESULT:*
  full_name  | id  | age
-+-+-
 KATE DANIEL | 901 | 30

(1 row)

*CASE 2:*
--
SELECT * FROM JSON_TABLE(jsonb '{
 "id" : 901,
 "age" : 30,
 "*FULL_NAME*" : "KATE DANIEL"}',
'$'
COLUMNS(
 FULL_NAME varchar(20),
 ID int,
 AGE int
  )

   ) as t;

*RESULT:*
 full_name | id  | age
---+-+-
   | 901 | 30
(1 row)


Thanks & Regards,
Shruthi K C
EnterpriseDB: http://www.enterprisedb.com

Disconnect if socket cannot be put into non-blocking mode

2024-03-11 Thread Heikki Linnakangas

While self-reviewing my "Refactoring backend fork+exec code" patches, I 
noticed this in pq_init():



/*
 * In backends (as soon as forked) we operate the underlying socket in
 * nonblocking mode and use latches to implement blocking semantics if
 * needed. That allows us to provide safely interruptible reads and
 * writes.
 *
 * Use COMMERROR on failure, because ERROR would try to send the error 
to
 * the client, which might require changing the mode again, leading to
 * infinite recursion.
 */
#ifndef WIN32
if (!pg_set_noblock(MyProcPort->sock))
ereport(COMMERROR,
(errmsg("could not set socket to nonblocking mode: 
%m")));
#endif

#ifndef WIN32

/* Don't give the socket to any subprograms we execute. */
if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0)
elog(FATAL, "fcntl(F_SETFD) failed on socket: %m");
#endif


Using COMMERROR here seems bogus. Firstly, if there was a problem with 
recursion, surely the elog(FATAL) that follows would also be wrong. But 
more seriously, it's not cool to continue using the connection as if 
everything is OK, if we fail to put it into non-blocking mode. We should 
disconnect. (COMMERROR merely logs the error, it does not bail out like 
ERROR does)


Barring objections, I'll commit and backpatch the attached to fix that.

--
Heikki Linnakangas
Neon (https://neon.tech)From fc737586783a05672ff8eb96e245dfdeeadcd506 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 11 Mar 2024 16:36:37 +0200
Subject: [PATCH 1/1] Disconnect if socket cannot be put into non-blocking mode

Commit 387da18874 moved the code to put socket into non-blocking mode
from socket_set_nonblocking() into the one-time initialization
function, pg_init(). In socket_set_nonblocking(), there indeed was a
risk of recursion on failure like the comment said, but in pg_init(),
ERROR or FATAL is fine. There's even another elog(FATAL) just after
this, if setting FD_CLOEXEC fails.

Note that COMMERROR merely logged the error, it did not close the
connection, so if putting the socket to non-blocking mode failed we
would use the connection anyway. You might not immediately notice,
because most socket operations in a regular backend wait for the
socket to become readable/writable anyway. But e.g. replication will
be quite broken.

Backpatch to all supported versions.

Discussion: xx
---
 src/backend/libpq/pqcomm.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index c606bf34473..d9e49ca7028 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -189,14 +189,10 @@ pq_init(void)
 	 * nonblocking mode and use latches to implement blocking semantics if
 	 * needed. That allows us to provide safely interruptible reads and
 	 * writes.
-	 *
-	 * Use COMMERROR on failure, because ERROR would try to send the error to
-	 * the client, which might require changing the mode again, leading to
-	 * infinite recursion.
 	 */
 #ifndef WIN32
 	if (!pg_set_noblock(MyProcPort->sock))
-		ereport(COMMERROR,
+		ereport(FATAL,
 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-- 
2.39.2

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-11 Thread Nathan Bossart

On Mon, Mar 11, 2024 at 04:09:27PM +0530, Amit Kapila wrote:
> I don't see how it will be easier for the user to choose the default
> value of 'max_slot_xid_age' compared to 'max_slot_wal_keep_size'. But,
> I agree similar to 'max_slot_wal_keep_size', 'max_slot_xid_age' can be
> another parameter to allow vacuum to proceed removing the rows which
> otherwise it wouldn't have been as those would be required by some
> slot.

Yeah, the idea is to help prevent transaction ID wraparound, so I would
expect max_slot_xid_age to ordinarily be set relatively high, i.e., 1.5B+.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: WIP Incremental JSON Parser

2024-03-11 Thread Jacob Champion

On Sun, Mar 10, 2024 at 11:43 PM Andrew Dunstan  wrote:
> I haven't managed to reproduce this. But I'm including some tests for it.

If I remove the newline from the end of the new tests:

> @@ -25,7 +25,7 @@ for (my $size = 64; $size > 0; $size--)
>  foreach my $test_string (@inlinetests)
>  {
> my ($fh, $fname) = tempfile();
> -   print $fh "$test_string\n";
> +   print $fh "$test_string";
> close($fh);
>
> foreach my $size (1..6, 10, 20, 30)

then I can reproduce the same result as my local tests. That extra
whitespace appears to help the partial token logic out somehow.

--Jacob

Re: Add Index-level REINDEX with multiple jobs

2024-03-11 Thread Maxim Orlov

On Tue, 6 Feb 2024 at 09:22, Michael Paquier  wrote:

>
> The problem may be actually trickier than that, no?  Could there be
> other factors to take into account for their classification, like
> their sizes (typically, we'd want to process the biggest one first, I
> guess)?

Sorry for a late reply.  Thanks for an explanation.  This is sounds
reasonable to me.
Svetlana had addressed this in the patch v2.

-- 
Best regards,
Maxim Orlov.

Re: Reducing output size of nodeToString

2024-03-11 Thread Peter Eisentraut


On 22.02.24 16:07, Matthias van de Meent wrote:

On Thu, 22 Feb 2024 at 13:37, Matthias van de Meent
 wrote:


On Mon, 19 Feb 2024 at 14:19, Matthias van de Meent
 wrote:

Attached the updated version of the patch on top of 5497daf3, which
incorporates this last round of feedback.


Now attached rebased on top of 93db6cbd to fix conflicts with fbc93b8b
and an issue in the previous patchset: I attached one too many v3-0001
from a previous patch I worked on.


... and now with a fix for not overwriting newly deserialized location
attributes with -1, which breaks test output for
READ_WRITE_PARSE_PLAN_TREES installations. Again, no other significant
changes since the patch of last Monday.


* v7-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch

This patch looks much more complicated than I was expecting.  I had 
suggested to model this after stringToNodeWithLocations().  This uses a 
global variable to toggle the mode.  Your patch creates a function 
nodeToStringNoQLocs() -- why the different naming scheme? -- and passes 
the flag down as an argument to all the output functions.  I mean, in a 
green field, avoiding global variables can be sensible, of course, but I 
think in this limited scope here it would really be better to keep the 
code for the two directions read and write the same.


Attached is a small patch that shows what I had in mind.  (It doesn't 
contain any callers, but your patch shows where all those would go.)



* v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch

This looks sensible, but maybe making Location a global type is a bit 
much?  Maybe something more specific like ParseLocation, or ParseLoc, to 
keep it under 12 characters.
From dee403f637f279813f0711bbc64c365cba82c18c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 11 Mar 2024 14:07:52 +0100
Subject: [PATCH] Add nodeToStringWithoutLocations()

---
 src/backend/nodes/outfuncs.c | 30 +-
 src/include/nodes/nodes.h|  1 +
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 29cbc83bd9..ea0a6cb94d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -25,6 +25,8 @@
 #include "nodes/pg_list.h"
 #include "utils/datum.h"
 
+static bool write_location_fields = true;
+
 static void outChar(StringInfo str, char c);
 static void outDouble(StringInfo str, double d);
 
@@ -88,7 +90,7 @@ static void outDouble(StringInfo str, double d);
 
 /* Write a parse location field (actually same as INT case) */
 #define WRITE_LOCATION_FIELD(fldname) \
-   appendStringInfo(str, " :" CppAsString(fldname) " %d", node->fldname)
+   appendStringInfo(str, " :" CppAsString(fldname) " %d", 
write_location_fields ? node->fldname : -1)
 
 /* Write a Node field */
 #define WRITE_NODE_FIELD(fldname) \
@@ -770,6 +772,32 @@ nodeToString(const void *obj)
return str.data;
 }
 
+/*
+ * nodeToStringWithoutLocations -
+ *returns the ascii representation of the Node as a palloc'd string
+ *
+ * This form prints location fields with their default value (not the actual
+ * field value).  This is useful for serializing node trees for storing into
+ * catalogs, where the location information is not useful since the original
+ * query string is not preserved anyway.
+ */
+char *
+nodeToStringWithoutLocations(const void *obj)
+{
+   StringInfoData str;
+   boolsave_write_location_fields;
+
+   save_write_location_fields = write_location_fields;
+   write_location_fields = false;
+
+   initStringInfo(&str);
+   outNode(&str, obj);
+
+   write_location_fields = save_write_location_fields;
+
+   return str.data;
+}
+
 /*
  * bmsToString -
  *returns the ascii representation of the Bitmapset as a palloc'd 
string
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 2969dd831b..3abe47af94 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -195,6 +195,7 @@ extern void outBitmapset(struct StringInfoData *str,
 extern void outDatum(struct StringInfoData *str, uintptr_t value,
 int typlen, bool typbyval);
 extern char *nodeToString(const void *obj);
+extern char *nodeToStringWithoutLocations(const void *obj);
 extern char *bmsToString(const struct Bitmapset *bms);
 
 /*
-- 
2.44.0

Re: UUID v7

2024-03-11 Thread Aleksander Alekseev

Hi,

> Oops, CFbot expectedly found a problem...
> Sorry for the noise, this version, I hope, will pass all the tests.
> Thanks!
>
> Best regards, Andrey Borodin.

I had some issues applying v19 against the current `master` branch.
PFA the rebased and minorly tweaked v20.

The patch LGTM. I think it could be merged unless there are any open
issues left. I don't think so, but maybe I missed something.

-- 
Best regards,
Aleksander Alekseev


v20-0001-Implement-UUID-v7.patch
Description: Binary data

Re: POC, WIP: OR-clause support for indexes

2024-03-11 Thread Andrei Lepikhov


On 11/3/2024 18:31, Alexander Korotkov wrote:

I'm not convinced about this limit. The initial reason was to combine
long lists of ORs into the array because such a transformation made at
an early stage increases efficiency.
I understand the necessity of this limit in the array decomposition
routine but not in the creation one.


The comment near MAX_SAOP_ARRAY_SIZE says that this limit is because
N^2 algorithms could be applied to arrays.  Are you sure that's not
true for our case?
When you operate an array, indeed. But when we transform ORs to an 
array, not. Just check all the places in the optimizer and even the 
executor where we would pass along the list of ORs. This is why I think 
we should use this optimization even more intensively for huge numbers 
of ORs in an attempt to speed up the overall query.



3) Better save the original order of clauses by putting hash entries and
untransformable clauses to the same list.  A lot of differences in
regression tests output have gone.

I agree that reducing the number of changes in regression tests looks
better. But to achieve this, you introduced a hack that increases the
complexity of the code. Is it worth it? Maybe it would be better to make
one-time changes in tests instead of getting this burden on board. Or
have you meant something more introducing the node type?


For me the reason is not just a regression test.  The current code
keeps the original order of quals as much as possible.  The OR
transformation code reorders quals even in cases when it doesn't
eventually apply any optimization.  I don't think that's acceptable.
However, less hackery ways for this is welcome for sure.
Why is it unacceptable? Can the user implement some order-dependent 
logic with clauses, and will it be correct?
Otherwise, it is a matter of taste, and generally, this decision is up 
to you.



We don't make array values unique.  That might make query execution
performance somewhat worse, and also makes selectivity estimation
worse.  I suggest Andrei and/or Alena should implement making array
values unique.

The fix Alena has made looks correct. But I urge you to think twice:
The optimizer doesn't care about duplicates, so why do we do it?
What's more, this optimization is intended to speed up queries with long
OR lists. Using the list_append_unique() comparator on such lists could
impact performance. I suggest sticking to the common rule and leaving
the responsibility on the user's shoulders.


I don't see why the optimizer doesn't care about duplicates for OR
lists.  As I showed before in an example, it successfully removes the
duplicate.  So, currently OR transformation clearly introduces a
regression in terms of selectivity estimation.  I think we should
evade that.
I think you are right. It is probably a better place than any other to 
remove duplicates in an array. I just think we should sort and remove 
duplicates from entry->consts in one pass. Thus, this optimisation 
should be applied to sortable constants.





At least, we should do this optimization later, in one pass, with
sorting elements before building the array. But what if we don't have a
sort operator for the type?


It was probably discussed before, but can we do our work later?  There
is a canonicalize_qual() which calls find_duplicate_ors().  This is
the place where currently duplicate OR clauses are removed.  Could our
OR-to-ANY transformation be just another call from
canonicalize_qual()?
Hmm, we already tried to do it at that point. I vaguely recall some 
issues caused by this approach. Anyway, it should be done as quickly as 
possible to increase the effect of the optimization.


--
regards,
Andrei Lepikhov
Postgres Professional

RE: Test failures of 100_bugs.pl

2024-03-11 Thread Zhijie Hou (Fujitsu)

On Monday, April 24, 2023 5:50 PM Yu Shi (Fujitsu)  
wrote:
> 
> On Fri, Apr 21, 2023 1:48 PM Yu Shi (Fujitsu)  wrote:
> >
> > I wrote a patch to dump rel state in wait_for_subscription_sync() as
> suggested.
> > Please see the attached patch.
> > I will try to add some debug logs in code later.
> >
> 
> Please see the attached v2 patch.
> 
> I added some debug logs when invalidating syncing table states and updating
> table_states_not_ready list. I also adjusted the message level in the tests 
> which
> failed before.

Just a reference.

I think similar issue has been analyzed in other thread[1] and the reason seems
clear that the table state cache invalidation got lost due to a race condition.
The fix is also being discussed there.

[1] 
https://www.postgresql.org/message-id/flat/711a6afe-edb7-1211-cc27-1bef8239eec7%40gmail.com

Best Regards,
Hou zj

Re: Add new error_action COPY ON_ERROR "log"

2024-03-11 Thread Bharath Rupireddy

On Mon, Mar 11, 2024 at 11:16 AM Michael Paquier  wrote:
>
> At the end of the day, this comes down to what is more helpful to the
> user.  And I'd agree on the side what ON_ERROR does currently, which
> is what your patch relies on: on the first conversion failure, give up
> and skip the rest of the row because we cannot trust its contents.
> That's my way of saying that I am fine with the proposal of your
> patch, and that we cannot provide the full state of a row without
> making the error stack of COPY more invasive.

+1.

> +  verbose, a NOTICE message
> +  containing the line number and column name for each discarded row is
> +  emitted.
>
> This should clarify that the column name refers to the attribute where
> the input conversion has failed, I guess.  Specifying only "column
> name" without more context is a bit confusing.

Done.

Please see the attached v6 patch set.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v6-0001-Add-LOG_VERBOSITY-option-to-COPY-command.patch
Description: Binary data


v6-0002-Add-detailed-info-when-COPY-skips-soft-errors.patch
Description: Binary data

Re: type cache cleanup improvements

2024-03-11 Thread Aleksander Alekseev

Hi,

> Yep, exacly. One time from 2^32 we reset whole cache instead of one (or 
> several)
> entry with hash value = 0.

Got it. Here is an updated patch where I added a corresponding comment.

Now the patch LGTM. I'm going to change its status to RfC unless
anyone wants to review it too.

-- 
Best regards,
Aleksander Alekseev


v4-0001-Improve-performance-of-type-cache-cleanup.patch
Description: Binary data

Re: make BuiltinTrancheNames less ugly

2024-03-11 Thread Alvaro Herrera

On 2024-Mar-01, Tristan Partin wrote:

> On Fri Mar 1, 2024 at 8:00 AM CST, Alvaro Herrera wrote:

> > I'm pretty disappointed of not being able to remove
> > generate-lwlocknames.pl (it now generates the .h, no longer the .c
> > file), but I can't find a way to do the equivalent #defines in CPP ...
> > it'd require invoking the C preprocessor twice.  Maybe an intermediate
> > .h file would solve the problem, but I have no idea how would that work
> > with Meson.  I guess I'll do it in Make and let somebody suggest a Meson
> > way.
> 
> I can help you with Meson if you get the Make implementation done.

Actually I realized that we need to keep the Perl script anyway because
we want to be able to cross-check the wait_event_names.txt files to
ensure we generate the correct documentation.  Since it was simple
enough, I added the Meson support already.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Essentially, you're proposing Kevlar shoes as a solution for the problem
that you want to walk around carrying a loaded gun aimed at your foot.
(Tom Lane)
>From 6359348fafbefb391aad89e47e5e443eb8ab8e29 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 1 Mar 2024 13:03:10 +0100
Subject: [PATCH v2] Rework lwlocknames.txt to become lwlocklist.h

This saves some code and some space in BuiltinTrancheNames, as foreseen
in commit 74a730631065.

We still have to build lwlocknames.h using Perl code, which initially I
wanted to avoid, but it gives us the chance to cross-check
wait_event_names.txt.

Discussion: https://postgr.es/m/202401231025.gbv4nnte5fmm@alvherre.pgsql
---
 src/backend/Makefile  |  4 +-
 src/backend/storage/lmgr/.gitignore   |  1 -
 src/backend/storage/lmgr/Makefile |  9 +-
 .../storage/lmgr/generate-lwlocknames.pl  | 60 ++---
 src/backend/storage/lmgr/lwlock.c | 15 ++--
 src/backend/storage/lmgr/lwlocknames.txt  | 60 -
 src/backend/storage/lmgr/meson.build  |  2 -
 .../utils/activity/wait_event_names.txt   |  8 +-
 src/include/storage/lwlock.h  |  4 +-
 src/include/storage/lwlocklist.h  | 85 +++
 src/include/storage/meson.build   | 12 ++-
 src/tools/pginclude/headerscheck  |  1 +
 12 files changed, 136 insertions(+), 125 deletions(-)
 delete mode 100644 src/backend/storage/lmgr/lwlocknames.txt
 create mode 100644 src/include/storage/lwlocklist.h

diff --git a/src/backend/Makefile b/src/backend/Makefile
index d66e2a4b9f..79a12dfd1e 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -110,8 +110,8 @@ $(top_builddir)/src/port/libpgport_srv.a: | submake-libpgport
 parser/gram.h: parser/gram.y
 	$(MAKE) -C parser gram.h
 
-storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl storage/lmgr/lwlocknames.txt utils/activity/wait_event_names.txt
-	$(MAKE) -C storage/lmgr lwlocknames.h lwlocknames.c
+storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl ../include/storage/lwlocklist.h utils/activity/wait_event_names.txt
+	$(MAKE) -C storage/lmgr lwlocknames.h
 
 utils/activity/wait_event_types.h: utils/activity/generate-wait_event_types.pl utils/activity/wait_event_names.txt
 	$(MAKE) -C utils/activity wait_event_types.h pgstat_wait_event.c wait_event_funcs_data.c
diff --git a/src/backend/storage/lmgr/.gitignore b/src/backend/storage/lmgr/.gitignore
index dab4c3f580..8e5b734f15 100644
--- a/src/backend/storage/lmgr/.gitignore
+++ b/src/backend/storage/lmgr/.gitignore
@@ -1,3 +1,2 @@
-/lwlocknames.c
 /lwlocknames.h
 /s_lock_test
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index 1aef423384..3f89548bde 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -18,7 +18,6 @@ OBJS = \
 	lmgr.o \
 	lock.o \
 	lwlock.o \
-	lwlocknames.o \
 	predicate.o \
 	proc.o \
 	s_lock.o \
@@ -35,11 +34,7 @@ s_lock_test: s_lock.c $(top_builddir)/src/common/libpgcommon.a $(top_builddir)/s
 		$(TASPATH) -L $(top_builddir)/src/common -lpgcommon \
 		-L $(top_builddir)/src/port -lpgport -lm -o s_lock_test
 
-# see notes in src/backend/parser/Makefile
-lwlocknames.c: lwlocknames.h
-	touch $@
-
-lwlocknames.h: $(top_srcdir)/src/backend/storage/lmgr/lwlocknames.txt $(top_srcdir)/src/backend/utils/activity/wait_event_names.txt generate-lwlocknames.pl
+lwlocknames.h: ../../../include/storage/lwlocklist.h ../../utils/activity/wait_event_names.txt generate-lwlocknames.pl
 	$(PERL) $(srcdir)/generate-lwlocknames.pl $^
 
 check: s_lock_test
@@ -47,4 +42,4 @@ check: s_lock_test
 
 clean:
 	rm -f s_lock_test
-	rm -f lwlocknames.h lwlocknames.c
+	rm -f lwlocknames.h
diff --git a/src/backend/storage/lmgr/generate-lwlocknames.pl b/src/backend/storage/lmgr/generate-lwlocknames.pl
index 7b93ecf6c1..eaddd9d3b9 100644
--- a/src/backend/storage/lmgr/generate-lwlocknames.pl
+++ b/src/backend/storage/lmgr/generate-lwlocknames.pl
@@ -1,6 +1,6 @@
 #!/usr/bi

Re: Function and Procedure with same signature?

2024-03-11 Thread Hannu Krosing

On Thu, Mar 7, 2024 at 5:46 PM Tom Lane  wrote:
>
> Hannu Krosing  writes:
> > On Sat, Feb 10, 2024 at 12:38 AM Tom Lane  wrote:
> >> Worth noting perhaps that this is actually required by the SQL
> >> standard: per spec, functions and procedures are both "routines"
> >> and share the same namespace,
>
> > Can you point me to a place in the standard where it requires all
> > kinds of ROUTINES to be using the same namespace ?
>
> [ digs around a bit... ]  Well, the spec is vaguer about this than
> I thought.  It is clear on one thing: 11.60 
> conformance rules include
...

Thanks for thorough analysis of the standard.

I went and looked at more what other relevant database do in this
space based on their documentation

Tl;DR

* MS SQL Server
   - no overloading allowed anywhere
* MySQL
   - no overloading
* Oracle
   - no overloading at top level
   - overloading and independent namespaces for functions and procedures
* Teradata
   - function overloading allowed
   - not clear from documentation if this also applies procedures
   - function overloading docs does not mention possible clashes with
procedure names anywhere
* DB2
   - function overloading fully supported
   - procedure overloading supported, but only for distinct NUMBER OF ARGUMENTS

I'll try to get access to a Teradata instance to verify the above

So at high level most other Serious Databases
  - do support function overloading
  - keep functions and procedures in separated namespaces

I still think that PostgreSQL having functions and procedures share
the same namespace is an unneeded and unjustified restriction


I plan to do some hands-on testing on Teradata and DB2 to understand it

But my current thinking is that we should not be more restrictive than
others unless there is a strong technical reason for it. And currently
I do not see any.

> It could be argued that this doesn't prohibit having both a function
> and a procedure with the same data type list, only that you can't
> write ROUTINE when trying to drop or alter one.  But I think that's
> just motivated reasoning.  The paragraphs for  being
> FUNCTION or PROCEDURE are exactly like the above except they say
> "exactly one function" or "exactly one procedure".  If you argue that
> this text means we must allow functions and procedures with the same
> parameter lists, then you are also arguing that we must allow multiple
> functions with the same parameter lists, and it's just the user's
> tough luck if they need to drop one of them.

The main issue is not dropping them, but inability to determine which
one to call.

We already have this in case of two overloaded functions with same
initial argument types and the rest having defaults - when

---
hannuk=# create function get(i int, out o int) begin atomic select i; end;
CREATE FUNCTION
hannuk=# create function get(i int, j int default 0, out o int) begin
atomic select i+j; end;
CREATE FUNCTION
hannuk=# select get(1);
ERROR:  function get(integer) is not unique
LINE 1: select get(1);
   ^
HINT:  Could not choose a best candidate function. You might need to
add explicit type casts.
---

> A related point is that our tolerance for overloading routine
> names isn't unlimited: we don't allow duplicate names with the
> same list of input parameters, even if the output parameters are
> different.

This again has a good reason, as there would be many cases where you
could not decide which one to call

Not allowing overloading based on only return types is common across
all OO languages.

My point is that this does not apply to FUNCTION vs. PROCEDURE as it
is very clear from the CALL syntax which one is meant.

> This is not something that the SQL spec cares to
> address, but there are good ambiguity-avoidance reasons for it.
> I think limiting overloading so that a ROUTINE specification is
> unambiguous is also a good thing.

I think ROUTINE being unambiguous is not e very good goal.

What if next version of standard introduces DROP DATABASE OBJECT ?

> I remain unexcited about changing our definition of this.
> "Oracle allows it" is not something that has any weight in
> my view: they have made a bunch of bad design decisions
> as well as good ones, and I think this is a bad one.

Fully agree that  "Oracle allows it" is a non-argument.

My main point is that there is no strong reason to have objects which
are distinct at syntax level to be in the same namespace.

# Oracle is actually much more restrictive in top level object namespace -

All of the following share the same namespace - [Packages, Private
synonyms, Sequences, Stand-alone procedures, Stand-alone stored
functions, Tables, User-defined operators, User-defined types, Views].
(I guess this makes parser easier to write)

The equivalent in postgreSQL would be [extensions, schemas, tables,
procedures, functions and a few more] all sharing the namespace.

Where Oracle *does* allow overloading is "packaged functions and
procedures" which are probably using a separate parse

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-11 Thread Amit Kapila

On Tue, Feb 27, 2024 at 2:07 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > PSA the patch for implementing it. It is basically same as Ian's one.
> > However, this patch still cannot satisfy the condition 3).
> >
> > `pg_basebackup -D data_N2 -d "user=postgres" -R`
> > -> dbname would not be appeared in primary_conninfo.
> >
> > This is because `if (connection_string)` case in GetConnection() explicy 
> > override
> > a dbname to "replication". I've tried to add a dummy entry {"dbname", NULL} 
> > pair
> > before the overriding, but it is no-op. Because The replacement of the 
> > dbname in
> > pqConnectOptions2() would be done only for the valid (=lastly specified)
> > connection options.
>
> Oh, this patch missed the straightforward case:
>
> pg_basebackup -D data_N2 -d "user=postgres dbname=replication" -R
> -> dbname would not be appeared in primary_conninfo.
>
> So I think it cannot be applied as-is. Sorry for sharing the bad item.
>

Can you please share the patch that can be considered for review?

-- 
With Regards,
Amit Kapila.

Re: POC, WIP: OR-clause support for indexes

2024-03-11 Thread Alexander Korotkov

Hi Andrei,

Thank you for your response.

On Mon, Mar 11, 2024 at 7:13 AM Andrei Lepikhov
 wrote:
> On 7/3/2024 21:51, Alexander Korotkov wrote:
> > Hi!
> >
> > On Tue, Mar 5, 2024 at 9:59 AM Andrei Lepikhov
> > mailto:a.lepik...@postgrespro.ru>> wrote:
> >  > On 5/3/2024 12:30, Andrei Lepikhov wrote:
> >  > > On 4/3/2024 09:26, jian he wrote:
> >  > ... and the new version of the patchset is attached.
> >
> > I made some revisions for the patchset.
> Great!
> > 1) Use hash_combine() to combine hash values.
> Looks better
> > 2) Upper limit the number of array elements by MAX_SAOP_ARRAY_SIZE.
>
> I'm not convinced about this limit. The initial reason was to combine
> long lists of ORs into the array because such a transformation made at
> an early stage increases efficiency.
> I understand the necessity of this limit in the array decomposition
> routine but not in the creation one.

The comment near MAX_SAOP_ARRAY_SIZE says that this limit is because
N^2 algorithms could be applied to arrays.  Are you sure that's not
true for our case?

> > 3) Better save the original order of clauses by putting hash entries and
> > untransformable clauses to the same list.  A lot of differences in
> > regression tests output have gone.
> I agree that reducing the number of changes in regression tests looks
> better. But to achieve this, you introduced a hack that increases the
> complexity of the code. Is it worth it? Maybe it would be better to make
> one-time changes in tests instead of getting this burden on board. Or
> have you meant something more introducing the node type?

For me the reason is not just a regression test.  The current code
keeps the original order of quals as much as possible.  The OR
transformation code reorders quals even in cases when it doesn't
eventually apply any optimization.  I don't think that's acceptable.
However, less hackery ways for this is welcome for sure.

> > We don't make array values unique.  That might make query execution
> > performance somewhat worse, and also makes selectivity estimation
> > worse.  I suggest Andrei and/or Alena should implement making array
> > values unique.
> The fix Alena has made looks correct. But I urge you to think twice:
> The optimizer doesn't care about duplicates, so why do we do it?
> What's more, this optimization is intended to speed up queries with long
> OR lists. Using the list_append_unique() comparator on such lists could
> impact performance. I suggest sticking to the common rule and leaving
> the responsibility on the user's shoulders.

I don't see why the optimizer doesn't care about duplicates for OR
lists.  As I showed before in an example, it successfully removes the
duplicate.  So, currently OR transformation clearly introduces a
regression in terms of selectivity estimation.  I think we should
evade that.

> At least, we should do this optimization later, in one pass, with
> sorting elements before building the array. But what if we don't have a
> sort operator for the type?

It was probably discussed before, but can we do our work later?  There
is a canonicalize_qual() which calls find_duplicate_ors().  This is
the place where currently duplicate OR clauses are removed.  Could our
OR-to-ANY transformation be just another call from
canonicalize_qual()?

--
Regards,
Alexander Korotkov

Re: Using the %m printf format more

2024-03-11 Thread Dagfinn Ilmari Mannsåker

Michael Paquier  writes:

> On Wed, Mar 06, 2024 at 07:11:19PM +, Dagfinn Ilmari Mannsåker wrote:
>
>> The attached patch does so everywhere appropriate. One place where it's
>> not appropriate is the TAP-emitting functions in pg_regress, since those
>> call fprintf()
>
> I am not really following your argument with pg_regress.c and
> fprintf().  d6c55de1f99a should make that possible even in the case of
> emit_tap_output_v(), no? 

The problem isn't that emit_tap_output_v() doesn't support %m, which it
does, but that it potentially calls fprintf() to output TAP protocol
elements such as "\n" and "# " before it calls vprintf(…, fmt, …), and
those calls might clobber errno.  An option is to make it save errno at
the start and restore it before the vprintf() calls, as in the second
attached patch.

>> and other potentially errno-modifying functions before
>> evaluating the format string.
>
> Sure.

On closer look, fprintf() is actually the only errno-clobbering function
it calls, I was just hedging my bets in that statement.

- ilmari

>From 3727341aad84fbd275acb24f92591cc734fdd6a7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Wed, 6 Mar 2024 16:58:33 +
Subject: [PATCH v2 1/2] Use %m printf format instead of strerror(errno) where
 appropriate

Now that all live branches (12-) support %m in printf formats, let's
use it everywhere appropriate.

Particularly, we're _not_ converting the TAP output calls in
pg_regress, since those functions call fprintf() and friends before
evaluating the format string, which can clobber errno.
---
 src/backend/postmaster/postmaster.c   | 21 ++--
 src/backend/postmaster/syslogger.c|  2 +-
 src/backend/utils/misc/guc.c  |  9 +-
 src/bin/initdb/findtimezone.c |  4 +-
 src/bin/pg_ctl/pg_ctl.c   | 74 +++---
 src/bin/pg_dump/compress_gzip.c   |  2 +-
 src/bin/pg_dump/compress_none.c   |  5 +-
 src/bin/pg_upgrade/check.c| 41 +++-
 src/bin/pg_upgrade/controldata.c  |  6 +-
 src/bin/pg_upgrade/exec.c | 14 ++-
 src/bin/pg_upgrade/file.c | 98 +--
 src/bin/pg_upgrade/function.c |  3 +-
 src/bin/pg_upgrade/option.c   | 10 +-
 src/bin/pg_upgrade/parallel.c | 12 +--
 src/bin/pg_upgrade/pg_upgrade.c   |  4 +-
 src/bin/pg_upgrade/relfilenumber.c|  5 +-
 src/bin/pg_upgrade/tablespace.c   |  4 +-
 src/bin/pg_upgrade/version.c  |  9 +-
 src/common/psprintf.c |  4 +-
 src/interfaces/ecpg/preproc/ecpg.c| 12 +--
 src/port/path.c   |  3 +-
 src/test/isolation/isolationtester.c  |  2 +-
 .../modules/libpq_pipeline/libpq_pipeline.c   |  4 +-
 src/tools/ifaddrs/test_ifaddrs.c  |  2 +-
 24 files changed, 158 insertions(+), 192 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index f3c09e8dc0..af8a1efe66 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1375,12 +1375,12 @@ PostmasterMain(int argc, char *argv[])
 
 			/* Make PID file world readable */
 			if (chmod(external_pid_file, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH) != 0)
-write_stderr("%s: could not change permissions of external PID file \"%s\": %s\n",
-			 progname, external_pid_file, strerror(errno));
+write_stderr("%s: could not change permissions of external PID file \"%s\": %m\n",
+			 progname, external_pid_file);
 		}
 		else
-			write_stderr("%s: could not write external PID file \"%s\": %s\n",
-		 progname, external_pid_file, strerror(errno));
+			write_stderr("%s: could not write external PID file \"%s\": %m\n",
+		 progname, external_pid_file);
 
 		on_proc_exit(unlink_external_pid_file, 0);
 	}
@@ -1589,8 +1589,8 @@ checkControlFile(void)
 	{
 		write_stderr("%s: could not find the database system\n"
 	 "Expected to find it in the directory \"%s\",\n"
-	 "but could not open file \"%s\": %s\n",
-	 progname, DataDir, path, strerror(errno));
+	 "but could not open file \"%s\": %m\n",
+	 progname, DataDir, path);
 		ExitPostmaster(2);
 	}
 	FreeFile(fp);
@@ -6277,15 +6277,13 @@ read_backend_variables(char *id, Port **port, BackgroundWorker **worker)
 	fp = AllocateFile(id, PG_BINARY_R);
 	if (!fp)
 	{
-		write_stderr("could not open backend variables file \"%s\": %s\n",
-	 id, strerror(errno));
+		write_stderr("could not open backend variables file \"%s\": %m\n", id);
 		exit(1);
 	}
 
 	if (fread(¶m, sizeof(param), 1, fp) != 1)
 	{
-		write_stderr("could not read from backend variables file \"%s\": %s\n",
-	 id, strerror(errno));
+		write_stderr("could not read from backend variables file \"%s\": %m\n", id);
 		exit(1);
 	}
 
@@ -6293,8 +6291,7 @@ read_backend_variables(char *

Re: Transaction timeout

2024-03-11 Thread Alexander Korotkov

On Mon, Mar 11, 2024 at 12:53 PM Andrey M. Borodin  wrote:
> > On 7 Mar 2024, at 00:55, Alexander Korotkov  wrote:
> >
> > On Wed, Mar 6, 2024 at 10:22 AM Andrey M. Borodin  
> > wrote:
> >>> On 25 Feb 2024, at 21:50, Alexander Korotkov  wrote:
> >>>
> >>> Thank you for the patches.  I've pushed the 0001 patch to avoid
> >>> further failures on buildfarm.  Let 0004 wait till injections points
> >>> by Mechael are committed.
> >>
> >> Thanks!
> >>
> >> All prerequisites are committed. I propose something in a line with this 
> >> patch.
> >
> > Thank you.  I took a look at the patch.  Should we also check the
> > relevant message after the timeout is fired?  We could check it in
> > psql stderr or log for that.
>
> PFA version which checks log output.
> But I could not come up with a proper use of BackgroundPsql->query_until() to 
> check outputs. And there are multiple possible errors.
>
> We can copy test from src/bin/psql/t/001_basic.pl:
>
> # test behavior and output on server crash
> my ($ret, $out, $err) = $node->psql('postgres',
> "SELECT 'before' AS running;\n"
> . "SELECT pg_terminate_backend(pg_backend_pid());\n"
> . "SELECT 'AFTER' AS not_running;\n");
>
> is($ret, 2, 'server crash: psql exit code');
> like($out, qr/before/, 'server crash: output before crash');
> ok($out !~ qr/AFTER/, 'server crash: no output after crash');
> is( $err,
> 'psql::2: FATAL: terminating connection due to administrator command
> psql::2: server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> psql::2: error: connection to server was lost',
> 'server crash: error message’);
>
> But I do not see much value in this.
> What do you think?

I think if checking psql stderr is problematic, checking just logs is
fine.  Could we wait for the relevant log messages one by one with
$node->wait_for_log() just like 040_standby_failover_slots_sync.pl do?

--
Regards,
Alexander Korotkov

Re: Reducing the log spam

2024-03-11 Thread Laurenz Albe

On Mon, 2024-03-11 at 09:33 +0100, Jelte Fennema-Nio wrote:
> -   the subscriber's server log.
> +   the subscriber's server log if you remove 23505 from
> +   .
> 
> This seems like a pretty big regression. Being able to know why your
> replication got closed seems pretty critical.

The actual SQLSTATEs that get suppressed are subject to discussion
(an I have a gut feeling that some people will want the list empty).

As far as this specific functionality is concerned, I think that the
actual problem is a deficiency in PostgreSQL.  The problem is that
the log is the *only* place where you can get this information.  That
will be a problem for many people, even without "log_suppress_errcodes".

I think that this isformation should be available in some statistics
view.

Yours,
Laurenz Albe

Re: Regardign RecentFlushPtr in WalSndWaitForWal()

2024-03-11 Thread Amit Kapila

On Mon, Mar 11, 2024 at 4:17 PM shveta malik  wrote:
>
> On Sat, Mar 2, 2024 at 4:44 PM Amit Kapila  wrote:
> >
> > Right, I think the quoted code has check "if (!RecoveryInProgress())".
> >
> > >
> >  But apart from that, your
> > > observation seems accurate, yes.
> > >
> >
> > I also find the observation correct and the code has been like that
> > since commit 5a991ef8 [1]. So, let's wait to see if Robert or Andres
> > remembers the reason, otherwise, we should probably nuke this code.
>
> Please find the patch attached for the same.
>

LGTM. I'll push this tomorrow unless I see any comments/objections to
this change.

-- 
With Regards,
Amit Kapila.

Re: Transaction timeout

2024-03-11 Thread Andrey M. Borodin

> On 7 Mar 2024, at 00:55, Alexander Korotkov  wrote:
> 
> On Wed, Mar 6, 2024 at 10:22 AM Andrey M. Borodin  
> wrote:
>>> On 25 Feb 2024, at 21:50, Alexander Korotkov  wrote:
>>> 
>>> Thank you for the patches.  I've pushed the 0001 patch to avoid
>>> further failures on buildfarm.  Let 0004 wait till injections points
>>> by Mechael are committed.
>> 
>> Thanks!
>> 
>> All prerequisites are committed. I propose something in a line with this 
>> patch.
> 
> Thank you.  I took a look at the patch.  Should we also check the
> relevant message after the timeout is fired?  We could check it in
> psql stderr or log for that.

PFA version which checks log output.
But I could not come up with a proper use of BackgroundPsql->query_until() to 
check outputs. And there are multiple possible errors.

We can copy test from src/bin/psql/t/001_basic.pl:

# test behavior and output on server crash
my ($ret, $out, $err) = $node->psql('postgres',
"SELECT 'before' AS running;\n"
. "SELECT pg_terminate_backend(pg_backend_pid());\n"
. "SELECT 'AFTER' AS not_running;\n");

is($ret, 2, 'server crash: psql exit code');
like($out, qr/before/, 'server crash: output before crash');
ok($out !~ qr/AFTER/, 'server crash: no output after crash');
is( $err,
'psql::2: FATAL: terminating connection due to administrator command
psql::2: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
psql::2: error: connection to server was lost',
'server crash: error message’);

But I do not see much value in this.
What do you think?

Best regards, Andrey Borodin.

v3-0001-Add-timeouts-TAP-tests.patch
Description: Binary data

Re: Add comment to specify timeout unit in ConditionVariableTimedSleep()

2024-03-11 Thread shveta malik

On Sat, Mar 9, 2024 at 12:19 PM Michael Paquier  wrote:
>
> On Tue, Mar 05, 2024 at 03:20:48PM +0900, Michael Paquier wrote:
> > That sounds like a good idea to me, so I'm OK with your suggestion.
>
> Applied this one as f160bf06f72a.  Thanks.

Thanks!

thanks
Shveta

Re: Regardign RecentFlushPtr in WalSndWaitForWal()

2024-03-11 Thread shveta malik

On Sat, Mar 2, 2024 at 4:44 PM Amit Kapila  wrote:
>
> Right, I think the quoted code has check "if (!RecoveryInProgress())".
>
> >
>  But apart from that, your
> > observation seems accurate, yes.
> >
>
> I also find the observation correct and the code has been like that
> since commit 5a991ef8 [1]. So, let's wait to see if Robert or Andres
> remembers the reason, otherwise, we should probably nuke this code.

Please find the patch attached for the same.

thanks
Shveta


v1-0001-Remove-redundant-RecentFlushPtr-fetch-in-WalSndWa.patch
Description: Binary data

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-11 Thread Alexander Korotkov

Hi!

I've decided to put my hands on this patch.

On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila  wrote:
> +1 for the second one not only because it avoids new words in grammar
> but also sounds to convey the meaning. I think you can explain in docs
> how this feature can be used basically how will one get the correct
> LSN value to specify.

I picked the second option and left only the AFTER clause for the
BEGIN statement.  I think this should be enough for the beginning.

> As suggested previously also pick one of the approaches (I would
> advocate the second one) and keep an option for the second one by
> mentioning it in the commit message. I hope to see more
> reviews/discussions or usage like how will users get the LSN value to
> be specified on the core logic of the feature at this stage. IF
> possible, state, how real-world applications could leverage this
> feature.

I've added a paragraph to the docs about the usage.  After you made
some changes on primary, you run pg_current_wal_insert_lsn().  Then
connect to replica and run 'BEGIN AFTER lsn' with the just obtained
LSN.  Now you're guaranteed to see the changes made to the primary.

Also, I've significantly reworked other aspects of the patch.  The
most significant changes are:
1) Waiters are now stored in the array sorted by LSN.  This saves us
from scanning of wholeper-backend array.
2) Waiters are removed from the array immediately once their LSNs are
replayed.  Otherwise, the WAL replayer will keep scanning the shared
memory array till waiters wake up.
3) To clean up after errors, we now call WaitLSNCleanup() on backend
shmem exit.  I think this is preferable over the previous approach to
remove from the queue before ProcessInterrupts().
4) There is now condition to recheck if LSN is replayed after adding
to the shared memory array.  This should save from the race
conditions.
5) I've renamed too generic names for functions and files.

--
Regards,
Alexander Korotkov

v8-0001-Implement-AFTER-clause-for-BEGIN-command.patch
Description: Binary data

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-11 Thread Amit Kapila

On Wed, Mar 6, 2024 at 2:47 PM Bharath Rupireddy
 wrote:
>
> Thanks. v8-0001 is how it looks. Please see the v8 patch set with this change.
>

Commit message says: "Currently postgres has the ability to invalidate
inactive replication slots based on the amount of WAL (set via
max_slot_wal_keep_size GUC) that will be needed for the slots in case
they become active. However, choosing a default value for
max_slot_wal_keep_size is tricky. Because the amount of WAL a customer
generates, and their allocated storage will vary greatly in
production, making it difficult to pin down a one-size-fits-all value.
It is often easy for developers to set an XID age (age of slot's xmin
or catalog_xmin) of say 1 or 1.5 billion, after which the slots get
invalidated."

I don't see how it will be easier for the user to choose the default
value of 'max_slot_xid_age' compared to 'max_slot_wal_keep_size'. But,
I agree similar to 'max_slot_wal_keep_size', 'max_slot_xid_age' can be
another parameter to allow vacuum to proceed removing the rows which
otherwise it wouldn't have been as those would be required by some
slot. Now, if this understanding is correct, we should probably make
this invalidation happen by (auto)vacuum after computing the age based
on this new parameter.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-11 Thread Amit Kapila

On Fri, Mar 8, 2024 at 10:42 PM Bharath Rupireddy
 wrote:
>
> On Wed, Mar 6, 2024 at 4:49 PM Amit Kapila  wrote:
> >
> > You might want to consider its interaction with sync slots on standby.
> > Say, there is no activity on slots in terms of processing the changes
> > for slots. Now, we won't perform sync of such slots on standby showing
> > them inactive as per your new criteria where as same slots could still
> > be valid on primary as the walsender is still active. This may be more
> > of a theoretical point as in running system there will probably be
> > some activity but I think this needs some thougths.
>
> I believe the xmin and catalog_xmin of the sync slots on the standby
> keep advancing depending on the slots on the primary, no? If yes, the
> XID age based invalidation shouldn't be a problem.
>
> I believe there are no walsenders started for the sync slots on the
> standbys, right? If yes, the inactive timeout based invalidation also
> shouldn't be a problem. Because, the inactive timeouts for a slot are
> tracked only for walsenders because they are the ones that typically
> hold replication slots for longer durations and for real replication
> use. We did a similar thing in a recent commit [1].
>
> Is my understanding right?
>

Yes, your understanding is correct. I wanted us to consider having new
parameters like 'inactive_replication_slot_timeout' to be at
slot-level instead of GUC. I think this new parameter doesn't seem to
be the similar as 'max_slot_wal_keep_size' which leads to truncation
of WAL at global and then invalidates the appropriate slots. OTOH, the
'inactive_replication_slot_timeout' doesn't appear to have a similar
global effect. The other thing we should consider is what if the
checkpoint happens at a timeout greater than
'inactive_replication_slot_timeout'? Shall, we consider doing it via
some other background process or do we think checkpointer is the best
we can have?

>
 Do you still see any problems with it?
>

Sorry, I haven't done any detailed review yet so can't say with
confidence whether there is any problem or not w.r.t sync slots.

-- 
With Regards,
Amit Kapila.

Re: Proposal to add page headers to SLRU pages

2024-03-11 Thread Li, Yong


> On Mar 9, 2024, at 05:22, Jeff Davis  wrote:
> 
> External Email
> 
> On Wed, 2024-03-06 at 12:01 +, Li, Yong wrote:
>> Rebase the patch against the latest HEAD.
> 
> The upgrade logic could use more comments explaining what's going on
> and why. As I understand it, it's a one-time conversion that needs to
> happen between 16 and 17. Is that right?
> 
> Regards,
>Jeff Davis
> 

> In the new code we effectively store only one LSN per page, which I
> understand is strictly worse.  Maybe the idea of doing away with these
> LSN groups should be reconsidered ... unless I completely misunderstand
> the whole thing.
> 
> --
> Álvaro Herrera PostgreSQL Developer  —


Thanks for the comments on LSN groups and pg_upgrade.

I have updated the patch to address both comments:
- The clog LSN group has been brought back.
  Now the page LSN on each clog page is used for honoring the write-ahead rule
  and it is always the highest LSN of all the LSN groups on the page.
  The LSN groups are used by TransactionIdGetStatus() as before.
- New comments have been added to pg_upgrade to mention the SLRU
  page header change as the reason for upgrading clog files.

Regards,
Yong

slru_page_header_v6.patch
Description: slru_page_header_v6.patch

Re: Statistics Import and Export

2024-03-11 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> > > +/*
> > > + * Set statistics for a given pg_class entry.
> > > + *
> > > + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> > > + *
> > > + * This does an in-place (i.e. non-transactional) update of pg_class,
> > just as
> > > + * is done in ANALYZE.
> > > + *
> > > + */
> > > +Datum
> > > +pg_set_relation_stats(PG_FUNCTION_ARGS)
> > > +{
> > > + const char *param_names[] = {
> > > + "relation",
> > > + "reltuples",
> > > + "relpages",
> > > + };
> > > +
> > > + Oid relid;
> > > + Relationrel;
> > > + HeapTuple   ctup;
> > > + Form_pg_class   pgcform;
> > > +
> > > + for (int i = 0; i <= 2; i++)
> > > + if (PG_ARGISNULL(i))
> > > + ereport(ERROR,
> > > +
> >  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +  errmsg("%s cannot be NULL",
> > param_names[i])));
> >
> > Why not just mark this function as strict..?  Or perhaps we should allow
> > NULLs to be passed in and just not update the current value in that
> > case?
> 
> Strict could definitely apply here, and I'm inclined to make it so.

Having thought about it a bit more, I generally like the idea of being
able to just update one stat instead of having to update all of them at
once (and therefore having to go look up what the other values currently
are...).  That said, per below, perhaps making it strict is the better
plan.

> > Also, in some cases we allow the function to be called with a
> > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > OID ends up being NULL).
> 
> Thoughts on it emitting a WARN or NOTICE before returning false?

Eh, I don't think so?

Where this is coming from is that we can often end up with functions
like these being called inside of larger queries, and having them spit
out WARN or NOTICE will just make them noisy.

That leads to my general feeling of just returning NULL if called with a
NULL OID, as we would get with setting the function strict.

> >   Not sure if that makes sense here or not
> > offhand but figured I'd mention it as something to consider.
> >
> > > + pgcform = (Form_pg_class) GETSTRUCT(ctup);
> > > + pgcform->reltuples = PG_GETARG_FLOAT4(1);
> > > + pgcform->relpages = PG_GETARG_INT32(2);
> >
> > Shouldn't we include relallvisible?
> 
> Yes. No idea why I didn't have that in there from the start.

Ok.

> > Also, perhaps we should use the approach that we have in ANALYZE, and
> > only actually do something if the values are different rather than just
> > always doing an update.
> 
> That was how it worked back in v1, more for the possibility that there was
> no matching JSON to set values.
> 
> Looking again at analyze.c (currently lines 1751-1780), we just check if
> there is a row in the way, and if so we replace it. I don't see where we
> compare existing values to new values.

Well, that code is for pg_statistic while I was looking at pg_class (in
vacuum.c:1428-1443, where we track if we're actually changing anything
and only make the pg_class change if there's actually something
different):

vacuum.c:1531
/* If anything changed, write out the tuple. */
if (dirty)
heap_inplace_update(rd, ctup);

Not sure why we don't treat both the same way though ... although it's
probably the case that it's much less likely to have an entire
pg_statistic row be identical than the few values in pg_class.

> > > +Datum
> > > +pg_set_attribute_stats(PG_FUNCTION_ARGS)
> >
> > > + /* names of columns that cannot be null */
> > > + const char *required_param_names[] = {
> > > + "relation",
> > > + "attname",
> > > + "stainherit",
> > > + "stanullfrac",
> > > + "stawidth",
> > > + "stadistinct",
> > > + "stakind1",
> > > + "stakind2",
> > > + "stakind3",
> > > + "stakind4",
> > > + "stakind5",
> > > + };
> >
> > Same comment here as above wrt NULL being passed in.
> 
> In this case, the last 10 params (stanumbersN and stavaluesN) can be null,
> and are NULL more often than not.

Hmm, that's a valid point, so a NULL passed in would need to set that
value actually to NULL, presumably.  Perhaps then we should have
pg_set_relation_stats() be strict and have pg_set_attribute_stats()
handles NULLs passed in appropriately, and return NULL if the relation
itself or attname, or other required (not NULL'able) argument passed in
cause the function to return NULL.

(What I'm trying to drive at here is a consistent interface for these
functions, but one which does a no-op instead of returning an ERROR on
values being passed in which aren't allowable; it can be quite
frustrating trying to get a query to work where one of the functions
decides to return ERROR instead of j

Re: Support a wildcard in backtrace_functions

2024-03-11 Thread Jelte Fennema-Nio

On Fri, 8 Mar 2024 at 17:17, Bharath Rupireddy
 wrote:
> Hm, we may not want backtrace_on_internal_error to be on by default.
> AFAICS, all ERRCODE_INTERNAL_ERROR are identifiable with the error
> message, so it's sort of easy for one to track down the cause of it
> without actually needing to log backtrace by default.

While maybe all messages uniquely identify the exact error, these
errors usually originate somewhere deep down the call stack in a
function that is called from many other places. Having the full stack
trace can thus greatly help us to find what caused this specific error
to occur. I think that would be quite useful to enable by default, if
only because it would make many bug reports much more actionable.

> 1. Rename the new GUC backtrace_functions_min_level to backtrace_min_level.
> 2. Add 'internal' to backtrace_min_level_options enum
> +static const struct config_enum_entry backtrace_functions_level_options[] = {
> +   {"internal", INTERNAL, false},
> +{"debug5", DEBUG5, false},
> +{"debug4", DEBUG4, false},
> 3. Remove backtrace_on_internal_error GUC which is now effectively
> covered by backtrace_min_level = 'internal';
>
> Does anyone see a problem with it?

Honestly, it seems a bit confusing to me to add INTERNAL as a level,
since it's an error code not log level. Also merging it in this way,
brings up certain questions:
1. How do you get the current backtrace_on_internal_error=true
behaviour? Would you need to set both backtrace_functions='*' and
backtrace_min_level=INTERNAL?
2. What is the default value for backtrace_min_level?
3. You still wouldn't be able to limit INTERNAL errors to FATAL level

I personally think having three GUCs in this patchset make sense,
especially since I think it would be good to turn
backtrace_on_internal_error on by default. The only additional change
that I think might be worth making is to make
backtrace_on_internal_error take a level argument, so that you could
configure postgres to only add stack traces to FATAL internal errors.

(attached is a patch that should fix the CI issue by adding
GUC_NOT_IN_SAMPLE backtrace_functions_min_level)

v7-0002-Add-wildcard-support-to-backtrace_functions-GUC.patch
Description: Binary data

v7-0001-Add-backtrace_functions_min_level.patch
Description: Binary data

Re: May be BUG. Periodic burst growth of the checkpoint_req counter on replica.

2024-03-11 Thread Alexander Korotkov

On Mon, Mar 11, 2024 at 5:43 AM Anton A. Melnikov
 wrote:
> On 11.03.2024 03:39, Alexander Korotkov wrote:
> > Now that we distinguish stats for checkpoints and
> > restartpoints, we need to update the docs.  Please, check the patch
> > attached.
>
> Maybe bring the pg_stat_get_checkpointer_buffers_written() description in 
> consistent with these changes?
> Like that:
>
> --- a/src/include/catalog/pg_proc.dat
> +++ b/src/include/catalog/pg_proc.dat
> @@ -5740 +5740 @@
> -  descr => 'statistics: number of buffers written by the checkpointer',
> +  descr => 'statistics: number of buffers written during checkpoints and 
> restartpoints',

This makes sense.  I've included this into the revised patch.

> And after i took a fresh look at this patch i noticed a simple way to extract
> write_time and sync_time counters for restartpoints too.
>
> What do you think, is there a sense to do this?

I'm not sure we need this.  The ways we trigger checkpoints and
restartpoints are different.  This is why we needed separate
statistics.  But the process of writing buffers is the same.

--
Regards,
Alexander Korotkov


rename_pg_stat_get_checkpointer_fields_v2.patch
Description: Binary data

Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-11 Thread Heikki Linnakangas


On 08/03/2024 19:34, Melanie Plageman wrote:

On Fri, Mar 08, 2024 at 06:07:33PM +0200, Heikki Linnakangas wrote:

On 08/03/2024 02:46, Melanie Plageman wrote:

On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote:

On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote:

However, by adding a vmbuffer to next_block_state, the callback may be
able to avoid extra VM fetches from one invocation to the next.


That's a good idea, holding separate VM buffer pins for the next-unskippable
block and the block we're processing. I adopted that approach.


Cool. It can't be avoided with streaming read vacuum, but I wonder if
there would ever be adverse effects to doing it on master? Maybe if we
are doing a lot of skipping and the block of the VM for the heap blocks
we are processing ends up changing each time but we would have had the
right block of the VM if we used the one from
heap_vac_scan_next_block()?

Frankly, I'm in favor of just doing it now because it makes
lazy_scan_heap() less confusing.


+1


I'm pretty happy with the attached patches now. The first one fixes the
existing bug I mentioned in the other email (based on the on-going
discussion that might not how we want to fix it though).


ISTM we should still do the fix you mentioned -- seems like it has more
upsides than downsides?


 From b68cb29c547de3c4acd10f31aad47b453d154666 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 8 Mar 2024 16:00:22 +0200
Subject: [PATCH v8 1/3] Set all_visible_according_to_vm correctly with
  DISABLE_PAGE_SKIPPING

It's important for 'all_visible_according_to_vm' to correctly reflect
whether the VM bit is set or not, even when we are not trusting the VM
to skip pages, because contrary to what the comment said,
lazy_scan_prune() relies on it.

If it's incorrectly set to 'false', when the VM bit is in fact set,
lazy_scan_prune() will try to set the VM bit again and dirty the page
unnecessarily. As a result, if you used DISABLE_PAGE_SKIPPING, all
heap pages were dirtied, even if there were no changes. We would also
fail to clear any VM bits that were set incorrectly.

This was broken in commit 980ae17310, so backpatch to v16.


LGTM.


Committed and backpatched this.


 From 47af1ca65cf55ca876869b43bff47f9d43f0750e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 8 Mar 2024 17:32:19 +0200
Subject: [PATCH v8 2/3] Confine vacuum skip logic to lazy_scan_skip()
---
  src/backend/access/heap/vacuumlazy.c | 256 +++
  1 file changed, 141 insertions(+), 115 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c 
b/src/backend/access/heap/vacuumlazy.c
index ac55ebd2ae5..0aa08762015 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -204,6 +204,12 @@ typedef struct LVRelState
int64   live_tuples;/* # live tuples remaining */
int64   recently_dead_tuples;   /* # dead, but not yet 
removable */
int64   missed_dead_tuples; /* # removable, but not removed */


Perhaps we should add a comment to the blkno member of LVRelState
indicating that it is used for error reporting and logging?


Well, it's already under the "/* Error reporting state */" section. I 
agree this is a little confusing, the name 'blkno' doesn't convey that 
it's supposed to be used just for error reporting. But it's a 
pre-existing issue so I left it alone. It can be changed with a separate 
patch if we come up with a good idea.



I see why you removed my treatise-level comment that was here about
unskipped skippable blocks. However, when I was trying to understand
this code, I did wish there was some comment that explained to me why we
needed all of the variables next_unskippable_block,
next_unskippable_allvis, all_visible_according_to_vm, and current_block.

The idea that we would choose not to skip a skippable block because of
kernel readahead makes sense. The part that I had trouble wrapping my
head around was that we want to also keep the visibility status of both
the beginning and ending blocks of the skippable range and then use
those to infer the visibility status of the intervening blocks without
another VM lookup if we decide not to skip them.


Right, I removed the comment because looked a little out of place and it 
duplicated the other comments sprinkled in the function. I agree this 
could still use some more comments though.


Here's yet another attempt at making this more readable. I moved the 
logic to find the next unskippable block to a separate function, and 
added comments to make the states more explicit. What do you think?


--
Heikki Linnakangas
Neon (https://neon.tech)
From c21480e9da61e145573de3b502551dde1b8fa3f6 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 8 Mar 2024 17:32:19 +0200
Subject: [PATCH v9 1/2] Confine vacuum skip logic to lazy_scan_skip()

Rename lazy_scan_skip() to heap_vac_scan_next_block() and move more
code into the function, so tha

Re: automating RangeTblEntry node support

2024-03-11 Thread Peter Eisentraut


On 20.02.24 08:57, Peter Eisentraut wrote:

On 18.02.24 00:06, Matthias van de Meent wrote:

I'm not sure that the cleanup which is done when changing a RTE's
rtekind is also complete enough for this purpose.
Things like inline_cte_walker change the node->rtekind, which could
leave residual junk data in fields that are currently dropped during
serialization (as the rtekind specifically ignores those fields), but
which would add overhead when the default omission is expected to
handle these fields; as they could then contain junk. It looks like
there is some care about zeroing now unused fields, but I haven't
checked that it covers all cases and fields to the extent required so
that removing this specialized serializer would have zero impact on
size once the default omission patch is committed.

An additional patch with a single function that for this purpose
clears junk fields from RTEs that changed kind would be appreciated:
it is often hand-coded at those locations the kind changes, but that's
more sensitive to programmer error.


Yes, interesting idea.  Or maybe an assert-like function that checks an 
existing structure for consistency.  Or maybe both.  I'll try this out.


In the meantime, if there are no remaining concerns, I propose to commit 
the first two patches


Remove custom Constraint node read/write implementations
Remove custom _jumbleRangeTblEntry()


After a few side quests, here is an updated patch set.  (I had committed 
the first of the two patches mentioned above, but not yet the second one.)


v3-0001-Remove-obsolete-comment.patch
v3-0002-Improve-comment.patch

These just update a few comments around the RangeTblEntry definition.

v3-0003-Reformat-some-node-comments.patch
v3-0004-Remove-custom-_jumbleRangeTblEntry.patch

This is pretty much the same patch as before.  I have now split it up to 
first reformat the comments to make room for the node annotations.  This 
patch is now also pgindent-proof.  After some side quest discussions, 
the set of fields to jumble seems correct now, so commit message 
comments to the contrary have been dropped.


v3-0005-Make-RangeTblEntry-dump-order-consistent.patch

I separated that from the 0008 patch below.  I think it useful even if 
we don't go ahead with 0008 now, for example in dumps from the debugger, 
and just in general to keep everything more consistent.


v3-0006-WIP-AssertRangeTblEntryIsValid.patch

This is in response to some of the discussions where there was some 
doubt whether all fields are always filled and cleared correctly when 
the RTE kind is changed.  Seems correct as far as this goes.  I didn't 
know of a good way to hook this in, so I put it into the write/read 
functions, which is obviously a bit weird if I'm proposing to remove 
them later.  Consider it a proof of concept.


v3-0007-Simplify-range_table_mutator_impl-and-range_table.patch
v3-0008-WIP-Remove-custom-RangeTblEntry-node-read-write-i.patch

At this point, I'm not too stressed about pressing forward with these 
last two patches.  We can look at them again perhaps if we make progress 
on a more compact node output format.  When I started this thread, I had 
a lot of questions about various details about the RangeTblEntry struct, 
and we have achieved many answers during the discussions, so I'm happy 
with the progress.  So for PG17, I'd like to just do patches 0001..0005.
From dc53b7ddfc5aa004a0d222b4084a1c580f05a296 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 11 Mar 2024 09:15:33 +0100
Subject: [PATCH v3 1/8] Remove obsolete comment

The idea to use a union in the definition of RangeTblEntry is clearly
not being pursued.

Discussion: 
https://www.postgresql.org/message-id/flat/4b27fc50-8cd6-46f5-ab20-88dbaadca...@eisentraut.org
---
 src/include/nodes/parsenodes.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 2380821600..5113f97363 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1028,12 +1028,6 @@ typedef struct RangeTblEntry
 
RTEKind rtekind;/* see above */
 
-   /*
-* XXX the fields applicable to only some rte kinds should be merged 
into
-* a union.  I didn't do this yet because the diffs would impact a lot 
of
-* code that is being actively worked on.  FIXME someday.
-*/
-
/*
 * Fields valid for a plain relation RTE (else zero):
 *

base-commit: af0e7deb4a1c369bb8154ac55f085d6a93fe5c35
-- 
2.44.0

From 9fd2ae6a561ea72f627057d922e48d92eaafe099 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 11 Mar 2024 09:15:33 +0100
Subject: [PATCH v3 2/8] Improve comment

Clarify that RangeTblEntry.lateral reflects whether LATERAL was
specified in the statement (as opposed to whether lateralness is
implicit).  Also, the list of applicable entry types was incomplete.

Discussion: 
https://www.postgresql.org/message-id/flat/4b27fc50-8cd6-46f5-

Re: remaining sql/json patches

2024-03-11 Thread jian he

Hi.
more minor issues.

by searching `elog(ERROR, "unrecognized node type: %d"`
I found that generally enum is cast to int, before printing it out.
I also found a related post at [1].

So I add the typecast to int, before printing it out.
most of the refactored code is unlikely to be reachable, but still.

I also refactored ExecPrepareJsonItemCoercion error message, to make
the error message more explicit.
@@ -4498,7 +4498,9 @@ ExecPrepareJsonItemCoercion(JsonbValue *item,
JsonExprState *jsestate,
if (throw_error)
ereport(ERROR,

errcode(ERRCODE_SQL_JSON_ITEM_CANNOT_BE_CAST_TO_TARGET_TYPE),
-   errmsg("SQL/JSON item cannot
be cast to target type"));
+
errcode(ERRCODE_SQL_JSON_ITEM_CANNOT_BE_CAST_TO_TARGET_TYPE),
+   errmsg("SQL/JSON item cannot
be cast to type %s",
+
format_type_be(jsestate->jsexpr->returning->typid)));

+ /*
+ * We abuse CaseTestExpr here as placeholder to pass the result of
+ * evaluating the JSON_VALUE/QUERY jsonpath expression as input to the
+ * coercion expression.
+ */
+ CaseTestExpr *placeholder = makeNode(CaseTestExpr);
typo in comment, should it be `JSON_VALUE/JSON_QUERY`?

[1] 
https://stackoverflow.com/questions/8012647/can-we-typecast-a-enum-variable-in-c


v42-0001-miscellaneous-fix-based-on-v42.no-cfbot
Description: Binary data

Re: Add bump memory context type and use it for tuplesorts

2024-03-11 Thread John Naylor

On Tue, Mar 5, 2024 at 9:42 AM David Rowley  wrote:
> performance against the recently optimised aset, generation and slab
> contexts.  The attached graph shows the time it took in seconds to
> allocate 1GB of memory performing a context reset after 1MB. The
> function I ran the test on is in the attached
> pg_allocate_memory_test.patch.txt file.
>
> The query I ran was:
>
> select chksz,mtype,pg_allocate_memory_test_reset(chksz,
> 1024*1024,1024*1024*1024, mtype) from (values(8),(16),(32),(64))
> sizes(chksz),(values('aset'),('generation'),('slab'),('bump'))
> cxt(mtype) order by mtype,chksz;

I ran the test function, but using 256kB and 3MB for the reset
frequency, and with 8,16,24,32 byte sizes (patched against a commit
after the recent hot/cold path separation). Images attached. I also
get a decent speedup with the bump context, but not quite as dramatic
as on your machine. It's worth noting that slab is the slowest for me.
This is an Intel i7-10750H.

Re: Statistics Import and Export

2024-03-11 Thread Bertrand Drouvot

Hi,

On Fri, Mar 08, 2024 at 01:35:40AM -0500, Corey Huinker wrote:
> Anyway, here's v7. Eagerly awaiting feedback.

Thanks!

A few random comments:

1 ===

+The purpose of this function is to apply statistics values in an
+upgrade situation that are "good enough" for system operation until

Worth to add a few words about "influencing" the planner use case?

2 ===

+#include "catalog/pg_type.h"
+#include "fmgr.h"

Are those 2 needed?

3 ===

+   if (!HeapTupleIsValid(ctup))
+   elog(ERROR, "pg_class entry for relid %u vanished during 
statistics import",

s/during statistics import/when setting statistics/?

4 ===

+Datum
+pg_set_relation_stats(PG_FUNCTION_ARGS)
+{
.
.
+   table_close(rel, ShareUpdateExclusiveLock);
+
+   PG_RETURN_BOOL(true);

Why returning a bool? (I mean we'd throw an error or return true).

5 ===

+ */
+Datum
+pg_set_attribute_stats(PG_FUNCTION_ARGS)
+{

This function is not that simple, worth to explain its logic in a comment above?

6 ===

+   if (!HeapTupleIsValid(tuple))
+   {
+   relation_close(rel, NoLock);
+   PG_RETURN_BOOL(false);
+   }
+
+   attr = (Form_pg_attribute) GETSTRUCT(tuple);
+   if (attr->attisdropped)
+   {
+   ReleaseSysCache(tuple);
+   relation_close(rel, NoLock);
+   PG_RETURN_BOOL(false);
+   }

Why is it returning "false" and not throwing an error? (if ok, then I think
we can get rid of returning a bool).

7 ===

+* If this relation is an index and that index has expressions in
+* it, and the attnum specified

s/is an index and that index has/is an index that has/?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add Index-level REINDEX with multiple jobs

2024-03-11 Thread Svetlana Derevyanko


Andrey M. Borodin писал(а) 2024-03-04 09:26:

On 6 Feb 2024, at 11:21, Michael Paquier  wrote:

The problem may be actually trickier than that, no?  Could there be
other factors to take into account for their classification, like
their sizes (typically, we'd want to process the biggest one first, I
guess)?


Maxim, what do you think about it?


Best regards, Andrey Borodin.


I agree that, as is the case with tables in REINDEX SCHEME, indices 
should be sorted accordingly to their size.


Attaching the second version of patch, with indices being sorted by 
size.


Best regards,
Svetlana DerevyankoFrom 3c2f0fcbcf382cc2cb9786e26ad8027558937ea2 Mon Sep 17 00:00:00 2001
From: Svetlana Derevyanko 
Date: Fri, 29 Dec 2023 08:20:54 +0300
Subject: [PATCH v2] Add Index-level REINDEX with multiple jobs

Author: Maxim Orlov , Svetlana Derevyanko 
---
 doc/src/sgml/ref/reindexdb.sgml|   3 +-
 src/bin/scripts/reindexdb.c| 159 ++---
 src/bin/scripts/t/090_reindexdb.pl |   8 +-
 3 files changed, 151 insertions(+), 19 deletions(-)

diff --git a/doc/src/sgml/ref/reindexdb.sgml b/doc/src/sgml/ref/reindexdb.sgml
index 8d9ced212f..dfb5e534fb 100644
--- a/doc/src/sgml/ref/reindexdb.sgml
+++ b/doc/src/sgml/ref/reindexdb.sgml
@@ -187,8 +187,7 @@ PostgreSQL documentation
 setting is high enough to accommodate all connections.


-Note that this option is incompatible with the --index
-and --system options.
+Note that this option is incompatible with the --system option.

   
  
diff --git a/src/bin/scripts/reindexdb.c b/src/bin/scripts/reindexdb.c
index 6ae30dff31..63d99e7441 100644
--- a/src/bin/scripts/reindexdb.c
+++ b/src/bin/scripts/reindexdb.c
@@ -251,14 +251,6 @@ main(int argc, char *argv[])
 	}
 	else
 	{
-		/*
-		 * Index-level REINDEX is not supported with multiple jobs as we
-		 * cannot control the concurrent processing of multiple indexes
-		 * depending on the same relation.
-		 */
-		if (concurrentCons > 1 && indexes.head != NULL)
-			pg_fatal("cannot use multiple jobs to reindex indexes");
-
 		if (dbname == NULL)
 		{
 			if (getenv("PGDATABASE"))
@@ -279,7 +271,7 @@ main(int argc, char *argv[])
 		if (indexes.head != NULL)
 			reindex_one_database(&cparams, REINDEX_INDEX, &indexes,
  progname, echo, verbose,
- concurrently, 1, tablespace);
+ concurrently, concurrentCons, tablespace);
 
 		if (tables.head != NULL)
 			reindex_one_database(&cparams, REINDEX_TABLE, &tables,
@@ -299,6 +291,67 @@ main(int argc, char *argv[])
 	exit(0);
 }
 
+static SimpleStringList *
+sort_indices_by_tables(PGconn *conn, SimpleStringList *user_list, SimpleStringList *indices_tables_list,
+	   SimpleStringList ***indices_process_list, bool echo)
+{
+	SimpleStringList *process_list = pg_malloc0(sizeof(SimpleStringList));
+	SimpleStringListCell *idx_cell,
+			   *idx_table,
+			   *cell;
+	int			ipl_len = 10,
+current_tables_num = 0,
+i;
+
+	*indices_process_list = pg_malloc0(sizeof(*indices_process_list) * ipl_len);
+
+	for (idx_cell = user_list->head, idx_table = indices_tables_list->head;
+		 idx_cell;
+		 idx_cell = idx_cell->next, idx_table = idx_table->next)
+	{
+		SimpleStringList *list_to_add = NULL;
+
+		if (!current_tables_num)
+		{
+			simple_string_list_append(process_list, idx_table->val);
+			(*indices_process_list)[0] = pg_malloc0(sizeof(SimpleStringList));
+			current_tables_num++;
+		}
+
+		cell = process_list->head;
+		for (i = 0; i < current_tables_num; i++)
+		{
+			if (!strcmp(idx_table->val, cell->val))
+			{
+list_to_add = (*indices_process_list)[i];
+break;
+			}
+
+		}
+
+		if (!list_to_add)
+		{
+			simple_string_list_append(process_list, idx_table->val);
+
+			if (current_tables_num >= ipl_len)
+			{
+ipl_len *= 2;
+*indices_process_list = pg_realloc(*indices_process_list,
+   sizeof(*indices_process_list) * ipl_len);
+			}
+
+			(*indices_process_list)[current_tables_num] = pg_malloc0(sizeof(SimpleStringList));
+			list_to_add = (*indices_process_list)[current_tables_num];
+			current_tables_num++;
+		}
+
+		simple_string_list_append(list_to_add, idx_cell->val);
+
+	}
+
+	return process_list;
+}
+
 static void
 reindex_one_database(ConnParams *cparams, ReindexType type,
 	 SimpleStringList *user_list,
@@ -310,10 +363,13 @@ reindex_one_database(ConnParams *cparams, ReindexType type,
 	SimpleStringListCell *cell;
 	bool		parallel = concurrentCons > 1;
 	SimpleStringList *process_list = user_list;
+	SimpleStringList **indices_process_list = NULL;
+	SimpleStringList *indices_tables_list = NULL;
 	ReindexType process_type = type;
 	ParallelSlotArray *sa;
 	bool		failed = false;
-	int			items_count = 0;
+	int			items_count = 0,
+i = 0;
 
 	conn = connectDatabase(cparams, progname, echo, false, true);
 
@@ -383,8 +439,23 @@ reindex_one_database(ConnParams *cparams, ReindexType type,
 	return;
 break;
 
-			case REINDEX_SYSTEM:
 			case REIN

Re: Reducing the log spam

2024-03-11 Thread Jelte Fennema-Nio

-   the subscriber's server log.
+   the subscriber's server log if you remove 23505 from
+   .

This seems like a pretty big regression. Being able to know why your
replication got closed seems pretty critical.

On Mon, 11 Mar 2024 at 03:44, Laurenz Albe  wrote:
>
> On Sat, 2024-03-09 at 14:03 +0100, Laurenz Albe wrote:
> > Here is a patch that implements this.
>
> And here is patch v2 that fixes a bug and passes the regression tests.
>
> Yours,
> Laurenz Albe

Fix expecteddir argument in pg_regress

2024-03-11 Thread Anthonin Bonnefoy

Hi all!

pg_regress accepts the expecteddir argument. However, it is never used
and outputdir is used instead to get the expected files paths.

This patch fixes this and uses the expecteddir argument as expected.

Regards,
Anthonin


v01-0001-pg_regress-Use-expecteddir-for-the-expected-file.patch
Description: Binary data

[PROPOSAL] Skip test citext_utf8 on Windows

2024-03-11 Thread Oleg Tselebrovskiy


Greetings, everyone!

While running "installchecks" on databases with UTF-8 encoding the test
citext_utf8 fails because of Turkish dotted I like this:

 SELECT 'i'::citext = 'İ'::citext AS t;
  t
 ---
- t
+ f
 (1 row)

I tried to replicate the test's results by hand and with any collation
that I tried (including --locale="Turkish") this test failed

Also an interesing result of my tesing. If you initialize you DB
with -E utf-8 --locale="Turkish" and then run select LOWER('İ');
the output will be this:
 lower
---
 İ
(1 row)

Which I find strange since lower() uses collation that was passed
(default in this case but still)

My PostgreSQL version is this:
postgres=# select version();
   version
--
 PostgreSQL 17devel on x86_64-windows, compiled by gcc-13.1.0, 64-bit

The proposed patch for skipping test is attached

Oleg Tselebrovskiy, Postgres Proÿþdiff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out

index 5d988dcd485..6c4069f9469 100644

--- a/contrib/citext/expected/citext_utf8.out

+++ b/contrib/citext/expected/citext_utf8.out

@@ -10,7 +10,8 @@

 SELECT getdatabaseencoding() <> 'UTF8' OR

        (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'

         FROM pg_database

-        WHERE datname=current_database())

+        WHERE datname=current_database()) OR

+	   (version() ~ 'windows' OR version() ~ 'Visual C\+\+' OR version() ~ 'mingw32')

        AS skip_test \gset

 \if :skip_test

 \quit

diff --git a/contrib/citext/expected/citext_utf8_1.out b/contrib/citext/expected/citext_utf8_1.out

index 7065a5da190..d4472b1c36a 100644

--- a/contrib/citext/expected/citext_utf8_1.out

+++ b/contrib/citext/expected/citext_utf8_1.out

@@ -10,7 +10,8 @@

 SELECT getdatabaseencoding() <> 'UTF8' OR

        (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'

         FROM pg_database

-        WHERE datname=current_database())

+        WHERE datname=current_database()) OR

+	   (version() ~ 'windows' OR version() ~ 'Visual C\+\+' OR version() ~ 'mingw32')

        AS skip_test \gset

 \if :skip_test

 \quit

diff --git a/contrib/citext/sql/citext_utf8.sql b/contrib/citext/sql/citext_utf8.sql

index 34b232d64e2..53775cdcd35 100644

--- a/contrib/citext/sql/citext_utf8.sql

+++ b/contrib/citext/sql/citext_utf8.sql

@@ -11,7 +11,8 @@

 SELECT getdatabaseencoding() <> 'UTF8' OR

        (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'

         FROM pg_database

-        WHERE datname=current_database())

+        WHERE datname=current_database()) OR

+	   (version() ~ 'windows' OR version() ~ 'Visual C\+\+' OR version() ~ 'mingw32')

        AS skip_test \gset

 \if :skip_test

 \quit

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-11 Thread Masahiko Sawada

On Mon, Mar 11, 2024 at 12:20 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada  wrote:
> >
> > I've attached the remaining patches for CI. I've made some minor
> > changes in separate patches and drafted the commit message for
> > tidstore patch.
> >
> > While reviewing the tidstore code, I thought that it would be more
> > appropriate to place tidstore.c under src/backend/lib instead of
> > src/backend/common/access since the tidstore is no longer implemented
> > only for heap or other access methods, and it might also be used by
> > executor nodes in the future. What do you think?
>
> That's a heck of a good question. I don't think src/backend/lib is
> right -- it seems that's for general-purpose data structures.
> Something like backend/utils is also too general.
> src/backend/access/common has things for tuple descriptors, toast,
> sessions, and I don't think tidstore is out of place here. I'm not
> sure there's a better place, but I could be convinced otherwise.

Yeah, I agreed that src/backend/lib seems not to be the place for
tidstore. Let's keep it in src/backend/access/common. If others think
differently, we can move it later.

>
> v68-0001:
>
> I'm not sure if commit messages are much a subject of review, and it's
> up to the committer, but I'll share a couple comments just as
> something to think about, not something I would ask you to change: I
> think it's a bit distracting that the commit message talks about the
> justification to use it for vacuum. Let's save that for the commit
> with actual vacuum changes. Also, I suspect saying there are a "wide
> range" of uses is over-selling it a bit, and that paragraph is a bit
> awkward aside from that.

Thank you for the comment, and I agreed. I've updated the commit message.

>
> + /* Collect TIDs extracted from the key-value pair */
> + result->num_offsets = 0;
> +
>
> This comment has nothing at all to do with this line. If the comment
> is for several lines following, some of which are separated by blank
> lines, there should be a blank line after the comment. Also, why isn't
> tidstore_iter_extract_tids() responsible for setting that to zero?

Agreed, fixed.

I also updated this part so we set result->blkno in
tidstore_iter_extract_tids() too, which seems more readable.

>
> + ts->context = CurrentMemoryContext;
>
> As far as I can tell, this member is never accessed again -- am I
> missing something?

You're right. It was used to re-create the tidstore in the same
context again while resetting it, but we no longer support the reset
API. Considering it again, would it be better to allocate the iterator
struct in the same context as we store the tidstore struct?

>
> + /* DSA for tidstore will be detached at the end of session */
>
> No other test module pins the mapping, but that doesn't necessarily
> mean it's wrong. Is there some advantage over explicitly detaching?

One small benefit of not explicitly detaching dsa_area in
tidstore_destroy() would be simplicity; IIUC if we want to do that, we
need to remember the dsa_area using (for example) a static variable,
and free it if it's non-NULL. I've implemented this idea in the
attached patch.

>
> +-- Add tids in random order.
>
> I don't see any randomization here. I do remember adding row_number to
> remove whitespace in the output, but I don't remember a random order.
> On that subject, the row_number was an easy trick to avoid extra
> whitespace, but maybe we should just teach the setting function to
> return blocknumber rather than null?

Good idea, fixed.

>
> +Datum
> +tidstore_create(PG_FUNCTION_ARGS)
> +{
> ...
> + tidstore = TidStoreCreate(max_bytes, dsa);
>
> +Datum
> +tidstore_set_block_offsets(PG_FUNCTION_ARGS)
> +{
> 
> + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
>
> These names are too similar. Maybe the test module should do
> s/tidstore_/test_/ or similar.

Agreed.

>
> +/* Sanity check if we've called tidstore_create() */
> +static void
> +check_tidstore_available(void)
> +{
> + if (tidstore == NULL)
> + elog(ERROR, "tidstore is not initialized");
> +}
>
> I don't find this very helpful. If a developer wiped out the create
> call, wouldn't the test crash and burn pretty obviously?

Removed.

>
> In general, the .sql file is still very hard-coded. Functions are
> created that contain a VALUES statement. Maybe it's okay for now, but
> wanted to mention it. Ideally, we'd have some randomized tests,
> without having to display it. That could be in addition to (not
> replacing) the small tests we have that display input. (see below)
>

Agreed to add randomized tests in addition to the existing tests.

>
> v68-0002:
>
> @@ -329,6 +381,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
>
>   ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
>
> +#ifdef TIDSTORE_DEBUG
> + if (!TidStoreIsShared(ts))
> + {
> + bool ret_debug = ts_debug_is_member(ts, tid);;
> + Assert(ret == ret_debug);
> + }
> +#endif
>
> This only checking the

Re: ExecAppendAsyncEventWait() in REL_14_STABLE can corrupt PG_exception_stack

2024-03-11 Thread Etsuro Fujita

Hi Michael-san,

On Mon, Mar 11, 2024 at 8:12 AM Michael Paquier  wrote:
> On Mon, Feb 26, 2024 at 04:29:44PM +0900, Etsuro Fujita wrote:
> > Will do.  (I was thinking you would get busy from now on.)
>
> Fujita-san, have you been able to look at this thread?

Yeah, I took a look; the patch looks good to me, but I am thiking to
update some comments in a related function in postgres_fdw.c.  I will
have time to work on this later this week, so I would like to propose
an updated patch then.

Thanks for taking care of this!

Best regards,
Etsuro Fujita

Re: SQL:2011 application time

2024-03-11 Thread Peter Eisentraut

On 01.03.24 22:56, Paul Jungwirth wrote:

On 3/1/24 12:38, Paul Jungwirth wrote:

On 2/29/24 13:16, Paul Jungwirth wrote:
Here is a v26 patch series to fix a cfbot failure in sepgsql. Rebased 
to 655dc31046.

v27 attached, fixing some cfbot failures from 
headerscheck+cpluspluscheck. Sorry for the noise!

I had committed v27-0001-Rename-conwithoutoverlaps-to-conperiod.patch a 
little while ago.

I have reviewed v27-0002 through 0004 now.  I have one semantic question 
below, and there are a few places where more clarification of the 
interfaces could help.  Other than that, I think this is pretty good.

Attached is a small patch that changes the PERIOD keyword to unreserved 
for this patch.  You had said earlier that this didn't work for you. 
The attached patch works for me when applied on top of 0003.

* v27-0002-Add-GiST-referencedagg-support-func.patch

You wrote:

> I'm not sure how else to do it. The issue is that `range_agg` returns 
> a multirange, so the result

> type doesn't match the inputs. But other types will likely have the
> same problem: to combine boxes
> you may need a multibox. The combine mdranges you may need a
> multimdrange.

Can we just hardcode the use of range_agg for this release?  Might be 
easier.  I don't see all this generality being useful in the near future.

> Btw that part changed a bit since v24 because as jian he pointed out, 
> our type system doesn't

> support anyrange inputs and an anyrange[] output. So I changed the
> support funcs to use SETOF.

I didn't see any SETOF stuff in the patch, or I didn't know where to look.

I'm not sure I follow all the details here.  So more explanations of any 
kind could be helpful.

* v27-0003-Refactor-FK-operator-lookup.patch

I suggest to skip this refactoring patch.  I don't think the way this is 
sliced up is all that great, and it doesn't actually help with the 
subsequent patches.

* v27-0004-Add-temporal-FOREIGN-KEYs.patch

- src/backend/catalog/pg_constraint.c

FindFKPeriodOpersAndProcs() could use a bit more top-level
documentation.  Where does the input opclass come from?  What are the
three output values?  What is the business with "symmetric types"?

- src/backend/commands/indexcmds.c

GetOperatorFromWellKnownStrategy() is apparently changed to accept
InvalidOid for rhstype, but the meaning of this is not explained in
the function header.  It's also not clear to me why an existing caller
is changed.  This should be explained more thoroughly.

- src/backend/commands/tablecmds.c

is_temporal and similar should be renamed to with_period or similar 
throughout this patch.

In transformFkeyGetPrimaryKey():

 * Now build the list of PK attributes from the indkey definition (we
-* assume a primary key cannot have expressional elements)
+* assume a primary key cannot have expressional elements, unless it
+* has a PERIOD)

I think the original statement is still true even with PERIOD.  The 
expressional elements refer to expression indexes.  I don't think we can 
have a PERIOD marker on an expression?

- src/backend/utils/adt/ri_triggers.c

Please remove the separate trigger functions for the period case.  They 
are the same as the non-period ones, so we don't need separate ones. 
The difference is handled lower in the call stack, which I think is a 
good setup.  Removing the separate functions also removes a lot of extra 
code in other parts of the patch.

- src/include/catalog/pg_constraint.h

Should also update catalogs.sgml accordingly.

- src/test/regress/expected/without_overlaps.out
- src/test/regress/sql/without_overlaps.sql

A few general comments on the tests:

- In the INSERT commands, specify the column names explicitly.  This 
makes the tests easier to read (especially since the column order 
between the PK and the FK table is sometimes different).

- Let's try to make it so that the inserted literals match the values 
shown in the various error messages, so it's easier to match them up. 
So, change the int4range literals to half-open notation.  And also maybe 
change the date output format to ISO.

- In various comments, instead of test FK "child", maybe use 
"referencing table"?  Instead of "parent", use "referenced table" (or 
primary key table).  When I read child and parent I was looking for 
inheritance.

- Consider truncating the test tables before each major block of tests 
and refilling them with fresh data.  So it's easier to eyeball the 
tests.  Otherwise, there is too much dependency on what earlier tests 
left behind.

A specific question:

In this test, a PERIOD marker on the referenced site is automatically 
inferred from the primary key:

+-- with inferred PK on the referenced table:
+CREATE TABLE temporal_fk_rng2rng (
+   id int4range,
+   valid_at tsrange,
+   parent_id int4range,
+   CONSTRAINT temporal_fk_rng2rng_pk PRIMARY KEY (id, valid_at WITHOUT 
OVERLAPS),
+   CONSTRAINT temporal_fk_rng2rng_fk FOREIGN KEY (parent_id, PERIOD 
valid_at)

+   REFERENCE

Re: remaining sql/json patches

2024-03-11 Thread jian he

one more issue.

+-- Extension: non-constant JSON path
+SELECT JSON_EXISTS(jsonb '{"a": 123}', '$' || '.' || 'a');
+SELECT JSON_VALUE(jsonb '{"a": 123}', '$' || '.' || 'a');
+SELECT JSON_VALUE(jsonb '{"a": 123}', '$' || '.' || 'b' DEFAULT 'foo'
ON EMPTY);
+SELECT JSON_QUERY(jsonb '{"a": 123}', '$' || '.' || 'a');
+SELECT JSON_QUERY(jsonb '{"a": 123}', '$' || '.' || 'a' WITH WRAPPER);

json path may not be a plain Const.
does the following code in expression_tree_walker_impl need to consider
cases when the `jexpr->path_spec` part is not a Const?

+ case T_JsonExpr:
+ {
+ JsonExpr   *jexpr = (JsonExpr *) node;
+
+ if (WALK(jexpr->formatted_expr))
+ return true;
+ if (WALK(jexpr->result_coercion))
+ return true;
+ if (WALK(jexpr->item_coercions))
+ return true;
+ if (WALK(jexpr->passing_values))
+ return true;
+ /* we assume walker doesn't care about passing_names */
+ if (WALK(jexpr->on_empty))
+ return true;
+ if (WALK(jexpr->on_error))
+ return true;
+ }

Re: Infinite loop in XLogPageRead() on standby

2024-03-11 Thread Kyotaro Horiguchi

At Wed, 6 Mar 2024 11:34:29 +0100, Alexander Kukushkin  
wrote in 
> Hmm, I think you meant to use wal_segment_size, because 0x10 is just
> 1MB. As a result, currently it works for you by accident.

Oh, I once saw the fix work, but seems not to be working after some
point. The new issue was a corruption of received WAL records on the
first standby, and it may be related to the setting.

> > Thus, I managed to reproduce precisely the same situation as you
> > described utilizing your script with modifications and some core
> > tweaks, and with the change above, I saw that the behavior was
> > fixed. However, for reasons unclear to me, it shows another issue, and
> > I am running out of time and need more caffeine. I'll continue
> > investigating this tomorrow.
> >
> 
> Thank you for spending your time on it!

You're welcome, but I aplogize for the delay in the work..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Fix for timestamp lag issue from emit_log_hook when GUC log_line_prefix has '%m'

2024-03-11 Thread Michael Paquier

On Sat, Mar 09, 2024 at 09:09:39PM +0530, Kambam Vinay wrote:
> We observed a slight lag in timestamp for a few logs from the emit_log_hook
> hook implementation when the log_line_prefix GUC has '%m'.
> 
> Upon debugging, we found that the saved_timeval_set variable is set to
> 'true' in get_formatted_log_time() but is not reset to 'false' until the
> next call to send_message_to_server_log(). Due to this, saved_timeval_set
> will be true during the execution of hook emit_log_hook() which prefixes
> the saved timestamp 'saved_timeval' from the previous log line (our hook
> implementation calls log_line_prefix()).
> 
> Attached patch sets the saved_timeval_set to false before executing the
> emit_log_hook()

Indeed.  If you rely on log_line_prefix() in your hook before the
server side elog, the saved timestamp would not match with what could
be computed in the follow-up send_message_to_server_log or
send_message_to_frontend.

Hmm.  Shouldn't we remove the forced reset of formatted_log_time for
the 'm' case in log_status_format() and remove the reset done at the
beginning of send_message_to_server_log()?  One problem with your
patch is that we would still finish with a different saved_timeval in
the hook and in send_message_to_server_log(), but we should do things
so as the timestamps are the same for the whole duration of
EmitErrorReport(), no?  It seems to me that a more correct solution
would be to reset saved_timeval_set and formatted_log_time[0] once
before the hook, at the beginning of EmitErrorReport().
--
Michael


signature.asc
Description: PGP signature

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-11 Thread John Naylor

On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada  wrote:
>
> On Thu, Feb 15, 2024 at 8:26 PM John Naylor  wrote:

> > v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> > reduce memory regressions, so should be considered. Up to 3 tids can
> > be stored in the last level child pointer. It's not polished, but I'll
> > only proceed with that if we think we need this. "flags" iis called
> > that because it could hold tidbitmap.c booleans (recheck, lossy) in
> > the future, in addition to reserving space for the pointer tag. Note:
> > I hacked the tests to only have 2 offsets per block to demo, but of
> > course both paths should be tested.
>
> Interesting. I've run the same benchmark tests we did[1][2] (the
> median of 3 runs):
[found a big speed-up where we don't expect one]

I tried to reproduce this (similar patch, but rebased on top of a bug
you recently fixed (possibly related?) -- attached, and also shows one
way to address some lack of coverage in the debug build, for as long
as we test that with CI).

Fortunately I cannot see a difference, so I believe it's not affecting
the case in this test all, as expected:

v68:

INFO:  finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total)
tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable
removable cutoff: 770, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 88478 pages from table (20.00% of total) had
19995999 dead item identifiers removed
index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822
currently deleted, 0 reusable
avg read rate: 620.356 MB/s, avg write rate: 124.105 MB/s
buffer usage: 758236 hits, 274196 misses, 54854 dirtied
WAL usage: 2 records, 0 full page images, 425 bytes

system usage: CPU: user: 3.74 s, system: 0.68 s, elapsed: 4.45 s
system usage: CPU: user: 3.02 s, system: 0.42 s, elapsed: 3.47 s
system usage: CPU: user: 3.09 s, system: 0.38 s, elapsed: 3.49 s
system usage: CPU: user: 3.00 s, system: 0.43 s, elapsed: 3.45 s

v68 + emb values (that cannot be used because > 3 tids per block):

INFO:  finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total)
tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable
removable cutoff: 775, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 88478 pages from table (20.00% of total) had
19995999 dead item identifiers removed
index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822
currently deleted, 0 reusable
avg read rate: 570.808 MB/s, avg write rate: 114.192 MB/s
buffer usage: 758236 hits, 274196 misses, 54854 dirtied
WAL usage: 2 records, 0 full page images, 425 bytes

system usage: CPU: user: 3.11 s, system: 0.62 s, elapsed: 3.75 s
system usage: CPU: user: 3.04 s, system: 0.41 s, elapsed: 3.46 s
system usage: CPU: user: 3.05 s, system: 0.41 s, elapsed: 3.47 s
system usage: CPU: user: 3.04 s, system: 0.43 s, elapsed: 3.49 s

I'll continue polishing the runtime-embeddable values patch as time
permits, for later consideration.


WIP-3-embeddable-TIDs.patch.nocfbot
Description: Binary data

Re: Spurious pgstat_drop_replslot() call

2024-03-11 Thread Bertrand Drouvot

Hi,

On Mon, Mar 11, 2024 at 04:15:40PM +0900, Michael Paquier wrote:
> That's a slight change in behavior, unfortunately, and it cannot be
> called a bug as this improves the visibility of the stats after an
> invalidation, so this is not something that can be backpatched.

Yeah, makes sense to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Spurious pgstat_drop_replslot() call

2024-03-11 Thread Michael Paquier

On Fri, Mar 08, 2024 at 03:04:10PM +, Bertrand Drouvot wrote:
> The switch in the patch from "drop" to "invalidation" is in [1], see:
> 
> "
> Given the precedent of max_slot_wal_keep_size, I think it's wrong to just drop
> the logical slots. Instead we should just mark them as invalid, 
> like InvalidateObsoleteReplicationSlots().
> 
> Makes fully sense and done that way in the attached patch.
> “
> 
> But yeah, hard to be sure why this call is there, at least I don't remember...

Yep, noticed that on Friday.

> We can not be 100% sure that the stats are up to date when the process holding
> the active slot is killed. 
> 
> So v2 attached adds a test where we ensure first that we have non empty stats
> before triggering the invalidation.

Ah, that explains the extra poll.  What you have done here makes sense
to me, and the new test fails immediately if I add back the stats drop
in the invalidation code path.

That's a slight change in behavior, unfortunately, and it cannot be
called a bug as this improves the visibility of the stats after an
invalidation, so this is not something that can be backpatched.
--
Michael


signature.asc
Description: PGP signature

Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-03-11 Thread Heikki Linnakangas


On 10/03/2024 22:59, Thomas Munro wrote:

On Mon, Mar 11, 2024 at 9:30 AM Heikki Linnakangas  wrote:

Barring objections, I'll commit the attached.


+1


Pushed, thanks!


I guess the comment for smgrreleaseall() could also be updated.  It
mentions only PROCSIGNAL_BARRIER_SMGRRELEASE, but I think sinval
overflow (InvalidateSystemCaches()) should also be mentioned?


I removed that comment; people can grep to find the callers.


Hmm, I'm not sure if we need even smgrreleaseall() here anymore. It's
not required for correctness AFAICS. We don't do it in single-rel
invalidation in RelationCacheInvalidateEntry() either.


I think we do, because we have missed sinval messages.  It's unlikely
but a relfilenode might have been recycled, and we might have file
descriptors that point to the unlinked files.  That is, there are new
files with the same names and we need to open those ones.


Gotcha.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

2024-03-11 Thread Mats Kindahl

On Mon, Mar 11, 2024 at 12:42 AM Michael Paquier 
wrote:

> On Thu, Mar 07, 2024 at 07:45:01AM +0900, Michael Paquier wrote:
> > That's nice.  You would be able to shave quite a bit of code.  If
> > there are no objections, I propose to apply the change of this thread
> > to have this standard explain wrapper at the beginning of next week.
> > If others have any comments, feel free.
>
> Well, done as of a04ddd077e61.
>

Thanks Michael!
-- 
Best wishes,
Mats Kindahl, Timescale

1 2 >

100 matches

Mail list logo