RE: Synchronizing slots from primary to standby

2024-04-07 Thread Zhijie Hou (Fujitsu)
On Saturday, April 6, 2024 12:43 PM Amit Kapila  wrote:
> On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot
>  wrote:
> >
> > On Fri, Apr 05, 2024 at 06:23:10PM +0530, Amit Kapila wrote:
> > > On Fri, Apr 5, 2024 at 5:17 PM Amit Kapila 
> wrote:
> > > Thinking more on this, it doesn't seem related to
> > > c9920a9068eac2e6c8fb34988d18c0b42b9bf811 as that commit doesn't
> > > change any locking or something like that which impacts write positions.
> >
> > Agree.
> >
> > > I think what has happened here is that running_xact record written
> > > by the background writer [1] is not written to the kernel or disk
> > > (see LogStandbySnapshot()), before pg_current_wal_lsn() checks the
> > > current_lsn to be compared with replayed LSN.
> >
> > Agree, I think it's not visible through pg_current_wal_lsn() yet.
> >
> > Also I think that the DEBUG message in LogCurrentRunningXacts()
> >
> > "
> > elog(DEBUG2,
> >  "snapshot of %d+%d running transaction ids (lsn %X/%X oldest
> xid %u latest complete %u next xid %u)",
> >  CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
> >  LSN_FORMAT_ARGS(recptr),
> >  CurrRunningXacts->oldestRunningXid,
> >  CurrRunningXacts->latestCompletedXid,
> >  CurrRunningXacts->nextXid); "
> >
> > should be located after the XLogSetAsyncXactLSN() call. Indeed, the
> > new LSN is visible after the spinlock (XLogCtl->info_lck) in
> > XLogSetAsyncXactLSN() is released,
> >
> 
> I think the new LSN can be visible only when the corresponding WAL is written
> by XLogWrite(). I don't know what in XLogSetAsyncXactLSN() can make it
> visible. In your experiment below, isn't it possible that in the meantime WAL
> writer has written that WAL due to which you are seeing an updated location?
> 
> >see:
> >
> > \watch on Session 1 provides:
> >
> >  pg_current_wal_lsn
> > 
> >  0/87D110
> > (1 row)
> >
> > Until:
> >
> > Breakpoint 2, XLogSetAsyncXactLSN (asyncXactLSN=8900936) at
> xlog.c:2579
> > 2579XLogRecPtr  WriteRqstPtr = asyncXactLSN;
> > (gdb) n
> > 2581boolwakeup = false;
> > (gdb)
> > 2584SpinLockAcquire(&XLogCtl->info_lck);
> > (gdb)
> > 2585RefreshXLogWriteResult(LogwrtResult);
> > (gdb)
> > 2586sleeping = XLogCtl->WalWriterSleeping;
> > (gdb)
> > 2587prevAsyncXactLSN = XLogCtl->asyncXactLSN;
> > (gdb)
> > 2588if (XLogCtl->asyncXactLSN < asyncXactLSN)
> > (gdb)
> > 2589XLogCtl->asyncXactLSN = asyncXactLSN;
> > (gdb)
> > 2590SpinLockRelease(&XLogCtl->info_lck);
> > (gdb) p p/x (uint32) XLogCtl->asyncXactLSN
> > $1 = 0x87d148
> >
> > Then session 1 provides:
> >
> >  pg_current_wal_lsn
> > 
> >  0/87D148
> > (1 row)
> >
> > So, when we see in the log:
> >
> > 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG:
> > snapshot of 0+0 running transaction ids (lsn 0/398 oldest xid 740
> > latest complete 739 next xid 740)
> > 2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG:
> statement: SELECT '0/360' <= replay_lsn AND state = 'streaming'
> >
> > It's indeed possible that the new LSN was not visible yet (spinlock
> > not released?) before the query began (because we can not rely on the
> > time the DEBUG message has been emitted).
> >
> > > Note that the reason why
> > > walsender has picked the running_xact written by background writer
> > > is because it has checked after pg_current_wal_lsn() query, see LOGs [2].
> > > I think we can probably try to reproduce manually via debugger.
> > >
> > > If this theory is correct
> >
> > It think it is.
> >
> > > then I think we will need to use injection points to control the
> > > behavior of bgwriter or use the slots created via SQL API for
> > > syncing in tests.
> > >
> > > Thoughts?
> >
> > I think that maybe as a first step we should move the "elog(DEBUG2,"
> > message as proposed above to help debugging (that could help to confirm
> the above theory).
> >
> 
> I think I am missing how exactly moving DEBUG2 can confirm the above theory.
> 
> > If the theory is proven then I'm not sure we need the extra complexity
> > of injection point here, maybe just relying on the slots created via
> > SQL API could be enough.
> >
> 
> Yeah, that could be the first step. We can probably add an injection point to
> control the bgwrite behavior and then add tests involving walsender
> performing the decoding. But I think it is important to have sufficient tests 
> in
> this area as I see they are quite helpful in uncovering the issues.

Here is the patch to drop the subscription in the beginning so that the
restart_lsn of the lsub1_slot won't be advanced due to concurrent
xl_running_xacts from bgwriter. The subscription will be re-created after all
the slots are sync-ready. I think maybe we can use this to stabilize the test
as a first step and then think about how to make use of 

Re: Improve tab completion for ALTER DEFAULT PRIVILEGE and ALTER TABLE

2024-04-07 Thread vignesh C
On Mon, 8 Apr 2024 at 10:29, Masahiko Sawada  wrote:
>
> On Fri, Apr 5, 2024 at 1:18 AM vignesh C  wrote:
> >
> > On Tue, 2 Apr 2024 at 13:08, Masahiko Sawada  wrote:
> > >
> > > On Mon, Apr 1, 2024 at 10:41 PM vignesh C  wrote:
> > > >
> > > > On Thu, 28 Mar 2024 at 13:05, Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Thank you for the patch!
> > > > >
> > > > > On Mon, Jul 3, 2023 at 12:12 AM vignesh C  wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Improved tab completion for "ALTER DEFAULT PRIVILEGE" and "ALTER 
> > > > > > TABLE":
> > > > > > 1) GRANT, REVOKE and FOR USER keyword was not displayed in tab
> > > > > > completion of alter default privileges like the below statement:
> > > > > > ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC;
> > > > > > ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC;
> > > > > > ALTER DEFAULT PRIVILEGES FOR USER vignesh revoke INSERT ON tables 
> > > > > > FROM dba1;
> > > > >
> > > > > +1
> > > > >
> > > > > >
> > > > > > 2) USER was not displayed for "ALTER DEFAULT PRIVILEGES IN SCHEMA
> > > > > > public FOR " like in below statement:
> > > > > > ALTER DEFAULT PRIVILEGES IN SCHEMA public FOR USER dba1 GRANT INSERT
> > > > > > ON TABLES TO PUBLIC;
> > > > >
> > > > > Since there is no difference FOR USER and FOR ROLE, I'm not sure we
> > > > > really want to support both in tab-completion.
> > > >
> > > > I have removed this change
> > > >
> > > > > >
> > > > > > 3) "FOR GRANT OPTION" was not display for "ALTER DEFAULT PRIVILEGES
> > > > > > REVOKE " like in below statement:
> > > > > > alter default privileges revoke grant option for select ON tables 
> > > > > > FROM dba1;
> > > > >
> > > > > +1. But the v3 patch doesn't cover the following case:
> > > > >
> > > > > =# alter default privileges for role masahiko revoke [tab]
> > > > > ALL CREATE  DELETE  EXECUTE INSERT  MAINTAIN
> > > > >  REFERENCES  SELECT  TRIGGER TRUNCATEUPDATE  USAGE
> > > >
> > > > Modified in the updated patch
> > > >
> > > > > And it doesn't cover MAINTAIN neither:
> > > > >
> > > > > =# alter default privileges revoke [tab]
> > > > > ALL   DELETEGRANT OPTION FOR  REFERENCES
> > > > >  TRIGGER   UPDATE
> > > > > CREATEEXECUTE   INSERTSELECT
> > > > >  TRUNCATE  USAGE
> > > >
> > > > Modified in the updated patch
> > > >
> > > > > The patch adds the completions for ALTER DEFAULT PRIVILEGES REVOKE,
> > > > > but we handle such case in GRANT and REVOKE part:
> > > > >
> > > > > (around L3958)
> > > > > /*
> > > > >  * With ALTER DEFAULT PRIVILEGES, restrict completion to 
> > > > > grantable
> > > > >  * privileges (can't grant roles)
> > > > >  */
> > > > > if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES"))
> > > > > COMPLETE_WITH("SELECT", "INSERT", "UPDATE",
> > > > >   "DELETE", "TRUNCATE", "REFERENCES", 
> > > > > "TRIGGER",
> > > > >   "CREATE", "EXECUTE", "USAGE", "MAINTAIN", 
> > > > > "ALL");
> > > >
> > > > The current patch handles the fix from here now.
> > > >
> > > > > Also, I think we can support WITH GRANT OPTION too. For example,
> > > > >
> > > > > =# alter default privileges for role masahiko grant all on tables to
> > > > > public [tab]
> > > >
> > > > I have handled this in the updated patch
> > > >
> > > > > It's already supported in the GRANT statement.
> > > > >
> > > > > >
> > > > > > 4) "DATA TYPE" was missing in "ALTER TABLE table-name ALTER COLUMN
> > > > > > column-name SET" like in:
> > > > > > ALTER TABLE t1 ALTER COLUMN c1 SET DATA TYPE text;
> > > > > >
> > > > >
> > > > > +1. The patch looks good to me, so pushed.
> > > >
> > > > Thanks for committing this.
> > > >
> > > > The updated patch has the changes for the above comments.
> > > >
> > >
> > > Thank you for updating the patch.
> > >
> > > I think it doesn't work well as "GRANT OPTION FOR" is complemented
> > > twice. For example,
> > >
> > > =# alter default privileges for user masahiko revoke [tab]
> > > ALL   DELETEGRANT OPTION FOR  MAINTAIN
> > >  SELECTTRUNCATE  USAGE
> > > CREATEEXECUTE   INSERTREFERENCES
> > >  TRIGGER   UPDATE
> > > =# alter default privileges for user masahiko revoke grant option for 
> > > [tab]
> > > ALL   DELETEGRANT OPTION FOR  MAINTAIN
> > >  SELECTTRUNCATE  USAGE
> > > CREATEEXECUTE   INSERTREFERENCES
> > >  TRIGGER   UPDATE
> >
> > Thanks for finding this issue, the attached v5 version patch has the
> > fix for the same.
>
> Thank you for updating the patch! I've pushed with minor adjustments.

Thanks for pushing this patch.

Regards,
Vignesh




Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-07 Thread Dmitry Koval

Alexander Lakhin, thanks for the problems you found!

Unfortunately I can't watch them immediately (event [1]).
I will try to start solving them in 12-14 hours.

[1] https://pgconf.ru/2024
--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com




Re: Replace FunctionCall2Coll with FunctionCallInvoke

2024-04-07 Thread Andy Fan



Hello Michael, 

> [[PGP Signed Part:Undecided]]
> On Mon, Apr 08, 2024 at 12:25:00PM +0800, Andy Fan wrote:
>> During the review of using extended statistics to improve join estimates
>> [1], I found some code level optimization opportunities which apply to
>> existing code as well. To make the discussion easier, open this new
>> thread.
>
> Is that measurable?

I didn't spent time on testing it. Compared with if the improvement is
measureable, I'm more interested with if it is better than before or
not. As for measurable respect, I'm with the idea at [1]. Do you think
the measurable matter? 

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZJ0a_Dcn%2BST4YSeSrLnnmajmcsi7ZvEpgkKNiF0SwBuw%40mail.gmail.com
 

-- 
Best Regards
Andy Fan





Re: Replace FunctionCall2Coll with FunctionCallInvoke

2024-04-07 Thread Michael Paquier
On Mon, Apr 08, 2024 at 12:25:00PM +0800, Andy Fan wrote:
> During the review of using extended statistics to improve join estimates
> [1], I found some code level optimization opportunities which apply to
> existing code as well. To make the discussion easier, open this new
> thread.

Is that measurable?
--
Michael


signature.asc
Description: PGP signature


Re: Speed up clean meson builds by ~25%

2024-04-07 Thread Michael Paquier
On Fri, Apr 05, 2024 at 06:19:20PM +0200, Jelte Fennema-Nio wrote:
> Agreed. While not a full solution, I think this patch is still good to
> address some of the pain: Waiting 10 seconds at the end of my build
> with only 1 of my 10 cores doing anything.
> 
> So while this doesn't decrease CPU time spent it does decrease
> wall-clock time significantly in some cases, with afaict no downsides.

Well, this is also painful with ./configure.  So, even if we are going
to move away from it at this point, we still need to support it for a
couple of years.  It looks to me that letting the clang folks know
about the situation is the best way forward.
--
Michael


signature.asc
Description: PGP signature


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-04-07 Thread Bharath Rupireddy
On Thu, Mar 21, 2024 at 11:33 PM Bharath Rupireddy
 wrote:
>
> Please find the v26 patches after rebasing.

Commit f3ff7bf83b added a check in WALReadFromBuffers to ensure the
requested WAL is not past the WAL that's copied to WAL buffers. So,
I've dropped v26-0001 patch.

I've attached v27 patches for further review.

0001 adds a test module to demonstrate reading from WAL buffers
patterns like the caller ensuring the requested WAL is fully copied to
WAL buffers using WaitXLogInsertionsToFinish and an implementation of
xlogreader page_read
callback to read unflushed/not-yet-flushed WAL directly from WAL buffers.

0002 Use WALReadFromBuffers in more places like for logical
walsenders, logical decoding functions, backends reading WAL.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 77c4cc3320a9c2982b5fdac9bd51a892ca644fae Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 8 Apr 2024 05:02:07 +
Subject: [PATCH v27 1/2] Add test module to demonstrate reading from WAL 
 buffers patterns

This commit adds a test module to demonstrate a few patterns for
reading from WAL buffers using WALReadFromBuffers added by commit
91f2cae7a4e.

1. This module contains a test function to read the WAL that's
fully copied to WAL buffers. Whether or not the WAL is fully
copied to WAL buffers is ensured by WaitXLogInsertionsToFinish
before WALReadFromBuffers.

2. This module contains an implementation of xlogreader page_read
callback to read unflushed/not-yet-flushed WAL directly from WAL
buffers.
---
 src/backend/access/transam/xlog.c |   3 +-
 src/backend/access/transam/xlogreader.c   |   3 +-
 src/include/access/xlog.h |   1 +
 src/test/modules/Makefile |   1 +
 src/test/modules/meson.build  |   1 +
 .../modules/read_wal_from_buffers/.gitignore  |   4 +
 .../modules/read_wal_from_buffers/Makefile|  23 ++
 .../modules/read_wal_from_buffers/meson.build |  33 ++
 .../read_wal_from_buffers--1.0.sql|  37 ++
 .../read_wal_from_buffers.c   | 318 ++
 .../read_wal_from_buffers.control |   4 +
 .../read_wal_from_buffers/t/001_basic.pl  | 111 ++
 12 files changed, 536 insertions(+), 3 deletions(-)
 create mode 100644 src/test/modules/read_wal_from_buffers/.gitignore
 create mode 100644 src/test/modules/read_wal_from_buffers/Makefile
 create mode 100644 src/test/modules/read_wal_from_buffers/meson.build
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
 create mode 100644 src/test/modules/read_wal_from_buffers/t/001_basic.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e3fb26f5ab..4d4a840c8e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -700,7 +700,6 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 	  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 			  XLogRecPtr *PrevPtr);
-static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
@@ -1496,7 +1495,7 @@ WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
  * uninitialized page), and the inserter might need to evict an old WAL buffer
  * to make room for a new one, which in turn requires WALWriteLock.
  */
-static XLogRecPtr
+XLogRecPtr
 WaitXLogInsertionsToFinish(XLogRecPtr upto)
 {
 	uint64		bytepos;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 37d2a57961..12dddf64cc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1033,7 +1033,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
+	if (state->seg.ws_segno != 0 &&
+		targetSegNo != state->seg.ws_segno && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 76787a8267..74606a6846 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -252,6 +252,7 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 extern Size WALReadFromBuffers(char *dstbuf, XLogRecPtr startpt

Re: Improve heapgetpage() performance, overhead from serializable

2024-04-07 Thread Andres Freund
Hi,

On 2024-04-08 16:18:21 +1200, David Rowley wrote:
> On Mon, 8 Apr 2024 at 16:08, Andres Freund  wrote:
> > I think visible would be ok, the serialization checks are IMO about
> > visibility too. But if you'd prefer I'd also be ok with something like
> > page_collect_tuples()?
> 
> That's ok for me.

Cool, pushed that way.

Greetings,

Andres Freund




Re: ExecAppendAsyncEventWait() in REL_14_STABLE can corrupt PG_exception_stack

2024-04-07 Thread Michael Paquier
On Thu, Apr 04, 2024 at 06:08:40PM +0900, Etsuro Fujita wrote:
> Pushed.  Sorry for the delay.

Thanks!
--
Michael


signature.asc
Description: PGP signature


Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-04-07 Thread Thomas Munro
On Mon, Apr 8, 2024 at 11:53 AM Melanie Plageman
 wrote:
> I've reviewed v6. I think you should mention in the docs that it only
> works for shared buffers -- so specifically not buffers containing
> blocks of temp tables.

Thanks for looking!  The whole pg_buffercache extension is for working
with shared buffers only, as mentioned at the top.  I have tried to
improve that paragraph though, as it only mentioned examining them.

> In the function pg_buffercache_invalidate(), why not use the
> BufferIsValid() function?
>
> -   if (buf < 1 || buf > NBuffers)
> +   if (!BufferIsValid(buf) || buf > NBuffers)

It doesn't check the range (it has assertions, not errors).

> I thought the below would be more clear for the comment above
> InvalidateUnpinnedBuffer().
>
> - * Returns true if the buffer was valid and it has now been made invalid.
> - * Returns false if the wasn't valid, or it couldn't be evicted due to a pin,
> - * or if the buffer becomes dirty again while we're trying to write it out.
> + * Returns true if the buffer was valid and has now been made invalid. 
> Returns
> + * false if it wasn't valid, if it couldn't be evicted due to a pin, or if 
> the
> + * buffer becomes dirty again while we're trying to write it out.

Fixed.

> Some of that probably applies for the docs too (i.e. you have some
> similar wording in the docs). There is actually one typo in your
> version, so even if you don't adopt my suggestion, you should fix that
> typo.

Yeah, thanks, improved similarly there.

> I didn't notice anything else out of place. I tried it and it worked
> as expected. I'm excited to have this feature!

Thanks!

> I didn't read through this whole thread, but was there any talk of
> adding other functions to let me invalidate a bunch of buffers at once
> or even some options -- like invalidate every 3rd buffer or whatever?
> (Not the concern of this patch, but just wondering because that would
> be a useful future enhancement IMO).

TBH I tried to resist people steering in that direction because you
can also just define a SQL function to do that built on this, and if
you had specialised functions they'd never be quite right.  IMHO we
succeeded in minimising the engineering and maximising flexibility,
'cause it's for hackers.  Crude, but already able to express a wide
range of stuff by punting the problem to SQL.

Thanks to Palak for the patch.  Pushed.




Re: remaining sql/json patches

2024-04-07 Thread jian he
On Mon, Apr 8, 2024 at 11:21 AM jian he  wrote:
>
> On Mon, Apr 8, 2024 at 12:34 AM jian he  wrote:
> >
> > On Sun, Apr 7, 2024 at 9:36 PM Amit Langote  wrote:
> > > 0002 needs an expanded commit message but I've run out of energy today.
> > >
>
> other than that, it looks good to me.

one more tiny issue.
+EXPLAIN (COSTS OFF, VERBOSE) SELECT * FROM jsonb_table_view1;
+ERROR:  relation "jsonb_table_view1" does not exist
+LINE 1: EXPLAIN (COSTS OFF, VERBOSE) SELECT * FROM jsonb_table_view1...
+   ^
maybe you want
EXPLAIN (COSTS OFF, VERBOSE) SELECT * FROM jsonb_table_view7;
at the end of the sqljson_jsontable.sql.
I guess it will be fine, but the format json explain's out is quite big.

you also need to `drop table s cascade;` at the end of the test?




Re: Improve tab completion for ALTER DEFAULT PRIVILEGE and ALTER TABLE

2024-04-07 Thread Masahiko Sawada
On Fri, Apr 5, 2024 at 1:18 AM vignesh C  wrote:
>
> On Tue, 2 Apr 2024 at 13:08, Masahiko Sawada  wrote:
> >
> > On Mon, Apr 1, 2024 at 10:41 PM vignesh C  wrote:
> > >
> > > On Thu, 28 Mar 2024 at 13:05, Masahiko Sawada  
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thank you for the patch!
> > > >
> > > > On Mon, Jul 3, 2023 at 12:12 AM vignesh C  wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Improved tab completion for "ALTER DEFAULT PRIVILEGE" and "ALTER 
> > > > > TABLE":
> > > > > 1) GRANT, REVOKE and FOR USER keyword was not displayed in tab
> > > > > completion of alter default privileges like the below statement:
> > > > > ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC;
> > > > > ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC;
> > > > > ALTER DEFAULT PRIVILEGES FOR USER vignesh revoke INSERT ON tables 
> > > > > FROM dba1;
> > > >
> > > > +1
> > > >
> > > > >
> > > > > 2) USER was not displayed for "ALTER DEFAULT PRIVILEGES IN SCHEMA
> > > > > public FOR " like in below statement:
> > > > > ALTER DEFAULT PRIVILEGES IN SCHEMA public FOR USER dba1 GRANT INSERT
> > > > > ON TABLES TO PUBLIC;
> > > >
> > > > Since there is no difference FOR USER and FOR ROLE, I'm not sure we
> > > > really want to support both in tab-completion.
> > >
> > > I have removed this change
> > >
> > > > >
> > > > > 3) "FOR GRANT OPTION" was not display for "ALTER DEFAULT PRIVILEGES
> > > > > REVOKE " like in below statement:
> > > > > alter default privileges revoke grant option for select ON tables 
> > > > > FROM dba1;
> > > >
> > > > +1. But the v3 patch doesn't cover the following case:
> > > >
> > > > =# alter default privileges for role masahiko revoke [tab]
> > > > ALL CREATE  DELETE  EXECUTE INSERT  MAINTAIN
> > > >  REFERENCES  SELECT  TRIGGER TRUNCATEUPDATE  USAGE
> > >
> > > Modified in the updated patch
> > >
> > > > And it doesn't cover MAINTAIN neither:
> > > >
> > > > =# alter default privileges revoke [tab]
> > > > ALL   DELETEGRANT OPTION FOR  REFERENCES
> > > >  TRIGGER   UPDATE
> > > > CREATEEXECUTE   INSERTSELECT
> > > >  TRUNCATE  USAGE
> > >
> > > Modified in the updated patch
> > >
> > > > The patch adds the completions for ALTER DEFAULT PRIVILEGES REVOKE,
> > > > but we handle such case in GRANT and REVOKE part:
> > > >
> > > > (around L3958)
> > > > /*
> > > >  * With ALTER DEFAULT PRIVILEGES, restrict completion to 
> > > > grantable
> > > >  * privileges (can't grant roles)
> > > >  */
> > > > if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES"))
> > > > COMPLETE_WITH("SELECT", "INSERT", "UPDATE",
> > > >   "DELETE", "TRUNCATE", "REFERENCES", "TRIGGER",
> > > >   "CREATE", "EXECUTE", "USAGE", "MAINTAIN", 
> > > > "ALL");
> > >
> > > The current patch handles the fix from here now.
> > >
> > > > Also, I think we can support WITH GRANT OPTION too. For example,
> > > >
> > > > =# alter default privileges for role masahiko grant all on tables to
> > > > public [tab]
> > >
> > > I have handled this in the updated patch
> > >
> > > > It's already supported in the GRANT statement.
> > > >
> > > > >
> > > > > 4) "DATA TYPE" was missing in "ALTER TABLE table-name ALTER COLUMN
> > > > > column-name SET" like in:
> > > > > ALTER TABLE t1 ALTER COLUMN c1 SET DATA TYPE text;
> > > > >
> > > >
> > > > +1. The patch looks good to me, so pushed.
> > >
> > > Thanks for committing this.
> > >
> > > The updated patch has the changes for the above comments.
> > >
> >
> > Thank you for updating the patch.
> >
> > I think it doesn't work well as "GRANT OPTION FOR" is complemented
> > twice. For example,
> >
> > =# alter default privileges for user masahiko revoke [tab]
> > ALL   DELETEGRANT OPTION FOR  MAINTAIN
> >  SELECTTRUNCATE  USAGE
> > CREATEEXECUTE   INSERTREFERENCES
> >  TRIGGER   UPDATE
> > =# alter default privileges for user masahiko revoke grant option for [tab]
> > ALL   DELETEGRANT OPTION FOR  MAINTAIN
> >  SELECTTRUNCATE  USAGE
> > CREATEEXECUTE   INSERTREFERENCES
> >  TRIGGER   UPDATE
>
> Thanks for finding this issue, the attached v5 version patch has the
> fix for the same.

Thank you for updating the patch! I've pushed with minor adjustments.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-07 Thread Amit Kapila
On Sun, Apr 7, 2024 at 3:06 AM Andres Freund  wrote:
>
> On 2024-04-06 10:58:32 +0530, Amit Kapila wrote:
> > On Sat, Apr 6, 2024 at 10:13 AM Amit Kapila  wrote:
> > >
> >
> > There are still a few pending issues to be fixed in this feature but
> > otherwise, we have committed all the main patches, so I marked the CF
> > entry corresponding to this work as committed.
>
> There are a a fair number of failures of 040_standby_failover_slots_sync in
> the buildfarm.  It'd be nice to get those fixed soon-ish.
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-04-06%2020%3A58%3A50
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-04-06%2015%3A18%3A08
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-04-06%2010%3A13%3A58
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-04-05%2016%3A04%3A10
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-04-05%2014%3A59%3A40
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-04-05%2014%3A59%3A07
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-04-05%2014%3A18%3A07
>
> The symptoms are similar, but not entirely identical across all of them, I 
> think.
>

I have analyzed these failures and there are two different tests that
are failing but the underlying reason is the same as being discussed
with Bertrand. We are working on the fix.

-- 
With Regards,
Amit Kapila.




Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-04-07 Thread Thomas Munro
On Mon, Apr 8, 2024 at 12:10 PM Andres Freund  wrote:
> On 2024-04-07 11:07:58 +1200, Thomas Munro wrote:
> > I thought of a better name for the bufmgr.c function though:
> > InvalidateUnpinnedBuffer().  That name seemed better to me after I
> > festooned it with warnings about why exactly it's inherently racy and
> > only for testing use.
>
> I still dislike that, fwiw, due to the naming similarity to
> InvalidateBuffer(), which throws away dirty buffer contents too. Which
> obviously isn't acceptable from "userspace".  I'd just name it
> pg_buffercache_evict() - given that the commit message's first paragraph uses
> "it is useful to be able to evict arbitrary blocks" that seems to describe
> things at least as well as "invalidate"?

Alright, sold.  I'll go with EvictUnpinnedBuffer() in bufmgr.c and
pg_buffercache_evict() in the contrib module.




Replace FunctionCall2Coll with FunctionCallInvoke

2024-04-07 Thread Andy Fan

Hello,

During the review of using extended statistics to improve join estimates
[1], I found some code level optimization opportunities which apply to
existing code as well. To make the discussion easier, open this new
thread.

I don't know how to name the thread name, just use the patch 1 for the
naming.

commit d228d9734e70b4f43ad824d736fb1279d2aad5fc (HEAD -> misc_stats)
Author: yizhi.fzh 
Date:   Mon Apr 8 11:43:37 2024 +0800

Fast path for empty clauselist in clauselist_selectivity_ext

It should be common in the real life, for example:

SELECT * FROM t1, t2 WHERE t1.a = t2.a AND t1.a > 3;

clauses == NIL at the scan level of t2.

This would also make some debug life happier.

commit e852ce631f9348d5d29c8a53090ee71f7253767c
Author: yizhi.fzh 
Date:   Mon Apr 8 11:13:57 2024 +0800

Improve FunctionCall2Coll with FunctionCallInvoke

If a FunctionCall2Coll is called multi times with the same FmgrInfo,
turning it into FunctionCallInvoke will be helpful since the later one
can use the shared FunctionCallInfo.

There are many other places like GITSTATE which have the similar
chances, but I'd see if such changes is necessary at the first
place.


[1]
https://www.postgresql.org/message-id/c8c0ff31-3a8a-7562-bbd3-78b2ec65f16c%40enterprisedb.com
 

-- 
Best Regards
Andy Fan
>From e852ce631f9348d5d29c8a53090ee71f7253767c Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Mon, 8 Apr 2024 11:13:57 +0800
Subject: [PATCH v1 1/2] Improve FunctionCall2Coll with FunctionCallInvoke

If a FunctionCall2Coll is called multi times with the same FmgrInfo,
turning it into FunctionCallInvoke will be helpful since the later one
can use the shared FunctionCallInfo.

There are many other places like GITSTATE which have the similar
chances, but I'd see if such changes is necessary at the first place.
---
 src/backend/utils/adt/selfuncs.c | 35 +---
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 35f8f306ee..8dfeba5444 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -196,7 +196,7 @@ static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata,
 			   Oid sortop, Oid collation,
 			   Datum *min, Datum *max);
 static void get_stats_slot_range(AttStatsSlot *sslot,
- Oid opfuncoid, FmgrInfo *opproc,
+ FunctionCallInfo fcinfo,
  Oid collation, int16 typLen, bool typByVal,
  Datum *min, Datum *max, bool *p_have_data);
 static bool get_actual_variable_range(PlannerInfo *root,
@@ -5905,6 +5905,7 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata,
 	bool		typByVal;
 	Oid			opfuncoid;
 	FmgrInfo	opproc;
+	LOCAL_FCINFO(fcinfo, 2);
 	AttStatsSlot sslot;
 
 	/*
@@ -5936,8 +5937,6 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata,
 	   (opfuncoid = get_opcode(sortop
 		return false;
 
-	opproc.fn_oid = InvalidOid; /* mark this as not looked up yet */
-
 	get_typlenbyval(vardata->atttype, &typLen, &typByVal);
 
 	/*
@@ -5957,6 +5956,11 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata,
 		free_attstatsslot(&sslot);
 	}
 
+	fmgr_info(opfuncoid, &opproc);
+	InitFunctionCallInfoData(*fcinfo, &opproc, 2, collation, NULL, NULL);
+	fcinfo->args[0].isnull = false;
+	fcinfo->args[1].isnull = false;
+
 	/*
 	 * Otherwise, if there is a histogram with some other ordering, scan it
 	 * and get the min and max values according to the ordering we want.  This
@@ -5968,7 +5972,7 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata,
 		 STATISTIC_KIND_HISTOGRAM, InvalidOid,
 		 ATTSTATSSLOT_VALUES))
 	{
-		get_stats_slot_range(&sslot, opfuncoid, &opproc,
+		get_stats_slot_range(&sslot, fcinfo,
 			 collation, typLen, typByVal,
 			 &tmin, &tmax, &have_data);
 		free_attstatsslot(&sslot);
@@ -6003,7 +6007,7 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata,
 		}
 
 		if (use_mcvs)
-			get_stats_slot_range(&sslot, opfuncoid, &opproc,
+			get_stats_slot_range(&sslot, fcinfo,
  collation, typLen, typByVal,
  &tmin, &tmax, &have_data);
 		free_attstatsslot(&sslot);
@@ -6021,7 +6025,7 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata,
  * to what we find in the statistics array.
  */
 static void
-get_stats_slot_range(AttStatsSlot *sslot, Oid opfuncoid, FmgrInfo *opproc,
+get_stats_slot_range(AttStatsSlot *sslot, FunctionCallInfo fcinfo,
 	 Oid collation, int16 typLen, bool typByVal,
 	 Datum *min, Datum *max, bool *p_have_data)
 {
@@ -6031,10 +6035,6 @@ get_stats_slot_range(AttStatsSlot *sslot, Oid opfuncoid, FmgrInfo *opproc,
 	bool		found_tmin = false;
 	bool		found_tmax = false;
 
-	/* Look up the comparison function, if we didn't already do so */
-	if (opproc->fn_oid != opfuncoid)
-		fmgr_info(opfuncoid, opproc);
-
 	/* Scan all the s

Re: Improve heapgetpage() performance, overhead from serializable

2024-04-07 Thread David Rowley
On Mon, 8 Apr 2024 at 16:08, Andres Freund  wrote:
>
> On 2024-04-08 15:43:12 +1200, David Rowley wrote:
> > I understand wanting to avoid the long name.  I'd rather stay clear of
> > "visible", but don't feel as strongly about this as it's static.
>
> I think visible would be ok, the serialization checks are IMO about
> visibility too. But if you'd prefer I'd also be ok with something like
> page_collect_tuples()?

That's ok for me.

David




Re: Allow non-superuser to cancel superuser tasks.

2024-04-07 Thread Michael Paquier
On Fri, Apr 05, 2024 at 08:07:51PM -0500, Nathan Bossart wrote:
> On Sat, Apr 06, 2024 at 08:56:04AM +0900, Michael Paquier wrote:
>> There is pg_read_all_stats as well, so I don't see a big issue in
>> requiring to be a member of this role as well for the sake of what's
>> proposing here.
> 
> Well, that tells you quite a bit more than just which PIDs correspond to
> autovacuum workers, but maybe that's good enough for now.

That may be a good initial compromise, for now.

>> I'd rather not leak any information at the end for
>> anybody calling pg_signal_backend without access to the stats, so
>> checking the backend type after the role sounds kind of a safer
>> long-term approach for me.
> 
> I'm not following what you mean by this.  Are you suggesting that we should
> keep the existing superuser message for the autovacuum workers?

Mostly.  Just to be clear the patch has the following problem:
=# CREATE ROLE popo LOGIN;
CREATE ROLE
=# CREATE EXTENSION injection_points;
CREATE EXTENSION
=# select injection_points_attach('autovacuum-start', 'wait');
 injection_points_attach
-

(1 row)
=# select pid, backend_type from pg_stat_activity
 where wait_event = 'autovacuum-start' LIMIT 1;
  pid  |   backend_type
---+---
 14163 | autovacuum worker
(1 row)
=> \c postgres popo
You are now connected to database "postgres" as user "popo". 
=> select pg_terminate_backend(14163);
ERROR:  42501: permission denied to terminate autovacuum worker backend
DETAIL:  Only roles with the SUPERUSER attribute or with privileges of
the "pg_signal_autovacuum" role may terminate autovacuum worker
backend
LOCATION:  pg_terminate_backend, signalfuncs.c:267 
=> select backend_type from pg_stat_activity where pid = 14163;
 backend_type
--
 null
(1 row)

And we should try to reshape things so as we get an ERROR like
"permission denied to terminate process" or "permission denied to
cancel query" for all the error paths, including autovacuum workers 
and backends, so as we never leak any information about the backend
types involved when a role has no permission to issue the signal.
Perhaps that's the most intuitive thing as well, because autovacuum
workers are backends.  One thing that we could do is to mention both
pg_signal_backend and pg_signal_autovacuum in the errdetail, and have
both cases be handled by SIGNAL_BACKEND_NOPERMISSION on failure.
--
Michael


signature.asc
Description: PGP signature


Re: Improve heapgetpage() performance, overhead from serializable

2024-04-07 Thread Andres Freund
Hi,

On 2024-04-08 15:43:12 +1200, David Rowley wrote:
> On Mon, 8 Apr 2024 at 15:13, Andres Freund  wrote:
> > Off-list Melanie suggested heap_page_collect_visible_tuples(). Given the
> > separate callsites (making long names annoying) and the fact that it's 
> > really
> > specific to one caller, I'm somewhat inclined to just go with
> > collect_visible_tuples() or page_collect_visible(), I think I prefer the
> > latter.
> 
> I understand wanting to avoid the long name.  I'd rather stay clear of
> "visible", but don't feel as strongly about this as it's static.

I think visible would be ok, the serialization checks are IMO about
visibility too. But if you'd prefer I'd also be ok with something like
page_collect_tuples()?

Greetings,

Andres Freund




Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-07 Thread Alexander Lakhin

08.04.2024 01:15, Alexander Korotkov wrote:

Thank you for spotting this.  This seems like a missing check.  I'm
going to get a closer look at this tomorrow.



Thanks!

There is also an anomaly with the MERGE command:
CREATE TABLE t1 (i int, a int, b int, c int) PARTITION BY RANGE (a, b);
CREATE TABLE t1p1 PARTITION OF t1 FOR VALUES FROM (1, 1) TO (1, 2);

CREATE TABLE t2 (i int, t text) PARTITION BY RANGE (t);
CREATE TABLE t2pa PARTITION OF t2 FOR VALUES FROM ('A') TO ('C');

CREATE TABLE t3 (i int, t text);

ALTER TABLE t2 MERGE PARTITIONS (t1p1, t2pa, t3) INTO t2p;

leads to:
ERROR:  partition bound for relation "t3" is null
WARNING:  problem in alloc set PortalContext: detected write past chunk end in 
block 0x55f1ef42f820, chunk 0x55f1ef42ff40
WARNING:  problem in alloc set PortalContext: detected write past chunk end in 
block 0x55f1ef42f820, chunk 0x55f1ef42ff40

(I'm also not sure that the error message is clear enough (can't we say
"relation X is not a partition of relation Y" in this context, as in
MarkInheritDetached(), for example?).)

Whilst with
ALTER TABLE t2 MERGE PARTITIONS (t1p1, t2pa) INTO t2p;

I get:
Program terminated with signal SIGSEGV, Segmentation fault.

#0  pg_detoast_datum_packed (datum=0x1) at fmgr.c:1866
1866    if (VARATT_IS_COMPRESSED(datum) || VARATT_IS_EXTERNAL(datum))
(gdb) bt
#0  pg_detoast_datum_packed (datum=0x1) at fmgr.c:1866
#1  0x55d77d00fde2 in bttextcmp (...) at 
../../../../src/include/postgres.h:314
#2  0x55d77d03fa27 in FunctionCall2Coll (...) at fmgr.c:1161
#3  0x55d77ce1572f in partition_rbound_cmp (...) at partbounds.c:3525
#4  0x55d77ce157b9 in qsort_partition_rbound_cmp (...) at partbounds.c:3816
#5  0x55d77d0982ef in qsort_arg (...) at 
../../src/include/lib/sort_template.h:316
#6  0x55d77ce1d109 in calculate_partition_bound_for_merge (...) at 
partbounds.c:5786
#7  0x55d77cc24b2b in transformPartitionCmdForMerge (...) at 
parse_utilcmd.c:3524
#8  0x55d77cc2b555 in transformAlterTableStmt (...) at parse_utilcmd.c:3812
#9  0x55d77ccab17c in ATParseTransformCmd (...) at tablecmds.c:5650
#10 0x55d77ccafd09 in ATExecCmd (...) at tablecmds.c:5589
...

Best regards,
Alexander




Re: Use streaming read API in ANALYZE

2024-04-07 Thread Thomas Munro
On Mon, Apr 8, 2024 at 10:26 AM Melanie Plageman
 wrote:
> On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote:
> > On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote:
> > > From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001
> > > From: Melanie Plageman 
> > > Date: Sun, 7 Apr 2024 15:38:41 -0400
> > > Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore()
> > >
> > > A previous commit stopped using BlockSampler_HasMore() for flow control
> > > in acquire_sample_rows(). There seems little use now for
> > > BlockSampler_HasMore(). It should be sufficient to return
> > > InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore()
> > > would have returned false. Remove BlockSampler_HasMore().
> > >
> > > Author: Melanie Plageman, Nazir Bilal Yavuz
> > > Discussion: 
> > > https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
> >
> > The justification here seems somewhat odd. Sure, the previous commit stopped
> > using BlockSampler_HasMore in acquire_sample_rows - but only because it was
> > moved to block_sampling_streaming_read_next()?
>
> It didn't stop using it. It stopped being useful. The reason it existed,
> as far as I can tell, was to use it as the while() loop condition in
> acquire_sample_rows(). I think it makes much more sense for
> BlockSampler_Next() to return InvalidBlockNumber when there are no more
> blocks -- not to assert you don't call it when there aren't any more
> blocks.
>
> I didn't want to change BlockSampler_Next() in the same commit as the
> streaming read user and we can't remove BlockSampler_HasMore() without
> changing BlockSampler_Next().

I agree that the code looks useless if one condition implies the
other, but isn't it good to keep that cross-check, perhaps
reformulated as an assertion?  I didn't look too hard at the maths, I
just saw the words "It is not obvious that this code matches Knuth's
Algorithm S ..." and realised I'm not sure I have time to develop a
good opinion about this today.  So I'll leave the 0002 change out for
now, as it's a tidy-up that can easily be applied in the next cycle.
From c3b8df8e4720d8b0dfb4c892c0aa3ddaef8f401f Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 8 Apr 2024 14:38:58 +1200
Subject: [PATCH v12] Remove obsolete BlockSampler_HasMore().

Commit 041b9680 stopped using BlockSampler_HasMore() for flow control in
acquire_sample_rows(). There seems to be little use for it now. We can
just return InvalidBlockNumber from BlockSampler_Next() when
BlockSample_HasMore() would have returned false.

Author: Melanie Plageman 
Author: Nazir Bilal Yavuz 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/commands/analyze.c|  4 +---
 src/backend/utils/misc/sampling.c | 10 +++---
 src/include/utils/sampling.h  |  1 -
 3 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index da27a13a3f..e9fa3470cf 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -,9 +,7 @@ block_sampling_read_stream_next(ReadStream *stream,
 void *callback_private_data,
 void *per_buffer_data)
 {
-	BlockSamplerData *bs = callback_private_data;
-
-	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+	return BlockSampler_Next(callback_private_data);
 }
 
 /*
diff --git a/src/backend/utils/misc/sampling.c b/src/backend/utils/misc/sampling.c
index 933db06702..6e2bca9739 100644
--- a/src/backend/utils/misc/sampling.c
+++ b/src/backend/utils/misc/sampling.c
@@ -54,12 +54,6 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize,
 	return Min(bs->n, bs->N);
 }
 
-bool
-BlockSampler_HasMore(BlockSampler bs)
-{
-	return (bs->t < bs->N) && (bs->m < bs->n);
-}
-
 BlockNumber
 BlockSampler_Next(BlockSampler bs)
 {
@@ -68,7 +62,9 @@ BlockSampler_Next(BlockSampler bs)
 	double		p;/* probability to skip block */
 	double		V;/* random */
 
-	Assert(BlockSampler_HasMore(bs));	/* hence K > 0 and k > 0 */
+	/* Return if no remaining blocks or no blocks to sample */
+	if (K <= 0 || k <= 0)
+		return InvalidBlockNumber;
 
 	if ((BlockNumber) k >= K)
 	{
diff --git a/src/include/utils/sampling.h b/src/include/utils/sampling.h
index be48ee52ba..fb5d6820a2 100644
--- a/src/include/utils/sampling.h
+++ b/src/include/utils/sampling.h
@@ -38,7 +38,6 @@ typedef BlockSamplerData *BlockSampler;
 
 extern BlockNumber BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
 	 int samplesize, uint32 randseed);
-extern bool BlockSampler_HasMore(BlockSampler bs);
 extern BlockNumber BlockSampler_Next(BlockSampler bs);
 
 /* Reservoir sampling methods */
-- 
2.44.0



Re: Improve heapgetpage() performance, overhead from serializable

2024-04-07 Thread David Rowley
On Mon, 8 Apr 2024 at 15:13, Andres Freund  wrote:
> I kinda don't like heap_prepare_pagescan(), but heapgetpage() is worse. And I
> don't have a great alternative suggestion.

It came around from having nothing better.  I was keen not to have the
name indicate it was only for checking visibility as we're also
checking for serialization conflicts and pruning the page.  The word
"prepare" made it there as it seemed generic enough to not falsely
indicate it was only checking visibility.  Also, it seemed good to
keep it generic as if we one day end up with something new that needs
to be done before scanning a page in page mode then that new code is
more likely to be put in the function with a generic name rather than
a function that appears to be named for some other unrelated task.  It
would be nice not to end up with two functions to call before scanning
a page in page mode.

> Off-list Melanie suggested heap_page_collect_visible_tuples(). Given the
> separate callsites (making long names annoying) and the fact that it's really
> specific to one caller, I'm somewhat inclined to just go with
> collect_visible_tuples() or page_collect_visible(), I think I prefer the
> latter.

I understand wanting to avoid the long name.  I'd rather stay clear of
"visible", but don't feel as strongly about this as it's static.

David




Re: remaining sql/json patches

2024-04-07 Thread jian he
On Mon, Apr 8, 2024 at 12:34 AM jian he  wrote:
>
> On Sun, Apr 7, 2024 at 9:36 PM Amit Langote  wrote:
> > 0002 needs an expanded commit message but I've run out of energy today.
> >
>

+/*
+ * Fetch next row from a JsonTablePlan's path evaluation result and from
+ * any child nested path(s).
+ *
+ * Returns true if the any of the paths (this or the nested) has more rows to
+ * return.
+ *
+ * By fetching the nested path(s)'s rows based on the parent row at each
+ * level, this essentially joins the rows of different levels.  If any level
+ * has no matching rows, the columns at that level will compute to NULL,
+ * making it an OUTER join.
+ */
+static bool
+JsonTablePlanScanNextRow(JsonTablePlanState *planstate)

"if the any"
should be
"if any" ?

also I think,
 + If any level
+ * has no matching rows, the columns at that level will compute to NULL,
+ * making it an OUTER join.
means
+ If any level rows do not match, the rows at that level will compute to NULL,
+ making it an OUTER join.

other than that, it looks good to me.




Re: Improve heapgetpage() performance, overhead from serializable

2024-04-07 Thread Andres Freund
Hi,

On 2024-04-08 14:43:21 +1200, David Rowley wrote:
> On Sun, 7 Apr 2024 at 19:30, Andres Freund  wrote:
> > Good call. Added and pushed.
> 
> I understand you're already aware of the reference in the comment to
> heapgetpage(), which no longer exists as of 44086b097.

Yea, https://postgr.es/m/20240407172615.cocrsvboqm3ttqe4%40awork3.anarazel.de


> Melanie and I had discussed the heap_prepare_pagescan() name while I
> was reviewing that recent refactor. Aside from fixing the comment, how
> about also renaming heapgetpage_collect() to
> heap_prepare_pagescan_tuples()?

> Patch attached for reference. Not looking for any credit.
> 
> I'm also happy to revisit the heap_prepare_pagescan() name and call
> heapgetpage_collect() some appropriate derivative of whatever we'd
> rename that to.

I kinda don't like heap_prepare_pagescan(), but heapgetpage() is worse. And I
don't have a great alternative suggestion.

Off-list Melanie suggested heap_page_collect_visible_tuples(). Given the
separate callsites (making long names annoying) and the fact that it's really
specific to one caller, I'm somewhat inclined to just go with
collect_visible_tuples() or page_collect_visible(), I think I prefer the
latter.

Greetings,

Andres Freund




Re: Improve heapgetpage() performance, overhead from serializable

2024-04-07 Thread David Rowley
On Sun, 7 Apr 2024 at 19:30, Andres Freund  wrote:
> Good call. Added and pushed.

I understand you're already aware of the reference in the comment to
heapgetpage(), which no longer exists as of 44086b097.

Melanie and I had discussed the heap_prepare_pagescan() name while I
was reviewing that recent refactor. Aside from fixing the comment, how
about also renaming heapgetpage_collect() to
heap_prepare_pagescan_tuples()?

Patch attached for reference. Not looking for any credit.

I'm also happy to revisit the heap_prepare_pagescan() name and call
heapgetpage_collect() some appropriate derivative of whatever we'd
rename that to.

Copied Melanie as she may want to chime in too.

David
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2663f52d1a..d3c2a60985 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -434,16 +434,15 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber 
startBlk, BlockNumber numBlk
 }
 
 /*
- * Per-tuple loop for heapgetpage() in pagemode. Pulled out so it can be
- * called multiple times, with constant arguments for all_visible,
+ * Per-tuple loop for heap_prepare_pagescan().  Pulled out so it can be called
+ * multiple times, with constant arguments for all_visible,
  * check_serializable.
  */
 pg_attribute_always_inline
 static int
-heapgetpage_collect(HeapScanDesc scan, Snapshot snapshot,
-   Page page, Buffer buffer,
-   BlockNumber block, int lines,
-   bool all_visible, bool 
check_serializable)
+heap_prepare_pagescan_tuples(HeapScanDesc scan, Snapshot snapshot, Page page,
+Buffer buffer, 
BlockNumber block, int lines,
+bool all_visible, bool 
check_serializable)
 {
int ntup = 0;
OffsetNumber lineoff;
@@ -547,28 +546,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
CheckForSerializableConflictOutNeeded(scan->rs_base.rs_rd, 
snapshot);
 
/*
-* We call heapgetpage_collect() with constant arguments, to get the
-* compiler to constant fold the constant arguments. Separate calls with
-* constant arguments, rather than variables, are needed on several
+* We call heap_prepare_pagescan_tuples() with constant arguments, to 
get
+* the compiler to constant fold the constant arguments. Separate calls
+* with constant arguments, rather than variables, are needed on several
 * compilers to actually perform constant folding.
 */
if (likely(all_visible))
{
if (likely(!check_serializable))
-   scan->rs_ntuples = heapgetpage_collect(scan, snapshot, 
page, buffer,
-   
   block, lines, true, false);
+   scan->rs_ntuples = heap_prepare_pagescan_tuples(scan, 
snapshot,
+   
page, buffer,
+   
block, lines,
+   
true, false);
else
-   scan->rs_ntuples = heapgetpage_collect(scan, snapshot, 
page, buffer,
-   
   block, lines, true, true);
+   scan->rs_ntuples = heap_prepare_pagescan_tuples(scan, 
snapshot,
+   
page, buffer,
+   
block, lines,
+   
true, true);
}
else
{
if (likely(!check_serializable))
-   scan->rs_ntuples = heapgetpage_collect(scan, snapshot, 
page, buffer,
-   
   block, lines, false, false);
+   scan->rs_ntuples = heap_prepare_pagescan_tuples(scan, 
snapshot,
+   
page, buffer,
+   
block, lines,
+   

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Peter Geoghegan
On Sun, Apr 7, 2024 at 9:57 PM Peter Geoghegan  wrote:
> On Sun, Apr 7, 2024 at 9:50 PM Tom Lane  wrote:
> > those first two Asserts are redundant with the "if" as well.
>
> I'll get rid of those other two assertions as well, then.

Done that way.

-- 
Peter Geoghegan




Re: Coverity complains about simplehash.h's SH_STAT()

2024-04-07 Thread Andres Freund
On 2024-04-07 21:41:23 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I can't see a way it could hurt in the back branches, so I'm inclined to
> > backpatch the pfree?
> 
> +1

Done




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Peter Geoghegan
On Sun, Apr 7, 2024 at 9:50 PM Tom Lane  wrote:
> If you're doing that, then surely
>
> if (j != (BTEqualStrategyNumber - 1) ||
> !(xform[j].skey->sk_flags & SK_SEARCHARRAY))
> {
> ...
> }
> else
> {
> Assert(j == (BTEqualStrategyNumber - 1));
> Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
> Assert(xform[j].ikey == array->scan_key);
> Assert(!(cur->sk_flags & SK_SEARCHARRAY));
> }
>
> those first two Asserts are redundant with the "if" as well.

I'll get rid of those other two assertions as well, then.

-- 
Peter Geoghegan




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Tom Lane
Peter Geoghegan  writes:
> The assertions in question are arguably redundant. There are very
> similar assertions just a little earlier on, as we initially set up
> the array stuff (right before _bt_compare_scankey_args is called).
> I'll just remove the "Assert(xform[j].ikey == array->scan_key)"
> assertion that Coverity doesn't like, in addition to the
> "Assert(!array || array->scan_key == i)" assertion, on the grounds
> that they're redundant.

If you're doing that, then surely

if (j != (BTEqualStrategyNumber - 1) ||
!(xform[j].skey->sk_flags & SK_SEARCHARRAY))
{
...
}
else
{
Assert(j == (BTEqualStrategyNumber - 1));
Assert(xform[j].skey->sk_flags & SK_SEARCHARRAY);
Assert(xform[j].ikey == array->scan_key);
Assert(!(cur->sk_flags & SK_SEARCHARRAY));
}

those first two Asserts are redundant with the "if" as well.

regards, tom lane




Re: Coverity complains about simplehash.h's SH_STAT()

2024-04-07 Thread Tom Lane
Andres Freund  writes:
> On 2024-04-07 21:03:53 -0400, Tom Lane wrote:
>> I realize this function is only debug support, but wouldn't it
>> be appropriate to pfree(collisions) before exiting?

> It indeed looks like that memory should be freed. Very odd that coverity
> started to complain about that just now.  If coverity had started to complain
> after da41d71070d, I'd understand, but that was ~5 years ago.

If we recently added a new simplehash caller, Coverity might see that
as a new bug.  Still doesn't explain why nothing about the old callers.
We might have over-hastily dismissed such warnings as uninteresting,
perhaps.

> I can't see a way it could hurt in the back branches, so I'm inclined to
> backpatch the pfree?

+1

regards, tom lane




Re: Experiments with Postgres and SSL

2024-04-07 Thread Heikki Linnakangas

On 08/04/2024 04:28, Tom Lane wrote:

Heikki Linnakangas  writes:

Committed this. Thank you to everyone involved!


Looks like perlcritic isn't too happy with the test code:
koel and crake say

./src/test/libpq_encryption/t/001_negotiate_encryption.pl: Return value of 
flagged function ignored - chmod at line 138, column 2.  See pages 208,278 of 
PBP.  ([InputOutput::RequireCheckedSyscalls] Severity: 5)
./src/test/libpq_encryption/t/001_negotiate_encryption.pl: Return value of 
flagged function ignored - open at line 184, column 1.  See pages 208,278 of 
PBP.  ([InputOutput::RequireCheckedSyscalls] Severity: 5)


Fixed, thanks.

I'll make a note in my personal TODO list to add perlcritic to cirrus CI 
if possible. I rely heavily on that nowadays to catch issues before the 
buildfarm.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Peter Geoghegan
On Sun, Apr 7, 2024 at 9:25 PM Tom Lane  wrote:
> Perhaps this'd help:
>
> -Assert(xform[j].ikey == array->scan_key);
> +Assert(array && xform[j].ikey == array->scan_key);
>
> If that doesn't silence it, I'd be prepared to just dismiss the
> warning.

The assertions in question are arguably redundant. There are very
similar assertions just a little earlier on, as we initially set up
the array stuff (right before _bt_compare_scankey_args is called).
I'll just remove the "Assert(xform[j].ikey == array->scan_key)"
assertion that Coverity doesn't like, in addition to the
"Assert(!array || array->scan_key == i)" assertion, on the grounds
that they're redundant.

> Some work in the comment to explain why we must have an array here
> wouldn't be out of place either, perhaps.

There is a comment block about this right above the assertion in question:

/*
 * Both scan keys might have arrays, in which case we'll
 * arbitrarily pass only one of the arrays.  That won't
 * matter, since _bt_compare_scankey_args is aware that two
 * SEARCHARRAY scan keys mean that _bt_preprocess_array_keys
 * failed to eliminate redundant arrays through array merging.
 * _bt_compare_scankey_args just returns false when it sees
 * this; it won't even try to examine either array.
 */

Do you think it needs more work?

-- 
Peter Geoghegan




Re: Popcount optimization using AVX512

2024-04-07 Thread Tom Lane
Nathan Bossart  writes:
> On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote:
>> The Intel documentation for _mm256_undefined_si256() [0]
>> indicates that it is intended to return "undefined elements," so it seems
>> like the use of an uninitialized variable might be intentional.

> See also https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=72af61b122.

Ah, interesting.  That hasn't propagated to stable distros yet,
evidently (and even when it does, I wonder how soon Coverity
will understand it).  Anyway, that does establish that it's
gcc's problem not ours.  Thanks for digging!

regards, tom lane




Re: Coverity complains about simplehash.h's SH_STAT()

2024-04-07 Thread Andres Freund
Hi,

On 2024-04-07 21:03:53 -0400, Tom Lane wrote:
> Today's Coverity run produced this:
> 
> /srv/coverity/git/pgsql-git/postgresql/src/include/lib/simplehash.h: 1138 in 
> bh_nodeidx_stat()
> 1132 avg_collisions = 0;
> 1133 }
> 1134 
> 1135 sh_log("size: " UINT64_FORMAT ", members: %u, filled: %f, total 
> chain: %u, max chain: %u, avg chain: %f, total_collisions: %u, 
> max_collisions: %u, avg_collisions: %f",
> 1136tb->size, tb->members, fillfactor, total_chain_length, 
> max_chain_length, avg_chain_length,
> 1137total_collisions, max_collisions, avg_collisions);
> >>> CID 1596268:  Resource leaks  (RESOURCE_LEAK)
> >>> Variable "collisions" going out of scope leaks the storage it points 
> >>> to.
> 1138 }
> 1139 
> 1140 #endif/* SH_DEFINE */
> 
> I have no idea why we didn't see this warning before --- but AFAICS
> it's quite right, and it looks like a nontrivial amount of memory
> could be at stake:
> 
> uint32   *collisions = (uint32 *) palloc0(tb->size * sizeof(uint32));
> 
> I realize this function is only debug support, but wouldn't it
> be appropriate to pfree(collisions) before exiting?

It indeed looks like that memory should be freed. Very odd that coverity
started to complain about that just now.  If coverity had started to complain
after da41d71070d, I'd understand, but that was ~5 years ago.

I can't see a way it could hurt in the back branches, so I'm inclined to
backpatch the pfree?

Greetings,

Andres Freund




Re: Popcount optimization using AVX512

2024-04-07 Thread Nathan Bossart
On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote:
> The Intel documentation for _mm256_undefined_si256() [0]
> indicates that it is intended to return "undefined elements," so it seems
> like the use of an uninitialized variable might be intentional.

See also https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=72af61b122.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Experiments with Postgres and SSL

2024-04-07 Thread Tom Lane
Heikki Linnakangas  writes:
> Committed this. Thank you to everyone involved!

Looks like perlcritic isn't too happy with the test code:
koel and crake say

./src/test/libpq_encryption/t/001_negotiate_encryption.pl: Return value of 
flagged function ignored - chmod at line 138, column 2.  See pages 208,278 of 
PBP.  ([InputOutput::RequireCheckedSyscalls] Severity: 5)
./src/test/libpq_encryption/t/001_negotiate_encryption.pl: Return value of 
flagged function ignored - open at line 184, column 1.  See pages 208,278 of 
PBP.  ([InputOutput::RequireCheckedSyscalls] Severity: 5)


regards, tom lane




Re: Experiments with Postgres and SSL

2024-04-07 Thread Heikki Linnakangas

Committed this. Thank you to everyone involved!

On 04/04/2024 14:08, Matthias van de Meent wrote:

Patch 0003:

The read size in secure_raw_read is capped to port->raw_buf_remaining
if the raw buf has any data. While the user will probably call into
this function again, I think that's a waste of cycles.


Hmm, yeah, I suppose we could read more data in the same call. It seems 
simpler not to. The case that "raw_buf_remaining > 0" is a very rare.



pq_buffer_has_data now doesn't have any protections against
desynchronized state between PqRecvLength and PqRecvPointer. An
Assert(PqRecvLength >= PqRecvPointer) to that value would be
appreciated.


Added.


0006:

In CONNECTION_FAILED, we use connection_failed() to select whether we
need a new connection or stop trying altogether, but that function's
description states:


+ * Out-of-line portion of the CONNECTION_FAILED() macro
+ *
+ * Returns true, if we should retry the connection with different encryption 
method.


Which to me reads like we should reuse the connection, and try a
different method on that same connection. Maybe we can improve the
wording to something like
+ * Returns true, if we should reconnect with a different encryption method.
to make the reconnect part more clear.


Changed to "Returns true, if we should reconnect and retry with a 
different encryption method".



In select_next_encryption_method, there are several copies of this pattern:

if ((remaining_methods & ENC_METHOD) != 0)
{
 conn->current_enc_method = ENC_METHOD;
 return true;
}

I think a helper macro would reduce the verbosity of the scaffolding,
like in the attached SELECT_NEXT_METHOD.diff.txt.


Applied.

In addition to the above, I made heavy changes to the tests. I wanted to 
test not just the outcome (SSL, GSSAPI, plaintext, or fail), but also 
the steps and reconnections needed to get there. To facilitate that, I 
rewrote how the expected outcome was represented in the test script. It 
now uses a table-driven approach, with a line for each test iteration, 
ie. for each different combination of options that are tested.


I then added some more logging, so that whenever the server receives an 
SSLRequest or GSSENCRequest packet, it logs a line. That's controlled by 
a new not-in-sample GUC ("trace_connection_negotiation"), intended only 
for the test and debugging. The test scrapes the log for the lines that 
it prints, and the expected output includes a compact trace of expected 
events. For example, the expected output for "user=testuser 
gssencmode=prefer sslmode=prefer sslnegotiation=direct", when GSS and 
SSL are both disabled in the server, looks like this:


# USER  GSSENCMODE  SSLMODE   SSLNEGOTIATION  EVENTS -> OUTCOME
testuserprefer  preferdirect  connect, 
directsslreject, reconnect, sslreject, authok  -> plain


That means, we expect libpq to first try direct SSL, which is rejected 
by the server. It should then reconnect and attempt traditional 
negotiated SSL, which is also rejected. Finally, it should try plaintext 
authentication, without reconnecting, which succeeds.


That actually revealed a couple of slightly bogus behaviors with the 
current code. Here's one example:


# XXX: libpq retries the connection unnecessarily in this case:
nogssuser   require allow connect, gssaccept, authfail, 
reconnect, gssaccept, authfail -> fail


That means, with "gssencmode=require sslmode=allow", if the server 
accepts the GSS encryption but refuses the connection at authentication, 
libpq will reconnect and go through the same motions again. The second 
attempt is pointless, we know it's going to fail. The refactoring to the 
libpq state machine fixed that issue as a side-effect.


I removed the server ssl_enable_alpn and libpq sslalpn options. The idea 
was that they could be useful for testing, but we didn't actually have 
any tests that would use them, and you get the same result by testing 
with an older server or client version. I'm open to adding them back if 
we also add tests that benefit from them, but they were pretty pointless 
as they were.


One important open item now is that we need to register a proper ALPN 
protocol ID with IANA.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Tom Lane
Peter Geoghegan  writes:
> On Sun, Apr 7, 2024 at 8:48 PM Tom Lane  wrote:
>> Coverity pointed out something that looks like a potentially live
>> problem in 5bf748b86:
>> ... which certainly makes it look like array->scan_key could be
>> a null-pointer dereference.

> But the "Assert(xform[j].ikey == array->scan_key)" assertion is
> located in a block where it's been established that the scan key (the
> one stored in xform[j] at this point in execution) must have an array.

> This is probably very hard for tools like Coverity to understand.

It's not obvious to human readers either ...

> Would Coverity stop complaining if I just removed the assertion? I
> could just do that, I suppose, but that seems backwards to me.

Perhaps this'd help:

-Assert(xform[j].ikey == array->scan_key);
+Assert(array && xform[j].ikey == array->scan_key);

If that doesn't silence it, I'd be prepared to just dismiss the
warning.

Some work in the comment to explain why we must have an array here
wouldn't be out of place either, perhaps.

regards, tom lane




Re: Popcount optimization using AVX512

2024-04-07 Thread Nathan Bossart
On Sun, Apr 07, 2024 at 08:42:12PM -0400, Tom Lane wrote:
> Today's Coverity run produced this warning, which seemingly was
> triggered by one of these commits, but I can't make much sense
> of it:
> 
> *** CID 1596255:  Uninitialized variables  (UNINIT)
> /usr/lib/gcc/x86_64-linux-gnu/10/include/avxintrin.h: 1218 in 
> _mm256_undefined_si256()
> 1214 extern __inline __m256i __attribute__((__gnu_inline__, 
> __always_inline__, __artificial__))
> 1215 _mm256_undefined_si256 (void)
> 1216 {
> 1217   __m256i __Y = __Y;
 CID 1596255:  Uninitialized variables  (UNINIT)
 Using uninitialized value "__Y".
> 1218   return __Y;
> 1219 }
> 
> I see the same code in my local copy of avxintrin.h,
> and I quite agree that it looks like either an undefined
> value or something that properly ought to be an error.
> If we are calling this, why (and from where)?

Nothing in these commits uses this, or even uses the 256-bit registers.
avxintrin.h is included by immintrin.h, which is probably why this is
showing up.  I believe you're supposed to use immintrin.h for the
intrinsics used in these commits, so I don't immediately see a great way to
avoid this.  The Intel documentation for _mm256_undefined_si256() [0]
indicates that it is intended to return "undefined elements," so it seems
like the use of an uninitialized variable might be intentional.

> Anyway, we can certainly just dismiss this warning if it
> doesn't correspond to any real problem in our code.
> But I thought I'd raise the question.

That's probably the right thing to do, unless there's some action we can
take to suppress this warning.

[0] 
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_undefined_si256&ig_expand=6943

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Weird test mixup

2024-04-07 Thread Michael Paquier
On Sat, Apr 06, 2024 at 10:34:46AM +0500, Andrey M. Borodin wrote:
> I find name of the function "injection_points_local()" strange,
> because there is no verb in the name. How about
> "injection_points_set_local"? 

That makes sense.

> I'm not sure if we should refactor anything here, but
> InjectionPointSharedState has singular name, plural wait_counts and
> singular condition.
> InjectionPointSharedState is already an array of injection points,
> maybe let's add there optional pid instead of inventing separate
> array of pids?

Perhaps we could unify these two concepts, indeed, with a "kind" added
to InjectionPointCondition.  Now waits/wakeups are a different beast
than the conditions that could be assigned to a point to filter if it
should be executed.  More runtime conditions coming immediately into
my mind, that could be added to this structure relate mostly to global
objects, like:
- Specific database name and/or OID.
- Specific role(s).
So that's mostly cross-checking states coming from miscadmin.h for
now.

> Can we set global point to 'notice', but same local to 'wait'? Looks
> like now we can't, but allowing to do so would make code simpler.

You mean using the name point name with more than more callback?  Not
sure we'd want to be able to do that.  Perhaps you're right, though,
if there is a use case that justifies it.

> Besides this opportunity to simplify stuff, both patches looks good
> to me.

Yeah, this module can be always tweaked more if necessary.  Saying
that, naming the new thing "condition" in InjectionPointSharedState
felt strange, as you said, because it is an array of multiple
conditions.

For now I have applied 997db123c054 to make the GIN tests with
injection points repeatable as it was an independent issue, and
f587338dec87 to add the local function pieces.

Attached is the last piece to switch the GIN test to use local
injection points.  85f65d7a26fc should maintain the buildfarm at bay,
but I'd rather not take a bet and accidently freeze the buildfarm as
it would impact folks who aim at getting patches committed just before
the finish line.  So I am holding on this one for a few more days
until we're past the freeze and the buildfarm is more stable.
--
Michael
From bc2ce072cad0efa94084297330db3102ee346c25 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 8 Apr 2024 10:20:19 +0900
Subject: [PATCH v3] Switch GIN tests with injection points to be
 concurrent-safe

---
 src/test/modules/gin/Makefile   | 3 ---
 src/test/modules/gin/expected/gin_incomplete_splits.out | 7 +++
 src/test/modules/gin/meson.build| 2 --
 src/test/modules/gin/sql/gin_incomplete_splits.sql  | 3 +++
 4 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/src/test/modules/gin/Makefile b/src/test/modules/gin/Makefile
index da4c9cea5e..e007e38ac2 100644
--- a/src/test/modules/gin/Makefile
+++ b/src/test/modules/gin/Makefile
@@ -4,9 +4,6 @@ EXTRA_INSTALL = src/test/modules/injection_points
 
 REGRESS = gin_incomplete_splits
 
-# The injection points are cluster-wide, so disable installcheck
-NO_INSTALLCHECK = 1
-
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/src/test/modules/gin/expected/gin_incomplete_splits.out b/src/test/modules/gin/expected/gin_incomplete_splits.out
index 973a8ce6c8..15574e547a 100644
--- a/src/test/modules/gin/expected/gin_incomplete_splits.out
+++ b/src/test/modules/gin/expected/gin_incomplete_splits.out
@@ -12,6 +12,13 @@
 -- This uses injection points to cause errors that leave some page
 -- splits in "incomplete" state
 create extension injection_points;
+-- Make all injection points local to this process, for concurrency.
+SELECT injection_points_set_local();
+ injection_points_set_local 
+
+ 
+(1 row)
+
 -- Use the index for all the queries
 set enable_seqscan=off;
 -- Print a NOTICE whenever an incomplete split gets fixed
diff --git a/src/test/modules/gin/meson.build b/src/test/modules/gin/meson.build
index 5ec0760a27..9734b51de2 100644
--- a/src/test/modules/gin/meson.build
+++ b/src/test/modules/gin/meson.build
@@ -12,7 +12,5 @@ tests += {
 'sql': [
   'gin_incomplete_splits',
 ],
-# The injection points are cluster-wide, so disable installcheck
-'runningcheck': false,
   },
 }
diff --git a/src/test/modules/gin/sql/gin_incomplete_splits.sql b/src/test/modules/gin/sql/gin_incomplete_splits.sql
index ea3667b38d..ebf0f620f0 100644
--- a/src/test/modules/gin/sql/gin_incomplete_splits.sql
+++ b/src/test/modules/gin/sql/gin_incomplete_splits.sql
@@ -14,6 +14,9 @@
 -- splits in "incomplete" state
 create extension injection_points;
 
+-- Make all injection points local to this process, for concurrency.
+SELECT injection_points_set_local();
+
 -- Use the index for all the queries
 set enable_seqscan=off;
 
-- 
2.43.0



signature.asc
Description: PGP signature


Re: Use streaming read API in ANALYZE

2024-04-07 Thread Thomas Munro
On Mon, Apr 8, 2024 at 10:26 AM Melanie Plageman
 wrote:
> On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote:
> > >  src/backend/commands/analyze.c | 89 ++
> > >  1 file changed, 26 insertions(+), 63 deletions(-)
> >
> > That's a very nice demonstration of how this makes good prefetching 
> > easier...
>
> Agreed. Yay streaming read API and Bilal!

+1

I found a few comments to tweak, just a couple of places that hadn't
got the memo after we renamed "read stream", and an obsolete mention
of pinning buffers.  I adjusted those directly.

I ran some tests on a random basic Linux/ARM cloud box with a 7.6GB
table, and I got:

  coldhot
master: 9025ms  199ms
patched, io_combine_limit=1:9025ms  191ms
patched, io_combine_limit=default:  8729ms  191ms

Despite being random, occasionally some I/Os must get merged, allowing
slightly better random throughput when accessing disk blocks through a
3000 IOPS drinking straw.  Looking at strace, I see 29144 pread* calls
instead of 30071, which fits that theory.  Let's see... if you roll a
fair 973452-sided dice 30071 times, how many times do you expect to
roll consecutive numbers?  Each time you roll there is a 1/973452
chance that you get the last number + 1, and we have 30071 tries
giving 30071/973452 = ~3%.  9025ms minus 3% is 8754ms.  Seems about
right.

I am not sure why the hot number is faster exactly.  (Anecdotally, I
did notice that in the cases that beat master semi-unexpectedly like
this, my software memory prefetch patch doesn't help or hurt, while in
some cases and on some CPUs there is little difference, and then that
patch seems to get a speed-up like this, which might be a clue.
*Shrug*, investigation needed.)

Pushed.  Thanks Bilal and reviewers!




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Peter Geoghegan
On Sun, Apr 7, 2024 at 8:48 PM Tom Lane  wrote:
>
> Coverity pointed out something that looks like a potentially live
> problem in 5bf748b86:
>
> /srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtutils.c: 
> 2950 in _bt_preprocess_keys()
> 2944  * need to make sure that we don't throw 
> away an array
> 2945  * scan key.  _bt_compare_scankey_args 
> expects us to
> 2946  * always keep arrays (and discard 
> non-arrays).
> 2947  */
> 2948 Assert(j == (BTEqualStrategyNumber - 1));
> 2949 Assert(xform[j].skey->sk_flags & 
> SK_SEARCHARRAY);
> >>> CID 1596256:  Null pointer dereferences  (FORWARD_NULL)
> >>> Dereferencing null pointer "array".
> 2950 Assert(xform[j].ikey == array->scan_key);
> 2951 Assert(!(cur->sk_flags & SK_SEARCHARRAY));
> 2952 }
> 2953 }
> 2954 else if (j == (BTEqualStrategyNumber - 1))
>
> Above this there is an assertion
>
> Assert(!array || array->num_elems > 0);
>
> which certainly makes it look like array->scan_key could be
> a null-pointer dereference.

But the "Assert(xform[j].ikey == array->scan_key)" assertion is
located in a block where it's been established that the scan key (the
one stored in xform[j] at this point in execution) must have an array.
It has been marked SK_SEARCHARRAY, and uses the equality strategy, so
it had better have one or we're in big trouble either way.

This is probably very hard for tools like Coverity to understand. We
also rely on the fact that only one of the two scan keys (only one of
the pair of scan keys that were passed to _bt_compare_scankey_args)
can have an array at the point of the assertion that Coverity finds
suspicious. It's possible that both of those scan keys actually did
have arrays, but _bt_compare_scankey_args just treats that as a case
of being unable to prove which scan key was redundant/contradictory
due to a lack of suitable cross-type support -- so the assertion won't
be reached.

Would Coverity stop complaining if I just removed the assertion? I
could just do that, I suppose, but that seems backwards to me.

--
Peter Geoghegan




Coverity complains about simplehash.h's SH_STAT()

2024-04-07 Thread Tom Lane
Today's Coverity run produced this:

/srv/coverity/git/pgsql-git/postgresql/src/include/lib/simplehash.h: 1138 in 
bh_nodeidx_stat()
1132 avg_collisions = 0;
1133 }
1134 
1135 sh_log("size: " UINT64_FORMAT ", members: %u, filled: %f, total 
chain: %u, max chain: %u, avg chain: %f, total_collisions: %u, max_collisions: 
%u, avg_collisions: %f",
1136tb->size, tb->members, fillfactor, total_chain_length, 
max_chain_length, avg_chain_length,
1137total_collisions, max_collisions, avg_collisions);
>>> CID 1596268:  Resource leaks  (RESOURCE_LEAK)
>>> Variable "collisions" going out of scope leaks the storage it points to.
1138 }
1139 
1140 #endif/* SH_DEFINE */

I have no idea why we didn't see this warning before --- but AFAICS
it's quite right, and it looks like a nontrivial amount of memory
could be at stake:

uint32   *collisions = (uint32 *) palloc0(tb->size * sizeof(uint32));

I realize this function is only debug support, but wouldn't it
be appropriate to pfree(collisions) before exiting?

regards, tom lane




Re: meson: Specify -Wformat as a common warning flag for extensions

2024-04-07 Thread Sutou Kouhei
Hi Andres,

Thanks for reviewing this!

In <20240407232635.fq4kc5556laha...@awork3.anarazel.de>
  "Re: meson: Specify -Wformat as a common warning flag for extensions" on Sun, 
7 Apr 2024 16:26:35 -0700,
  Andres Freund  wrote:

> This seems like a fair amount of extra configure tests. Particularly because
> /W* isn't ever interesting for Makefile.global - they're msvc flags - because
> you can't use that with msvc.
> 
> I'm also doubtful that it's worth supporting warning_level=3/everything, you
> end up with a completely flood of warnings that way.

OK. I've removed "/W*" flags and warning_level==3/everything
cases.

How about the attached v5 patch?


Thanks,
-- 
kou
>From 205ef88c66cf1050eedfc1e72d951de93a02e53a Mon Sep 17 00:00:00 2001
From: Sutou Kouhei 
Date: Fri, 15 Mar 2024 18:27:30 +0900
Subject: [PATCH v5] meson: Restore implicit warning/debug/optimize flags for
 extensions

Meson specifies warning/debug/optimize flags such as "-Wall", "-g" and
"-O2" automatically based on "--warnlevel" and "--buildtype"
options. And we use "--warning_level=1" and
"--buildtype=debugoptimized" by default.

We don't specify warning/debug/optimize flags explicitly to build
PostgreSQL with Meson. Because Meson does it automatically as we said.
But Meson doesn't care about flags in Makefile.global and
pg_config. So we need to care about them manually.

This changes do it. They detect warning/debug/optimize flags based on
warning_level/debug/optimization option values because Meson doesn't
tell us flags Meson guessed.
---
 meson.build | 41 +
 src/include/meson.build |  4 ++--
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/meson.build b/meson.build
index 5acf083ce3c..11bd56f79a7 100644
--- a/meson.build
+++ b/meson.build
@@ -1848,6 +1848,47 @@ endif
 vectorize_cflags = cc.get_supported_arguments(['-ftree-vectorize'])
 unroll_loops_cflags = cc.get_supported_arguments(['-funroll-loops'])
 
+# They aren't used for building PostgreSQL itself because Meson does
+# everything internally. They are used by extensions via pg_config or
+# Makefile.global.
+common_builtin_flags = []
+
+warning_level = get_option('warning_level')
+# See https://mesonbuild.com/Builtin-options.html#details-for-warning_level for
+# warning_level values.
+#
+# We don't use "/W*" flags here because we don't need to care about MSVC here.
+#
+# We don't have "warning_level == 3" and "warning_level ==
+# 'everything'" here because we don't use these warning levels.
+if warning_level == '1'
+  common_builtin_flags += ['-Wall']
+elif warning_level == '2'
+  common_builtin_flags += ['-Wall', '-Wextra']
+endif
+
+if get_option('debug')
+  common_builtin_flags += ['-g']
+endif
+
+optimization = get_option('optimization')
+if optimization == '0'
+  common_builtin_flags += ['-O0']
+elif optimization == '1'
+  common_builtin_flags += ['-O1']
+elif optimization == '2'
+  common_builtin_flags += ['-O2']
+elif optimization == '3'
+  common_builtin_flags += ['-O3']
+elif optimization == 's'
+  common_builtin_flags += ['-Os']
+endif
+
+cflags_builtin = cc.get_supported_arguments(common_builtin_flags)
+if llvm.found()
+  cxxflags_builtin = cpp.get_supported_arguments(common_builtin_flags)
+endif
+
 common_warning_flags = [
   '-Wmissing-prototypes',
   '-Wpointer-arith',
diff --git a/src/include/meson.build b/src/include/meson.build
index a28f115d867..58b7a9c1e7e 100644
--- a/src/include/meson.build
+++ b/src/include/meson.build
@@ -44,9 +44,9 @@ config_paths_data.set_quoted('MANDIR', dir_prefix / dir_man)
 
 var_cc = ' '.join(cc.cmd_array())
 var_cpp = ' '.join(cc.cmd_array() + ['-E'])
-var_cflags = ' '.join(cflags + cflags_warn + get_option('c_args'))
+var_cflags = ' '.join(cflags + cflags_builtin + cflags_warn + get_option('c_args'))
 if llvm.found()
-  var_cxxflags = ' '.join(cxxflags + cxxflags_warn + get_option('cpp_args'))
+  var_cxxflags = ' '.join(cxxflags + cxxflags_builtin + cxxflags_warn + get_option('cpp_args'))
 else
   var_cxxflags = ''
 endif
-- 
2.43.0



Re: WIP Incremental JSON Parser

2024-04-07 Thread Tom Lane
Coverity complained that this patch leaks memory:

/srv/coverity/git/pgsql-git/postgresql/src/bin/pg_combinebackup/load_manifest.c:
 212 in load_backup_manifest()
206 bytes_left -= rc;
207 json_parse_manifest_incremental_chunk(
208 
  inc_state, buffer, rc, bytes_left == 0);
209 }
210 
211 close(fd);
>>> CID 1596259:(RESOURCE_LEAK)
>>> Variable "inc_state" going out of scope leaks the storage it points to.
212 }
213 
214 /* All done. */
215 pfree(buffer);
216 return result;
217 }

/srv/coverity/git/pgsql-git/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c:
 488 in parse_manifest_file()
482 bytes_left -= rc;
483 json_parse_manifest_incremental_chunk(
484 
  inc_state, buffer, rc, bytes_left == 0);
485 }
486 
487 close(fd);
>>> CID 1596257:(RESOURCE_LEAK)
>>> Variable "inc_state" going out of scope leaks the storage it points to.
488 }
489 
490 /* Done with the buffer. */
491 pfree(buffer);
492 
493 return result;

It's right about that AFAICS, and not only is the "inc_state" itself
leaked but so is its assorted infrastructure.  Perhaps we don't care
too much about that in the existing applications, but ISTM that
isn't going to be a tenable assumption across the board.  Shouldn't
there be a "json_parse_manifest_incremental_shutdown()" or the like
to deallocate all the storage allocated by the parser?

regards, tom lane




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Tom Lane
Coverity pointed out something that looks like a potentially live
problem in 5bf748b86:

/srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtutils.c: 
2950 in _bt_preprocess_keys()
2944  * need to make sure that we don't throw away 
an array
2945  * scan key.  _bt_compare_scankey_args expects 
us to
2946  * always keep arrays (and discard non-arrays).
2947  */
2948 Assert(j == (BTEqualStrategyNumber - 1));
2949 Assert(xform[j].skey->sk_flags & 
SK_SEARCHARRAY);
>>> CID 1596256:  Null pointer dereferences  (FORWARD_NULL)
>>> Dereferencing null pointer "array".
2950 Assert(xform[j].ikey == array->scan_key);
2951 Assert(!(cur->sk_flags & SK_SEARCHARRAY));
2952 }
2953 }
2954 else if (j == (BTEqualStrategyNumber - 1))

Above this there is an assertion

Assert(!array || array->num_elems > 0);

which certainly makes it look like array->scan_key could be
a null-pointer dereference.

regards, tom lane




Re: Popcount optimization using AVX512

2024-04-07 Thread Tom Lane
Nathan Bossart  writes:
> Here is what I have staged for commit, which I intend to do shortly.

Today's Coverity run produced this warning, which seemingly was
triggered by one of these commits, but I can't make much sense
of it:

*** CID 1596255:  Uninitialized variables  (UNINIT)
/usr/lib/gcc/x86_64-linux-gnu/10/include/avxintrin.h: 1218 in 
_mm256_undefined_si256()
1214 extern __inline __m256i __attribute__((__gnu_inline__, 
__always_inline__, __artificial__))
1215 _mm256_undefined_si256 (void)
1216 {
1217   __m256i __Y = __Y;
>>> CID 1596255:  Uninitialized variables  (UNINIT)
>>> Using uninitialized value "__Y".
1218   return __Y;
1219 }

I see the same code in my local copy of avxintrin.h,
and I quite agree that it looks like either an undefined
value or something that properly ought to be an error.
If we are calling this, why (and from where)?

Anyway, we can certainly just dismiss this warning if it
doesn't correspond to any real problem in our code.
But I thought I'd raise the question.

regards, tom lane




Re: POC, WIP: OR-clause support for indexes

2024-04-07 Thread Justin Pryzby
On Mon, Apr 08, 2024 at 01:34:37AM +0300, Alexander Korotkov wrote:
> Hi!
> 
> On Mon, Apr 1, 2024 at 9:38 AM Andrei Lepikhov
>  wrote:
> > On 28/3/2024 16:54, Alexander Korotkov wrote:
> > > The current patch has a boolean guc enable_or_transformation.
> > > However, when we have just a few ORs to be transformated, then we
> > > should get less performance gain from the transformation and higher
> > > chances to lose a good bitmap scan plan from that.  When there is a
> > > huge list of ORs to be transformed, then the performance gain is
> > > greater and it is less likely we could lose a good bitmap scan plan.
> > >
> > > What do you think about introducing a GUC threshold value: the minimum
> > > size of list to do OR-to-ANY transformation?
> > > min_list_or_transformation or something.
> > I labelled it or_transformation_limit (see in attachment). Feel free to
> > rename it.
> > It's important to note that the limiting GUC doesn't operate
> > symmetrically for forward, OR -> SAOP, and backward SAOP -> OR
> > operations. In the forward case, it functions as you've proposed.
> > However, in the backward case, we only check whether the feature is
> > enabled or not. This is due to our existing limitation,
> > MAX_SAOP_ARRAY_SIZE, and the fact that we can't match the length of the
> > original OR list with the sizes of the resulting SAOPs. For instance, a
> > lengthy OR list with 100 elements can be transformed into 3 SAOPs, each
> > with a size of around 30 elements.
> > One aspect that requires attention is the potential inefficiency of our
> > OR -> ANY transformation when we have a number of elements less than
> > MAX_SAOP_ARRAY_SIZE. This is because we perform a reverse transformation
> > ANY -> OR at the stage of generating bitmap scans. If the BitmapScan
> > path dominates, we may have done unnecessary work. Is this an occurrence
> > that we should address?
> > But the concern above may just be a point of improvement later: We can
> > add one more strategy to the optimizer: testing each array element as an
> > OR clause; we can also provide a BitmapOr path, where SAOP is covered
> > with a minimal number of partial indexes (likewise, previous version).
> 
> I've revised the patch.  Did some beautification, improvements for
> documentation, commit messages etc.
> 
> I've pushed the 0001 patch without 0002.  I think 0001 is good by
> itself given that there is the or_to_any_transform_limit GUC option.
> The more similar OR clauses are here the more likely grouping them
> into SOAP will be a win.  But I've changed the default value to 5.

The sample config file has the wrong default

+#or_to_any_transform_limit = 0

We had a patch to catch this kind of error, but it was closed (which IMO
was itself an error).

-- 
Justin




Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-04-07 Thread Andres Freund
Hi,

On 2024-04-07 11:07:58 +1200, Thomas Munro wrote:
> I thought of a better name for the bufmgr.c function though:
> InvalidateUnpinnedBuffer().  That name seemed better to me after I
> festooned it with warnings about why exactly it's inherently racy and
> only for testing use.

I still dislike that, fwiw, due to the naming similarity to
InvalidateBuffer(), which throws away dirty buffer contents too. Which
obviously isn't acceptable from "userspace".  I'd just name it
pg_buffercache_evict() - given that the commit message's first paragraph uses
"it is useful to be able to evict arbitrary blocks" that seems to describe
things at least as well as "invalidate"?

Greetings,

Andres Freund




Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-04-07 Thread Melanie Plageman
On Sat, Apr 6, 2024 at 7:08 PM Thomas Munro  wrote:
>
> On second thoughts, I think the original "invalidate" terminology was
> fine, no need to invent a new term.
>
> I thought of a better name for the bufmgr.c function though:
> InvalidateUnpinnedBuffer().  That name seemed better to me after I
> festooned it with warnings about why exactly it's inherently racy and
> only for testing use.
>
> I suppose someone could propose an additional function
> pg_buffercache_invalidate(db, tbspc, rel, fork, blocknum) that would
> be slightly better in the sense that it couldn't accidentally evict
> some innocent block that happened to replace the real target just
> before it runs, but I don't think it matters much for this purpose and
> it would still be racy on return (vacuum decides to load your block
> back in) so I don't think it's worth bothering with.
>
> So this is the version I plan to commit.

I've reviewed v6. I think you should mention in the docs that it only
works for shared buffers -- so specifically not buffers containing
blocks of temp tables.

In the function pg_buffercache_invalidate(), why not use the
BufferIsValid() function?

-   if (buf < 1 || buf > NBuffers)
+   if (!BufferIsValid(buf) || buf > NBuffers)

I thought the below would be more clear for the comment above
InvalidateUnpinnedBuffer().

- * Returns true if the buffer was valid and it has now been made invalid.
- * Returns false if the wasn't valid, or it couldn't be evicted due to a pin,
- * or if the buffer becomes dirty again while we're trying to write it out.
+ * Returns true if the buffer was valid and has now been made invalid. Returns
+ * false if it wasn't valid, if it couldn't be evicted due to a pin, or if the
+ * buffer becomes dirty again while we're trying to write it out.

Some of that probably applies for the docs too (i.e. you have some
similar wording in the docs). There is actually one typo in your
version, so even if you don't adopt my suggestion, you should fix that
typo.

I didn't notice anything else out of place. I tried it and it worked
as expected. I'm excited to have this feature!

I didn't read through this whole thread, but was there any talk of
adding other functions to let me invalidate a bunch of buffers at once
or even some options -- like invalidate every 3rd buffer or whatever?
(Not the concern of this patch, but just wondering because that would
be a useful future enhancement IMO).

- Melanie




Re: Table AM Interface Enhancements

2024-04-07 Thread Andres Freund
Hi,

On 2024-04-08 02:25:17 +0300, Alexander Korotkov wrote:
> On Mon, Apr 8, 2024 at 12:40 AM Andres Freund  wrote:
> > On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:
> > > I've pushed 0001, 0002 and 0006.
> >
> > I briefly looked at 27bc1772fc81 and I don't think the state post this 
> > commit
> > makes sense. Before this commit another block based AM could implement 
> > analyze
> > without much code duplication. Now a large portion of analyze.c has to be
> > copied, because they can't stop acquire_sample_rows() from calling
> > heapam_scan_analyze_next_block().
> >
> > I'm quite certain this will break a few out-of-core AMs in a way that can't
> > easily be fixed.
>
> I was under the impression there are not so many out-of-core table
> AMs, which have non-dummy analysis implementations.

I know of at least 4 that have some production usage.


> And even if there are some, duplicating acquire_sample_rows() isn't a big
> deal.

I don't agree. The code has evolved a bunch over time, duplicating it into
various AMs is a bad idea.


> But given your feedback, I'd like to propose to keep both options
> open.  Turn back the block-level API for analyze, but let table-AM
> implement its own analyze function.  Then existing out-of-core AMs
> wouldn't need to do anything (or probably just set the new API method
> to NULL).

I think this patch simply hasn't been reviewed even close to careful enough
and should be reverted. It's IMO to late for a redesign.  Sorry for not
looking earlier, I was mostly out sick for the last few months.

I think a dedicated tableam callback for sample acquisition probably makes
sense, but if we want that, we need to provide an easy way for AMs that are
sufficiently block-like to reuse the code, not have two different ways to
implement analyze.

ISTM that ->relation_analyze is quite misleading as a name. For one, it it
just sets some callbacks, no? But more importantly, it sounds like it'd
actually allow to wrap the whole analyze process, rather than just the
acquisition of samples.

Greetings,

Andres Freund




Re: Table AM Interface Enhancements

2024-04-07 Thread Alexander Korotkov
On Mon, Apr 8, 2024 at 2:25 AM Alexander Korotkov  wrote:
> On Mon, Apr 8, 2024 at 12:40 AM Andres Freund  wrote:
> > On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:
> > > I've pushed 0001, 0002 and 0006.
> >
> > I briefly looked at 27bc1772fc81 and I don't think the state post this 
> > commit
> > makes sense. Before this commit another block based AM could implement 
> > analyze
> > without much code duplication. Now a large portion of analyze.c has to be
> > copied, because they can't stop acquire_sample_rows() from calling
> > heapam_scan_analyze_next_block().
> >
> > I'm quite certain this will break a few out-of-core AMs in a way that can't
> > easily be fixed.
>
> I was under the impression there are not so many out-of-core table
> AMs, which have non-dummy analysis implementations.  And even if there
> are some, duplicating acquire_sample_rows() isn't a big deal.
>
> But given your feedback, I'd like to propose to keep both options
> open.  Turn back the block-level API for analyze, but let table-AM
> implement its own analyze function.  Then existing out-of-core AMs
> wouldn't need to do anything (or probably just set the new API method
> to NULL).

The attached patch was to illustrate the approach.  It surely needs
some polishing.

--
Regards,
Alexander Korotkov




Re: meson: Specify -Wformat as a common warning flag for extensions

2024-04-07 Thread Andres Freund
Hi,

On 2024-03-15 18:36:55 +0900, Sutou Kouhei wrote:
> +warning_level = get_option('warning_level')
> +# See https://mesonbuild.com/Builtin-options.html#details-for-warning_level 
> for
> +# warning_level values.
> +if warning_level == '1'
> +  common_builtin_flags += ['-Wall', '/W2']
> +elif warning_level == '2'
> +  common_builtin_flags += ['-Wall', '-Wextra', '/W3']
> +elif warning_level == '3'
> +  common_builtin_flags += ['-Wall', '-Wextra', '-Wpedantic', '/W4']
> +elif warning_level == 'everything'
> +  common_builtin_flags += ['-Weverything', '/Wall']
> +endif

> +cflags_builtin = cc.get_supported_arguments(common_builtin_flags)
> +if llvm.found()
> +  cxxflags_builtin = cpp.get_supported_arguments(common_builtin_flags)
> +endif

This seems like a fair amount of extra configure tests. Particularly because
/W* isn't ever interesting for Makefile.global - they're msvc flags - because
you can't use that with msvc.

I'm also doubtful that it's worth supporting warning_level=3/everything, you
end up with a completely flood of warnings that way.

Greetings,

Andres Freund




Re: Table AM Interface Enhancements

2024-04-07 Thread Alexander Korotkov
Hi,

On Mon, Apr 8, 2024 at 12:40 AM Andres Freund  wrote:
> On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:
> > I've pushed 0001, 0002 and 0006.
>
> I briefly looked at 27bc1772fc81 and I don't think the state post this commit
> makes sense. Before this commit another block based AM could implement analyze
> without much code duplication. Now a large portion of analyze.c has to be
> copied, because they can't stop acquire_sample_rows() from calling
> heapam_scan_analyze_next_block().
>
> I'm quite certain this will break a few out-of-core AMs in a way that can't
> easily be fixed.

I was under the impression there are not so many out-of-core table
AMs, which have non-dummy analysis implementations.  And even if there
are some, duplicating acquire_sample_rows() isn't a big deal.

But given your feedback, I'd like to propose to keep both options
open.  Turn back the block-level API for analyze, but let table-AM
implement its own analyze function.  Then existing out-of-core AMs
wouldn't need to do anything (or probably just set the new API method
to NULL).

> And even for non-block based AMs, the new interface basically requires
> reimplementing all of analyze.c.
.
Non-lock base AM needs to just provide an alternative implementation
for what acquire_sample_rows() does.  This seems like reasonable
effort for me, and surely not reimplementing all of analyze.c.

--
Regards,
Alexander Korotkov


v1-0001-Turn-back-the-block-level-API-for-relation-analyz.patch
Description: Binary data


Re: Add bump memory context type and use it for tuplesorts

2024-04-07 Thread Daniel Gustafsson
> On 8 Apr 2024, at 01:04, Andres Freund  wrote:

> What makes you think that's unrelated? To me that looks like the same issue?

Nvm, I misread the assert, ETOOLITTLESLEEP. Sorry for the noise.

--
Daniel Gustafsson





Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-07 Thread John Naylor
On Mon, Apr 8, 2024 at 2:07 AM Andres Freund  wrote:
>
> Looking at the code, the failure isn't suprising anymore:
> chardata[MaxBlocktableEntrySize];
> BlocktableEntry *page = (BlocktableEntry *) data;
>
> 'char' doesn't enforce any alignment, but you're storing a BlocktableEntry in
> a char[]. You can't just do that.  Look at how we do that for
> e.g. PGAlignedblock.
>
>
> With the attached minimal fix, the tests pass again.

Thanks, will push this shortly!




Re: Add bump memory context type and use it for tuplesorts

2024-04-07 Thread David Rowley
On Mon, 8 Apr 2024 at 09:09, Andres Freund  wrote:
> I suspect that KeeperBlock() isn't returning true, because IsKeeperBlock 
> misses
> the MAXALIGN(). I think that about fits with:

Thanks for investigating that.

I've just pushed a fix for the macro and also adjusted a location
which was *correctly* calculating the keeper block address manually to
use the macro. If I'd used the macro there to start with the Assert
likely wouldn't have failed, but there'd have been memory alignment
issues.

David




Re: Add bump memory context type and use it for tuplesorts

2024-04-07 Thread Andres Freund
Hi,

On 2024-04-08 00:55:42 +0200, Daniel Gustafsson wrote:
> > On 8 Apr 2024, at 00:41, Tom Lane  wrote:
> > 
> > Tomas Vondra  writes:
> >> Yup, changing it to this:
> > 
> >> #define KeeperBlock(set) ((BumpBlock *) ((char *) (set) +
> >> MAXALIGN(sizeof(BumpContext
> > 
> >> fixes the issue for me.
> > 
> > Mamba is happy with that change, too.
> 
> Unrelated to that one, seems like turaco ran into another issue:
> 
> running bootstrap script ... TRAP: failed Assert("total_allocated == 
> context->mem_allocated"), File: "bump.c", Line: 808, PID: 7809
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=turaco&dt=2024-04-07%2022%3A42%3A54

What makes you think that's unrelated? To me that looks like the same issue?

Greetings,

Andres Freund




Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-07 Thread Daniel Gustafsson
> On 8 Apr 2024, at 00:46, Michael Paquier  wrote:
> 
> On Sat, Apr 06, 2024 at 07:47:43PM +0200, Daniel Gustafsson wrote:
>> My apologies, I thought I did but clearly failed.  My point was that this is 
>> a
>> special/corner case where we try to find one of two different libraries (with
>> different ideas about backwards compatability etc) for supporting a single
>> thing.  So instead I tested for the explicit versions like how we already 
>> test
>> for the exact Perl version in config/perl.m4 (albeit that a library and a
>> program are two different things of course).
> 
> +  # Function introduced in OpenSSL 1.1.1 and LibreSSL 3.3.2
> +  AC_CHECK_FUNCS([EVP_PKEY_new_CMAC_key], [], [AC_MSG_ERROR([OpenSSL 1.1.1 
> or later is required for SSL support])]) 
> 
> I can see why you want to do that, but we've always relied on
> compilation failures while documenting the versions supported.  I
> don't disagree to your point of detecting that earlier, but it sounds
> like this should be a separate patch separate from the one removing
> support for OpenSSL 1.0.2 and 1.1.0, at least, because you are solving
> two problems at once.
> 
>> In bumping we want to move to 1.1.1 since that's the first version with the
>> rewritten RNG which is fork-safe by design, something PostgreSQL clearly
>> benefits from.  There is no new API for this to gate on though.  For LibreSSL
>> we want 3.3.2 to a) ensure we have coverage in the BF and b) since it's the
>> first version where the tests pass due to error message alignment with 
>> OpenSSL.
>> The combination of these gets rid of lots of specialcased #ifdef soup.  I
>> wasn't however able to find a specific API call which is unique to the two
>> version which we rely on.
> 
> Based on the state of the patch, all the symbols cleaned up in
> pg_config.h.in would be removed only by dropping support for
> 1.0.2.  The routines of protocol_openssl.c exist in 1.1.0.  libpq
> internal locking can be removed by dropping support for 1.0.2.  So
> a bunch of ifdefs are removed with 1.0.2 support, but that's much,
> much less cleaned once 1.1.0 is removed.  

If we are settling for removing 1.0.2 we need to make sure there is 1.1.0
buildfarm animals that actually run the sslcheck.  1.1.0 was never an LTS
release so it's presence in distributions is far less widespread.  I would
prefer 1.1.1 but, either way, we have ample time to discuss that during the v18
cycle.

> And pg_strong_random.c stuff would still remain around.

It stays but as belts-and-suspenders safety against bugs, not as a requirement.

>> Testing for the presence of an API provided and introduced by both libraries 
>> in
>> the version we're interested in, but which we don't use, is the alternative 
>> but
>> I thought that would be more frowned upon.  EVP_PKEY_new_CMAC_key() was
>> introduced in 1.1.1 and LibreSSL 3.3.2, so an AC_CHECK_FUNCS for that, as in
>> the attached, achieves the version check but pollutes pg_config.h with a 
>> define
>> which will never be used which seemed a bit ugly.
> 
> Two lines in pg_config.h is a minimal cost.  That does not look like a
> big deal to me.
> 
> -  * Make sure processes do not share OpenSSL randomness state.  This is 
> no
> -  * longer required in OpenSSL 1.1.1 and later versions, but until we 
> drop
> -  * support for version < 1.1.1 we need to do this.
> +  * Make sure processes do not share OpenSSL randomness state.  This is 
> in
> +  * theory no longer be required in OpenSSL 1.1.1 and later versions, but
> +  * there is no harm in taking extra precautions.
> 
> I was wondering if this should also document what you've mentioned,
> aka that OpenSSL still found ways to break the randomness state and
> that this is a cheap insurance against future mistakes that could
> happen in this area.

I'm not quite sure how stable Github links are over time, but I guess we could
post the commit sha for the bug in OpenSSL (as well as the fix) which is a
stable documentation of how it can be subtly broken.

--
Daniel Gustafsson





Re: Add bump memory context type and use it for tuplesorts

2024-04-07 Thread Daniel Gustafsson
> On 8 Apr 2024, at 00:41, Tom Lane  wrote:
> 
> Tomas Vondra  writes:
>> Yup, changing it to this:
> 
>> #define KeeperBlock(set) ((BumpBlock *) ((char *) (set) +
>> MAXALIGN(sizeof(BumpContext
> 
>> fixes the issue for me.
> 
> Mamba is happy with that change, too.

Unrelated to that one, seems like turaco ran into another issue:

running bootstrap script ... TRAP: failed Assert("total_allocated == 
context->mem_allocated"), File: "bump.c", Line: 808, PID: 7809

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=turaco&dt=2024-04-07%2022%3A42%3A54

--
Daniel Gustafsson





Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-07 Thread Michael Paquier
On Thu, Apr 04, 2024 at 03:10:27PM +1300, David Rowley wrote:
> Someone asked me about this, so thought it might be useful to post here.

I've received the same question.

> To express this as UTC, It's:
> 
> postgres=# select '2024-04-08 00:00-12:00' at time zone 'UTC';
>   timezone
> -
>  2024-04-08 12:00:00
> 
> Or, time remaining, relative to now:
> 
> select '2024-04-08 00:00-12:00' - now();

And, as of the moment of typing this email, I get:
=# select '2024-04-08 00:00-12:00' - now() as time_remaining;
 time_remaining
-
 13:10:35.688134
(1 row)

So there is just a bit more than half a day remaining before the
feature freeze is in effect.
--
Michael


signature.asc
Description: PGP signature


Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-07 Thread Michael Paquier
On Sat, Apr 06, 2024 at 07:47:43PM +0200, Daniel Gustafsson wrote:
> My apologies, I thought I did but clearly failed.  My point was that this is a
> special/corner case where we try to find one of two different libraries (with
> different ideas about backwards compatability etc) for supporting a single
> thing.  So instead I tested for the explicit versions like how we already test
> for the exact Perl version in config/perl.m4 (albeit that a library and a
> program are two different things of course).

+  # Function introduced in OpenSSL 1.1.1 and LibreSSL 3.3.2
+  AC_CHECK_FUNCS([EVP_PKEY_new_CMAC_key], [], [AC_MSG_ERROR([OpenSSL 1.1.1 or 
later is required for SSL support])]) 

I can see why you want to do that, but we've always relied on
compilation failures while documenting the versions supported.  I
don't disagree to your point of detecting that earlier, but it sounds
like this should be a separate patch separate from the one removing
support for OpenSSL 1.0.2 and 1.1.0, at least, because you are solving
two problems at once.

> In bumping we want to move to 1.1.1 since that's the first version with the
> rewritten RNG which is fork-safe by design, something PostgreSQL clearly
> benefits from.  There is no new API for this to gate on though.  For LibreSSL
> we want 3.3.2 to a) ensure we have coverage in the BF and b) since it's the
> first version where the tests pass due to error message alignment with 
> OpenSSL.
> The combination of these gets rid of lots of specialcased #ifdef soup.  I
> wasn't however able to find a specific API call which is unique to the two
> version which we rely on.

Based on the state of the patch, all the symbols cleaned up in
pg_config.h.in would be removed only by dropping support for
1.0.2.  The routines of protocol_openssl.c exist in 1.1.0.  libpq
internal locking can be removed by dropping support for 1.0.2.  So
a bunch of ifdefs are removed with 1.0.2 support, but that's much,
much less cleaned once 1.1.0 is removed.  And pg_strong_random.c stuff
would still remain around.

> Testing for the presence of an API provided and introduced by both libraries 
> in
> the version we're interested in, but which we don't use, is the alternative 
> but
> I thought that would be more frowned upon.  EVP_PKEY_new_CMAC_key() was
> introduced in 1.1.1 and LibreSSL 3.3.2, so an AC_CHECK_FUNCS for that, as in
> the attached, achieves the version check but pollutes pg_config.h with a 
> define
> which will never be used which seemed a bit ugly.

Two lines in pg_config.h is a minimal cost.  That does not look like a
big deal to me.

-* Make sure processes do not share OpenSSL randomness state.  This is 
no
-* longer required in OpenSSL 1.1.1 and later versions, but until we 
drop
-* support for version < 1.1.1 we need to do this.
+* Make sure processes do not share OpenSSL randomness state.  This is 
in
+* theory no longer be required in OpenSSL 1.1.1 and later versions, but
+* there is no harm in taking extra precautions.

I was wondering if this should also document what you've mentioned,
aka that OpenSSL still found ways to break the randomness state and
that this is a cheap insurance against future mistakes that could
happen in this area.
--
Michael


signature.asc
Description: PGP signature


Re: Add bump memory context type and use it for tuplesorts

2024-04-07 Thread Tom Lane
Tomas Vondra  writes:
> Yup, changing it to this:

> #define KeeperBlock(set) ((BumpBlock *) ((char *) (set) +
> MAXALIGN(sizeof(BumpContext

> fixes the issue for me.

Mamba is happy with that change, too.

regards, tom lane




Re: POC, WIP: OR-clause support for indexes

2024-04-07 Thread Alexander Korotkov
Hi!

On Mon, Apr 1, 2024 at 9:38 AM Andrei Lepikhov
 wrote:
> On 28/3/2024 16:54, Alexander Korotkov wrote:
> > The current patch has a boolean guc enable_or_transformation.
> > However, when we have just a few ORs to be transformated, then we
> > should get less performance gain from the transformation and higher
> > chances to lose a good bitmap scan plan from that.  When there is a
> > huge list of ORs to be transformed, then the performance gain is
> > greater and it is less likely we could lose a good bitmap scan plan.
> >
> > What do you think about introducing a GUC threshold value: the minimum
> > size of list to do OR-to-ANY transformation?
> > min_list_or_transformation or something.
> I labelled it or_transformation_limit (see in attachment). Feel free to
> rename it.
> It's important to note that the limiting GUC doesn't operate
> symmetrically for forward, OR -> SAOP, and backward SAOP -> OR
> operations. In the forward case, it functions as you've proposed.
> However, in the backward case, we only check whether the feature is
> enabled or not. This is due to our existing limitation,
> MAX_SAOP_ARRAY_SIZE, and the fact that we can't match the length of the
> original OR list with the sizes of the resulting SAOPs. For instance, a
> lengthy OR list with 100 elements can be transformed into 3 SAOPs, each
> with a size of around 30 elements.
> One aspect that requires attention is the potential inefficiency of our
> OR -> ANY transformation when we have a number of elements less than
> MAX_SAOP_ARRAY_SIZE. This is because we perform a reverse transformation
> ANY -> OR at the stage of generating bitmap scans. If the BitmapScan
> path dominates, we may have done unnecessary work. Is this an occurrence
> that we should address?
> But the concern above may just be a point of improvement later: We can
> add one more strategy to the optimizer: testing each array element as an
> OR clause; we can also provide a BitmapOr path, where SAOP is covered
> with a minimal number of partial indexes (likewise, previous version).

I've revised the patch.  Did some beautification, improvements for
documentation, commit messages etc.

I've pushed the 0001 patch without 0002.  I think 0001 is good by
itself given that there is the or_to_any_transform_limit GUC option.
The more similar OR clauses are here the more likely grouping them
into SOAP will be a win.  But I've changed the default value to 5.
This will make it less invasive and affect only queries with obvious
repeating patterns.  That also reduced the changes in the regression
tests expected outputs.

Regarding 0002, it seems questionable since it could cause a planning
slowdown for SAOP's with large arrays.  Also, it might reduce the win
of transformation made by 0001.  So, I think we should skip it for
now.

--
Regards,
Alexander Korotkov




Re: Cluster::restart dumping logs when stop fails

2024-04-07 Thread Daniel Gustafsson
> On 7 Apr 2024, at 18:51, Daniel Gustafsson  wrote:
>> On 7 Apr 2024, at 18:28, Andres Freund  wrote:

>> I'm ok with printing path + some content or just the path.
> 
> I think printing the last 512 bytes or so would be a good approach, I'll take
> care of it later tonight. That would be a backpatchable change IMHO.

In the end I opted for just printing the path to avoid introducing logic (at
this point in the cycle) for ensuring that a negative offset doesn't exceed the
filesize.  Since it changes behavior I haven't backpatched it, but can do that
if we think it's more of a fix than a change (I think it can be argued either
way, I have no strong feelings).

--
Daniel Gustafsson





Re: Use streaming read API in ANALYZE

2024-04-07 Thread Melanie Plageman
On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote:
> > From 1dc2343661f3edb3b1bc4307afb0e956397eb76c Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Sun, 7 Apr 2024 14:55:22 -0400
> > Subject: [PATCH v10 1/3] Make heapam_scan_analyze_next_[tuple|block] static.
> > 
> > 27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and
> > scan_analzye_next_tuple -- leaving their heap AM implementations only
> > called by acquire_sample_rows().
> 
> Ugh, I don't think 27bc1772fc81 makes much sense. But that's unrelated to this
> thread.  I did raise that separately
> https://www.postgresql.org/message-id/20240407214001.jgpg5q3yv33ve6y3%40awork3.anarazel.de
> 
> Unless I seriously missed something, I see no alternative to reverting that
> commit.

Noted. I'll give up on this refactor then. Lots of churn for no gain.
Attached v11 is just Bilal's v8 patch rebased to apply cleanly and with
a few tweaks (I changed one of the loop conditions. All other changes
are to comments and commit message).

> > @@ -1206,11 +1357,13 @@ acquire_sample_rows(Relation onerel, int elevel,
> > break;
> >  
> > prefetch_block = BlockSampler_Next(&prefetch_bs);
> > -   PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, 
> > prefetch_block);
> > +   PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, 
> > prefetch_block);
> > }
> > }
> >  #endif
> >  
> > +   scan->rs_cbuf = InvalidBuffer;
> > +
> > /* Outer loop over blocks to sample */
> > while (BlockSampler_HasMore(&bs))
> > {
> 
> I don't think it's good to move a lot of code *and* change how it is
> structured in the same commit. Makes it much harder to actually see changes /
> makes git blame harder to use / etc.

Yep.

> > From 90d115c2401567be65bcf64393a6d3b39286779e Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Sun, 7 Apr 2024 15:28:32 -0400
> > Subject: [PATCH v10 2/3] Use streaming read API in ANALYZE
> >
> > The ANALYZE command prefetches and reads sample blocks chosen by a
> > BlockSampler algorithm. Instead of calling Prefetch|ReadBuffer() for
> > each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.
> >
> > Author: Nazir Bilal Yavuz
> > Reviewed-by: Melanie Plageman
> > Discussion: 
> > https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
> > ---
> >  src/backend/commands/analyze.c | 89 ++
> >  1 file changed, 26 insertions(+), 63 deletions(-)
> 
> That's a very nice demonstration of how this makes good prefetching easier...

Agreed. Yay streaming read API and Bilal!

> > From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Sun, 7 Apr 2024 15:38:41 -0400
> > Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore()
> > 
> > A previous commit stopped using BlockSampler_HasMore() for flow control
> > in acquire_sample_rows(). There seems little use now for
> > BlockSampler_HasMore(). It should be sufficient to return
> > InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore()
> > would have returned false. Remove BlockSampler_HasMore().
> > 
> > Author: Melanie Plageman, Nazir Bilal Yavuz
> > Discussion: 
> > https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
> 
> The justification here seems somewhat odd. Sure, the previous commit stopped
> using BlockSampler_HasMore in acquire_sample_rows - but only because it was
> moved to block_sampling_streaming_read_next()?

It didn't stop using it. It stopped being useful. The reason it existed,
as far as I can tell, was to use it as the while() loop condition in
acquire_sample_rows(). I think it makes much more sense for
BlockSampler_Next() to return InvalidBlockNumber when there are no more
blocks -- not to assert you don't call it when there aren't any more
blocks.

I didn't want to change BlockSampler_Next() in the same commit as the
streaming read user and we can't remove BlockSampler_HasMore() without
changing BlockSampler_Next().

- Melanie
>From 3cb43693c04554f5d46e0dc9156bef36af642593 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sun, 7 Apr 2024 18:17:01 -0400
Subject: [PATCH v11 1/2] Use streaming read API in ANALYZE

The ANALYZE command prefetches and reads sample blocks chosen by a
BlockSampler algorithm. Instead of calling [Prefetch|Read]Buffer() for
each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.

Author: Nazir Bilal Yavuz
Reviewed-by: Melanie Plageman, Andres Freund
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/access/heap/heapam_handler.c | 20 +++---
 src/backend/commands/analyze.c   | 85 +++-
 src/include/access/heapam.h  |  5 +-
 3 files chan

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-07 Thread Alexander Korotkov
Hi, Alexander!

On Sun, Apr 7, 2024 at 10:00 PM Alexander Lakhin  wrote:
> 07.04.2024 01:22, Alexander Korotkov wrote:
> > I've pushed 0001 and 0002.  I didn't push 0003 for the following reasons.
>
> Please try the following (erroneous) query:
> CREATE TABLE t1(i int, t text) PARTITION BY LIST (t);
> CREATE TABLE t1pa PARTITION OF t1 FOR VALUES IN ('A');
>
> CREATE TABLE t2 (i int, t text) PARTITION BY RANGE (t);
> ALTER TABLE t2 SPLIT PARTITION t1pa INTO
>(PARTITION t2a FOR VALUES FROM ('A') TO ('B'),
> PARTITION t2b FOR VALUES FROM ('B') TO ('C'));
>
> that triggers an assertion failure:
> TRAP: failed Assert("datums != NIL"), File: "partbounds.c", Line: 3434, PID: 
> 1841459
>
> or a segfault (in a non-assert build):
> Program terminated with signal SIGSEGV, Segmentation fault.
>
> #0  pg_detoast_datum_packed (datum=0x0) at fmgr.c:1866
> 1866if (VARATT_IS_COMPRESSED(datum) || VARATT_IS_EXTERNAL(datum))
> (gdb) bt
> #0  pg_detoast_datum_packed (datum=0x0) at fmgr.c:1866
> #1  0x55f38c5d5e3f in bttextcmp (...) at varlena.c:1834
> #2  0x55f38c6030dd in FunctionCall2Coll (...) at fmgr.c:1161
> #3  0x55f38c417c83 in partition_rbound_cmp (...) at partbounds.c:3525
> #4  check_partition_bounds_for_split_range (...) at partbounds.c:5221
> #5  check_partitions_for_split (...) at partbounds.c:5688
> #6  0x55f38c256c49 in transformPartitionCmdForSplit (...) at 
> parse_utilcmd.c:3451
> #7  transformAlterTableStmt (...) at parse_utilcmd.c:3810
> #8  0x55f38c2bdf9c in ATParseTransformCmd (...) at tablecmds.c:5650

Thank you for spotting this.  This seems like a missing check.  I'm
going to get a closer look at this tomorrow.

--
Regards,
Alexander Korotkov




Re: Streaming read-ready sequential scan code

2024-04-07 Thread Andres Freund
Hi,

On 2024-04-08 09:36:59 +1200, Thomas Munro wrote:
> I've been on the fence about that flag for sequential scan...  Some
> days I want to consider changing to READ_STREAM_DEFAULT and relying on
> our "anti-heuristics" to suppress advice, which would work out the
> same in most cases but might occasionally win big.

Agreed, it's pretty easy to end up with a fairly "fragmented" set of a
relation's buffers in s_b.  OTOH, there might not be any need for the
heuristic if we actually trigger reads asynchronously.


> BTW looking at the branching in read-stream user patches that have an
> initialisation step like yours, I wonder if it might every make sense
> to be able to change the callback on the fly from inside the callback,
> so that you finish up with a branchless one doing most of the work.  I
> have no idea if it's worth it...

I was wondering about that too, I dislike those branches. But instead of
changing the callback, it seems like a better design would be to have another
dedicated callback for that?  There already is a dedicated branch for the
"just starting up" path in read_stream_next_buffer(), so it'd be pretty much
free to call another callback there.

Greetings,

Andres Freund




Re: ❓ JSON Path Dot Precedence

2024-04-07 Thread David E. Wheeler
On Apr 7, 2024, at 15:46, Erik Wienhold  wrote:

> I guess jsonpath assumes that hex, octal, and binary literals are
> integers.  So there's no ambiguity about any fractional part that might
> follow.

Yeah, that’s what the comment in the flex file says:

https://github.com/postgres/postgres/blob/b4a71cf/src/backend/utils/adt/jsonpath_scan.l#L102-L105


> Also, is there even a use case for path "0x2.p10"?  The path has to
> start with "$" or ("@" in case of a filter expression), doesn't it?  And
> it that case it doesn't parse:
> 
>test=# select '$.0x2.p10'::jsonpath;
>ERROR:  trailing junk after numeric literal at or near ".0x" of jsonpath 
> input
>LINE 1: select '$.0x2.p10'::jsonpath;
> 
> Even with extra whitespace:
> 
>test=# select '$ . 0x2 . p10'::jsonpath;
>ERROR:  syntax error at or near "0x2" of jsonpath input
>LINE 1: select '$ . 0x2 . p10'::jsonpath;
> 
> Or should it behave like an array accessor?  Similar to:
> 
>test=# select jsonb_path_query('[0,1,{"p10":42},3]', 
> '$[0x2].p10'::jsonpath);
> jsonb_path_query
>--
> 42
>(1 row)

I too am curious why these parse successfully, but don’t appear to be useful.

Best,

David





Re: Use streaming read API in ANALYZE

2024-04-07 Thread Andres Freund
Hi,

On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote:
> From 1dc2343661f3edb3b1bc4307afb0e956397eb76c Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Sun, 7 Apr 2024 14:55:22 -0400
> Subject: [PATCH v10 1/3] Make heapam_scan_analyze_next_[tuple|block] static.
> 
> 27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and
> scan_analzye_next_tuple -- leaving their heap AM implementations only
> called by acquire_sample_rows().

Ugh, I don't think 27bc1772fc81 makes much sense. But that's unrelated to this
thread.  I did raise that separately
https://www.postgresql.org/message-id/20240407214001.jgpg5q3yv33ve6y3%40awork3.anarazel.de

Unless I seriously missed something, I see no alternative to reverting that
commit.


> @@ -1206,11 +1357,13 @@ acquire_sample_rows(Relation onerel, int elevel,
>   break;
>  
>   prefetch_block = BlockSampler_Next(&prefetch_bs);
> - PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, 
> prefetch_block);
> + PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, 
> prefetch_block);
>   }
>   }
>  #endif
>  
> + scan->rs_cbuf = InvalidBuffer;
> +
>   /* Outer loop over blocks to sample */
>   while (BlockSampler_HasMore(&bs))
>   {

I don't think it's good to move a lot of code *and* change how it is
structured in the same commit. Makes it much harder to actually see changes /
makes git blame harder to use / etc.



> From 90d115c2401567be65bcf64393a6d3b39286779e Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Sun, 7 Apr 2024 15:28:32 -0400
> Subject: [PATCH v10 2/3] Use streaming read API in ANALYZE
>
> The ANALYZE command prefetches and reads sample blocks chosen by a
> BlockSampler algorithm. Instead of calling Prefetch|ReadBuffer() for
> each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.
>
> Author: Nazir Bilal Yavuz
> Reviewed-by: Melanie Plageman
> Discussion: 
> https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
> ---
>  src/backend/commands/analyze.c | 89 ++
>  1 file changed, 26 insertions(+), 63 deletions(-)

That's a very nice demonstration of how this makes good prefetching easier...




> From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Sun, 7 Apr 2024 15:38:41 -0400
> Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore()
> 
> A previous commit stopped using BlockSampler_HasMore() for flow control
> in acquire_sample_rows(). There seems little use now for
> BlockSampler_HasMore(). It should be sufficient to return
> InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore()
> would have returned false. Remove BlockSampler_HasMore().
> 
> Author: Melanie Plageman, Nazir Bilal Yavuz
> Discussion: 
> https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com

The justification here seems somewhat odd. Sure, the previous commit stopped
using BlockSampler_HasMore in acquire_sample_rows - but only because it was
moved to block_sampling_streaming_read_next()?

Greetings,

Andres Freund




Re: Flushing large data immediately in pqcomm

2024-04-07 Thread Melih Mutlu
David Rowley , 6 Nis 2024 Cmt, 04:34 tarihinde şunu
yazdı:

> Does anyone else want to try the attached script on the v5 patch to
> see if their numbers are better?
>

I'm seeing the below results with your script on my machine (). I ran it
several times, results were almost similar each time.

master:
Run 100 100 500: 1.627905512
Run 1024 10240 20: 1.603231684
Run 1024 1048576 2000: 2.962812352
Run 1048576 1048576 1000: 2.940766748

v5:
Run 100 100 500: 1.611508155
Run 1024 10240 20: 1.603505596
Run 1024 1048576 2000: 2.727241937
Run 1048576 1048576 1000: 2.721268988


Re: Table AM Interface Enhancements

2024-04-07 Thread Andres Freund
Hi,

On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:
> I've pushed 0001, 0002 and 0006.

I briefly looked at 27bc1772fc81 and I don't think the state post this commit
makes sense. Before this commit another block based AM could implement analyze
without much code duplication. Now a large portion of analyze.c has to be
copied, because they can't stop acquire_sample_rows() from calling
heapam_scan_analyze_next_block().

I'm quite certain this will break a few out-of-core AMs in a way that can't
easily be fixed.


And even for non-block based AMs, the new interface basically requires
reimplementing all of analyze.c.

What am I missing here?

Greetings,

Andres Freund




Re: Streaming read-ready sequential scan code

2024-04-07 Thread Thomas Munro
On Sun, Apr 7, 2024 at 1:34 PM Melanie Plageman
 wrote:
> Attached v13 0001 is your fix and 0002 is a new version of the
> sequential scan streaming read user. Off-list Andres mentioned that I
> really ought to separate the parallel and serial sequential scan users
> into two different callbacks. I've done that in the attached. It
> actually makes the code used by the callbacks nicer and more readable
> anyway (putting aside performance). I was able to measure a small
> performance difference as well.

Thanks.  I changed a couple of very trivial things before pushing.

+BlockNumber (*cb) (ReadStream *pgsr, void *private_data,
+   void *per_buffer_data);

This type has a friendly name: ReadStreamBlockNumberCB.

+scan->rs_read_stream =
read_stream_begin_relation(READ_STREAM_SEQUENTIAL,

I've been on the fence about that flag for sequential scan...  Some
days I want to consider changing to READ_STREAM_DEFAULT and relying on
our "anti-heuristics" to suppress advice, which would work out the
same in most cases but might occasionally win big.  It might also
hurt, I dunno, so I suspect we'd have to make it better first, which
my patch in [1] was a first swing at, but I haven't researched that
enough.  So, kept this way!

- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buffer before saving it in the scan descriptor.

Changed to:

 * Read the next block of the scan relation from the read stream and save it
 * in the scan descriptor.  It is already pinned.

+static BlockNumber
+heap_scan_stream_read_next_parallel(ReadStream *pgsr, void *private_data,
+void *per_buffer_data)

Changed argument names to match the function pointer type definition,
"stream" and "callback_private_data".

BTW looking at the branching in read-stream user patches that have an
initialisation step like yours, I wonder if it might every make sense
to be able to change the callback on the fly from inside the callback,
so that you finish up with a branchless one doing most of the work.  I
have no idea if it's worth it...

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGLLFvou5rx5FDhm-Pc9r4STQTFFmrx6SUV%2Bvk8fwMbreA%40mail.gmail.com




Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Nazir Bilal Yavuz
Hi,

On Mon, 8 Apr 2024 at 00:01, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz  wrote:
> >
> > Hi,
> >
> > On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
> > >
> > > I had been planning to commit v14 this morning but got cold feet with
> > > the BMR-based interface.  Heikki didn't like it much, and in the end,
> > > neither did I.  I have now removed it, and it seems much better.  No
> > > other significant changes, just parameter types and inlining details.
> > > For example:
> > >
> > >  * read_stream_begin_relation() now takes a Relation, likes its name says
> > >  * StartReadBuffers()'s operation takes smgr and optional rel
> > >  * ReadBuffer_common() takes smgr and optional rel
> >
> > Read stream objects can be created only using Relations now. There
> > could be read stream users which do not have a Relation but
> > SMgrRelations. So, I created another constructor for the read streams
> > which use SMgrRelations instead of Relations. Related patch is
> > attached.
>
> After sending this, I realized that I forgot to add persistence value
> to the new constructor. While working on it I also realized that
> current code sets persistence in PinBufferForBlock() function and this
> function is called for each block, which can be costly. So, I moved
> setting persistence to the out of PinBufferForBlock() function.
>
> Setting persistence outside of the PinBufferForBlock() function (0001)
> and creating the new constructor that uses SMgrRelations (0002) are
> attached.

Melanie noticed there was a 'sgmr -> smgr' typo in 0002. Fixed in attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 04fd860ce8c4c57830930cb362799fd155c92613 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH 1/2] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 31 +++
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..d155dde5ce3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
 	{
 		stream->ios[i].op.rel = rel;
 		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = 0;
+		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 06e9ffd2b00..b4fcefed78a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1067,24 +1067,10 @@ PinBufferForBlock(Relation rel,
 	BufferDesc *bufHdr;
 	IOContext	io_context;
 	IOObject	io_object;
-	char		persistence;
 
 	Assert(blockNum != P_NEW);
 
-	/*
-	 * If there is no Relation it usually implies recovery and thus permanent,
-	 * but we take an argmument because CreateAndCopyRelationData can reach us
-	 * with only an SMgrRelation for an unlogged relation that we don't want
-	 * to flag with BM_PERMANENT.
-	 */
-	if (rel)
-		persistence = rel->rd_rel->relpersistence;
-	else if (smgr_persistence == 0)
-		persistence = RELPERSISTENCE_PERMANENT;
-	else
-		persistence = smgr_persistence;
-
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
@@ -1101,7 +1087,7 @@ PinBufferForBlock(Relation rel,
 	   smgr->smgr_rlocator.locator.relNumber,
 	   smgr->smgr_rlocator.backend);
 
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
 		if (*foundPtr)
@@ -1109,7 +1095,7 @@ PinBufferForBlock(Relation rel,
 	}
 	else
 	{
-		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
+		bufHdr = BufferAlloc(smgr, smgr_persistence, forkNum, blockNum,
 			 strategy, foundPtr, io_context);
 		if (*foundPtr)
 			pgBufferUsage.shared_blks_hit++;
@@ -1157,6 +1143,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	ReadBuffersOperation operation;
 	Buffer		buffer;
 	int			flags;
+	char		persistence;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1195,7 +1182,15 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		flags = 0;
 	operation.smgr = smgr;
 	operation.rel = rel;
-	operation.smgr_persistence = smgr_persistence;
+
+	if (rel)
+		persistence = rel->rd_rel->relper

Re: Add bump memory context type and use it for tuplesorts

2024-04-07 Thread Tomas Vondra



On 4/7/24 23:09, Andres Freund wrote:
> Hi,
> 
> On 2024-04-07 22:35:47 +0200, Tomas Vondra wrote:
>> I haven't investigated, but I'd considering it works on 64-bit, I guess
>> it's not considering alignment somewhere. I can dig more if needed.
> 
> I think I may the problem:
> 
> 
> #define KeeperBlock(set) ((BumpBlock *) ((char *) (set) + 
> sizeof(BumpContext)))
> #define IsKeeperBlock(set, blk) (KeeperBlock(set) == (blk))
> 
> BumpContextCreate():
> ...
>   /* Fill in the initial block's block header */
>   block = (BumpBlock *) (((char *) set) + MAXALIGN(sizeof(BumpContext)));
>   /* determine the block size and initialize it */
>   firstBlockSize = allocSize - MAXALIGN(sizeof(BumpContext));
>   BumpBlockInit(set, block, firstBlockSize);
> ...
>   ((MemoryContext) set)->mem_allocated = allocSize;
> 
> void
> BumpCheck(MemoryContext context)
> ...
>   if (IsKeeperBlock(bump, block))
>   total_allocated += block->endptr - (char *) bump;
> ...
> 
> I suspect that KeeperBlock() isn't returning true, because IsKeeperBlock 
> misses
> the MAXALIGN(). I think that about fits with:
> 
>> #4  0x008f0088 in BumpCheck (context=0x131e330) at bump.c:808
>> 808 Assert(total_allocated == context->mem_allocated);
>> (gdb) p total_allocated
>> $1 = 8120
>> (gdb) p context->mem_allocated
>> $2 = 8192
> 

Yup, changing it to this:

#define KeeperBlock(set) ((BumpBlock *) ((char *) (set) +
MAXALIGN(sizeof(BumpContext

fixes the issue for me.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pub/sub - specifying optional parameters without values.

2024-04-07 Thread Tom Lane
Peter Smith  writes:
> Here is a similar update for another page: "55.4 Streaming Replication
> Protocol" [0]. This patch was prompted by a review comment reply at
> [1] (#2).

> I've used text almost the same as the boilerplate text added by the
> previous commit [2]

Pushed, except I put it at the bottom of the section not the top.

regards, tom lane




Re: Add bump memory context type and use it for tuplesorts

2024-04-07 Thread Andres Freund
Hi,

On 2024-04-07 22:35:47 +0200, Tomas Vondra wrote:
> I haven't investigated, but I'd considering it works on 64-bit, I guess
> it's not considering alignment somewhere. I can dig more if needed.

I think I may the problem:


#define KeeperBlock(set) ((BumpBlock *) ((char *) (set) + sizeof(BumpContext)))
#define IsKeeperBlock(set, blk) (KeeperBlock(set) == (blk))

BumpContextCreate():
...
/* Fill in the initial block's block header */
block = (BumpBlock *) (((char *) set) + MAXALIGN(sizeof(BumpContext)));
/* determine the block size and initialize it */
firstBlockSize = allocSize - MAXALIGN(sizeof(BumpContext));
BumpBlockInit(set, block, firstBlockSize);
...
((MemoryContext) set)->mem_allocated = allocSize;

void
BumpCheck(MemoryContext context)
...
if (IsKeeperBlock(bump, block))
total_allocated += block->endptr - (char *) bump;
...

I suspect that KeeperBlock() isn't returning true, because IsKeeperBlock misses
the MAXALIGN(). I think that about fits with:

> #4  0x008f0088 in BumpCheck (context=0x131e330) at bump.c:808
> 808 Assert(total_allocated == context->mem_allocated);
> (gdb) p total_allocated
> $1 = 8120
> (gdb) p context->mem_allocated
> $2 = 8192


Greetings,

Andres Freund




Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Nazir Bilal Yavuz
Hi,

On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
> >
> > I had been planning to commit v14 this morning but got cold feet with
> > the BMR-based interface.  Heikki didn't like it much, and in the end,
> > neither did I.  I have now removed it, and it seems much better.  No
> > other significant changes, just parameter types and inlining details.
> > For example:
> >
> >  * read_stream_begin_relation() now takes a Relation, likes its name says
> >  * StartReadBuffers()'s operation takes smgr and optional rel
> >  * ReadBuffer_common() takes smgr and optional rel
>
> Read stream objects can be created only using Relations now. There
> could be read stream users which do not have a Relation but
> SMgrRelations. So, I created another constructor for the read streams
> which use SMgrRelations instead of Relations. Related patch is
> attached.

After sending this, I realized that I forgot to add persistence value
to the new constructor. While working on it I also realized that
current code sets persistence in PinBufferForBlock() function and this
function is called for each block, which can be costly. So, I moved
setting persistence to the out of PinBufferForBlock() function.

Setting persistence outside of the PinBufferForBlock() function (0001)
and creating the new constructor that uses SMgrRelations (0002) are
attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 04fd860ce8c4c57830930cb362799fd155c92613 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH 1/2] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 31 +++
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..d155dde5ce3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
 	{
 		stream->ios[i].op.rel = rel;
 		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = 0;
+		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 06e9ffd2b00..b4fcefed78a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1067,24 +1067,10 @@ PinBufferForBlock(Relation rel,
 	BufferDesc *bufHdr;
 	IOContext	io_context;
 	IOObject	io_object;
-	char		persistence;
 
 	Assert(blockNum != P_NEW);
 
-	/*
-	 * If there is no Relation it usually implies recovery and thus permanent,
-	 * but we take an argmument because CreateAndCopyRelationData can reach us
-	 * with only an SMgrRelation for an unlogged relation that we don't want
-	 * to flag with BM_PERMANENT.
-	 */
-	if (rel)
-		persistence = rel->rd_rel->relpersistence;
-	else if (smgr_persistence == 0)
-		persistence = RELPERSISTENCE_PERMANENT;
-	else
-		persistence = smgr_persistence;
-
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
@@ -1101,7 +1087,7 @@ PinBufferForBlock(Relation rel,
 	   smgr->smgr_rlocator.locator.relNumber,
 	   smgr->smgr_rlocator.backend);
 
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
 		if (*foundPtr)
@@ -1109,7 +1095,7 @@ PinBufferForBlock(Relation rel,
 	}
 	else
 	{
-		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
+		bufHdr = BufferAlloc(smgr, smgr_persistence, forkNum, blockNum,
 			 strategy, foundPtr, io_context);
 		if (*foundPtr)
 			pgBufferUsage.shared_blks_hit++;
@@ -1157,6 +1143,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	ReadBuffersOperation operation;
 	Buffer		buffer;
 	int			flags;
+	char		persistence;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1195,7 +1182,15 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		flags = 0;
 	operation.smgr = smgr;
 	operation.rel = rel;
-	operation.smgr_persistence = smgr_persistence;
+
+	if (rel)
+		persistence = rel->rd_rel->relpersistence;
+	else if (smgr_persistence == 0)
+		persistence = RELPERSISTENCE_PERMANENT;
+	else
+		persistence = smgr_persistence;
+	operation.smgr_persistence = persistence;
+
 	operation.forknum = forkN

Re: Use streaming read API in ANALYZE

2024-04-07 Thread Melanie Plageman
On Sun, Apr 7, 2024 at 3:57 PM Melanie Plageman
 wrote:
>
> On Thu, Apr 04, 2024 at 02:03:30PM +0300, Nazir Bilal Yavuz wrote:
> >
> > On Wed, 3 Apr 2024 at 23:44, Melanie Plageman  
> > wrote:
> > >
> > > I don't see the point of BlockSampler_HasMore() anymore. I removed it in
> > > the attached and made BlockSampler_Next() return InvalidBlockNumber
> > > under the same conditions. Is there a reason not to do this? There
> > > aren't other callers. If the BlockSampler_Next() wasn't part of an API,
> > > we could just make it the streaming read callback, but that might be
> > > weird as it is now.
> >
> > I agree. There is no reason to have BlockSampler_HasMore() after
> > streaming read API changes.
> >
> > > That and my other ideas in attached. Let me know what you think.
> >
> > I agree with your changes but I am not sure if others agree with all
> > the changes you have proposed. So, I didn't merge 0001 and your ideas
> > yet, instead I wrote a commit message, added some comments, changed ->
> > 'if (bs->t >= bs->N || bs->m >= bs->n)' to 'if (K <= 0 || k <= 0)' and
> > attached it as 0002.
>
> I couldn't quite let go of those changes to acquire_sample_rows(), so
> attached v9 0001 implements them as a preliminary patch before your
> analyze streaming read user. I inlined heapam_scan_analyze_next_block()
> entirely and made heapam_scan_analyze_next_tuple() a static function in
> commands/analyze.c (and tweaked the name).
>
> I made a few tweaks to your patch since it is on top of those changes
> instead of preceding them. Then 0003 is removing BlockSampler_HasMore()
> since it doesn't make sense to remove it before the streaming read user
> was added.

I realized there were a few outdated comments. Fixed in attached v10.

- Melanie
From 1dc2343661f3edb3b1bc4307afb0e956397eb76c Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sun, 7 Apr 2024 14:55:22 -0400
Subject: [PATCH v10 1/3] Make heapam_scan_analyze_next_[tuple|block] static.

27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and
scan_analzye_next_tuple -- leaving their heap AM implementations only
called by acquire_sample_rows().

Move heapam_scan_analyze_next_tuple() to analyze.c as a static helper for
acquire_sample_rows() and inline heapam_scan_analyze_next_block().

Author: Melanie Plageman
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/access/heap/heapam_handler.c | 193 +--
 src/backend/commands/analyze.c   | 181 -
 src/include/access/heapam.h  |   9 --
 3 files changed, 179 insertions(+), 204 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 58de2c82a70..364dd0b165b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,189 +1054,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
- *
- * This routine holds a buffer pin and lock on the heap page.  They are held
- * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
- * items of the heap page are analyzed.
- */
-void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-			   BufferAccessStrategy bstrategy)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-
-	/*
-	 * We must maintain a pin on the target page's buffer to ensure that
-	 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out from
-	 * under us.  Hence, pin the page until we are done looking at it.  We
-	 * also choose to hold sharelock on the buffer throughout --- we could
-	 * release and re-acquire sharelock for each tuple, but since we aren't
-	 * doing much work per tuple, the extra lock traffic is probably better
-	 * avoided.
-	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-		blockno, RBM_NORMAL, bstrategy);
-	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
- * is found, true is returned and a tuple is stored in `slot`.  When no more
- * tuples for sampling, false is returned and the pin and lock acquired by
- * heapam_scan_analyze_next_block() are released.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-bool
-heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-			   double *liverows, double *deadrows,
-			   TupleTableSlot *slot)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-	Page		targpage;
-	OffsetNumber maxoffset;
-	BufferHeapTupleTableSlot *hslot;
-
-	Assert(TTS_IS_BUFFERTUPLE(slot));
-
-	hslot = (BufferH

Re: Add bump memory context type and use it for tuplesorts

2024-04-07 Thread Tomas Vondra




On 4/7/24 22:35, Tomas Vondra wrote:
> On 4/7/24 14:37, David Rowley wrote:
>> On Sun, 7 Apr 2024 at 22:05, John Naylor  wrote:
>>>
>>> On Sat, Apr 6, 2024 at 7:37 PM David Rowley  wrote:

>>>  I'm planning on pushing these, pending a final look at 0002 and 0003
 on Sunday morning NZ time (UTC+12), likely in about 10 hours time.
>>>
>>> +1
>>
>> I've now pushed all 3 patches.   Thank you for all the reviews on
>> these and for the extra MemoryContextMethodID bit, Matthias.
>>
>>> I haven't looked at v6, but I've tried using it in situ, and it seems
>>> to work as well as hoped:
>>>
>>> https://www.postgresql.org/message-id/CANWCAZZQFfxvzO8yZHFWtQV%2BZ2gAMv1ku16Vu7KWmb5kZQyd1w%40mail.gmail.com
>>
>> I'm already impressed with the radix tree work.  Nice to see bump
>> allowing a little more memory to be saved for TID storage.
>>
>> David
> 
> There seems to be some issue with this on 32-bit machines. A couple
> animals (grison, mamba) already complained about an assert int
> BumpCheck() during initdb, I get the same crash on my rpi5 running
> 32-bit debian - see the backtrace attached.
> 
> I haven't investigated, but I'd considering it works on 64-bit, I guess
> it's not considering alignment somewhere. I can dig more if needed.
> 

I did try running it under valgrind, and there doesn't seem to be
anything particularly wrong - just a bit of noise about calculating CRC
on uninitialized bytes.

regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Security lessons from liblzma

2024-04-07 Thread Ranier Vilela
Hi,

Sorry, without any connection with the technical part of the thread.
But I couldn't help but record this, and congratulate Andres Freund, for
the excellent work.
It's not every day that a big press, in Brazil or around the world,
publishes something related to technology people.

Today I came across a publication in Brazil's most widely circulated
newspaper, calling him a hero for preventing the attack.

programador-conserta-falha-no-linux-evita-ciberataque-e-vira-heroi-da-internet.shtml


Congratulations Andres Freund, nice work.

best regards,
Ranier Vilela


Re: Add bump memory context type and use it for tuplesorts

2024-04-07 Thread Tomas Vondra
On 4/7/24 14:37, David Rowley wrote:
> On Sun, 7 Apr 2024 at 22:05, John Naylor  wrote:
>>
>> On Sat, Apr 6, 2024 at 7:37 PM David Rowley  wrote:
>>>
>>  I'm planning on pushing these, pending a final look at 0002 and 0003
>>> on Sunday morning NZ time (UTC+12), likely in about 10 hours time.
>>
>> +1
> 
> I've now pushed all 3 patches.   Thank you for all the reviews on
> these and for the extra MemoryContextMethodID bit, Matthias.
> 
>> I haven't looked at v6, but I've tried using it in situ, and it seems
>> to work as well as hoped:
>>
>> https://www.postgresql.org/message-id/CANWCAZZQFfxvzO8yZHFWtQV%2BZ2gAMv1ku16Vu7KWmb5kZQyd1w%40mail.gmail.com
> 
> I'm already impressed with the radix tree work.  Nice to see bump
> allowing a little more memory to be saved for TID storage.
> 
> David

There seems to be some issue with this on 32-bit machines. A couple
animals (grison, mamba) already complained about an assert int
BumpCheck() during initdb, I get the same crash on my rpi5 running
32-bit debian - see the backtrace attached.

I haven't investigated, but I'd considering it works on 64-bit, I guess
it's not considering alignment somewhere. I can dig more if needed.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companywarning: Can't open file /SYSV003414ed (deleted) during file-backed mapping 
note processing
[New LWP 20363]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by 
`/mnt/data/postgres/tmp_install/home/debian/pg-master/bin/postgres --boot -F 
-c'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=4152716800, signo=6, 
no_tid=) at pthread_kill.c:44
44  pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (threadid=4152716800, signo=6, 
no_tid=) at pthread_kill.c:44
#1  0xf721bcfc in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#2  0xf72050a0 in __GI_abort () at abort.c:79
#3  0x008b29fc in ExceptionalCondition (conditionName=0xb1c3cc "total_allocated 
== context->mem_allocated", fileName=0xb1bf98 "bump.c", lineNumber=808) at 
assert.c:66
#4  0x008f0088 in BumpCheck (context=0x131e330) at bump.c:808
#5  0x008ef1bc in BumpReset (context=0x131e330) at bump.c:258
#6  0x008ef310 in BumpDelete (context=0x131e330) at bump.c:287
#7  0x008fab58 in MemoryContextDeleteOnly (context=0x131e330) at mcxt.c:528
#8  0x008faa00 in MemoryContextDelete (context=0x131e330) at mcxt.c:482
#9  0x008fac1c in MemoryContextDeleteChildren (context=0x131c328) at mcxt.c:548
#10 0x008fa75c in MemoryContextReset (context=0x131c328) at mcxt.c:389
#11 0x0090a1a8 in tuplesort_free (state=0x131a3b8) at tuplesort.c:958
#12 0x0090a1fc in tuplesort_end (state=0x131a3b8) at tuplesort.c:973
#13 0x00149250 in _bt_spooldestroy (btspool=0x1222b20) at nbtsort.c:517
#14 0x00149200 in _bt_spools_heapscan (heap=0xebf9f250, index=0xebf426e0, 
buildstate=0xffde9ed0, indexInfo=0x12e8f90) at nbtsort.c:504
#15 0x00148e30 in btbuild (heap=0xebf9f250, index=0xebf426e0, 
indexInfo=0x12e8f90) at nbtsort.c:321
#16 0x001fd164 in index_build (heapRelation=0xebf9f250, 
indexRelation=0xebf426e0, indexInfo=0x12e8f90, isreindex=false, parallel=false) 
at index.c:3021
#17 0x001e0404 in build_indices () at bootstrap.c:962
#18 0x001da92c in boot_yyparse () at bootparse.y:388
#19 0x001de988 in BootstrapModeMain (argc=6, argv=0x119c114, check_only=false) 
at bootstrap.c:357
#20 0x004389e4 in main (argc=7, argv=0x119c110) at main.c:186
(gdb) up
#1  0xf721bcfc in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
26  ../sysdeps/posix/raise.c: No such file or directory.
(gdb) up
#2  0xf72050a0 in __GI_abort () at abort.c:79
79  abort.c: No such file or directory.
(gdb) up
#3  0x008b29fc in ExceptionalCondition (conditionName=0xb1c3cc "total_allocated 
== context->mem_allocated", fileName=0xb1bf98 "bump.c", lineNumber=808) at 
assert.c:66
66  abort();
(gdb) up
#4  0x008f0088 in BumpCheck (context=0x131e330) at bump.c:808
808 Assert(total_allocated == context->mem_allocated);
(gdb) p total_allocated
$1 = 8120
(gdb) p context->mem_allocated
$2 = 8192


Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-04-07 Thread Tom Lane
Corey Huinker  writes:
>> I've also used the technique quite a lot, but only using the PK,
>> didn't know about the ctid trick, so many thanks for documenting it.

> tid-scans only became a thing a few versions ago (12?). Prior to that, PK
> was the only way to go.

I think we had TID scans for awhile before it was possible to use
them in joins, although I don't recall the details of that.
Anyway, pushed after some additional wordsmithing.

regards, tom lane




Re: Use streaming read API in ANALYZE

2024-04-07 Thread Melanie Plageman
On Thu, Apr 04, 2024 at 02:03:30PM +0300, Nazir Bilal Yavuz wrote:
> 
> On Wed, 3 Apr 2024 at 23:44, Melanie Plageman  
> wrote:
> >
> > I don't see the point of BlockSampler_HasMore() anymore. I removed it in
> > the attached and made BlockSampler_Next() return InvalidBlockNumber
> > under the same conditions. Is there a reason not to do this? There
> > aren't other callers. If the BlockSampler_Next() wasn't part of an API,
> > we could just make it the streaming read callback, but that might be
> > weird as it is now.
> 
> I agree. There is no reason to have BlockSampler_HasMore() after
> streaming read API changes.
> 
> > That and my other ideas in attached. Let me know what you think.
> 
> I agree with your changes but I am not sure if others agree with all
> the changes you have proposed. So, I didn't merge 0001 and your ideas
> yet, instead I wrote a commit message, added some comments, changed ->
> 'if (bs->t >= bs->N || bs->m >= bs->n)' to 'if (K <= 0 || k <= 0)' and
> attached it as 0002.

I couldn't quite let go of those changes to acquire_sample_rows(), so
attached v9 0001 implements them as a preliminary patch before your
analyze streaming read user. I inlined heapam_scan_analyze_next_block()
entirely and made heapam_scan_analyze_next_tuple() a static function in
commands/analyze.c (and tweaked the name).

I made a few tweaks to your patch since it is on top of those changes
instead of preceding them. Then 0003 is removing BlockSampler_HasMore()
since it doesn't make sense to remove it before the streaming read user
was added.

- Melanie
>From b301a73e2ede8eb1d95e6a8c8d89f09959d9c22a Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sun, 7 Apr 2024 14:55:22 -0400
Subject: [PATCH v9 1/3] Make heapam_scan_analyze_next_[tuple|block] static.

27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and
scan_analzye_next_tuple -- leaving their heap AM implementations only
called by acquire_sample_rows().

Move heapam_scan_analyze_next_tuple() to analyze.c as a static helper for
acquire_sample_rows() and inline heapam_scan_analyze_next_block().

Author: Melanie Plageman
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/access/heap/heapam_handler.c | 183 ---
 src/backend/commands/analyze.c   | 181 +-
 src/include/access/heapam.h  |   9 --
 3 files changed, 174 insertions(+), 199 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 58de2c82a70..5edac76ceb6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,189 +1054,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
- *
- * This routine holds a buffer pin and lock on the heap page.  They are held
- * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
- * items of the heap page are analyzed.
- */
-void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-			   BufferAccessStrategy bstrategy)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-
-	/*
-	 * We must maintain a pin on the target page's buffer to ensure that
-	 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out from
-	 * under us.  Hence, pin the page until we are done looking at it.  We
-	 * also choose to hold sharelock on the buffer throughout --- we could
-	 * release and re-acquire sharelock for each tuple, but since we aren't
-	 * doing much work per tuple, the extra lock traffic is probably better
-	 * avoided.
-	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-		blockno, RBM_NORMAL, bstrategy);
-	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
- * is found, true is returned and a tuple is stored in `slot`.  When no more
- * tuples for sampling, false is returned and the pin and lock acquired by
- * heapam_scan_analyze_next_block() are released.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-bool
-heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-			   double *liverows, double *deadrows,
-			   TupleTableSlot *slot)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-	Page		targpage;
-	OffsetNumber maxoffset;
-	BufferHeapTupleTableSlot *hslot;
-
-	Assert(TTS_IS_BUFFERTUPLE(slot));
-
-	hslot = (BufferHeapTupleTableSlot *) slot;
-	targpage = BufferGetPage(hscan->rs_cbuf);
-	maxoffset = PageGetMaxOffsetNumber(targpage);
-
-	/* Inner loop over all tuples on the selected page */
-	for (; hs

Re: ❓ JSON Path Dot Precedence

2024-04-07 Thread Erik Wienhold
On 2024-04-07 18:13 +0200, David E. Wheeler wrote:
> A question about the behavior of the JSON Path parser. The docs[1]
> have this to say about numbers:
> 
> >  Numeric literals in SQL/JSON path expressions follow JavaScript
> >  rules, which are different from both SQL and JSON in some minor
> >  details. For example, SQL/JSON path allows .1 and 1., which are
> >  invalid in JSON.
> 
> In other words, this is valid:
> 
> david=# select '2.'::jsonpath;
>  jsonpath 
> --
>  2
> 
> But this feature creates a bit of a conflict with the use of a dot for
> path expressions. Consider `0x2.p10`. How should that be parsed? As an
> invalid decimal expression ("trailing junk after numeric literal”), or
> as a valid integer 2 followed by the path segment “p10”? Here’s the
> parser’s answer:
> 
> david=# select '0x2.p10'::jsonpath;
>  jsonpath  
> ---
>  (2)."p10"
> 
> So it would seem that, other things being equal, a path key expression
> (`.foo`) is slightly higher precedence than a decimal expression. Is
> that intentional/correct?

I guess jsonpath assumes that hex, octal, and binary literals are
integers.  So there's no ambiguity about any fractional part that might
follow.

> Discovered while writing my Go lexer and throwing all of Go’s floating
> point literal examples[2] at it and comparing to the Postgres path
> parser. Curiously, this is only an issue for 0x/0o/0b numeric
> expressions; a decimal expression does not behave in the same way:
> 
> david=# select '2.p10'::jsonpath;
> ERROR:  trailing junk after numeric literal at or near "2.p" of jsonpath input
> LINE 1: select '2.p10'::jsonpath;

It scans the decimal "2." and then finds junks "p10".

Works with a full decimal:

test=# select '3.14.p10'::jsonpath;
   jsonpath
--
 (3.14)."p10"
(1 row)

And with extra whitespace to resolve the ambiguity:

test=# select '2 .p10'::jsonpath;
 jsonpath
---
 (2)."p10"
(1 row)

> Which maybe seems a bit inconsistent.
> 
> Thoughts on what the “correct” behavior should be?

I'd say a member accessor after a number doesn't really make sense
because object keys are strings.  One could argue that path "$.2.p10"
should match JSON '{"2":{"p10":42}}', i.e. the numeric accessor is
converted to a string.  For example, in nodejs I can do:

> var x = {2: {p10: 42}}
> x[2].p10
42

But that's JavaScript, not JSON.

Also, is there even a use case for path "0x2.p10"?  The path has to
start with "$" or ("@" in case of a filter expression), doesn't it?  And
it that case it doesn't parse:

test=# select '$.0x2.p10'::jsonpath;
ERROR:  trailing junk after numeric literal at or near ".0x" of jsonpath 
input
LINE 1: select '$.0x2.p10'::jsonpath;

Even with extra whitespace:

test=# select '$ . 0x2 . p10'::jsonpath;
ERROR:  syntax error at or near "0x2" of jsonpath input
LINE 1: select '$ . 0x2 . p10'::jsonpath;

Or should it behave like an array accessor?  Similar to:

test=# select jsonb_path_query('[0,1,{"p10":42},3]', 
'$[0x2].p10'::jsonpath);
 jsonb_path_query
--
 42
(1 row)

>   [1]: 
> https://www.postgresql.org/docs/devel/datatype-json.html#DATATYPE-JSONPATH
>   [2]: https://tip.golang.org/ref/spec#Floating-point_literals

-- 
Erik




Re: LogwrtResult contended spinlock

2024-04-07 Thread Jeff Davis
On Sun, 2024-04-07 at 14:19 +0200, Alvaro Herrera wrote:
> I pushed the "copy" pointer now, except that I renamed it to
> "insert",
> which is what we call the operation being tracked.  I also added some
> comments.

The "copy" pointer, as I called it, is not the same as the "insert"
pointer (as returned by GetXLogInsertRecPtr()).

LSNs before the "insert" pointer are reserved for WAL inserters (and
may or may not have made it to WAL buffers yet). LSNs before the "copy"
pointer are written to WAL buffers with CopyXLogRecordToWAL() (and may
or may not have been evicted to the OS file yet).

Other than that, looks good, thank you.

Regards,
Jeff Davis





Re: EXCLUDE COLLATE in CREATE/ALTER TABLE document

2024-04-07 Thread Tom Lane
"Daniel Verite"  writes:
> Also the use of "and/or" in the previous version conveys the fact
> that operator class and ordering options are not mutually
> exclusive. But when using "any of the following" in the new text,
> doesn't it loose that meaning?

Yeah; and/or is perfectly fine here and doesn't need to be improved
on.

There's a bigger problem though, which is that these bits
are *also* missing any reference to opclass parameters.
I fixed that and pushed it.

regards, tom lane




Re: Table AM Interface Enhancements

2024-04-07 Thread Pavel Borisov
Hi, Alexander!

On Sun, 7 Apr 2024 at 12:34, Pavel Borisov  wrote:

> Hi, Alexander!
>
> On Sun, 7 Apr 2024 at 07:33, Alexander Korotkov 
> wrote:
>
>> Hi, Pavel!
>>
>> On Fri, Apr 5, 2024 at 6:58 PM Pavel Borisov 
>> wrote:
>> > On Tue, 2 Apr 2024 at 19:17, Jeff Davis  wrote:
>> >>
>> >> On Tue, 2024-04-02 at 11:49 +0300, Alexander Korotkov wrote:
>> >> > I don't like the idea that every custom table AM reltoptions should
>> >> > begin with StdRdOptions.  I would rather introduce the new data
>> >> > structure with table options, which need to be accessed outside of
>> >> > table AM.  Then reloptions will be a backbox only directly used in
>> >> > table AM, while table AM has a freedom on what to store in reloptions
>> >> > and how to calculate externally-visible options.  What do you think?
>> >>
>> >> Hi Alexander!
>> >>
>> >> I agree with all of that. It will take some refactoring to get there,
>> >> though.
>> >>
>> >> One idea is to store StdRdOptions like normal, but if an unrecognized
>> >> option is found, ask the table AM if it understands the option. In that
>> >> case I think we'd just use a different field in pg_class so that it can
>> >> use whatever format it wants to represent its options.
>> >>
>> >> Regards,
>> >> Jeff Davis
>> >
>> > I tried to rework a patch regarding table am according to the input
>> from Alexander and Jeff.
>> >
>> > It splits table reloptions into two categories:
>> > - common for all tables (stored in a fixed size structure and could be
>> accessed from outside)
>> > - table-am specific (variable size, parsed and accessed by access
>> method only)
>>
>> Thank you for your work.  Please, check the revised patch.
>>
>> It makes CommonRdOptions a separate data structure, not directly
>> involved in parsing the reloption.  Instead table AM can fill it on
>> the base of its reloptions or calculate the other way.  Patch comes
>> with a test module, which comes with heap-based table AM.  This table
>> AM has "enable_parallel" reloption, which is used as the base to set
>> the value of CommonRdOptions.parallel_workers.
>>
> To me, a patch v10 looks good.
>
> I think the comment for RelationData now applies only to rd_options, not
> to rd_common_options.
> >NULLs means "use defaults".
>
> Regards,
> Pavel
>

I made minor changes to the patch. Please find v11 attached.

Regards,
Pavel.


v11-0001-Custom-reloptions-for-table-AM.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2024-04-07 Thread Andrey M. Borodin



> On 7 Apr 2024, at 21:41, Alvaro Herrera  wrote:
> 
> Well, it would be nice to have *some* test, but as you say it is way
> more complex than the thing being tested, and it zooms in on the
> functioning of the multixact creation in insane quantum-physics ways ...
> to the point that you can no longer trust that multixact works the same
> way with the test than without it.  So what I did is manually run other
> tests (pgbench) to verify that the corner case in multixact creation is
> being handled correctly, and pushed the patch after a few corrections,
> in particular so that it would follow the CV sleeping protocol a little
> more closely to what the CV documentation suggests, which is not at all
> what the patch did.  I also fixed a few comments that had been neglected
> and changed the name and description of the CV in the docs.

That's excellent! Thank you!

> Now, maybe we can still add the test later, but it needs a rebase.

Sure. 'wait' injection points are there already, so I'll produce patch with 
"prepared" injection points and re-implement test on top of that. I'll put it 
on July CF.


Best regards, Andrey Borodin.



Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-07 Thread Andres Freund
Hi,

On 2024-04-01 11:53:28 +0900, Masahiko Sawada wrote:
> On Fri, Mar 29, 2024 at 4:21 PM John Naylor  wrote:
> > I've marked it Ready for Committer.
>
> Thank you! I've attached the patch that I'm going to push tomorrow.

Locally I ran a 32bit build with ubsan enabled (by accident actually), which
complains:

performing post-bootstrap initialization ...
--- stderr ---
../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341:24:
 runtime error: member access within misaligned address 0xffb6258e for type 
'struct BlocktableEntry', which requires 4 byte alignment
0xffb6258e: note: pointer points here
 00 00 02 00 01 40  dc e9 83 0b 80 48 70 ee  00 00 00 00 00 00 00 01  17 00 00 
00 f8 d4 a6 ee  e8 25
 ^
#0 0x814097e in TidStoreSetBlockOffsets 
../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341
#1 0x826560a in dead_items_add 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2889
#2 0x825f8da in lazy_scan_prune 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:1502
#3 0x825da71 in lazy_scan_heap 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:977
#4 0x825ad8f in heap_vacuum_rel 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:499
#5 0x8697e97 in table_relation_vacuum 
../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1725
#6 0x869fca6 in vacuum_rel 
../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:2206
#7 0x869a0fd in vacuum 
../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:622
#8 0x869986b in ExecVacuum 
../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:449
#9 0x8e5f832 in standard_ProcessUtility 
../../../../../home/andres/src/postgresql/src/backend/tcop/utility.c:859
#10 0x8e5e5f6 in ProcessUtility 
../../../../../home/andres/src/postgresql/src/backend/tcop/utility.c:523
#11 0x8e5b71a in PortalRunUtility 
../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:1158
#12 0x8e5be80 in PortalRunMulti 
../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:1315
#13 0x8e59f9b in PortalRun 
../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:791
#14 0x8e4d5f3 in exec_simple_query 
../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:1274
#15 0x8e55159 in PostgresMain 
../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4680
#16 0x8e54445 in PostgresSingleUserMain 
../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4136
#17 0x88bb55e in main 
../../../../../home/andres/src/postgresql/src/backend/main/main.c:194
#18 0xf76f47c4  (/lib/i386-linux-gnu/libc.so.6+0x237c4) (BuildId: 
fe79efe6681a919714a4e119da2baac3a4953fbf)
#19 0xf76f4887 in __libc_start_main (/lib/i386-linux-gnu/libc.so.6+0x23887) 
(BuildId: fe79efe6681a919714a4e119da2baac3a4953fbf)
#20 0x80d40f7 in _start 
(/srv/dev/build/postgres/m-dev-assert-32/tmp_install/srv/dev/install/postgres/m-dev-assert-32/bin/postgres+0x80d40f7)

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341:24
 in
Aborted (core dumped)
child process exited with exit code 134
initdb: data directory 
"/srv/dev/build/postgres/m-dev-assert-32/tmp_install/initdb-template" not 
removed at user's request


At first I was confused why CI didn't find this. Turns out that, for me, this
is only triggered without compiler optimizations, and I had used -O0 while CI
uses some optimizations.

Backtrace:
#9  0x0814097f in TidStoreSetBlockOffsets (ts=0xb8dfde4, blkno=15, 
offsets=0xffb6275c, num_offsets=11)
at 
../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341
#10 0x0826560b in dead_items_add (vacrel=0xb8df6d4, blkno=15, 
offsets=0xffb6275c, num_offsets=11)
at 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2889
#11 0x0825f8db in lazy_scan_prune (vacrel=0xb8df6d4, buf=24, blkno=15, 
page=0xeeb6c000 "", vmbuffer=729, all_visible_according_to_vm=false,
has_lpdead_items=0xffb62a1f) at 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:1502
#12 0x0825da72 in lazy_scan_heap (vacrel=0xb8df6d4) at 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:977
#13 0x0825ad90 in heap_vacuum_rel (rel=0xb872810, params=0xffb62e90, 
bstrategy=0xb99d5e0)
at 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:499
#14 0x08697e98 in table_relation_vacuum (rel=0xb872810, params=0xffb62e90, 
bstrategy=0xb99d5e0)
at 
../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1725
#15 0x0869fca7 in vacuum_rel (relid=1249, relation=0x0, params=0xff

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-07 Thread Alexander Lakhin

Hi Alexander and Dmitry,

07.04.2024 01:22, Alexander Korotkov wrote:

I've pushed 0001 and 0002.  I didn't push 0003 for the following reasons.


Please try the following (erroneous) query:
CREATE TABLE t1(i int, t text) PARTITION BY LIST (t);
CREATE TABLE t1pa PARTITION OF t1 FOR VALUES IN ('A');

CREATE TABLE t2 (i int, t text) PARTITION BY RANGE (t);
ALTER TABLE t2 SPLIT PARTITION t1pa INTO
  (PARTITION t2a FOR VALUES FROM ('A') TO ('B'),
   PARTITION t2b FOR VALUES FROM ('B') TO ('C'));

that triggers an assertion failure:
TRAP: failed Assert("datums != NIL"), File: "partbounds.c", Line: 3434, PID: 
1841459

or a segfault (in a non-assert build):
Program terminated with signal SIGSEGV, Segmentation fault.

#0  pg_detoast_datum_packed (datum=0x0) at fmgr.c:1866
1866    if (VARATT_IS_COMPRESSED(datum) || VARATT_IS_EXTERNAL(datum))
(gdb) bt
#0  pg_detoast_datum_packed (datum=0x0) at fmgr.c:1866
#1  0x55f38c5d5e3f in bttextcmp (...) at varlena.c:1834
#2  0x55f38c6030dd in FunctionCall2Coll (...) at fmgr.c:1161
#3  0x55f38c417c83 in partition_rbound_cmp (...) at partbounds.c:3525
#4  check_partition_bounds_for_split_range (...) at partbounds.c:5221
#5  check_partitions_for_split (...) at partbounds.c:5688
#6  0x55f38c256c49 in transformPartitionCmdForSplit (...) at 
parse_utilcmd.c:3451
#7  transformAlterTableStmt (...) at parse_utilcmd.c:3810
#8  0x55f38c2bdf9c in ATParseTransformCmd (...) at tablecmds.c:5650
...

Best regards,
Alexander




Re: Trigger violates foreign key constraint

2024-04-07 Thread Tom Lane
Laurenz Albe  writes:
> Patch v3 is attached.

I agree with documenting this hazard, but I think it'd be better
to do so in the "Triggers" chapter.  There is no hazard unless
you are writing user-defined triggers, which is surely far fewer
people than use foreign keys.  So I suggest something like the
attached.

regards, tom lane

diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index a5390ff644..99d8b75fc5 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -354,6 +354,20 @@
 to avoid infinite recursion in such scenarios.

 
+   
+PostgreSQL implements foreign key
+constraints via system-defined AFTER triggers.
+If a foreign key constraint specifies referential actions (that is,
+cascading updates or deletes), these triggers issue ordinary SQL
+update or delete commands on the referencing table to perform the
+actions.  In particular, any user-defined triggers on the referencing
+table will be fired.  If such a trigger modifies or blocks the effect
+of one of these commands, the end result could be to break referential
+integrity.  It is the trigger programmer's responsibility to avoid
+that.  (Similar problems can arise in any case where different
+triggers are working at cross-purposes.)
+   
+

 
  trigger


Re: MultiXact\SLRU buffers configuration

2024-04-07 Thread Alvaro Herrera
On 2024-Feb-03, Andrey M. Borodin wrote:

> Here's the test draft. This test reliably reproduces sleep on CV when waiting 
> next multixact to be filled into "members" SLRU.
> Cost of having this test:
> 1. We need a new injection point type "wait" (in addition to "error" and 
> "notice"). It cannot be avoided, because we need to sync at least 3 processed 
> to observe condition we want.
> 2. We need new way to declare injection point that can happen inside critical 
> section. I've called it "prepared injection point".
> 
> Complexity of having this test is higher than complexity of CV-sleep patch 
> itself. Do we want it? If so I can produce cleaner version, currently all 
> multixact tests are int injection_points test module.

Well, it would be nice to have *some* test, but as you say it is way
more complex than the thing being tested, and it zooms in on the
functioning of the multixact creation in insane quantum-physics ways ...
to the point that you can no longer trust that multixact works the same
way with the test than without it.  So what I did is manually run other
tests (pgbench) to verify that the corner case in multixact creation is
being handled correctly, and pushed the patch after a few corrections,
in particular so that it would follow the CV sleeping protocol a little
more closely to what the CV documentation suggests, which is not at all
what the patch did.  I also fixed a few comments that had been neglected
and changed the name and description of the CV in the docs.

Now, maybe we can still add the test later, but it needs a rebase.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
“Cuando no hay humildad las personas se degradan” (A. Christie)




Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Melanie Plageman
On Sun, Apr 7, 2024 at 1:33 PM Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
> >
> > I had been planning to commit v14 this morning but got cold feet with
> > the BMR-based interface.  Heikki didn't like it much, and in the end,
> > neither did I.  I have now removed it, and it seems much better.  No
> > other significant changes, just parameter types and inlining details.
> > For example:
> >
> >  * read_stream_begin_relation() now takes a Relation, likes its name says
> >  * StartReadBuffers()'s operation takes smgr and optional rel
> >  * ReadBuffer_common() takes smgr and optional rel
>
> Read stream objects can be created only using Relations now. There
> could be read stream users which do not have a Relation but
> SMgrRelations. So, I created another constructor for the read streams
> which use SMgrRelations instead of Relations. Related patch is
> attached.

This patch LGTM

- Melanie




Re: pg_combinebackup --copy-file-range

2024-04-07 Thread Tomas Vondra
On 4/7/24 19:46, Tomas Vondra wrote:
> On 4/5/24 21:43, Tomas Vondra wrote:
>> Hi,
>>
>> ...
>>
>> 2) The prefetching is not a huge improvement, at least not for these
>> three filesystems (btrfs, ext4, xfs). From the color scale it might seem
>> like it helps, but those values are relative to the baseline, so when
>> the non-prefetching value is 5% and with prefetching 10%, that means the
>> prefetching makes it slower. And that's very often true.
>>
>> This is visible more clearly in prefetching.pdf, comparing the
>> non-prefetching and prefetching results for each patch, not to baseline.
>> That's makes it quite clear there's a lot of "red" where prefetching
>> makes it slower. It certainly does help for larger increments (which
>> makes sense, because the modified blocks are distributed randomly, and
>> thus come from random files, making long streaks unlikely).
>>
>> I've imagined the prefetching could be made a bit smarter to ignore the
>> streaks (=sequential patterns), but once again - this only matters with
>> the batching, which we don't have. And without the batching it looks
>> like a net loss (that's the first column in the prefetching PDF).
>>
>> I did start thinking about prefetching because of ZFS, where it was
>> necessary to get decent performance. And that's still true. But (a) even
>> with the explicit prefetching it's still 2-3x slower than any of these
>> filesystems, so I assume performance-sensitive use cases won't use it.
>> And (b) the prefetching seems necessary in all cases, no matter how
>> large the increment is. Which goes directly against the idea of looking
>> at how random the blocks are and prefetching only the sufficiently
>> random patterns. That doesn't seem like a great thing.
>>
> 
> I finally got a more complete ZFS results, and I also decided to get
> some numbers without the ZFS tuning I did. And boy oh boy ...
> 
> All the tests I did with ZFS were tuned the way I've seen recommended
> when using ZFS for PostgreSQL, that is
> 
>   zfs set recordsize=8K logbias=throughput compression=none
> 
> and this performed quite poorly - pg_combinebackup took 4-8x longer than
> with the traditional filesystems (btrfs, xfs, ext4) and the only thing
> that improved that measurably was prefetching.
> 
> But once I reverted back to the default recordsize of 128kB the
> performance is waay better - entirely comparable to ext4/xfs, while
> btrfs remains faster with --copy-file-range --no-manigest (by a factor
> of 2-3x).
> 

I forgot to explicitly say that I think confirms the decision to not
push the patch adding the explicit prefetching to pg_combinebackup. It's
not needed/beneficial even for ZFS, when using a suitable configuration.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Nazir Bilal Yavuz
Hi,

On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
>
> I had been planning to commit v14 this morning but got cold feet with
> the BMR-based interface.  Heikki didn't like it much, and in the end,
> neither did I.  I have now removed it, and it seems much better.  No
> other significant changes, just parameter types and inlining details.
> For example:
>
>  * read_stream_begin_relation() now takes a Relation, likes its name says
>  * StartReadBuffers()'s operation takes smgr and optional rel
>  * ReadBuffer_common() takes smgr and optional rel

Read stream objects can be created only using Relations now. There
could be read stream users which do not have a Relation but
SMgrRelations. So, I created another constructor for the read streams
which use SMgrRelations instead of Relations. Related patch is
attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 38b57ec7e54a54a0c7b0117a0ecaaf68c643e1b0 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 20:16:00 +0300
Subject: [PATCH] Add a way to create read stream object by using SMgrRelation

Currently read stream object can be created only by using Relation,
there is no way to create it by using SMgrRelation.

To achieve that, read_stream_begin_impl() function is created and
contents of the read_stream_begin_relation() is moved into this
function.

Both read_stream_begin_relation() and read_stream_begin_sgmr_relation()
calls read_stream_begin_impl() by Relation and SMgrRelation
respectively.
---
 src/include/storage/read_stream.h |  8 +++
 src/backend/storage/aio/read_stream.c | 74 ++-
 2 files changed, 69 insertions(+), 13 deletions(-)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index fae09d2b4cc..601a7fcf92b 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -15,6 +15,7 @@
 #define READ_STREAM_H
 
 #include "storage/bufmgr.h"
+#include "storage/smgr.h"
 
 /* Default tuning, reasonable for many users. */
 #define READ_STREAM_DEFAULT 0x00
@@ -56,6 +57,13 @@ extern ReadStream *read_stream_begin_relation(int flags,
 			  ReadStreamBlockNumberCB callback,
 			  void *callback_private_data,
 			  size_t per_buffer_data_size);
+extern ReadStream *read_stream_begin_sgmr_relation(int flags,
+   BufferAccessStrategy strategy,
+   SMgrRelation smgr,
+   ForkNumber forknum,
+   ReadStreamBlockNumberCB callback,
+   void *callback_private_data,
+   size_t per_buffer_data_size);
 extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..a9a3b0de6c9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -406,14 +406,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
  * write extra data for each block into the space provided to it.  It will
  * also receive callback_private_data for its own purposes.
  */
-ReadStream *
-read_stream_begin_relation(int flags,
-		   BufferAccessStrategy strategy,
-		   Relation rel,
-		   ForkNumber forknum,
-		   ReadStreamBlockNumberCB callback,
-		   void *callback_private_data,
-		   size_t per_buffer_data_size)
+static ReadStream *
+read_stream_begin_impl(int flags,
+	   BufferAccessStrategy strategy,
+	   Relation rel,
+	   SMgrRelation smgr,
+	   ForkNumber forknum,
+	   ReadStreamBlockNumberCB callback,
+	   void *callback_private_data,
+	   size_t per_buffer_data_size)
 {
 	ReadStream *stream;
 	size_t		size;
@@ -422,9 +423,6 @@ read_stream_begin_relation(int flags,
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
-	SMgrRelation smgr;
-
-	smgr = RelationGetSmgr(rel);
 
 	/*
 	 * Decide how many I/Os we will allow to run at the same time.  That
@@ -434,7 +432,7 @@ read_stream_begin_relation(int flags,
 	 */
 	tablespace_id = smgr->smgr_rlocator.locator.spcOid;
 	if (!OidIsValid(MyDatabaseId) ||
-		IsCatalogRelation(rel) ||
+		(rel && IsCatalogRelation(rel)) ||
 		IsCatalogRelationOid(smgr->smgr_rlocator.locator.relNumber))
 	{
 		/*
@@ -548,7 +546,7 @@ read_stream_begin_relation(int flags,
 	for (int i = 0; i < max_ios; ++i)
 	{
 		stream->ios[i].op.rel = rel;
-		stream->ios[i].op.smgr = RelationGetSmgr(rel);
+		stream->ios[i].op.smgr = smgr;
 		stream->ios[i].op.smgr_persistence = 0;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
@@ -557,6 +555,56 @@ read_stream_begin_relation(int flags,
 	return stream;
 }
 
+/*
+ * Create a new read stream for reading a relation.
+ * See read_stream_begin_impl() for the detailed explanation.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+		   Bu

Re: LDAP & Kerberos test woes

2024-04-07 Thread Heikki Linnakangas

On 07/04/2024 13:19, Heikki Linnakangas wrote:

1st patch fixes the LDAP setup tests, and 2nd patch fixes the error
handling in the END blocks.


Committed and backpatched these test fixes.

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Peter Geoghegan
On Sun, Apr 7, 2024 at 1:00 PM Alexander Lakhin  wrote:
> Please look at an assertion failure (reproduced starting from 5bf748b86),
> triggered by the following query:
> CREATE TABLE t (a int, b int);
> CREATE INDEX t_idx ON t (a, b);
> INSERT INTO t (a, b) SELECT g, g FROM generate_series(0, 999) g;
> ANALYZE t;
> SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]);
>
> TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 
> 3251267

I immediately see what's up here. WIll fix this in the next short
while. There is no bug here in builds without assertions, but it's
probably best to keep the assertion, and to just make sure not to call
_bt_preprocess_array_keys_final() unless it has real work to do.

The assertion failure demonstrates that
_bt_preprocess_array_keys_final can currently be called when there can
be no real work for it to do. The problem is that we condition the
call to _bt_preprocess_array_keys_final() on whether or not we had to
do "real work" back when _bt_preprocess_array_keys() was called, at
the start of _bt_preprocess_keys(). That usually leaves us with
"so->numArrayKeys > 0", because the "real work" typically includes
equality type array keys. But not here -- here you have two SAOP
inequalities, which become simple scalar scan keys through early
array-specific preprocessing in _bt_preprocess_array_keys(). There are
no arrays left at this point, so "so->numArrayKeys == 0".

FWIW I missed this because the tests only cover cases with one SOAP
inequality, which will always return early from _bt_preprocess_keys
(by taking its generic single scan key fast path). Your test case has
2 scan keys, avoiding the fast path, allowing us to reach
_bt_preprocess_array_keys_final().

-- 
Peter Geoghegan




Re: Cluster::restart dumping logs when stop fails

2024-04-07 Thread Andres Freund
On 2024-04-07 18:51:40 +0200, Daniel Gustafsson wrote:
> > On 7 Apr 2024, at 18:28, Andres Freund  wrote:
> > 
> > On 2024-04-07 16:52:05 +0200, Daniel Gustafsson wrote:
> >>> On 7 Apr 2024, at 14:51, Andrew Dunstan  wrote:
> >>> On 2024-04-06 Sa 20:49, Andres Freund wrote:
> >> 
>  That's probably unnecessary optimization, but it seems a tad silly to 
>  read an
>  entire, potentially sizable, file to just use the last 1k. Not sure if 
>  the way
>  slurp_file() uses seek supports negative ofsets, the docs read to me 
>  like that
>  may only be supported with SEEK_END.
> >>> 
> >>> We should enhance slurp_file() so it uses SEEK_END if the offset is 
> >>> negative.
> >> 
> >> Absolutely agree.  Reading the thread I think Andres argues for not 
> >> printing
> >> anything at all in this case but we should support negative offsets 
> >> anyways, it
> >> will fort sure come in handy.
> > 
> > I'm ok with printing path + some content or just the path.
> 
> I think printing the last 512 bytes or so would be a good approach, I'll take
> care of it later tonight. That would be a backpatchable change IMHO.

+1 - thanks for quickly improving this.




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-04-07 Thread Alexander Lakhin

Hello Peter,

03.04.2024 22:53, Peter Geoghegan wrote:

On Mon, Apr 1, 2024 at 6:33 PM Peter Geoghegan  wrote:

Note: v18 doesn't have any adjustments to the costing, as originally
planned. I'll probably need to post a revised patch with improved (or
at least polished) costing in the next few days, so that others will
have the opportunity to comment before I commit the patch.

Attached is v19, which dealt with remaining concerns I had about the
costing in selfuncs.c. My current plan is to commit this on Saturday
morning (US Eastern time).


Please look at an assertion failure (reproduced starting from 5bf748b86),
triggered by the following query:
CREATE TABLE t (a int, b int);
CREATE INDEX t_idx ON t (a, b);
INSERT INTO t (a, b) SELECT g, g FROM generate_series(0, 999) g;
ANALYZE t;
SELECT * FROM t WHERE a < ANY (ARRAY[1]) AND b < ANY (ARRAY[1]);

TRAP: failed Assert("so->numArrayKeys"), File: "nbtutils.c", Line: 560, PID: 
3251267

Best regards,
Alexander




Re: Cluster::restart dumping logs when stop fails

2024-04-07 Thread Daniel Gustafsson
> On 7 Apr 2024, at 18:28, Andres Freund  wrote:
> 
> On 2024-04-07 16:52:05 +0200, Daniel Gustafsson wrote:
>>> On 7 Apr 2024, at 14:51, Andrew Dunstan  wrote:
>>> On 2024-04-06 Sa 20:49, Andres Freund wrote:
>> 
 That's probably unnecessary optimization, but it seems a tad silly to read 
 an
 entire, potentially sizable, file to just use the last 1k. Not sure if the 
 way
 slurp_file() uses seek supports negative ofsets, the docs read to me like 
 that
 may only be supported with SEEK_END.
>>> 
>>> We should enhance slurp_file() so it uses SEEK_END if the offset is 
>>> negative.
>> 
>> Absolutely agree.  Reading the thread I think Andres argues for not printing
>> anything at all in this case but we should support negative offsets anyways, 
>> it
>> will fort sure come in handy.
> 
> I'm ok with printing path + some content or just the path.

I think printing the last 512 bytes or so would be a good approach, I'll take
care of it later tonight. That would be a backpatchable change IMHO.

--
Daniel Gustafsson





  1   2   >