Re: remap the .text segment into huge pages at run time

2022-11-05 Thread John Naylor
On Sat, Nov 5, 2022 at 3:27 PM Andres Freund  wrote:

> > simplified it. Interestingly enough, looking through the commit history,
> > they used to align the segments via linker flags, but took it out here:
> >
> > https://github.com/intel/iodlr/pull/25#discussion_r397787559
> >
> > ...saying "I'm not sure why we added this". :/
>
> That was about using a linker script, not really linker flags though.

Oops, the commit I was referring to pointed to that discussion, but I
should have shown it instead:

--- a/large_page-c/example/Makefile
+++ b/large_page-c/example/Makefile
@@ -28,7 +28,6 @@ OBJFILES=  \
   filler16.o   \

 OBJS=$(addprefix $(OBJDIR)/,$(OBJFILES))
-LDFLAGS=-Wl,-z,max-page-size=2097152

But from what you're saying, this flag wouldn't have been enough anyway...

> I don't think the dummy functions are a good approach, there were plenty
> things after it when I played with them.

To be technical, the point wasn't to have no code after it, but to have no
*hot* code *before* it, since with the iodlr approach the first 1.99MB of
.text is below the first aligned boundary within that section. But yeah,
I'm happy to ditch that hack entirely.

> > > With these flags the "R E" segments all start on a 0x20/2MiB
boundary
> > and
> > > are padded to the next 2MiB boundary. However the OS / dynamic loader
only
> > > maps the necessary part, not all the zero padding.
> > >
> > > This means that if we were to issue a MADV_COLLAPSE, we can before it
do
> > an
> > > mremap() to increase the length of the mapping.
> >
> > I see, interesting. What location are you passing for madvise() and
> > mremap()? The beginning of the segment (for me has .init/.plt) or an
> > aligned boundary within .text?

>/*
> * Make huge pages out of it. Requires at least linux 6.1.  We
could
> * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all
that
> * much in older kernels.
> */

About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for
THP? The man page seems to indicate that.

In the support work I've done, the standard recommendation is to turn THP
off, especially if they report sudden performance problems. If explicit
HP's are used for shared mem, maybe THP is less of a risk? I need to look
back at the tests that led to that advice...

> A real version would have to open /proc/self/maps and do this for at least

I can try and generalize your above sketch into a v2 patch.

> postgres' r-xp mapping. We could do it for libraries too, if they're
suitably
> aligned (both in memory and on-disk).

It looks like plpgsql is only 27 standard pages in size...

Regarding glibc, we could try moving a couple of the hotter functions into
PG, using smaller and simpler coding, if that has better frontend cache
behavior. The paper "Understanding and Mitigating Front-End Stalls in
Warehouse-Scale Computers" talks about this, particularly section 4.4
regarding memcmp().

> > I quickly tried to align the segments with the linker and then in my
patch
> > have the address for mmap() rounded *down* from the .text start to the
> > beginning of that segment. It refused to start without logging an error.
>
> Hm, what linker was that? I did note that you need some additional flags
for
> some of the linkers.

BFD, but I wouldn't worry about that failure too much, since the
mremap()/madvise() strategy has a lot fewer moving parts.

On the subject of linkers, though, one thing that tripped me up was trying
to change the linker with Meson. First I tried

-Dc_args='-fuse-ld=lld'

but that led to warnings like this when :
/usr/bin/ld: warning: -z separate-loadable-segments ignored

When using this in the top level meson.build

elif host_system == 'linux'
  sema_kind = 'unnamed_posix'
  cppflags += '-D_GNU_SOURCE'
  # Align the loadable segments to 2MB boundaries to support remapping to
  # huge pages.
  ldflags += cc.get_supported_link_arguments([
'-Wl,-zmax-page-size=0x20',
'-Wl,-zcommon-page-size=0x20',
'-Wl,-zseparate-loadable-segments'
  ])


According to

https://mesonbuild.com/howtox.html#set-linker

I need to add CC_LD=lld to the env vars before invoking, which got rid of
the warning. Then I wanted to verify that lld was actually used, and in

https://releases.llvm.org/14.0.0/tools/lld/docs/index.html

it says I can run this and it should show “Linker: LLD”, but that doesn't
appear for me:

$ readelf --string-dump .comment inst-perf/bin/postgres

String dump of section '.comment':
  [ 0]  GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2)


--
John Naylor
EDB: http://www.enterprisedb.com


RE: Perform streaming logical transactions by background workers and parallel apply

2022-11-05 Thread houzj.f...@fujitsu.com
On Saturday, November 5, 2022 1:43 PM Amit Kapila 
> 
> On Fri, Nov 4, 2022 at 7:35 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Friday, November 4, 2022 4:07 PM Amit Kapila
>  wrote:
> > >
> > > On Thu, Nov 3, 2022 at 6:36 PM houzj.f...@fujitsu.com
> > >  wrote:
> > > >
> > > > Thanks for the analysis and summary !
> > > >
> > > > I tried to implement the above idea and here is the patch set.
> > > >
> > >
> > > Few comments on v42-0001
> > > ===
> >
> > Thanks for the comments.
> >
> > >
> > > 10.
> > > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > > + winfo->shared->transaction_lock_id =
> > > + winfo->shared->parallel_apply_get_unique_id();
> > >
> > > Why can't we use xid (remote_xid) for one of these and local_xid
> > > (one generated by parallel apply) for the other?
> ...
> ...
> >
> > I also considered using xid for these locks, but it seems the objsubid
> > for the shared object lock is 16bit while xid is 32 bit. So, I tried
> > to generate a unique 16bit id here.
> >
> 
> Okay, I see your point. Can we think of having a new lock tag for this with 
> classid,
> objid, objsubid for the first three fields of locktag field? We can use a new
> macro SET_LOCKTAG_APPLY_TRANSACTION and a common function to set the
> tag and acquire the lock. One more point related to this is that I am 
> suggesting
> classid by referring to SET_LOCKTAG_OBJECT as that is used in the current
> patch but do you think we need it for our purpose, won't subscription id and
> xid can uniquely identify the tag?

I agree that it could be better to have a new lock tag. Another point is that
the remote xid and Local xid could be the same in some rare cases, so I think
we might need to add another identifier to make it unique.

Maybe :
locktag_field1 : subscription oid
locktag_field2 : xid(remote or local)
locktag_field3 : 0(lock for stream block)/1(lock for transaction)

Best regards,
Hou zj


bug: pg_dump use strange tag for trigger

2022-11-05 Thread Pavel Stehule
Hi,

When I played with regression tests for pg_restore, I tested -T filtering
triggers too. I had problems with restoring triggers. I found that the name
for trigger uses the pattern "tablename triggername" (not just (and
correct) triggername).

I propose to generate tag just like trigger name

proposed patch attached

regards

Pavel
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index da427f4d4a..8ffc6024ea 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -17148,7 +17148,7 @@ dumpTrigger(Archive *fout, const TriggerInfo *tginfo)
 	appendPQExpBuffer(trigprefix, "TRIGGER %s ON",
 	  fmtId(tginfo->dobj.name));
 
-	tag = psprintf("%s %s", tbinfo->dobj.name, tginfo->dobj.name);
+	tag = psprintf("%s", tginfo->dobj.name);
 
 	if (tginfo->dobj.dump & DUMP_COMPONENT_DEFINITION)
 		ArchiveEntry(fout, tginfo->dobj.catId, tginfo->dobj.dumpId,


Re: explain analyze rows=%.0f

2022-11-05 Thread Tom Lane
Robert Haas  writes:
> On Fri, Jul 22, 2022 at 6:47 AM Amit Kapila  wr=
ote:
>> I feel the discussion has slightly deviated which makes it unclear
>> whether this patch is required or not?

> My opinion is that showing some fractional digits at least when
> loops>1 would be better than what we have now. It might not be the
> best thing we could do, but it would be better than doing nothing.

Yeah, I think that is a reasonable compromise.

I took a brief look through the patch, and I have some review
comments:

* Code like this is pretty awful:

appendStringInfo(es->str,
 (nloops =3D=3D 1 || !HAS_DECIMAL(rows)) ?
 " rows=3D%.0f loops=3D%.0f)" : " rows=3D%=
.2f loops=3D%.0f)",
 rows, nloops);

Don't use variable format strings.  They're hard to read and they
probably defeat compile-time checks that the arguments match the
format string.  You could use a "*" field width instead, ie

appendStringInfo(es->str,
 " rows=3D%.*f loops=3D%.0f)",
 (nloops =3D=3D 1 || !HAS_DECIMAL(rows)) ?=
 2 : 0,
 rows, nloops);

That'd also allow you to reduce the code churn you've added by
splitting some appendStringInfo calls.

* I'm fairly concerned about how stable this'll be in the buildfarm,
in particular I fear HAS_DECIMAL() is not likely to give consistent
results across platforms.  I gather that an earlier version of the patch
tried to check whether the fractional part would be zero to two decimal
places, rather than whether it's exactly zero.  Probably want to put
back something like that.

* Another thought is that the non-text formats tend to prize output
consistency over readability, so maybe we should just always use 2
fractional digits there, rather than trying to minimize visible changes.

* We need some doc adjustments, surely, to explain what the heck this
means.

regards, tom lane




Re: new option to allow pg_rewind to run without full_page_writes

2022-11-05 Thread Andres Freund
Hi,

On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> Currently pg_rewind refuses to run if full_page_writes is off. This is to
> prevent it to run into a torn page during operation.
>
> This is usually a good call, but some file systems like ZFS are naturally
> immune to torn page (maybe btrfs too, but I don't know for sure for this
> one).

Note that this isn't about torn pages in case of crashes, but about reading
pages while they're being written to.

Right now, that definitely allows for torn reads, because of the way
pg_read_binary_file() is implemented.  We only ensure a 4k read size from the
view of our code, which obviously can lead to torn 8k page reads, no matter
what the filesystem guarantees.

Also, for reasons I don't understand we use C streaming IO or
pg_read_binary_file(), so you'd also need to ensure that the buffer size used
by the stream implementation can't cause the reads to happen in smaller
chunks.  Afaict we really shouldn't use file streams here, then we'd at least
have control over that aspect.


Does ZFS actually guarantee that there never can be short reads? As soon as
they are possible, full page writes are needed.



This isn't an fundamental issue - we could have a version of
pg_read_binary_file() for relation data that prevents the page being written
out concurrently by locking the buffer page. In addition it could often avoid
needing to read the page from the OS / disk, if present in shared buffers
(perhaps minus cases where we haven't flushed the WAL yet, but we could also
flush the WAL in those).

Greetings,

Andres Freund




Re: Suppressing useless wakeups in walreceiver

2022-11-05 Thread Nathan Bossart
On Sat, Nov 05, 2022 at 03:00:55PM -0700, Nathan Bossart wrote:
> On Mon, Oct 17, 2022 at 03:21:18PM +0900, Kyotaro Horiguchi wrote:
>> Now that I see the fix for the implicit conversion:
>> 
>> L527: +  nap = Max(0, (nextWakeup - now + 999) / 
>> 1000);
>> ..
>> L545: +  
>>(int) Min(INT_MAX, nap),
>> 
>> 
>> I think limiting the naptime at use is confusing. Don't we place these
>> adjacently?  Then we could change the nap to an integer.  Or we can
>> just assert out for the nap time longer than INT_MAX (but this would
>> require another int64 variable. I belive we won't need such a long
>> nap, (or that is no longer a nap?)
> 
> Yeah, I guess this deserves a comment.  I could also combine it easily:
> 
>   nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000));

Here is a new version of the patch that addresses this feedback.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e2c7b170058bc62ed2e3ea84506f80c698c1df02 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 27 Jan 2022 21:43:17 +1300
Subject: [PATCH v8 1/1] Suppress useless wakeups in walreceiver.

Instead of waking up 10 times per second to check for various timeout
conditions, keep track of when we next have periodic work to do.

Reviewed-by: Kyotaro Horiguchi, Bharath Rupireddy, Alvaro Herrera
Discussion: https://postgr.es/m/CA%2BhUKGJGhX4r2LPUE3Oy9BX71Eum6PBcS8L3sJpScR9oKaTVaA%40mail.gmail.com
---
 src/backend/replication/walreceiver.c | 171 +-
 1 file changed, 114 insertions(+), 57 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 6cbb67c92a..0f84c229ad 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -95,8 +95,6 @@ bool		hot_standby_feedback;
 static WalReceiverConn *wrconn = NULL;
 WalReceiverFunctionsType *WalReceiverFunctions = NULL;
 
-#define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
-
 /*
  * These variables are used similarly to openLogFile/SegNo,
  * but for walreceiver to write the XLOG. recvFileTLI is the TimeLineID
@@ -116,6 +114,23 @@ static struct
 	XLogRecPtr	Flush;			/* last byte + 1 flushed in the standby */
 }			LogstreamResult;
 
+/*
+ * Reasons to wake up and perform periodic tasks.
+ */
+typedef enum WalRcvWakeupReason
+{
+	WALRCV_WAKEUP_TERMINATE,
+	WALRCV_WAKEUP_PING,
+	WALRCV_WAKEUP_REPLY,
+	WALRCV_WAKEUP_HSFEEDBACK,
+	NUM_WALRCV_WAKEUPS
+} WalRcvWakeupReason;
+
+/*
+ * Wake up times for periodic tasks.
+ */
+TimestampTz	wakeup[NUM_WALRCV_WAKEUPS];
+
 static StringInfoData reply_message;
 static StringInfoData incoming_message;
 
@@ -132,6 +147,7 @@ static void XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void WalRcvComputeNextWakeup(WalRcvWakeupReason reason, TimestampTz now);
 
 /*
  * Process any interrupts the walreceiver process may have received.
@@ -179,9 +195,7 @@ WalReceiverMain(void)
 	TimeLineID	primaryTLI;
 	bool		first_stream;
 	WalRcvData *walrcv = WalRcv;
-	TimestampTz last_recv_timestamp;
-	TimestampTz starttime;
-	bool		ping_sent;
+	TimestampTz now;
 	char	   *err;
 	char	   *sender_host = NULL;
 	int			sender_port = 0;
@@ -192,7 +206,7 @@ WalReceiverMain(void)
 	 */
 	Assert(walrcv != NULL);
 
-	starttime = GetCurrentTimestamp();
+	now = GetCurrentTimestamp();
 
 	/*
 	 * Mark walreceiver as running in shared memory.
@@ -248,7 +262,7 @@ WalReceiverMain(void)
 
 	/* Initialise to a sanish value */
 	walrcv->lastMsgSendTime =
-		walrcv->lastMsgReceiptTime = walrcv->latestWalEndTime = starttime;
+		walrcv->lastMsgReceiptTime = walrcv->latestWalEndTime = now;
 
 	/* Report the latch to use to awaken this process */
 	walrcv->latch = &MyProc->procLatch;
@@ -414,9 +428,10 @@ WalReceiverMain(void)
 			initStringInfo(&reply_message);
 			initStringInfo(&incoming_message);
 
-			/* Initialize the last recv timestamp */
-			last_recv_timestamp = GetCurrentTimestamp();
-			ping_sent = false;
+			/* Initialize nap wakeup times. */
+			now = GetCurrentTimestamp();
+			for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
+WalRcvComputeNextWakeup(i, now);
 
 			/* Loop until end-of-streaming or error */
 			for (;;)
@@ -426,6 +441,8 @@ WalReceiverMain(void)
 bool		endofwal = false;
 pgsocket	wait_fd = PGINVALID_SOCKET;
 int			rc;
+TimestampTz	nextWakeup;
+int			nap;
 
 /*
  * Exit walreceiver if we're not in recovery. This should not
@@ -443,11 +460,15 @@ WalReceiverMain(void)
 {
 	ConfigReloadPending = false;
 	ProcessConfigFile(PGC_SIGHUP);
+	now = GetCurrentTimestamp();
+	for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
+		WalRcvComputeNextWakeup(i, now);
 	XLogWalRcvS

Re: Suppressing useless wakeups in walreceiver

2022-11-05 Thread Nathan Bossart
On Wed, Oct 26, 2022 at 03:05:20PM +0530, Bharath Rupireddy wrote:
> A comment on the patch:
> Isn't it better we track the soonest wakeup time or min of wakeup[]
> array (update the min whenever the array element is updated in
> WalRcvComputeNextWakeup()) instead of recomputing everytime by looping
> for NUM_WALRCV_WAKEUPS times? I think this will save us some CPU
> cycles, because the for loop, in which the below code is placed, runs
> till the walreceiver end of life cycle. We may wrap wakeup[] and min
> inside a structure for better code organization or just add min as
> another static variable alongside wakeup[].
> 
> +/* Find the soonest wakeup time, to limit our nap. */
> +nextWakeup = PG_INT64_MAX;
> +for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
> +nextWakeup = Min(wakeup[i], nextWakeup);
> +nap = Max(0, (nextWakeup - now + 999) / 1000);

While that might save a few CPU cycles when computing the nap time, I don't
think it's worth the added complexity and CPU cycles in
WalRcvComputeNextWakeup().  I suspect it'd be difficult to demonstrate any
meaningful difference between the two approaches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Suppressing useless wakeups in walreceiver

2022-11-05 Thread Nathan Bossart
On Mon, Oct 17, 2022 at 03:21:18PM +0900, Kyotaro Horiguchi wrote:
> Now that I see the fix for the implicit conversion:
> 
> L527: +   nap = Max(0, (nextWakeup - now + 999) / 
> 1000);
> ..
> L545: +   
>(int) Min(INT_MAX, nap),
> 
> 
> I think limiting the naptime at use is confusing. Don't we place these
> adjacently?  Then we could change the nap to an integer.  Or we can
> just assert out for the nap time longer than INT_MAX (but this would
> require another int64 variable. I belive we won't need such a long
> nap, (or that is no longer a nap?)

Yeah, I guess this deserves a comment.  I could also combine it easily:

nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000));

We could probably just remove the WL_TIMEOUT flag and set timeout to -1
whenever "nap" is calculated to be > INT_MAX, but I didn't think it was
worth complicating the code in order to save an extra wakeup every ~25
days.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: thinko in basic_archive.c

2022-11-05 Thread Nathan Bossart
On Fri, Oct 21, 2022 at 09:30:16PM +0530, Bharath Rupireddy wrote:
> On Fri, Oct 21, 2022 at 10:43 AM Kyotaro Horiguchi
>  wrote:
>> Thanks, but we don't need to wipe out the all bytes. Just putting \0
>> at the beginning of the buffer is sufficient.
> 
> Nah, that's not a clean way IMO.

Why not?  This is a commonly-used technique.  I see over 80 existing useѕ
in PostgreSQL.  Plus, your shutdown callback only checks for the first
byte, anyway.

+   if (tempfilepath[0] == '\0')
+   return;

As noted upthread [0], I think we should be cautious to only remove the
temporary file if we know we created it.  I still feel that trying to add
this cleanup logic to basic_archive is probably more trouble than it's
worth, but if there is a safe and effective way to do so, I won't object.

[0] https://postgr.es/m/20221015211026.GA1821022%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Documentation for building with meson

2022-11-05 Thread Andres Freund
Hi,

On 2022-10-30 20:51:59 -0700, samay sharma wrote:
> +# setup and enter build directory (done only first time)
> +meson setup build src --prefix=$PWD/install

This command won't work on windows, I think.


> + 
> +  Configuring the build
> +
> +   
> +The first step of the installation procedure is to configure the
> +source tree for your system and choose the options you would like. To

s/source tree/build tree/?


> +create and configure the build directory, you can start with the
> +meson setup command.
> +   
> +
> +
> +meson setup build
> +
> +
> +   
> +The setup command takes a builddir and a 
> srcdir
> +argument. If no srcdir is given, Meson will deduce the
> +srcdir based on the current directory and the location
> +of meson.build. The builddir is 
> mandatory.
> +   
> +
> +   
> +Meson then loads the build configuration file and sets up the build 
> directory.
> +Additionally, the invocation can pass options to Meson. The list of 
> commonly
> +used options is in subsequent sections. A few examples of specifying 
> different
> +build options are:

Somehow the "tone" is descriptive in a distanced way, rather than instructing
what to do.


> +   
> +Installation Locations
> +
> + 
> +  These options control where ninja install (or meson 
> install) will put
> +  the files.  The --prefix option is sufficient for
> +  most cases.

Perhaps the short version use of prefix could be a link here? Not sure if
that's a good idea.



> + 
> +  
> +   
> --prefix=PREFIX
> +   
> +
> + Install all files under the directory 
> PREFIX
> + instead of /usr/local/pgsql. The actual
> + files will be installed into various subdirectories; no files
> + will ever be installed directly into the
> + PREFIX directory.
> +
> +   
> +  

Hm, need to mention windows here likely. By default the installation will go
to :/usr/local/pgsql.


> +  
> +   
> --bindir=DIRECTORY
> +   
> +
> + Specifies the directory for executable programs. The default
> + is PREFIX/bin, which
> + normally means /usr/local/pgsql/bin.
> +
> +   
> +  

Hm, do we really want the "which normally means" part? That'll make the OS
stuff more complicated.


> +  
> +   
> --sysconfdir=DIRECTORY
> +   
> +
> + Sets the directory for various configuration files,
> + PREFIX/etc by 
> default.
> +
> +   
> +  

Need to check what windows does here.


> +  
> +   
> -Dnls=auto/enabled/disabled

Wonder if we should define a entity for
auto/enabled/disabled? There's a lot of repetitions
of it.


> +   
> +
> + Enables or disables Native Language Support 
> (NLS),
> + that is, the ability to display a program's messages in a
> + language other than English. It defaults to auto, meaning that it
> + will be enabled automatically if an implementation of the
> + Gettext API is found.
> +

Do we really want to repeat the "It defaults to auto, meaning that it will be
enabled automatically if ..." for each of these? Perhaps we could say
'defaults to auto'?


> +
> + By default,
> + 
> pkg-configpkg-config
> + will be used to find the required compilation options.  This is
> + supported for ICU4C version 4.6 and 
> later.
> + 
> +

I'd just remove this paragraph. Right now the meson build will just use solely
use pkg-config config files for icu.  I don't think we need to care about 4.6
(from 2010) anymore.


> +  
> +   
> -Dllvm=auto/enabled/disabled
> +   
> +
> + Build with support for LLVM based
> + JIT compilation + condition="standalone-ignore"> (see  + linkend="jit"/>).  This
> + requires the LLVM library to be 
> installed.
> + The minimum required version of LLVM is
> + currently 3.9. It is set to disabled by default.
> +
> +
> + 
> llvm-configllvm-config
> + will be used to find the required compilation options.
> + llvm-config, and then
> + llvm-config-$major-$minor for all supported
> + versions, will be searched for in your PATH.
> + 
> +

Probably a link to the docs for meson native files suffices here for
now. Since the autoconf docs have been written there's only
llvm-config-$version, llvm stopped having separate major/minor versions
somewhere around llvm 4. I think it'd suffice to say llvm-config-$version?


> +
> + LLVM support requires a compatible
> + clang compiler (specified, if necessary, using 
> the
> + CLANG environment variable), and a working C++
> + compiler (specified, if necessary, using the CXX
> + environment variable).
> +

For clang we don't look for CLA

Re: archive modules

2022-11-05 Thread Nathan Bossart
On Mon, Oct 17, 2022 at 02:49:51PM +0900, Michael Paquier wrote:
> And, by the way, this patch would prevent the existence of archive
> modules that need to be loaded but *want* an archive_command with
> what they want to achieve.  That does not strike me as a good idea if
> we want to have a maximum of flexibility with this facility.

Such a module could define a custom GUC that accepts a shell command.  I
don't think we should overload the meaning of archive_command based on the
whims of whatever archive module is loaded.  Besides the potential end-user
confusion, your archive_command might be unexpectedly used incorrectly if
you forget to set archive_library.

Perhaps we could eventually move the archive_command functionality to a
contrib module (i.e., "shell_archive") so that users must always set
archive_library.  But until then, I suspect it's better to treat modules
and commands as two separate interfaces to ease migration from older major
versions (even though archive_command is now essentially a built-in archive
module).

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: archive modules

2022-11-05 Thread Nathan Bossart
On Fri, Nov 04, 2022 at 12:05:26PM +0900, Ian Lawrence Barwick wrote:
> cfbot reports the patch no longer applies [1].  As CommitFest 2022-11 is
> currently underway, this would be an excellent time to update the patch.

Indeed.  Here is a new version of the patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 559eb898a9..2ffd82ab66 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3597,9 +3597,11 @@ include_dir 'conf.d'


 This parameter can only be set in the postgresql.conf
-file or on the server command line.  It is ignored unless
+file or on the server command line.  It is only used if
 archive_mode was enabled at server start and
-archive_library is set to an empty string.
+archive_library is set to an empty string.  If both
+archive_command and archive_library
+are set, archiving will fail.
 If archive_command is an empty string (the default) while
 archive_mode is enabled (and archive_library
 is set to an empty string), WAL archiving is temporarily
@@ -3624,7 +3626,9 @@ include_dir 'conf.d'

 The library to use for archiving completed WAL file segments.  If set to
 an empty string (the default), archiving via shell is enabled, and
- is used.  Otherwise, the specified
+ is used.  If both
+archive_command and archive_library
+are set, archiving will fail.  Otherwise, the specified
 shared library is used for archiving. The WAL archiver process is
 restarted by the postmaster when this parameter changes. For more
 information, see  and
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 2670e41666..3e11a4ce12 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -792,6 +792,12 @@ HandlePgArchInterrupts(void)
 		ConfigReloadPending = false;
 		ProcessConfigFile(PGC_SIGHUP);
 
+		if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0')
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+	 errmsg("both archive_command and archive_library specified"),
+	 errdetail("Only one of archive_command, archive_library may be set.")));
+
 		archiveLibChanged = strcmp(XLogArchiveLibrary, archiveLib) != 0;
 		pfree(archiveLib);
 
@@ -825,6 +831,12 @@ LoadArchiveLibrary(void)
 {
 	ArchiveModuleInit archive_init;
 
+	if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0')
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("both archive_command and archive_library specified"),
+ errdetail("Only one of archive_command, archive_library may be set.")));
+
 	memset(&ArchiveContext, 0, sizeof(ArchiveModuleCallbacks));
 
 	/*


Re: Do we need to pass down nonnullable_vars when reducing outer joins?

2022-11-05 Thread Tom Lane
Richard Guo  writes:
> AFAICS, the Vars forced nonnullable by given clause are only used to
> check if we can reduce JOIN_LEFT to JOIN_ANTI, and it is checking the
> join's own quals there. It seems to me we do not need to pass down
> nonnullable_vars by upper quals to the children of a join.

Hmm, you are right, we are not doing anything useful with that data.
I can't remember if I had a concrete plan for doing something with it
or not, but we sure aren't using it now.  So pushed.

regards, tom lane




Re: proposal: possibility to read dumped table's name from file

2022-11-05 Thread Pavel Stehule
Hi

pá 4. 11. 2022 v 14:28 odesílatel Julien Rouhaud 
napsal:

> On Fri, Nov 04, 2022 at 07:19:27AM -0500, Justin Pryzby wrote:
> > On Fri, Nov 04, 2022 at 07:59:01PM +0800, Julien Rouhaud wrote:
> > > On Thu, Nov 03, 2022 at 10:22:15PM +0100, Pavel Stehule wrote:
> > > > čt 3. 11. 2022 v 5:09 odesílatel Julien Rouhaud 
> napsal:
> > > > updated patch attached
> > > >
> > > > big thanks for these comments and tips
> > >
> > > Thanks for the updated patch!  As far as I'm concerned the patch is in
> a good
> > > shape, passes the CI and I don't have anything more to say so I'm
> marking it as
> > > Ready for Committer!
> >
> > +1
> >
> > I started looking to see if it's possible to simplify the patch at all,
> > but nothing to show yet.
> >
> > But one thing I noticed is that "optarg" looks wrong here:
> >
> > simple_string_list_append(&opts->triggerNames, optarg);
>
> Ah indeed, good catch!  Maybe there should be an explicit test for every
> (include|exclude) / objtype combination?  It would be a bit verbose (and
> possibly hard to maintain).
>

yes - pg_restore is not well covered by  tests, fixed

I found another issue. The pg_restore requires a full signature of the
function and it is pretty sensitive on white spaces (pg_restore). I made a
mistake when I partially parsed patterns like SQL identifiers. It can work
for simple cases, but when I parse the function's signature it stops
working. So I rewrote the parsing pattern part. Now, I just read an input
string and I try to reduce spaces. Still multiline identifiers are
supported. Against the previous method of pattern parsing, I needed to
change just one regress test - now I am not able to detect garbage after
pattern :-/. It is possible to enter types like "double precision" or
"timestamp with time zone", without needing to check it on the server side.

When I wroted regress tests I found some  issues of pg_restore filtering
options (not related to this patch)

* function's filtering doesn't support schema - when the name of function
is specified with schema, then the function is not found

* the function has to be specified with an argument type list - the
separator has to be exactly ", " string. Without space or with one space
more, the filtering doesn't work (new implementation of pattern parsing
reduces white spaces sensitivity). This is not a bug, but it is not well
documented.

* the trigger filtering is probably broken (on pg_restore side). The name
should be entered in form "tablename triggername"

attached updated patch

Regards

Pavel
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 8b9d9f4cad..d5a6e2c7ee 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -779,6 +779,80 @@ PostgreSQL documentation
   
  
 
+ 
+  --filter=filename
+  
+   
+Specify a filename from which to read patterns for objects to include
+or exclude from the dump. The patterns are interpreted according to the
+same rules as the corresponding options:
+-t/--table for tables,
+-n/--schema for schemas,
+--include-foreign-data for data on foreign servers and
+--exclude-table-data for table data.
+To read from STDIN, use - as the
+filename.  The --filter option can be specified in
+conjunction with the above listed options for including or excluding
+objects, and can also be specified more than once for multiple filter
+files.
+   
+
+   
+The file lists one object pattern per row, with the following format:
+
+{ include | exclude } { table | schema | foreign_data | table_data } PATTERN
+
+   
+
+   
+The first keyword specifies whether the objects matched by the pattern
+are to be included or excluded. The second keyword specifies the type
+of object to be filtered using the pattern:
+
+ 
+  
+   table: tables, works like
+   -t/--table
+  
+ 
+ 
+  
+   schema: schemas, works like
+   -n/--schema
+  
+ 
+ 
+  
+   foreign_data: data on foreign servers, works like
+   --include-foreign-data. This keyword can only be
+   used with the include keyword.
+  
+ 
+ 
+  
+   table_data: table data, works like
+   --exclude-table-data. This keyword can only be
+   used with the exclude keyword.
+  
+ 
+
+   
+
+   
+Lines starting with # are considered comments and
+ignored. Comments can be placed after filter as well. Blank lines
+are also ignored. See  for how to
+perform quoting in patterns.
+   
+
+   
+Example files are listed below in the 
+section.
+   
+
+  
+ 
+
  
   --if-exists
   
@@ -1119,6 +1193,7 @@ PostgreSQL documentation
 sche

Re: Check SubPlan clause for nonnullable rels/Vars

2022-11-05 Thread Tom Lane
Richard Guo  writes:
> [ v2-0001-Check-SubPlan-clause-for-nonnullable-rels-Vars.patch ]

Pushed with cosmetic changes:

* I don't believe in "add at the end" as a principle for placement
of new code.  There's usually some other logic that will give more
consistent results.  In cases like this, ordering the treatment of
Node types in the same way as they appear in the include/nodes/
headers is the standard answer.  (Not that everybody's been totally
consistent about that :-( ... but that's not an argument for
introducing even more entropy.)

* I rewrote the comments a bit.

* I didn't like the test case too much: spinning up a whole new set
of tables seems like a lot of useless cycles.  Plus it makes it
harder to experiment with the test query manually.  I usually like
to write such queries using the regression database's standard tables,
so I rewrote this example that way.

regards, tom lane




Re: Lockless queue of waiters in LWLock

2022-11-05 Thread Andres Freund
Hi,

On 2022-11-05 12:05:43 +0300, Alexander Korotkov wrote:
> On Fri, Nov 4, 2022 at 10:07 PM Andres Freund  wrote:
> > The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why
> > lwlocks are slower than a spinlock (but obviously are better under 
> > contention
> > nonetheless).
> >
> >
> > I have a benchmark program that starts a thread for each physical core and
> > just increments a counter on an atomic value.
> 
> Thank you for this insight!  I didn't know xadd is much cheaper than
> cmpxchg unless there are retries.

The magnitude of the effect is somewhat surprising, I agree. Some difference
makes sense to me, but...


> I also wonder how cmpxchg becomes faster with higher concurrency.

If you're referring to the leading 32/64 that's not concurrency, that's
32/64bit values. Sorry for not being clearer on that.

Greetings,

Andres Freund




Re: pg_reload_conf() synchronously

2022-11-05 Thread Andres Freund
Hi,

On 2022-11-05 12:03:49 -0700, Gurjeet Singh wrote:
> For example, I thought of implementing DSM based synchronisation between
> processes, so that the invoking backend can be sure that _all_ children of
> Postmaster have processed the reload. But that will run into problems as
> different backends get created, and as they exit.

If you just want something like that you really can just use the global
barrier mechanism.  The hard part is how to deal with situations like two
backends waiting at the same time. Possibly the best way would be to not
actually offer a blocking API but just a way to ask whether a change has been
processed everywhere, and require explicit polling on the client side.


> The proposed patch solves a specific problem, that of a single backend
> reloading conf before the next command, without adding any complexity
> for any other cases.

I'm not sure that's true btw - I seem to recall that there's code somewhere
noting that it relies on postmaster being the first to process a config
change. Which wouldn't be true with this change anymore. I remember not being
convinced by that logic, because successive config changes can still lead to
backends processing the config file first.

Greetings,

Andres Freund




Re: pg_reload_conf() synchronously

2022-11-05 Thread Gurjeet Singh
On Sat, Nov 5, 2022 at 11:23 AM Andres Freund  wrote:
> > As far as I know, Gurjeet has been annoyed only with non-user-settable
> > GUCs for one connection (correct me of course), there was nothing
> > fancy with isolation tests, yet.  Not saying that this is useless for
> > isolation tests, this would have its cases for assumptions where
> > multiple GUCs need to be synced across multiple sessions, but it seems
> > to me that we have two different cases in need of two different
> > solutions.
>
> I didn't say we need to go for something more complete. Just that it's worth
> thinking about.

FWIW, I have considered a few different approaches, but all of them
were not only more work, they were fragile, and diffcult to prove
correctness of. For example, I thought of implementing DSM based
synchronisation between processes, so that the invoking backend can be
sure that _all_ children of Postmaster have processed the reload. But
that will run into problems as different backends get created, and as
they exit.

The invoking client may be interested in just client-facing backends
honoring the reload before moving on, so it would have to wait until
all the other backends finish their current command and return to the
main loop; but that may never happen, because one of the backends may
be processing a long-running query. Or, for some reason, the the
caller may be interested in only the autovacuum proccesses honoring
its reload request. So I thought about creating _classes_ of backends:
client-facing, other always-on children of Postmaster, BGWorkers, etc.
But that just makes the problem more difficult to solve.

I hadn't considered the possibilty of deadlocks that Tom raised, so
that's another downside of the other approaches.

The proposed patch solves a specific problem, that of a single backend
reloading conf before the next command, without adding any complexity
for any other cases. It sure is worth solving the case where a backend
waits for another backed(s) to process the conf files, but it will
have to be a separate solution, becuase it's much more difficult to
get it right.

Best regards,
Gurjeet
http://Gurje.et




Re: ssl tests aren't concurrency safe due to get_free_port()

2022-11-05 Thread Andres Freund
Hi,

On 2022-11-03 14:16:51 -0400, Andrew Dunstan wrote:
> > Here's a patch which I think does the right thing.
> Updated with a couple of thinkos fixed.

Thanks!


> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm 
> b/src/test/perl/PostgreSQL/Test/Cluster.pm
> index d80134b26f..aceca353d3 100644
> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
> @@ -93,9 +93,9 @@ use warnings;
>  
>  use Carp;
>  use Config;
> -use Fcntl qw(:mode);
> +use Fcntl qw(:mode :flock :seek O_CREAT O_RDWR);

Does this do anything useful on windows?



>  # the minimum version we believe to be compatible with this package without
>  # subclassing.
> @@ -140,6 +140,27 @@ INIT
>  
>   # Tracking of last port value assigned to accelerate free port lookup.
>   $last_port_assigned = int(rand() * 16384) + 49152;
> +
> + # Set the port lock directory
> +
> + # If we're told to use a directory (e.g. from a buildfarm client)
> + # explicitly, use that
> + $portdir = $ENV{PG_TEST_PORT_DIR};
> + # Otherwise, try to use a directory at the top of the build tree
> + if (! $portdir && $ENV{MESON_BUILD_ROOT})
> + {
> + $portdir = $ENV{MESON_BUILD_ROOT} . '/portlock'
> + }
> + elsif (! $portdir && ($ENV{TESTDATADIR} || "") =~ /\W(src|contrib)\W/p)
> + {
> + my $dir = ${^PREMATCH};
> + $portdir = "$dir/portlock" if $dir;
> + }
> + # As a last resort use a directory under tmp_check
> + $portdir ||= $PostgreSQL::Test::Utils::tmp_check . '/portlock';
> + $portdir =~ s!\\!/!g;
> + # Make sure the directory exists
> + mkpath($portdir) unless -d $portdir;
>  }
>  
>  =pod
> @@ -1505,6 +1526,7 @@ sub get_free_port
>   last;
>   }
>   }
> + $found = _reserve_port($port) if $found;
>   }
>   }
>  
> @@ -1535,6 +1557,38 @@ sub can_bind
>   return $ret;
>  }
>  
> +# Internal routine to reserve a port number
> +# Returns 1 if successful, 0 if port is already reserved.
> +sub _reserve_port
> +{
> + my $port = shift;
> + # open in rw mode so we don't have to reopen it and lose the lock
> + sysopen(my $portfile, "$portdir/$port.rsv", O_RDWR|O_CREAT)
> +   || die "opening port file";
> + # take an exclusive lock to avoid concurrent access
> + flock($portfile, LOCK_EX) || die "locking port file";
> + # see if someone else has or had a reservation of this port
> + my $pid = <$portfile>;
> + chomp $pid;
> + if ($pid +0 > 0)

Gotta love perl.


> + {
> + if (kill 0, $pid)

Does this work on windows?


Greetings,

Andres Freund




Re: [PATCH] Add native windows on arm64 support

2022-11-05 Thread Andres Freund
Hi,

On 2022-11-03 11:06:46 +, Niyas Sait wrote:
> I've attached a new version of the patch which excludes the already merged
> ASLR changes and add
> small changes to handle latest changes in the build scripts.

Note that we're planning to remove the custom windows build scripts before the
next release, relying on the meson build instead.


> diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
> index 8b19ab160f..bf6a6dba35 100644
> --- a/src/include/storage/s_lock.h
> +++ b/src/include/storage/s_lock.h
> @@ -708,13 +708,21 @@ typedef LONG slock_t;
>  #define SPIN_DELAY() spin_delay()
>  
>  /* If using Visual C++ on Win64, inline assembly is unavailable.
> - * Use a _mm_pause intrinsic instead of rep nop.
> + * Use _mm_pause (x64) or __isb(arm64) intrinsic instead of rep nop.
>   */
>  #if defined(_WIN64)
>  static __forceinline void
>  spin_delay(void)
>  {
> +#ifdef _M_ARM64
> + /*
> +  * arm64 way of hinting processor for spin loops optimisations
> +  * ref: 
> https://community.arm.com/support-forums/f/infrastructure-solutions-forum/48654/ssetoneon-faq
> + */
> + __isb(_ARM64_BARRIER_SY);
> +#else
>   _mm_pause();
> +#endif
>  }
>  #else
>  static __forceinline void

I think we should just apply this, there seems very little risk of making
anything worse, given the gating to _WIN64 && _M_ARM64.


> diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
> index 9e301f96f6..981718752f 100644
> --- a/src/port/pg_crc32c_armv8.c
> +++ b/src/port/pg_crc32c_armv8.c
> @@ -14,7 +14,9 @@
>   */
>  #include "c.h"
>  
> +#ifndef _MSC_VER
>  #include 
> +#endif
>  
>  #include "port/pg_crc32c.h"

This won't suffice with the meson build, since the relevant configure test
also uses arm_acle.h:
elif host_cpu == 'arm' or host_cpu == 'aarch64'

  prog = '''
#include 

int main(void)
{
unsigned int crc = 0;
crc = __crc32cb(crc, 0);
crc = __crc32ch(crc, 0);
crc = __crc32cw(crc, 0);
crc = __crc32cd(crc, 0);

/* return computed value, to prevent the above being optimized away */
return crc == 0;
}
'''

  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
without -march=armv8-a+crc',
  args: test_c_args)
# Use ARM CRC Extension unconditionally
cdata.set('USE_ARMV8_CRC32C', 1)
have_optimized_crc = true
  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
with -march=armv8-a+crc',
  args: test_c_args + ['-march=armv8-a+crc'])
# Use ARM CRC Extension, with runtime check
cflags_crc += '-march=armv8-a+crc'
cdata.set('USE_ARMV8_CRC32C', false)
cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
have_optimized_crc = true
  endif
endif

The meson checking logic is used both for msvc and other compilers, so this
will need to work with both.


> diff --git a/src/tools/msvc/gendef.pl b/src/tools/msvc/gendef.pl
> index d6bed1ce15..f1e0ff446b 100644
> --- a/src/tools/msvc/gendef.pl
> +++ b/src/tools/msvc/gendef.pl
> @@ -120,9 +120,9 @@ sub writedef
>   {
>   my $isdata = $def->{$f} eq 'data';
>  
> - # Strip the leading underscore for win32, but not x64
> + # Strip the leading underscore for win32, but not x64 and arm64
>   $f =~ s/^_//
> -   unless ($arch eq "x86_64");
> +   unless ($arch ne "x86");
>  
>   # Emit just the name if it's a function symbol, or emit the name
>   # decorated with the DATA option for variables.
> @@ -143,7 +143,7 @@ sub writedef
>  sub usage
>  {
>   die("Usage: gendef.pl --arch  --deffile  --tempdir 
>  files-or-directories\n"
> -   . "arch: x86 | x86_64\n"
> +   . "arch: x86 | x86_64 | arm64 \n"
> . "deffile: path of the generated file\n"
> . "tempdir: directory for temporary files\n"
> . "files or directories: object files or directory 
> containing object files\n"
> @@ -160,7 +160,7 @@ GetOptions(
>   'tempdir:s' => \$tempdir,) or usage();
>  
>  usage("arch: $arch")
> -  unless ($arch eq 'x86' || $arch eq 'x86_64');
> +  unless ($arch eq 'x86' || $arch eq 'x86_64' || $arch eq 'arm64');
>  
>  my @files;
>  

Seems reasonable.

Greetings,

Andres Freund




Re: pg_reload_conf() synchronously

2022-11-05 Thread Andres Freund
Hi,

On 2022-11-05 14:26:44 +0900, Michael Paquier wrote:
> On Fri, Nov 04, 2022 at 09:51:08PM -0700, Andres Freund wrote:
> > On 2022-11-04 23:35:21 -0400, Tom Lane wrote:
> >> True, but do we have any such cases?
> > 
> > I know I hit it twice and gave up on the tests.
> 
> Still, is there any need for a full-blown facility for the case of an
> individual backend?

No, and I didn't claim it was.


> Using a new function to track that all the changes are in effect is useful
> for isolation tests

I hit it in tap tests, fwiw.


> As far as I know, Gurjeet has been annoyed only with non-user-settable
> GUCs for one connection (correct me of course), there was nothing
> fancy with isolation tests, yet.  Not saying that this is useless for
> isolation tests, this would have its cases for assumptions where
> multiple GUCs need to be synced across multiple sessions, but it seems
> to me that we have two different cases in need of two different
> solutions.

I didn't say we need to go for something more complete. Just that it's worth
thinking about.

Greetings,

Andres Freund




Re: [PATCH] ALTER TABLE ... SET STORAGE default

2022-11-05 Thread Tom Lane
Aleksander Alekseev  writes:
> Hi Nikita,
>> This seems a little bit confusing and thus very unfriendly for the user, 
>> because the actual meaning
>> of the same 'DEFAULT' option will be different for each data type, and to 
>> check storage mode user
>> has to query full table (or column) description.
>> I'd rather add a paragraph in documentation describing each data type 
>> default storage mode.

> I agree that "SET STORAGE default" syntax leaves much to be desired.

FWIW, I don't buy that argument at all.  If you believe that then
you must also think that

INSERT INTO mytab VALUES (..., DEFAULT, ...);

is a poorly-designed feature because you have to go consult the table
definition to find out what will be inserted.  (Well, maybe you do
think that, but the SQL committee won't agree with you ;-))  So I don't
see any problem with DEFAULT representing a data-type-specific default
in this situation.

> Personally I would prefer "RESET STORAGE" and "RESET COMPRESSION".

Perhaps, but what's done is done, and I agree that STORAGE had better
follow the existing precedent.

I've not read the patch in any detail, but I don't see a problem
with the design.

regards, tom lane




Re: pg_dump: Refactor code that constructs ALTER ... OWNER TO commands

2022-11-05 Thread Corey Huinker
On Wed, Nov 2, 2022 at 5:30 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 01.11.22 13:59, Corey Huinker wrote:
> > On Mon, Oct 24, 2022 at 5:54 AM Peter Eisentraut
> >  > > wrote:
> >
> > Avoid having to list all the possible object types twice.  Instead,
> > only
> > _getObjectDescription() needs to know about specific object types.
> It
> > communicates back to _printTocEntry() whether an owner is to be set.
> >
> > In passing, remove the logic to use ALTER TABLE to set the owner of
> > views and sequences.  This is no longer necessary.  Furthermore, if
> > pg_dump doesn't recognize the object type, this is now a fatal error,
> > not a warning.
> >
> >
> > Makes sense, passes all tests.
>
> Committed.
>
> > It's clearly out of scope for this very focused patch, but would it make
> > sense for the TocEntry struct to be populated with an type enumeration
> > integer as well as the type string to make for clearer and faster
> > sifting later?
>
> That could be better, but wouldn't that mean a change of the format of
> pg_dump archives?
>

Sorry for the confusion, I was thinking strictly of the in memory
representation after it is extracted from the dictionary.


Re: Schema variables - new implementation for Postgres 15

2022-11-05 Thread Julien Rouhaud
Hi,

On Sat, Nov 05, 2022 at 05:04:31PM +0100, Tomas Vondra wrote:
>
> I did a quick initial review of this patch series - attached is a
> version with "review" commits for some of the parts. The current patch
> seems in pretty good shape, most of what I noticed are minor issues. I
> plan to do a more thorough review later.

Thanks!

I agree with all of your comments, just a few answers below

> - NamesFromList and IdentifyVariable seem introduced unnecessarily
> early, as they are only used in 0002 and 0003 parts (in the original
> patch series). Not sure if the plan is to squash everything into a
> single patch, or commit individual patches.

The split was mostly done to make the patch easier to review, as it adds quite
a bit of infrastructure.

There have been some previous comments to have a more logical separation and
fix similar issues, but there are still probably other oddities like that
laying around.  I personally didn't focus much on it as I don't know if the
future committer will choose to squash everything or not.

> - AFAIK patches don't need to modify typedefs.list.

I think this was discussed a year or so ago, and my understanding is that the
general rule is that it's now welcome, if not recommended, to maintain
typedefs.list in each patchset.

> Which I think means this:
>
> if (filter_lxid && svar->drop_lxid == MyProc->lxid)
> continue;
>
> accesses drop_lxid, which was not initialized in init_session_variable.

Agreed.




Re: [patch] Have psql's \d+ indicate foreign partitions

2022-11-05 Thread Tom Lane
Michael Paquier  writes:
> On Mon, Oct 24, 2022 at 09:44:18PM +0900, Ian Lawrence Barwick wrote:
>> Recently I have been working a lot with partitioned tables which contain a 
>> mix
>> of local and foreign partitions, and find it would be very useful to be able 
>> to
>> easily obtain an overview of which partitions are foreign and where they are
>> located.

> Hmm.  I am not sure that we should add this much amount of
> information, particularly for the server bits.

FWIW, I am also in favor of adding ", FOREIGN" but no more.
My concern is that as submitted, the patch greatly increases
the cost of the underlying query by adding two more catalogs
to the join.  I don't think imposing such a cost on everybody
(whether they use foreign partitions or not) is worth that.  But
we can add ", FOREIGN" for free since we have the relkind anyway.

regards, tom lane




Re: Temporary tables versus wraparound... again

2022-11-05 Thread Tom Lane
Greg Stark  writes:
> Simple Rebase

I took a little bit of a look through these.

* I find 0001 a bit scary, specifically that it's decided it's
okay to apply extract_autovac_opts, pgstat_fetch_stat_tabentry_ext,
and especially relation_needs_vacanalyze to another session's
temp table.  How safe is that really?

* Don't see much point in renaming checkTempNamespaceStatus.
That doesn't make it not an ABI break.  If we want to back-patch
this we'll have to do something different than what you did here.

* In 0002, I don't especially approve of what you've done with
the relminmxid calculation --- that seems to move it out of
"pure bug fix" territory and into "hmm, I wonder if this
creates new bugs" territory.  Also, skipping that update
for non-temp tables immediately falsifies ResetVacStats'
claimed charter of "resetting to the same values used when
creating tables".  Surely GetOldestMultiXactId isn't *that*
expensive, especially compared to the costs of file truncation.
I think you should just do GetOldestMultiXactId straight up,
and maybe submit a separate performance-improvement patch
to make it do the other thing (in both places) for temp tables.

* I wonder if this is the best place for ResetVacStats --- it
doesn't seem to be very close to the code it needs to stay in
sync with.  If there's no better place, perhaps adding cross-
reference comments in both directions would be advisable.

* 0003 says it's running temp.sql by itself to avoid interference
from other sessions, but sadly that cannot avoid interference
from background autovacuum/autoanalyze.  I seriously doubt this
patch would survive contact with the buildfarm.  Do we actually
need a new test case?  It's not like the code won't get exercised
without it --- we have plenty of temp table truncations, surely.

regards, tom lane




Re: Add common function ReplicationOriginName.

2022-11-05 Thread Tom Lane
Aleksander Alekseev  writes:
> This leaves us one patch to deal with.
> [ v4-0001-Pass-Size-size_t-as-a-2nd-argument-of-snprintf.patch ]

I looked at this and am inclined to reject it.  None of these
places realistically need to deal with strings longer than
MAXPATHLEN or so, let alone multiple gigabytes.  So it's just
code churn, creating backpatch hazards (admittedly not big ones)
for no real gain.

regards, tom lane




Re: explain_regress, explain(MACHINE), and default to explain(BUFFERS) (was: BUFFERS enabled by default in EXPLAIN (ANALYZE))

2022-11-05 Thread Tom Lane
Justin Pryzby  writes:
> You set the patch to "waiting on author", which indicates that there's
> no need for further input or review.  But, I think that's precisely
> what's needed - without input from more people, what could I do to
> progress the patch ?  I don't think it's reasonable to put 001 first and
> change thousands (actually, 1338) of regression results.  If nobody
> wants to discuss 001, then this patch series won't progress.

Well ...

1. 0001 invents a new GUC but provides no documentation for it.
That certainly isn't committable, and it's discouraging the
discussion you seek because people have to read the whole patch
in detail to understand what is being proposed.

2. The same complaint for 0004, which labors under the additional
problem that MACHINE is one of the worst option names I've ever
seen proposed around here.  It conveys *nothing* about what it does
or is good for.  The fact that it's got no relationship to the GUC
name, and yet they're intimately connected, doesn't improve my
opinion of it.

3. I'm not really on board with the entire approach.  Making
EXPLAIN work significantly differently for developers and test
critters than it does for everyone else seems likely to promote
confusion and hide bugs.  I don't think getting rid of the need
for filter functions in test scripts is worth that.

4. I don't see how the current patches could pass under "make
installcheck" --- it looks to me like it only takes care of
applying the GUC change in installations built by pg_regress
or Cluster.pm.  To make this actually work, you'd have to
add "explain_regress = on" to the ALTER DATABASE SET commands
issued for regression databases.  That'd substantially increase
the problem of "it works differently than what users see", at
least for me --- I do a lot of stuff interactively in the
regression database.

5. The change in postgres_fdw.c in 0001 concerns me too.
Yeah, it will fix things if an updated postgres_fdw talks to
an updated server, but what about an older postgres_fdw?
Don't really want to see that case break; communication with
servers of different major versions is one of postgres_fdw's
main goals.

As a side note, 0001 also creates a hard break in postgres_fdw's
ability to work with pre-9.0 remote servers, which won't accept
"EXPLAIN (COSTS)".  Maybe we don't care about that anymore, given
the policy we've adopted elsewhere that we won't test or support
interoperability with versions before 9.2.  But we should either
make the command spelling conditional on remote version, or
officially move that goalpost.

regards, tom lane




Re: explain_regress, explain(MACHINE), and default to explain(BUFFERS) (was: BUFFERS enabled by default in EXPLAIN (ANALYZE))

2022-11-05 Thread Justin Pryzby
On Fri, Nov 04, 2022 at 11:46:03AM +0100, Laurenz Albe wrote:

> "explain_regress" reduces the EXPLAIN options you need for regression tests.
> This is somewhat useful, but not a big win.  Also, it will make backpatching
> regression tests slightly harder for the next 5 years.

But it doesn't cause backpatching to be harder; if a patch is meant to
be backpatched, its author can still choose to write the old way, with
(COSTS OFF, ...)

> For me, an important question is: do we really want more EXPLAIN (ANALYZE)
> in the regression tests?  It will probably slow the tests down, and I wonder
> if there is a lot of added value, if we have to remove all the 
> machine-specific
> output anyway.

I don't intend to encourage putting lots of "explain analyze" in the
tests.  Rather, this makes that a reasonable possibility rather than
something that people are discouraged from doing by its difficulty.
Using "explain analyze" still needs to be evaluated (for speed and
otherwise), same as always.

In any case, I suggested to defer review of patches 004 and beyond,
which are optional, and it distracts from the essential discussion about
001.

> I would really like to see 0003 go in, but it currently depends on 0001, which
> I am not so sure about.
> I understand that you did that so that "explain_regress" can turn off BUFFERS
> and there is no extra churn in the regression tests.
> 
> Still, it would be a shame if resistance against "explain_regress" would
> be a show-stopper for 0003.
> 
> If I could get my way, I'd want two separate patches: first, one to turn
> BUFFERS on, and second one for "explain_regress" with its current 
> functionality
> on top of that.

You set the patch to "waiting on author", which indicates that there's
no need for further input or review.  But, I think that's precisely
what's needed - without input from more people, what could I do to
progress the patch ?  I don't think it's reasonable to put 001 first and
change thousands (actually, 1338) of regression results.  If nobody
wants to discuss 001, then this patch series won't progress.

-- 
Justin




Re: New docs chapter on Transaction Management and related changes

2022-11-05 Thread Simon Riggs
On Fri, 4 Nov 2022 at 15:17, Laurenz Albe  wrote:
>
> On Sat, 2022-10-15 at 21:08 -0400, Bruce Momjian wrote:
> > I therefore merged all three paragraphs into
> > one and tried to make the text saner;  release_savepoint.sgml diff
> > attached, URL content updated.
>
> I wanted to have a look at this, but I am confused.  The original patch
> was much bigger.  Is this just an incremental patch?  If yes, it would
> be nice to have a "grand total" patch, so that I can read it all
> in one go.

Agreed; new compilation patch attached, including mine and then
Robert's suggested rewordings.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


xact_docs.v9.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-05 Thread John Naylor
On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada 
wrote:
>
> For parallel heap pruning, multiple workers will insert key-value
> pairs to the radix tree concurrently. The simplest solution would be a
> single lock to protect writes but the performance will not be good.
> Another solution would be that we can divide the tables into multiple
> ranges so that keys derived from TIDs are not conflicted with each
> other and have parallel workers process one or more ranges. That way,
> parallel vacuum workers can build *sub-trees* and the leader process
> can merge them. In use cases of lazy vacuum, since the write phase and
> read phase are separated the readers don't need to worry about
> concurrent updates.

It's a good idea to use ranges for a different reason -- readahead. See
commit 56788d2156fc3, which aimed to improve readahead for sequential
scans. It might work to use that as a model: Each worker prunes a range of
64 pages, keeping the dead tids in a local array. At the end of the range:
lock the tid store, enter the tids into the store, unlock, free the local
array, and get the next range from the leader. It's possible contention
won't be too bad, and I suspect using small local arrays as-we-go would be
faster and use less memory than merging multiple sub-trees at the end.

> I've attached a draft patch for lazy vacuum integration that can be
> applied on top of v8 patches. The patch adds a new module called
> TIDStore, an efficient storage for TID backed by radix tree. Lazy
> vacuum and parallel vacuum use it instead of a TID array. The patch
> also introduces rt_detach() that was missed in 0002 patch. It's a very
> rough patch but I hope it helps in considering lazy vacuum
> integration, radix tree APIs, and shared radix tree functionality.

It does help, good to see this.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Lockless queue of waiters in LWLock

2022-11-05 Thread Alexander Korotkov
Hi, Andres!

On Fri, Nov 4, 2022 at 10:07 PM Andres Freund  wrote:
> The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why
> lwlocks are slower than a spinlock (but obviously are better under contention
> nonetheless).
>
>
> I have a benchmark program that starts a thread for each physical core and
> just increments a counter on an atomic value.

Thank you for this insight!  I didn't know xadd is much cheaper than
cmpxchg unless there are retries.  I also wonder how cmpxchg becomes
faster with higher concurrency.

--
Regards,
Alexander Korotkov




Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-11-05 Thread Pavel Luzanov

Hello,

While playing with the patch I found a situation where the performance 
may be degraded compared to previous versions.


The test case below.
If you create a proper index for the query (a,c), version 16 wins. On my 
notebook, the query runs ~50% faster.
But if there is no index (a,c), but only (a,b), in previous versions the 
planner uses it, but with this patch a full table scan is selected.



create table t (a text, b text, c text);
insert into t (a,b,c) select x,y,x from generate_series(1,100) as x, 
generate_series(1,1) y;

create index on t (a,b);
vacuum analyze t;

explain (analyze, buffers)
select a, array_agg(c order by c) from t group by a;


v 14.5
 QUERY PLAN
-
 GroupAggregate  (cost=0.42..46587.76 rows=100 width=34) (actual 
time=3.077..351.526 rows=100 loops=1)

   Group Key: a
   Buffers: shared hit=193387 read=2745
   ->  Index Scan using t_a_b_idx on t  (cost=0.42..41586.51 
rows=100 width=4) (actual time=0.014..155.095 rows=100 loops=1)

 Buffers: shared hit=193387 read=2745
 Planning:
   Buffers: shared hit=9
 Planning Time: 0.059 ms
 Execution Time: 351.581 ms
(9 rows)


v 16
   QUERY PLAN

 GroupAggregate  (cost=128728.34..136229.59 rows=100 width=34) (actual 
time=262.930..572.915 rows=100 loops=1)

   Group Key: a
   Buffers: shared hit=5396, temp read=1950 written=1964
   ->  Sort  (cost=128728.34..131228.34 rows=100 width=4) (actual 
time=259.423..434.105 rows=100 loops=1)

 Sort Key: a, c
 Sort Method: external merge  Disk: 15600kB
 Buffers: shared hit=5396, temp read=1950 written=1964
 ->  Seq Scan on t  (cost=0.00..15396.00 rows=100 width=4) 
(actual time=0.005..84.104 rows=100 loops=1)

   Buffers: shared hit=5396
 Planning:
   Buffers: shared hit=9
 Planning Time: 0.055 ms
 Execution Time: 575.146 ms
(13 rows)

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company





Re: psql: Add command to use extended query protocol

2022-11-05 Thread Pavel Stehule
so 5. 11. 2022 v 7:35 odesílatel Corey Huinker 
napsal:

> On Fri, Nov 4, 2022 at 11:45 AM Peter Eisentraut <
> peter.eisentr...@enterprisedb.com> wrote:
>
>> On 02.11.22 01:18, Corey Huinker wrote:
>> >
>> >   SELECT $1, $2 \gp 'foo' 'bar'
>> >
>> >
>> > I think this is a great idea, but I foresee people wanting to send that
>> > output to a file or a pipe like \g allows. If we assume everything
>> after
>> > the \gp is a param, don't we paint ourselves into a corner?
>>
>> Any thoughts on how that syntax could be generalized?
>>
>
> A few:
>
> The most compact idea I can think of is to have \bind and \endbind (or
> more terse equivalents \bp and \ebp)
>
> SELECT * FROM foo WHERE type_id = $1 AND cost > $2 \bind 'param1' 'param2'
> \endbind $2 \g filename.csv
>
> Maybe the end-bind param isn't needed at all, we just insist that bind
> params be single quoted strings or numbers, so the next slash command ends
> the bind list.
>
> If that proves difficult, we might save bind params like registers
>
> something like this, positional:
>
> \bind 1 'param1'
> \bind 2 'param2'
> SELECT * FROM foo WHERE type_id = $1 AND cost > $2 \g filename.csv
> \unbind
>
> or all the binds on one line
>
> \bindmany 'param1' 'param2'
> SELECT * FROM foo WHERE type_id = $1 AND cost > $2 \g filename.csv
> \unbind
>
> Then psql would merely have to check if it had any bound registers, and if
> so, the next query executed is extended query protocol, and \unbind wipes
> out the binds to send us back to regular mode.
>

what about introduction new syntax for psql variables that should be passed
as bind variables.

like

SELECT * FROM foo WHERE x = $x \g

any time when this syntax can be used, then extended query protocol will be
used

and without any variable, the extended query protocol can be forced by psql
config variable

like

\set EXTENDED_QUERY_PROTOCOL true
SELECT 1;

Regards

Pavel


>
>
>


Re: remap the .text segment into huge pages at run time

2022-11-05 Thread Andres Freund
Hi,

On 2022-11-05 12:54:18 +0700, John Naylor wrote:
> On Sat, Nov 5, 2022 at 1:33 AM Andres Freund  wrote:
> > I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just
> hardcode
> > the address / length), and it seems to work nicely.
> >
> > With the weird caveat that on fs one needs to make sure that the
> executable
> > doesn't reflinks to reuse parts of other files, and that the mold linker
> and
> > cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
> > binary with cp --reflink=never
>
> What happens otherwise? That sounds like a difficult thing to guard against.

MADV_COLLAPSE fails, but otherwise things continue on. I think it's mostly an
issue on dev systems, not on prod systems, because there the files will be be
unpacked from a package or such.


> > On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > > > - Add a "cold" __asm__ filler function that just takes up space,
> enough to
> > > > push the end of the .text segment over the next aligned boundary, or
> to
> > > > ~8MB in size.
> > >
> > > I don't understand why this is needed - as long as the pages are
> aligned to
> > > 2MB, why do we need to fill things up on disk? The in-memory contents
> are the
> > > relevant bit, no?
> >
> > I now assume it's because you either observed the mappings set up by the
> > loader to not include the space between the segments?
>
> My knowledge is not quite that deep. The iodlr repo has an example "hello
> world" program, which links with 8 filler objects, each with 32768
> __attribute((used)) dummy functions. I just cargo-culted that idea and
> simplified it. Interestingly enough, looking through the commit history,
> they used to align the segments via linker flags, but took it out here:
>
> https://github.com/intel/iodlr/pull/25#discussion_r397787559
>
> ...saying "I'm not sure why we added this". :/

That was about using a linker script, not really linker flags though.

I don't think the dummy functions are a good approach, there were plenty
things after it when I played with them.



> I quickly tried to align the segments with the linker and then in my patch
> have the address for mmap() rounded *down* from the .text start to the
> beginning of that segment. It refused to start without logging an error.

Hm, what linker was that? I did note that you need some additional flags for
some of the linkers.


> > With these flags the "R E" segments all start on a 0x20/2MiB boundary
> and
> > are padded to the next 2MiB boundary. However the OS / dynamic loader only
> > maps the necessary part, not all the zero padding.
> >
> > This means that if we were to issue a MADV_COLLAPSE, we can before it do
> an
> > mremap() to increase the length of the mapping.
>
> I see, interesting. What location are you passing for madvise() and
> mremap()? The beginning of the segment (for me has .init/.plt) or an
> aligned boundary within .text?

I started postgres with setarch -R, looked at /proc/$pid/[s]maps to see the
start/end of the r-xp mapped segment.  Here's my hacky code, with a bunch of
comments added.

   void *addr = (void*) 0x5580;
   void *end = (void *) 0x55e09000;
   size_t advlen = (uintptr_t) end - (uintptr_t) addr;

   const size_t bound = 1024*1024*2 - 1;
   size_t advlen_up = (advlen + bound - 1) & ~(bound - 1);
   void *r2;

   /*
* Increase size of mapping to cover the tailing padding to the next
* segment. Otherwise all the code in that range can't be put into
* a huge page (access in the non-mapped range needs to cause a fault,
* hence can't be in the huge page).
* XXX: Should proably assert that that space is actually zeroes.
*/
   r2 = mremap(addr, advlen, advlen_up, 0);
   if (r2 == MAP_FAILED)
   fprintf(stderr, "mremap failed: %m\n");
   else if (r2 != addr)
   fprintf(stderr, "mremap wrong addr: %m\n");
   else
   advlen = advlen_up;

   /*
* The docs for MADV_COLLAPSE say there should be at least one page
* in the mapped space "for every eligible hugepage-aligned/sized
* region to be collapsed". I just forced that. But probably not
* necessary.
*/
   r = madvise(addr, advlen, MADV_WILLNEED);
   if (r != 0)
   fprintf(stderr, "MADV_WILLNEED failed: %m\n");

   r = madvise(addr, advlen, MADV_POPULATE_READ);
   if (r != 0)
   fprintf(stderr, "MADV_POPULATE_READ failed: %m\n");

   /*
* Make huge pages out of it. Requires at least linux 6.1.  We could
* fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that
* much in older kernels.
*/
#define MADV_COLLAPSE25
   r = madvise(addr, advlen, MADV_COLLAPSE);
   if (r != 0)
   fprintf(stderr, "MADV_COLLAPSE failed: %m\n");


A real version would have to open /proc/self/maps and do this for at least
postgres' r-xp mapping. We could do it for librarie