date:20221104

Re: remap the .text segment into huge pages at run time

2022-11-04 Thread John Naylor

On Sat, Nov 5, 2022 at 1:33 AM Andres Freund  wrote:

> > I wonder how far we can get with just using the linker hints to align
> > sections. I know that the linux folks are working on promoting
sufficiently
> > aligned executable pages to huge pages too, and might have succeeded
already.
> >
> > IOW, adding the linker flags might be a good first step.
>
> Indeed, I did see that that works to some degree on the 5.19 kernel I was
> running. However, it never seems to get around to using huge pages
> sufficiently to compete with explicit use of huge pages.

Oh nice, I didn't know that! There might be some threshold of pages mapped
before it does so. At least, that issue is mentioned in that paper linked
upthread for FreeBSD.

> More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was
> added into linux 6.1. That explicitly remaps a region and uses huge pages
for
> it. Of course that's going to take a while to be widely available, but it
> seems like a safer approach than the remapping approach from this thread.

I didn't know that either, funny timing.

> I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just
hardcode
> the address / length), and it seems to work nicely.
>
> With the weird caveat that on fs one needs to make sure that the
executable
> doesn't reflinks to reuse parts of other files, and that the mold linker
and
> cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
> binary with cp --reflink=never

What happens otherwise? That sounds like a difficult thing to guard against.

> The difference in itlb.itlb_flush between pipelined / non-pipelined cases
> unsurprisingly is stark.
>
> While the pipelined case still sees a good bit reduced itlb traffic, the
total
> amount of cycles in which a walk is active is just not large enough to
matter,
> by the looks of it.

Good to know, thanks for testing. Maybe the pipelined case is something
devs should consider when microbenchmarking, to reduce noise from context
switches.

On Sat, Nov 5, 2022 at 4:21 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > > - Add a "cold" __asm__ filler function that just takes up space,
enough to
> > > push the end of the .text segment over the next aligned boundary, or
to
> > > ~8MB in size.
> >
> > I don't understand why this is needed - as long as the pages are
aligned to
> > 2MB, why do we need to fill things up on disk? The in-memory contents
are the
> > relevant bit, no?
>
> I now assume it's because you either observed the mappings set up by the
> loader to not include the space between the segments?

My knowledge is not quite that deep. The iodlr repo has an example "hello
world" program, which links with 8 filler objects, each with 32768
__attribute((used)) dummy functions. I just cargo-culted that idea and
simplified it. Interestingly enough, looking through the commit history,
they used to align the segments via linker flags, but took it out here:

https://github.com/intel/iodlr/pull/25#discussion_r397787559

...saying "I'm not sure why we added this". :/

I quickly tried to align the segments with the linker and then in my patch
have the address for mmap() rounded *down* from the .text start to the
beginning of that segment. It refused to start without logging an error.

BTW, that what I meant before, although I wasn't clear:

> > Since the front is all-cold, and there is very little at the end,
> > practically all hot pages are now remapped. The biggest problem with the
> > hackish filler function (in addition to maintainability) is, if explicit
> > huge pages are turned off in the kernel, attempting mmap() with
MAP_HUGETLB
> > causes complete startup failure if the .text segment is larger than 8MB.
>
> I would expect MAP_HUGETLB to always fail if not enabled in the kernel,
> independent of the .text segment size?

With the file-level hack, it would just fail without a trace with .text >
8MB (I have yet to enable core dumps on this new OS I have...), whereas
without it I did see the failures in the log, and successful fallback.

> With these flags the "R E" segments all start on a 0x20/2MiB boundary
and
> are padded to the next 2MiB boundary. However the OS / dynamic loader only
> maps the necessary part, not all the zero padding.
>
> This means that if we were to issue a MADV_COLLAPSE, we can before it do
an
> mremap() to increase the length of the mapping.

I see, interesting. What location are you passing for madvise() and
mremap()? The beginning of the segment (for me has .init/.plt) or an
aligned boundary within .text?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-04 Thread Amit Kapila

On Fri, Nov 4, 2022 at 7:35 PM houzj.f...@fujitsu.com
 wrote:
>
> On Friday, November 4, 2022 4:07 PM Amit Kapila  
> wrote:
> >
> > On Thu, Nov 3, 2022 at 6:36 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > Thanks for the analysis and summary !
> > >
> > > I tried to implement the above idea and here is the patch set.
> > >
> >
> > Few comments on v42-0001
> > ===
>
> Thanks for the comments.
>
> >
> > 10.
> > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > + winfo->shared->transaction_lock_id = parallel_apply_get_unique_id();
> >
> > Why can't we use xid (remote_xid) for one of these and local_xid (one 
> > generated
> > by parallel apply) for the other?
...
...
>
> I also considered using xid for these locks, but it seems the objsubid for the
> shared object lock is 16bit while xid is 32 bit. So, I tried to generate a 
> unique 16bit id
> here.
>

Okay, I see your point. Can we think of having a new lock tag for this
with classid, objid, objsubid for the first three fields of locktag
field? We can use a new macro SET_LOCKTAG_APPLY_TRANSACTION and a
common function to set the tag and acquire the lock. One more point
related to this is that I am suggesting classid by referring to
SET_LOCKTAG_OBJECT as that is used in the current patch but do you
think we need it for our purpose, won't subscription id and xid can
uniquely identify the tag?

-- 
With Regards,
Amit Kapila.

Re: proposal: possibility to read dumped table's name from file

2022-11-04 Thread Julien Rouhaud

On Fri, Nov 04, 2022 at 02:37:05PM +0100, Pavel Stehule wrote:
> pá 4. 11. 2022 v 14:28 odesílatel Julien Rouhaud 
> napsal:
>
> > >
> > > But one thing I noticed is that "optarg" looks wrong here:
> > >
> > > simple_string_list_append(>triggerNames, optarg);
> >
> > Ah indeed, good catch!  Maybe there should be an explicit test for every
> > (include|exclude) / objtype combination?  It would be a bit verbose (and
> > possibly hard to maintain).
> >
>
> I'll do it

Thanks a lot Pavel!  I switched the CF entry back to "Waiting on Author" in the
meantime.

Re: pg_reload_conf() synchronously

2022-11-04 Thread Michael Paquier

On Fri, Nov 04, 2022 at 09:51:08PM -0700, Andres Freund wrote:
> On 2022-11-04 23:35:21 -0400, Tom Lane wrote:
>> True, but do we have any such cases?
> 
> I know I hit it twice and gave up on the tests.

Still, is there any need for a full-blown facility for the case of an
individual backend?  Using a new function to track that all the
changes are in effect is useful for isolation tests, less for single
sessions where a function to wait for all the backends is no different
than a \c to enforce a reload because both tests would need an extra
step (on top of making parallel tests longer if something does a long
transaction?).

As far as I know, Gurjeet has been annoyed only with non-user-settable
GUCs for one connection (correct me of course), there was nothing
fancy with isolation tests, yet.  Not saying that this is useless for
isolation tests, this would have its cases for assumptions where
multiple GUCs need to be synced across multiple sessions, but it seems
to me that we have two different cases in need of two different
solutions.
--
Michael

signature.asc
Description: PGP signature

Re: pg_reload_conf() synchronously

2022-11-04 Thread Andres Freund

Hi,

On 2022-11-04 23:35:21 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Worth noting that this doesn't necessarily suffice to avoid race conditions 
> > in
> > tests, if the test depends on *other* backends having seen the configuration
> > changes.
> 
> True, but do we have any such cases?

I know I hit it twice and gave up on the tests.


> > It might be worth to use the global barrier mechanism to count which 
> > backends
> > have reloaded configuration and to provide a function / option to pg_sleep
> > that waits for that.
> 
> That ... seems like a lot of mechanism.  And it could easily result
> in undetected deadlocks, if any other session is blocked on you.
> I seriously doubt that we should go there.

Yea, it's not great. Probably ok enough for a test, but ...

Greetings,

Andres Freund

Re: pg_reload_conf() synchronously

2022-11-04 Thread Tom Lane

Andres Freund  writes:
> Worth noting that this doesn't necessarily suffice to avoid race conditions in
> tests, if the test depends on *other* backends having seen the configuration
> changes.

True, but do we have any such cases?

> It might be worth to use the global barrier mechanism to count which backends
> have reloaded configuration and to provide a function / option to pg_sleep
> that waits for that.

That ... seems like a lot of mechanism.  And it could easily result
in undetected deadlocks, if any other session is blocked on you.
I seriously doubt that we should go there.

regards, tom lane

Re: Incorrect include file order in guc-file.l

2022-11-04 Thread Michael Paquier

On Fri, Nov 04, 2022 at 07:55:20AM +0700, John Naylor wrote:
> I've removed it.

Thanks.

Aha, there were three more of these, as of rewriteheap.c, copydir.c
and pgtz.c that I also forgot to clean up in bfb9dfd..
--
Michael


signature.asc
Description: PGP signature

Re: pg_reload_conf() synchronously

2022-11-04 Thread Andres Freund

Hi,

On 2022-11-04 10:26:38 -0700, Gurjeet Singh wrote:
> The attached patch makes the pg_reload_conf() function set
> ConfigReloadPending to true, which will force the postgres main
> command-processing loop to process and apply config changes _before_
> executing the next command.

Worth noting that this doesn't necessarily suffice to avoid race conditions in
tests, if the test depends on *other* backends having seen the configuration
changes.

It might be worth to use the global barrier mechanism to count which backends
have reloaded configuration and to provide a function / option to pg_sleep
that waits for that.

Greetings,

Andres Freund

Re: Add tracking of backend memory allocated to pg_stat_activity

2022-11-04 Thread Andres Freund

Hi,

On 2022-11-04 08:56:13 -0400, Reid Thompson wrote:
> From a8de5d29c0c6f10962181926a49ad4fec1e52bd1 Mon Sep 17 00:00:00 2001
> From: Reid Thompson 
> Date: Thu, 11 Aug 2022 12:01:25 -0400
> Subject: [PATCH] Add tracking of backend memory allocated to pg_stat_activity
> 
> This new field displays the current bytes of memory allocated to the
> backend process. It is updated as memory for the process is
> malloc'd/free'd. Memory allocated to items on the freelist is included in
> the displayed value.  Dynamic shared memory allocations are included
> only in the value displayed for the backend that created them, they are
> not included in the value for backends that are attached to them to
> avoid double counting. On occasion, orphaned memory segments may be
> cleaned up on postmaster startup. This may result in decreasing the sum
> without a prior increment. We limit the floor of backend_mem_allocated
> to zero. Updated pg_stat_activity documentation for the new column.

I'm not convinced that counting DSM values this way is quite right. There are
a few uses of DSMs that track shared resources, with the biggest likely being
the stats for relations etc.  I suspect that tracking that via backend memory
usage will be quite confusing, because fairly random backends that had to grow
the shared state end up being blamed with the memory usage in perpituity - and
then suddenly that memory usage will vanish when that backend exits, despite
the memory continuing to exist.

> @@ -734,6 +747,7 @@ AllocSetAlloc(MemoryContext context, Size size)
>   return NULL;
>  
>   context->mem_allocated += blksize;
> + pgstat_report_backend_allocated_bytes_increase(blksize);
>  
>   block->aset = set;
>   block->freeptr = block->endptr = ((char *) block) + blksize;
> @@ -944,6 +958,7 @@ AllocSetAlloc(MemoryContext context, Size size)
>   return NULL;
>  
>   context->mem_allocated += blksize;
> + pgstat_report_backend_allocated_bytes_increase(blksize);
>  
>   block->aset = set;
>   block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
> @@ -1043,6 +1058,7 @@ AllocSetFree(void *pointer)
>   block->next->prev = block->prev;
>  
>   set->header.mem_allocated -= block->endptr - ((char *) block);
> + pgstat_report_backend_allocated_bytes_decrease(block->endptr - 
> ((char *) block));
>  
>  #ifdef CLOBBER_FREED_MEMORY
>   wipe_mem(block, block->freeptr - ((char *) block));

I suspect this will be noticable cost-wise. Even though these paths aren't the
hottest memory allocation paths, by nature of going down into malloc, adding
an external function call that then does a bunch of branching etc. seems
likely to add up to some.  See below for how I think we can deal with that...

> +
> +/* 
> + * pgstat_report_backend_allocated_bytes_increase() -
> + *
> + * Called to report increase in memory allocated for this backend
> + * 
> + */
> +void
> +pgstat_report_backend_allocated_bytes_increase(uint64 allocation)
> +{
> + volatile PgBackendStatus *beentry = MyBEEntry;
> +
> + if (!beentry || !pgstat_track_activities)
> + {
> + /*
> +  * Account for memory before pgstats is initialized. This will 
> be
> +  * migrated to pgstats on initialization.
> +  */
> + backend_allocated_bytes += allocation;
> +
> + return;
> + }
> +
> + /*
> +  * Update my status entry, following the protocol of bumping
> +  * st_changecount before and after.  We use a volatile pointer here to
> +  * ensure the compiler doesn't try to get cute.
> +  */
> + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
> + beentry->backend_allocated_bytes += allocation;
> + PGSTAT_END_WRITE_ACTIVITY(beentry);
> +}

This is quite a few branches, including write/read barriers.

It doesn't really make sense to use the PGSTAT_BEGIN_WRITE_ACTIVITY() pattern
here - you're just updating a single value, there's nothing to be gained by
it. The point of PGSTAT_BEGIN_*ACTIVITY() stuff is to allow readers to get a
consistent view of multiple values - but there aren't multiple values here!

To avoid the overhead of checking (!beentry || !pgstat_track_activities) and
the external function call, I think you'd be best off copying the trickery I
introduced for pgstat_report_wait_start(), in 225a22b19ed.

I.e. make pgstat_report_backend_allocated_bytes_increase() a static inline
function that unconditionally updates something like
*my_backend_allocated_memory.  To deal with the case of (!beentry ||
!pgstat_track_activities), that variable initially points to some backend
local state and is set to the shared state in pgstat_bestart().

This additionally has the nice benefit that you can track memory usage from
before pgstat_bestart(), it'll be in the local variable.

> +void
>

Re: CI and test improvements

2022-11-04 Thread Andres Freund

Hi,

On 2022-11-04 18:54:12 -0500, Justin Pryzby wrote:
> Subject: [PATCH 1/8] meson: PROVE is not required
> Subject: [PATCH 3/8] meson: rename 'main' tasks to 'regress' and 'isolation'

Pushed, thanks for the patches.


> Subject: [PATCH 2/8] meson: other fixes for cygwin
>
> XXX: what about HAVE_BUGGY_STRTOF ?

What are you wondering about here? Shouldn't that continue to be set via
src/include/port/cygwin.h?


> diff --git a/src/test/regress/meson.build b/src/test/regress/meson.build
> index 3dcfc11278f..6ec3c77af53 100644
> --- a/src/test/regress/meson.build
> +++ b/src/test/regress/meson.build
> @@ -10,7 +10,7 @@ regress_sources = pg_regress_c + files(
>  # patterns like ".*-.*-mingw.*". We probably can do better, but for now just
>  # replace 'gcc' with 'mingw' on windows.
>  host_tuple_cc = cc.get_id()
> -if host_system == 'windows' and host_tuple_cc == 'gcc'
> +if host_system in ['windows', 'cygwin'] and host_tuple_cc == 'gcc'
>host_tuple_cc = 'mingw'
>  endif

This doesn't quite seem right - shouldn't it say cywin? Not that it makes a
difference right now, given the contents of resultmap:
float4:out:.*-.*-cygwin.*=float4-misrounded-input.out
float4:out:.*-.*-mingw.*=float4-misrounded-input.out


> From 0acbbd2fdd97bbafc5c4552e26f92d52c597eea9 Mon Sep 17 00:00:00 2001
> From: Justin Pryzby 
> Date: Wed, 25 May 2022 21:53:22 -0500
> Subject: [PATCH 4/8] cirrus/windows: add compiler_warnings_script
>
> I'm not sure how to write this test in windows shell; it's also not easy to
> write it in posix sh, since windows shell is somehow interpretting && and 
> ||...
>
> https://www.postgresql.org/message-id/20220212212310.f645c6vw3njkgxka%40alap3.anarazel.de
>
> See also:
> 8a1ce5e54f6d144e4f8e19af7c767b026ee0c956
> https://cirrus-ci.com/task/6241060062494720
> https://cirrus-ci.com/task/6496366607204352
>
> ci-os-only: windows
> ---
>  .cirrus.yml| 10 +-
>  src/tools/ci/windows-compiler-warnings | 24 
>  2 files changed, 33 insertions(+), 1 deletion(-)
>  create mode 100755 src/tools/ci/windows-compiler-warnings
>
> diff --git a/.cirrus.yml b/.cirrus.yml
> index 9f2282471a9..99ac09dc679 100644
> --- a/.cirrus.yml
> +++ b/.cirrus.yml
> @@ -451,12 +451,20 @@ task:
>
>build_script: |
>  vcvarsall x64
> -ninja -C build
> +ninja -C build |tee build/meson-logs/build.txt
> +REM Since pipes lose exit status of the preceding command, rerun 
> compilation,
> +REM without the pipe exiting now if it fails, rather than trying to run 
> checks
> +ninja -C build > nul

This seems mighty grotty :(. but I guess it's quick enough not worry about,
and I can't come up with a better plan.

It doesn't seem quite right to redirect into meson-logs/ to me, my
interpretation is that that's "meson's namespace".  Why not just store it in
build/?



> From e85fe83fd8a4b4c79a96d2bf66cd6a5e1bdfcd1e Mon Sep 17 00:00:00 2001
> From: Justin Pryzby 
> Date: Sat, 26 Feb 2022 19:34:35 -0600
> Subject: [PATCH 5/8] cirrus: build docs as a separate task..
>
> This will run the doc build if any docs have changed, even if Linux
> fails, to allow catch doc build failures.
>
> This'll automatically show up as a separate "column" on cfbot.
>
> Also, in the future, this will hopefully upload each patch's changed HTML docs
> as an artifact, for easy review.
>
> Note that this is currently building docs with both autoconf and meson.
>
> ci-os-only: html
> ---
>  .cirrus.yml | 62 +
>  1 file changed, 44 insertions(+), 18 deletions(-)
>
> diff --git a/.cirrus.yml b/.cirrus.yml
> index 99ac09dc679..37fd79e5b77 100644
> --- a/.cirrus.yml
> +++ b/.cirrus.yml
> @@ -472,6 +472,9 @@ task:
>type: text/plain
>
>
> +###
> +# Test that code can be built with gcc/clang without warnings
> +###
>  task:
>name: CompilerWarnings
>
> @@ -515,10 +518,6 @@ task:
>  #apt-get update
>  #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
>
> -  ###
> -  # Test that code can be built with gcc/clang without warnings
> -  ###
> -

Why remove this?


>setup_script: echo "COPT=-Werror" > src/Makefile.custom
>
># Trace probes have a history of getting accidentally broken. Use the
> @@ -580,20 +579,6 @@ task:
>make -s -j${BUILD_JOBS} clean
>time make -s -j${BUILD_JOBS} world-bin
>
> -  ###
> -  # Verify docs can be built
> -  ###
> -  # XXX: Only do this if there have been changes in doc/ since last build
> -  always:
> -docs_build_script: |
> -  time ./configure \
> ---cache gcc.cache \
> -CC="ccache gcc" \
> -CXX="ccache g++" \
> -CLANG="ccache clang"
> -  make -s -j${BUILD_JOBS} clean
> -  time make -s -j${BUILD_JOBS} -C doc
> -
>###
># Verify headerscheck / cpluspluscheck succeed
>#
> @@ -617,3 +602,44 @@ task:
>
>always:
>  upload_caches: ccache
> +
> +
> +###
> +# Verify docs can be built
> +# changesInclude() will skip

Re: Direct I/O

2022-11-04 Thread Andres Freund

Hi,

On 2022-11-04 14:47:31 -0500, Jim Nasby wrote:
> On 11/1/22 2:36 AM, Thomas Munro wrote:
> > Here is a patch to allow PostgreSQL to use $SUBJECT.  It is from the
> 
> This is exciting to see! There's two other items to add to the TODO list
> before this would be ready for production:
> 
> 1) work_mem. This is a significant impediment to scaling shared buffers the
> way you'd want to.

I don't really think that's closely enough related to tackle together. Yes,
it'd be easier to set a large s_b if we had better work_mem management, but
it's a completely distinct problem, and in a lot of cases you could use DIO
without tackling the work_mem issue.

> 2) Clock sweep. Specifically, currently the only thing that drives
> usage_count is individual backends running the clock hand. On large systems
> with 75% of memory going to shared_buffers, that becomes a very significant
> problem, especially when the backend running the clock sweep is doing so in
> order to perform an operation like a b-tree page split. I suspect it
> shouldn't be too hard to deal with this issue by just having bgwriter or
> another bgworker proactively ensuring some reasonable number of buffers with
> usage_count=0 exist.

I agree this isn't great, but I don't think the replacement efficiency is that
big a problem. Replacing the wrong buffers is a bigger issue.

I've run tests with s_b=768GB (IIRC) without it showing up as a major
issue. If you have an extreme replacement rate at such a large s_b you have a
lot of other problems.

I don't want to discourage anybody from tackling the clock replacement issues,
the contrary, but AIO+DIO can show significant wins without those
changes. It's already a humongous project...

> One other thing to be aware of: overflowing as SLRU becomes a massive
> problem if there isn't a filesystem backing the SLRU. Obviously only an
> issue if you try and apply DIO to SLRU files.

Which would be a very bad idea for now Thomas does have a patch for moving
them into the main buffer pool.

Greetings,

Andres Freund

Re: CI and test improvements

2022-11-04 Thread Justin Pryzby

On Sat, Sep 10, 2022 at 03:05:42PM -0500, Justin Pryzby wrote:
> On Sun, Aug 28, 2022 at 02:28:02PM -0700, Andres Freund wrote:
> > On 2022-08-28 12:10:29 -0500, Justin Pryzby wrote:
> > > On Sun, Aug 28, 2022 at 09:07:52AM -0700, Andres Freund wrote:
> > > > > --- /dev/null
> > > > > +++ b/src/tools/ci/windows-compiler-warnings
> > > >
> > > > Wouldn't that be doable as something like
> > > > sh -c 'if test -s file; then cat file;exit 1; fi"
> > > > inside .cirrus.yml?
> > >
> > > I had written it inline in a couple ways, like
> > > - sh -exc 'f=msbuild.warn.log; if [ -s "$f" ]; then cat "$f"; exit 1; 
> > > else exit 0; fi'
> > >
> > > but then separated it out as you suggested in
> > > 20220227010908.vz2a7dmfzgwg7...@alap3.anarazel.de
> > >
> > > after I complained about cmd.exe requiring escaping for && and ||
> > > That makes writing any shell script a bit perilous and a separate script
> > > seems better.
> > 
> > I remember that I suggested it - but note that the way I wrote above doesn't
> > have anything needing escaping. 
> 
> It doesn't require it, but that still gives the impression that it's
> normally possible to write one-liner shell scripts there, which is
> misleading/wrong, and the reason why I took your suggestion to use a
> separate script file.
> 
> > Anyway, what do you think of the multiline split I suggested?
> 
> Done, and sorted.

Rewrote this and rebased some of the other stuff on top of the meson
commit, for which I also include some new patches.
>From 3825287f88ad03ed3335ab6c7ea1ca2b95d88356 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 30 Sep 2022 08:56:07 -0500
Subject: [PATCH 1/8] meson: PROVE is not required

It ought to be possible to build the application without running tests,
or without running TAP tests.  And it's may be needed to support
cygwin/buildfarm, where TAP tests consistently fail..

See: 20221021034040.gt16...@telsasoft.com
---
 meson.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/meson.build b/meson.build
index bfacbdc0af6..ce2f223a409 100644
--- a/meson.build
+++ b/meson.build
@@ -323,7 +323,7 @@ python = find_program(get_option('PYTHON'), required: true, native: true)
 flex = find_program(get_option('FLEX'), native: true, version: '>= 2.5.35')
 bison = find_program(get_option('BISON'), native: true, version: '>= 2.3')
 sed = find_program(get_option('SED'), 'sed', native: true)
-prove = find_program(get_option('PROVE'), native: true)
+prove = find_program(get_option('PROVE'), native: true, required: false)
 tar = find_program(get_option('TAR'), native: true)
 gzip = find_program(get_option('GZIP'), native: true)
 program_lz4 = find_program(get_option('LZ4'), native: true, required: false)
-- 
2.25.1

>From a994127b0361e2bc3ff59919176ec5f81faa426e Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 30 Sep 2022 13:39:43 -0500
Subject: [PATCH 2/8] meson: other fixes for cygwin

XXX: what about HAVE_BUGGY_STRTOF ?

See: 20221021034040.gt16...@telsasoft.com
---
 meson.build  | 8 ++--
 src/port/meson.build | 4 
 src/test/regress/meson.build | 2 +-
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index ce2f223a409..ed24370672a 100644
--- a/meson.build
+++ b/meson.build
@@ -211,6 +211,10 @@ if host_system == 'aix'
 
 elif host_system == 'cygwin'
   cppflags += '-D_GNU_SOURCE'
+  dlsuffix = '.dll'
+  mod_link_args_fmt = ['@0@']
+  mod_link_with_name = 'lib@0@.exe.a'
+  mod_link_with_dir = 'libdir'
 
 elif host_system == 'darwin'
   dlsuffix = '.dylib'
@@ -2301,8 +2305,8 @@ gnugetopt_dep = cc.find_library('gnugetopt', required: false)
 #   (i.e., allow '-' as a flag character), so use our version on those platforms
 # - We want to use system's getopt_long() only if the system provides struct
 #   option
-always_replace_getopt = host_system in ['windows', 'openbsd', 'solaris']
-always_replace_getopt_long = host_system == 'windows' or not cdata.has('HAVE_STRUCT_OPTION')
+always_replace_getopt = host_system in ['windows', 'cygwin', 'openbsd', 'solaris']
+always_replace_getopt_long = host_system in ['windows', 'cygwin'] or not cdata.has('HAVE_STRUCT_OPTION')
 
 # Required on BSDs
 execinfo_dep = cc.find_library('execinfo', required: false)
diff --git a/src/port/meson.build b/src/port/meson.build
index c696f1b..0ba83cc7930 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -40,6 +40,10 @@ if host_system == 'windows'
 'win32setlocale.c',
 'win32stat.c',
   )
+elif host_system == 'cygwin'
+  pgport_sources += files(
+'dirmod.c',
+  )
 endif
 
 if cc.get_id() == 'msvc'
diff --git a/src/test/regress/meson.build b/src/test/regress/meson.build
index 3dcfc11278f..6ec3c77af53 100644
--- a/src/test/regress/meson.build
+++ b/src/test/regress/meson.build
@@ -10,7 +10,7 @@ regress_sources = pg_regress_c + files(
 # patterns like ".*-.*-mingw.*". We probably can do better, but for now just
 # replace 'gcc' with 'mingw' on windows.

Re: PL/pgSQL cursors should get generated portal names by default

2022-11-04 Thread Tom Lane

Jan Wieck  writes:
> I need to do some testing on this. I seem to recall that the naming was 
> originally done because a reference cursor is basically a named cursor 
> that can be handed around between functions and even the top SQL level 
> of the application. For the latter to work the application needs to know 
> the name of the portal.

Right.  With this patch, it'd be necessary to hand back the actual
portal name (by returning the refcursor value), or else manually
set the refcursor value before OPEN to preserve the previous behavior.
But as far as I saw, all our documentation examples show handing back
the portal name, so I'm hoping most people do it like that already.

> I am currently down with Covid and have trouble focusing. But I hope to 
> get to it some time next week.

Get well soon!

regards, tom lane

Re: PL/pgSQL cursors should get generated portal names by default

2022-11-04 Thread Jan Wieck


On 11/4/22 03:22, Pavel Stehule wrote:

Hi


st 2. 11. 2022 v 0:39 odesílatel Tom Lane > napsal:


There's a complaint at [1] about how you can't re-use the same
cursor variable name within a routine called from another routine
that's already using that name.  The complaint is itself a bit
under-documented, but I believe it is referring to this ancient
bit of behavior:

          A bound cursor variable is initialized to the string value
          representing its name, so that the portal name is the same as
          the cursor variable name, unless the programmer overrides it
          by assignment before opening the cursor.

So if you try to nest usage of two bound cursor variables of the
same name, it blows up on the portal-name conflict.  But it'll work
fine if you use unbound cursors (i.e., plain "refcursor" variables):

          But an unbound cursor
          variable defaults to the null value initially, so it will
receive
          an automatically-generated unique name, unless overridden.

I wonder why we did it like that; maybe it's to be bug-compatible with
some Oracle PL/SQL behavior or other?  Anyway, this seems non-orthogonal
and contrary to all principles of structured programming.  We don't even
offer an example of the sort of usage that would benefit from it, ie
that calling code could "just know" what the portal name is.

I propose that we should drop this auto initialization and let all
refcursor variables start out null, so that they'll get unique
portal names unless you take explicit steps to do something else.
As attached.

(Obviously this would be a HEAD-only fix, but maybe there's scope for
improving the back-branch docs along lines similar to these changes.)


I am sending an review of this patch

1. The patching, compilation without any problems
2. All tests passed
3. The implemented change is documented well
4. Although this is potencial compatibility break, we want this feature. 
It allows to use cursors variables in recursive calls by default, it 
allows shadowing of cursor variables

5. This patch is short and almost trivial, just remove code.

I'll mark this patch as ready for commit


I need to do some testing on this. I seem to recall that the naming was 
originally done because a reference cursor is basically a named cursor 
that can be handed around between functions and even the top SQL level 
of the application. For the latter to work the application needs to know 
the name of the portal.


I am currently down with Covid and have trouble focusing. But I hope to 
get to it some time next week.



Regards, Jan

Re: WIN32 pg_import_system_collations

2022-11-04 Thread Juan José Santamaría Flecha

On Mon, Oct 31, 2022 at 3:09 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

Thanks for taking a look into this patch.

>
> Consider this function you are introducing:
>
> +/*
> + * Create a collation if the input locale is valid for so.
> + * Also keeps track of the number of valid locales and collations created.
> + */
> +static int
> +CollationFromLocale(char *isolocale, char *localebuf, int *nvalid,
> +   int *ncreated, int nspid)
>
> This declaration is incomprehensible without studying all the callers
> and the surrounding code.
>
> Start with the name: What does "collation from locale" mean?  Does it
> make a collation?  Does it convert one?  Does it find one?  There should
> be a verb in there.
>
> (I think in the context of this file, a lower case name would be more
> appropriate for a static function.)
>
> Then the arguments.  The input arguments should be "const".  All the
> arguments should be documented.  What is "isolocale", what is
> "localebuf", how are they different?  What is being counted by "valid"
> (collatons?, locales?), and what makes a thing valid and invalid?  What
> is being "created"?  What is nspid?  What is the return value?
>
> Please make another pass over this.
>
> Ok, I can definitely improve the comments for that function.


> Also consider describing in the commit message what you are doing in
> more detail, including some of the things that have been discussed in
> this thread.
>
> Going through the thread for the commit message, I think that maybe the
collation naming remarks were not properly addressed. In the current
version the collations retain their native name, but an alias is created
for those with a shape that we can assume a POSIX equivalent exists.

Please find attached a new version.

Regards,

Juan José Santamaría Flecha


v7-0002-Add-collate.windows.win1252-test.patch
Description: Binary data


v7-0001-WIN32-pg_import_system_collations.patch
Description: Binary data

Re: Refactor to introduce pg_strcoll().

2022-11-04 Thread Jeff Davis

On Tue, 2022-11-01 at 13:36 +0100, Peter Eisentraut wrote:
> But I think putting the Windows UTF-8 code (win32_utf8_wcscoll())
> from 
> varstr_cmp() into pg_strcoll() might create future complications. 
> Normally, it would be up to varstr_sortsupport() to set up a 
> Windows/UTF-8 specific comparator function, but it says there it's
> not 
> implemented.  But someone who wanted to implement that would have to 
> lift it back out of pg_strcoll, or rearrange the code in some other
> way.

Is there a reason that it wouldn't work to just use
varlenafastcmp_locale() after my patch? The comment says:

  /*  
   * There is a further exception on Windows.  When the database  
   * encoding is UTF-8 and we are not using the C collation, complex  
   * hacks are required... 

But I think the complex hacks are just the transformation into UTF 16
and calling of wcscoll(). If that's done from within pg_strcoll()
(after my patch), then I think it just works, right?

I can't easily test on windows, so perhaps I'm missing something. Does
the buildfarm have enough coverage here? Is it reasonable to try a
commit that removes the:

  #ifdef WIN32
if (GetDatabaseEncoding() == PG_UTF8 &&
!(locale && locale->provider == COLLPROVIDER_ICU))
return;
  #endif

along with my patch and see what I get out of the buildfarm?

> Perhaps you have some follow-up work 
> planned, in which case it might be better to consider this in more 
> context. 

I was also considering an optimization to use stack allocation for
short strings when doing the non-UTF8 icu comparison. I am seeing some
benefit there, but it seems to depend on the size of the stack buffer.
That suggests that maybe TEXTBUFSIZE is too large (1024) and perhaps we
should drop it down, but I need to investigate more.

In general, I'm trying to slowly get the locale stuff more isolated.

Regards,
Jeff Davis

Re: remap the .text segment into huge pages at run time

2022-11-04 Thread Andres Freund

Hi,

On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > - Add a "cold" __asm__ filler function that just takes up space, enough to
> > push the end of the .text segment over the next aligned boundary, or to
> > ~8MB in size.
>
> I don't understand why this is needed - as long as the pages are aligned to
> 2MB, why do we need to fill things up on disk? The in-memory contents are the
> relevant bit, no?

I now assume it's because you either observed the mappings set up by the
loader to not include the space between the segments?

With sufficient linker flags the segments are sufficiently aligned both on
disk and in memory to just map more:

bfd: -Wl,-zmax-page-size=0x20,-zcommon-page-size=0x20
  Type   Offset VirtAddr   PhysAddr
 FileSizMemSiz  Flags  Align
...
  LOAD   0x 0x 0x
 0x000c7f58 0x000c7f58  R  0x20
  LOAD   0x0020 0x0020 0x0020
 0x00921d39 0x00921d39  R E0x20
  LOAD   0x00c0 0x00c0 0x00c0
 0x002626b8 0x002626b8  R  0x20
  LOAD   0x00fdf510 0x011df510 0x011df510
 0x00037fd6 0x0006a310  RW 0x20

gold -Wl,-zmax-page-size=0x20,-zcommon-page-size=0x20,--rosegment
  Type   Offset VirtAddr   PhysAddr
 FileSizMemSiz  Flags  Align
...
  LOAD   0x 0x 0x
 0x009230f9 0x009230f9  R E0x20
  LOAD   0x00a0 0x00a0 0x00a0
 0x0033a738 0x0033a738  R  0x20
  LOAD   0x00ddf4e0 0x00fdf4e0 0x00fdf4e0
 0x0003800a 0x0006a340  RW 0x20

lld: -Wl,-zmax-page-size=0x20,-zseparate-loadable-segments
  LOAD   0x 0x 0x
 0x0033710c 0x0033710c  R  0x20
  LOAD   0x0040 0x0040 0x0040
 0x00921cb0 0x00921cb0  R E0x20
  LOAD   0x00e0 0x00e0 0x00e0
 0x00020ae0 0x00020ae0  RW 0x20
  LOAD   0x0100 0x0100 0x0100
 0x000174ea 0x00049820  RW 0x20

mold 
-Wl,-zmax-page-size=0x20,-zcommon-page-size=0x20,-zseparate-loadable-segments
  Type   Offset VirtAddr   PhysAddr
 FileSizMemSiz  Flags  Align
...
  LOAD   0x 0x 0x
 0x0032dde9 0x0032dde9  R  0x20
  LOAD   0x0040 0x0040 0x0040
 0x00921cbe 0x00921cbe  R E0x20
  LOAD   0x00e0 0x00e0 0x00e0
 0x002174e8 0x00249820  RW 0x20

With these flags the "R E" segments all start on a 0x20/2MiB boundary and
are padded to the next 2MiB boundary. However the OS / dynamic loader only
maps the necessary part, not all the zero padding.

This means that if we were to issue a MADV_COLLAPSE, we can before it do an
mremap() to increase the length of the mapping.


MADV_COLLAPSE without mremap:

tps = 1117335.766756 (without initial connection time)

 Performance counter stats for 'system wide':

 1,169,012,466,070  cycles  
 (55.53%)
   729,146,640,019  instructions #0.62  insn per 
cycle   (66.65%)
 7,062,923  itlb.itlb_flush 
 (66.65%)
 1,041,825,587  iTLB-loads  
 (66.65%)
   634,272,420  iTLB-load-misses #   60.88% of all iTLB 
cache accesses  (66.66%)
27,018,254,873  itlb_misses.walk_active 
 (66.68%)
   610,639,252  itlb_misses.walk_completed_4k   
 (44.47%)
24,262,549  itlb_misses.walk_completed_2m_4m
 (44.46%)
 2,948  itlb_misses.walk_completed_1g   
 (44.43%)

  10.039217004 seconds time elapsed


MADV_COLLAPSE with mremap:

tps = 1140869.853616 (without initial connection time)

 Performance counter stats for 'system wide':

 1,173,272,878,934

Re: User functions for building SCRAM secrets

2022-11-04 Thread Jacob Champion

On Tue, Nov 1, 2022 at 4:02 PM Jacob Champion  wrote:
> I guess I have fewer problems with this use case in theory, but I'm
> wondering if better client-side support might also solve this one as
> well, without the additional complication. Is there a reason it would
> not?

To expand on this question, after giving it some more thought:

It seems to me that the use case here is extremely similar to the one
being tackled by Peter E's client-side encryption [1]. People want to
write SQL to perform a cryptographic operation using a secret, and
then send the resulting ciphertext (or in this case, a one-way hash)
to the server, but ideally the server should not actually have the
secret.

I don't think it's helpful for me to try to block progress on this
patchset behind the other one. But is there a way for me to help this
proposal skate in the same general direction? Could Peter's encryption
framework expand to fit this case in the future?

Thanks,
--Jacob

[1] 
https://www.postgresql.org/message-id/flat/89157929-c2b6-817b-6025-8e4b2d89d88f%40enterprisedb.com

Re: Commit fest 2022-11

2022-11-04 Thread Jacob Champion

On 11/3/22 22:18, Ian Lawrence Barwick wrote:
> 2022年11月4日(金) 10:23 Ian Lawrence Barwick  >:
>> 2022年11月4日(金) 9:43 Justin Pryzby  >:
>> > If I'm not wrong, Jacob used the CF app to bulk-mail people about
>> > patches not applying and similar things.  That seemed to work well, and
>> > doesn't require sending mails to dozens of threads.
>>
>> I don't see anything like that in the CF app (though I may be looking in the
>> wrong place).

I just used the "email author" checkboxes.

>> I also don't see how it would be possible to filter on patches
>> not applying in cbfot, as AFAICT the former is not aware of the latter.

That was the hard part. I ended up manually merging the two pages locally.

> Also, having gone through all the cfbot items with non-applying 
> patches (single red "X"), sending a reminder without checking
> further doesn't seem the right thing tod do - in two cases the patch
> was not applying because it had already been committed, and with
> another the consensus was to return it with feedback. With others,
> it's obvious the threads were recently active and I don't think a
> reminder is necessary right now.

True. One nice thing about the author-only email is that, as long as you
don't send reminders too often (I think there'd been talk before of
once, maximum twice, per CF?) then if an author feels there's no reason
to take action, they don't have to. That low-effort strategy also scales
a bit better than making a CFM scan manually, and it's a bit closer in
my opinion to the automated reminder feature that's been requested
frequently.

> There is an option for each entry to send an email from the CF app, but it 
> comes
> with a note "Please ensure that the email settings for your domain (DKIM, SPF)
> allow emails from external sources." which I fear would lead to email
> delivery issues.

I know some of my bulk emails were delivered to spam folders, so it is a
fair concern.

--Jacob

Re: Direct I/O

2022-11-04 Thread Jim Nasby


On 11/1/22 2:36 AM, Thomas Munro wrote:


Hi,

Here is a patch to allow PostgreSQL to use $SUBJECT.  It is from the


This is exciting to see! There's two other items to add to the TODO list 
before this would be ready for production:


1) work_mem. This is a significant impediment to scaling shared buffers 
the way you'd want to.


2) Clock sweep. Specifically, currently the only thing that drives 
usage_count is individual backends running the clock hand. On large 
systems with 75% of memory going to shared_buffers, that becomes a very 
significant problem, especially when the backend running the clock sweep 
is doing so in order to perform an operation like a b-tree page split. I 
suspect it shouldn't be too hard to deal with this issue by just having 
bgwriter or another bgworker proactively ensuring some reasonable number 
of buffers with usage_count=0 exist.



One other thing to be aware of: overflowing as SLRU becomes a massive 
problem if there isn't a filesystem backing the SLRU. Obviously only an 
issue if you try and apply DIO to SLRU files.

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-04 Thread David Christensen

On Fri, Nov 4, 2022 at 1:38 PM Justin Pryzby  wrote:
> > > As I recall, that's due to relying on "cp".  And "rsync", which
> > > shouldn't be assumed to exist by regression tests).

Will poke around other TAP tests to see if there's a more consistent
interface, what perl version we can assume and available modules, etc.
If there's not some trivial wrapper at this point so all TAP tests
could use it regardless of OS, it would definitely be good to
introduce such a method.

> > I will work on supporting the windows compatibility here. Is there some 
> > list of guidelines for what you can and can’t use? I don’t have a windows 
> > machine available to develop on.
>
> I think a lot (most?) developers here don't have a windows environment
> available, so now have been using cirrusci's tests to verify.  If you
> haven't used cirrusci directly (not via cfbot) before, start at:
> src/tools/ci/README

Thanks, good starting point.

> > Was it failing on windows? I was attempting to skip it as I recall.
>
> I don't see anything about skipping, and cirrus's logs from 2
> commitfests ago were pruned.  I looked at this patch earlier this year,
> but never got around to replacing the calls to rsync and cp.

Ah, it's skipped (not fixed) in my git repo, but never got around to
submitting that version through email.  That explains it.

David

Re: Lockless queue of waiters in LWLock

2022-11-04 Thread Andres Freund

Hi,

On 2022-11-03 14:50:11 +0400, Pavel Borisov wrote:
> Or maybe there is another explanation for now small performance
> difference around 20 connections described in [0]?
> Thoughts?

Using xadd is quite a bit cheaper than cmpxchg, and now every lock release
uses a compare-exchange, I think.

In the past I had a more complicated version of LWLockAcquire which tried to
use an xadd to acquire locks. IIRC (and this is long enough ago that I might
not) that proved to be a benefit, but I was worried about the complexity. And
just getting in the version that didn't always use a spinlock was the higher
priority.

The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why
lwlocks are slower than a spinlock (but obviously are better under contention
nonetheless).

I have a benchmark program that starts a thread for each physical core and
just increments a counter on an atomic value.

On my dual Xeon Gold 5215 workstation:

cmpxchg:
32: throughput per thread: 0.55M/s, total: 11.02M/s
64: throughput per thread: 0.63M/s, total: 12.68M/s

lock add:
32: throughput per thread: 2.10M/s, total: 41.98M/s
64: throughput per thread: 2.12M/s, total: 42.40M/s

xadd:
32: throughput per thread: 2.10M/s, total: 41.91M/s
64: throughput per thread: 2.04M/s, total: 40.71M/s

and even when there's no contention, every thread just updating its own
cacheline:

cmpxchg:
32: throughput per thread: 88.83M/s, total: 1776.51M/s
64: throughput per thread: 96.46M/s, total: 1929.11M/s

lock add:
32: throughput per thread: 166.07M/s, total: 3321.31M/s
64: throughput per thread: 165.86M/s, total: 3317.22M/s

add (no lock):
32: throughput per thread: 530.78M/s, total: 10615.62M/s
64: throughput per thread: 531.22M/s, total: 10624.35M/s

xadd:
32: throughput per thread: 165.88M/s, total: 3317.51M/s
64: throughput per thread: 165.93M/s, total: 3318.53M/s

Greetings,

Andres Freund

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-04 Thread Justin Pryzby

On Fri, Nov 04, 2022 at 09:16:29AM -0500, David Christensen wrote:
> On Nov 4, 2022, at 9:02 AM, Justin Pryzby  wrote:
> > On Fri, Nov 04, 2022 at 11:52:59AM +0900, Ian Lawrence Barwick wrote:
> >> 2022年5月3日(火) 8:45 David Christensen :
> >>> 
> >>> ...and pushing a couple fixups pointed out by cfbot, so here's v4.
> >> 
> >> cfbot reports the patch no longer applies [1]. As CommitFest 2022-11 is
> >> currently underway, this would be an excellent time to update the patch.
> > 
> > More important than needing to be rebased, the patch has never passed
> > its current tests on windows.
> > 
> > As I recall, that's due to relying on "cp".  And "rsync", which
> > shouldn't be assumed to exist by regression tests).
> 
> I will work on supporting the windows compatibility here. Is there some list 
> of guidelines for what you can and can’t use? I don’t have a windows machine 
> available to develop on. 

I think a lot (most?) developers here don't have a windows environment
available, so now have been using cirrusci's tests to verify.  If you
haven't used cirrusci directly (not via cfbot) before, start at:
src/tools/ci/README

There's not much assumed about the build environment, and not much more
assumed about the test environment.  Most of the portability is handled
by using C and perl.  I think there's even no assumption that "tar" is
available (except maybe for building releases).  This patch should avoid
relying on tools that aren't already required.

As a practical matter, cfbot needs to pass, not only to demonstrate that
the patch consistently passes tests, but also because if the patch were
merged while it failed tests in cfbot, it would cause every other patch
to start to fail, too.

> Was it failing on windows? I was attempting to skip it as I recall. 

I don't see anything about skipping, and cirrus's logs from 2
commitfests ago were pruned.  I looked at this patch earlier this year,
but never got around to replacing the calls to rsync and cp.

-- 
Justin

Re: remap the .text segment into huge pages at run time

2022-11-04 Thread Andres Freund

Hi,

This nerd-sniped me badly :)

On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> On 2022-11-02 13:32:37 +0700, John Naylor wrote:
> > I found an MIT-licensed library "iodlr" from Intel [3] that allows one to
> > remap the .text segment to huge pages at program start. Attached is a
> > hackish, Meson-only, "works on my machine" patchset to experiment with this
> > idea.
>
> I wonder how far we can get with just using the linker hints to align
> sections. I know that the linux folks are working on promoting sufficiently
> aligned executable pages to huge pages too, and might have succeeded already.
>
> IOW, adding the linker flags might be a good first step.

Indeed, I did see that that works to some degree on the 5.19 kernel I was
running. However, it never seems to get around to using huge pages
sufficiently to compete with explicit use of huge pages.

More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was
added into linux 6.1. That explicitly remaps a region and uses huge pages for
it. Of course that's going to take a while to be widely available, but it
seems like a safer approach than the remapping approach from this thread.

I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just hardcode
the address / length), and it seems to work nicely.

With the weird caveat that on fs one needs to make sure that the executable
doesn't reflinks to reuse parts of other files, and that the mold linker and
cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
binary with cp --reflink=never

FWIW, you can see the state of the page mapping in more detail with the
kernel's page-types tool

sudo /home/andres/src/kernel/tools/vm/page-types -L -p 12297 -a 
0x55800,0x56122
sudo /home/andres/src/kernel/tools/vm/page-types -f 
/srv/dev/build/m-opt/src/backend/postgres2

Perf results:

c=150;psql -f ~/tmp/prewarm.sql;perf stat -a -e 
cycles,iTLB-loads,iTLB-load-misses,itlb_misses.walk_active,itlb_misses.walk_completed_4k,itlb_misses.walk_completed_2m_4m,itlb_misses.walk_completed_1g
 pgbench -n -M prepared -S -P1 -c$c -j$c -T10

without MADV_COLLAPSE:

tps = 1038230.070771 (without initial connection time)

 Performance counter stats for 'system wide':

 1,184,344,476,152  cycles  
 (71.41%)
 2,846,146,710  iTLB-loads  
 (71.43%)
 2,021,885,782  iTLB-load-misses #   71.04% of all iTLB 
cache accesses  (71.44%)
75,633,850,933  itlb_misses.walk_active 
 (71.44%)
 2,020,962,930  itlb_misses.walk_completed_4k   
 (71.44%)
 1,213,368  itlb_misses.walk_completed_2m_4m
 (57.12%)
 2,293  itlb_misses.walk_completed_1g   
 (57.11%)

  10.064352587 seconds time elapsed

with MADV_COLLAPSE:

tps = 1113717.114278 (without initial connection time)

 Performance counter stats for 'system wide':

 1,173,049,140,611  cycles  
 (71.42%)
 1,059,224,678  iTLB-loads  
 (71.44%)
   653,603,712  iTLB-load-misses #   61.71% of all iTLB 
cache accesses  (71.44%)
26,135,902,949  itlb_misses.walk_active 
 (71.44%)
   628,314,285  itlb_misses.walk_completed_4k   
 (71.44%)
25,462,916  itlb_misses.walk_completed_2m_4m
 (57.13%)
 2,228  itlb_misses.walk_completed_1g   
 (57.13%)

Note that while the rate of itlb-misses stays roughly the same, the total
number of iTLB loads reduced substantially, and the number of cycles in which
an itlb miss was in progress is 1/3 of what it was before.

A lot of the remaining misses are from the context switches. The iTLB is
flushed on context switches, and of course pgbench -S is extremely context
switch heavy.

Comparing plain -S with 10 pipelined -S transactions (using -t 10 / -t
1 to compare the same amount of work) I get:

without MADV_COLLAPSE:

not pipelined:

tps = 1037732.722805 (without initial connection time)

 Performance counter stats for 'system wide':

 1,691,411,678,007  cycles  
 (62.48%)
 8,856,107  itlb.itlb_flush 
 (62.48%)
 4,600,041,062  iTLB-loads  
 (62.48%)
 2,598,218,236  iTLB-load-misses #   56.48% of all iTLB 
cache accesses  (62.50%)
   100,095,862,126  itlb_misses.walk_active

Re: Privileges on PUBLICATION

2022-11-04 Thread Antonin Houska

Mark Dilger  wrote:

> > On Nov 4, 2022, at 12:28 AM, Antonin Houska  wrote:
> > 
> > I thought about the whole concept a bit more and I doubt if the PUBLICATION
> > privilege is the best approach. In particular, the user specified in CREATE
> > SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have 
> > SELECT
> > privilege on the tables replicated. So if the DBA excludes some columns from
> > the publication's column list and sets the (publication) privileges in such 
> > a
> > way that the user cannot get the column values via other publications, the
> > user still can connect to the database directly and get values of the 
> > excluded
> > columns.
> > 
> > As an alternative to the publication privileges, I think that the CREATE
> > SUBSCRIPTION command could grant ACL_SELECT automatically to the 
> > subscription
> > user on the individual columns contained in the publication column list, and
> > DROP SUBSCRIPTION would revoke that privilege.
> > 
> > Of course a question is what to do if the replication user already has that
> > privilege on some columns: either the CREATE SUBSCRIPTION command should 
> > raise
> > ERROR, or we should introduce a new privilege (e.g. ACL_SELECT_PUB) for this
> > purpose, which would effectivelly be ACL_SELECT, but hidden from the users 
> > of
> > GRANT / REVOKE.
> > 
> > If this approach was taken, the USAGE privilege on schema would need to be
> > granted / revoked in a similar way.
> 
> When you talk about a user needing to have privileges, it sounds like you
> mean privileges on the publishing database.  But then you talk about CREATE
> SUBSCRIPTION granting privileges, which would necessarily be on the
> subscriber database.  Can you clarify what you have in mind?

Right, the privileges need to be added on the publishing side, but the user
that needs those privileges is specified on the subscription side. I didn't
think much in detail how it would work. The "subscription user" certainly
cannot connect to the publisher database and add grant the privileges to
itself. Perhaps some of the workers on the publisher side could do it on
startup.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: [PATCH] Support % wildcard in extension upgrade filenames

2022-11-04 Thread Sandro Santilli

On Mon, Oct 31, 2022 at 01:55:05AM -0400, Regina Obe wrote:
> 
> Sandro, can you submit an updated version of this patch.
> I was testing it out and looked good first time.

Attached updated version of the patch.

> If everyone is okay with this patch, we'll go ahead and add tests and
> documentation to go with it.

Yes please let us know if there's any chance of getting this
mechanism approved or if we should keep advertising works around
this limitation (it's not just installing 100s of files but also
still missing needed upgrade paths when doing so).


--strk;
>From 8725c616c31a9bc25478c7128ea81ce753e51089 Mon Sep 17 00:00:00 2001
From: Sandro Santilli 
Date: Wed, 14 Sep 2022 11:10:10 +0200
Subject: [PATCH] Allow wildcard (%) in extension upgrade paths

A wildcard character "%" will be accepted in the
"source" side of the upgrade script and be considered
usable to upgrade any version to the "target" side.

Using wildcards needs to be explicitly requested by
extensions via a "wildcard_upgrades" setting in their
control file.
---
 src/backend/commands/extension.c | 58 
 1 file changed, 52 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 1a62e5dac5..ea8825fcff 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -86,6 +86,7 @@ typedef struct ExtensionControlFile
 	bool		relocatable;	/* is ALTER EXTENSION SET SCHEMA supported? */
 	bool		superuser;		/* must be superuser to install? */
 	bool		trusted;		/* allow becoming superuser on the fly? */
+	bool		wildcard_upgrades;  /* allow using wildcards in upgrade scripts */
 	int			encoding;		/* encoding of the script file, or -1 */
 	List	   *requires;		/* names of prerequisite extensions */
 } ExtensionControlFile;
@@ -128,6 +129,7 @@ static void ApplyExtensionUpdates(Oid extensionOid,
   bool cascade,
   bool is_create);
 static char *read_whole_file(const char *filename, int *length);
+static bool file_exists(const char *name);
 
 
 /*
@@ -579,6 +581,14 @@ parse_extension_control_file(ExtensionControlFile *control,
 		 errmsg("parameter \"%s\" requires a Boolean value",
 item->name)));
 		}
+		else if (strcmp(item->name, "wildcard_upgrades") == 0)
+		{
+			if (!parse_bool(item->value, >wildcard_upgrades))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg("parameter \"%s\" requires a Boolean value",
+item->name)));
+		}
 		else if (strcmp(item->name, "encoding") == 0)
 		{
 			control->encoding = pg_valid_server_encoding(item->value);
@@ -636,6 +646,7 @@ read_extension_control_file(const char *extname)
 	control->relocatable = false;
 	control->superuser = true;
 	control->trusted = false;
+	control->wildcard_upgrades = false;
 	control->encoding = -1;
 
 	/*
@@ -890,7 +901,15 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 	if (from_version == NULL)
 		elog(DEBUG1, "executing extension script for \"%s\" version '%s'", control->name, version);
 	else
+	{
+		if ( control->wildcard_upgrades && ! file_exists(filename) )
+		{
+			elog(DEBUG1, "extension upgrade script \"%s\" does not exist, will try wildcard", filename);
+			/* if filename does not exist, try wildcard */
+			filename = get_extension_script_filename(control, "%", version);
+		}
 		elog(DEBUG1, "executing extension script for \"%s\" update from version '%s' to '%s'", control->name, from_version, version);
+	}
 
 	/*
 	 * If installing a trusted extension on behalf of a non-superuser, become
@@ -1215,13 +1234,23 @@ identify_update_path(ExtensionControlFile *control,
 	/* Find shortest path */
 	result = find_update_path(evi_list, evi_start, evi_target, false, false);
 
-	if (result == NIL)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("extension \"%s\" has no update path from version \"%s\" to version \"%s\"",
-		control->name, oldVersion, newVersion)));
+	if (result != NIL)
+		return result;
 
-	return result;
+	/* Find wildcard path, if allowed by control file */
+	if ( control->wildcard_upgrades )
+	{
+		evi_start = get_ext_ver_info("%", _list);
+		result = find_update_path(evi_list, evi_start, evi_target, false, false);
+
+		if (result != NIL)
+			return result;
+	}
+
+	ereport(ERROR,
+			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			 errmsg("extension \"%s\" has no update path from version \"%s\" to version \"%s\"",
+	control->name, oldVersion, newVersion)));
 }
 
 /*
@@ -3392,3 +3421,20 @@ read_whole_file(const char *filename, int *length)
 	buf[*length] = '\0';
 	return buf;
 }
+
+static bool
+file_exists(const char *name)
+{
+	struct stat st;
+
+	Assert(name != NULL);
+
+	if (stat(name, ) == 0)
+		return !S_ISDIR(st.st_mode);
+	else if (!(errno == ENOENT || errno == ENOTDIR || errno == EACCES))
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not access file \"%s\": %m", name)));
+
+	return

Re: Draft back-branch release notes are up

2022-11-04 Thread Justin Pryzby

On Fri, Nov 04, 2022 at 12:47:40PM -0400, Tom Lane wrote:
> I committed draft release notes at
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=bc62182f0afe6b01fec45b8d26df03fc9de4599a

+  Fix planner failure with extended statistics on partitioned tables
+  (Richard Guo, Justin Pryzby)

This can also happen with inheritance tables.

+  Add missing guards for NULL connection pointer

Maybe should be NULL or NULL ?

-- 
Justin

pg_reload_conf() synchronously

2022-11-04 Thread Gurjeet Singh

In an internal conversation it was seen that for some tests that want
to enforce a behaviour, a behaviour that is controlled by a GUC, we
_need_ to perform a sleep for an arbitrary amount of time.
Alternatively, executing the rest of the test on a new connection also
helps to get the expected behaviour. Following is a sample snippet of
such a test.

ALTER SYSTEM SET param TO value;
SELECT pg_reload_conf();
-- Either pg_sleep(0.1) or \connect here for next command to behave as expected.
ALTER ROLE ... PASSWORD '...';

This is because of the fact that the SIGHUP, sent to Postmaster by
this backend, takes some time to get back to the issuing backend.

Although a pg_sleep() call works to alleviate the pain in most cases,
it does not provide any certainty. Following the Principle Of Least
Astonishment, we want a command that loads the configuration
_synchronously_, so that the subsequent commands from the client can
be confident that the requisite parameter changes have taken effect.

The attached patch makes the pg_reload_conf() function set
ConfigReloadPending to true, which will force the postgres main
command-processing loop to process and apply config changes _before_
executing the next command.

The only downside I can think of this approach is that it _might_
cause the issuing backend to process the config file twice; once after
it has processed the current command, and another time when the SIGHUP
signal comes from the Postmaster. If the SIGHUP signal from the
Postmaster comes before the end of current command, then the issuing
backend will process the config only once, as before the patch. In any
case, this extra processing of the config seems harmless, and the
benefit outweighs the extra processing done.

The alternate methods that I considered (see branch reload_my_conf at
[2])  were to either implement the SQL command RELOAD CONFIGURATION [
FOR ALL ], or to create an overloaded version of function
pg_reload_conf(). But both those approaches had much more significant
downsides, like additional GRANTs, etc.

There have been issues identified in the past (see [1]) that possibly
may be alleviated by this approach of applying config changes
synchronously.

[1]: https://www.postgresql.org/message-id/2138662.1623460441%40sss.pgh.pa.us
[2]: https://github.com/gurjeet/postgres/tree/reload_my_conf

Best regards,
Gurjeet
http://Gurje.et


reload_conf_synchronouly.patch
Description: Binary data

Re: Privileges on PUBLICATION

2022-11-04 Thread Mark Dilger




> On Nov 4, 2022, at 12:28 AM, Antonin Houska  wrote:
> 
> I thought about the whole concept a bit more and I doubt if the PUBLICATION
> privilege is the best approach. In particular, the user specified in CREATE
> SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have SELECT
> privilege on the tables replicated. So if the DBA excludes some columns from
> the publication's column list and sets the (publication) privileges in such a
> way that the user cannot get the column values via other publications, the
> user still can connect to the database directly and get values of the excluded
> columns.
> 
> As an alternative to the publication privileges, I think that the CREATE
> SUBSCRIPTION command could grant ACL_SELECT automatically to the subscription
> user on the individual columns contained in the publication column list, and
> DROP SUBSCRIPTION would revoke that privilege.
> 
> Of course a question is what to do if the replication user already has that
> privilege on some columns: either the CREATE SUBSCRIPTION command should raise
> ERROR, or we should introduce a new privilege (e.g. ACL_SELECT_PUB) for this
> purpose, which would effectivelly be ACL_SELECT, but hidden from the users of
> GRANT / REVOKE.
> 
> If this approach was taken, the USAGE privilege on schema would need to be
> granted / revoked in a similar way.

When you talk about a user needing to have privileges, it sounds like you mean 
privileges on the publishing database.  But then you talk about CREATE 
SUBSCRIPTION granting privileges, which would necessarily be on the subscriber 
database.  Can you clarify what you have in mind?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Draft back-branch release notes are up

2022-11-04 Thread Tom Lane

I committed draft release notes at

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=bc62182f0afe6b01fec45b8d26df03fc9de4599a

Please send comments/corrections by Sunday.

regards, tom lane

Re: Add non-blocking version of PQcancel

2022-11-04 Thread Jacob Champion

On 10/5/22 06:23, Jelte Fennema wrote:
> In my first version of this patch, this is exactly what I did. But then
> I got this feedback from Jacob, so I changed it to reusing PGconn:
> 
>>  [snip]
> 
> I changed it back to use PGcancelConn as per your suggestion and I 
> agree that the API got better because of it.

Sorry for the whiplash!

Is the latest attachment the correct version? I don't see any difference
between the latest 0001 and the previous version's 0002 -- it has no
references to PG_TEST_TIMEOUT_DEFAULT, PGcancelConn, etc.

Thanks,
--Jacob

Re: heapgettup refactoring

2022-11-04 Thread Melanie Plageman

Thanks for the review!
Attached is v2 with feedback addressed.

On Mon, Oct 31, 2022 at 9:09 PM Andres Freund  wrote:
> > From 9d8b01960463dc64ff5b111d523ff80fce3017af Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Mon, 31 Oct 2022 13:40:06 -0400
> > Subject: [PATCH v1 2/3] Turn HeapKeyTest macro into function
> >
> > This should always be inlined appropriately now. It is easier to read as
> > a function. Also, remove unused include in catcache.c.
> > ---
> >  src/backend/access/heap/heapam.c   | 10 ++--
> >  src/backend/utils/cache/catcache.c |  1 -
> >  src/include/access/valid.h | 76 --
> >  3 files changed, 36 insertions(+), 51 deletions(-)
> >
> > diff --git a/src/backend/access/heap/heapam.c 
> > b/src/backend/access/heap/heapam.c
> > index 12be87efed..1c995faa12 100644
> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> > @@ -716,8 +716,10 @@ heapgettup(HeapScanDesc scan,
> > 
> >   snapshot);
> >
> >   if (valid && key != NULL)
> > - HeapKeyTest(tuple, 
> > RelationGetDescr(scan->rs_base.rs_rd),
> > - nkeys, key, 
> > valid);
> > + {
> > + valid = HeapKeyTest(tuple, 
> > RelationGetDescr(scan->rs_base.rs_rd),
> > + nkeys, key);
> > + }
> >
> >   if (valid)
> >   {
>
> superfluous parens.

fixed.

> > From a894ce38c488df6546392b9f3bd894b67edf951e Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Mon, 31 Oct 2022 13:40:29 -0400
> > Subject: [PATCH v1 3/3] Refactor heapgettup* and heapgetpage
> >
> > Simplify heapgettup(), heapgettup_pagemode(), and heapgetpage(). All
> > three contained several unnecessary local variables, duplicate code, and
> > nested if statements. Streamlining these improves readability and
> > extensibility.
>
> It'd be nice to break this into slightly smaller chunks.

I can do that. Since incorporating feedback will be harder once I break
it up into smaller chunks, I'm inclined to wait to do so until I know
that the structure I have now is the one we will go with. (I know smaller
chunks will make it more reviewable.)

> > +
> > +static inline void heapgettup_no_movement(HeapScanDesc scan)
> > +{
>
> FWIW, for function definitions we keep the return type (and with that also the
> the "static inline") on a separate line.

Fixed

>
> > + ItemId  lpp;
> > + OffsetNumber lineoff;
> > + BlockNumber page;
> > + Page dp;
> > + HeapTuple   tuple = &(scan->rs_ctup);
> > + /*
> > + * ``no movement'' scan direction: refetch prior tuple
> > + */
> > +
> > + /* Since the tuple was previously fetched, needn't lock page here */
> > + if (!scan->rs_inited)
> > + {
> > + Assert(!BufferIsValid(scan->rs_cbuf));
> > + tuple->t_data = NULL;
> > + return;
>
> Is it possible to have a no-movement scan with an uninitialized scan? That
> doesn't really seem to make sense. At least that's how I understand the
> explanation for NoMovement nearby:
>  * dir == NoMovementScanDirection means "re-fetch the tuple indicated
>  * by scan->rs_ctup".
>
> We can't have a rs_ctup without an already started scan, no?
>
> Looks like this is pre-existing code that you just moved, but it still seems
> wrong.

Changed to an assert

>
> > + }
> > + page = ItemPointerGetBlockNumber(&(tuple->t_self));
> > + if (page != scan->rs_cblock)
> > + heapgetpage((TableScanDesc) scan, page);
>
>
> We have a
> BlockNumber page;
> and
> Pagedp;
> in this code which seems almost intentionally confusing. This again is a
> pre-existing issue but perhaps we could clean it up first?

in attached
page -> block
dp -> page
in basically all locations in heapam.c (should that be its own commit?)

> > +static inline Page heapgettup_continue_page(HeapScanDesc scan, BlockNumber 
> > page, ScanDirection dir,
> > + int *linesleft, OffsetNumber *lineoff)
> > +{
> > + HeapTuple   tuple = &(scan->rs_ctup);
>
> Hm. Finding the next offset via rs_ctup doesn't seem quite right. For one,
> it's not actually that cheap to extract the offset from an ItemPointer because
> of the the way we pack it into ItemPointerData.

So, it was like this before [1].
What about saving the lineoff in rs_cindex.

It is smaller, but that seems okay, right?

> > + Page dp = BufferGetPage(scan->rs_cbuf);
> > + TestForOldSnapshot(scan->rs_base.rs_snapshot, scan->rs_base.rs_rd, 
> > dp);
>
> Newlines between definitions and code :)

k

> Perhaps worth asserting that the scan is initialized

Re: psql: Add command to use extended query protocol

2022-11-04 Thread Peter Eisentraut


On 02.11.22 01:18, Corey Huinker wrote:


      SELECT $1, $2 \gp 'foo' 'bar'


I think this is a great idea, but I foresee people wanting to send that 
output to a file or a pipe like \g allows. If we assume everything after 
the \gp is a param, don't we paint ourselves into a corner?


Any thoughts on how that syntax could be generalized?

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-04 Thread Masahiko Sawada

On Thu, Nov 3, 2022 at 1:59 PM John Naylor  wrote:
>
> On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada  
> wrote:
> >
> > I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
> > the comments I got so far. 0004 patch is a DSA support patch for PoC.
>
> Thanks for the new patchset. This is not a full review, but I have some 
> comments:
>
> 0001 and 0002 look okay on a quick scan -- I will use this as a base for 
> further work that we discussed. However, before I do so I'd like to request 
> another revision regarding the following:
>
> > In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
> > to point its children, and we use rt_node_ptr as either rt_node* or
> > dsa_pointer depending on whether the radix tree is shared or not (ie,
> > by checking radix_tree->dsa == NULL).
>

Thank you for the comments!

> 0004: Looks like a good start, but this patch has a large number of changes 
> like these, making it hard to read:
>
> - if (found && child_p)
> - *child_p = child;
> + if (found && childp_p)
> + *childp_p = childp;
> ...
>   rt_node_inner_32 *new32;
> + rt_node_ptr new32p;
>
>   /* grow node from 4 to 32 */
> - new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
> -  RT_NODE_KIND_32);
> + new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
> + new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);
>
> It's difficult to keep in my head what all the variables refer to. I thought 
> a bit about how to split this patch up to make this easier to read. Here's 
> what I came up with:
>
> typedef struct rt_node_ptr
> {
>   uintptr_t encoded;
>   rt_node * decoded;
> }
>
> Note that there is nothing about "dsa or local". That's deliberate. That way, 
> we can use the "encoded" field for a tagged pointer as well, as I hope we can 
> do (at least for local pointers) in the future. So an intermediate patch 
> would have "static inline void" functions  node_ptr_encode() and  
> node_ptr_decode(), which would only copy from one member to another. I 
> suspect that: 1. The actual DSA changes will be *much* smaller and easier to 
> reason about. 2. Experimenting with tagged pointers will be easier.

Good idea. Will try in the next version patch.

>
> Also, quick question: 0004 has a new function rt_node_update_inner() -- is 
> that necessary because of DSA?, or does this ideally belong in 0002? What's 
> the reason for it?

Oh, this was needed once when initially I'm writing DSA support but
thinking about it again now I think we can remove it and use
rt_node_insert_inner() with parent = NULL instead.

>
> Regarding the performance, I've
> > added another boolean argument to bench_seq/shuffle_search(),
> > specifying whether to use the shared radix tree or not. Here are
> > benchmark results in my environment,
>
> > [...]
>
> > In non-shared radix tree cases (the forth argument is false), I don't
> > see a visible performance degradation. On the other hand, in shared
> > radix tree cases (the forth argument is true), I see visible overheads
> > because of dsa_get_address().
>
> Thanks, this is useful.
>
> > Please note that the current shared radix tree implementation doesn't
> > support any locking, so it cannot be read while written by someone.
>
> I think at the very least we need a global lock to enforce this.
>
> > Also, only one process can iterate over the shared radix tree. When it
> > comes to parallel vacuum, these don't become restriction as the leader
> > process writes the radix tree while scanning heap and the radix tree
> > is read by multiple processes while vacuuming indexes. And only the
> > leader process can do heap vacuum by iterating the key-value pairs in
> > the radix tree. If we want to use it for other cases too, we would
> > need to support locking, RCU or something.
>
> A useful exercise here is to think about what we'd need to do parallel heap 
> pruning. We don't need to go that far for v16 of course, but what's the 
> simplest thing we can do to make that possible? Other use cases can change to 
> more sophisticated schemes if need be.

For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.

I've attached a draft patch for lazy vacuum integration that can be
applied on top of v8 patches. The patch adds a new module called
TIDStore, an efficient storage for TID backed by radix tree. Lazy
vacuum and parallel vacuum use it instead of a TID array. The patch

Re: New docs chapter on Transaction Management and related changes

2022-11-04 Thread Laurenz Albe

On Sat, 2022-10-15 at 21:08 -0400, Bruce Momjian wrote:
> I therefore merged all three paragraphs into
> one and tried to make the text saner;  release_savepoint.sgml diff
> attached, URL content updated.

I wanted to have a look at this, but I am confused.  The original patch
was much bigger.  Is this just an incremental patch?  If yes, it would
be nice to have a "grand total" patch, so that I can read it all
in one go.

Yours,
Laurenz Albe

Re: Schema variables - new implementation for Postgres 15

2022-11-04 Thread Dmitry Dolgov

> On Fri, Nov 04, 2022 at 03:17:18PM +0100, Pavel Stehule wrote:
> > I've stumbled upon something that looks weird to me (inspired by the
> > example from tests):
> >
> > =# create variable v2 as int;
> > =# let v2 = 3;
> > =# create view vv2 as select coalesce(v2, 0) + 1000 as result
> >
> > =# select * from vv2;
> >  result
> >  
> > 1003
> >
> > =# set force_parallel_mode to on;
> > =# select * from vv2;
> >  result
> >  
> > 1000
> >
> > In the second select the actual work is done from a worker backend.
> > Since values of session variables are stored in the backend local
> > memory, it's not being shared with the worker and the value is not found
> > in the hash map. Does this suppose to be like that?
> >
>
> It looks like a bug, but parallel queries should be supported.
>
> The value of the variable is passed as parameter to the worker backend. But
> probably somewhere the original reference was not replaced by parameter
>
> On Fri, Nov 04, 2022 at 10:17:13PM +0800, Julien Rouhaud wrote:
> Hi,
>
> There's code to serialize and restore all used variables for parallel workers
> (see code about PARAM_VARIABLE and queryDesc->num_session_variables /
> queryDesc->plannedstmt->sessionVariables).  I haven't reviewed that part yet,
> but it's supposed to be working.  Blind guess would be that it's missing
> something in expression walker.

I see, thanks. I'll take a deeper look, my initial assumption was due to
the fact that in the worker case create_sessionvars_hashtables is
getting called for every query.

Re: Schema variables - new implementation for Postgres 15

2022-11-04 Thread Pavel Stehule

Hi


pá 4. 11. 2022 v 15:08 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> > On Fri, Nov 04, 2022 at 05:58:06AM +0100, Pavel Stehule wrote:
> > Hi
> >
> > fix clang warning
>
> I've stumbled upon something that looks weird to me (inspired by the
> example from tests):
>
> =# create variable v2 as int;
> =# let v2 = 3;
> =# create view vv2 as select coalesce(v2, 0) + 1000 as result
>
> =# select * from vv2;
>  result
>  
> 1003
>
> =# set force_parallel_mode to on;
> =# select * from vv2;
>  result
>  
> 1000
>
> In the second select the actual work is done from a worker backend.
> Since values of session variables are stored in the backend local
> memory, it's not being shared with the worker and the value is not found
> in the hash map. Does this suppose to be like that?
>

It looks like a bug, but parallel queries should be supported.

The value of the variable is passed as parameter to the worker backend. But
probably somewhere the original reference was not replaced by parameter

Re: Schema variables - new implementation for Postgres 15

2022-11-04 Thread Julien Rouhaud

Hi,

On Fri, Nov 04, 2022 at 03:07:48PM +0100, Dmitry Dolgov wrote:
> > On Fri, Nov 04, 2022 at 05:58:06AM +0100, Pavel Stehule wrote:
> > Hi
> >
> > fix clang warning
>
> I've stumbled upon something that looks weird to me (inspired by the
> example from tests):
>
> =# create variable v2 as int;
> =# let v2 = 3;
> =# create view vv2 as select coalesce(v2, 0) + 1000 as result
>
> =# select * from vv2;
>  result
>  
> 1003
>
> =# set force_parallel_mode to on;
> =# select * from vv2;
>  result
>  
> 1000
>
> In the second select the actual work is done from a worker backend.
> Since values of session variables are stored in the backend local
> memory, it's not being shared with the worker and the value is not found
> in the hash map. Does this suppose to be like that?

There's code to serialize and restore all used variables for parallel workers
(see code about PARAM_VARIABLE and queryDesc->num_session_variables /
queryDesc->plannedstmt->sessionVariables).  I haven't reviewed that part yet,
but it's supposed to be working.  Blind guess would be that it's missing
something in expression walker.

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-04 Thread David Christensen

On Nov 4, 2022, at 9:02 AM, Justin Pryzby  wrote:
> 
> On Fri, Nov 04, 2022 at 11:52:59AM +0900, Ian Lawrence Barwick wrote:
>> 2022年5月3日(火) 8:45 David Christensen :
>>> 
>>> ...and pushing a couple fixups pointed out by cfbot, so here's v4.
>> 
>> cfbot reports the patch no longer applies [1]. As CommitFest 2022-11 is
>> currently underway, this would be an excellent time to update the patch.
> 
> More important than needing to be rebased, the patch has never passed
> its current tests on windows.
> 
> As I recall, that's due to relying on "cp".  And "rsync", which
> shouldn't be assumed to exist by regression tests).

I will work on supporting the windows compatibility here. Is there some list of 
guidelines for what you can and can’t use? I don’t have a windows machine 
available to develop on. 

Was it failing on windows? I was attempting to skip it as I recall. 

Best,

David

Re: Schema variables - new implementation for Postgres 15

2022-11-04 Thread Dmitry Dolgov

> On Fri, Nov 04, 2022 at 05:58:06AM +0100, Pavel Stehule wrote:
> Hi
>
> fix clang warning

I've stumbled upon something that looks weird to me (inspired by the
example from tests):

=# create variable v2 as int;
=# let v2 = 3;
=# create view vv2 as select coalesce(v2, 0) + 1000 as result

=# select * from vv2;
 result

1003

=# set force_parallel_mode to on;
=# select * from vv2;
 result

1000

In the second select the actual work is done from a worker backend.
Since values of session variables are stored in the backend local
memory, it's not being shared with the worker and the value is not found
in the hash map. Does this suppose to be like that?

Re: Fix proposal for comparaison bugs in PostgreSQL::Version

2022-11-04 Thread Jehan-Guillaume de Rorthais

On Thu, 3 Nov 2022 13:11:18 -0500
Justin Pryzby  wrote:

> On Tue, Jun 28, 2022 at 06:17:40PM -0400, Andrew Dunstan wrote:
> > Nice catch, but this looks like massive overkill. I think we can very
> > simply fix the test in just a few lines of code, instead of a 190 line
> > fix and a 130 line TAP test.
> > 
> > It was never intended to be able to compare markers like rc1 vs rc2, and
> > I don't see any need for it. If you can show me a sane use case I'll
> > have another look, but right now it seems quite unnecessary.
> > 
> > Here's my proposed fix.
> > 
> > diff --git a/src/test/perl/PostgreSQL/Version.pm
> > b/src/test/perl/PostgreSQL/Version.pm index 8f70491189..8d4dbbf694 100644
> > --- a/src/test/perl/PostgreSQL/Version.pm  
> 
> Is this still an outstanding issue ?

The issue still exists on current HEAD:

  $ perl -Isrc/test/perl/ -MPostgreSQL::Version -le \
  'print "bug" if PostgreSQL::Version->new("9.6") <= 9.0'
  bug

Regards,

RE: Perform streaming logical transactions by background workers and parallel apply

2022-11-04 Thread houzj.f...@fujitsu.com

On Friday, November 4, 2022 4:07 PM Amit Kapila  wrote:
> 
> On Thu, Nov 3, 2022 at 6:36 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > Thanks for the analysis and summary !
> >
> > I tried to implement the above idea and here is the patch set.
> >
> 
> Few comments on v42-0001
> ===

Thanks for the comments.

> 
> 10.
> + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> + winfo->shared->transaction_lock_id = parallel_apply_get_unique_id();
> 
> Why can't we use xid (remote_xid) for one of these and local_xid (one 
> generated
> by parallel apply) for the other? I was a bit worried about the local_xid 
> because it
> will be generated only after applying the first message but the patch already
> seems to be waiting for it in parallel_apply_wait_for_xact_finish as seen in 
> the
> below code.
> 
> +void
> +parallel_apply_wait_for_xact_finish(ParallelApplyWorkerShared *wshared)
> +{
> + /*
> + * Wait until the parallel apply worker handles the first message and
> + * set the flag to true.
> + */
> + parallel_apply_wait_for_in_xact(wshared, PARALLEL_TRANS_STARTED);
> +
> + /* Wait for the transaction lock to be released. */
> + parallel_apply_lock(wshared->transaction_lock_id);

I also considered using xid for these locks, but it seems the objsubid for the
shared object lock is 16bit while xid is 32 bit. So, I tried to generate a 
unique 16bit id
here. I will think more on this and maybe I need to add some comments to
explain this.

Best regards,
Hou zj

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-04 Thread Justin Pryzby

On Fri, Nov 04, 2022 at 11:52:59AM +0900, Ian Lawrence Barwick wrote:
> 2022年5月3日(火) 8:45 David Christensen :
> >
> > ...and pushing a couple fixups pointed out by cfbot, so here's v4.
> 
> cfbot reports the patch no longer applies [1]. As CommitFest 2022-11 is
> currently underway, this would be an excellent time to update the patch.

More important than needing to be rebased, the patch has never passed
its current tests on windows.

As I recall, that's due to relying on "cp".  And "rsync", which
shouldn't be assumed to exist by regression tests).

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/38/3628

Re: proposal: possibility to read dumped table's name from file

2022-11-04 Thread Pavel Stehule

pá 4. 11. 2022 v 14:28 odesílatel Julien Rouhaud 
napsal:

> On Fri, Nov 04, 2022 at 07:19:27AM -0500, Justin Pryzby wrote:
> > On Fri, Nov 04, 2022 at 07:59:01PM +0800, Julien Rouhaud wrote:
> > > On Thu, Nov 03, 2022 at 10:22:15PM +0100, Pavel Stehule wrote:
> > > > čt 3. 11. 2022 v 5:09 odesílatel Julien Rouhaud 
> napsal:
> > > > updated patch attached
> > > >
> > > > big thanks for these comments and tips
> > >
> > > Thanks for the updated patch!  As far as I'm concerned the patch is in
> a good
> > > shape, passes the CI and I don't have anything more to say so I'm
> marking it as
> > > Ready for Committer!
> >
> > +1
> >
> > I started looking to see if it's possible to simplify the patch at all,
> > but nothing to show yet.
> >
> > But one thing I noticed is that "optarg" looks wrong here:
> >
> > simple_string_list_append(>triggerNames, optarg);
>
> Ah indeed, good catch!  Maybe there should be an explicit test for every
> (include|exclude) / objtype combination?  It would be a bit verbose (and
> possibly hard to maintain).
>

I'll do it

Re: proposal: possibility to read dumped table's name from file

2022-11-04 Thread Julien Rouhaud

On Fri, Nov 04, 2022 at 07:19:27AM -0500, Justin Pryzby wrote:
> On Fri, Nov 04, 2022 at 07:59:01PM +0800, Julien Rouhaud wrote:
> > On Thu, Nov 03, 2022 at 10:22:15PM +0100, Pavel Stehule wrote:
> > > čt 3. 11. 2022 v 5:09 odesílatel Julien Rouhaud  
> > > napsal:
> > > updated patch attached
> > >
> > > big thanks for these comments and tips
> > 
> > Thanks for the updated patch!  As far as I'm concerned the patch is in a 
> > good
> > shape, passes the CI and I don't have anything more to say so I'm marking 
> > it as
> > Ready for Committer!
> 
> +1
> 
> I started looking to see if it's possible to simplify the patch at all,
> but nothing to show yet.
> 
> But one thing I noticed is that "optarg" looks wrong here:
> 
> simple_string_list_append(>triggerNames, optarg);

Ah indeed, good catch!  Maybe there should be an explicit test for every
(include|exclude) / objtype combination?  It would be a bit verbose (and
possibly hard to maintain).

Re: Add index scan progress to pg_stat_progress_vacuum

2022-11-04 Thread Imseih (AWS), Sami

Resubmitting patches with proper format.

Thanks

Sami Imseih
Amazon Web Services (AWS)







v14-0002-Function-to-return-currently-vacuumed-or-cleaned.patch
Description: v14-0002-Function-to-return-currently-vacuumed-or-cleaned.patch


v14-0001-Add-2-new-columns-to-pg_stat_progress_vacuum.-Th.patch
Description: v14-0001-Add-2-new-columns-to-pg_stat_progress_vacuum.-Th.patch

Re: Add explicit casts in four places to simplehash.h

2022-11-04 Thread Tom Lane

David Geier  writes:
> On 11/3/22 15:50, Tom Lane wrote:
>> The bigger picture here is that we do actually endeavor to keep
>> (most of) our headers C++ clean, but our tool cpluspluscheck misses
>> these problems because it doesn't try to use these macros.
>> I wonder whether there is a way to do better.

> What about having a custom header alongside cpluspluscheck which 
> references all macros we care about?

Can't see that that would help, because of the hand maintenance
required to make it useful.

regards, tom lane

Re: Add tracking of backend memory allocated to pg_stat_activity

2022-11-04 Thread Reid Thompson

On Fri, 2022-11-04 at 11:06 +0100, Drouvot, Bertrand wrote:
> Hi,
> 
> It looks like the patch does not apply anymore since b1099eca8f.
> 
> Regards,
> 

Thanks,

rebased to current master attached.

-- 
Reid Thompson
Senior Software Engineer
Crunchy Data, Inc.

reid.thomp...@crunchydata.com
www.crunchydata.com
From a8de5d29c0c6f10962181926a49ad4fec1e52bd1 Mon Sep 17 00:00:00 2001
From: Reid Thompson 
Date: Thu, 11 Aug 2022 12:01:25 -0400
Subject: [PATCH] Add tracking of backend memory allocated to pg_stat_activity

This new field displays the current bytes of memory allocated to the
backend process. It is updated as memory for the process is
malloc'd/free'd. Memory allocated to items on the freelist is included in
the displayed value.  Dynamic shared memory allocations are included
only in the value displayed for the backend that created them, they are
not included in the value for backends that are attached to them to
avoid double counting. On occasion, orphaned memory segments may be
cleaned up on postmaster startup. This may result in decreasing the sum
without a prior increment. We limit the floor of backend_mem_allocated
to zero. Updated pg_stat_activity documentation for the new column.
---
 doc/src/sgml/monitoring.sgml|  12 +++
 src/backend/catalog/system_views.sql|   1 +
 src/backend/storage/ipc/dsm_impl.c  |  81 +++
 src/backend/utils/activity/backend_status.c | 105 
 src/backend/utils/adt/pgstatfuncs.c |   4 +-
 src/backend/utils/mmgr/aset.c   |  18 
 src/backend/utils/mmgr/generation.c |  15 +++
 src/backend/utils/mmgr/slab.c   |  21 
 src/include/catalog/pg_proc.dat |   6 +-
 src/include/utils/backend_status.h  |   7 +-
 src/test/regress/expected/rules.out |   9 +-
 11 files changed, 270 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5d622d514..972805b85a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -947,6 +947,18 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  
 
 
+
+ 
+  backend_allocated_bytes bigint
+ 
+ 
+  The byte count of memory allocated to this backend. Dynamic shared memory
+  allocations are included only in the value displayed for the backend that
+  created them, they are not included in the value for backends that are
+  attached to them to avoid double counting.
+ 
+
+
  
   
query text
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..84d462aa97 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -864,6 +864,7 @@ CREATE VIEW pg_stat_activity AS
 S.state,
 S.backend_xid,
 s.backend_xmin,
+S.backend_allocated_bytes,
 S.query_id,
 S.query,
 S.backend_type
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 6ddd46a4e7..eae03159c3 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -66,6 +66,7 @@
 #include "postmaster/postmaster.h"
 #include "storage/dsm_impl.h"
 #include "storage/fd.h"
+#include "utils/backend_status.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 
@@ -232,6 +233,14 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
 			name)));
 			return false;
 		}
+
+		/*
+		 * Detach and destroy pass through here, only decrease the memory
+		 * shown allocated in pg_stat_activity when the creator destroys the
+		 * allocation.
+		 */
+		if (op == DSM_OP_DESTROY)
+			pgstat_report_backend_allocated_bytes_decrease(*mapped_size);
 		*mapped_address = NULL;
 		*mapped_size = 0;
 		if (op == DSM_OP_DESTROY && shm_unlink(name) != 0)
@@ -332,6 +341,36 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
 		name)));
 		return false;
 	}
+
+	/*
+	 * Attach and create pass through here, only update backend memory
+	 * allocated in pg_stat_activity for the creator process.
+	 */
+	if (op == DSM_OP_CREATE)
+	{
+		/*
+		 * Posix creation calls dsm_impl_posix_resize implying that resizing
+		 * occurs or may be added in the future. As implemented
+		 * dsm_impl_posix_resize utilizes fallocate or truncate, passing the
+		 * whole new size as input, growing the allocation as needed (only
+		 * truncate supports shrinking). We update by replacing the old
+		 * allocation with the new.
+		 */
+#if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__)
+		/*
+		 * posix_fallocate does not shrink allocations, adjust only on
+		 * allocation increase.
+		 */
+		if (request_size > *mapped_size)
+		{
+			pgstat_report_backend_allocated_bytes_decrease(*mapped_size);
+			pgstat_report_backend_allocated_bytes_increase(request_size);
+		}
+#else
+

Re: Lockless queue of waiters in LWLock

2022-11-04 Thread Alexander Korotkov

Hi, Pavel!

On Thu, Nov 3, 2022 at 1:51 PM Pavel Borisov  wrote:
> > I'm planning to gather more detailed statistics on different
> > LWLockAcquire calls soon to understand prospects for further
> > optimizations.
>
> So, I've made more measurements.
>
> 1. Applied measuring patch 0001 to a patch with lockless queue
> optimization (v2) from [0] in this thread and run the same concurrent
> insert test described in [1] on 20 pgbench connections.
> The new results for ProcArray lwlock are as follows:
> exacq 45132  // Overall number of exclusive locks taken
> ex_attempt[0] 20755  // Exclusive locks taken immediately
> ex_attempt[1] 18800  // Exclusive locks taken after one waiting on semaphore
> ex_attempt[2] 5577// Exclusive locks taken after two or more
> waiting on semaphore
> shacq 494871// .. same stats for shared locks
> sh_attempt[0] 463211 // ..
> sh_attempt[1] 29767   // ..
> sh_attempt[2] 1893 // .. same stats for shared locks
> sh_wake_calls 31070 // Number of calls to wake up shared waiters
> sh_wakes 36190// Number of shared waiters woken up.
> GroupClearXid 55300 // Number of calls of ProcArrayGroupClearXid
> EndTransactionInternal: 236193 // Number of calls
> ProcArrayEndTransactionInternal
>
> 2. Applied measuring patch 0002 to a Andres Freund's patch v3 from [2]
> and run the same concurrent insert test described in [1] on 20 pgbench
> connections.
> The results for ProcArray lwlock are as follows:
> exacq 49300   // Overall number of exclusive locks taken
> ex_attempt1[0] 18353// Exclusive locks taken immediately by first
> call of LWLockAttemptLock in LWLockAcquire loop
> ex_attempt2[0] 18144.   // Exclusive locks taken immediately by second
> call of LWLockAttemptLock in LWLockAcquire loop
> ex_attempt1[1] 9985  // Exclusive locks taken after one waiting on
> semaphore by first call of LWLockAttemptLock in LWLockAcquire loop
> ex_attempt2[1] 1838. // Exclusive locks taken after one waiting on
> semaphore by second call of LWLockAttemptLock in LWLockAcquire loop
> ex_attempt1[2] 823.   // Exclusive locks taken after two or more
> waiting on semaphore by first call of LWLockAttemptLock in
> LWLockAcquire loop
> ex_attempt2[2] 157.   // Exclusive locks taken after two or more
> waiting on semaphore by second call of LWLockAttemptLock in
> LWLockAcquire loop
> shacq 508131 // .. same stats for shared locks
> sh_attempt1[0] 469410   //..
> sh_attempt2[0] 27858.   //..
> sh_attempt1[1] 10309.   //..
> sh_attempt2[1] 460.   //..
> sh_attempt1[2] 90. //..
> sh_attempt2[2] 4.   // .. same stats for shared locks
> dequeue self 48461   // Number of dequeue_self calls
> sh_wake_calls 27560   // Number of calls to wake up
> shared waiters
> sh_wakes 19408  // Number of shared waiters woken up.
> GroupClearXid 65021. // Number of calls of
> ProcArrayGroupClearXid
> EndTransactionInternal: 249003 // Number of calls
> ProcArrayEndTransactionInternal
>
> It seems that two calls in each look in Andres's (and master) code
> help evade semaphore-waiting loops that may be relatively expensive.
> The probable reason for this is that the small delay between these two
> calls is sometimes enough for concurrent takers to free spinlock for
> the queue modification. Could we get even more performance if we do
> three or more tries to take the lock in the queue? Will this degrade
> performance in some other cases?

Thank you for gathering the statistics. Let me do some relative
analysis of that.

*Lockless queue patch*

1. Shared lockers
1.1. 93.60% of them acquire lock without sleeping on semaphore
1.2. 6.02% of them acquire lock after sleeping on semaphore 1 time
1.3. 0.38% of them acquire lock after sleeping on semaphore 2 or more times
2. Exclusive lockers
2.1. 45.99% of them acquire lock without sleeping on semaphore
2.2. 41.66% of them acquire lock after sleeping on semaphore 1 time
2.3. 12.36% of them acquire lock after sleeping on semaphore 2 or more times

In general, about 10% of lockers sleep on the semaphore.

*Andres's patch*

1. Shared lockers
1.1. 97.86% of them acquire lock without sleeping on the semaphore
(92.38% do this immediately and 5.48% after queuing)
1.2. 2.12% of them acquire lock after sleeping on semaphore 1 time
(2.03% do this immediately and 0.09% after queuing)
1.3. 0.02% of them acquire lock after sleeping on semaphore 2 or more
times (0.02% do this immediately and 0.00% after queuing)
2. Exclusive lockers
2.1. 74.03% of them acquire lock without sleeping on the semaphore
(37.23% do this immediately and 36.80% after queuing)
2.2. 23.98% of them acquire lock after sleeping on semaphore 1 time
(20.25% do this immediately and 3.73% after queuing)
2.3. 1.99% of them acquire lock after sleeping on semaphore 2 or more
times (1.67% do this immediately and 0.32% after queuing)

In general, about 4% of lockers sleep on the

Re: proposal: possibility to read dumped table's name from file

2022-11-04 Thread Justin Pryzby

On Fri, Nov 04, 2022 at 07:59:01PM +0800, Julien Rouhaud wrote:
> On Thu, Nov 03, 2022 at 10:22:15PM +0100, Pavel Stehule wrote:
> > čt 3. 11. 2022 v 5:09 odesílatel Julien Rouhaud  napsal:
> > updated patch attached
> >
> > big thanks for these comments and tips
> 
> Thanks for the updated patch!  As far as I'm concerned the patch is in a good
> shape, passes the CI and I don't have anything more to say so I'm marking it 
> as
> Ready for Committer!

+1

I started looking to see if it's possible to simplify the patch at all,
but nothing to show yet.

But one thing I noticed is that "optarg" looks wrong here:

simple_string_list_append(>triggerNames, optarg);

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

2022-11-04 Thread Julien Rouhaud

On Thu, Nov 03, 2022 at 10:22:15PM +0100, Pavel Stehule wrote:
> čt 3. 11. 2022 v 5:09 odesílatel Julien Rouhaud  napsal:
>
> >
> > Is there any interest with the initial pg_strip_crlf?  AFAICT all the rest
> > of
> > the code will ignore such caracters using isspace() so it wouldn't change
> > anything.
> >
>
> I think reading multiline identifiers is a little bit easier, because I
> don't need to check the ending \n and \r
> When I read multiline identifiers, I cannot ignore white spaces.

Ok.  I don't have a strong objection to it.

>
> updated patch attached
>
> big thanks for these comments and tips

Thanks for the updated patch!  As far as I'm concerned the patch is in a good
shape, passes the CI and I don't have anything more to say so I'm marking it as
Ready for Committer!

Re: Allow single table VACUUM in transaction block

2022-11-04 Thread Rahila Syed

Hi,

On Fri, Nov 4, 2022 at 2:39 PM Simon Riggs 
wrote:

> Hi Rahila,
>
> Thanks for your review.
>
> On Fri, 4 Nov 2022 at 07:37, Rahila Syed  wrote:
>
> >> I would like to bring up a few points that I came across while looking
> into the vacuum code.
> >>
> >> 1.  As a result of this change to allow VACUUM inside a user
> transaction, I think there is some chance of causing
> >> a block/delay of concurrent VACUUMs if a VACUUM is being run under a
> long running transaction.
> >> 2. Also, if a user runs VACUUM in a transaction, performance
> optimizations like PROC_IN_VACUUM won't work.
> >> 3. Also, if VACUUM happens towards the end of a long running
> transaction, the snapshot will be old
> >> and xmin horizon for vacuum would be somewhat old as compared to
> current lazy vacuum which
> >> acquires a new snapshot just before scanning the table.
> >>
> >> So, while I understand the need of the feature, I am wondering if there
> should be some mention
> >> of above caveats in documentation with the recommendation that VACUUM
> should be run outside
> >> a transaction, in general.
> >>
> >
> > Sorry, I just noticed that you have already mentioned some of these in
> the documentation as follows, so it seems
> > it is already taken care of.
> >
> > +VACUUM cannot be executed inside a transaction
> block,
> > +unless a single table is specified and FULL is
> not
> > +specified.  When executing inside a transaction block the vacuum
> scan can
> > +hold back the xmin horizon and does not update the database
> datfrozenxid,
> > +as a result this usage is not useful for database maintenance, but
> is provided
> > +to allow vacuuming in special circumstances, such as temporary or
> private
> > +work tables.
>
> Yes, I wondered whether we should have a NOTICE or WARNING to remind
> people of those points?
>

 +1 . My vote for NOTICE over WARNING because I think
it is useful information for the user rather than any potential problem.

Thank you,
Rahila Syed

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-04 Thread Amit Kapila

On Fri, Nov 4, 2022 at 1:36 PM Amit Kapila  wrote:
>
> On Thu, Nov 3, 2022 at 6:36 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > Thanks for the analysis and summary !
> >
> > I tried to implement the above idea and here is the patch set.
> >
>
> Few comments on v42-0001
> ===
>

Few more comments on v42-0001
===
1. In parallel_apply_send_data(), it seems winfo->serialize_changes
and switching_to_serialize are set to indicate that we have changed
parallel to serialize mode. Isn't using just the
switching_to_serialize sufficient? Also, it would be better to name
switching_to_serialize as parallel_to_serialize or something like
that.

2. In parallel_apply_send_data(), the patch has already initialized
the fileset, and then again in apply_handle_stream_start(), it will do
the same if we fail while sending stream_start message to the parallel
worker. It seems we don't need to initialize fileset again for
TRANS_LEADER_PARTIAL_SERIALIZE state in apply_handle_stream_start()
unless I am missing something.

3.
apply_handle_stream_start(StringInfo s)
{
...
+ if (!first_segment)
+ {
+ /*
+ * Unlock the shared object lock so that parallel apply worker
+ * can continue to receive and apply changes.
+ */
+ parallel_apply_unlock(winfo->shared->stream_lock_id);
...
}

Can we have an assert before this unlock call that the lock must be
held? Similarly, if there are other places then we can have assert
there as well.

4. It is not very clear to me how maintaining ParallelApplyLockids
list is helpful.

5.
/*
+ * Handle STREAM START message when the transaction was spilled to disk.
+ *
+ * Inintialize fileset if not yet and open the file.
+ */
+void
+serialize_stream_start(TransactionId xid, bool first_segment)
+{
+ /*
+ * Start a transaction on stream start,

This function's name and comments seem to indicate that it is to
handle stream_start message. Is that really the case? It is being
called from parallel_apply_send_data() which made me think it can be
used from other places as well.

-- 
With Regards,
Amit Kapila.

Re: confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.

2022-11-04 Thread Ashutosh Sharma

On Fri, Nov 4, 2022 at 3:29 PM Amit Kapila  wrote:
>
> On Thu, Nov 3, 2022 at 4:49 PM Amit Kapila  wrote:
> >
> > On Mon, Sep 19, 2022 at 8:09 PM Ashutosh Sharma  
> > wrote:
> > >
> > > Thanks for the clarification. Attached is the patch with the changes.
> > > Please have a look.
> > >
> >
> > LGTM. I'll push this tomorrow unless there are any other
> > comments/suggestions for this.
> >
>
> Pushed!
>

thanks Amit.!

--
With Regards,
Ashutosh Sharma.

RE: Improve logging when using Huge Pages

2022-11-04 Thread Shinoda, Noriyoshi (PN Japan FSIP)

Thanks for your comment. 
I understand that some people don't like noise log. However, I don't understand 
the feeling of disliking the one-line log that is output when the instance is 
started. 
In both MySQL and Oracle Database, a log is output if it fails to acquire 
HugePages with the same behavior as huge_pages=try in PostgreSQL.

Regards,
Noriyoshi Shinoda

-Original Message-
From: Thomas Munro  
Sent: Thursday, November 3, 2022 10:10 AM
To: Shinoda, Noriyoshi (PN Japan FSIP) 
Cc: Jacob Champion ; Masahiko Sawada 
; Fujii Masao ; Justin 
Pryzby ; PostgreSQL-development 
; Julien Rouhaud ; Tom Lane 
; Kyotaro Horiguchi 
Subject: Re: Improve logging when using Huge Pages

On Wed, Aug 3, 2022 at 8:42 PM Shinoda, Noriyoshi (PN Japan FSIP) 
 wrote:
> > As discussed in [1], we're taking this opportunity to return some patchsets 
> > that don't appear to be getting enough reviewer interest.
> Thank you for your helpful comments.
> As you say, my patch doesn't seem to be of much interest to reviewers either.
> I will discard the patch I proposed this time and consider it again.

I wonder if the problem here is that people are reluctant to add noise
to every starting system.   There are people who have not configured
their system and don't want to see that noise, and then some people have 
configured their system and would like to know about it if it doesn't work so 
they can be aware of that, but don't want to use "off"
because they don't want a hard failure.  Would it be better if there were a new 
level "try_log" (or something), which only logs a message if it tries and fails?

Re: explain_regress, explain(MACHINE), and default to explain(BUFFERS) (was: BUFFERS enabled by default in EXPLAIN (ANALYZE))

2022-11-04 Thread Laurenz Albe

Thanks for the updated patch set!

On Fri, 2022-10-28 at 17:59 -0500, Justin Pryzby wrote:
> 
> > 0004, 0005, 0006, 0007: EXPLAIN (MACHINE)
> > 
> >   I think it is confusing that these are included in this patch set.
> >   EXPLAIN (MACHINE OFF) is similar to "explain_regress = on", only it goes 
> > even further:
> >   no query ID, no Heap Fetches, no Sort details, ...
> >   Why not add this functionality to the GUC?
> 
> Good question, and one I've asked myself.
> 
> explain_regress affects pre-existing uses of explain in the regression
> tests, aiming to simplify current (or maybe only future) uses.  And
> is obviously used to allow enabling BUFFERS by default.
> 
> explain(MACHINE off) is mostly about allowing "explain ANALYZE" to be
> used in regression tests.  Which, right now, is rare.  Maybe I shouldn't
> include those patches until the earlier patches progress (or, maybe we
> should just defer their discussion).

Essentially, "explain_regress" is to simplify the current regression tests,
and MACHINE OFF is to simplify future regression tests.  That does not seem
like a meaningful distinction to me.  Both are only useful for the regression
tests, and I see no need for two different knobs.

My opinion is now like this:

+1 on enabling BUFFERS by default for EXPLAIN (ANALYZE)

+0.2 on "explain_regress"

-1 on EXPLAIN (MACHINE) in addition to "explain_regress"

-0.1 on adding the MACHINE OFF functionality to "explain_regress"

"explain_regress" reduces the EXPLAIN options you need for regression tests.
This is somewhat useful, but not a big win.  Also, it will make backpatching
regression tests slightly harder for the next 5 years.

Hence the +0.2 for "explain_regress".

For me, an important question is: do we really want more EXPLAIN (ANALYZE)
in the regression tests?  It will probably slow the tests down, and I wonder
if there is a lot of added value, if we have to remove all the machine-specific
output anyway.

Hence the -0.1.

> >   0005 suppresses "rows removed by filter", but how is that machine 
> > dependent?
> 
> It can vary with the number of parallel workers (see 13e8b2ee8), which
> may not be "machine dependent" but the point of that patch is to allow
> predictable output of EXPLAIN ANALYZE.  Maybe it needs a better name (or
> maybe it should be folded into the first patch).

Now it makes sense to me.  I just didn't think of that.

> I'm not actually sure if this patch should try to address parallel
> workers at all, or (if so) if what it's doing now is adequate.
> 
> > > BTW, I think it may be that the GUC should be marked PGDLLIMPORT ?
> > 
> > I think it is project policy to apply this mark wherever it is needed.  Do 
> > you think
> > that third-party extensions might need to use this in C code?
> 
> Since v15 (8ec569479), the policy is:
> > On Windows, export all the server's global variables using PGDLLIMPORT 
> > markers (Robert Haas)
> 
> I'm convinced now that's what's intended here.

You convinced me too.

> > This is not enough.  The patch would have to update all the examples
> > that use EXPLAIN ANALYZE.
> 
> Fair enough.  I've left a comment to handle this later if the patch
> gains any traction and 001 is progressed.

I agree with that.

I would really like to see 0003 go in, but it currently depends on 0001, which
I am not so sure about.
I understand that you did that so that "explain_regress" can turn off BUFFERS
and there is no extra churn in the regression tests.

Still, it would be a shame if resistance against "explain_regress" would
be a show-stopper for 0003.

If I could get my way, I'd want two separate patches: first, one to turn
BUFFERS on, and second one for "explain_regress" with its current functionality
on top of that.

Yours,
Laurenz Albe

Re: Error for row-level triggers with transition tables on partitioned tables

2022-11-04 Thread Etsuro Fujita

On Thu, Nov 3, 2022 at 2:20 AM Tom Lane  wrote:
> Etsuro Fujita  writes:
> > We do not currently allow row-level triggers on partitions to have
> > transition tables either, and the error message for that is “ROW
> > triggers with transition tables are not supported on partitions.”.
> > How about changing the DETAIL message to something similar to this
> > like “ROW triggers with transition tables are not supported on
> > partitioned tables.”, to avoid confusion?  Patch attached.  Will add
> > this to the upcoming CF.
>
> +1, this wording is better.  I marked it RFC.

Cool!  I have committed the patch.

Thanks for reviewing!

Best regards,
Etsuro Fujita

found the document `README.parallel` to be a little bit incorrect

2022-11-04 Thread qinghao huang

Hi hackers,
Recently, I'm looking into the implementation of parallel index building. 
And I noticed that in `README.parallel` we mentioned that `we allow no writes 
to the database and no DDL` . But if I understand it correctly,  index building 
as one significant DDL is now supporting parallel (though the `_bt_load` is 
still single processed). So I guess we should update the outdated document.

 The related file is here. 
https://github.com/postgres/postgres/blob/8c7146790811ac4eee23fab2226db14b636e1ac5/src/backend/access/transam/README.parallel#L85
[https://opengraph.githubassets.com/e00d7ae4b1ecff53f42d16a5f9dabfdada039cabbfc1a703bad3634a5903d3d0/postgres/postgres]
postgres/README.parallel at 8c7146790811ac4eee23fab2226db14b636e1ac5 · 
postgres/postgres
Mirror of the official PostgreSQL GIT repository. Note that this is just a 
*mirror* - we dont work with pull requests on github. To contribute, 
please see https://wiki.postgresql.org/wiki/Subm...
github.com


Best,
Qinghao

Re: Logical replication timeout problem

2022-11-04 Thread Fabrice Chapuis

Hello Wang,
I tested the draft patch in my lab for Postgres 14.4, the refresh of the
materialized view ran without generating the timeout on the worker.
Do you plan to propose this patch at the next commit fest.

Regards,
Fabrice

On Wed, Oct 19, 2022 at 10:15 AM wangw.f...@fujitsu.com <
wangw.f...@fujitsu.com> wrote:

> On Tue, Oct 18, 2022 at 22:35 PM Fabrice Chapuis 
> wrote:
> > Hello Amit,
> >
> > In version 14.4 the timeout problem for logical replication happens
> again despite
> > the patch provided for this issue in this version. When bulky
> materialized views
> > are reloaded it broke logical replication. It is possible to solve this
> problem by
> > using your new "streaming" option.
> > Have you ever had this issue reported to you?
> >
> > Regards
> >
> > Fabrice
> >
> > 2022-10-10 17:19:02 CEST [538424]: [17-1]
> > user=postgres,db=dbxxxa00,client=[local] CONTEXT:  SQL statement "REFRESH
> > MATERIALIZED VIEW sxxxa00.table_base"
> > PL/pgSQL function refresh_materialized_view(text) line 5 at
> EXECUTE
> > 2022-10-10 17:19:02 CEST [538424]: [18-1]
> > user=postgres,db=dbxxxa00,client=[local] STATEMENT:  select
> > refresh_materialized_view('sxxxa00.table_base');
> > 2022-10-10 17:19:02 CEST [538424]: [19-1]
> > user=postgres,db=dbxxxa00,client=[local] LOG:  duration: 264815.652
> > ms  statement: select refresh_materialized_view('sxxxa00.table_base');
> > 2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG:
> automatic
> > vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
> > pages: 0 removed, 296589 remain, 0 skipped due to pins, 0
> skipped frozen
> > tuples: 0 removed, 48472622 remain, 0 are dead but not yet
> removable,
> > oldest xmin: 1501528
> > index scan not needed: 0 pages from table (0.00% of total) had 0
> dead item
> > identifiers removed
> > I/O timings: read: 1.494 ms, write: 0.000 ms
> > avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
> > buffer usage: 593301 hits, 77 misses, 294605 dirtied
> > WAL usage: 296644 records, 46119 full page images, 173652718
> bytes
> > system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32
> s
> > 2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG:
> automatic
> > analyze of table "dbxxxa00.sxxxa00.table_base"
> > I/O timings: read: 0.043 ms, write: 0.000 ms
> > avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
> > buffer usage: 30308 hits, 2 misses, 2 dirtied
> > system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
> > 2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG:
> checkpoint
> > complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0
> recycled;
> > write=269.551 s, sync=0.002 s, total=269.560 s; sync files=251,
> longest=0.00
> > 1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
> > 2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR:
> terminating
> > logical replication worker due to timeout
> > 2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG:
> background
> > worker "logical replication worker" (PID 716163) exited with exit code 1
> > 2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG:  logical
> > replication apply worker for subscription "subxxx_sxxxa00" has started
>
> Thanks for reporting!
>
> There is one thing I want to confirm:
> Is the statement `select refresh_materialized_view('sxxxa00.table_base');`
> executed on the publisher-side?
>
> If so, I think the reason for this timeout problem could be that during DDL
> (`REFRESH MATERIALIZED VIEW`), lots of temporary data is generated due to
> rewrite. Since these temporary data will not be processed by the pgoutput
> plugin, our previous fix for DML had no impact on this case.
> I think setting "streaming" option to "on" could work around this problem.
>
> I tried to write a draft patch (see attachment) on REL_14_4 to fix this.
> I tried it locally and it seems to work.
> Could you please confirm whether this problem is fixed after applying this
> draft patch?
>
> If this draft patch works, I will improve it and try to fix this problem.
>
> Regards,
> Wang wei
>

Re: Add tracking of backend memory allocated to pg_stat_activity

2022-11-04 Thread Drouvot, Bertrand


Hi,

On 10/25/22 8:59 PM, Reid Thompson wrote:

On Tue, 2022-10-25 at 14:51 -0400, Reid Thompson wrote:

patch rebased to current master


actually attach the patch



It looks like the patch does not apply anymore since b1099eca8f.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.

2022-11-04 Thread Amit Kapila

On Thu, Nov 3, 2022 at 4:49 PM Amit Kapila  wrote:
>
> On Mon, Sep 19, 2022 at 8:09 PM Ashutosh Sharma  wrote:
> >
> > Thanks for the clarification. Attached is the patch with the changes.
> > Please have a look.
> >
>
> LGTM. I'll push this tomorrow unless there are any other
> comments/suggestions for this.
>

Pushed!

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

2022-11-04 Thread Hayato Kuroda (Fujitsu)

> While testing yours, I found that the leader apply worker has been crashed in 
> the
> following case.
> I will dig the failure more, but I reported here for records.

I found a reason why the leader apply worker crasehes.
In parallel_apply_free_worker() the leader sends the pending message to 
parallel apply worker:

```
+   /*
+* Resend the pending message to parallel apply worker to 
cleanup the
+* queue. Note that parallel apply worker will just ignore this 
message
+* as it has already handled this message while applying spooled
+* messages.
+*/
+   result = shm_mq_send(winfo->mq_handle, 
strlen(winfo->pending_msg),
+winfo->pending_msg, 
false, true);
```

...but the message length should not be calucarete by strlen() because the 
logicalrep message has '\0'.
PSA the patch to fix it. It can be applied on v42 patch set.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED



0001-fix-wrong-message-length-estimation.patch
Description: 0001-fix-wrong-message-length-estimation.patch

Re: Support logical replication of DDLs

2022-11-04 Thread vignesh C

On Wed, 2 Nov 2022 at 05:13, vignesh C  wrote:
>
> On Mon, 31 Oct 2022 at 16:17, vignesh C  wrote:
> >
> > On Thu, 27 Oct 2022 at 16:02, vignesh C  wrote:
> > >
> > > On Thu, 27 Oct 2022 at 02:09, Zheng Li  wrote:
> > > >
> > > > > Adding support for deparsing of CREATE/ALTER/DROP LANGUAGE for ddl 
> > > > > replication.
> > > >
> > > > Adding support for deparsing of:
> > > > COMMENT
> > > > ALTER DEFAULT PRIVILEGES
> > > > CREATE/DROP ACCESS METHOD
> > >
> > > Adding support for deparsing of:
> > > ALTER/DROP ROUTINE
> > >
> > > The patch also includes fixes for the following issues:
> >
>
Few comments:
1) If the user has specified a non-existing object, then we will throw
the wrong error.
+Datum
+publication_deparse_ddl_command_start(PG_FUNCTION_ARGS)
+{
+   EventTriggerData *trigdata;
+   char   *command = psprintf("Drop table command start");
+   DropStmt   *stmt;
+   ListCell   *cell1;
+
+   if (!CALLED_AS_EVENT_TRIGGER(fcinfo))
+   elog(ERROR, "not fired by event trigger manager");
+
+   trigdata = (EventTriggerData *) fcinfo->context;
+   stmt = (DropStmt *) trigdata->parsetree;
+
+   /* extract the relid from the parse tree */
+   foreach(cell1, stmt->objects)
+   {
+   charrelpersist;
+   Node   *object = lfirst(cell1);
+   ObjectAddress address;
+   Relationrelation = NULL;
+
+   address = get_object_address(stmt->removeType,
+object,
+
  ,
+
  AccessExclusiveLock,
+true);
+
+   relpersist = get_rel_persistence(address.objectId);

We could check relation is NULL after getting address and skip
processing that object

2) Materialized view handling is missing:
+   switch (rel->rd_rel->relkind)
+   {
+   case RELKIND_RELATION:
+   case RELKIND_PARTITIONED_TABLE:
+   reltype = "TABLE";
+   break;
+   case RELKIND_INDEX:
+   case RELKIND_PARTITIONED_INDEX:
+   reltype = "INDEX";
+   break;
+   case RELKIND_VIEW:
+   reltype = "VIEW";
+   break;
+   case RELKIND_COMPOSITE_TYPE:
+   reltype = "TYPE";
+   istype = true;
+   break;

We could use this scenario for debugging and verifying:
ALTER MATERIALIZED VIEW testschema.amv SET TABLESPACE regress_tblspace;

3)  Readdition of alter table readd statistics is not handled:

+   case AT_DropIdentity:
+   tmpobj = new_objtree_VA("ALTER COLUMN
%{column}I DROP IDENTITY", 2,
+
 "type", ObjTypeString, "drop identity",
+
 "column", ObjTypeString, subcmd->name);
+
+   append_string_object(tmpobj,
"%{if_not_exists}s",
+
  subcmd->missing_ok ? "IF EXISTS" : "");
+
+   subcmds = lappend(subcmds,
new_object_object(tmpobj));
+   break;
+   default:
+   elog(WARNING, "unsupported alter table
subtype %d",
+subcmd->subtype);
+   break;
+   }


We could use this scenario for debugging and verifying:
CREATE TABLE functional_dependencies (
filler1 TEXT,
filler2 NUMERIC,
a INT,
b TEXT,
filler3 DATE,
c INT,
d TEXT
)
WITH (autovacuum_enabled = off);
CREATE STATISTICS func_deps_stat (dependencies) ON a, b, c FROM
functional_dependencies;
TRUNCATE functional_dependencies;
ALTER TABLE functional_dependencies ALTER COLUMN c TYPE numeric;

4) "Alter sequence as" option not hanlded

+   if (strcmp(elem->defname, "cache") == 0)
+   newelm = deparse_Seq_Cache(alterSeq, seqform, false);
+   else if (strcmp(elem->defname, "cycle") == 0)
+   newelm = deparse_Seq_Cycle(alterSeq, seqform, false);
+   else if (strcmp(elem->defname, "increment") == 0)
+   newelm = deparse_Seq_IncrementBy(alterSeq,
seqform, false);
+   else if (strcmp(elem->defname, "minvalue") == 0)
+   newelm = deparse_Seq_Minvalue(alterSeq, seqform, false);
+   else if (strcmp(elem->defname, "maxvalue") == 0)
+   newelm = deparse_Seq_Maxvalue(alterSeq, seqform, false);
+   else if (strcmp(elem->defname, "start") == 0)
+   newelm = deparse_Seq_Startwith(alterSeq,
seqform, false);
+   else if (strcmp(elem->defname, "restart") == 0)
+   newelm = deparse_Seq_Restart(alterSeq, seqdata);
+   else if (strcmp(elem->defname, "owned_by") == 0)
+

Re: Proposal to provide the facility to set binary format output for specific OID's per session

2022-11-04 Thread Dave Cramer

Hi Ian,

Thanks, will do
Dave Cramer


On Thu, 3 Nov 2022 at 21:36, Ian Lawrence Barwick  wrote:

> 2022年9月6日(火) 21:32 Dave Cramer :
> >
> >
> >
> >
> > On Tue, 6 Sept 2022 at 02:30, Ibrar Ahmed  wrote:
> >>
> >>
> >>
> >> On Fri, Aug 12, 2022 at 5:48 PM Dave Cramer 
> wrote:
> >>>
> >>>
> >>>
> >>> On Fri, 5 Aug 2022 at 17:51, Justin Pryzby 
> wrote:
> 
>  On Tue, Jul 26, 2022 at 08:11:04AM -0400, Dave Cramer wrote:
>  > Attached patch to correct these deficiencies.
> 
>  You sent a patch to be applied on top of the first patch, but cfbot
> doesn't
>  know that, so it says the patch doesn't apply.
>  http://cfbot.cputube.org/dave-cramer.html
> 
>  BTW, a previous discussion about this idea is here:
> 
> https://www.postgresql.org/message-id/flat/40cbb35d-774f-23ed-3079-03f938aac...@2ndquadrant.com
> >>>
> >>>
> >>> squashed patch attached
> >>>
> >>> Dave
> >>
> >> The patch does not apply successfully; a rebase is required.
> >>
> >> === applying patch ./0001-add-format_binary.patch
> >> patching file src/backend/tcop/postgres.c
> >> Hunk #1 succeeded at 97 (offset -8 lines).
> >> patching file src/backend/tcop/pquery.c
> >> patching file src/backend/utils/init/globals.c
> >> patching file src/backend/utils/misc/guc.c
> >> Hunk #1 succeeded at 144 (offset 1 line).
> >> Hunk #2 succeeded at 244 with fuzz 2 (offset 1 line).
> >> Hunk #3 succeeded at 4298 (offset -1 lines).
> >> Hunk #4 FAILED at 12906.
> >> 1 out of 4 hunks FAILED -- saving rejects to file
> src/backend/utils/misc/guc.c.rej
> >> patching file src/include/miscadmin.h
> >>
> >
> > Thanks,
> >
> > New rebased patch attached
>
> Hi
>
> cfbot reports the patch no longer applies [1].  As CommitFest 2022-11 is
> currently underway, this would be an excellent time to update the patch
> again.
>
> [1] http://cfbot.cputube.org/patch_40_3777.log
>
> Thanks
>
> Ian Barwick
>

Re: Pluggable toaster

2022-11-04 Thread Aleksander Alekseev

Hi again,

> [1]: 
> https://www.postgresql.org/message-id/flat/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22=5xvbg7s4vr...@mail.gmail.com

I added a link but forgot to use it, d'oh!

Please note that if the goal is to merely implement "type-aware
sub-TOASTers" then compression dictionaries [1] arguably provide the
same value with MUCH less complexity.

-- 
Best regards,
Aleksander Alekseev

Re: [BUG] parenting a PK constraint to a self-FK one (Was: Self FK oddity when attaching a partition)

2022-11-04 Thread Jehan-Guillaume de Rorthais

On Thu, 3 Nov 2022 20:44:16 +0100
Alvaro Herrera  wrote:

> On 2022-Oct-05, Alvaro Herrera wrote:
> 
> > I've been giving the patches a look and it caused me to notice two
> > additional bugs in the same area:
> > 
> > - FKs in partitions are sometimes marked NOT VALID.  This is because of
> >   missing initialization when faking up a Constraint node in
> >   CloneFkReferencing.  Easy to fix, have patch, running tests now.  
> 
> I have pushed the fix for this now.

Thank you Alvaro!

Re: Pluggable toaster

2022-11-04 Thread Aleksander Alekseev

Hi Nikita,

> Setting TOAST for table and database is a subject for discussion. There is 
> already default
> Toaster. Also, there is not much sense in setting Jsonb Toaster as default 
> even for table, do
> not say database, because table could contain other TOASTable columns not of 
> Json type.

True, but besides Jsonb Toaster one can implement a universal one as
well (your own example: encryption, or a Toaster that bypasses a 1 GB
value limitation). However we can probably keep this out of scope of
this particular patchset for now. As mentioned before this is going to
be just an additional syntax for the user convenience.

> Custom Toasters will work with Extended storage, but as I answered in 
> previous email -
> there is no much use of it, because they would deal with compressed data.

Compression is actually a part of the four-stage TOAST algorithm. So
what you're doing is using the default TOAST most of the time, but if
the storage strategy is EXTERNAL and a custom TOASTer is specified
then you use a type-aware "TOASTer".

If the goal is to implement true "Pluggable TOASTers" then the TOAST
should be substituted entirely. If the goal is to implement type-aware
sub-TOASTers we should be honest about this and call it otherwise:
e.g. "Type-aware TOASTer" or perhaps "subTOASTer". Additionally in
this case there should be no validate() method since this is going to
work only with the default heapam that implements the default TOASTer.

So to clarify, the goal is to deliver true "Pluggable TOASTers" or
only "type-aware sub-TOASTers"?

> For TOASTer you SHOULD distinguish insert and update operations, really. 
> Because for
> TOASTed data these operations affect many tuples, and AM does know which of 
> them
> were updated and which were not - that's very serious limitation of current 
> TOAST, and
> TOAST mechanics shoud care about updating affected tuples only instead of 
> marking
> whole record dead and inserting new one. This is also an argument for not 
> using EXTENDED
> storage mode - because with compressed data you do not have such choice, you 
> should
> drop the whole record.

This may actually be a good point. I suggest reflecting it in the documentation.

> >Users should be able to DROP extension. I seriously doubt that the
> >patch is going to be accepted as long as it has this limitation.
>
> There is a mention in documentation and previous discussion that this 
> operation would lead
> to loss of data TOASTed with this custom Toaster.

I don't see any reason why the semantics for Toasters should be any
different from user-defined types implemented in an extension. If
there are columns that use a given Toaster we should forbid DROPping
the extension. Otherwise "DROP extension" should succeed. No one says
that this should be a fast operation but it should be possible.

[1]: 
https://www.postgresql.org/message-id/flat/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22=5xvbg7s4vr...@mail.gmail.com

-- 
Best regards,
Aleksander Alekseev

Re: Add SHELL_EXIT_CODE to psql

2022-11-04 Thread Corey Huinker

Oops, that sample output was from a previous run, should have been:

-- SHELL_EXIT_CODE is undefined
\echo :SHELL_EXIT_CODE
:SHELL_EXIT_CODE
-- bad \!
\! borp
sh: line 1: borp: command not found
\echo :SHELL_EXIT_CODE
127
-- bad backtick
\set var `borp`
sh: line 1: borp: command not found
\echo :SHELL_EXIT_CODE
127
-- good \!
\! true
\echo :SHELL_EXIT_CODE
0
-- play with exit codes
\! exit 4
\echo :SHELL_EXIT_CODE
4
\set var `exit 3`
\echo :SHELL_EXIT_CODE
3


On Fri, Nov 4, 2022 at 5:08 AM Corey Huinker 
wrote:

>
> Over in
> https://www.postgresql.org/message-id/eaf326ad693e74eba068f33a7f518...@oss.nttdata.com
>  Justin
> Pryzby suggested that psql might need the ability to capture the shell exit
> code.
>
> This is a POC patch that does that, but doesn't touch on the ON_ERROR_STOP
> stuff.
>
> I've added some very rudimentary tests, but haven't touched the
> documentation, because I strongly suspect that someone will suggest a
> better name for the variable.
>
> But basically, it works like this
>
> -- SHELL_EXIT_CODE is undefined
> \echo :SHELL_EXIT_CODE
> :SHELL_EXIT_CODE
> -- bad \!
> \! borp
> sh: line 1: borp: command not found
> \echo :SHELL_EXIT_CODE
> 32512
> -- bad backtick
> \set var `borp`
> sh: line 1: borp: command not found
> \echo :SHELL_EXIT_CODE
> 127
> -- good \!
> \! true
> \echo :SHELL_EXIT_CODE
> 0
> -- play with exit codes
> \! exit 4
> \echo :SHELL_EXIT_CODE
> 1024
> \set var `exit 3`
> \echo :SHELL_EXIT_CODE
> 3
>
>
> Feedback welcome.
>

Re: Make ON_ERROR_STOP stop on shell script failure

2022-11-04 Thread Corey Huinker

>
> I think it'd be a lot better to expose the script status to psql.
> (without having to write "foo; echo status=$?").
>

I agree, and I hacked up a proof of concept, but started another thread at
https://www.postgresql.org/message-id/CADkLM=cWao2x2f+UDw15W1JkVFr_bsxfstw=ngea7r9m4j-...@mail.gmail.com
so as not to clutter up this one.

Re: Allow single table VACUUM in transaction block

2022-11-04 Thread Simon Riggs

Hi Rahila,

Thanks for your review.

On Fri, 4 Nov 2022 at 07:37, Rahila Syed  wrote:

>> I would like to bring up a few points that I came across while looking into 
>> the vacuum code.
>>
>> 1.  As a result of this change to allow VACUUM inside a user transaction, I 
>> think there is some chance of causing
>> a block/delay of concurrent VACUUMs if a VACUUM is being run under a long 
>> running transaction.
>> 2. Also, if a user runs VACUUM in a transaction, performance optimizations 
>> like PROC_IN_VACUUM won't work.
>> 3. Also, if VACUUM happens towards the end of a long running transaction, 
>> the snapshot will be old
>> and xmin horizon for vacuum would be somewhat old as compared to current 
>> lazy vacuum which
>> acquires a new snapshot just before scanning the table.
>>
>> So, while I understand the need of the feature, I am wondering if there 
>> should be some mention
>> of above caveats in documentation with the recommendation that VACUUM should 
>> be run outside
>> a transaction, in general.
>>
>
> Sorry, I just noticed that you have already mentioned some of these in the 
> documentation as follows, so it seems
> it is already taken care of.
>
> +VACUUM cannot be executed inside a transaction block,
> +unless a single table is specified and FULL is not
> +specified.  When executing inside a transaction block the vacuum scan can
> +hold back the xmin horizon and does not update the database datfrozenxid,
> +as a result this usage is not useful for database maintenance, but is 
> provided
> +to allow vacuuming in special circumstances, such as temporary or private
> +work tables.

Yes, I wondered whether we should have a NOTICE or WARNING to remind
people of those points?

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Add SHELL_EXIT_CODE to psql

2022-11-04 Thread Corey Huinker

Over in
https://www.postgresql.org/message-id/eaf326ad693e74eba068f33a7f518...@oss.nttdata.com
Justin
Pryzby suggested that psql might need the ability to capture the shell exit
code.

This is a POC patch that does that, but doesn't touch on the ON_ERROR_STOP
stuff.

I've added some very rudimentary tests, but haven't touched the
documentation, because I strongly suspect that someone will suggest a
better name for the variable.

But basically, it works like this

-- SHELL_EXIT_CODE is undefined
\echo :SHELL_EXIT_CODE
:SHELL_EXIT_CODE
-- bad \!
\! borp
sh: line 1: borp: command not found
\echo :SHELL_EXIT_CODE
32512
-- bad backtick
\set var `borp`
sh: line 1: borp: command not found
\echo :SHELL_EXIT_CODE
127
-- good \!
\! true
\echo :SHELL_EXIT_CODE
0
-- play with exit codes
\! exit 4
\echo :SHELL_EXIT_CODE
1024
\set var `exit 3`
\echo :SHELL_EXIT_CODE
3


Feedback welcome.
From 119837575f4d0da804d92ec797bbf11e8075e595 Mon Sep 17 00:00:00 2001
From: coreyhuinker 
Date: Fri, 4 Nov 2022 04:45:39 -0400
Subject: [PATCH] POC: expose shell exit code as a psql variable

---
 src/bin/psql/command.c |  4 
 src/bin/psql/help.c|  2 ++
 src/bin/psql/psqlscanslash.l   | 28 +---
 src/test/regress/expected/psql.out | 11 +++
 src/test/regress/sql/psql.sql  |  7 +++
 5 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index ab613dd49e..4666a63051 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -4957,6 +4957,7 @@ static bool
 do_shell(const char *command)
 {
 	int			result;
+	char		exit_code_buf[32];
 
 	fflush(NULL);
 	if (!command)
@@ -4984,6 +4985,9 @@ do_shell(const char *command)
 	else
 		result = system(command);
 
+	snprintf(exit_code_buf, sizeof(exit_code_buf), "%d", WEXITSTATUS(result));
+	SetVariable(pset.vars, "SHELL_EXIT_CODE", exit_code_buf);
+
 	if (result == 127 || result == -1)
 	{
 		pg_log_error("\\!: failed");
diff --git a/src/bin/psql/help.c b/src/bin/psql/help.c
index f8ce1a0706..04f996332e 100644
--- a/src/bin/psql/help.c
+++ b/src/bin/psql/help.c
@@ -454,6 +454,8 @@ helpVariables(unsigned short int pager)
 		  "show all results of a combined query (\\;) instead of only the last\n");
 	HELP0("  SHOW_CONTEXT\n"
 		  "controls display of message context fields [never, errors, always]\n");
+	HELP0("  SHELL_EXIT_CODE\n"
+		  "Exit code of the last shell command\n");
 	HELP0("  SINGLELINE\n"
 		  "if set, end of line terminates SQL commands (same as -S option)\n");
 	HELP0("  SINGLESTEP\n"
diff --git a/src/bin/psql/psqlscanslash.l b/src/bin/psql/psqlscanslash.l
index a467b72144..30e6f5dcd4 100644
--- a/src/bin/psql/psqlscanslash.l
+++ b/src/bin/psql/psqlscanslash.l
@@ -27,6 +27,8 @@
 
 %{
 #include "fe_utils/psqlscan_int.h"
+#include "settings.h"
+#include "variables.h"
 
 /*
  * We must have a typedef YYSTYPE for yylex's first argument, but this lexer
@@ -774,6 +776,8 @@ evaluate_backtick(PsqlScanState state)
 	bool		error = false;
 	char		buf[512];
 	size_t		result;
+	int			exit_code = 0;
+	char		exit_code_buf[32];
 
 	initPQExpBuffer(_output);
 
@@ -783,6 +787,7 @@ evaluate_backtick(PsqlScanState state)
 	{
 		pg_log_error("%s: %m", cmd);
 		error = true;
+		exit_code = -1;
 	}
 
 	if (!error)
@@ -800,10 +805,25 @@ evaluate_backtick(PsqlScanState state)
 		} while (!feof(fd));
 	}
 
-	if (fd && pclose(fd) == -1)
+	if (fd)
 	{
-		pg_log_error("%s: %m", cmd);
-		error = true;
+		exit_code = pclose(fd);
+		if (exit_code == -1)
+		{
+			pg_log_error("%s: %m", cmd);
+			error = true;
+		}
+		if (WIFEXITED(exit_code))
+		{
+			exit_code=WEXITSTATUS(exit_code);
+		}
+		else if(WIFSIGNALED(exit_code)) {
+			exit_code=WTERMSIG(exit_code);
+		}
+		else if(WIFSTOPPED(exit_code)) {
+			//If you need to act upon the process stopping, do it here.
+			exit_code=WSTOPSIG(exit_code);
+		}
 	}
 
 	if (PQExpBufferDataBroken(cmd_output))
@@ -826,5 +846,7 @@ evaluate_backtick(PsqlScanState state)
 		appendBinaryPQExpBuffer(output_buf, cmd_output.data, cmd_output.len);
 	}
 
+	snprintf(exit_code_buf, sizeof(exit_code_buf), "%d", exit_code);
+	SetVariable(pset.vars, "SHELL_EXIT_CODE", exit_code_buf);
 	termPQExpBuffer(_output);
 }
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index a7f5700edc..33202ffda7 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -1275,6 +1275,17 @@ execute q;
 ++-+
 
 deallocate q;
+-- test SHELL_EXIT_CODE
+\! nosuchcommand
+sh: line 1: nosuchcommand: command not found
+\echo :SHELL_EXIT_CODE
+127
+\set nosuchvar `nosuchcommand`
+sh: line 1: nosuchcommand: command not found
+\! nosuchcommand
+sh: line 1: nosuchcommand: command not found
+\echo :SHELL_EXIT_CODE
+127
 -- test single-line header and data
 prepare q as select repeat('x',2*n) as "0123456789abcdef", repeat('y',20-2*n) as "0123456789" from generate_series(1,10) as n;
 \pset

Re: [PATCH] Compression dictionaries for JSONB

2022-11-04 Thread Aleksander Alekseev

Hi hackers,

> For the record, Nikita and I agreed offlist that Nikita will join this
> effort as a co-author in order to implement the suggested improvements
> (and perhaps some improvements that were not suggested yet). Meanwhile
> I'm going to keep the current version of the patch up to date with the
> `master` branch.

8272749e added a few more arguments to CastCreate(). Here is the rebased patch.

-- 
Best regards,
Aleksander Alekseev


v9-0001-Compression-dictionaries-for-JSONB.patch
Description: Binary data

Re: Moving forward with TDE

2022-11-04 Thread Dilip Kumar

On Fri, Nov 4, 2022 at 3:36 AM David Christensen
 wrote:
>
> > Unless somebody in the community remembers open questions/issues with
> > TDE that were never addressed I suggest simply iterating with our
> > usual testing/reviewing process. For now I'm going to change the
> > status of the CF entry [1] to "Waiting for Author" since the patchset
> > doesn't pass the CI [2].
>
> Thanks, enclosed is a new version that is rebased on HEAD and fixes a
> bug that the new pg_control_init() test picked up.

I was looking into the documentation patches 0001 and 0002, I think
the explanation is very clear.  I have a few questions/comments

+By not using the database id in the IV, CREATE DATABASE can copy the
+heap/index files from the old database to a new one without
+decryption/encryption.  Both page copies are valid.  Once a database
+changes its pages, it gets new LSNs, and hence new IV.

How about the WAL_LOG method for creating a database? because in that
we get the new LSN for the pages in the new database, so do we
reencrypt, if yes then this documentation needs to be updated
otherwise we might need to add that code.

+changes its pages, it gets new LSNs, and hence new IV.  Using only the
+LSN and page number also avoids requiring pg_upgrade to preserve
+database oids, tablespace oids, and relfilenodes.

I think this line needs to be changed, because now we are already
preserving dbid/tbsid/relfilenode.  So even though we are not using
those in IV there is no point in saying we are avoiding that
requirement.

I will review the remaining patches soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Privileges on PUBLICATION

2022-11-04 Thread Antonin Houska

Amit Kapila  wrote:

> On Thu, Nov 3, 2022 at 11:12 AM Antonin Houska  wrote:
> >
> > Peter Eisentraut  wrote:
> >
> > > The CF entry is about privileges on publications.  Please rebase that 
> > > patch
> > > and repost it so that the CF app and the CF bot are up to date.
> >
> > The rebased patch (with regression tests added) is attached here.
> >
> > There's still one design issue that I haven't mentioned yet: if the USAGE
> > privilege on a publication is revoked after the synchronization phase
> > completed, the missing privilege on a publication causes ERROR in the output
> > plugin. If the privilege is then granted, the error does not disappear 
> > because
> > the same (historical) snapshot we use to decode the failed data change again
> > is also used to check the privileges in the catalog, so the output plugin 
> > does
> > not see that the privilege has already been granted.
> >
> 
> We have a similar problem even when publication is dropped/created.
> The replication won't be able to proceed.
> 
> > The only solution seems to be to drop the publication from the subscription
> > and add it again, or to drop and re-create the whole subscription. I haven't
> > added a note about this problem to the documentation yet, in case someone 
> > has
> > better idea how to approach the problem.
> >
> 
> I think one possibility is that the user advances the slot used in
> replication by using pg_replication_slot_advance() at or after the
> location where the privilege is granted. Some other ideas have been
> discussed in the thread [1], in particular, see email [2] and
> discussion after that but we didn't reach any conclusion.

Thanks for feedback. Regarding the publications, I'm not sure what the user
should expect if he revokes the permissions in the middle of processing. In
such a situation, he should be aware that some privileged data could already
have been replicated. My preference in such a situation would be to truncate
the table(s) on the subscriber side and to restart the replication, rather
than fix the replication and keep the subscriber's access to the leaked data.

> [1] - 
> https://www.postgresql.org/message-id/CAHut%2BPvMbCsL8PAz1Qc6LNoL0Ag0y3YJtPVJ8V0xVXJOPb%2B0xw%40mail.gmail.com
> [2] - 
> https://www.postgresql.org/message-id/CAA4eK1JTwOAniPua04o2EcOXfzRa8ANax%3D3bpx4H-8dH7M2p%3DA%40mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-11-04 Thread Drouvot, Bertrand


Hi,

On 7/28/22 11:38 AM, Nitin Jadhav wrote:

To understand the performance effects of the above, I have taken the
average of five checkpoints with the patch and without the patch in my
environment. Here are the results.
With patch: 269.65 s
Without patch: 269.60 s


Those look like timed checkpoints - if the checkpoints are sleeping a
part of the time, you're not going to see any potential overhead.


Yes. The above data is collected from timed checkpoints.

create table t1(a int);
insert into t1 select * from generate_series(1,1000);

I generated a lot of data by using the above queries which would in
turn trigger the checkpoint (wal).
---


To see whether this has an effect you'd have to make sure there's a
certain number of dirty buffers (e.g. by doing CREATE TABLE AS
some_query) and then do a manual checkpoint and time how long that
times.


For this case I have generated data by using below queries.

create table t1(a int);
insert into t1 select * from generate_series(1,800);

This does not trigger the checkpoint automatically. I have issued the
CHECKPOINT manually and measured the performance by considering an
average of 5 checkpoints. Here are the details.

With patch: 2.457 s
Without patch: 2.334 s

Please share your thoughts.



v6 was not applying anymore, due to a change in 
doc/src/sgml/ref/checkpoint.sgml done by b9eb0ff09e (Rename 
pg_checkpointer predefined role to pg_checkpoint).


Please find attached a rebase in v7.

While working on this rebase, I also noticed that "pg_checkpointer" is 
still mentioned in some translation files:

"
$ git grep pg_checkpointer
src/backend/po/de.po:msgid "must be superuser or have privileges of 
pg_checkpointer to do CHECKPOINT"
src/backend/po/ja.po:msgid "must be superuser or have privileges of 
pg_checkpointer to do CHECKPOINT"
src/backend/po/ja.po:msgstr 
"CHECKPOINTを実行するにはスーパーユーザーであるか、またはpg_checkpointerの権限を持つ必要があります"
src/backend/po/sv.po:msgid "must be superuser or have privileges of 
pg_checkpointer to do CHECKPOINT"

"

I'm not familiar with how the translation files are handled (looks like 
they have their own set of commits, see 3c0bcdbc66 for example) but 
wanted to mention that "pg_checkpointer" is still mentioned (even if 
that may be expected as the last commit related to translation files 
(aka 3c0bcdbc66) is older than the one that renamed pg_checkpointer to 
pg_checkpoint (aka b9eb0ff09e)).


That said, back to this patch: I did not look closely but noticed that 
the buffers_total reported by pg_stat_progress_checkpoint:


postgres=# select type,flags,start_lsn,phase,buffers_total,new_requests 
from pg_stat_progress_checkpoint;
type| flags | start_lsn  | phase 
 | buffers_total | new_requests

+---++---+---+--
 checkpoint | immediate force wait  | 1/E6C523A8 | checkpointing 
buffers |   1024275 | false

(1 row)

is a little bit different from what is logged once completed:

2022-11-04 08:18:50.806 UTC [3488442] LOG:  checkpoint complete: wrote 
1024278 buffers (97.7%);


Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comcommit 5198f5010be0febd019b117e2d78b8e42b21
Author: bdrouvot 
Date:   Thu Nov 3 12:59:10 2022 +

v7-0001-pg_stat_progress_checkpoint-view.patch

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5d622d514..ceb7d60ffa 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -414,6 +414,13 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
See .
   
  
+
+ 
+  
pg_stat_progress_checkpointpg_stat_progress_checkpoint
+  One row only, showing the progress of the checkpoint.
+   See .
+  
+ 
 

   
@@ -5784,7 +5791,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
which support progress reporting are ANALYZE,
CLUSTER,
CREATE INDEX, VACUUM,
-   COPY,
+   COPY, CHECKPOINT
and  (i.e., replication
command that  issues to take
a base backup).
@@ -7072,6 +7079,402 @@ FROM pg_stat_get_backend_idset() AS backendid;
   
  
 
+ 
+  Checkpoint Progress Reporting
+
+  
+   pg_stat_progress_checkpoint
+  
+
+  
+   Whenever the checkpoint operation is running, the
+   pg_stat_progress_checkpoint view will contain a
+   single row indicating the progress of the checkpoint. The tables below
+   describe the information that will be reported and provide information about
+   how to interpret it.
+  
+
+  
+   pg_stat_progress_checkpoint View
+   
+
+ 
+  
+   Column Type
+  
+  
+   Description
+  
+ 
+
+
+
+ 
+  
+   pid integer
+  
+  
+   Process ID of the checkpointer process.
+  
+ 
+
+ 
+  
+   type text
+  
+  
+   Type of the checkpoint. See .
+  
+ 
+
+

Re: Add explicit casts in four places to simplehash.h

2022-11-04 Thread David Geier


On 11/3/22 15:50, Tom Lane wrote:

Seems reasonable, so done (I fixed one additional spot you missed).

Thanks!

The bigger picture here is that we do actually endeavor to keep
(most of) our headers C++ clean, but our tool cpluspluscheck misses
these problems because it doesn't try to use these macros.
I wonder whether there is a way to do better.


What about having a custom header alongside cpluspluscheck which 
references all macros we care about?
We could start with the really big macros like the ones from 
simplehash.h and add as we go.

I could give this a try if deemed useful.

--
David Geier
(ServiceNow)

Reviving lost replication slots

2022-11-04 Thread sirisha chamarthi

Hi,

A replication slot can be lost when a subscriber is not able to catch up
with the load on the primary and the WAL to catch up exceeds
max_slot_wal_keep_size. When this happens, target has to be reseeded
(pg_dump) from the scratch and this can take longer. I am investigating the
options to revive a lost slot. With the attached patch and copying the WAL
files from the archive to pg_wal directory I was able to revive the lost
slot. I also verified that a lost slot doesn't let vacuum cleanup the
catalog tuples deleted by any later transaction than catalog_xmin. One side
effect of this approach is that the checkpointer creating the .ready files
corresponds to the copied wal files in the archive_status folder. Archive
command has to handle this case. At the same time, checkpointer can
potentially delete the file again before the subscriber consumes the file
again. In the proposed patch, I am not setting restart_lsn
to InvalidXLogRecPtr but instead relying on invalidated_at field to tell if
the slot is lost. Is the intent of setting restart_lsn to InvalidXLogRecPtr
was to disallow reviving the slot?

If overall direction seems ok, I would continue on the work to revive the
slot by copying the wal files from the archive. Appreciate your feedback.

Thanks,
Sirisha


0001-Allow-revive-a-lost-replication-slot.patch
Description: Binary data

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-04 Thread Amit Kapila

On Thu, Nov 3, 2022 at 6:36 PM houzj.f...@fujitsu.com
 wrote:
>
> Thanks for the analysis and summary !
>
> I tried to implement the above idea and here is the patch set.
>

Few comments on v42-0001
===
1.
+ /*
+ * Set the xact_state flag in the leader instead of the
+ * parallel apply worker to avoid the race condition where the leader has
+ * already started waiting for the parallel apply worker to finish
+ * processing the transaction while the child process has not yet
+ * processed the first STREAM_START and has not set the
+ * xact_state to true.
+ */
+ SpinLockAcquire(>shared->mutex);
+ winfo->shared->xact_state = PARALLEL_TRANS_UNKNOWN;

The comments and code for xact_state doesn't seem to match.

2.
+ * progress. This could happend as we don't wait for transaction rollback
+ * to finish.
+ */

/happend/happen

3.
+/* Helper function to release a lock with lockid */
+void
+parallel_apply_lock(uint16 lockid)
...
...
+/* Helper function to take a lock with lockid */
+void
+parallel_apply_unlock(uint16 lockid)

Here, the comments seems to be reversed.

4.
+parallel_apply_lock(uint16 lockid)
+{
+ MemoryContext oldcontext;
+
+ if (list_member_int(ParallelApplyLockids, lockid))
+ return;
+
+ LockSharedObjectForSession(SubscriptionRelationId, MySubscription->oid,
+lockid, am_leader_apply_worker() ?
+AccessExclusiveLock:
+AccessShareLock);

This appears odd to me because this forecloses the option the parallel
apply worker can ever acquire this lock in exclusive mode. I think it
would be better to have lock_mode as one of the parameters in this
API.

5.
+ * Inintialize fileset if not yet and open the file.
+ */
+void
+serialize_stream_start(TransactionId xid, bool first_segment)

Typo. /Inintialize/Initialize

6.
parallel_apply_setup_dsm()
{
...
+ shared->xact_state = false;

xact_state should be set with one of the values of ParallelTransState.

7.
/*
+ * Don't use SharedFileSet here because the fileset is shared by the leader
+ * worker and the fileset in leader need to survive after releasing the
+ * shared memory

This comment seems a bit unclear to me. Should there be and between
leader worker? If so, then the following 'and' won't make sense.

8.
+apply_handle_stream_stop(StringInfo s)
{
...
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If there is no message left, wait for the leader to release the
+ * lock and send more messages.
+ */
+ if (pg_atomic_sub_fetch_u32(&(MyParallelShared->left_message), 1) == 0)
+ parallel_apply_lock(MyParallelShared->stream_lock_id);

As per Sawada-San's email [1], this lock should be released
immediately after we acquire it. If we do so, then we don't need to
unlock separately in apply_handle_stream_start() in the below code and
at similar places in stream_prepare, stream_commit, and stream_abort.
Is there a reason for doing it differently?

apply_handle_stream_start(StringInfo s)
{
...
+ case TRANS_PARALLEL_APPLY:
...
+ /*
+ * Unlock the shared object lock so that the leader apply worker
+ * can continue to send changes.
+ */
+ parallel_apply_unlock(MyParallelShared->stream_lock_id);


9.
+parallel_apply_spooled_messages(void)
{
...
+ if (fileset_valid)
+ {
+ in_streamed_transaction = false;
+
+ parallel_apply_lock(MyParallelShared->transaction_lock_id);

Is there a reason to acquire this lock here if the parallel apply
worker will acquire it at stream_start?

10.
+ winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
+ winfo->shared->transaction_lock_id = parallel_apply_get_unique_id();

Why can't we use xid (remote_xid) for one of these and local_xid (one
generated by parallel apply) for the other? I was a bit worried about
the local_xid because it will be generated only after applying the
first message but the patch already seems to be waiting for it in
parallel_apply_wait_for_xact_finish as seen in the below code.

+void
+parallel_apply_wait_for_xact_finish(ParallelApplyWorkerShared *wshared)
+{
+ /*
+ * Wait until the parallel apply worker handles the first message and
+ * set the flag to true.
+ */
+ parallel_apply_wait_for_in_xact(wshared, PARALLEL_TRANS_STARTED);
+
+ /* Wait for the transaction lock to be released. */
+ parallel_apply_lock(wshared->transaction_lock_id);

[1] - 
https://www.postgresql.org/message-id/CAD21AoCWovvhGBD2uKcQqbk6px6apswuBrs6dR9%2BWhP1j2LdsQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

2022-11-04 Thread Hayato Kuroda (Fujitsu)

Dear Hou,

Thank you for updating the patch!
While testing yours, I found that the leader apply worker has been crashed in 
the following case.
I will dig the failure more, but I reported here for records.


1. Change macros for forcing to write a temporary file.

```
-#define CHANGES_THRESHOLD  1000
-#define SHM_SEND_TIMEOUT_MS1
+#define CHANGES_THRESHOLD  10
+#define SHM_SEND_TIMEOUT_MS100
```

2. Set logical_decoding_work_mem to 64kB on publisher

3. Insert huge data on publisher

```
publisher=# \d tbl 
Table "public.tbl"
 Column |  Type   | Collation | Nullable | Default 
+-+---+--+-
 c  | integer |   |  | 
Publications:
"pub"


publisher=# BEGIN;
BEGIN
publisher=*# INSERT INTO tbl SELECT i FROM generate_series(1, 500) s(i);
INSERT 0 500
publisher=*# COMMIT;
```

-> LA crashes on subscriber! Followings are the backtrace.


```
(gdb) bt
#0  0x7f2663ae4387 in raise () from /lib64/libc.so.6
#1  0x7f2663ae5a78 in abort () from /lib64/libc.so.6
#2  0x00ad0a95 in ExceptionalCondition (conditionName=0xcabdd0 
"mqh->mqh_partial_bytes <= nbytes", 
fileName=0xcabc30 "../src/backend/storage/ipc/shm_mq.c", lineNumber=420) at 
../src/backend/utils/error/assert.c:66
#3  0x008eaeb7 in shm_mq_sendv (mqh=0x271ebd8, iov=0x7ffc664a2690, 
iovcnt=1, nowait=false, force_flush=true)
at ../src/backend/storage/ipc/shm_mq.c:420
#4  0x008eac5a in shm_mq_send (mqh=0x271ebd8, nbytes=1, data=0x271f3c0, 
nowait=false, force_flush=true)
at ../src/backend/storage/ipc/shm_mq.c:338
#5  0x00880e18 in parallel_apply_free_worker (winfo=0x271f270, xid=735, 
stop_worker=true)
at ../src/backend/replication/logical/applyparallelworker.c:368
#6  0x008a3638 in apply_handle_stream_commit (s=0x7ffc664a2790) at 
../src/backend/replication/logical/worker.c:2081
#7  0x008a54da in apply_dispatch (s=0x7ffc664a2790) at 
../src/backend/replication/logical/worker.c:3195
#8  0x008a5a76 in LogicalRepApplyLoop (last_received=378674872) at 
../src/backend/replication/logical/worker.c:3431
#9  0x008a72ac in start_apply (origin_startpos=0) at 
../src/backend/replication/logical/worker.c:4245
#10 0x008a7d77 in ApplyWorkerMain (main_arg=0) at 
../src/backend/replication/logical/worker.c:4555
#11 0x0084983c in StartBackgroundWorker () at 
../src/backend/postmaster/bgworker.c:861
#12 0x00854192 in do_start_bgworker (rw=0x26c0d20) at 
../src/backend/postmaster/postmaster.c:5801
#13 0x0085457c in maybe_start_bgworkers () at 
../src/backend/postmaster/postmaster.c:6025
#14 0x0085350b in sigusr1_handler (postgres_signal_arg=10) at 
../src/backend/postmaster/postmaster.c:5182
#15 
#16 0x7f2663ba3b23 in __select_nocancel () from /lib64/libc.so.6
#17 0x0084edbc in ServerLoop () at 
../src/backend/postmaster/postmaster.c:1768
#18 0x0084e737 in PostmasterMain (argc=3, argv=0x2690f60) at 
../src/backend/postmaster/postmaster.c:1476
#19 0x0074adfb in main (argc=3, argv=0x2690f60) at 
../src/backend/main/main.c:197
``` 

PSA the script that can reproduce the failure on my environment. 

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



repro.sh
Description: repro.sh

Re: Allow single table VACUUM in transaction block

2022-11-04 Thread Rahila Syed

Hi Simon,

On Fri, Nov 4, 2022 at 10:15 AM Rahila Syed  wrote:

> Hi Simon,
>
> On Thu, Nov 3, 2022 at 3:53 PM Simon Riggs 
> wrote:
>
>> On Tue, 1 Nov 2022 at 23:56, Simon Riggs 
>> wrote:
>>
>> > > I haven't checked the rest of the patch, but +1 for allowing VACUUM
>> FULL
>> > > within a user txn.
>> >
>> > My intention was to prevent that. I am certainly quite uneasy about
>> > changing anything related to CLUSTER/VF, since they are old, complex
>> > and bug prone.
>> >
>> > So for now, I will block VF, as was my original intent.
>> >
>> > I will be guided by what others think... so you may yet get your wish.
>> >
>> >
>> > > Maybe the error message needs to be qualified "...when multiple
>> > > relations are specified".
>> > >
>> > > ERROR:  VACUUM cannot run inside a transaction block
>> >
>> > Hmm, that is standard wording based on the statement type, but I can
>> > set a CONTEXT message also. Will update accordingly.
>> >
>> > Thanks for your input.
>>
>> New version attached, as described.
>>
>> Other review comments and alternate opinions welcome.
>>
>>
> I applied and did some basic testing on the patch, it works as described.
>
> I would like to bring up a few points that I came across while looking
> into the vacuum code.
>
> 1.  As a result of this change to allow VACUUM inside a user transaction,
> I think there is some chance of causing
> a block/delay of concurrent VACUUMs if a VACUUM is being run under a long
> running transaction.
> 2. Also, if a user runs VACUUM in a transaction, performance optimizations
> like PROC_IN_VACUUM won't work.
> 3. Also, if VACUUM happens towards the end of a long running transaction,
> the snapshot will be old
> and xmin horizon for vacuum would be somewhat old as compared to current
> lazy vacuum which
> acquires a new snapshot just before scanning the table.
>
> So, while I understand the need of the feature, I am wondering if there
> should be some mention
> of above caveats in documentation with the recommendation that VACUUM
> should be run outside
> a transaction, in general.
>
>
Sorry, I just noticed that you have already mentioned some of these in the
documentation as follows, so it seems
it is already taken care of.

+VACUUM cannot be executed inside a transaction
block,
+unless a single table is specified and FULL is not
+specified.  When executing inside a transaction block the vacuum scan
can
+hold back the xmin horizon and does not update the database
datfrozenxid,
+as a result this usage is not useful for database maintenance, but is
provided
+to allow vacuuming in special circumstances, such as temporary or
private
+work tables.

Thank you,
Rahila Syed

Re: Privileges on PUBLICATION

2022-11-04 Thread Antonin Houska

Peter Eisentraut  wrote:

> On 03.11.22 01:43, Antonin Houska wrote:
> > Peter Eisentraut  wrote:
> > 
> >> The CF entry is about privileges on publications.  Please rebase that patch
> >> and repost it so that the CF app and the CF bot are up to date.
> > The rebased patch (with regression tests added) is attached here.
> 
> Some preliminary discussion:
> 
> What is the upgrade strategy?  I suppose the options are either that
> publications have a default acl that makes them publicly accessible, 
> thus preserving the existing behavior by default, or pg_dump would need to
> create additional GRANT statements when upgrading from pre-PG16.  I don't see
> anything like either of these mentioned in the patch.  What is your plan?

So far I considered the first

> You might be interested in this patch, which relates to yours:
> https://commitfest.postgresql.org/40/3955/

ok, I'll check.

> Looking at your patch, I would also like to find a way to refactor away the
> ExecGrant_Publication() function.  I'll think about that.
> 
> I think you should add some tests under src/test/regress/ for the new GRANT
> and REVOKE statements, just to have some basic coverage that it works.
> sql/publication.sql would be appropriate, I think.

I thought about the whole concept a bit more and I doubt if the PUBLICATION
privilege is the best approach. In particular, the user specified in CREATE
SUBSCRIPTION ... CONNECTION ... (say "subscription user") needs to have SELECT
privilege on the tables replicated. So if the DBA excludes some columns from
the publication's column list and sets the (publication) privileges in such a
way that the user cannot get the column values via other publications, the
user still can connect to the database directly and get values of the excluded
columns.

As an alternative to the publication privileges, I think that the CREATE
SUBSCRIPTION command could grant ACL_SELECT automatically to the subscription
user on the individual columns contained in the publication column list, and
DROP SUBSCRIPTION would revoke that privilege.

Of course a question is what to do if the replication user already has that
privilege on some columns: either the CREATE SUBSCRIPTION command should raise
ERROR, or we should introduce a new privilege (e.g. ACL_SELECT_PUB) for this
purpose, which would effectivelly be ACL_SELECT, but hidden from the users of
GRANT / REVOKE.

If this approach was taken, the USAGE privilege on schema would need to be
granted / revoked in a similar way.

What do you think about that?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: PL/pgSQL cursors should get generated portal names by default

2022-11-04 Thread Pavel Stehule

Hi


st 2. 11. 2022 v 0:39 odesílatel Tom Lane  napsal:

> There's a complaint at [1] about how you can't re-use the same
> cursor variable name within a routine called from another routine
> that's already using that name.  The complaint is itself a bit
> under-documented, but I believe it is referring to this ancient
> bit of behavior:
>
>  A bound cursor variable is initialized to the string value
>  representing its name, so that the portal name is the same as
>  the cursor variable name, unless the programmer overrides it
>  by assignment before opening the cursor.
>
> So if you try to nest usage of two bound cursor variables of the
> same name, it blows up on the portal-name conflict.  But it'll work
> fine if you use unbound cursors (i.e., plain "refcursor" variables):
>
>  But an unbound cursor
>  variable defaults to the null value initially, so it will receive
>  an automatically-generated unique name, unless overridden.
>
> I wonder why we did it like that; maybe it's to be bug-compatible with
> some Oracle PL/SQL behavior or other?  Anyway, this seems non-orthogonal
> and contrary to all principles of structured programming.  We don't even
> offer an example of the sort of usage that would benefit from it, ie
> that calling code could "just know" what the portal name is.
>
> I propose that we should drop this auto initialization and let all
> refcursor variables start out null, so that they'll get unique
> portal names unless you take explicit steps to do something else.
> As attached.
>
> (Obviously this would be a HEAD-only fix, but maybe there's scope for
> improving the back-branch docs along lines similar to these changes.)
>
>
I am sending an review of this patch

1. The patching, compilation without any problems
2. All tests passed
3. The implemented change is documented well
4. Although this is potencial compatibility break, we want this feature. It
allows to use cursors variables in recursive calls by default, it allows
shadowing of cursor variables
5. This patch is short and almost trivial, just remove code.

I'll mark this patch as ready for commit

Regards

Pavel



> regards, tom lane
>
> [1]
> https://www.postgresql.org/message-id/166689990972.627.16269382598283029015%40wrigleys.postgresql.org
>
>

82 matches

Mail list logo