Re: Combine pg_walinspect till_end_of_wal functions with others

2023-03-12 Thread Michael Paquier
On Fri, Mar 10, 2023 at 04:45:06PM +0530, Bharath Rupireddy wrote:
> Any comments on the attached v5 patch?

I have reviewed the patch, and found it pretty messy.  The tests
should have been divided into their own patch, I think.  This is
rather straight-forward once the six functions have their checks
grouped together.  The result was pretty good, so I have begun by
applying that as 1f282c2.  This also includes most of the refinements
you have proposed for the whitespaces in the tests.  Note that we were
missing a few spots with the bound checks for the six functions, so
now the coverage should be full.

After that comes the rest of the patch, and I have found a couple of
mistakes.

-  pg_get_wal_records_info(start_lsn pg_lsn, end_lsn pg_lsn)
+  pg_get_wal_records_info(start_lsn pg_lsn, end_lsn pg_lsn DEFAULT NULL)
   returns setof record
[...]
-  pg_get_wal_stats(start_lsn pg_lsn, end_lsn pg_lsn, per_record boolean 
DEFAULT false)
+  pg_get_wal_stats(start_lsn pg_lsn, end_lsn pg_lsn DEFAULT NULL, 
per_record boolean DEFAULT false)

This part of the documentation is now incorrect.

+-- Make sure checkpoints don't interfere with the test.
+SELECT 'init' FROM 
pg_create_physical_replication_slot('regress_pg_walinspect_slot', true, false);

Adding a physical slot is better for stability of course, but the test
also has to drop it or installcheck would cause an existing cluster to
have that still around.  The third argument could be true, as well, so
as we'd use a temporary slot.

-  If start_lsn
-  or end_lsn are not yet available, the
-  function will raise an error. For example:
+  If a future end_lsn (i.e. the LSN server
+  doesn't know about) is specified, it returns stats till end of WAL. It
+  will raise an error, if the server doesn't have WAL available at given
+  start_lsn or if the
+  start_lsn is in future or is past the
+  end_lsn. For example, usage of the function is
+  as follows:

Hmm.  I would simplify that, and just mention that an error is raised
when the start LSN is available, without caring about the rest (valid
end LSN being past the current insert LSN, and error if start > end,
the second being obvious).

+ 
+  
+   Note that pg_get_wal_records_info_till_end_of_wal and
+   pg_get_wal_stats_till_end_of_wal functions have been
+   removed in the pg_walinspect version
+   1.1. The same functionality can be achieved with
+   pg_get_wal_records_info and
+   pg_get_wal_stats functions by specifying a future
+   end_lsn. However, 
till_end_of_wal
+   functions will still work if the extension is installed explicitly with
+   version 1.0.
+  
+ 

Not convinced that this is necessary.

+   GetInputLSNs(fcinfo, &start_lsn, &end_lsn, till_end_of_wal);
+
+   stats_per_record = PG_GETARG_BOOL(2);

This code in GetWalStats() is incorrect.
pg_get_wal_stats_till_end_of_wal() has a stats_per_record, but as
*second* argument, so this would be broken.

+   GetInputLSNs(fcinfo, &start_lsn, &end_lsn, till_end_of_wal); 

Coming from the last point, I think that this interface is confusing,
and actually incorrect.  From what I can see, we should be doing what
~15 has by grepping the argument values within the main function
calls, and just pass them down to the internal routines GetWalStats()
and GetWALRecordsInfo().

-static bool
-IsFutureLSN(XLogRecPtr lsn, XLogRecPtr *curr_lsn)
+static XLogRecPtr
+GetCurrentLSN(void)

This wrapper is actually a good idea.

At the end, I am finishing with the attached.  ValidateInputLSNs()
ought to be called, IMO, when the caller of the SQL functions can
directly specify an end_lsn.  This means that there is no point to do
this check in the two till_end_* functions.  This has as cost two
extra checks to make sure that the start_lsn is not higher than the
current LSN, but I am fine to live with that.  It seemed rather
natural to me to let ValidateInputLSNs() do a refresh of the end_lsn
if it sees that it is higher than the current LSN.  And if you look
closely, you will see that we only call *once* GetCurrentLSN() for
each function call, so the maths are more precise.  

I have cleaned up the comments of the modules, while on it, as there
was not much value in copy-pasting how a function fails while there is
a centralized validation code path.  The tests for the till_end()
functions have been moved to the test path where we install 1.0.

With all these cleanups done, there is less code than at the
beginning, which comes from the docs, so the current code does not
change in size:
 7 files changed, 173 insertions(+), 206 deletions(-)
--
Michael
From 74b873e5d61173719ec1961adbc30d711b63f79b Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 13 Mar 2023 15:51:51 +0900
Subject: [PATCH v6] Rework pg_walinspect functions

---
 doc/src/sgml/pgwalinspect.sgml|  53 +
 .../pg_walinspect/expected/oldextversions.out |  45 +++-
 .../pg_walinspect/expected/pg_walinspect.out  |  42 ++--
 ...

Re: CI and test improvements

2023-03-12 Thread Peter Eisentraut

On 03.02.23 15:26, Justin Pryzby wrote:

rebased, and re-including a patch to show code coverage of changed
files.


This constant flow of patches under one subject doesn't lend itself well 
to the commit fest model of trying to finish things up.  I can't quite 
tell which of these patches are ready and agreed upon, and which ones 
are work in progress or experimental.


> e4534821ef5 cirrus/ccache: use G rather than GB suffix..

This one seems obvious.  I have committed it.


9baf41674ad pg_upgrade: tap test: exercise --link and --clone


This seems like a good idea.


7e09035f588 WIP: ci/meson: allow showing only failed tests ..


I'm not sure I like this one.  I sometimes look up the logs of 
non-failed tests to compare them with failed tests, to get context to 
could lead to failures.  Maybe we can make this behavior adjustable. 
But I've not been bothered by the current behavior.






Re: meson: Non-feature feature options

2023-03-12 Thread Peter Eisentraut

On 09.03.23 14:54, Daniel Gustafsson wrote:

On 9 Mar 2023, at 14:45, Peter Eisentraut  
wrote:



How about we just hardcode "openssl" here instead?  We could build that array 
dynamically, of course, but maybe we leave that until we actually have a need?


At least for 16 keeping it hardcoded is an entirely safe bet so +1 for leaving
additional complexity for when needed.


I have committed it like this.

I didn't like the other variants, because they would cause the openssl 
line to stick out for purely implementation reasons (e.g., we don't have 
a line "compression: YES (lz4)".  If we get support for another ssl 
library, we can easily reconsider this.






Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-12 Thread Önder Kalacı
Hi Shi Yu,


> > >
> > >
> > > Reading [1], I think I can follow what you suggest. So, basically,
> > > if the leftmost column is not filtered, we have the following:
> > >
> > >>  but the entire index would have to be scanned, so in most cases the
> planner
> > would prefer a sequential table scan over using the index.
> > >
> > >
> > > So, in our case, we could follow a similar approach. If the leftmost
> column of
> > the index
> > > is not sent over the wire from the pub, we can prefer the sequential
> scan.
> > >
> > > Is my understanding of your suggestion accurate?
> > >
> >
> > Yes. I request an opinion from Shi-San who has reported the problem.
> >
>
> I also agree with this.
> And I think we can mention this in the comments if we do so.
>
>
Already commented on FindUsableIndexForReplicaIdentityFull() on v44.


Thanks,
Onder KALACI


Re: [Proposal] Add foreign-server health checks infrastructure

2023-03-12 Thread Katsuragi Yuta

On 2023-03-10 18:07, Katsuragi Yuta wrote:

On 2023-03-08 13:40, Hayato Kuroda (Fujitsu) wrote:

Dear Vignesh,

Thank you for reviewing! PSA new version.


Hi,

Thank you for the comments, Vignesh.
Thank you for updating the patch, Kuroda-san. This fix looks fine to 
me.


And also, there seems no other comments from this post [1].
So, I'm planning to update the status to ready for committer
on next Monday.


[1]:
https://www.postgresql.org/message-id/TYAPR01MB5866F8EA3C06E1B43E42E4E4F5B79%40TYAPR01MB5866.jpnprd01.prod.outlook.com

regards,


Hi,

I updated the status of the patch to ready for committer.

regards,

--
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Microsecond-based timeouts

2023-03-12 Thread Thomas Munro
Hi,

Over in [1], I thought for a moment that a new function
WaitLatchUs(..., timeout_us, ...) was going to be useful to fix that
bug report, at least in master, until I realised the required Linux
syscall is a little too new (for example RHEL 9 shipped May '22,
Debian 12 is expected to be declared "stable" in a few months).  So
I'm kicking this proof-of-concept over into a new thread to talk about
in the next cycle, in case it turns out to be useful later.

There probably isn't too much call for very high resolution sleeping.
Most time-based sleeping is probably bad, but when it's legitimately
used to spread CPU or I/O out (instead of illegitimate use for
polling-based algorithms), it seems nice to be able to use all the
accuracy your hardware can provide, and yet it is still important to
be able to process other kinds of events, so WaitLatchUs() seems like
a better building block than pg_usleep().

One question is whether it'd be better to use nanoseconds instead,
since the relevant high-resolution primitives use those under the
covers (struct timespec).  On the other hand, microseconds are a good
match for our TimestampTz which is the ultimate source of many of our
timeout decisions.  I suppose we could also consider an interface with
an absolute timeout instead, and then stop thinking about the units so
much.

As mentioned in that other thread, the only systems that currently
seem to be able to sleep less than 1ms through these multiplexing APIs
are: Linux 5.11+ (epoll_pwait2()), FreeBSD (kevent()), macOS (ditto).
Everything else will round up to milliseconds at the kernel interface
(because poll(), epoll_wait() and WaitForMultipleObjects() take those)
or later inside the kernel due to kernel tick rounding.  There might
be ways to do better on Windows with separate timer events, but I
don't know.

[1] 
https://www.postgresql.org/message-id/flat/CAAKRu_b-q0hXCBUCAATh0Z4Zi6UkiC0k2DFgoD3nC-r3SkR3tg%40mail.gmail.com
From e99b7d31831f31888a9433a83d3e64ccbe2cc5c7 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 10 Mar 2023 15:16:47 +1300
Subject: [PATCH 1/3] Support microsecond based timeouts in WaitEventSet API.

WaitLatch() can only wait for whole numbers of milliseconds, a
limitation inherited ultimately from poll() and similar interfaces.  In
the past it didn't matter much as sleep times were very inaccurate in
practice on common systems, but Linux and others can now be accurate
down to small fractions of a millisecond.  In order to be able to
replace pg_usleep() calls, provide WaitLatchUs().  Just like
pg_usleep(), the actual resolution of the sleeping depends on the OS and
hardware.

For Linux, this requires epoll_pwait2() (Linux 5.11), otherwise we have
to round to milliseconds for epoll_wait().  For macOS and *BSD, kevent()
has always supported nanosecond-based timeouts, but only macOS and
FreeBSD are known to support high resolution timers (other BSDs tested
currently round up to kernel ticks so WaitLatch() already couldn't sleep
for only 1ms).  For Solaris and AIX, we currently use poll() and that
requires rounding up to milliseconds, so no improvement over WaitLatch()
there.  Likewise for Windows (which already couldn't sleep for only 1ms
due to internal rounding to tick size).

Discussion: https://postgr.es/m/CAAKRu_b-q0hXCBUCAATh0Z4Zi6UkiC0k2DFgoD3nC-r3SkR3tg%40mail.gmail.com
---
 configure   |   2 +-
 configure.ac|   1 +
 meson.build |   1 +
 src/backend/storage/ipc/latch.c | 146 
 src/include/pg_config.h.in  |   3 +
 src/include/storage/latch.h |  13 ++-
 src/tools/msvc/Solution.pm  |   1 +
 7 files changed, 128 insertions(+), 39 deletions(-)

diff --git a/configure b/configure
index e35769ea73..914361f91b 100755
--- a/configure
+++ b/configure
@@ -15699,7 +15699,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols copyfile getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
+for ac_func in backtrace_symbols copyfile epoll_pwait2 getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index af23c15cb2..4249f8002c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1794,6 +1794,7 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 AC_CHECK_FUNCS(m4_normalize([
 	backtrace_symbols
 	copyfile
+	epoll_pwait2
 	getifaddrs
 	getpeerucred
 	inet_pton
diff --git a/meson.build b/meson.build
index d4384f1bf6..fe9b0470aa 100644
--- a/meson.build
+++ b

Re: Compilation error after redesign of the archive modules

2023-03-12 Thread Michael Paquier
On Fri, Mar 10, 2023 at 05:16:53PM +0900, Michael Paquier wrote:
> On Fri, Mar 10, 2023 at 01:41:07PM +0530, Sravan Kumar wrote:
>> I have attached a patch that fixes the problem. Can you please review
>> if it makes sense to push this patch?
> 
> Indeed, reproduced here.  I'll fix that in a bit..

(Sorry for the late reply, I thought that I sent that on Friday but it
was stuck in my drafts.)

Note that your patch took only care of the ./configure part of the
installation process, but it was missing meson.  Applied a fix for
both as of 6ad5793.
--
Michael


signature.asc
Description: PGP signature


Re: psql \watch 2nd argument: iteration count

2023-03-12 Thread Andrey Borodin
Michael, thanks for reviewing  this!

On Sun, Mar 12, 2023 at 6:17 PM Michael Paquier  wrote:
>
> On Sun, Mar 12, 2023 at 01:05:39PM -0700, Andrey Borodin wrote:
> > In the review above Kyotaro-san suggested that message should contain
> > information on what it expects... So, maybe then
> > pg_log_error("\\watch interval must be non-negative number, but
> > argument is '%s'", opt); ?
> > Or perhaps with articles? pg_log_error("\\watch interval must be a
> > non-negative number, but the argument is '%s'", opt);
>
> -   HELP0("  \\watch [SEC]   execute query every SEC seconds\n");
> +   HELP0("  \\watch [SEC [N]]   execute query every SEC seconds N 
> times\n");
>
> Is that really the interface we'd want to work with in the long-term?
> For one, this does not give the option to specify only an interval
> while relying on the default number of seconds.  This may be fine, but
> it does  not strike me as the best choice.  How about doing something
> more extensible, for example:
> \watch [ (option=value [, option=value] .. ) ] [SEC]
>
> I am not sure that this will be the last option we'll ever add to
> \watch, so I'd rather have us choose a design more flexible than
> what's proposed here, in a way similar to \g or \gx.
I've attached an implementation of this proposed interface (no tests
and help message yet, though, sorry).
I tried it a little bit,  and it works for me.
fire query 3 times
SELECT 1;\watch c=3 0
or with 200ms interval
SELECT 1;\watch i=.2 c=3
nonsense, but correct
SELECT 1;\watch i=1e-100 c=1

Actually Nik was asking for the feature. Nik, what do you think?

On Sun, Mar 12, 2023 at 6:26 PM Michael Paquier  wrote:
>
> While on it, I have some comments about 0001.
>
> -sleep = strtod(opt, NULL);
> -if (sleep <= 0)
> -sleep = 1;
> +char *opt_end;
> +sleep = strtod(opt, &opt_end);
> +if (sleep < 0 || *opt_end)
> +{
> +pg_log_error("\\watch interval must be non-negative number, "
> + "but argument is '%s'", opt);
> +free(opt);
> +resetPQExpBuffer(query_buf);
> +psql_scan_reset(scan_state);
> +return PSQL_CMD_ERROR;
> +}
>
> Okay by me to make this behavior a bit better, though it is not
> something I would backpatch as it can influence existing workflows,
> even if they worked in an inappropriate way.
+1

> Anyway, are you sure that this is actually OK?  It seems to me that
> this needs to check for three things:
> - If sleep is a negative value.
> - errno should be non-zero.
I think we can treat errno and negative values equally.
> - *opt_end == opt.
>
> So this needs three different error messages to show the exact error
> to the user?
I've tried this approach, but could not come up with sufficiently
different error messages...

>  Wouldn't it be better to have a couple of regression
> tests, as well?
Added two tests.

Thanks!


Best regards, Andrey Borodin.


v6-0002-Add-iteration-count-argument-to-psql-watch-comman.patch
Description: Binary data


v6-0001-Fix-incorrect-argument-handling-in-psql-watch.patch
Description: Binary data


Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

2023-03-12 Thread Thomas Munro
On Sat, Mar 11, 2023 at 11:49 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > I think this is the minimal back-patchable change.  I propose to go
> > ahead and do that, and then to kick the ideas about latch API changes
> > into a new thread for the next commitfest.
>
> OK by me, but then again 4753ef37 wasn't my patch.

I'll wait another day to see if Stephen or anyone else who hasn't hit
Monday yet wants to object.

Here also are those other minor tweaks, for master only.  I see now
that nanosleep() has already been proposed before:

https://www.postgresql.org/message-id/flat/CABQrizfxpBLZT5mZeE0js5oCh1tqEWvcGF3vMRCv5P-RwUY5dQ%40mail.gmail.com
https://www.postgresql.org/message-id/flat/4902.1552349020%40sss.pgh.pa.us

There I see the question of whether it should loop on EINTR to keep
waiting the remaining time.  Generally it seems like a job for
something higher level to deal with interruption policy, and of course
all the race condition and portability problems inherent with signals
are fixed by using latches instead, so I don't think there really is a
good answer to that question -- if you loop, you break our programming
rules by wilfully ignoring eg global barriers, but if you don't loop,
it implies you're relying on the interrupt to cause you to do
something and yet you might have missed it if it was delivered just
before the syscall.  At the time of the earlier thread, maybe it was
more acceptable as it could only delay cancel for that backend, but
now it might even delay arbitrary other backends, and neither answer
to that question can fix that in a race-free way.  Also, back then
latches had a SIGUSR1 handler on common systems, but now they don't,
so (racy unreliable) latch responsiveness has decreased since then.
So I think we should just leave the interface as it is, and build
better things and then eventually retire it.  This general topic is
also currently being discussed at:

https://www.postgresql.org/message-id/flat/20230209205929.GA720594%40nathanxps13

I propose to go ahead and make this small improvement anyway because
it'll surely be a while before we delete the last pg_usleep() call,
and it's good to spring-clean old confusing commentary about signals
and portability.
From 03c7bd1a8df9f3bd3b6b00ad86c8ca734b45f9bd Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 11 Mar 2023 10:42:20 +1300
Subject: [PATCH 1/3] Fix fractional vacuum_cost_delay.

Commit 4753ef37 changed vacuum_delay_point() to use the WaitLatch() API,
to fix the problem that vacuum could keep running for a very long time
after the postmaster died.

Unfortunately, that broke commit caf626b2's support for fractional
vacuum_cost_delay, which shipped in PostgreSQL 12.  WaitLatch() works in
whole milliseconds.

For now, revert the change from commit 4753ef37, but add an explicit
check for postmaster death.  That's an extra system call on systems
other than Linux and FreeBSD, but that overhead doesn't matter much
considering that we willingly went to sleep and woke up again.  (In
later work, we might add higher resolution timeouts to the latch API so
that we could do this in the standard programming pattern, but that
wouldn't be back-patched.)

Back-patch to 14, where commit 4753ef37 arrived.

Reported-by: Melanie Plageman 
Discussion: https://postgr.es/m/CAAKRu_b-q0hXCBUCAATh0Z4Zi6UkiC0k2DFgoD3nC-r3SkR3tg%40mail.gmail.com
---
 src/backend/commands/vacuum.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 2e12baf8eb..c54360a6a0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -50,6 +50,7 @@
 #include "postmaster/bgworker_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/acl.h"
@@ -2232,11 +2233,18 @@ vacuum_delay_point(void)
 		if (msec > VacuumCostDelay * 4)
 			msec = VacuumCostDelay * 4;
 
-		(void) WaitLatch(MyLatch,
-		 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-		 msec,
-		 WAIT_EVENT_VACUUM_DELAY);
-		ResetLatch(MyLatch);
+		pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+		pg_usleep(msec * 1000);
+		pgstat_report_wait_end();
+
+		/*
+		 * We don't want to ignore postmaster death during very long vacuums
+		 * with vacuum_cost_delay configured.  We can't use the usual
+		 * WaitLatch() approach here because we want microsecond-based sleep
+		 * durations above.
+		 */
+		if (IsUnderPostmaster && !PostmasterIsAlive())
+			exit(1);
 
 		VacuumCostBalance = 0;
 
-- 
2.39.2

From f0c3b6506e040933d2466ee4061f29a9a946a155 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 13 Mar 2023 09:50:06 +1300
Subject: [PATCH 2/3] Update obsolete comment about pg_usleep() accuracy.

There are still some systems that use traditional tick or jiffy-based
sleep timing, but many including Linux (see "man 7 time"), FreeBSD and
macOS are limited only

RE: Rework LogicalOutputPluginWriterUpdateProgress

2023-03-12 Thread wangw.f...@fujitsu.com
On Fri, Mar 10, 2023 20:17 PM Osumi, Takamichi/大墨 昂道 
 wrote:
> Hi,
> 
> 
> On Friday, March 10, 2023 6:32 PM Wang, Wei/王 威 
> wrote:
> > Attach the new patch set.
> Thanks for updating the patch ! One review comment on v7-0005.

Thanks for your comment.

> stream_start_cb_wrapper and stream_stop_cb_wrapper don't call the pair of
> threshold check and UpdateProgressAndKeepalive unlike other write wrapper
> functions like below. But, both of them write some data to the output plugin, 
> set
> the flag of did_write and thus it updates the subscriber's last_recv_timestamp
> used for timeout check in LogicalRepApplyLoop. So, it looks adding the pair to
> both functions can be more accurate, in order to reset the counter in
> changes_count on the publisher ?
> 
> @@ -1280,6 +1282,8 @@ stream_start_cb_wrapper(ReorderBuffer *cache,
> ReorderBufferTXN *txn,
> 
> /* Pop the error context stack */
> error_context_stack = errcallback.previous;
> +
> +   /* No progress has been made, so don't call 
> UpdateProgressAndKeepalive */
>  }

Since I think stream_start/stop_cp are different from change_cb, they don't
represent records in wal, so I think the LSNs corresponding to these two
messages are the LSNs of other records. So, we don't call the function
UpdateProgressAndKeepalive here. Also, for the reasons described in [1].#05, I
didn't reset the counter here.

[1] - 
https://www.postgresql.org/message-id/OS3PR01MB6275374EBE7C8CABBE6730099EAF9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei


RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-12 Thread shiy.f...@fujitsu.com
On Fri, Mar 10, 2023 8:17 PM Amit Kapila  wrote:
> 
> On Fri, Mar 10, 2023 at 5:16 PM Önder Kalacı  wrote:
> >
> >>
> >> wip_for_optimize_index_column_match
> >> +static bool
> >> +IndexContainsAnyRemoteColumn(IndexInfo  *indexInfo,
> >> + LogicalRepRelation  *remoterel)
> >> +{
> >> + for (int i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
> >> + {
> >>
> >> Wouldn't it be better to just check if the first column is not part of
> >> the remote column then we can skip that index?
> >
> >
> > Reading [1], I think I can follow what you suggest. So, basically,
> > if the leftmost column is not filtered, we have the following:
> >
> >>  but the entire index would have to be scanned, so in most cases the 
> >> planner
> would prefer a sequential table scan over using the index.
> >
> >
> > So, in our case, we could follow a similar approach. If the leftmost column 
> > of
> the index
> > is not sent over the wire from the pub, we can prefer the sequential scan.
> >
> > Is my understanding of your suggestion accurate?
> >
> 
> Yes. I request an opinion from Shi-San who has reported the problem.
> 

I also agree with this.
And I think we can mention this in the comments if we do so.

Regards,
Shi Yu


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-03-12 Thread Masahiko Sawada
On Sun, Mar 12, 2023 at 12:54 AM John Naylor
 wrote:
>
> On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada  wrote:
> >
> > On Fri, Mar 10, 2023 at 3:42 PM John Naylor
> >  wrote:
>
> > > I'd suggest sharing your todo list in the meanwhile, it'd be good to 
> > > discuss what's worth doing and what is not.
> >
> > Apart from more rounds of reviews and tests, my todo items that need
> > discussion and possibly implementation are:
>
> Quick thoughts on these:
>
> > * The memory measurement in radix trees and the memory limit in
> > tidstores. I've implemented it in v30-0007 through 0009 but we need to
> > review it. This is the highest priority for me.
>
> Agreed.
>
> > * Additional size classes. It's important for an alternative of path
> > compression as well as supporting our decoupling approach. Middle
> > priority.
>
> I'm going to push back a bit and claim this doesn't bring much gain, while it 
> does have a complexity cost. The node1 from Andres's prototype is 32 bytes in 
> size, same as our node3, so it's roughly equivalent as a way to ameliorate 
> the lack of path compression.

But does it mean that our node1 would help reduce the memory further
since since our base node type (i.e. RT_NODE) is smaller than the base
node type of Andres's prototype? The result I shared before showed
1.2GB vs. 1.9GB.

> I say "roughly" because the loop in node3 is probably noticeably slower. A 
> new size class will by definition still use that loop.

I've evaluated the performance of node1 but the result seems to show
the opposite. I used the test query:

select * from bench_search_random_nodes(100 * 1000 * 1000,
'0xFFFF');

Which make the radix tree that has node1 like:

max_val = 18446744073709551615
num_keys = 65536
height = 7, n1 = 1536, n3 = 0, n15 = 0, n32 = 0, n61 = 0, n256 = 257

All internal nodes except for the root node are node1. The radix tree
that doesn't have node1 is:

max_val = 18446744073709551615
num_keys = 65536
height = 7, n3 = 1536, n15 = 0, n32 = 0, n125 = 0, n256 = 257

Here is the result:

* w/ node1
 mem_allocated | load_ms | search_ms
---+-+---
573448 |1848 |  1707
(1 row)

* w/o node1
 mem_allocated | load_ms | search_ms
---+-+---
598024 |2014 |  1825
(1 row)

Am I missing something?

>
> About a smaller node125-type class: I'm actually not even sure we need to 
> have any sub-max node bigger about 64 (node size 768 bytes). I'd just let 65+ 
> go to the max node -- there won't be many of them, at least in synthetic 
> workloads we've seen so far.

Makes sense to me.

>
> > * Node shrinking support. Low priority.
>
> This is an architectural wart that's been neglected since the tid store 
> doesn't perform deletion. We'll need it sometime. If we're not going to make 
> this work, why ship a deletion API at all?
>
> I took a look at this a couple weeks ago, and fixing it wouldn't be that 
> hard. I even had an idea of how to detect when to shrink size class within a 
> node kind, while keeping the header at 5 bytes. I'd be willing to put effort 
> into that, but to have a chance of succeeding, I'm unwilling to make it more 
> difficult by adding more size classes at this point.

I think that the deletion (and locking support) doesn't have use cases
in the core (i.e. tidstore) but is implemented so that external
extensions can use it. There might not be such extensions. Given the
lack of use cases in the core (and the rest of time), I think it's
okay even if the implementation of such API is minimal and not
optimized enough.  For instance, the implementation of dshash.c is
minimalist, and doesn't have resizing. We can improve them in the
future if extensions or other core features want.

Personally I think we should focus on addressing feedback that we
would get and improving the existing use cases for the rest of time.
That's why considering min-max size class has a higher priority than
the node shrinking support in my todo list.

FYI, I've run TPC-C workload over the weekend, and didn't get any
failures of the assertion proving tidstore and the current tid lookup
return the same result.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: psql \watch 2nd argument: iteration count

2023-03-12 Thread Michael Paquier
On Mon, Mar 13, 2023 at 10:17:12AM +0900, Michael Paquier wrote:
> I am not sure that this will be the last option we'll ever add to
> \watch, so I'd rather have us choose a design more flexible than
> what's proposed here, in a way similar to \g or \gx.

While on it, I have some comments about 0001.

-sleep = strtod(opt, NULL);
-if (sleep <= 0)
-sleep = 1;
+char *opt_end;
+sleep = strtod(opt, &opt_end);
+if (sleep < 0 || *opt_end)
+{
+pg_log_error("\\watch interval must be non-negative number, "
+ "but argument is '%s'", opt);
+free(opt);
+resetPQExpBuffer(query_buf);
+psql_scan_reset(scan_state);
+return PSQL_CMD_ERROR;
+}

Okay by me to make this behavior a bit better, though it is not
something I would backpatch as it can influence existing workflows,
even if they worked in an inappropriate way.

Anyway, are you sure that this is actually OK?  It seems to me that
this needs to check for three things:
- If sleep is a negative value.
- errno should be non-zero.
- *opt_end == opt.

So this needs three different error messages to show the exact error
to the user?  Wouldn't it be better to have a couple of regression
tests, as well?
--
Michael


signature.asc
Description: PGP signature


Re: psql \watch 2nd argument: iteration count

2023-03-12 Thread Michael Paquier
On Sun, Mar 12, 2023 at 01:05:39PM -0700, Andrey Borodin wrote:
> In the review above Kyotaro-san suggested that message should contain
> information on what it expects... So, maybe then
> pg_log_error("\\watch interval must be non-negative number, but
> argument is '%s'", opt); ?
> Or perhaps with articles? pg_log_error("\\watch interval must be a
> non-negative number, but the argument is '%s'", opt);

-   HELP0("  \\watch [SEC]   execute query every SEC seconds\n");
+   HELP0("  \\watch [SEC [N]]   execute query every SEC seconds N 
times\n");

Is that really the interface we'd want to work with in the long-term?
For one, this does not give the option to specify only an interval
while relying on the default number of seconds.  This may be fine, but
it does  not strike me as the best choice.  How about doing something
more extensible, for example:
\watch [ (option=value [, option=value] .. ) ] [SEC]

I am not sure that this will be the last option we'll ever add to
\watch, so I'd rather have us choose a design more flexible than
what's proposed here, in a way similar to \g or \gx.
--
Michael


signature.asc
Description: PGP signature


Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-03-12 Thread Michael Paquier
On Fri, Mar 10, 2023 at 03:59:04PM +0900, Michael Paquier wrote:
> My apologies for the long message, but this deserves some attention,
> IMHO.

Note: A CF entry has been added as of [1], and I have added an item in
the list of live issues on the open item page for 16.

[1]: https://commitfest.postgresql.org/43/4244/
[2]: https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items#Live_issues
--
Michael


signature.asc
Description: PGP signature


Re: Progress report of CREATE INDEX for nested partitioned tables

2023-03-12 Thread Tom Lane
I wrote:
> Justin Pryzby  writes:
>> count_leaf_partitions() is also called for sub-partitions, in the case
>> that a matching "partitioned index" already exists, and the progress
>> report needs to be incremented by the number of leaves for which indexes
>> were ATTACHED.

> Can't you increment progress by one at the point where the actual attach
> happens?

Oh, never mind; now I realize that the point is that you didn't ever
iterate over those leaf indexes.

However, consider a thought experiment: assume for whatever reason that
all the actual index builds happen first, then all the cases where you
succeed in attaching a sub-partitioned index happen at the end of the
command.  In that case, the percentage-done indicator would go from
some-number to 100% more or less instantly.

What if we simply do nothing at sub-partitioned indexes?  Or if that's
slightly too radical, just increase the PARTITIONS_DONE counter by 1?
That would look indistinguishable from the case where all the attaches
happen at the end.

regards, tom lane




Re: Progress report of CREATE INDEX for nested partitioned tables

2023-03-12 Thread Tom Lane
Justin Pryzby  writes:
> On Sun, Mar 12, 2023 at 04:14:06PM -0400, Tom Lane wrote:
>> Hm.  Could we get rid of count_leaf_partitions by doing the work in
>> ProcessUtilitySlow?  Or at least passing that OID list forward instead
>> of recomputing it?

> count_leaf_partitions() is called in two places:

> Once to get PROGRESS_CREATEIDX_PARTITIONS_TOTAL.  It'd be easy enough to
> pass an integer total via IndexStmt (but I think we wanted to avoid
> adding anything there, since it's not a part of the statement).

I agree that adding such a field to IndexStmt would be a very bad idea.
However, adding another parameter to DefineIndex doesn't seem like a
problem.  Or could we just call pgstat_progress_update_param directly from
ProcessUtilitySlow, after counting the partitions in the existing loop?

> count_leaf_partitions() is also called for sub-partitions, in the case
> that a matching "partitioned index" already exists, and the progress
> report needs to be incremented by the number of leaves for which indexes
> were ATTACHED.

Can't you increment progress by one at the point where the actual attach
happens?

I also wonder whether leaving non-leaf partitions out of the total
is making things more complicated rather than simpler ...

> We'd need a mapping from OID => npartitions (or to
> compile some data structure of all the partitioned partitions).  I guess
> CreateIndex() could call CreatePartitionDirectory().  But it looks like
> that would be *more* expensive.

The reason I find this annoying is that the non-optional nature of the
progress reporting mechanism was sold on the basis that it would add
only negligible overhead.  Adding extra pass(es) over pg_inherits
breaks that promise.  Maybe it's cheap enough to not matter in the
big scheme of things, but we should not be having to make arguments
like that one.

regards, tom lane




Re: Progress report of CREATE INDEX for nested partitioned tables

2023-03-12 Thread Justin Pryzby
On Sun, Mar 12, 2023 at 04:14:06PM -0400, Tom Lane wrote:
> Justin Pryzby  writes:
> > On Fri, Mar 10, 2023 at 03:36:10PM -0500, Tom Lane wrote:
> >> I took a look through this.  It seems like basically a good solution,
> >> but the count_leaf_partitions() function is bothering me, for two
> >> reasons:
> 
> > ... find_all_inheritors() will also have been called by
> > ProcessUtilitySlow().  Maybe it's sufficient to mention that ?
> 
> Hm.  Could we get rid of count_leaf_partitions by doing the work in
> ProcessUtilitySlow?  Or at least passing that OID list forward instead
> of recomputing it?

count_leaf_partitions() is called in two places:

Once to get PROGRESS_CREATEIDX_PARTITIONS_TOTAL.  It'd be easy enough to
pass an integer total via IndexStmt (but I think we wanted to avoid
adding anything there, since it's not a part of the statement).

count_leaf_partitions() is also called for sub-partitions, in the case
that a matching "partitioned index" already exists, and the progress
report needs to be incremented by the number of leaves for which indexes
were ATTACHED.  We'd need a mapping from OID => npartitions (or to
compile some data structure of all the partitioned partitions).  I guess
CreateIndex() could call CreatePartitionDirectory().  But it looks like
that would be *more* expensive.

-- 
Justin




Re: WIP Patch: pg_dump structured

2023-03-12 Thread Attila Soki



> On 12 Mar 2023, at 21:50, Tom Lane  wrote:
> 
> Attila Soki  writes:
>> This patch adds the structured output format to pg_dump.
>> This format is a plaintext output split up into multiple files and the
>> resulting small files are stored in a directory path based on the dumped 
>> object.
> 
> Won't this fail completely with SQL objects whose names aren't suitable
> to be pathname components?  "A/B" is a perfectly good name so far as
> SQL is concerned.  You could also have problems with collisions on
> case-insensitive filesystems.

The “A/B” case is handled in _CleanFilename function, the slash and other
problematic characters are replaced.
You are right about the case-insensivity, this is not handled and will fail. I 
forgot
to handle that. I trying to find a way to handle this.

> 
>> This format can be restored by feeding its plaintext toc file 
>> (restore-dump.sql)
>> to psql. The output is also suitable for manipulating the files with standard
>> editing tools.
> 
> This seems a little contradictory: if you want to edit the individual
> files, you'd have to also update restore-dump.sql, or else it's pointless.
> It might make more sense to consider this as a write-only dump format
> and not worry about whether it can be restored directly.

The main motivation was to track changes with VCS at the file (object) level,
editing small files was intended as a second possible use case.
I did not know that a write-only format would go.

> 
>> What do you think of this feature, any chance it will be added to pg_dump 
>> once
>> the patch is ready?
> 
> I'm not clear on how big the use-case is.  It's not really obvious to
> me that this'd have any benefit over the existing plain-text dump
> capability.  You can edit those files too, at least till the schema
> gets too big for your editor.  (But if you've got many many thousand
> SQL objects, a file-per-SQL-object directory will also be no fun to
> deal with.)


I use something like this (a previous version) to track several thousand
objects. But I'm not sure if that would have a wide user base.
Therefore the wip to see if there is interest in this feature.
I think the advantage of having many small files is that it is recognizable
which file (object) is involved in a commit and that the SQL functions and
tables get a change history.

Thank you for your feedback.

Regards,
Attila Soki



Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-12 Thread Peter Smith
On Sat, Mar 11, 2023 at 6:05 PM Amit Kapila  wrote:
>
> On Fri, Mar 10, 2023 at 7:58 PM Önder Kalacı  wrote:
> >
> >
> > I think one option could be to drop some cases altogether, but not sure 
> > we'd want that.
> >
> > As a semi-related question: Are you aware of any setting that'd make 
> > pg_stat_all_indexes
> > reflect the changes sooner? It is hard to debug what is the bottleneck in 
> > the tests, but
> > I have a suspicion that there might be several poll_query_until() calls on
> > pg_stat_all_indexes, which might be the reason?
> >
>
> Yeah, I also think poll_query_until() calls on pg_stat_all_indexes is
> the main reason for these tests taking more time. When I commented
> those polls, it drastically reduces the test time. On looking at
> pgstat_report_stat(), it seems we don't report stats sooner than 1s
> and as most of this patch's test relies on stats, it leads to taking
> more time. I don't have a better idea to verify this patch without
> checking whether the index scan is really used by referring to
> pg_stat_all_indexes. I think trying to reduce the poll calls may help
> in reducing the test timings further. Some ideas on those lines are as
> follows:

If the reason for the stats polling was only to know if some index is
chosen or not, I was wondering if you can just convey the same
information to the TAP test via some conveniently placed (DEBUG?)
logging.

This way the TAP test can do a 'wait_for_log' instead of the
'poll_query_until'. It will probably generate lots of extra logging
but it still might be lots faster that current code because it won't
incur the 1s overheads of the stats.

--
Kind Regards,
Peter Smith.
Fujitsu Australia.




Re: WIP Patch: pg_dump structured

2023-03-12 Thread Tom Lane
Attila Soki  writes:
> This patch adds the structured output format to pg_dump.
> This format is a plaintext output split up into multiple files and the
> resulting small files are stored in a directory path based on the dumped 
> object.

Won't this fail completely with SQL objects whose names aren't suitable
to be pathname components?  "A/B" is a perfectly good name so far as
SQL is concerned.  You could also have problems with collisions on
case-insensitive filesystems.

> This format can be restored by feeding its plaintext toc file 
> (restore-dump.sql)
> to psql. The output is also suitable for manipulating the files with standard
> editing tools.

This seems a little contradictory: if you want to edit the individual
files, you'd have to also update restore-dump.sql, or else it's pointless.
It might make more sense to consider this as a write-only dump format
and not worry about whether it can be restored directly.

> What do you think of this feature, any chance it will be added to pg_dump once
> the patch is ready?

I'm not clear on how big the use-case is.  It's not really obvious to
me that this'd have any benefit over the existing plain-text dump
capability.  You can edit those files too, at least till the schema
gets too big for your editor.  (But if you've got many many thousand
SQL objects, a file-per-SQL-object directory will also be no fun to
deal with.)

regards, tom lane




WIP Patch: pg_dump structured

2023-03-12 Thread Attila Soki
Hi all,

I was looking for a way to track actual schema changes after database migrations
in a VCS. Preferably, the schema definition should come from a trusted source
like pg_dump and should consist of small files.
This patch was born out of that need.

This patch adds the structured output format to pg_dump.
This format is a plaintext output split up into multiple files and the
resulting small files are stored in a directory path based on the dumped object.
This format can be restored by feeding its plaintext toc file (restore-dump.sql)
to psql. The output is also suitable for manipulating the files with standard
editing tools.

This patch is a WIP (V1). The patch is against master and it compiles
successfully on macOS 13.2.1 aarch64 and on Debian 11 arm64.
To test, execute pg_dump --format=structured --file=/path/to/outputdir dbname

What do you think of this feature, any chance it will be added to pg_dump once
the patch is ready?
Is the chosen name "structured" appropriate?

Thanks for any feedback.

--
Attila Soki



v1-wip-pg_dump_structured.patch
Description: Binary data


Re: Progress report of CREATE INDEX for nested partitioned tables

2023-03-12 Thread Tom Lane
Justin Pryzby  writes:
> On Fri, Mar 10, 2023 at 03:36:10PM -0500, Tom Lane wrote:
>> I took a look through this.  It seems like basically a good solution,
>> but the count_leaf_partitions() function is bothering me, for two
>> reasons:

> ... find_all_inheritors() will also have been called by
> ProcessUtilitySlow().  Maybe it's sufficient to mention that ?

Hm.  Could we get rid of count_leaf_partitions by doing the work in
ProcessUtilitySlow?  Or at least passing that OID list forward instead
of recomputing it?

regards, tom lane




Re: Progress report of CREATE INDEX for nested partitioned tables

2023-03-12 Thread Justin Pryzby
On Fri, Mar 10, 2023 at 03:36:10PM -0500, Tom Lane wrote:
> Justin Pryzby  writes:
> > Update to address a compiler warning in the supplementary patches adding
> > assertions.
> 
> I took a look through this.  It seems like basically a good solution,
> but the count_leaf_partitions() function is bothering me, for two
> reasons:
> 
> 1. It seems like a pretty expensive thing to do.  Don't we have the
> info at hand somewhere already?

I don't know where that would be.  We need the list of both direct *and*
indirect partitions.  See:
https://www.postgresql.org/message-id/5073D187-4200-4A2D-BAC0-91C657E3C22E%40gmail.com

If it would help to avoid the concern, then I might consider proposing
not to call get_rel_relkind() ...

> 2. Is it really safe to do find_all_inheritors with NoLock?  If so,
> a comment explaining why would be good.

In both cases (both for the parent and for case of a partitioned child
with pre-existing indexes being ATTACHed), the table itself is already
locked by DefineIndex():

lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
rel = table_open(relationId, lockmode);

and
childrel = table_open(childRelid, lockmode);
...
table_close(childrel, NoLock);

And, find_all_inheritors() will also have been called by
ProcessUtilitySlow().  Maybe it's sufficient to mention that ?

-- 
Justin




Re: psql \watch 2nd argument: iteration count

2023-03-12 Thread Andrey Borodin
On Thu, Mar 9, 2023 at 11:25 AM Nathan Bossart  wrote:
>
> +   pg_log_error("Watch period must be 
> non-negative number, but argument is '%s'", opt);
>
> After looking around at the other error messages in this file, I think we
> should make this more concise.  Maybe something like
>
> pg_log_error("\\watch: invalid delay interval: %s", opt);
In the review above Kyotaro-san suggested that message should contain
information on what it expects... So, maybe then
pg_log_error("\\watch interval must be non-negative number, but
argument is '%s'", opt); ?
Or perhaps with articles? pg_log_error("\\watch interval must be a
non-negative number, but the argument is '%s'", opt);

>
> +   free(opt);
> +   resetPQExpBuffer(query_buf);
> +   return PSQL_CMD_ERROR;
>
> Is this missing psql_scan_reset(scan_state)?
Yes, fixed.

Best regards, Andrey Borodin.


v5-0002-Add-iteration-count-argument-to-psql-watch-comman.patch
Description: Binary data


v5-0001-Fix-incorrect-argument-handling-in-psql-watch.patch
Description: Binary data


Re: pg_dump versus hash partitioning

2023-03-12 Thread Tom Lane
Justin Pryzby  writes:
> On Sun, Mar 12, 2023 at 03:46:52PM -0400, Tom Lane wrote:
>> What I propose we do about that is further tweak things so that
>> load-via-partition-root forces dumping via COPY.  AFAIK the only
>> compelling use-case for dump-as-INSERTs is in transferring data
>> to a non-Postgres database, which is a context in which dumping
>> partitioned tables as such is pretty hopeless anyway.  (I wonder if
>> we should have some way to dump all the contents of a partitioned
>> table as if it were unpartitioned, to support such migration.)

> I think that what this other thread is about.
> https://commitfest.postgresql.org/42/4130/
> pg_dump all child tables with the root table

As far as I understood (didn't actually read the latest patch) that
one is just about easily selecting all the partitions of a partitioned
table when doing a selective dump.  It's not helping you produce a
non-Postgres-specific dump.

Although I guess by combining load-via-partition-root, data-only mode,
and dump-as-inserts you could produce a clean collection of
non-partition-dependent INSERT commands ... so maybe we'd better not
force dump-as-inserts off.  I'm starting to like the te->defn hack
more.

regards, tom lane




Re: pg_dump versus hash partitioning

2023-03-12 Thread Justin Pryzby
On Sun, Mar 12, 2023 at 03:46:52PM -0400, Tom Lane wrote:
> What I propose we do about that is further tweak things so that
> load-via-partition-root forces dumping via COPY.  AFAIK the only
> compelling use-case for dump-as-INSERTs is in transferring data
> to a non-Postgres database, which is a context in which dumping
> partitioned tables as such is pretty hopeless anyway.  (I wonder if
> we should have some way to dump all the contents of a partitioned
> table as if it were unpartitioned, to support such migration.)

I think that what this other thread is about.

https://commitfest.postgresql.org/42/4130/
pg_dump all child tables with the root table




Re: pg_dump versus hash partitioning

2023-03-12 Thread Tom Lane
Julien Rouhaud  writes:
> The BEGIN + TRUNCATE is only there to avoid generating WAL records just in 
> case
> the wal_level is minimal.  I don't remember if that optimization still exists,
> but if yes we could avoid doing that if the server's wal_level is replica or
> higher?  That's not perfect but it would help in many cases.

After thinking about this, it seems like a better idea is to skip the
TRUNCATE if we're doing load-via-partition-root.  In that case it's
clearly a dangerous thing to do regardless of deadlock worries, since
it risks discarding previously-loaded data that came over from another
partition.  (IOW this seems like an independent, pre-existing bug in
load-via-partition-root mode.)

The trick is to detect in pg_restore whether pg_dump chose to do
load-via-partition-root.  If we have a COPY statement we can fairly
easily examine it to see if the target table is what we expect or
something else.  However, if the table was dumped as INSERT statements
it'd be far messier; the INSERTs are not readily accessible from the
code that needs to make the decision.

What I propose we do about that is further tweak things so that
load-via-partition-root forces dumping via COPY.  AFAIK the only
compelling use-case for dump-as-INSERTs is in transferring data
to a non-Postgres database, which is a context in which dumping
partitioned tables as such is pretty hopeless anyway.  (I wonder if
we should have some way to dump all the contents of a partitioned
table as if it were unpartitioned, to support such migration.)

An alternative could be to extend the archive TOC format to record
directly whether a given TABLE DATA object loads data via partition
root or normally.  Potentially we could do that without an archive
format break by defining te->defn for TABLE DATA to be empty for
normal dumps (as it is now) or something like "-- load via partition root"
for the load-via-partition-root case.  However, depending on examination
of the COPY command would already work for the vast majority of existing
archive files, so I feel like it might be the preferable choice.

Thoughts?

regards, tom lane




Re: Add LZ4 compression in pg_dump

2023-03-12 Thread Tomas Vondra



On 3/12/23 11:07, Peter Eisentraut wrote:
> On 11.03.23 07:00, Alexander Lakhin wrote:
>> Hello,
>> 23.02.2023 23:24, Tomas Vondra wrote:
>>> On 2/23/23 16:26, Tomas Vondra wrote:
 Thanks for v30 with the updated commit messages. I've pushed 0001 after
 fixing a comment typo and removing (I think) an unnecessary change
 in an
 error message.

 I'll give the buildfarm a bit of time before pushing 0002 and 0003.

>>> I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
>>> and marked the CF entry as committed. Thanks for the patch!
>>>
>>> I wonder how difficult would it be to add the zstd compression, so that
>>> we don't have the annoying "unsupported" cases.
>>
>> With the patch 0003 committed, a single warning -Wtype-limits appeared
>> in the
>> master branch:
>> $ CPPFLAGS="-Og -Wtype-limits" ./configure --with-lz4 -q && make -s -j8
>> compress_lz4.c: In function ‘LZ4File_gets’:
>> compress_lz4.c:492:19: warning: comparison of unsigned expression in
>> ‘< 0’ is always false [-Wtype-limits]
>>    492 | if (dsize < 0)
>>    |
>> (I wonder, is it accidental that there no other places that triggers
>> the warning, or some buildfarm animals had this check enabled before?)
> 
> I think there is an underlying problem in this code that it dances back
> and forth between size_t and int in an unprincipled way.
> 
> In the code that triggers the warning, dsize is size_t.  dsize is the
> return from LZ4File_read_internal(), which is declared to return int.
> The variable that LZ4File_read_internal() returns in the success case is
> size_t, but in case of an error it returns -1.  (So the code that is
> warning is meaning to catch this error case, but it won't ever work.)
> Further below LZ4File_read_internal() calls LZ4File_read_overflow(),
> which is declared to return int, but in some cases it returns
> fs->overflowlen, which is size_t.
> 

I agree. I just got home so I looked at this only very briefly, but I
think it's clearly wrong to assign the LZ4File_read_internal() result to
a size_t variable (and it seems to me LZ4File_gets does the same mistake
with LZ4File_read_internal() result).

I'll get this fixed early next week, I'm too tired to do that now
without likely causing further issues.

> This should be cleaned up.
> 
> AFAICT, the upstream API in lz4.h uses int for size values, but
> lz4frame.h uses size_t, so I don't know what the correct approach is.

Yeah, that's a good point. I think Justin is right we should be using
the LZ4F stuff, so ultimately we'll probably switch to size_t. But IMO
it's definitely better to correct the current code first, and only then
switch to LZ4F (from one correct state to another).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Proposal] Allow pg_dump to include all child tables with the root table

2023-03-12 Thread Stéphane Tachoires
Hi Gilles,

On Ubuntu 22.04.2 all deb's updated, I have an error on a test
I'll try to find where and why, but I think you should know first.

1/1 postgresql:pg_dump / pg_dump/002_pg_dumpERROR24.40s
  exit status 1
――
✀
 
――
stderr:
#   Failed test 'only_dump_measurement: should dump CREATE TABLE
test_compression'
#   at /media/hddisk/stephane/postgresql/src/postgresql/src/bin/pg_dump/t/
002_pg_dump.pl line 4729.
# Review only_dump_measurement results in
/media/hddisk/stephane/postgresql/build/testrun/pg_dump/002_pg_dump/data/tmp_test_iJxJ
# Looks like you failed 1 test of 10264.

(test program exited with status code 1)



Summary of Failures:

1/1 postgresql:pg_dump / pg_dump/002_pg_dump ERROR24.40s   exit
status 1


Ok: 0
Expected Fail:  0
Fail:   1
Unexpected Pass:0
Skipped:0
Timeout:0

Join, only_dump_measurement.sql from
/media/hddisk/stephane/postgresql/build/testrun/pg_dump/002_pg_dump/data/tmp_test_iJxJ
If you need more information, please ask...

Stéphane.


Le dim. 12 mars 2023 à 10:04, Gilles Darold  a écrit :

> Le 11/03/2023 à 19:51, Gilles Darold a écrit :
> > Le 04/03/2023 à 20:18, Tom Lane a écrit :
> >> As noted, "childs" is bad English and "partitions" is flat out wrong
> >> (unless you change it to recurse only to partitions, which doesn't
> >> seem like a better definition).  We could go with
> >> --[exclude-]table-and-children, or maybe
> >> --[exclude-]table-and-child-tables, but those are getting into
> >> carpal-tunnel-syndrome-inducing territory 🙁.  I lack a better
> >> naming suggestion offhand.
> >
> >
> > In attachment is version 3 of the patch, it includes the use of
> > options suggested by Stephane and Tom:
> >
> > --table-and-children,
> >
> > --exclude-table-and-children
> >
> > --exclude-table-data-and-children.
> >
> >  Documentation have been updated too.
> >
> >
> > Thanks
> >
>
> New version v4 of the patch attached with a typo in documentation fixed.
>
> --
> Gilles Darold.
>


-- 
"Où se posaient les hirondelles avant l'invention du téléphone ?"
  -- Grégoire Lacroix


only_dump_measurement.sql
Description: application/sql


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-12 Thread Bharath Rupireddy
On Sun, Mar 12, 2023 at 12:52 AM Nitin Jadhav
 wrote:
>
> I went through the v8 patch.

Thanks for looking at it. Please post the responses in-line, not above
the entire previous message for better readability.

> Following are my thoughts to improve the
> WAL buffer hit ratio.

Note that the motive of this patch is to read WAL from WAL buffers
*when possible* without affecting concurrent WAL writers.

> Currently the no-longer-needed WAL data present in WAL buffers gets
> cleared in XLogBackgroundFlush() which is called based on the
> wal_writer_delay config setting. Once the data is flushed to the disk,
> it is treated as no-longer-needed and it will be cleared as soon as
> possible based on some config settings.

Being opportunistic in pre-initializing as many possible WAL buffer
pages as is there for a purpose. There's an illuminating comment [1],
so that's done for a purpose, so removing it fully is a no-go IMO. For
instance, it'll make WAL buffer pages available for concurrent writers
so there will be less work for writers in GetXLogBuffer. I'm sure
removing the opportunistic pre-initialization of the WAL buffer pages
will hurt performance in a highly concurrent-write workload.

/*
 * Great, done. To take some work off the critical path, try to initialize
 * as many of the no-longer-needed WAL buffers for future use as we can.
 */
AdvanceXLInsertBuffer(InvalidXLogRecPtr, insertTLI, true);

> Second, In WALRead(), we try to read the data from disk whenever we
> don't find the data from WAL buffers. We don't store this data in the
> WAL buffer. We just read the data, use it and leave it. If we store
> this data to the WAL buffer, then we may avoid a few disk reads.

Again this is going to hurt concurrent writers. Note that wal_buffers
aren't used as full cache per-se, there'll be multiple writers to it,
*when possible* readers will try to read from it without hurting
writers.

> The patch attached takes care of this.

Please post the new proposal as a text file (not a .patch file) or as
a plain text in the email itself if the change is small or attach all
the patches if the patch is over-and-above the proposed patches.
Attaching a single over-and-above patch will make CFBot unhappy and
will force authors to repost the original patches. Typically, we
follow this. Having said, I have some review comments to fix on
v8-0001, so, I'll be sending out v9 patch-set soon.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Implement IF NOT EXISTS for CREATE PUBLICATION AND CREATE SUBSCRIPTION

2023-03-12 Thread Tom Lane
vignesh C  writes:
> Currently we don't support "IF NOT EXISTS" for Create publication and
> Create subscription, I felt it would be useful to add this "IF NOT
> EXISTS" which will create publication/subscription only if the object
> does not exist.
> Attached patch for handling the same.
> Thoughts?

I generally dislike IF NOT EXISTS options, because they are so
semantically squishy: when the command is over, you cannot make any
assumptions whatsoever about the properties of the object, beyond
the bare fact that it exists.  I do not think we should implement
such options without a pretty compelling argument that there is a
use-case for them.  "I felt it would be useful" doesn't meet the
bar IMO.

CREATE OR REPLACE doesn't have this semantic problem, but I'm not
sure whether it's a useful approach for these types of objects.

regards, tom lane




Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-03-12 Thread Alexander Korotkov
On Fri, Mar 10, 2023 at 8:17 PM Chris Travers  wrote:
> "Right, the improvement this patch gives to the heap is not the full 
> motivation. Another motivation is the improvement it gives to TableAM API. 
> Our current API implies that the effort on locating the tuple by tid is 
> small. This is more or less true for the heap, where we just need to pin and 
> lock the buffer. But imagine other TableAM implementations, where locating a 
> tuple is more expensive."
>
> Yeah. Our TableAM API is a very nice start to getting pluggable storage, but 
> we still have a long ways to go to have an ability to really provide a wide 
> variety of pluggable storage engines.
>
> In particular, the following approaches are likely to have much more 
> expensive tid lookups:
>  - columnar storage (may require a lot of random IO to reconstruct a tuple)
>  - index oriented storage (tid no longer physically locatable in the file via 
> seek)
>  - compressed cold storage like pg_ctyogen (again seek may be problematic).
>
> To my mind I think the performance benefits are a nice side benefit, but the 
> main interest I have on this is regarding improvements in the TableAM 
> capabilities.  I cannot see how to do this without a lot more infrastructure.

Chris, thank you for your feedback.

The revised patch set is attached.  Some comments are improved.  Also,
we implicitly skip the new facility for the MERGE case.  As I get Dean
Rasheed is going to revise the locking for MERGE soon [1].

Pavel, could you please re-run your test case on the revised patch?

1. 
https://www.postgresql.org/message-id/caezatcu9e9ccbi70ynbccf7xvz+zrjid0_6eeq2zeza1p+7...@mail.gmail.com

--
Regards,
Alexander Korotkov
From a06e0864fd45b3fcac961a45e8c66f4dad6aac47 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov 
Date: Wed, 11 Jan 2023 15:00:49 +0300
Subject: [PATCH 1/2] Evade extra table_tuple_fetch_row_version() in
 ExecUpdate()/ExecDelete()

When we lock tuple using table_tuple_lock() then we at the same time fetch
the locked tuple to the slot.  In this case we can skip extra
table_tuple_fetch_row_version() thank to we've already fetched the 'old' tuple
and nobody can change it concurrently since it's locked.

Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
---
 src/backend/executor/nodeModifyTable.c | 46 ++
 1 file changed, 33 insertions(+), 13 deletions(-)

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 6f44d71f16b..839e8fe0d04 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1589,6 +1589,21 @@ ldelete:
 	{
 		case TM_Ok:
 			Assert(context->tmfd.traversed);
+
+			/*
+			 * Save locked tuple for further processing of
+			 * RETURNING clause.
+			 */
+			if (processReturning &&
+resultRelInfo->ri_projectReturning &&
+!resultRelInfo->ri_FdwRoutine)
+			{
+TupleTableSlot *returningSlot;
+returningSlot = ExecGetReturningSlot(estate, resultRelInfo);
+ExecCopySlot(returningSlot, inputslot);
+ExecMaterializeSlot(returningSlot);
+			}
+
 			epqslot = EvalPlanQual(context->epqstate,
    resultRelationDesc,
    resultRelInfo->ri_RangeTableIndex,
@@ -1703,12 +1718,16 @@ ldelete:
 		}
 		else
 		{
+			/*
+			 * Tuple can be already fetched to the returning slot in case we've
+			 * previously locked it.  Fetch the tuple only if the slot is empty.
+			 */
 			slot = ExecGetReturningSlot(estate, resultRelInfo);
 			if (oldtuple != NULL)
 			{
 ExecForceStoreHeapTuple(oldtuple, slot, false);
 			}
-			else
+			else if (TupIsNull(slot))
 			{
 if (!table_tuple_fetch_row_version(resultRelationDesc, tupleid,
    SnapshotAny, slot))
@@ -2409,6 +2428,19 @@ redo_act:
 		case TM_Ok:
 			Assert(context->tmfd.traversed);
 
+			/* Make sure ri_oldTupleSlot is initialized. */
+			if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
+ExecInitUpdateProjection(context->mtstate,
+		 resultRelInfo);
+
+			/*
+			 * Save the locked tuple for further calculation of
+			 * the new tuple.
+			 */
+			oldSlot = resultRelInfo->ri_oldTupleSlot;
+			ExecCopySlot(oldSlot, inputslot);
+			ExecMaterializeSlot(oldSlot);
+
 			epqslot = EvalPlanQual(context->epqstate,
    resultRelationDesc,
    resultRelInfo->ri_RangeTableIndex,
@@ -2417,18 +2449,6 @@ redo_act:
 /* Tuple not passing quals anymore, exiting... */
 return NULL;
 
-			/* Make sure ri_oldTupleSlot is initialized. */
-			if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
-ExecInitUpdateProjection(context->mtstate,
-		 resultRelInfo);
-
-			/* Fetch the most recent version of old tuple. */
-			oldSlot =

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-12 Thread Önder Kalacı
Hi Amit, all


> >> I think we can add such a test (which relies on existing buggy
> >> behavior) later after fixing the existing bug. For now, it would be
> >> better to remove that test and add it after we fix dropped columns
> >> issue in HEAD.
> >
> >
> > Alright, when I push the next version (hopefully tomorrow), I'll follow
> this suggestion.
> >
>
> Okay, thanks. See, if you can also include your changes in the patch
> wip_for_optimize_index_column_match (after discussed modification).
> Few other minor comments:
>

Sure, done. Please check RemoteRelContainsLeftMostColumnOnIdx() function.

Note that we already have a test for that on SOME NULL VALUES AND MISSING
COLUMN.
Previously we'd check if test_replica_id_full_idy is used. Now, we don't
because it is not
used anymore :) I initially used poll_query_until with idx_scan=0, but that
also seems
confusing to read in the test. It looks like it could be prone to race
conditions as
poll_query_until with idxcan=0 does not guarantee anything.


>
> 1.
> +   are enforced for primary keys.  Internally, we follow a similar
> approach for
> +   supporting index scans within logical replication scope.  If there are
> no
>
> I think we can remove the above line: "Internally, we follow a similar
> approach for supporting index scans within logical replication scope."
> This didn't seem useful for users.
>
>
removed


> 2.
> diff --git a/src/backend/executor/execReplication.c
> b/src/backend/executor/execReplication.c
> index bc6409f695..646e608eb7 100644
> --- a/src/backend/executor/execReplication.c
> +++ b/src/backend/executor/execReplication.c
> @@ -83,11 +83,8 @@ build_replindex_scan_key(ScanKey skey, Relation
> rel, Relation idxrel,
> if (!AttributeNumberIsValid(table_attno))
> {
> /*
> -* XXX: For a non-primary/unique index with an
> additional
> -* expression, we do not have to continue at
> this point. However,
> -* the below code assumes the index scan is
> only done for simple
> -* column references. If we can relax the
> assumption in the below
> -* code-block, we can also remove the continue.
> +* XXX: Currently, we don't support
> expressions in the scan key,
> +* see code below.
>  */
>
>
> I have tried to simplify the above comment. See, if that makes sense to
> you.
>

Makes sense


>
> 3.
> /*
> + * We only need to allocate once. This is allocated within per
> + * tuple context -- ApplyMessageContext -- hence no need to
> + * explicitly pfree().
> + */
>
> We normally don't write why we don't need to explicitly pfree. It is
> good during the review but not sure if it is a good idea to keep it in
> the final code.
>
>
Sounds good, applied

4. I have modified the proposed commit message as follows, see if that
> makes sense to you, and let me know if I missed anything especially
> the review/author credit.
>
> Allow the use of indexes other than PK and REPLICA IDENTITY on the
> subscriber.
>
> Using REPLICA IDENTITY FULL on the publisher can lead to a full table
> scan per tuple change on the subscription when REPLICA IDENTITY or PK
> index is not available. This makes REPLICA IDENTITY FULL impractical
> to use apart from some small number of use cases.
>
> This patch allows using indexes other than PRIMARY KEY or REPLICA
> IDENTITY on the subscriber during apply of update/delete. The index
> that can be used must be a btree index, not a partial index, and it
> must have at least one column reference (i.e. cannot consist of only
> expressions). We can uplift these restrictions in the future. There is
> no smart mechanism to pick the index. If there is more than one index
> that
> satisfies these requirements, we just pick the first one. We discussed
> using some of the optimizer's low-level APIs for this but ruled it out
> as that can be a maintenance burden in the long run.
>
> This patch improves the performance in the vast majority of cases and
> the improvement is proportional to the amount of data in the table.
> However, there could be some regression in a small number of cases
> where the indexes have a lot of duplicate and dead rows. It was
> discussed that those are mostly impractical cases but we can provide a
> table or subscription level option to disable this feature if
> required.
>
> Author: Onder Kalaci
> Reviewed-by: Peter Smith, Shi yu, Hou Zhijie, Vignesh C, Kuroda
> Hayato, Amit Kapila
> Discussion:
> https://postgr.es/m/CACawEhVLqmAAyPXdHEPv1ssU2c=dqoniigz7g73hfys7+ng...@mail.gmail.com
>
>
I also see 2 mails/reviews from Wang wei, but I'm not sure what qualifies
as "reviewer" for this group. Should we
add that name as well? I think you can guide us on this.

Other than that, I only fixed one extra new line between 'that' and
"satisfies'. Other than that, it looks pretty good!

Thanks

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-12 Thread Önder Kalacı
Hi Amit, all

>
> 1.
> +# Testcase start: SUBSCRIPTION USES INDEX WITH PUB/SUB DIFFERENT DATA VIA
> +# A UNIQUE INDEX THAT IS NOT PRIMARY KEY OR REPLICA IDENTITY
>
> No need to use Delete test separate for this.
>

Yeah, there is really no difference between update/delete for this patch,
so it makes sense. I initially added it for completeness for the coverage,
but as it has the perf overhead for the tests, I agree that we could
drop some of those.


>
> 2.
> +$node_publisher->safe_psql('postgres',
> + "UPDATE test_replica_id_full SET x = x + 1 WHERE x = 1;");
> +$node_publisher->safe_psql('postgres',
> + "UPDATE test_replica_id_full SET x = x + 1 WHERE x = 3;");
> +
> +# check if the index is used even when the index has NULL values
> +$node_subscriber->poll_query_until(
> + 'postgres', q{select idx_scan=2 from pg_stat_all_indexes where
> indexrelname = 'test_replica_id_full_idx';}
> +) or die "Timed out while waiting for check subscriber
> tap_sub_rep_full updates test_replica_id_full table";
>
> Here, I think only one update is sufficient.
>

done. I guess you requested this change so that we would wait
for idx_scan=1 not idx_scan=2, which could help.


> 3.
> +$node_subscriber->safe_psql('postgres',
> + "CREATE INDEX people_last_names ON people(lastname)");
> +
> +# wait until the index is created
> +$node_subscriber->poll_query_until(
> + 'postgres', q{select count(*)=1 from pg_stat_all_indexes where
> indexrelname = 'people_last_names';}
> +) or die "Timed out while waiting for creating index people_last_names";
>
> I don't think we need this poll.
>

 true, not sure why I have this. none of the tests has this anyway.


> 4.
> +# update 2 rows
> +$node_publisher->safe_psql('postgres',
> + "UPDATE people SET firstname = 'no-name' WHERE firstname =
> 'first_name_1';");
> +$node_publisher->safe_psql('postgres',
> + "UPDATE people SET firstname = 'no-name' WHERE firstname =
> 'first_name_3' AND lastname = 'last_name_3';");
> +
> +# wait until the index is used on the subscriber
> +$node_subscriber->poll_query_until(
> + 'postgres', q{select idx_scan=2 from pg_stat_all_indexes where
> indexrelname = 'people_names';}
> +) or die "Timed out while waiting for check subscriber
> tap_sub_rep_full updates two rows via index scan with index on
> expressions and columns";
> +
> +$node_publisher->safe_psql('postgres',
> + "DELETE FROM people WHERE firstname = 'no-name';");
> +
> +# wait until the index is used on the subscriber
> +$node_subscriber->poll_query_until(
> + 'postgres', q{select idx_scan=4 from pg_stat_all_indexes where
> indexrelname = 'people_names';}
> +) or die "Timed out while waiting for check subscriber
> tap_sub_rep_full deletes two rows via index scan with index on
> expressions and columns";
>
> I think having one update or delete should be sufficient.
>

So, I dropped the 2nd update, but kept 1 update and 1 delete.
The latter deletes the tuple updated by the former. Seems like
an interesting test to keep.

Still, I dropped one of the extra poll_query_until, which is probably
good enough for this one? Let me know if you think otherwise.


>
> 5.
> +# update rows, moving them to other partitions
> +$node_publisher->safe_psql('postgres',
> + "UPDATE users_table_part SET value_1 = 0 WHERE user_id = 4;");
> +
> +# wait until the index is used on the subscriber
> +$node_subscriber->poll_query_until(
> + 'postgres', q{select sum(idx_scan)=1 from pg_stat_all_indexes where
> indexrelname ilike 'users_table_part_%';}
> +) or die "Timed out while waiting for updates on partitioned table with
> index";
> +
> +# delete rows from different partitions
> +$node_publisher->safe_psql('postgres',
> + "DELETE FROM users_table_part WHERE user_id = 1 and value_1 = 1;");
> +$node_publisher->safe_psql('postgres',
> + "DELETE FROM users_table_part WHERE user_id = 12 and value_1 = 12;");
> +
> +# wait until the index is used on the subscriber
> +$node_subscriber->poll_query_until(
> + 'postgres', q{select sum(idx_scan)=3 from pg_stat_all_indexes where
> indexrelname ilike 'users_table_part_%';}
> +) or die "Timed out while waiting for check subscriber
> tap_sub_rep_full updates partitioned table";
> +
>
> Can we combine these two polls?
>

Looking at it closely, the first one seems like an unnecessary poll anyway.
We can simply check the idxscan at the end of the test, I don't see
value in checking earlier.


>
> 6.
> +# Testcase start: SUBSCRIPTION USES INDEX WITH MULTIPLE ROWS AND COLUMNS,
> ALSO
> +# DROPS COLUMN
>
> In this test, let's try to update/delete 2-3 rows instead of 20. And
> after drop columns, let's keep just one of the update or delete.
>

changed to 3 rows


>
> 7. Apart from the above, I think it is better to use
> wait_for_catchup() consistently before trying to verify the data on
> the subscriber. We always use it in other tests. I guess here you are
> relying on the poll for index scans to ensure that data is replicated
> but I feel it may still be better to use wait_for_catchup().
>

Yes, that 

Implement IF NOT EXISTS for CREATE PUBLICATION AND CREATE SUBSCRIPTION

2023-03-12 Thread vignesh C
Hi,

Currently we don't support "IF NOT EXISTS" for Create publication and
Create subscription, I felt it would be useful to add this "IF NOT
EXISTS" which will create publication/subscription only if the object
does not exist.
Attached patch for handling the same.
Thoughts?

Regards,
Vignesh
From 63b53a87253af6e5d08f473cb0d568c2321a2bd3 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Fri, 3 Mar 2023 15:54:00 +0530
Subject: [PATCH v1] Implement IF NOT EXISTS for CREATE PUBLICATION AND CREATE
 SUBSCRIPTION

Implement IF NOT EXISTS for CREATE PUBLICATION AND CREATE SUBSCRIPTION
---
 doc/src/sgml/ref/create_publication.sgml   | 14 ++-
 doc/src/sgml/ref/create_subscription.sgml  | 14 ++-
 src/backend/commands/publicationcmds.c | 20 --
 src/backend/commands/subscriptioncmds.c| 18 +++--
 src/backend/parser/gram.y  | 45 ++
 src/include/nodes/parsenodes.h |  2 +
 src/test/regress/expected/publication.out  |  2 +
 src/test/regress/expected/subscription.out |  2 +
 src/test/regress/sql/publication.sql   |  1 +
 src/test/regress/sql/subscription.sql  |  1 +
 10 files changed, 109 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/create_publication.sgml b/doc/src/sgml/ref/create_publication.sgml
index a2946feaa3..b2f17bd7d8 100644
--- a/doc/src/sgml/ref/create_publication.sgml
+++ b/doc/src/sgml/ref/create_publication.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-CREATE PUBLICATION name
+CREATE PUBLICATION [ IF NOT EXISTS ] name
 [ FOR ALL TABLES
   | FOR publication_object [, ... ] ]
 [ WITH ( publication_parameter [= value] [, ... ] ) ]
@@ -54,6 +54,18 @@ CREATE PUBLICATION name
   Parameters
 
   
+   
+IF NOT EXISTS
+
+ 
+  Do not throw an error if a publication with the same name already exists.
+  A notice is issued in this case.  Note that there is no guarantee that
+  the existing publication is anything like the one that would have been
+  created.
+ 
+
+   
+

 name
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 51c45f17c7..4c7fe5dccd 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-CREATE SUBSCRIPTION subscription_name
+CREATE SUBSCRIPTION [ IF NOT EXISTS ] subscription_name
 CONNECTION 'conninfo'
 PUBLICATION publication_name [, ...]
 [ WITH ( subscription_parameter [= value] [, ... ] ) ]
@@ -61,6 +61,18 @@ CREATE SUBSCRIPTION subscription_nameParameters
 
   
+   
+IF NOT EXISTS
+
+ 
+  Do not throw an error if a subscription with the same name already
+  exists. A notice is issued in this case.  Note that there is no guarantee
+  that the existing subscription is anything like the one that would have
+  been created.
+ 
+
+   
+

 subscription_name
 
diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index f4ba572697..41e5f2c3e1 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -766,10 +766,22 @@ CreatePublication(ParseState *pstate, CreatePublicationStmt *stmt)
 	puboid = GetSysCacheOid1(PUBLICATIONNAME, Anum_pg_publication_oid,
 			 CStringGetDatum(stmt->pubname));
 	if (OidIsValid(puboid))
-		ereport(ERROR,
-(errcode(ERRCODE_DUPLICATE_OBJECT),
- errmsg("publication \"%s\" already exists",
-		stmt->pubname)));
+	{
+		if (stmt->if_not_exists)
+		{
+			ereport(NOTICE,
+	errcode(ERRCODE_DUPLICATE_OBJECT),
+	errmsg("publication \"%s\" already exists, skipping",
+		   stmt->pubname));
+			table_close(rel, RowExclusiveLock);
+			return InvalidObjectAddress;
+		}
+		else
+			ereport(ERROR,
+	(errcode(ERRCODE_DUPLICATE_OBJECT),
+	errmsg("publication \"%s\" already exists",
+		   stmt->pubname)));
+	}
 
 	/* Form a tuple. */
 	memset(values, 0, sizeof(values));
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 464db6d247..f2dd59244b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -593,10 +593,20 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 			MyDatabaseId, CStringGetDatum(stmt->subname));
 	if (OidIsValid(subid))
 	{
-		ereport(ERROR,
-(errcode(ERRCODE_DUPLICATE_OBJECT),
- errmsg("subscription \"%s\" already exists",
-		stmt->subname)));
+		if (stmt->if_not_exists)
+		{
+			ereport(NOTICE,
+	errcode(ERRCODE_DUPLICATE_OBJECT),
+	errmsg("subscription \"%s\" already exists, skipping",
+		   stmt->subname));
+			table_close(rel, RowExclusiveLock);
+			return InvalidObjectAddress;
+		}
+		else
+			ereport(ERROR,
+	(errcode(ERRCODE_DUPLICATE_OBJECT),
+	errmsg("subscription \"%s\" already exists",
+			stmt->subname)));
 	}
 
 	if (!IsS

Re: Account löschen

2023-03-12 Thread Alvaro Herrera
On 2023-Mar-12, Koray Ili wrote:

>Sehr geehrte Damen und Herren,
> 
>ich bitte um Löschung meines Accountes mit dem Benutzername
>[...]

This has been taken care of.

-- 
Álvaro Herrera




Re: unsafe_tests module

2023-03-12 Thread Andrew Dunstan


On 2023-03-11 Sa 19:12, Andres Freund wrote:

Hi,

On 2023-02-22 06:47:34 -0500, Andrew Dunstan wrote:

Given its nature and purpose as a module we don't want to run against an
installed instance, shouldn't src/test/modules/unsafe_tests have
NO_INSTALLCHECK=1 in its Makefile and runningcheck:false in its meson.build?

Seems like a good idea to me.



Thanks, done.


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Account löschen

2023-03-12 Thread Koray Ili
Sehr geehrte Damen und Herren,

 

ich bitte um Löschung meines Accountes mit dem Benutzername korayili2...@gmx.de. Ich bitte um Bestätigung. Vielen Dank. 
 

Mit freundlichen Grüßen,

Koray Ili

Email: korayili2...@gmx.de
Tel.: +49 (0) 179 3474321
Adresse: Hansaring 21, 46483 Wesel, Deutschland




Re: Add LZ4 compression in pg_dump

2023-03-12 Thread Peter Eisentraut

On 11.03.23 07:00, Alexander Lakhin wrote:

Hello,
23.02.2023 23:24, Tomas Vondra wrote:

On 2/23/23 16:26, Tomas Vondra wrote:

Thanks for v30 with the updated commit messages. I've pushed 0001 after
fixing a comment typo and removing (I think) an unnecessary change in an
error message.

I'll give the buildfarm a bit of time before pushing 0002 and 0003.


I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
and marked the CF entry as committed. Thanks for the patch!

I wonder how difficult would it be to add the zstd compression, so that
we don't have the annoying "unsupported" cases.


With the patch 0003 committed, a single warning -Wtype-limits appeared 
in the

master branch:
$ CPPFLAGS="-Og -Wtype-limits" ./configure --with-lz4 -q && make -s -j8
compress_lz4.c: In function ‘LZ4File_gets’:
compress_lz4.c:492:19: warning: comparison of unsigned expression in ‘< 
0’ is always false [-Wtype-limits]

   492 | if (dsize < 0)
   |
(I wonder, is it accidental that there no other places that triggers
the warning, or some buildfarm animals had this check enabled before?)


I think there is an underlying problem in this code that it dances back 
and forth between size_t and int in an unprincipled way.


In the code that triggers the warning, dsize is size_t.  dsize is the 
return from LZ4File_read_internal(), which is declared to return int. 
The variable that LZ4File_read_internal() returns in the success case is 
size_t, but in case of an error it returns -1.  (So the code that is 
warning is meaning to catch this error case, but it won't ever work.) 
Further below LZ4File_read_internal() calls LZ4File_read_overflow(), 
which is declared to return int, but in some cases it returns 
fs->overflowlen, which is size_t.


This should be cleaned up.

AFAICT, the upstream API in lz4.h uses int for size values, but 
lz4frame.h uses size_t, so I don't know what the correct approach is.





Re: [Proposal] Allow pg_dump to include all child tables with the root table

2023-03-12 Thread Gilles Darold

Le 11/03/2023 à 19:51, Gilles Darold a écrit :

Le 04/03/2023 à 20:18, Tom Lane a écrit :

As noted, "childs" is bad English and "partitions" is flat out wrong
(unless you change it to recurse only to partitions, which doesn't
seem like a better definition).  We could go with
--[exclude-]table-and-children, or maybe
--[exclude-]table-and-child-tables, but those are getting into
carpal-tunnel-syndrome-inducing territory 🙁.  I lack a better
naming suggestion offhand.



In attachment is version 3 of the patch, it includes the use of 
options suggested by Stephane and Tom:


    --table-and-children,

    --exclude-table-and-children

    --exclude-table-data-and-children.

 Documentation have been updated too.


Thanks



New version v4 of the patch attached with a typo in documentation fixed.

--
Gilles Darold.
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 334e4b7fd1..8ce4d5ff41 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -775,6 +775,16 @@ PostgreSQL documentation
   
  
 
+ 
+  --exclude-table-and-children=pattern
+  
+   
+Same as -T/--exclude-table option but automatically
+exclude partitions or child tables of the specified tables if any.
+   
+  
+ 
+
  
   --exclude-table-data=pattern
   
@@ -793,6 +803,17 @@ PostgreSQL documentation
   
  
 
+ 
+  --exclude-table-data-and-children=pattern
+  
+   
+Same as --exclude-table-data  but automatically
+exclude partitions or child tables of the specified tables if any.
+   
+  
+ 
+
+
  
   --extra-float-digits=ndigits
   
@@ -1158,6 +1179,16 @@ PostgreSQL documentation
   
  
 
+ 
+  --table-and-children=pattern
+  
+   
+Same as -t/--table option but automatically
+include partitions or child tables of the specified tables if any.
+   
+  
+ 
+
  
   --use-set-session-authorization
   
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 4217908f84..09d37991d6 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -130,6 +130,10 @@ static SimpleOidList foreign_servers_include_oids = {NULL, NULL};
 static SimpleStringList extension_include_patterns = {NULL, NULL};
 static SimpleOidList extension_include_oids = {NULL, NULL};
 
+static SimpleStringList table_include_patterns_and_children = {NULL, NULL};
+static SimpleStringList table_exclude_patterns_and_children = {NULL, NULL};
+static SimpleStringList tabledata_exclude_patterns_and_children = {NULL, NULL};
+
 static const CatalogId nilCatalogId = {0, 0};
 
 /* override for standard extra_float_digits setting */
@@ -180,7 +184,8 @@ static void expand_foreign_server_name_patterns(Archive *fout,
 static void expand_table_name_patterns(Archive *fout,
 	   SimpleStringList *patterns,
 	   SimpleOidList *oids,
-	   bool strict_names);
+	   bool strict_names,
+	   bool with_child_tables);
 static void prohibit_crossdb_refs(PGconn *conn, const char *dbname,
   const char *pattern);
 
@@ -421,6 +426,9 @@ main(int argc, char **argv)
 		{"on-conflict-do-nothing", no_argument, &dopt.do_nothing, 1},
 		{"rows-per-insert", required_argument, NULL, 10},
 		{"include-foreign-data", required_argument, NULL, 11},
+		{"table-and-children", required_argument, NULL, 12},
+		{"exclude-table-and-children", required_argument, NULL, 13},
+		{"exclude-table-data-and-children", required_argument, NULL, 14},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -631,6 +639,19 @@ main(int argc, char **argv)
 		  optarg);
 break;
 
+			case 12:			/* include table(s) with children */
+simple_string_list_append(&table_include_patterns_and_children, optarg);
+dopt.include_everything = false;
+break;
+
+			case 13:			/* exclude table(s) with children */
+simple_string_list_append(&table_exclude_patterns_and_children, optarg);
+break;
+
+			case 14:/* exclude table(s) data */
+simple_string_list_append(&tabledata_exclude_patterns_and_children, optarg);
+break;
+
 			default:
 /* getopt_long already emitted a complaint */
 pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -814,17 +835,34 @@ main(int argc, char **argv)
 	{
 		expand_table_name_patterns(fout, &table_include_patterns,
    &table_include_oids,
-   strict_names);
+   strict_names, false);
 		if (table_include_oids.head == NULL)
 			pg_fatal("no matching tables were found");
 	}
 	expand_table_name_patterns(fout, &table_exclude_patterns,
 			   &table_exclude_oids,
-			   false);
+			   false, false);
 
 	expand_table_name_patterns(fout, &tabledata_exclude_patterns,
 			   &tabledata_exclude_oids,
-			   false);
+			   false, false);
+
+	/* Expand table and children selection patterns into OID lists */
+	if (table_include_patterns_an

Re: [PoC] Implementation of distinct in Window Aggregates

2023-03-12 Thread Ankit Kumar Pandey

Attaching updated patch with a fix for an issue in window function.

I have also fixed naming convention of patch as last patch had 
incompatible name.


Note:

1. Pending: Investigation of test cases failures.


Regards,

Ankit
diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 4fcfd6df72..94ddccd23a 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -1574,14 +1574,40 @@ check_tuple(HeapCheckContext *ctx)
 static FullTransactionId
 FullTransactionIdFromXidAndCtx(TransactionId xid, const HeapCheckContext *ctx)
 {
-	uint32		epoch;
+	uint64		nextfxid_i;
+	int32		diff;
+	FullTransactionId fxid;
+
+	Assert(TransactionIdIsNormal(ctx->next_xid));
+	Assert(FullTransactionIdIsNormal(ctx->next_fxid));
+	Assert(XidFromFullTransactionId(ctx->next_fxid) == ctx->next_xid);
 
 	if (!TransactionIdIsNormal(xid))
 		return FullTransactionIdFromEpochAndXid(0, xid);
-	epoch = EpochFromFullTransactionId(ctx->next_fxid);
-	if (xid > ctx->next_xid)
-		epoch--;
-	return FullTransactionIdFromEpochAndXid(epoch, xid);
+
+	nextfxid_i = U64FromFullTransactionId(ctx->next_fxid);
+
+	/* compute the 32bit modulo difference */
+	diff = (int32) (ctx->next_xid - xid);
+
+	/*
+	 * In cases of corruption we might see a 32bit xid that is before epoch
+	 * 0. We can't represent that as a 64bit xid, due to 64bit xids being
+	 * unsigned integers, without the modulo arithmetic of 32bit xid. There's
+	 * no really nice way to deal with that, but it works ok enough to use
+	 * FirstNormalFullTransactionId in that case, as a freshly initdb'd
+	 * cluster already has a newer horizon.
+	 */
+	if (diff > 0 && (nextfxid_i - FirstNormalTransactionId) < (int64) diff)
+	{
+		Assert(EpochFromFullTransactionId(ctx->next_fxid) == 0);
+		fxid = FirstNormalFullTransactionId;
+	}
+	else
+		fxid = FullTransactionIdFromU64(nextfxid_i - diff);
+
+	Assert(FullTransactionIdIsNormal(fxid));
+	return fxid;
 }
 
 /*
@@ -1597,8 +1623,8 @@ update_cached_xid_range(HeapCheckContext *ctx)
 	LWLockRelease(XidGenLock);
 
 	/* And compute alternate versions of the same */
-	ctx->oldest_fxid = FullTransactionIdFromXidAndCtx(ctx->oldest_xid, ctx);
 	ctx->next_xid = XidFromFullTransactionId(ctx->next_fxid);
+	ctx->oldest_fxid = FullTransactionIdFromXidAndCtx(ctx->oldest_xid, ctx);
 }
 
 /*
diff --git a/contrib/pg_trgm/expected/pg_word_trgm.out b/contrib/pg_trgm/expected/pg_word_trgm.out
index 936d489390..c66a67f30e 100644
--- a/contrib/pg_trgm/expected/pg_word_trgm.out
+++ b/contrib/pg_trgm/expected/pg_word_trgm.out
@@ -1044,3 +1044,9 @@ select t,word_similarity('Kabankala',t) as sml from test_trgm2 where t %> 'Kaban
  Waikala  |  0.3
 (89 rows)
 
+-- test unsatisfiable pattern
+select * from test_trgm2 where t ~ '.*$x';
+ t 
+---
+(0 rows)
+
diff --git a/contrib/pg_trgm/sql/pg_word_trgm.sql b/contrib/pg_trgm/sql/pg_word_trgm.sql
index d9fa1c55e5..d2ada49133 100644
--- a/contrib/pg_trgm/sql/pg_word_trgm.sql
+++ b/contrib/pg_trgm/sql/pg_word_trgm.sql
@@ -43,3 +43,6 @@ select t,word_similarity('Baykal',t) as sml from test_trgm2 where 'Baykal' <% t
 select t,word_similarity('Kabankala',t) as sml from test_trgm2 where 'Kabankala' <% t order by sml desc, t;
 select t,word_similarity('Baykal',t) as sml from test_trgm2 where t %> 'Baykal' order by sml desc, t;
 select t,word_similarity('Kabankala',t) as sml from test_trgm2 where t %> 'Kabankala' order by sml desc, t;
+
+-- test unsatisfiable pattern
+select * from test_trgm2 where t ~ '.*$x';
diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 9a00564ae4..06cd3db67b 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -1947,9 +1947,7 @@ packGraph(TrgmNFA *trgmNFA, MemoryContext rcontext)
 arcsCount;
 	HASH_SEQ_STATUS scan_status;
 	TrgmState  *state;
-	TrgmPackArcInfo *arcs,
-			   *p1,
-			   *p2;
+	TrgmPackArcInfo *arcs;
 	TrgmPackedArc *packedArcs;
 	TrgmPackedGraph *result;
 	int			i,
@@ -2021,17 +2019,25 @@ packGraph(TrgmNFA *trgmNFA, MemoryContext rcontext)
 	qsort(arcs, arcIndex, sizeof(TrgmPackArcInfo), packArcInfoCmp);
 
 	/* We could have duplicates because states were merged. Remove them. */
-	/* p1 is probe point, p2 is last known non-duplicate. */
-	p2 = arcs;
-	for (p1 = arcs + 1; p1 < arcs + arcIndex; p1++)
+	if (arcIndex > 1)
 	{
-		if (packArcInfoCmp(p1, p2) > 0)
+		/* p1 is probe point, p2 is last known non-duplicate. */
+		TrgmPackArcInfo *p1,
+   *p2;
+
+		p2 = arcs;
+		for (p1 = arcs + 1; p1 < arcs + arcIndex; p1++)
 		{
-			p2++;
-			*p2 = *p1;
+			if (packArcInfoCmp(p1, p2) > 0)
+			{
+p2++;
+*p2 = *p1;
+			}
 		}
+		arcsCount = (p2 - arcs) + 1;
 	}
-	arcsCount = (p2 - arcs) + 1;
+	else
+		arcsCount = arcIndex;
 
 	/* Create packed representation */
 	result = (TrgmPackedGraph *)
diff --git a/meson.build b/meson.build
index 2409cc2254..d4384f1bf6 100644
--- a/meson.build
+++ b/meson.build
@@ -1268,7 +1268,7 @@ if uuidopt != 'none'
   elif uuidop