date:20220117

Re: Is there any documentation on how to correctly create extensions in HA(primary-standby) setup?

2022-01-17 Thread Julien Rouhaud

Hi,

On Thu, Dec 09, 2021 at 08:19:06AM +0530, Bharath Rupireddy wrote:
> 
> Thanks. Attaching v1 patch specifying the notes there. Please review.

I think that the common terminology is "module", not "extension".  That's
especially important here as this information is also relevant for modules that
may come with an SQL level extension.  This should be made clear in that new
documentation, same for the CREATE EXTENSION part that may not be relevant.

It also seems that this documentation is only aimed for physical replication.
It should also be explicitly stated as it might not be obvious for the intended
readers.

+ [...] set it either via ALTER 
SYSTEM
+ command or postgresql.conf file on both primary and
+ standys, reload the postgresql.conf file and restart
+ the servers.

Isn't the reload a terrible advice?  By definition changing
shared_preload_libraries isn't compatible with a simple reload and will emit
some error.

+ [...] Create the extension on the primary, there is no need to
+ create it on the standbys as the CREATE EXTENSION
+ command is replicated.

The "no need" here is quite ambiguous, as it seems to indicate that trying to
create the extension on the standby will work but is unnecessary.

Re: generic plans and "initial" pruning

2022-01-17 Thread Simon Riggs

On Tue, 11 Jan 2022 at 16:22, Robert Haas  wrote:

> This is just a relatively simple example and I think there are
> probably a bunch of others. There are a lot of kinds of DDL that could
> be performed on a partition that gets pruned away: DROP INDEX is just
> one example.

I haven't followed this in any detail, but this patch and its goal of
reducing the O(N) drag effect on partition execution time is very
important. Locking a long list of objects that then get pruned is very
wasteful, as the results show.

Ideally, we want an O(1) algorithm for single partition access and DDL
is rare. So perhaps that is the starting point for a safe design -
invent a single lock or cache that allows us to check if the partition
hierarchy has changed in any way, and if so, replan, if not, skip
locks.

Please excuse me if this idea falls short, if so, please just note my
comment about how important this is. Thanks.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: 32TB relation size make mdnblocks overflow

2022-01-17 Thread Julien Rouhaud

Hi,

On Tue, Jan 18, 2022 at 02:21:14PM +0800, 陈佳昕(步真) wrote:
> 
> We know that PostgreSQL doesn't support a single relation size over 32TB,
> limited by the MaxBlockNumber. But if we just 'insert into' one relation over
> 32TB, it will get an error message 'unexpected data beyond EOF in block 0 of
> relation' in ReadBuffer_common. The '0 block' is from mdnblocks function
> where the segment number is over 256 and make segno * RELSEG_SIZE over
> uint32's max value. So is it necessary to make the error message more
> readable like 'The relation size is over max value ...' and elog in
> mdnblocks?

I didn't try it but this is supposed to be caught by mdextend():

/*
 * If a relation manages to grow to 2^32-1 blocks, refuse to extend it 
any
 * more --- we mustn't create a block whose number actually is
 * InvalidBlockNumber.  (Note that this failure should be unreachable
 * because of upstream checks in bufmgr.c.)
 */
if (blocknum == InvalidBlockNumber)
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 errmsg("cannot extend file \"%s\" beyond %u 
blocks",
relpath(reln->smgr_rnode, 
forknum),
InvalidBlockNumber)));

Didn't you hit this?

Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2022-01-17 Thread Julien Rouhaud

Hi,

On Sun, Jan 02, 2022 at 02:55:04PM +0100, Fabien COELHO wrote:
> 
> > Here is my review about v32:
> 
> I forgot to tell that doc generation for the cumulated changes also works.

Unfortunately the patchset doesn't apply anymore:

http://cfbot.cputube.org/patch_36_2377.log
=== Applying patches on top of PostgreSQL commit ID 
4483b2cf29bfe8091b721756928ccbe31c5c8e14 ===
=== applying patch 
./v32-0003-Add-pg_ls_dir_metadata-to-list-a-dir-with-file-m.patch
[...]
patching file src/test/regress/expected/misc_functions.out
Hunk #1 succeeded at 274 (offset 7 lines).
patching file src/test/regress/expected/tablespace.out
Hunk #1 FAILED at 16.
1 out of 1 hunk FAILED -- saving rejects to file 
src/test/regress/expected/tablespace.out.rej

Justin, could you send a rebased version?

Re: missing indexes in indexlist with partitioned tables

2022-01-17 Thread Julien Rouhaud

Hi,

On Mon, Jan 17, 2022 at 08:32:40PM +, Arne Roland wrote:
> 
> Afaiac the join pruning where the outer table is a partitioned table is the 
> relevant case.

The last version of the patch now fails on all platform, with plan changes.

For instance:
https://cirrus-ci.com/task/4825629131538432
https://api.cirrus-ci.com/v1/artifact/task/4825629131538432/regress_diffs/src/test/regress/regression.diffs
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/partition_join.out 
/tmp/cirrus-ci-build/src/test/regress/results/partition_join.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/partition_join.out   
2022-01-17 23:08:47.158198249 +
+++ /tmp/cirrus-ci-build/src/test/regress/results/partition_join.out
2022-01-17 23:12:34.163488567 +
@@ -4887,37 +4887,23 @@
 SET enable_partitionwise_join = on;
 EXPLAIN (COSTS OFF)
 SELECT * FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY id ASC LIMIT 
10;
-  QUERY PLAN

+   QUERY PLAN
+-
  Limit
-   ->  Merge Append
- Sort Key: x.id
- ->  Merge Left Join
-   Merge Cond: (x_1.id = y_1.id)
-   ->  Index Only Scan using fract_t0_pkey on fract_t0 x_1
-   ->  Index Only Scan using fract_t0_pkey on fract_t0 y_1
- ->  Merge Left Join
-   Merge Cond: (x_2.id = y_2.id)
-   ->  Index Only Scan using fract_t1_pkey on fract_t1 x_2
-   ->  Index Only Scan using fract_t1_pkey on fract_t1 y_2
-(11 rows)
+   ->  Append
+ ->  Index Only Scan using fract_t0_pkey on fract_t0 x_1
+ ->  Index Only Scan using fract_t1_pkey on fract_t1 x_2
+(4 rows)

 EXPLAIN (COSTS OFF)
 SELECT * FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY id DESC LIMIT 
10;
-   QUERY PLAN
-
+QUERY PLAN
+--
  Limit
-   ->  Merge Append
- Sort Key: x.id DESC
- ->  Nested Loop Left Join
-   ->  Index Only Scan Backward using fract_t0_pkey on fract_t0 x_1
-   ->  Index Only Scan using fract_t0_pkey on fract_t0 y_1
- Index Cond: (id = x_1.id)
- ->  Nested Loop Left Join
-   ->  Index Only Scan Backward using fract_t1_pkey on fract_t1 x_2
-   ->  Index Only Scan using fract_t1_pkey on fract_t1 y_2
- Index Cond: (id = x_2.id)
-(11 rows)
+   ->  Append
+ ->  Index Only Scan Backward using fract_t1_pkey on fract_t1 x_2
+ ->  Index Only Scan Backward using fract_t0_pkey on fract_t0 x_1
+(4 rows)

Re: BUFFERS enabled by default in EXPLAIN (ANALYZE)

2022-01-17 Thread Julien Rouhaud

Hi,

On Wed, Dec 01, 2021 at 06:58:20PM -0600, Justin Pryzby wrote:
> 
> The attached patch series now looks like this (some minor patches are not
> included in this list):

This version of the patchset doesn't apply anymore:

http://cfbot.cputube.org/patch_36_3409.log
=== Applying patches on top of PostgreSQL commit ID 
4483b2cf29bfe8091b721756928ccbe31c5c8e14 ===
=== applying patch ./0003-Make-explain-default-to-BUFFERS-TRUE.patch
patching file doc/src/sgml/config.sgml
Hunk #1 FAILED at 7615.
1 out of 1 hunk FAILED -- saving rejects to file doc/src/sgml/config.sgml.rej

By the way, could you add versionning to the patchsets you send, e.g. git
format-patch -vXXX.  It would make it easier to know which version are used or
discussed.

I'm switching the CF entry to Waiting on Author.

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2022-01-17 Thread Shruthi Gowda

On Tue, Jan 18, 2022 at 12:35 AM Robert Haas  wrote:
>
> On Tue, Dec 14, 2021 at 1:21 PM Shruthi Gowda  wrote:
> > Thanks, Robert for the updated version. I reviewed the changes and it
> > looks fine.
> > I also tested the patch. The patch works as expected.
>
> Thanks.
>
> > > - I adjusted the function header comment for heap_create. Your
> > > proposed comment seemed like it was pretty detailed but not 100%
> > > correct. It also made one of the lines kind of long because you didn't
> > > wrap the text in the surrounding style. I decided to make it simpler
> > > and shorter instead of longer still and 100% correct.
> >
> > The comment update looks fine. However, I still feel it would be good to
> > mention on which (rare) circumstance a valid relfilenode can get passed.
>
> In general, I think it's the job of a function parameter comment to
> describe what the parameter does, not how the callers actually use it.
> One problem with describing the latter is that, if someone later adds
> another caller, there is a pretty good chance that they won't notice
> that the comment needs to be changed. More fundamentally, the
> parameter function comments should be like an instruction manual for
> how to use the function. If you are trying to figure out how to use
> this function, it is not helpful to know that "most callers like to
> pass false" for this parameter. What you need to know is what value
> your new call site should pass, and knowing what "most callers" do or
> that something is "rare" doesn't really help. If we want to make this
> comment more detailed, we should approach it from the point of view of
> explaining how it ought to be set.

It's clear now. Thanks for clarifying.

> I've committed the v8-0001 patch you attached. I'll write separately
> about v8-0002.

Sure. Thank you.

Re: Use generation context to speed up tuplesorts

2022-01-17 Thread Julien Rouhaud

Hi,

On Fri, Jan 07, 2022 at 12:03:55PM +0100, Ronan Dunklau wrote:
> 
> (Sorry for trying to merge back the discussion on the two sides of the thread)
> 
> In  https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, I 
> expressed the idea of being able to tune glibc's malloc behaviour. 
> 
> I implemented that (patch 0001) to provide a new hook which is called on 
> backend startup, and anytime we set work_mem. This hook is # defined 
> depending 
> on the malloc implementation: currently a default, no-op implementation is 
> provided as well as a glibc's malloc implementation.

This patch apparently misses something to have the no-op on windows:

https://cirrus-ci.com/task/4978247640285184
[03:03:01.676] Build FAILED.
[03:03:01.676]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(19,15): warning 
C4013: 'mallinfo2' undefined; assuming extern returning int 
[c:\cirrus\malloc_stats.vcxproj]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(18,19): error 
C2079: 'm' uses undefined struct 'mallinfo2' [c:\cirrus\malloc_stats.vcxproj]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error 
C2224: left of '.arena' must have struct/union type 
[c:\cirrus\malloc_stats.vcxproj]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error 
C2224: left of '.hblkhd' must have struct/union type 
[c:\cirrus\malloc_stats.vcxproj]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error 
C2224: left of '.uordblks' must have struct/union type 
[c:\cirrus\malloc_stats.vcxproj]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error 
C2224: left of '.fordblks' must have struct/union type 
[c:\cirrus\malloc_stats.vcxproj]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error 
C2224: left of '.keepcost' must have struct/union type 
[c:\cirrus\malloc_stats.vcxproj]

I'm also a bit confused about which patch(es) should actually be reviewed in
that thread.  My understanding is that the original patch might not be relevant
anymore but I might be wrong.  The CF entry should probably be updated to
clarify that, with an annotation/ title change / author update or something.

In the meantime I will switch the entry to Waiting on Author.

Re: Support for NSS as a libpq TLS backend

2022-01-17 Thread Julien Rouhaud

Hi,

On Mon, Jan 17, 2022 at 03:09:11PM +0100, Daniel Gustafsson wrote:
> 
> I must've fat-fingered the "git add -p" for v50 as the fix was in configure.ac
> but not configure.  Fixed now.

Thanks!  Apparently this version now fails on all OS, e.g.:

https://cirrus-ci.com/task/4643868095283200
[22:17:39.965] #   Failed test 'certificate authorization succeeds with correct 
client cert in PEM format'
[22:17:39.965] #   at t/001_ssltests.pl line 456.
[22:17:39.965] #  got: '2'
[22:17:39.965] # expected: '0'
[22:17:39.965]
[22:17:39.965] #   Failed test 'certificate authorization succeeds with correct 
client cert in PEM format: no stderr'
[22:17:39.965] #   at t/001_ssltests.pl line 456.
[22:17:39.965] #  got: 'psql: error: connection to server at 
"127.0.0.1", port 50023 failed: certificate present, but not private key file 
"/home/postgres/.postgresql/postgresql.key"'
[22:17:39.965] # expected: ''
[22:17:39.965]
[22:17:39.965] #   Failed test 'certificate authorization succeeds with correct 
client cert in DER format'
[22:17:39.965] #   at t/001_ssltests.pl line 475.
[22:17:39.965] #  got: '2'
[22:17:39.965] # expected: '0'
[...]

Re: SQL/JSON: functions

2022-01-17 Thread Julien Rouhaud

Hi,

The last version conflicts with recent c4cc2850f4d1 (Rename value node fields).
Can you send a rebased version?

Re: SQL/JSON: JSON_TABLE

2022-01-17 Thread Julien Rouhaud

Hi,

On Tue, Jan 04, 2022 at 09:03:05AM -0500, Andrew Dunstan wrote:
> 
> rebased again.

This version conflicts with recent c4cc2850f4d1 (Rename value node fields).
Can you send a rebased version?

Re: simplifying foreign key/RI checks

2022-01-17 Thread Amit Langote

Thanks for the review.

On Tue, Dec 21, 2021 at 5:54 PM Zhihong Yu  wrote:
> Hi,
>
> +   int lockflags = 0;
> +   TM_Result   test;
> +
> +   lockflags = TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS;
>
> The above assignment can be meged with the line where variable lockflags is 
> declared.

Sure.

> +   GetUserIdAndSecContext(&save_userid, &save_sec_context);
>
> save_userid -> saved_userid
> save_sec_context -> saved_sec_context

I agree that's better though I guess I had kept the names as they were
in other functions.

Fixed nevertheless.

> +* the transaction-snapshot mode.  If we didn't push one already, do
>
> didn't push -> haven't pushed

Done.

> For ri_PerformCheck():
>
> +   boolsource_is_pk = true;
>
> It seems the value of source_is_pk doesn't change - the value true can be 
> plugged into ri_ExtractValues() calls directly.

OK, done.

v13 is attached.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v13-0002-Avoid-using-SPI-for-some-RI-checks.patch
Description: Binary data


v13-0001-Add-isolation-tests-for-snapshot-behavior-in-ri_.patch
Description: Binary data

Re: In-placre persistance change of a relation

2022-01-17 Thread Julien Rouhaud

Hi,

On Fri, Jan 14, 2022 at 11:43:10AM +0900, Kyotaro Horiguchi wrote:
> I found a bug.
> 
> mdmarkexists() didn't close the tentatively opend fd. This is a silent
> leak on Linux and similars and it causes delete failure on Windows.
> It was the reason of the CI failure.
> 
> 027_persistence_change.pl uses interactive_psql() that doesn't work on
> the Windos VM on the CI.
> 
> In this version the following changes have been made in 0001.
> 
> - Properly close file descriptor in mdmarkexists.
> 
> - Skip some tests when IO::Pty is not available.
>   It might be better to separate that part.
> 
> Looking again the ALTER TABLE ALL IN TABLESPACE SET LOGGED patch, I
> noticed that it doesn't implement OWNED BY part and doesn't have test
> and documenttaion (it was PoC). Added all of them to 0002.

The cfbot is failing on all OS with this version of the patch.  Apparently
v16-0002 introduces some usage of "testtablespace" client-side variable that's
never defined, e.g.
https://api.cirrus-ci.com/v1/artifact/task/4670105480069120/regress_diffs/src/bin/pg_upgrade/tmp_check/regress/regression.diffs:

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/tablespace.out 
/tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tablespace.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/tablespace.out   
2022-01-18 04:26:38.744707547 +
+++ 
/tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tablespace.out
2022-01-18 04:30:37.557078083 +
@@ -948,76 +948,71 @@
 CREATE SCHEMA testschema;
 GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
 CREATE TABLESPACE regress_tablespace LOCATION :'testtablespace';
+ERROR:  syntax error at or near ":"
+LINE 1: CREATE TABLESPACE regress_tablespace LOCATION :'testtablespa...

Re: tweak to a few index tests to hits ambuildempty() routine.

2022-01-17 Thread Rushabh Lathia

On Mon, Nov 29, 2021 at 10:34 AM Amul Sul  wrote:

> Hi,
>
> Attached patch is doing small changes to brin, gin & gist index tests
> to use an unlogged table without changing the original intention of
> those tests and that is able to hit ambuildempty() routing which is
> otherwise not reachable by the current tests.
>

+1 for the idea as it does the better code coverage.



>
> --
> Regards,
> Amul Sul
> EDB: http://www.enterprisedb.com
>


-- 
Rushabh Lathia

32TB relation size make mdnblocks overflow

2022-01-17 Thread 陈佳昕(步真)

Hello

We know that PostgreSQL doesn't support a single relation size over 32TB, 
limited by the MaxBlockNumber. But if we just 'insert into' one relation over 
32TB, it will get an error message 'unexpected data beyond EOF in block 0 of 
relation' in ReadBuffer_common. The '0 block' is from mdnblocks function where 
the segment number is over 256 and make segno * RELSEG_SIZE over uint32's max 
value. So is it necessary to make the error message more readable like 'The 
relation size is over max value ...' and elog in mdnblocks?

This scene we met is as below, 'shl $0x18, %eax' make $ebx from 256 to 0, which 
makes segno from 256 to 0. 

   0x00c2cc51 <+289>: callq  0xc657f0 
   0x00c2cc56 <+294>: mov-0x8(%r15),%rdi
   0x00c2cc5a <+298>: mov%r15,%rsi
   0x00c2cc5d <+301>: mov%eax,%r14d
   0x00c2cc60 <+304>: mov0x10(%rdi),%rax
   0x00c2cc64 <+308>: callq  *0x8(%rax)
   0x00c2cc67 <+311>: test   %r14d,%r14d
   0x00c2cc6a <+314>: jns0xc2cd68 
=> 0x00c2cc70 <+320>: add$0x28,%rsp
   0x00c2cc74 <+324>: mov%ebx,%eax
   0x00c2cc76 <+326>: shl$0x18,%eax
   0x00c2cc79 <+329>: pop%rbx
   0x00c2cc7a <+330>: pop%r12
   0x00c2cc7c <+332>: pop%r13
   0x00c2cc7e <+334>: pop%r14
   0x00c2cc80 <+336>: pop%r15
   0x00c2cc82 <+338>: pop%rbp
   0x00c2cc83 <+339>: retq

Re: a misbehavior of partition row movement (?)

2022-01-17 Thread Amit Langote

On Tue, Jan 18, 2022 at 2:41 PM Julien Rouhaud  wrote:
> On Tue, Jan 18, 2022 at 02:33:39PM +0900, Michael Paquier wrote:
> > On Tue, Jan 18, 2022 at 12:16:23PM +0800, Julien Rouhaud wrote:
> > > I'm not sure why this test failed as it doesn't seem like something 
> > > impacted by
> > > the patch, but I may have missed something as I only had a quick look at 
> > > the
> > > patch and discussion.
> >
> > This issue is discussed here:
> > https://www.postgresql.org/message-id/20220117203746.oj43254j5jurb...@alap3.anarazel.de
>
> Oh I missed it, thanks!  Sorry for the noise.

Thanks, it had puzzled me too when I first saw it this morning.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Skipping logical replication transactions on subscriber side

2022-01-17 Thread Masahiko Sawada

On Tue, Jan 18, 2022 at 2:37 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Tuesday, January 18, 2022 1:39 PM Masahiko Sawada  
> wrote:
> > I've attached an updated patch. All comments I got so far were incorporated
> > into this patch unless I'm missing something.
>
> Hi, thank you for your new patch v7.
> For your information, I've encountered a failure to apply patch v7
> on top of the latest commit (d3f4532)
>
> $ git am v7-0001-Add-ALTER-SUBSCRIPTION-.-SKIP-to-skip-the-transac.patch
> Applying: Add ALTER SUBSCRIPTION ... SKIP to skip the transaction on 
> subscriber nodes
> error: patch failed: src/backend/parser/gram.y:9954
> error: src/backend/parser/gram.y: patch does not apply
>
> Could you please rebase it when it's necessary ?

Thank you for reporting!

I've attached a rebased patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v8-0001-Add-ALTER-SUBSCRIPTION-.-SKIP-to-skip-the-transac.patch
Description: Binary data

Re: Push down time-related SQLValue functions to foreign server

2022-01-17 Thread Corey Huinker

>
> Hmm ... not really, because for these particular functions, the
> point is exactly that we *don't* translate them to some function
> call on the remote end.  We evaluate them locally and push the
> resulting constant to the far side, thus avoiding issues like
> clock skew.
>

Ah, my pattern matching brain was so excited to see a use for routine
mapping that I didn't notice that important detail.

Re: Proposal: More structured logging

2022-01-17 Thread Ronan Dunklau

Le lundi 17 janvier 2022, 09:18:04 CET Ronan Dunklau a écrit :
> Le samedi 15 janvier 2022, 07:09:59 CET Julien Rouhaud a écrit :
> > Hi,
> > 
> > On Tue, Jan 11, 2022 at 11:05:26AM +0100, Ronan Dunklau wrote:
> > > Done, and I added anoher commit per your suggestion to add this comment.
> > 
> > The cfbot reports that the patchset doesn't apply anymore:
> > 
> > http://cfbot.cputube.org/patch_36_3293.log
> > === Applying patches on top of PostgreSQL commit ID
> > 43c2175121c829c8591fc5117b725f1f22bfb670 === [...]
> > === applying patch ./v4-0003-Output-error-tags-in-CSV-logs.patch
> > [...]
> > patching file src/backend/utils/error/elog.c
> > Hunk #2 FAILED at 3014.
> > 1 out of 2 hunks FAILED -- saving rejects to file
> > src/backend/utils/error/elog.c.rej
> > 
> > Could you send a rebased version?  In the meantime I will switch the cf
> > entry to Waiting on Author.
> 
> Thank you for this notification !
> 
> The csvlog has been refactored since I wrote the patch, and the new jsonlog
> destination has been introduced. I rebased to fix the first patch, and added
> a new patch to add support for tags in jsonlog output.
> 
> Best regards,

Hum, there was a missing import in csvlog.c from the fix above. Sorry about 
that. 


-- 
Ronan Dunklau>From 8071760e2a826b4ec88d2580c7bfc76e9bd2ff6e Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 13 Aug 2021 15:03:18 +0200
Subject: [PATCH v6 1/5] Add ErrorTag support

---
 src/backend/utils/error/elog.c | 48 ++
 src/include/utils/elog.h   | 10 +++
 2 files changed, 58 insertions(+)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 7402696986..4fb4c67c3f 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -460,6 +460,7 @@ errstart(int elevel, const char *domain)
 		edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION;
 	/* errno is saved here so that error parameter eval can't change it */
 	edata->saved_errno = errno;
+	edata->tags = NIL;
 
 	/*
 	 * Any allocations for this error state level should go into ErrorContext
@@ -511,6 +512,7 @@ errfinish(const char *filename, int lineno, const char *funcname)
 	int			elevel;
 	MemoryContext oldcontext;
 	ErrorContextCallback *econtext;
+	ListCell   *lc;
 
 	recursion_depth++;
 	CHECK_STACK_DEPTH();
@@ -616,7 +618,18 @@ errfinish(const char *filename, int lineno, const char *funcname)
 		pfree(edata->constraint_name);
 	if (edata->internalquery)
 		pfree(edata->internalquery);
+	/* Every tag should have been palloc'ed */
+	if (edata->tags != NIL)
+	{
+		foreach(lc, edata->tags)
+		{
+			ErrorTag   *tag = (ErrorTag *) lfirst(lc);
 
+			pfree(tag->tagvalue);
+			pfree(tag);
+		}
+		pfree(edata->tags);
+	}
 	errordata_stack_depth--;
 
 	/* Exit error-handling context */
@@ -1187,6 +1200,41 @@ errhint_plural(const char *fmt_singular, const char *fmt_plural,
 	return 0;	/* return value does not matter */
 }
 
+int
+errtag(const char *tag, const char *fmt_value,...)
+{
+	ErrorData  *edata = &errordata[errordata_stack_depth];
+	ErrorTag   *etag;
+	MemoryContext oldcontext;
+	StringInfoData buf;
+
+	recursion_depth++;
+	CHECK_STACK_DEPTH();
+	oldcontext = MemoryContextSwitchTo(edata->assoc_context);
+	etag = palloc(sizeof(ErrorTag));
+	etag->tagname = tag;
+	initStringInfo(&buf);
+	for (;;)
+	{
+		va_list		args;
+		int			needed;
+
+		errno = edata->saved_errno;
+		va_start(args, fmt_value);
+		needed = appendStringInfoVA(&buf, fmt_value, args);
+		va_end(args);
+		if (needed == 0)
+			break;
+		enlargeStringInfo(&buf, needed);
+	}
+	etag->tagvalue = pstrdup(buf.data);
+	edata->tags = lappend(edata->tags, etag);
+	pfree(buf.data);
+	MemoryContextSwitchTo(oldcontext);
+	recursion_depth--;
+	return 0;
+}
+
 
 /*
  * errcontext_msg --- add a context error message text to the current error
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 3eb8de3966..615fae47ef 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -15,6 +15,7 @@
 #define ELOG_H
 
 #include 
+#include "nodes/pg_list.h"
 
 /* Error level codes */
 #define DEBUG5		10			/* Debugging messages, in categories of
@@ -193,6 +194,8 @@ extern int	errhint(const char *fmt,...) pg_attribute_printf(1, 2);
 extern int	errhint_plural(const char *fmt_singular, const char *fmt_plural,
 		   unsigned long n,...) pg_attribute_printf(1, 4) pg_attribute_printf(2, 4);
 
+extern int	errtag(const char *tag, const char *fmt_value,...) pg_attribute_printf(2, 3);
+
 /*
  * errcontext() is typically called in error context callback functions, not
  * within an ereport() invocation. The callback function can be in a different
@@ -395,11 +398,18 @@ typedef struct ErrorData
 	int			internalpos;	/* cursor index into internalquery */
 	char	   *internalquery;	/* text of internally-generated query */
 	int			saved_errno;	/* errno at entry */
+	List	   *tags;			/* List of error tags */
 
 	/* context containing associated non-constant strings */
 	str

Re: Push down time-related SQLValue functions to foreign server

2022-01-17 Thread Tom Lane

Corey Huinker  writes:
> I'm very late to the party, but it seems to me that this effort is
> describing a small subset of what "routine mapping" seems to be for:
> defining function calls that can be pushed down to the foreign server, and
> the analogous function on the foreign side. We may want to consider
> implementing just enough of CREATE ROUTINE MAPPING and DROP ROUTINE MAPPING
> to support these specific fixed functions.

Hmm ... not really, because for these particular functions, the
point is exactly that we *don't* translate them to some function
call on the remote end.  We evaluate them locally and push the
resulting constant to the far side, thus avoiding issues like
clock skew.

Having said that: why couldn't that implementation sketch be used
for ANY stable subexpression?  What's special about the datetime
SQLValueFunctions?

regards, tom lane

Re: a misbehavior of partition row movement (?)

2022-01-17 Thread Julien Rouhaud

Hi,

On Tue, Jan 18, 2022 at 02:33:39PM +0900, Michael Paquier wrote:
> On Tue, Jan 18, 2022 at 12:16:23PM +0800, Julien Rouhaud wrote:
> > I'm not sure why this test failed as it doesn't seem like something 
> > impacted by
> > the patch, but I may have missed something as I only had a quick look at the
> > patch and discussion.
> 
> This issue is discussed here:
> https://www.postgresql.org/message-id/20220117203746.oj43254j5jurb...@alap3.anarazel.de

Oh I missed it, thanks!  Sorry for the noise.

RE: Skipping logical replication transactions on subscriber side

2022-01-17 Thread osumi.takami...@fujitsu.com

On Tuesday, January 18, 2022 1:39 PM Masahiko Sawada  
wrote:
> I've attached an updated patch. All comments I got so far were incorporated
> into this patch unless I'm missing something.

Hi, thank you for your new patch v7.
For your information, I've encountered a failure to apply patch v7
on top of the latest commit (d3f4532)

$ git am v7-0001-Add-ALTER-SUBSCRIPTION-.-SKIP-to-skip-the-transac.patch
Applying: Add ALTER SUBSCRIPTION ... SKIP to skip the transaction on subscriber 
nodes
error: patch failed: src/backend/parser/gram.y:9954
error: src/backend/parser/gram.y: patch does not apply

Could you please rebase it when it's necessary ?

Best Regards,
Takamichi Osumi

Re: a misbehavior of partition row movement (?)

2022-01-17 Thread Michael Paquier

On Tue, Jan 18, 2022 at 12:16:23PM +0800, Julien Rouhaud wrote:
> I'm not sure why this test failed as it doesn't seem like something impacted 
> by
> the patch, but I may have missed something as I only had a quick look at the
> patch and discussion.

This issue is discussed here:
https://www.postgresql.org/message-id/20220117203746.oj43254j5jurb...@alap3.anarazel.de
--
Michael


signature.asc
Description: PGP signature

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-01-17 Thread Peter Geoghegan

On Mon, Jan 17, 2022 at 8:13 PM Robert Haas  wrote:
> On Mon, Jan 17, 2022 at 5:41 PM Peter Geoghegan  wrote:
> > That just seems like semantics to me. The very next sentence after the
> > one you quoted in your reply was "And so it's highly unlikely that any
> > given VACUUM will ever *completely* fail to advance relfrozenxid".
> > It's continuous *within* each VACUUM. As far as I can tell there is
> > pretty much no way that the patch series will ever fail to advance
> > relfrozenxid *by at least a little bit*, barring pathological cases
> > with cursors and whatnot.
>
> I mean this boils down to saying that VACUUM will advance relfrozenxid
> except when it doesn't.

It actually doesn't boil down, at all. The world is complicated and
messy, whether we like it or not.

> > I never said that anti-wraparound vacuums just won't happen anymore. I
> > said that they'll be limited to cases like the stock table or
> > customers table case. I was very clear on that point.
>
> I don't know how I'm supposed to sensibly respond to a statement like
> this. If you were very clear, then I'm being deliberately obtuse if I
> fail to understand.

I don't know if I'd accuse you of being obtuse, exactly. Mostly I just
think it's strange that you don't seem to take what I say seriously
when it cannot be proven very easily. I don't think that you intend
this to be disrespectful, and I don't take it personally. I just don't
understand it.

> > It isn't that hard to see that the cases where we continue to get any
> > anti-wraparound VACUUMs with the patch seem to be limited to cases
> > like the stock/customers table, or cases like the pathological idle
> > cursor cases we've been discussing. Pretty narrow cases, overall.
> > Don't take my word for it - see for yourself.
>
> I don't think that's really possible. Words like "narrow" and
> "pathological" are value judgments, not factual statements. If I do an
> experiment where no wraparound autovacuums happen, as I'm sure I can,
> then those are the normal cases where the patch helps. If I do an
> experiment where they do happen, as I'm sure that I also can, you'll
> probably say either that the case in question is like the
> stock/customers table, or that it's pathological. What will any of
> this prove?

You seem to be suggesting that I used words like "pathological" in
some kind of highly informal, totally subjective way, when I did no
such thing.

I quite clearly said that you'll only get an anti-wraparound VACUUM
with the patch applied when the only factor that *ever* causes *any*
autovacuum worker to VACUUM the table (assuming the workload is
stable) is the anti-wraparound/autovacuum_freeze_max_age cutoff. With
a table like this, even increasing autovacuum_freeze_max_age to its
absolute maximum of 2 billion would not make it any more likely that
we'd get a non-aggressive VACUUM -- it would merely make the
anti-wraparound VACUUMs less frequent. No big change should be
expected with a table like that.

Also, since the patch is not magic, and doesn't even change the basic
invariants for relfrozenxid, it's still true that any scenario in
which it's fundamentally impossible for VACUUM to keep up will also
have anti-wraparound VACUUMs. But that's the least of the user's
trouble -- in the long run we're going to have the system refuse to
allocate new XIDs with such a workload.

The claim that I have made is 100% testable. Even if it was flat out
incorrect, not getting anti-wraparound VACUUMs per se is not the
important part. The important part is that the work is managed
intelligently, and the burden is spread out over time. I am
particularly concerned about the "freezing cliff" we get when many
pages are all-visible but not also all-frozen. Consistently avoiding
an anti-wraparound VACUUM (except with very particular workload
characteristics) is really just a side effect -- it's something that
makes the overall benefit relatively obvious, and relatively easy to
measure. I thought that you'd appreciate that.

-- 
Peter Geoghegan

Re: Push down time-related SQLValue functions to foreign server

2022-01-17 Thread Corey Huinker

> The implementation of converting now() to CURRENT_TIMESTAMP
> seems like an underdocumented kluge, too.
>

I'm very late to the party, but it seems to me that this effort is
describing a small subset of what "routine mapping" seems to be for:
defining function calls that can be pushed down to the foreign server, and
the analogous function on the foreign side. We may want to consider
implementing just enough of CREATE ROUTINE MAPPING and DROP ROUTINE MAPPING
to support these specific fixed functions.

Re: TAP test to cover "EndOfLogTLI != replayTLI" case

2022-01-17 Thread Kyotaro Horiguchi

At Tue, 18 Jan 2022 12:25:15 +0800, Julien Rouhaud  wrote 
in 
> Hi,
> 
> On Mon, Jan 17, 2022 at 09:33:57PM +0800, Julien Rouhaud wrote:
> > 
> > Thanks for the updated patch!  Note that thanks to Andres and Thomas work, 
> > you
> > can now easily rely on the exact same CI than the cfbot on your own github
> > repository, if you need to debug something on a platform you don't have 
> > access
> > to.  You can check the documentation at [1] for more detail on how to setup 
> > the
> > CI.
> 
> The cfbot reports that the patch still fails on Windows but also failed on
> macos with the same error: https://cirrus-ci.com/task/5655777858813952:
> 
> [14:20:43.950] #   Failed test 'check standby content before timeline switch 
> 0/500FAF8'
> [14:20:43.950] #   at t/003_recovery_targets.pl line 239.
> [14:20:43.950] #  got: '6000'
> [14:20:43.950] # expected: '7000'
> [14:20:43.950] # Looks like you failed 1 test of 10.
> 
> I'm switching the CF entry to Waiting on Author.

The most significant advantages of the local CI setup are

 - CI immediately responds to push.

 - You can dirty the code with additional logging aid as much as you
   like to see closely what is going on.  It makes us hesitant to do
   the same on this ML:p

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: row filtering for logical replication

2022-01-17 Thread Greg Nancarrow

On Tue, Jan 18, 2022 at 2:31 PM Amit Kapila  wrote:
>
> On Tue, Jan 18, 2022 at 8:41 AM Greg Nancarrow  wrote:
> >
> > On Tue, Jan 18, 2022 at 2:31 AM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > > (2) GetTopMostAncestorInPublication
> > > > Is there a reason why there is no "break" after finding a
> > > > topmost_relid? Why keep searching and potentially overwrite a
> > > > previously-found topmost_relid? If it's intentional, I think that a
> > > > comment should be added to explain it.
> > >
> > > The code was moved from get_rel_sync_entry, and was trying to get the
> > > last oid in the ancestor list which is published by the publication. Do 
> > > you
> > > have some suggestions for the comment ?
> > >
> >
> > Maybe the existing comment should be updated to just spell it out like that:
> >
> > /*
> >  * Find the "topmost" ancestor that is in this publication, by getting the
> >  * last Oid in the ancestors list which is published by the publication.
> >  */
> >
>
> I am not sure that is helpful w.r.t what Peter is looking for as that
> is saying what code is doing and he wants to know why it is so? I
> think one can understand this by looking at get_partition_ancestors
> which will return the top-most ancestor as the last element. I feel
> either we can say see get_partition_ancestors or maybe explain how the
> ancestors are stored in this list.
>

(note: I asked the original question about why there is no "break", not Peter)
Maybe instead, an additional comment could be added to the
GetTopMostAncestorInPublication function to say "Note that the
ancestors list is ordered such that the topmost ancestor is at the end
of the list". Unfortunately the get_partition_ancestors function
currently doesn't explicitly say that the topmost ancestors are
returned at the end of the list (I guess you could conclude it by then
looking at get_partition_ancestors_worker code which it calls).
Also, this leads me to wonder if searching the ancestors list
backwards might be better here, and break at the first match? Perhaps
there is only a small gain in doing that ...

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Null commitTS bug

2022-01-17 Thread Kyotaro Horiguchi

At Tue, 18 Jan 2022 10:43:55 +0900, Michael Paquier  wrote 
in 
> On Sun, Jan 16, 2022 at 11:01:25PM -0500, Tom Lane wrote:
> > Isn't that a very bad way to write "i = j + 1"?
> > 
> > I agree with Horiguchi-san that
> > for (i = 0, headxid = xid;;)
> 
> Okay.  Horiguchi-san, would you like to write a patch?

Yes, I will.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Skipping logical replication transactions on subscriber side

2022-01-17 Thread Masahiko Sawada

On Tue, Jan 18, 2022 at 12:20 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Monday, January 17, 2022 9:52 PM Masahiko Sawada  
> wrote:
> > Thank you for the comments!
> ..
> > > (2) Minor improvement suggestion of comment in
> > > src/backend/replication/logical/worker.c
> > >
> > > + * reset during that.  Also, we don't skip receiving the changes in
> > > + streaming
> > > + * cases, since we decide whether or not to skip applying the changes
> > > + when
> > >
> > > I sugguest that you don't use 'streaming cases', because what
> > > "streaming cases" means sounds a bit broader than actual your
> > implementation.
> > > We do skip transaction of streaming cases but not during the spooling 
> > > phase,
> > right ?
> > >
> > > I suggest below.
> > >
> > > "We don't skip receiving the changes at the phase to spool streaming
> > transactions"
> >
> > I might be missing your point but I think it's correct that we don't skip 
> > receiving
> > the change of the transaction that is sent via streaming protocol. And it 
> > doesn't
> > sound broader to me. Could you elaborate on that?
> OK. Excuse me for lack of explanation.
>
> I felt "streaming cases" implies "non-streaming cases"
> to compare a diffference (in my head) when it is
> used to explain something at first.
> I imagined the contrast between those, when I saw it.
>
> Thus, I thought "streaming cases" meant
> whole flow of streaming transactions which consists of messages
> surrounded by stream start and stream stop and which are finished by
> stream commit/stream abort (including 2PC variations).
>
> When I come back to the subject, you wrote below in the comment
>
> "we don't skip receiving the changes in streaming cases,
> since we decide whether or not to skip applying the changes
> when starting to apply changes"
>
> The first part of this sentence
> ("we don't skip receiving the changes in streaming cases")
> gives me an impression where we don't skip changes in the streaming cases
> (of my understanding above), but the last part
> ("we decide whether or not to skip applying the changes
> when starting to apply change") means we skip transactions for streaming at 
> apply phase.
>
> So, this sentence looked confusing to me slightly.
> Thus, I suggested below (and when I connect it with existing part)
>
> "we don't skip receiving the changes at the phase to spool streaming 
> transactions
> since we decide whether or not to skip applying the changes when starting to 
> apply changes"
>
> For me this looked better, but of course, this is a suggestion.

Thank you for your explanation.

I've modified the comment with some changes since "the phase to spool
streaming transaction" seems not commonly be used in worker.c.

>
> > >
> > > (3) in the comment of apply_handle_prepare_internal, two full-width
> > characters.
> > >
> > > 3-1
> > > +* won’t be resent in a case where the server crashes between
> > them.
> > >
> > > 3-2
> > > +* COMMIT PREPARED or ROLLBACK PREPARED. But that’s okay
> > > + because this
> > >
> > > You have full-width characters for "won't" and "that's".
> > > Could you please check ?
> >
> > Which characters in "won't" are full-width characters? I could not find 
> > them.
> All characters I found and mentioned as full-width are single quotes.
>
> It might be good that you check the entire patch once
> by some tool that helps you to detect it.

Thanks!

>
> > > (5)
> > >
> > > I can miss something here but, in one of the past discussions, there
> > > seems a consensus that if the user specifies XID of a subtransaction,
> > > it would be better to skip only the subtransaction.
> > >
> > > This time, is it out of the range of the patch ?
> > > If so, I suggest you include some description about it either in the
> > > commit message or around codes related to it.
> >
> > How can the user know subtransaction XID? I suppose you refer to streaming
> > protocol cases but while applying spooled changes we don't report
> > subtransaction XID neither in server log nor pg_stat_subscription_workers.
> Yeah, usually subtransaction XID is not exposed to the users. I agree.
>
> But, clarifying the target of this feature is only top-level transactions
> sounds better to me. Thank you Amit-san for your support
> about how we should write it in [1] !

Yes, I've included the sentence proposed by Amit in the latest patch.

>
> > > (6)
> > >
> > > I feel it's a better idea to include a test whether to skip aborted
> > > streaming transaction clears the XID in the TAP test for this feature,
> > > in a sense to cover various new code paths. Did you have any special
> > > reason to omit the case ?
> >
> > Which code path is newly covered by this aborted streaming transaction 
> > tests?
> > I think that this patch is already covered even by the test for a
> > committed-and-streamed transaction. It doesn't matter whether the streamed
> > transaction is committed or aborted because an error occurs while applying 
> > the
> > spooled changes

Re: Skipping logical replication transactions on subscriber side

2022-01-17 Thread Masahiko Sawada

On Tue, Jan 18, 2022 at 12:04 PM tanghy.f...@fujitsu.com
 wrote:
>
> On Mon, Jan 17, 2022 2:18 PM Masahiko Sawada  wrote:
> >
> > I've attached an updated patch. Please review it.
> >
>
> Thanks for updating the patch. Few comments:
>
> 1)
> /* Two_phase is only supported in v15 and higher */
> if (pset.sversion >= 15)
> appendPQExpBuffer(&buf,
> - ", subtwophasestate 
> AS \"%s\"\n",
> - gettext_noop("Two 
> phase commit"));
> + ", subtwophasestate 
> AS \"%s\"\n"
> + ", subskipxid AS 
> \"%s\"\n",
> + gettext_noop("Two 
> phase commit"),
> + gettext_noop("Skip 
> XID"));
>
> appendPQExpBuffer(&buf,
>   ",  subsynccommit AS 
> \"%s\"\n"
>
> I think "skip xid" should be mentioned in the comment. Maybe it could be 
> changed to:
> "Two_phase and skip XID are only supported in v15 and higher"

Added.

>
> 2) The following two places are not consistent in whether "= value" is 
> surround
> with square brackets.
>
> +ALTER SUBSCRIPTION name SKIP ( 
> skip_option [=  class="parameter">value] [, ... ] )
>
> +SKIP (  class="parameter">skip_option =  class="parameter">value [, ... ] )
>
> Should we modify the first place to:
> +ALTER SUBSCRIPTION name SKIP ( 
> skip_option =  class="parameter">value [, ... ] )
>
> Because currently there is only one skip_option - xid, and a parameter must be
> specified when using it.

Good catch. Fixed.

>
> 3)
> +* Protect subskip_xid of pg_subscription from being concurrently 
> updated
> +* while clearing it.
>
> "subskip_xid" should be "subskipxid" I think.

Fixed.

>
> 4)
> +/*
> + * Start skipping changes of the transaction if the given XID matches the
> + * transaction ID specified by skip_xid option.
> + */
>
> The option name was "skip_xid" in the previous version, and it is "xid" in
> latest patch. So should we modify "skip_xid option" to "skip xid option", or
> "skip option xid", or something else?
>
> Also the following place has similar issue:
> + * the subscription if hte user has specified skip_xid. Once we start 
> skipping

Fixed.

I've attached an updated patch. All comments I got so far were
incorporated into this patch unless I'm missing something.


Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/
From 9faf874a7388368f86c500e1fef9616ecf86e5b5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Fri, 10 Dec 2021 14:41:30 +0900
Subject: [PATCH v7] Add ALTER SUBSCRIPTION ... SKIP to skip the transaction on
 subscriber nodes

If incoming change violates any constraint, logical replication stops
until it's resolved. This commit introduces another way to skip the
transaction in question, other than manually updating the subscriber's
database or using pg_replication_origin_advance().

The user can specify XID by ALTER SUBSCRIPTION ... SKIP (xid = XXX),
updating pg_subscription.subskipxid field, telling the apply worker to
skip the transaction. The apply worker skips all data modification
changes within the specified transaction.

After skipping the transaction the apply worker clears
pg_subscription.subskipxid.
---
 doc/src/sgml/catalogs.sgml |  10 +
 doc/src/sgml/logical-replication.sgml  |  49 +++-
 doc/src/sgml/ref/alter_subscription.sgml   |  42 
 src/backend/catalog/pg_subscription.c  |   1 +
 src/backend/commands/subscriptioncmds.c|  53 +
 src/backend/parser/gram.y  |   9 +
 src/backend/replication/logical/worker.c   | 265 -
 src/bin/pg_dump/pg_dump.c  |   4 +
 src/bin/psql/describe.c|  10 +-
 src/bin/psql/tab-complete.c|   8 +-
 src/include/catalog/pg_subscription.h  |   4 +
 src/include/nodes/parsenodes.h |   3 +-
 src/test/regress/expected/subscription.out | 124 ++
 src/test/regress/sql/subscription.sql  |  17 ++
 src/test/subscription/t/028_skip_xact.pl   | 217 +
 15 files changed, 755 insertions(+), 61 deletions(-)
 create mode 100644 src/test/subscription/t/028_skip_xact.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 2aeb2ef346..16f429b853 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7746,6 +7746,16 @@ SCRAM-SHA-256$:&l
   
  
 
+ 
+  
+   subskipxid xid
+  
+  
+   ID of the transaction whose changes are to be skipped, if a valid
+   transaction ID; otherwise 0.
+  
+ 
+
  
   
subconninfo text
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/log

Re: libpq compression (part 2)

2022-01-17 Thread Justin Pryzby

On Tue, Jan 18, 2022 at 02:06:32AM +0500, Daniil Zakhlystov wrote:
> > => Since March, errmsg doesn't need extra parenthesis around it (e3a87b4).

> I’ve resolved the stuck tests and added zlib support for CI Windows builds to 
> patch 0003-*.  Thanks
> for the suggestion, all tests seem to be OK now, except the macOS one.  It 
> won't schedule in Cirrus
> CI for some reason, but I guess it happens because of my GitHub account 
> limitation.

I don't know about your github account, but it works for cfbot, which is now
green.

Thanks for implementing zlib for windows.  Did you try this with default
compressions set to lz4 and zstd ?

The thread from 2019 is very long, and starts off with the guidance that
compression had been implemented at the wrong layer.  It looks like this hasn't
changed since then.  secure_read/write are passed as function pointers to the
ZPQ interface, which then calls back to them to read and flush its compression
buffers.  As I understand, the suggestion was to leave the socket reads and
writes alone.  And then conditionally de/compress buffers after reading /
before writing from the socket if compression was negotiated.

It's currently done like this
pq_recvbuf() => secure_read() - when compression is disabled 
pq_recvbuf() => ZPQ => secure_read() - when compression is enabled 

Dmitri sent a partial, POC patch which changes the de/compression to happen in
secure_read/write, which is changed to call ZPQ:  
https://www.postgresql.org/message-id/ca+q6zcuprssnars+fyobsd-f2stk1roa-4sahfofajowlzi...@mail.gmail.com
pq_recvbuf() => secure_read() => ZPQ

The same thing is true of the frontend: function pointers to
pqsecure_read/write are being passed to zpq_create, and then the ZPQ interface
called instead of the original functions.  Those are the functions which read
using SSL, so they should also handle compression.

That's where SSL is handled, and it seems like the right place to handle
compression.  Have you evaluated that way to do things ?

Konstantin said he put ZPQ at that layer seems to 1) avoid code duplication
between client/server; and, 2) to allow compression to happen before SSL, to
allow both (if the admin decides it's okay).  But I don't see why compression
can't happen before sending to SSL, or after reading from it?

-- 
Justin

Re: TAP test to cover "EndOfLogTLI != replayTLI" case

2022-01-17 Thread Julien Rouhaud

Hi,

On Mon, Jan 17, 2022 at 09:33:57PM +0800, Julien Rouhaud wrote:
> 
> Thanks for the updated patch!  Note that thanks to Andres and Thomas work, you
> can now easily rely on the exact same CI than the cfbot on your own github
> repository, if you need to debug something on a platform you don't have access
> to.  You can check the documentation at [1] for more detail on how to setup 
> the
> CI.

The cfbot reports that the patch still fails on Windows but also failed on
macos with the same error: https://cirrus-ci.com/task/5655777858813952:

[14:20:43.950] #   Failed test 'check standby content before timeline switch 
0/500FAF8'
[14:20:43.950] #   at t/003_recovery_targets.pl line 239.
[14:20:43.950] #  got: '6000'
[14:20:43.950] # expected: '7000'
[14:20:43.950] # Looks like you failed 1 test of 10.

I'm switching the CF entry to Waiting on Author.

Re: a misbehavior of partition row movement (?)

2022-01-17 Thread Julien Rouhaud

Hi,

On Mon, Jan 17, 2022 at 08:40:54PM +0900, Amit Langote wrote:
> 
> Okay, I created versions of the patch series for branches 13 and 14
> (.txt files).  The one for HEAD is also re-attached.

FYI The patch failed today on FreeBSD, while it was previously quite stable on
all platforms (https://cirrus-ci.com/build/4551468081479680), with this error:

https://api.cirrus-ci.com/v1/artifact/task/6360787076775936/regress_diffs/src/test/recovery/tmp_check/regression.diffs
diff -U3 
/tmp/cirrus-ci-build/src/test/recovery/../regress/expected/reloptions.out 
/tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/reloptions.out
--- /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/reloptions.out   
2022-01-18 00:12:52.533086000 +
+++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/reloptions.out 
2022-01-18 00:28:00.000524000 +
@@ -133,7 +133,7 @@
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column?
 --
- t
+ f
 (1 row)


I'm not sure why this test failed as it doesn't seem like something impacted by
the patch, but I may have missed something as I only had a quick look at the
patch and discussion.

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-01-17 Thread Robert Haas

On Mon, Jan 17, 2022 at 5:41 PM Peter Geoghegan  wrote:
> That just seems like semantics to me. The very next sentence after the
> one you quoted in your reply was "And so it's highly unlikely that any
> given VACUUM will ever *completely* fail to advance relfrozenxid".
> It's continuous *within* each VACUUM. As far as I can tell there is
> pretty much no way that the patch series will ever fail to advance
> relfrozenxid *by at least a little bit*, barring pathological cases
> with cursors and whatnot.

I mean this boils down to saying that VACUUM will advance relfrozenxid
except when it doesn't.

> Actually, I think that even the people in the first category might
> well have about the same improved experience. Not just because of this
> patch series, mind you. It would also have a lot to do with the
> autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. Not to
> mention the freeze map. What version are these users on?

I think it varies. I expect the increase in the default cost limit to
have had a much more salutary effect than
autovacuum_vacuum_insert_scale_factor, but I don't know for sure. At
any rate, if you make the database big enough and generate dirty data
fast enough, it doesn't matter what the default limits are.

> I never said that anti-wraparound vacuums just won't happen anymore. I
> said that they'll be limited to cases like the stock table or
> customers table case. I was very clear on that point.

I don't know how I'm supposed to sensibly respond to a statement like
this. If you were very clear, then I'm being deliberately obtuse if I
fail to understand. If I say you weren't very clear, then we're just
contradicting each other.

> It isn't that hard to see that the cases where we continue to get any
> anti-wraparound VACUUMs with the patch seem to be limited to cases
> like the stock/customers table, or cases like the pathological idle
> cursor cases we've been discussing. Pretty narrow cases, overall.
> Don't take my word for it - see for yourself.

I don't think that's really possible. Words like "narrow" and
"pathological" are value judgments, not factual statements. If I do an
experiment where no wraparound autovacuums happen, as I'm sure I can,
then those are the normal cases where the patch helps. If I do an
experiment where they do happen, as I'm sure that I also can, you'll
probably say either that the case in question is like the
stock/customers table, or that it's pathological. What will any of
this prove?

I think we're reaching the point of diminishing returns in this
conversation. What I want to know is that users aren't going to be
harmed - even in cases where they have behavior that is like the
stock/customers table, or that you consider pathological, or whatever
other words we want to use to describe the weird things that happen to
people. And I think we've made perhaps a bit of modest progress in
exploring that issue, but certainly less than I'd like. I don't want
to spend the next several days going around in circles about it
though. That does not seem likely to make anyone happy.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: ICU for global collation

2022-01-17 Thread Julien Rouhaud

Hi,

On Thu, Jan 13, 2022 at 09:39:42AM +0100, Peter Eisentraut wrote:
> On 11.01.22 12:08, Julien Rouhaud wrote:
> > > So, unless there are concerns, I'm going to see about making a patch to 
> > > call
> > > pg_newlocale_from_collation() even with the default collation. That would
> > > make the actual feature patch quite a bit smaller, since we won't have to
> > > patch every call site of pg_newlocale_from_collation().
> > +1 for me!
> 
> Here is that patch.

The patch is quite straightforward so I don't have much to say, it all looks
good and passes all regression tests.

> If this is applied, then in my estimation all these hunks will completely
> disappear from the global ICU patch.  So this would be a significant win.

Agreed, the patch will be quite smaller and also easier to review.  Are you
planning to apply it on its own or add it to the default ICU patchset?  Given
the possible backpatch conflicts it can bring I'm not sure it's worthy enough
to apply on its own.

Re: Skipping logical replication transactions on subscriber side

2022-01-17 Thread Masahiko Sawada

On Tue, Jan 18, 2022 at 12:37 PM Amit Kapila  wrote:
>
> On Tue, Jan 18, 2022 at 8:34 AM tanghy.f...@fujitsu.com
>  wrote:
> >
> > On Mon, Jan 17, 2022 2:18 PM Masahiko Sawada  wrote:
> > >
> >
> > 2) The following two places are not consistent in whether "= value" is 
> > surround
> > with square brackets.
> >
> > +ALTER SUBSCRIPTION name SKIP 
> > ( skip_option [=  > class="parameter">value] [, ... ] )
> >
> > +SKIP (  > class="parameter">skip_option =  > class="parameter">value [, ... ] )
> >
> > Should we modify the first place to:
> > +ALTER SUBSCRIPTION name SKIP 
> > ( skip_option =  > class="parameter">value [, ... ] )
> >
> > Because currently there is only one skip_option - xid, and a parameter must 
> > be
> > specified when using it.
> >
>
> Good observation. Do we really need [, ... ] as currently, we support
> only one value for XID?

I think no. In the doc, it should be:

ALTER SUBSCRIPTION name SKIP ( skip_option = value )

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: ICU for global collation

2022-01-17 Thread Julien Rouhaud

Hi,

On Mon, Jan 17, 2022 at 07:07:38PM +, Finnerty, Jim wrote:
> On 10.01.22 12:49, Daniel Verite wrote:
> 
> > I think some users would want their db-wide ICU collation to be
> > case/accent-insensitive. 
> ...
> > IIRC, that was the context for some questions where people were
> > enquiring about db-wide ICU collations.
> 
> +1.  There is the DEFAULT_COLLATION_OID, which is the cluster-level default
> collation, a.k.a. the "global collation", as distinct from the "db-wide"
> database-level default collation, which controls the default type of the
> collatable types within that database.

There's no cluster-level default collation, and DEFAULT_COLLATION_OID is always
database-level (and template1 having a specific default collation).  The
template0 database is there to be able to support different databases with
different default collations, among other things.

So this patchset would allow per-database default ICU collation, although not
non-deterministic ones.

Re: Skipping logical replication transactions on subscriber side

2022-01-17 Thread Amit Kapila

On Tue, Jan 18, 2022 at 8:34 AM tanghy.f...@fujitsu.com
 wrote:
>
> On Mon, Jan 17, 2022 2:18 PM Masahiko Sawada  wrote:
> >
>
> 2) The following two places are not consistent in whether "= value" is 
> surround
> with square brackets.
>
> +ALTER SUBSCRIPTION name SKIP ( 
> skip_option [=  class="parameter">value] [, ... ] )
>
> +SKIP (  class="parameter">skip_option =  class="parameter">value [, ... ] )
>
> Should we modify the first place to:
> +ALTER SUBSCRIPTION name SKIP ( 
> skip_option =  class="parameter">value [, ... ] )
>
> Because currently there is only one skip_option - xid, and a parameter must be
> specified when using it.
>

Good observation. Do we really need [, ... ] as currently, we support
only one value for XID?

-- 
With Regards,
Amit Kapila.

Re: row filtering for logical replication

2022-01-17 Thread Amit Kapila

On Tue, Jan 18, 2022 at 8:41 AM Greg Nancarrow  wrote:
>
> On Tue, Jan 18, 2022 at 2:31 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > > (2) GetTopMostAncestorInPublication
> > > Is there a reason why there is no "break" after finding a
> > > topmost_relid? Why keep searching and potentially overwrite a
> > > previously-found topmost_relid? If it's intentional, I think that a
> > > comment should be added to explain it.
> >
> > The code was moved from get_rel_sync_entry, and was trying to get the
> > last oid in the ancestor list which is published by the publication. Do you
> > have some suggestions for the comment ?
> >
>
> Maybe the existing comment should be updated to just spell it out like that:
>
> /*
>  * Find the "topmost" ancestor that is in this publication, by getting the
>  * last Oid in the ancestors list which is published by the publication.
>  */
>

I am not sure that is helpful w.r.t what Peter is looking for as that
is saying what code is doing and he wants to know why it is so? I
think one can understand this by looking at get_partition_ancestors
which will return the top-most ancestor as the last element. I feel
either we can say see get_partition_ancestors or maybe explain how the
ancestors are stored in this list.

-- 
With Regards,
Amit Kapila.

RE: Skipping logical replication transactions on subscriber side

2022-01-17 Thread osumi.takami...@fujitsu.com

On Monday, January 17, 2022 9:52 PM Masahiko Sawada  
wrote:
> Thank you for the comments!
..
> > (2) Minor improvement suggestion of comment in
> > src/backend/replication/logical/worker.c
> >
> > + * reset during that.  Also, we don't skip receiving the changes in
> > + streaming
> > + * cases, since we decide whether or not to skip applying the changes
> > + when
> >
> > I sugguest that you don't use 'streaming cases', because what
> > "streaming cases" means sounds a bit broader than actual your
> implementation.
> > We do skip transaction of streaming cases but not during the spooling phase,
> right ?
> >
> > I suggest below.
> >
> > "We don't skip receiving the changes at the phase to spool streaming
> transactions"
> 
> I might be missing your point but I think it's correct that we don't skip 
> receiving
> the change of the transaction that is sent via streaming protocol. And it 
> doesn't
> sound broader to me. Could you elaborate on that?
OK. Excuse me for lack of explanation.

I felt "streaming cases" implies "non-streaming cases"
to compare a diffference (in my head) when it is
used to explain something at first.
I imagined the contrast between those, when I saw it.

Thus, I thought "streaming cases" meant
whole flow of streaming transactions which consists of messages
surrounded by stream start and stream stop and which are finished by
stream commit/stream abort (including 2PC variations).

When I come back to the subject, you wrote below in the comment

"we don't skip receiving the changes in streaming cases,
since we decide whether or not to skip applying the changes
when starting to apply changes"

The first part of this sentence
("we don't skip receiving the changes in streaming cases")
gives me an impression where we don't skip changes in the streaming cases
(of my understanding above), but the last part
("we decide whether or not to skip applying the changes
when starting to apply change") means we skip transactions for streaming at 
apply phase.

So, this sentence looked confusing to me slightly.
Thus, I suggested below (and when I connect it with existing part)

"we don't skip receiving the changes at the phase to spool streaming 
transactions
since we decide whether or not to skip applying the changes when starting to 
apply changes"

For me this looked better, but of course, this is a suggestion.

> >
> > (3) in the comment of apply_handle_prepare_internal, two full-width
> characters.
> >
> > 3-1
> > +* won’t be resent in a case where the server crashes between
> them.
> >
> > 3-2
> > +* COMMIT PREPARED or ROLLBACK PREPARED. But that’s okay
> > + because this
> >
> > You have full-width characters for "won't" and "that's".
> > Could you please check ?
> 
> Which characters in "won't" are full-width characters? I could not find them.
All characters I found and mentioned as full-width are single quotes.

It might be good that you check the entire patch once
by some tool that helps you to detect it.

> > (5)
> >
> > I can miss something here but, in one of the past discussions, there
> > seems a consensus that if the user specifies XID of a subtransaction,
> > it would be better to skip only the subtransaction.
> >
> > This time, is it out of the range of the patch ?
> > If so, I suggest you include some description about it either in the
> > commit message or around codes related to it.
> 
> How can the user know subtransaction XID? I suppose you refer to streaming
> protocol cases but while applying spooled changes we don't report
> subtransaction XID neither in server log nor pg_stat_subscription_workers.
Yeah, usually subtransaction XID is not exposed to the users. I agree.

But, clarifying the target of this feature is only top-level transactions
sounds better to me. Thank you Amit-san for your support
about how we should write it in [1] !

> > (6)
> >
> > I feel it's a better idea to include a test whether to skip aborted
> > streaming transaction clears the XID in the TAP test for this feature,
> > in a sense to cover various new code paths. Did you have any special
> > reason to omit the case ?
> 
> Which code path is newly covered by this aborted streaming transaction tests?
> I think that this patch is already covered even by the test for a
> committed-and-streamed transaction. It doesn't matter whether the streamed
> transaction is committed or aborted because an error occurs while applying the
> spooled changes.
Oh, this was my mistake.  What I expressed as a new patch is
apply_handle_stream_abort -> clear_subscription_skip_xid.
But, this was totally wrong as you explained.


> >
> > (7)
> >
> > I want more explanation for the reason to restart the subscriber in
> > the TAP test because this is not mandatory operation.
> > (We can pass the TAP tests without this restart)
> >
> > From :
> > # Restart the subscriber node to restart logical replication with no
> > interval
> >
> > IIUC, below would be better.
> >
> > To :
> > # As an optimization to finish test

Re: row filtering for logical replication

2022-01-17 Thread Greg Nancarrow

On Tue, Jan 18, 2022 at 2:31 AM houzj.f...@fujitsu.com
 wrote:
>
> > (2) GetTopMostAncestorInPublication
> > Is there a reason why there is no "break" after finding a
> > topmost_relid? Why keep searching and potentially overwrite a
> > previously-found topmost_relid? If it's intentional, I think that a
> > comment should be added to explain it.
>
> The code was moved from get_rel_sync_entry, and was trying to get the
> last oid in the ancestor list which is published by the publication. Do you
> have some suggestions for the comment ?
>

Maybe the existing comment should be updated to just spell it out like that:

/*
 * Find the "topmost" ancestor that is in this publication, by getting the
 * last Oid in the ancestors list which is published by the publication.
 */


Regards,
Greg Nancarrow
Fujitsu Australia

RE: Skipping logical replication transactions on subscriber side

2022-01-17 Thread tanghy.f...@fujitsu.com

On Mon, Jan 17, 2022 2:18 PM Masahiko Sawada  wrote:
> 
> I've attached an updated patch. Please review it.
> 

Thanks for updating the patch. Few comments:

1)
/* Two_phase is only supported in v15 and higher */
if (pset.sversion >= 15)
appendPQExpBuffer(&buf,
- ", subtwophasestate 
AS \"%s\"\n",
- gettext_noop("Two 
phase commit"));
+ ", subtwophasestate 
AS \"%s\"\n"
+ ", subskipxid AS 
\"%s\"\n",
+ gettext_noop("Two 
phase commit"),
+ gettext_noop("Skip 
XID"));

appendPQExpBuffer(&buf,
  ",  subsynccommit AS \"%s\"\n"

I think "skip xid" should be mentioned in the comment. Maybe it could be 
changed to:
"Two_phase and skip XID are only supported in v15 and higher"

2) The following two places are not consistent in whether "= value" is surround
with square brackets.

+ALTER SUBSCRIPTION name SKIP ( 
skip_option [= value] [, ... ] )

+SKIP ( skip_option = value [, ... ] )

Should we modify the first place to:
+ALTER SUBSCRIPTION name SKIP ( 
skip_option = value [, ... ] )

Because currently there is only one skip_option - xid, and a parameter must be
specified when using it.

3)
+* Protect subskip_xid of pg_subscription from being concurrently 
updated
+* while clearing it.

"subskip_xid" should be "subskipxid" I think.

4)
+/*
+ * Start skipping changes of the transaction if the given XID matches the
+ * transaction ID specified by skip_xid option.
+ */

The option name was "skip_xid" in the previous version, and it is "xid" in
latest patch. So should we modify "skip_xid option" to "skip xid option", or
"skip option xid", or something else?

Also the following place has similar issue:
+ * the subscription if hte user has specified skip_xid. Once we start skipping

Regards,
Tang

Re: a misbehavior of partition row movement (?)

2022-01-17 Thread Amit Langote

On Tue, Jan 18, 2022 at 7:15 AM Alvaro Herrera  wrote:
> On 2022-Jan-17, Tom Lane wrote:
> > But could we please do it in a way that is designed to keep the
> > code readable, rather than to minimize the number of lines of diff?
> > It makes zero sense to have the bits in AFTER_TRIGGER_TUP_BITS not
> > be adjacent.  So what should happen here is to renumber the symbols
> > in between to move their bits over one place.
>
> Is it typical to enumerate bits starting from the right of each byte,
> when doing it from the high bits of the word?  DONE and IN_PROGRESS have
> been defined as 0x1 and 0x2 of that byte for a very long time and I
> found that very strange.  I am inclined to count from the left, so I'd
> pick 8 first, defining the set like this:
>
> #define AFTER_TRIGGER_OFFSET0x07FF  /* must be 
> low-order bits */
> #define AFTER_TRIGGER_DONE  0x8000
> #define AFTER_TRIGGER_IN_PROGRESS   0x4000
> /* bits describing the size and tuple sources of this event */
> #define AFTER_TRIGGER_FDW_REUSE 0x
> #define AFTER_TRIGGER_FDW_FETCH 0x2000
> #define AFTER_TRIGGER_1CTID 0x1000
> #define AFTER_TRIGGER_2CTID 0x3000
> #define AFTER_TRIGGER_CP_UPDATE 0x0800
> #define AFTER_TRIGGER_TUP_BITS  0x3800

Agree that the ultimate code readability trumps diff minimalism. :-)

Would you like me to update the patch with the above renumbering or
are you working on a new version yourself?

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Skipping logical replication transactions on subscriber side

2022-01-17 Thread Amit Kapila

On Tue, Jan 18, 2022 at 8:02 AM Masahiko Sawada  wrote:
>
> On Mon, Jan 17, 2022 at 10:15 PM Amit Kapila  wrote:
> >
> > On Mon, Jan 17, 2022 at 6:22 PM Masahiko Sawada  
> > wrote:
> > >
> > > >
> > > > (5)
> > > >
> > > > I can miss something here but, in one of
> > > > the past discussions, there seems a consensus that
> > > > if the user specifies XID of a subtransaction,
> > > > it would be better to skip only the subtransaction.
> > > >
> > > > This time, is it out of the range of the patch ?
> > > > If so, I suggest you include some description about it
> > > > either in the commit message or around codes related to it.
> > >
> > > How can the user know subtransaction XID? I suppose you refer to
> > > streaming protocol cases but while applying spooled changes we don't
> > > report subtransaction XID neither in server log nor
> > > pg_stat_subscription_workers.
> > >
> >
> > I also think in the current system users won't be aware of
> > subtransaction's XID but I feel Osumi-San's point is valid that we
> > should at least add it in docs that we allow to skip only top-level
> > xacts. Also, in the future, it won't be impossible to imagine that we
> > can have subtransaction's XID info also available to users as we have
> > that in the case of streaming xacts (See subxact_data).
>
> Fair point and more accurate, but I'm a bit concerned that using these
> words could confuse the user. There are some places in the doc where
> we use the words “top-level transaction” and "sub transactions” but
> these are not commonly used in the doc. The user normally would not be
> aware that sub transactions are used to implement SAVEPOINTs. Also,
> the publisher's subtransaction ID doesn’t appear anywhere on the
> subscriber. So if we want to mention it, I think we should use other
> words instead of them but I don’t have a good idea for that. Do you
> have any ideas?
>

How about changing existing text:
+  Specifies the ID of the transaction whose changes are to be skipped
+  by the logical replication worker.  Setting NONE
+  resets the transaction ID.

to

Specifies the top-level transaction identifier whose changes are to be
skipped by the logical replication worker.  We don't support skipping
individual subtransactions.  Setting NONE resets
the transaction ID.

-- 
With Regards,
Amit Kapila.

Re: Skipping logical replication transactions on subscriber side

2022-01-17 Thread Masahiko Sawada

On Tue, Jan 18, 2022 at 10:36 AM Greg Nancarrow  wrote:
>
> On Mon, Jan 17, 2022 at 5:18 PM Masahiko Sawada  wrote:
> >
> > I've attached an updated patch. Please review it.
> >
>
> Some review comments for the v6 patch:

Thank you for the comments!

>
>
> doc/src/sgml/logical-replication.sgml
>
> (1) Expanded output
>
> Since the view output is shown in "expanded output" mode, perhaps the
> doc should say that, or alternatively add the following lines prior to
> it, to make it clear:
>
>   postgres=# \x
>   Expanded display is on.

I'm not sure it's really necessary. A similar example would be
perform.sgml but it doesn't say "\x".

>
>
> (2) Message output in server log
>
> The actual CONTEXT text now just says "at ..." instead of "with commit
> timestamp ...", so the doc needs to be updated as follows:
>
> BEFORE:
> +CONTEXT:  processing remote data during "INSERT" for replication
> target relation "public.test" in transaction 716 with commit timestamp
> 2021-09-29 15:52:45.165754+00
> AFTER:
> +CONTEXT:  processing remote data during "INSERT" for replication
> target relation "public.test" in transaction 716 at 2021-09-29
> 15:52:45.165754+00

Will fix.

>
> (3)
> The wording "the change" doesn't seem right here, so I suggest the
> following update:
>
> BEFORE:
> +   Skipping the whole transaction includes skipping the change that
> may not violate
> AFTER:
> +   Skipping the whole transaction includes skipping changes that may
> not violate
>
>
> doc/src/sgml/ref/alter_subscription.sgml

Will fix.

>
> (4)
> I have a number of suggested wording improvements:
>
> BEFORE:
> +  Skips applying changes of the particular transaction.  If incoming data
> +  violates any constraints the logical replication will stop until it is
> +  resolved.  The resolution can be done either by changing data on the
> +  subscriber so that it doesn't conflict with incoming change or
> by skipping
> +  the whole transaction.  The logical replication worker skips all data
> +  modification changes within the specified transaction including
> the changes
> +  that may not violate the constraint, so, it should only be used as a 
> last
> +  resort. This option has no effect on the transaction that is already
> +  prepared by enabling two_phase on subscriber.
>
> AFTER:
> +  Skips applying all changes of the specified transaction.  If
> incoming data
> +  violates any constraints, logical replication will stop until it is
> +  resolved.  The resolution can be done either by changing data on the
> +  subscriber so that it doesn't conflict with incoming change or
> by skipping
> +  the whole transaction.  Using the SKIP option, the logical
> replication worker skips all data
> +  modification changes within the specified transaction, including 
> changes
> +  that may not violate the constraint, so, it should only be used as a 
> last
> +  resort. This option has no effect on transactions that are already
> +  prepared by enabling two_phase on the subscriber.
>

Will fix.

>
> (5)
> change -> changes
>
> BEFORE:
> +  subscriber so that it doesn't conflict with incoming change or
> by skipping
> AFTER:
> +  subscriber so that it doesn't conflict with incoming changes or
> by skipping

Will fix.

>
>
> src/backend/replication/logical/worker.c
>
> (6) Missing word?
> The following should say "worth doing" or "worth it"?
>
> + * doesn't seem to be worth since it's a very minor case.
>

WIll fix

>
> src/test/regress/sql/subscription.sql
>
> (7) Misleading test case
> I think the following test case is misleading and should be removed,
> because the "1.1" xid value is only regarded as invalid because "1" is
> an invalid xid (and there's already a test case for a "1" xid) - the
> fractional part gets thrown away, and doesn't affect the validity
> here.
>
>+ALTER SUBSCRIPTION regress_testsub SKIP (xid = 1.1);
>

Good point. Will remove.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: row filtering for logical replication

2022-01-17 Thread Amit Kapila

On Mon, Jan 17, 2022 at 9:00 PM houzj.f...@fujitsu.com
 wrote:
>
> On Mon, Jan 17, 2022 12:34 PM Peter Smith  wrote:
> >
> > Here are some review comments for v65-0001 (review of updates since
> > v64-0001)
>
> Thanks for the comments!
>
> > ~~~
> >
> > 1. src/include/commands/publicationcmds.h - rename func
> >
> > +extern bool contain_invalid_rfcolumn(Oid pubid, Relation relation,
> > +List *ancestors,  AttrNumber *invalid_rfcolumn);
> >
> > I thought that function should be called "contains_..." instead of 
> > "contain_...".
> >
> > ~~~
> >
> > 2. src/backend/commands/publicationcmds.c - rename funcs
> >
> > Suggested renaming (same as above #1).
> >
> > "contain_invalid_rfcolumn_walker" --> "contains_invalid_rfcolumn_walker"
> > "contain_invalid_rfcolumn" --> "contains_invalid_rfcolumn"
> >
> > Also, update it in the comment for rf_context:
> > +/*
> > + * Information used to validate the columns in the row filter
> > +expression. See
> > + * contain_invalid_rfcolumn_walker for details.
> > + */
>
> I am not sure about the name because many existing
> functions are named contain_xxx_xxx.
> (for example contain_mutable_functions)
>

I also see many similar functions whose name start with contain_* like
contain_var_clause, contain_agg_clause, contain_window_function, etc.
So, it is probably okay to retain the name as it is in the patch.

-- 
With Regards,
Amit Kapila.

Re: Skipping logical replication transactions on subscriber side

2022-01-17 Thread Masahiko Sawada

On Mon, Jan 17, 2022 at 10:15 PM Amit Kapila  wrote:
>
> On Mon, Jan 17, 2022 at 6:22 PM Masahiko Sawada  wrote:
> >
> > >
> > > (5)
> > >
> > > I can miss something here but, in one of
> > > the past discussions, there seems a consensus that
> > > if the user specifies XID of a subtransaction,
> > > it would be better to skip only the subtransaction.
> > >
> > > This time, is it out of the range of the patch ?
> > > If so, I suggest you include some description about it
> > > either in the commit message or around codes related to it.
> >
> > How can the user know subtransaction XID? I suppose you refer to
> > streaming protocol cases but while applying spooled changes we don't
> > report subtransaction XID neither in server log nor
> > pg_stat_subscription_workers.
> >
>
> I also think in the current system users won't be aware of
> subtransaction's XID but I feel Osumi-San's point is valid that we
> should at least add it in docs that we allow to skip only top-level
> xacts. Also, in the future, it won't be impossible to imagine that we
> can have subtransaction's XID info also available to users as we have
> that in the case of streaming xacts (See subxact_data).

Fair point and more accurate, but I'm a bit concerned that using these
words could confuse the user. There are some places in the doc where
we use the words “top-level transaction” and "sub transactions” but
these are not commonly used in the doc. The user normally would not be
aware that sub transactions are used to implement SAVEPOINTs. Also,
the publisher's subtransaction ID doesn’t appear anywhere on the
subscriber. So if we want to mention it, I think we should use other
words instead of them but I don’t have a good idea for that. Do you
have any ideas?

>
> Few minor points:
> ===
> 1.
> + * the subscription if hte user has specified skip_xid.
>
> Typo. /hte/the

Will fix.

>
> 2.
> + * PREPARED wonâ€™t be resent but subskipxid is left.
>
> In diffmerge tool, won't is showing some funny chars. When I manually
> removed 't and added it again, everything is fine. I am not sure why
> it is so? I think Osumi-San has also raised this complaint.

Oh I didn't realize that. I'll check it again by using diffmerge tool.

>
> 3.
> + /*
> + * We don't expect that the user set the XID of the transaction that is
> + * rolled back but if the skip XID is set, clear it.
> + */
>
> /user set/user to set/

Will fix.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Rewriting the test of pg_upgrade as a TAP test - take three - remastered set

2022-01-17 Thread Michael Paquier

On Sat, Jan 15, 2022 at 01:52:39PM +0800, Julien Rouhaud wrote:
> libpq environment variable PGHOST has a non-local server value: 
> C:/Users/ContainerAdministrator/AppData/Local/Temp/FhBIlsw6SV
> Failure, exiting
> not ok 3 - run of pg_upgrade for new instance

There are two things here, as far as I understand:
1) This is a valid Windows path.  So shouldn't we fix pg_upgrade's
server.c to be a bit more compliant with Windows paths?  The code
accepts only paths beginning with '/' as local paths, so this breaks.
2) It looks safer in the long run to disable completely PGHOST and
PGHOSTADDR when running the pg_upgrade command in the test, and we'd
better not use Cluster::command_ok() or we would fall down to each
node's local environment.  This could be done in the tests as of the
attached, I guess, and this would bypass the problem coming from 1).

The patch needed a refresh as --make-testtablespace-dir has been
removed as of d6d317d.
--
Michael
From 5b4c3f279781418aa4bc24689342b41926d681e4 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 18 Jan 2022 11:13:09 +0900
Subject: [PATCH v7] Switch tests of pg_upgrade to use TAP

---
 src/bin/pg_upgrade/.gitignore|   5 +
 src/bin/pg_upgrade/Makefile  |  23 +-
 src/bin/pg_upgrade/TESTING   |  33 ++-
 src/bin/pg_upgrade/t/001_basic.pl|   9 +
 src/bin/pg_upgrade/t/002_pg_upgrade.pl   | 286 +++
 src/bin/pg_upgrade/test.sh   | 279 --
 src/test/perl/PostgreSQL/Test/Cluster.pm |  25 ++
 src/tools/msvc/vcregress.pl  |  94 +---
 8 files changed, 366 insertions(+), 388 deletions(-)
 create mode 100644 src/bin/pg_upgrade/t/001_basic.pl
 create mode 100644 src/bin/pg_upgrade/t/002_pg_upgrade.pl
 delete mode 100644 src/bin/pg_upgrade/test.sh

diff --git a/src/bin/pg_upgrade/.gitignore b/src/bin/pg_upgrade/.gitignore
index 2d3bfeaa50..3b64522ab6 100644
--- a/src/bin/pg_upgrade/.gitignore
+++ b/src/bin/pg_upgrade/.gitignore
@@ -7,3 +7,8 @@
 /loadable_libraries.txt
 /log/
 /tmp_check/
+
+# Generated by pg_regress
+/sql/
+/expected/
+/results/
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 44d06be5a6..35b6c123a5 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -28,6 +28,12 @@ OBJS = \
 override CPPFLAGS := -DDLSUFFIX=\"$(DLSUFFIX)\" -I$(srcdir) -I$(libpq_srcdir) $(CPPFLAGS)
 LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
 
+# required for 002_pg_upgrade.pl
+REGRESS_SHLIB=$(abs_top_builddir)/src/test/regress/regress$(DLSUFFIX)
+export REGRESS_SHLIB
+REGRESS_OUTPUTDIR=$(abs_top_builddir)/src/bin/pg_upgrade
+export REGRESS_OUTPUTDIR
+
 all: pg_upgrade
 
 pg_upgrade: $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils
@@ -49,17 +55,8 @@ clean distclean maintainer-clean:
 	   pg_upgrade_dump_globals.sql \
 	   pg_upgrade_dump_*.custom pg_upgrade_*.log
 
-# When $(MAKE) is present, make automatically infers that this is a
-# recursive make. which is not actually what we want here, as that
-# e.g. prevents output synchronization from working (as make thinks
-# that the subsidiary make knows how to deal with that itself, but
-# we're invoking a shell script that doesn't know). Referencing
-# $(MAKE) indirectly avoids that behaviour.
-# See https://www.gnu.org/software/make/manual/html_node/MAKE-Variable.html#MAKE-Variable
-NOTSUBMAKEMAKE=$(MAKE)
+check:
+	$(prove_check)
 
-check: test.sh all temp-install
-	MAKE=$(NOTSUBMAKEMAKE) $(with_temp_install) bindir=$(abs_top_builddir)/tmp_install/$(bindir) EXTRA_REGRESS_OPTS="$(EXTRA_REGRESS_OPTS)" $(SHELL) $<
-
-# installcheck is not supported because there's no meaningful way to test
-# pg_upgrade against a single already-running server
+installcheck:
+	$(prove_installcheck)
diff --git a/src/bin/pg_upgrade/TESTING b/src/bin/pg_upgrade/TESTING
index 78b9747908..43a71566e2 100644
--- a/src/bin/pg_upgrade/TESTING
+++ b/src/bin/pg_upgrade/TESTING
@@ -2,21 +2,30 @@ THE SHORT VERSION
 -
 
 On non-Windows machines, you can execute the testing process
-described below by running
+described below by running the following command in this directory:
 	make check
-in this directory.  This will run the shell script test.sh, performing
-an upgrade from the version in this source tree to a new instance of
-the same version.
 
-To test an upgrade from a different version, you must have a built
-source tree for the old version as well as this version, and you
-must have done "make install" for both versions.  Then do:
+This will run the TAP tests to run pg_upgrade, performing an upgrade
+from the version in this source tree to a  new instance of the same
+version.
+
+To test an upgrade from a different version, there are two options
+available:
+
+1) You have a built source tree for the old version as well as this
+version's binaries.  Then set up the following variables before
+launching the test:
 
 export oldsrc=...s

Re: Null commitTS bug

2022-01-17 Thread Michael Paquier

On Sun, Jan 16, 2022 at 11:01:25PM -0500, Tom Lane wrote:
> Isn't that a very bad way to write "i = j + 1"?
> 
> I agree with Horiguchi-san that
>   for (i = 0, headxid = xid;;)

Okay.  Horiguchi-san, would you like to write a patch?
--
Michael


signature.asc
Description: PGP signature

Re: Add last commit LSN to pg_last_committed_xact()

2022-01-17 Thread James Coleman

On Mon, Jan 17, 2022 at 4:34 PM Alvaro Herrera  wrote:
>
> On 2022-Jan-14, James Coleman wrote:
>
> > The logical slot can't flush past the
> > last commit, so even if there's 100s of megabytes of unflushed WAL on
> > the slot there may be zero lag (in terms of what's possible to
> > process).
> >
> > I've attached a simple patch (sans tests and documentation) to get
> > feedback early. After poking around this afternoon it seemed to me
> > that the simplest approach was to hook into the commit timestamps
> > infrastructure and store the commit's XLogRecPtr in the cache of the
> > most recent value (but of course don't write it out to disk).
>
> Maybe it would work to have a single LSN in shared memory, as an atomic
> variable, which uses monotonic advance[1] to be updated.  Whether this is
> updated or not would depend on a new GUC, maybe track_latest_commit_lsn.
> Causing performance pain during transaction commit is not great, but at
> least this way it shouldn't be *too* a large hit.
>
> [1] part of a large patch at
> https://www.postgresql.org/message-id/202111222156.xmo2yji5ifi2%40alvherre.pgsql

I'd be happy to make it a separate GUC, though it seems adding an
additional atomic access is worse (assuming we can convince ourselves
putting this into the commit timestamps infrastructure is acceptable)
given here we're already under a lock.

Thanks,
James Coleman

Re: Add last commit LSN to pg_last_committed_xact()

2022-01-17 Thread James Coleman

On Mon, Jan 17, 2022 at 4:20 PM Robert Haas  wrote:
>
> On Fri, Jan 14, 2022 at 7:42 PM James Coleman  wrote:
> > I've attached a simple patch (sans tests and documentation) to get
> > feedback early. After poking around this afternoon it seemed to me
> > that the simplest approach was to hook into the commit timestamps
> > infrastructure and store the commit's XLogRecPtr in the cache of the
> > most recent value (but of course don't write it out to disk). That the
> > downside of making this feature dependent on "track_commit_timestamps
> > = on", but that seems reasonable:
> >
> > 1. Getting the xid of the last commit is similarly dependent on commit
> > timestamps infrastructure.
> > 2. It's a simple place to hook into and avoids new shared data and locking.
> >
> > Thoughts?
>
> It doesn't seem great to me. It's making commit_ts do something other
> than commit timestamps, which looks kind of ugly.

I wondered about that, but commit_ts already does more than commit
timestamps by recording the xid of the last commit.

For that matter, keeping a cache of last commit metadata in shared
memory is arguably not obviously implied by "track_commit_timestamps",
which leads to the below...

> In general, I'm concerned about the cost of doing something like this.
> Extra shared memory updates as part of the process of committing a
> transaction are not (and can't be made) free.

It seems to me that to the degree there's a hot path concern here we
ought to separate out the last commit metadata caching from the
"track_commit_timestamps" feature (at least in terms of how it's
controlled by GUCs). If that were done we could also, in theory, allow
controlling which items are tracked to reduce hot path cost if only a
subset is needed. For that matter it'd also allow turning on this
metadata caching without enabling the commit timestamp storage.

I'm curious, though: I realize it's in the hot path, and I realize
that there's an accretive cost to even small features, but given we're
already paying the lock cost and updating memory in what is presumably
the same cache line, would you expect this cost to be clearly
measurable?

Thanks,
James Coleman

Re: Skipping logical replication transactions on subscriber side

2022-01-17 Thread Greg Nancarrow

On Mon, Jan 17, 2022 at 5:18 PM Masahiko Sawada  wrote:
>
> I've attached an updated patch. Please review it.
>

Some review comments for the v6 patch:


doc/src/sgml/logical-replication.sgml

(1) Expanded output

Since the view output is shown in "expanded output" mode, perhaps the
doc should say that, or alternatively add the following lines prior to
it, to make it clear:

  postgres=# \x
  Expanded display is on.


(2) Message output in server log

The actual CONTEXT text now just says "at ..." instead of "with commit
timestamp ...", so the doc needs to be updated as follows:

BEFORE:
+CONTEXT:  processing remote data during "INSERT" for replication
target relation "public.test" in transaction 716 with commit timestamp
2021-09-29 15:52:45.165754+00
AFTER:
+CONTEXT:  processing remote data during "INSERT" for replication
target relation "public.test" in transaction 716 at 2021-09-29
15:52:45.165754+00

(3)
The wording "the change" doesn't seem right here, so I suggest the
following update:

BEFORE:
+   Skipping the whole transaction includes skipping the change that
may not violate
AFTER:
+   Skipping the whole transaction includes skipping changes that may
not violate


doc/src/sgml/ref/alter_subscription.sgml

(4)
I have a number of suggested wording improvements:

BEFORE:
+  Skips applying changes of the particular transaction.  If incoming data
+  violates any constraints the logical replication will stop until it is
+  resolved.  The resolution can be done either by changing data on the
+  subscriber so that it doesn't conflict with incoming change or
by skipping
+  the whole transaction.  The logical replication worker skips all data
+  modification changes within the specified transaction including
the changes
+  that may not violate the constraint, so, it should only be used as a last
+  resort. This option has no effect on the transaction that is already
+  prepared by enabling two_phase on subscriber.

AFTER:
+  Skips applying all changes of the specified transaction.  If
incoming data
+  violates any constraints, logical replication will stop until it is
+  resolved.  The resolution can be done either by changing data on the
+  subscriber so that it doesn't conflict with incoming change or
by skipping
+  the whole transaction.  Using the SKIP option, the logical
replication worker skips all data
+  modification changes within the specified transaction, including changes
+  that may not violate the constraint, so, it should only be used as a last
+  resort. This option has no effect on transactions that are already
+  prepared by enabling two_phase on the subscriber.


(5)
change -> changes

BEFORE:
+  subscriber so that it doesn't conflict with incoming change or
by skipping
AFTER:
+  subscriber so that it doesn't conflict with incoming changes or
by skipping


src/backend/replication/logical/worker.c

(6) Missing word?
The following should say "worth doing" or "worth it"?

+ * doesn't seem to be worth since it's a very minor case.


src/test/regress/sql/subscription.sql

(7) Misleading test case
I think the following test case is misleading and should be removed,
because the "1.1" xid value is only regarded as invalid because "1" is
an invalid xid (and there's already a test case for a "1" xid) - the
fractional part gets thrown away, and doesn't affect the validity
here.

   +ALTER SUBSCRIPTION regress_testsub SKIP (xid = 1.1);


Regards,
Greg Nancarrow
Fujitsu Australia

Re: Refactoring of compression options in pg_basebackup

2022-01-17 Thread Michael Paquier

On Mon, Jan 17, 2022 at 12:48:12PM -0500, Robert Haas wrote:
> Alvaro's proposal is fine with me. I don't see any value in replacing
> --compress with --compression. It's longer but not superior in any way
> that I can see. Having both seems worst of all -- that's just
> confusing.

Okay, that looks like a consensus, then.  Robert, would it be better
to gather all that on the thread that deals with the server-side
compression?  Doing that here would be fine by me, with the option to
only specify the client.  Now it would be a bit weird to do things
with only the client part and not the server part :)
--
Michael


signature.asc
Description: PGP signature

Re: generic plans and "initial" pruning

2022-01-17 Thread Amit Langote

On Fri, Jan 14, 2022 at 11:10 PM Amit Langote  wrote:
> On Thu, Jan 6, 2022 at 3:45 PM Amul Sul  wrote:
> > Here are few comments for v1 patch:
>
> Thanks Amul.  I'm thinking about Robert's latest comments addressing
> which may need some rethinking of this whole design, but I decided to
> post a v2 taking care of your comments.

cfbot tells me there is an unused variable warning, which is fixed in
the attached v3.


--
Amit Langote
EDB: http://www.enterprisedb.com


v3-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patch
Description: Binary data

Re: row filtering for logical replication

2022-01-17 Thread Peter Smith

Here are some review comments for v66-0001 (review of updates since v65-0001)

~~~

1. src/backend/catalog/pg_publication.c - GetTopMostAncestorInPublication

@@ -276,17 +276,45 @@ GetPubPartitionOptionRelations(List *result,
PublicationPartOpt pub_partopt,
 }

 /*
+ * Returns the relid of the topmost ancestor that is published via this
+ * publication if any, otherwise return InvalidOid.
+ */

Suggestion:
"otherwise return InvalidOid." --> "otherwise returns InvalidOid."

~~~

2. src/backend/commands/publicationcmds.c - contain_invalid_rfcolumn_walker

@@ -235,6 +254,337 @@ CheckObjSchemaNotAlreadyInPublication(List
*rels, List *schemaidlist,
 }

 /*
+ * Returns true, if any of the columns used in the row filter WHERE clause are
+ * not part of REPLICA IDENTITY, false, otherwise.

Suggestion:
", false, otherwise" --> ", otherwise returns false."

~~~

3. src/backend/replication/logical/tablesync.c - fetch_remote_table_info

+ * We do need to copy the row even if it matches one of the publications,
+ * so, we later combine all the quals with OR.

Suggestion:

BEFORE
* We do need to copy the row even if it matches one of the publications,
* so, we later combine all the quals with OR.
AFTER
* We need to copy the row even if it matches just one of the publications,
* so, we later combine all the quals with OR.

~~~

4. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter_exec_expr

+ ret = ExecEvalExprSwitchContext(state, econtext, &isnull);
+
+ elog(DEBUG3, "row filter evaluates to %s (isnull: %s)",
+ DatumGetBool(ret) ? "true" : "false",
+ isnull ? "false" : "true");
+
+ if (isnull)
+ return false;
+
+ return DatumGetBool(ret);

That change to the logging looks incorrect - the "(isnull: %s)" value
is backwards now.

I guess maybe the intent was to change it something like below:

elog(DEBUG3, "row filter evaluates to %s (isnull: %s)",
isnull ? "false" : DatumGetBool(ret) ? "true" : "false",
isnull ? "true" : "false");

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: docs: pg_replication_origin_oid() description does not match behaviour

2022-01-17 Thread Michael Paquier

On Tue, Jan 18, 2022 at 10:19:41AM +0900, Ian Lawrence Barwick wrote:
> Given that the code has remained unchanged since the function was
> introduced in 9.5, it seems reasonable to change the documentation
> to match the function behaviour rather than the other way round.

Obviously.  I'll go fix that as you suggest, if there are no
objections.
--
Michael


signature.asc
Description: PGP signature

docs: pg_replication_origin_oid() description does not match behaviour

2022-01-17 Thread Ian Lawrence Barwick

Hi

>From the documentation for pg_replication_origin_oid() [1]:

> Looks up a replication origin by name and returns the internal ID.
> If no such replication origin is found an error is thrown.

However, it actually returns NULL if the origin does not exist:

postgres=# SELECT * FROM pg_replication_origin;
roident | roname
-+
(0 rows)

postgres=# SELECT pg_replication_origin_oid('foo'),
pg_replication_origin_oid('foo') IS NULL;
 pg_replication_origin_oid | ?column?
---+--
   | t
(1 row)

Given that the code has remained unchanged since the function was
introduced in 9.5, it seems reasonable to change the documentation
to match the function behaviour rather than the other way round.


Regards

Ian Barwick

[1] 
https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-REPLICATION-TABLE

-- 
EnterpriseDB: https://www.enterprisedb.com
commit 9e0d549002d97b20261554c5f45ad5149916994a
Author: Ian Barwick 
Date:   Tue Jan 18 09:52:16 2022 +0900

doc: fix pg_replication_origin_oid() description

If the specified replication origin does not exist, NULL is returned;
an error is not thrown.

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a270f89dfe..027dfa4b82 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26524,7 +26524,7 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());


 Looks up a replication origin by name and returns the internal ID. If
-no such replication origin is found an error is thrown.
+no such replication origin is found, NULL is returned.

Re: drop tablespace failed when location contains .. on win32

2022-01-17 Thread Michael Paquier

On Tue, Jan 18, 2022 at 01:08:01AM +, wangsh.f...@fujitsu.com wrote:
> Yes, I will send a new version before next weekend

Thanks!
--
Michael


signature.asc
Description: PGP signature

RE: drop tablespace failed when location contains .. on win32

2022-01-17 Thread wangsh.f...@fujitsu.com

Hi

> This patch is a wanted bugfix and has been waiting for an update for 2 months.
> 
> Do you plan to send a new version soon?

Yes, I will send a new version before next weekend

Regards

Shenhao Wang

Re: \d with triggers: more than one row returned by a subquery used as an expression

2022-01-17 Thread Tom Lane

I wrote:
> Justin Pryzby  writes:
>> Is there any reason why WITH ORDINALITY can't work ?
>> This is passing the smoke test.

> How hard did you try to break it?  It still seems to me that
> this can be fooled by an unrelated trigger with the same tgname.

Hmm ... no, it does work, because we'll stop at the first trigger
with tgparentid = 0, so unrelated triggers further up the partition stack
don't matter.  But this definitely requires commentary.  (And I'm
not too happy with burying such a complicated thing inside a conditional
inside a printf, either.)  Will see about cleaning it up.

regards, tom lane

Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings

2022-01-17 Thread Jelte Fennema

It seems the man page of TCP_USER_TIMEOUT does not align with 
reality then. When I use it on my local machine it is effectively used
as a connection timeout too. The second command times out after 
two seconds:

sudo iptables -A INPUT -p tcp --destination-port 5432 -j DROP
psql 'host=localhost tcp_user_timeout=2000'

The keepalive settings only apply once you get to the recv however. And yes, 
it is pretty unlikely for the connection to break right when it is waiting for 
data.
But it has happened for us. And when it happens it is really bad, because
the process will be blocked forever. Since it is a blocking call.

After investigation when this happened it seemed to be a combination of a few
things making this happen: 
1. The way citus uses cancelation requests: A Citus query on the coordinator 
creates 
   multiple connections to a worker and with 2PC for distributed transactions. 
If one 
   connection receives an error it sends a cancel request for all others.
2. When a machine is under heavy CPU or memory pressure things don't work
   well: 
   i. errors can occur pretty frequently, causing lots of cancels to be sent by 
Citus.
   ii. postmaster can be slow in handling new cancelation requests.
   iii. Our failover system can think the node is down, because health checks 
are
  failing.
3. Our failover system effectively cuts the power and the network of the 
primary 
   when it triggers a fail over to the secondary

This all together can result in a cancel request being interrupted right at 
that 
wrong moment. And when it happens a distributed query on the Citus 
coordinator, becomes blocked forever. We've had queries stuck in this state 
for multiple days. The only way to get out of it at that point is either by 
restarting
postgres or manually closing the blocked socket (either with ss or gdb).

Jelte

Re: \d with triggers: more than one row returned by a subquery used as an expression

2022-01-17 Thread Tom Lane

Justin Pryzby  writes:
> On Mon, Jan 17, 2022 at 05:02:00PM -0500, Tom Lane wrote:
>> ISTM the real problem is the assumption that only related triggers could
>> share a tgname, which evidently isn't true.  I think this query needs to
>> actually match on tgparentid, rather than taking shortcuts.

> I don't think that should be needed - tgparentid should match
> pg_partition_ancestors().

Uh, what?  tgparentid is a trigger OID, not a relation OID.

> Is there any reason why WITH ORDINALITY can't work ?
> This is passing the smoke test.

How hard did you try to break it?  It still seems to me that
this can be fooled by an unrelated trigger with the same tgname.

regards, tom lane

Re: \d with triggers: more than one row returned by a subquery used as an expression

2022-01-17 Thread Justin Pryzby

On Mon, Jan 17, 2022 at 05:02:00PM -0500, Tom Lane wrote:
> Justin Pryzby  writes:
> > On Fri, Dec 17, 2021 at 09:43:56AM -0600, Justin Pryzby wrote:
> >> I want to mention that the 2nd problem I mentioned here is still broken.
> >> https://www.postgresql.org/message-id/20210717010259.gu20...@telsasoft.com
> >> It happens if non-inheritted triggers on child and parent have the same 
> >> name.
> 
> > This is the fix I was proposing
> > It depends on pg_partition_ancestors() to return its partitions in order:
> > this partition => parent => ... => root.
> 
> I don't think that works at all.  I might be willing to accept the
> assumption about pg_partition_ancestors()'s behavior, but you're also
> making an assumption about how the output of pg_partition_ancestors()
> is joined to "pg_trigger AS u", and I really don't think that's safe.

> ISTM the real problem is the assumption that only related triggers could
> share a tgname, which evidently isn't true.  I think this query needs to
> actually match on tgparentid, rather than taking shortcuts.

I don't think that should be needed - tgparentid should match
pg_partition_ancestors().

> If we don't
> want to use a recursive CTE, maybe we could define it as only looking up to
> the immediate parent, rather than necessarily finding the root?

I think that defeats the intent of c33869cc3.

Is there any reason why WITH ORDINALITY can't work ?
This is passing the smoke test.

-- 
Justin
>From 99f8ee921258d9b2e75299e69826004fda3b8919 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 3 Apr 2020 22:43:26 -0500
Subject: [PATCH 1/2] psql: Add newlines and spacing in trigger query for \d

cosmetic change to c33869cc3 to make the output of psql -E look pretty.
---
 src/bin/psql/describe.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 61cec118aca..ed9f320b015 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2993,12 +2993,12 @@ describeOneTableDetails(const char *schemaname,
 		  "t.tgenabled, t.tgisinternal, %s\n"
 		  "FROM pg_catalog.pg_trigger t\n"
 		  "WHERE t.tgrelid = '%s' AND ",
-		  (pset.sversion >= 13 ?
-		   "(SELECT (NULLIF(a.relid, t.tgrelid))::pg_catalog.regclass"
-		   " FROM pg_catalog.pg_trigger AS u, "
-		   "  pg_catalog.pg_partition_ancestors(t.tgrelid) AS a"
-		   " WHERE u.tgname = t.tgname AND u.tgrelid = a.relid"
-		   "   AND u.tgparentid = 0) AS parent" :
+		  (pset.sversion >= 13 ? "\n"
+		   "  (SELECT (NULLIF(a.relid, t.tgrelid))::pg_catalog.regclass\n"
+		   "   FROM pg_catalog.pg_trigger AS u,\n"
+		   "  pg_catalog.pg_partition_ancestors(t.tgrelid) AS a\n"
+		   "   WHERE u.tgname = t.tgname AND u.tgrelid = a.relid\n"
+		   "AND u.tgparentid = 0) AS parent" :
 		   "NULL AS parent"),
 		  oid);
 
-- 
2.17.1

>From d866624abbd523afdbc3a539f14c935c42fc345f Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 16 Jul 2021 19:57:00 -0500
Subject: [PATCH 2/2] psql: fix \d for statement triggers with same name on
 partition and its parent

This depends on pg_partition_ancestors() to return its partitions in order:
this partition => parent => ... => root.

20210716193319.gs20...@telsasoft.com
---
 src/bin/psql/describe.c|  4 ++--
 src/test/regress/expected/triggers.out | 13 +
 src/test/regress/sql/triggers.sql  |  4 
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index ed9f320b015..8787849e8be 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2996,9 +2996,9 @@ describeOneTableDetails(const char *schemaname,
 		  (pset.sversion >= 13 ? "\n"
 		   "  (SELECT (NULLIF(a.relid, t.tgrelid))::pg_catalog.regclass\n"
 		   "   FROM pg_catalog.pg_trigger AS u,\n"
-		   "  pg_catalog.pg_partition_ancestors(t.tgrelid) AS a\n"
+		   "  pg_catalog.pg_partition_ancestors(t.tgrelid) WITH ORDINALITY AS a(relid,depth)\n"
 		   "   WHERE u.tgname = t.tgname AND u.tgrelid = a.relid\n"
-		   "AND u.tgparentid = 0) AS parent" :
+		   "AND u.tgparentid = 0 ORDER BY depth LIMIT 1) AS parent" :
 		   "NULL AS parent"),
 		  oid);
 
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index 5c0e7c2b79e..9278cc07237 100644
--- a/src/test/regress/expected/triggers.out
+++ b/src/test/regress/expected/triggers.out
@@ -2122,6 +2122,19 @@ Triggers:
 alter table trigpart attach partition trigpart3 FOR VALUES FROM (2000) to (3000); -- fail
 ERROR:  trigger "trg1" for relation "trigpart3" already exists
 drop table trigpart3;
+create trigger samename after delete on trigpart execute function trigger_nothing();
+create trigger samename after delete on trigpart1 execute function trigger_nothing();
+\d trigpart1
+ Table "public.

Re: a misbehavior of partition row movement (?)

2022-01-17 Thread Tom Lane

Alvaro Herrera  writes:
> On 2022-Jan-17, Tom Lane wrote:
>> It makes zero sense to have the bits in AFTER_TRIGGER_TUP_BITS not
>> be adjacent.  So what should happen here is to renumber the symbols
>> in between to move their bits over one place.

> Is it typical to enumerate bits starting from the right of each byte,
> when doing it from the high bits of the word?  DONE and IN_PROGRESS have
> been defined as 0x1 and 0x2 of that byte for a very long time and I
> found that very strange.  I am inclined to count from the left, so I'd
> pick 8 first, defining the set like this:

Doesn't matter to me either way, as long as the values look like they
were all defined by the same person ;-)

> (The fact that FDW_REUSE bits is actually an empty mask comes from
> 7cbe57c34dec, specifically [1])

That seemed a bit odd to me too, but this is not the patch to be
changing it in, I suppose.

regards, tom lane

Re: Push down time-related SQLValue functions to foreign server

2022-01-17 Thread Tom Lane

Alexander Pyhalov  writes:
>> Perhaps in a MACRO?

> Changed this check to a macro, also fixed condition in 
> is_foreign_param() and added test for it.
> Also fixed comment in prepare_query_params().

I took a quick look at this.  I'm unconvinced that you need the
TIME_RELATED_SQLVALUE_FUNCTION macro, mainly because I think testing
that in is_foreign_param() is pointless.  The only way we'll be seeing
a SQLValueFunction in is_foreign_param() is if we decided it was
shippable, so you really don't need two duplicate tests.

(In the same vein, I would not bother with including a switch in
deparseSQLValueFunction that knows about these opcodes explicitly.
Just use the typmod field; exprTypmod() does.)

I also find it pretty bizarre that contain_unsafe_functions
isn't placed beside its one caller.  Maybe that indicates that
is_foreign_expr is mis-located and should be in shippable.c.

More generally, it's annoying that you had to copy-and-paste
all of contain_mutable_functions to make this.  That creates
a rather nasty maintenance hazard for future hackers, who will
need to keep these widely-separated functions in sync.  Not
sure what to do about that though.  Do we want to extend
contain_mutable_functions itself to cover this use-case?

The test cases seem a bit overkill --- what is the point of the
two nigh-identical PREPARE tests, or the GROUP BY test?  If
anything is broken about GROUP BY, surely it's not specific
to this patch.

I'm not really convinced by the premise of 0002, particularly
this bit:

 static bool
-contain_mutable_functions_checker(Oid func_id, void *context)
+contain_unsafe_functions_checker(Oid func_id, void *context)
 {
-   return (func_volatile(func_id) != PROVOLATILE_IMMUTABLE);
+   /* now() is stable, but we can ship it as it's replaced by parameter */
+   return !(func_volatile(func_id) == PROVOLATILE_IMMUTABLE || func_id == 
F_NOW);
 }

The point of the check_functions_in_node callback API is to verify
the mutability of functions that are embedded in various sorts of
expression nodes ... but they might not be in a plain FuncExpr node,
which is the only case you'll deparse correctly.  It might be that
now() cannot legally appear in any of the other node types that
check_functions_in_node knows about, but I'm not quite convinced
of that, and even less convinced that that'll stay true as we add
more expression node types.  Also, if we commit this, for sure
some poor soul will try to expand the logic to some other stable
function that *can* appear in those contexts, and that'll be broken.

The implementation of converting now() to CURRENT_TIMESTAMP
seems like an underdocumented kluge, too.

On the whole I'm a bit inclined to drop 0002; I'm not sure it's
worth the trouble.

regards, tom lane

Re: [PATCH] reduce page overlap of GiST indexes built using sorted method

2022-01-17 Thread Björn Harrtell

Hi Aliaksandr,

Nice work on this. I've been following it a bit since the regression when
it was noted and it sparked renewed interest in R-tree structure and
optimization for me.

As for ideas. I'm not deep into details of postgresql and gist, but I've
learned that the node size for gist indexes are equal to the page size, and
as such they are quite large in this context. This is what caused the
overly flat tree when increasing the fill factor, if I'm not mistaken, and
why it helps to decrease the fill factor.

I think there is a large potential if tree node size could be tweaked to
fit the use case and combine higher fill factor with optimal tree depth.
It's data dependent so it would even make sense to make it an index
parameter, if possible.

There might be some deep reason in the architecture that I'm unaware of
that could make it difficult to affect the node size but regardless, I
believe there could be a substantial win if node size could be controlled.

Regards,

Björn

Den mån 17 jan. 2022 kl 23:46 skrev Aliaksandr Kalenik :

> Hey!
>
> Postgres 14 introduces an option to create a GiST index using a sort
> method. It allows to create indexes much faster but as it had been
> mentioned in sort support patch discussion the faster build performance
> comes at cost of higher degree of overlap between pages than for indexes
> built with regular method.
>
>
> Sort support was implemented for GiST opclass in PostGIS but eventually
> got removed as default behaviour in latest 3.2 release because as it had
> been discovered by Paul Ramsey
> https://lists.osgeo.org/pipermail/postgis-devel/2021-November/029225.html
> performance of queries might degrade by 50%.
>
> Together with Darafei Praliaskouski, Andrey Borodin and me we tried
> several approaches to solve query performance degrade:
>
>- The first attempt was try to decide whether to make a split
>depending on direction of curve (Z-curve for Postgres geometry type,
>Hilbert curve for PostGIS). It was implemented by filling page until
>fillfactor / 2 and then checking penalty for every next item and keep
>inserting in current page if penalty is 0 or start new page if penalty is
>not 0. It turned out that with this approach index becomes significantly
>larger whereas pages overlap still remains high.
>- Andrey Borodin implemented LRU + split: a fixed amount of pages are
>kept in memory and the best candidate page to insert the next item in is
>selected by minimum penalty among these pages. If the best page for
>insertion is full, it gets splited into multiple pages, and if the amount
>of candidate pages after split exceeds the limit, the pages insertion to
>which has not happened recently are flushed.
>
> https://github.com/x4m/postgres_g/commit/0f2ed5f32a00f6c3019048e0c145b7ebda629e73.
>We made some tests and while query performance speed using index built with
>this approach is fine a size of index is extremely large.
>
> Eventually we made implementation of an idea outlined in sort support
> patch discussion here
> https://www.postgresql.org/message-id/flat/08173bd0-488d-da76-a904-912c35da4...@iki.fi#09ac9751a4cde897c99b99b2170faf3a
> that several pages can be collected and then divided into actual index
> pages by calling picksplit. My benchmarks with data provided in
> postgis-devel show that query performance using index built with patched
> sort support is comparable with performance of query using index built with
> regular method. The size of index is also matches size of index built with
> non-sorting method.
>
>
> It should be noted that with the current implementation of the sorting
> build method, pages are always filled up to fillfactor. This patch changes
> this behavior to what it would be if using a non-sorting method, and pages
> are not always filled to fillfactor for the sake of query performance. I'm
> interested in improving it and I wonder if there are any ideas on this.
>
>
> Benchmark summary:
>
>
> create index roads_rdr_idx on roads_rdr using gist (geom);
>
>
> with sort support before patch / CREATE INDEX 76.709 ms
>
> with sort support after patch / CREATE INDEX 225.238 ms
>
> without sort support / CREATE INDEX 446.071 ms
>
>
> select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom;
>
>
> with sort support before patch / SELECT 5766.526 ms
>
> with sort support after patch / SELECT 2646.554 ms
>
> without sort support / SELECT 2721.718 ms
>
>
> index size
>
>
> with sort support before patch / IDXSIZE 2940928 bytes
>
> with sort support after patch / IDXSIZE 4956160 bytes
>
> without sort support / IDXSIZE 5447680 bytes
>
> More detailed:
>
> Before patch using sorted method:
>
>
> postgres=# create index roads_rdr_geom_idx_sortsupport on roads_rdr using
> gist(geom);
>
> CREATE INDEX
>
> Time: 76.709 ms
>
>
> postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
> b.geom;
>
>  count
>
> 
>
>  505806
>
> (1 row)
>
>
> Time:

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-01-17 Thread Peter Geoghegan

On Mon, Jan 17, 2022 at 2:13 PM Robert Haas  wrote:
> On Mon, Jan 17, 2022 at 4:28 PM Peter Geoghegan  wrote:
> > Updating relfrozenxid should now be thought of as a continuous thing,
> > not a discrete thing.
>
> I think that's pretty nearly 100% wrong. The most simplistic way of
> expressing that is to say - clearly it can only happen when VACUUM
> runs, which is not all the time.

That just seems like semantics to me. The very next sentence after the
one you quoted in your reply was "And so it's highly unlikely that any
given VACUUM will ever *completely* fail to advance relfrozenxid".
It's continuous *within* each VACUUM. As far as I can tell there is
pretty much no way that the patch series will ever fail to advance
relfrozenxid *by at least a little bit*, barring pathological cases
with cursors and whatnot.

> That's a bit facile, though; let me
> try to say something a little smarter. There are real production
> systems that exist today where essentially all vacuums are
> anti-wraparound vacuums. And there are also real production systems
> that exist today where virtually none of the vacuums are
> anti-wraparound vacuums. So if we ship your proposed patches, the
> frequency with which relfrozenxid gets updated is going to increase by
> a large multiple, perhaps 100x, for the second group of people, who
> will then perceive the movement of relfrozenxid to be much closer to
> continuous than it is today even though, technically, it's still a
> step function. But the people in the first category are not going to
> see any difference at all.

Actually, I think that even the people in the first category might
well have about the same improved experience. Not just because of this
patch series, mind you. It would also have a lot to do with the
autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. Not to
mention the freeze map. What version are these users on?

I have actually seen this for myself. With BenchmarkSQL, the largest
table (the order lines table) starts out having its autovacuums driven
entirely by autovacuum_vacuum_insert_scale_factor, even though there
is a fair amount of bloat from updates. It stays like that for hours
on HEAD. But even with my reasonably tuned setup, there is eventually
a switchover point. Eventually all autovacuums end up as aggressive
anti-wraparound VACUUMs -- this happens once the table gets
sufficiently large (this is one of the two that is append-only, with
one update to every inserted row from the delivery transaction, which
happens hours after the initial insert).

With the patch series, we have a kind of virtuous circle with freezing
and with advancing relfrozenxid with the same order lines table. As
far as I can tell, we fix the problem with the patch series. Because
there are about 10 tuples inserted per new order transaction, the
actual "XID consumption rate of the table" is much lower than the
"worst case XID consumption" for such a table.

It's also true that even with the patch we still get anti-wraparound
VACUUMs for two fixed-size, hot-update-only tables: the stock table,
and the customers table. But that's no big deal. It only happens
because nothing else will ever trigger an autovacuum, no matter the
autovacuum_freeze_max_age setting.

> And therefore the reasoning that says - anti-wraparound vacuums just
> aren't going to happen any more - or - relfrozenxid will advance
> continuously seems like dangerous wishful thinking to me.

I never said that anti-wraparound vacuums just won't happen anymore. I
said that they'll be limited to cases like the stock table or
customers table case. I was very clear on that point.

With pgbench, whether or not you ever see any anti-wraparound VACUUMs
will depend on how heap fillfactor for the accounts table -- set it
low enough (maybe to 90) and you will still get them, since there
won't be any other reason to VACUUM. As for the branches table, and
the tellers table, they'll get VACUUMs in any case, regardless of heap
fillfactor. And so they'll always advance relfrozenxid during eac
VACUUM, and never have even one anti-wraparound VACUUM.

> It's only
> true if (# of vacuums) / (# of wraparound vacuums) >> 1. And that need
> not be true in any particular environment, which to me means that all
> conclusions based on the idea that it has to be true are pretty
> dubious. There's no doubt in my mind that advancing relfrozenxid
> opportunistically is a good idea. However, I'm not sure how reasonable
> it is to change any other behavior on the basis of the fact that we're
> doing it, because we don't know how often it really happens.

It isn't that hard to see that the cases where we continue to get any
anti-wraparound VACUUMs with the patch seem to be limited to cases
like the stock/customers table, or cases like the pathological idle
cursor cases we've been discussing. Pretty narrow cases, overall.
Don't take my word for it - see for yourself.

-- 
Peter Geoghegan

Re: Pluggable toaster

2022-01-17 Thread Nikita Malakhov

Hi,

>This sounds interesting, but very much like column compression, which
>was proposed some time ago. If we haven't made much progrees with that
>patch (AFAICS), what's the likelihood we'll succeed here, when it's
>combined with yet more complexity?
The main concern is that this patch provides open API for Toast
functionality
and new Toasters could be written as extensions and plugged in instead of
default one, for any column (or datatype). It was not possible before,
because
Toast functionality was part of the Postgres core code and was not meant to
be modified.

>Maybe doing that kind of compression in TOAST is somehow simpler, but I
>don't see it.
[Custom ]Toaster extension itself is not restricted to only toast data, it
could be
used for compression, encryption, etc, just name it - any case when data
meant
to be transformed in some complicated way before being stored and
(possibly)
transformed backwards while being selected from the table, along with some
simpler but not so obvious transformations like removing common parts shared
by all data in column before storing it and restoring column value to full
during
selection.

>Seems you'd need a mapping table, to allow M:N mapping between types and
>toasters, linking it to all "compatible" types. It's not clear to me how
>would this work with custom data types, domains etc.
Any suitable [custom] Toaster could be plugged in for any table column,
or [duscussible] for datatype and assigned by default to the according
column
for any table using this datatype.

>Also, what happens to existing values when you change the toaster? What
>if the toasters don't use the same access method to store the chunks
>(heap vs. btree)? And so on.
For newer data there is no problem - it will be toasted by the newly
assigned Toaster.
For detoasting - the Toaster ID is stored in Toast pointer and in principle
all data
could be detoasted by the according toaster if it is available. But this is
the topic
for discussion and we are open for proposals, because there are possible
cases
where older Toaster is not available - the older used Toaster extension is
not installed
at all or was uninstalled, it was upgraded to the newer version. Currently
we see
two ways of handling this case - to restrict changing the toaster, and to
re-toast
all toasted data which could be very heavy if Toaster is assigned to a
widely used
datatype, and we're looking forward to any ideas.

>So you suggest we move all of this to toaster? I'd say -1 to that,
>because it makes it much harder to e.g. add custom compression method, etc.
Original compression methods, etc. are not affected by this patch.

Regards,

--
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/

On Mon, Jan 17, 2022 at 10:23 PM Tomas Vondra 
wrote:

>
>
> On 1/14/22 19:41, Teodor Sigaev wrote:
> >
> >> In my understanding, we want to be able to
> >> 1. Access data from a toasted object one slice at a time, by using
> >> knowledge of the structure
> >> 2. If toasted data is updated, then update a minimum number of
> >> slices(s), without rewriting the existing slices
> >> 3. If toasted data is expanded, then allownew slices to be appended to
> >> the object without rewriting the existing slices
> >
> > There are more options:
> > 1 share common parts between not only versions of row but between all
> > rows in a column. Seems strange but examples:
> >- urls often have a common prefix and so storing in a prefix tree (as
> >  SP-GiST does) allows significantly decrease storage size
> >- the same for json - it's often use case with common part of its
> >  hierarchical structure
> >- one more usecase for json. If json use only a few schemes
> >  (structure) it's possible to store in toast storage only values and
> >  don't store keys and structure
>
> This sounds interesting, but very much like column compression, which
> was proposed some time ago. If we haven't made much progrees with that
> patch (AFAICS), what's the likelihood we'll succeed here, when it's
> combined with yet more complexity?
>
> Maybe doing that kind of compression in TOAST is somehow simpler, but I
> don't see it.
>
> > 2 Current toast storage stores chunks in heap accesses method and to
> > provide fast access by toast id it makes an index. Ideas:
> >- store chunks directly in btree tree, pgsql's btree already has an
> >  INCLUDE columns, so, chunks and visibility data will be stored only
> >  in leaf pages. Obviously it reduces number of disk's access for
> >  "untoasting".
> >- use another access method for chunk storage
> >
>
> Maybe, but that probably requires more thought - e.g. btree requires the
> values to be less than 1/3 page, so I wonder how would that play with
> toasting of values.
>
> >> ISTM that we would want the toast algorithm to be associated with the
> >> datatype, not the column?
> >> Can you explain your thinking?
> > Hm. I'll try to explain my motivation.
> > 1) Datatype could have more th

Re: a misbehavior of partition row movement (?)

2022-01-17 Thread Alvaro Herrera

On 2022-Jan-17, Tom Lane wrote:

> But could we please do it in a way that is designed to keep the
> code readable, rather than to minimize the number of lines of diff?
> It makes zero sense to have the bits in AFTER_TRIGGER_TUP_BITS not
> be adjacent.  So what should happen here is to renumber the symbols
> in between to move their bits over one place.

Is it typical to enumerate bits starting from the right of each byte,
when doing it from the high bits of the word?  DONE and IN_PROGRESS have
been defined as 0x1 and 0x2 of that byte for a very long time and I
found that very strange.  I am inclined to count from the left, so I'd
pick 8 first, defining the set like this:

#define AFTER_TRIGGER_OFFSET0x07FF  /* must be 
low-order bits */
#define AFTER_TRIGGER_DONE  0x8000
#define AFTER_TRIGGER_IN_PROGRESS   0x4000
/* bits describing the size and tuple sources of this event */
#define AFTER_TRIGGER_FDW_REUSE 0x
#define AFTER_TRIGGER_FDW_FETCH 0x2000
#define AFTER_TRIGGER_1CTID 0x1000
#define AFTER_TRIGGER_2CTID 0x3000
#define AFTER_TRIGGER_CP_UPDATE 0x0800
#define AFTER_TRIGGER_TUP_BITS  0x3800

(The fact that FDW_REUSE bits is actually an empty mask comes from
7cbe57c34dec, specifically [1])

Is this what you were thinking?

[1] https://postgr.es/m/20140306033644.ga3527...@tornado.leadboat.com

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
 (George Orwell's The Lord of the Rings)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-01-17 Thread Robert Haas

On Mon, Jan 17, 2022 at 4:28 PM Peter Geoghegan  wrote:
> Updating relfrozenxid should now be thought of as a continuous thing,
> not a discrete thing.

I think that's pretty nearly 100% wrong. The most simplistic way of
expressing that is to say - clearly it can only happen when VACUUM
runs, which is not all the time. That's a bit facile, though; let me
try to say something a little smarter. There are real production
systems that exist today where essentially all vacuums are
anti-wraparound vacuums. And there are also real production systems
that exist today where virtually none of the vacuums are
anti-wraparound vacuums. So if we ship your proposed patches, the
frequency with which relfrozenxid gets updated is going to increase by
a large multiple, perhaps 100x, for the second group of people, who
will then perceive the movement of relfrozenxid to be much closer to
continuous than it is today even though, technically, it's still a
step function. But the people in the first category are not going to
see any difference at all.

And therefore the reasoning that says - anti-wraparound vacuums just
aren't going to happen any more - or - relfrozenxid will advance
continuously seems like dangerous wishful thinking to me. It's only
true if (# of vacuums) / (# of wraparound vacuums) >> 1. And that need
not be true in any particular environment, which to me means that all
conclusions based on the idea that it has to be true are pretty
dubious. There's no doubt in my mind that advancing relfrozenxid
opportunistically is a good idea. However, I'm not sure how reasonable
it is to change any other behavior on the basis of the fact that we're
doing it, because we don't know how often it really happens.

If someone says "every time I travel to Europe on business, I will use
the opportunity to bring you back a nice present," you can't evaluate
how much impact that will have on your life without knowing how often
they travel to Europe on business. And that varies radically from
"never" to "a lot" based on the person.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: \d with triggers: more than one row returned by a subquery used as an expression

2022-01-17 Thread Tom Lane

Justin Pryzby  writes:
> On Fri, Dec 17, 2021 at 09:43:56AM -0600, Justin Pryzby wrote:
>> I want to mention that the 2nd problem I mentioned here is still broken.
>> https://www.postgresql.org/message-id/20210717010259.gu20...@telsasoft.com
>> It happens if non-inheritted triggers on child and parent have the same name.

> This is the fix I was proposing
> It depends on pg_partition_ancestors() to return its partitions in order:
> this partition => parent => ... => root.

I don't think that works at all.  I might be willing to accept the
assumption about pg_partition_ancestors()'s behavior, but you're also
making an assumption about how the output of pg_partition_ancestors()
is joined to "pg_trigger AS u", and I really don't think that's safe.

ISTM the real problem is the assumption that only related triggers could
share a tgname, which evidently isn't true.  I think this query needs to
actually match on tgparentid, rather than taking shortcuts.  If we don't
want to use a recursive CTE, maybe we could define it as only looking up
to the immediate parent, rather than necessarily finding the root?

regards, tom lane

Re: Add last commit LSN to pg_last_committed_xact()

2022-01-17 Thread Robert Haas

On Mon, Jan 17, 2022 at 4:34 PM Alvaro Herrera  wrote:
> On 2022-Jan-14, James Coleman wrote:
> > The logical slot can't flush past the
> > last commit, so even if there's 100s of megabytes of unflushed WAL on
> > the slot there may be zero lag (in terms of what's possible to
> > process).
> >
> > I've attached a simple patch (sans tests and documentation) to get
> > feedback early. After poking around this afternoon it seemed to me
> > that the simplest approach was to hook into the commit timestamps
> > infrastructure and store the commit's XLogRecPtr in the cache of the
> > most recent value (but of course don't write it out to disk).
>
> Maybe it would work to have a single LSN in shared memory, as an atomic
> variable, which uses monotonic advance[1] to be updated.  Whether this is
> updated or not would depend on a new GUC, maybe track_latest_commit_lsn.
> Causing performance pain during transaction commit is not great, but at
> least this way it shouldn't be *too* a large hit.

I don't know if it would or not, but it's such a hot path that I find
the idea a bit worrisome. Atomics aren't free - especially inside of a
loop.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg14 psql broke \d datname.nspname.relname

2022-01-17 Thread Robert Haas

On Mon, Jan 17, 2022 at 1:06 PM Mark Dilger
 wrote:
> > On Jan 15, 2022, at 12:28 AM, Julien Rouhaud  wrote:
> > Could you send a rebased version?
> Yes.  Here it is:

This is not a full review, but I just noticed that:

+ * dotcnt: how many separators were parsed from the pattern, by reference.
+ * Can be NULL.

But then:

+Assert(dotcnt != NULL);

On a related note, it's unclear why you've added three new arguments
to processSQLNamePattern() but only one of them gets a mention in the
function header comment.

It's also pretty clear that the behavior of patternToSQLRegex() is
changing, but the function header comments are not.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: a misbehavior of partition row movement (?)

2022-01-17 Thread Tom Lane

Alvaro Herrera  writes:
> So this patch releases one bit from AFTER_TRIGGER_OFFSET and makes it
> become AFTER_TRIGGER_CP_UPDATE.  As far as I can tell there is no harm
> in doing so.

I agree that taking a bit away from AFTER_TRIGGER_OFFSET is okay
(it could spare even a couple more, if we need them).

But could we please do it in a way that is designed to keep the
code readable, rather than to minimize the number of lines of diff?
It makes zero sense to have the bits in AFTER_TRIGGER_TUP_BITS not
be adjacent.  So what should happen here is to renumber the symbols
in between to move their bits over one place.

(Since this data is only known within trigger.c, I don't even see
an ABI-stability argument for not changing these assignments.)

regards, tom lane

Re: Add last commit LSN to pg_last_committed_xact()

2022-01-17 Thread Alvaro Herrera

On 2022-Jan-14, James Coleman wrote:

> The logical slot can't flush past the
> last commit, so even if there's 100s of megabytes of unflushed WAL on
> the slot there may be zero lag (in terms of what's possible to
> process).
>
> I've attached a simple patch (sans tests and documentation) to get
> feedback early. After poking around this afternoon it seemed to me
> that the simplest approach was to hook into the commit timestamps
> infrastructure and store the commit's XLogRecPtr in the cache of the
> most recent value (but of course don't write it out to disk).

Maybe it would work to have a single LSN in shared memory, as an atomic
variable, which uses monotonic advance[1] to be updated.  Whether this is
updated or not would depend on a new GUC, maybe track_latest_commit_lsn.
Causing performance pain during transaction commit is not great, but at
least this way it shouldn't be *too* a large hit.

[1] part of a large patch at
https://www.postgresql.org/message-id/202111222156.xmo2yji5ifi2%40alvherre.pgsql

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Find a bug in a program, and fix it, and the program will work today.
Show the program how to find and fix a bug, and the program
will work forever" (Oliver Silfridge)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-01-17 Thread Peter Geoghegan

On Mon, Jan 17, 2022 at 7:12 AM Robert Haas  wrote:
> On Thu, Jan 13, 2022 at 4:27 PM Peter Geoghegan  wrote:
> > 1. Cases where our inability to get a cleanup lock signifies nothing
> > at all about the page in question, or any page in the same table, with
> > the same workload.
> >
> > 2. Pathological cases. Cases where we're at least at the mercy of the
> > application to do something about an idle cursor, where the situation
> > may be entirely hopeless on a long enough timeline. (Whether or not it
> > actually happens in the end is less significant.)
>
> Sure. I'm worrying about case (2). I agree that in case (1) waiting
> for the lock is almost always the wrong idea.

I don't doubt that we'd each have little difficulty determining
which category (1 or 2) a given real world case should be placed in,
using a variety of methods that put the issue in context (e.g.,
looking at the application code, talking to the developers or the
DBA). Of course, it doesn't follow that it would be easy to teach
vacuumlazy.c how to determine which category the same "can't get
cleanup lock" falls under, since (just for starters) there is no
practical way for VACUUM to see all that context.

That's what I'm effectively trying to work around with this "wait and
see approach" that demotes FreezeLimit to a backstop (and so justifies
removing the vacuum_freeze_min_age GUC that directly dictates our
FreezeLimit today). The cure may be worse than the disease, and the cure
isn't actually all that great at the best of times, so we should wait
until the disease visibly gets pretty bad before being
"interventionist" by waiting for a cleanup lock.

I've already said plenty about why I don't like vacuum_freeze_min_age
(or FreezeLimit) due to XIDs being fundamentally the wrong unit. But
that's not the only fundamental problem that I see. The other problem
is this: vacuum_freeze_min_age also dictates when an aggressive VACUUM
will start to wait for a cleanup lock. But why should the first thing
be the same as the second thing? I see absolutely no reason for it.
(Hence the idea of making FreezeLimit a backstop, and getting rid of
the GUC itself.)

> > This is my concern -- what I've called category 2 cases have this
> > exact quality. So given that, why not freeze what you can, elsewhere,
> > on other pages that don't have the same issue (presumably the vast
> > vast majority in the table)? That way you have the best possible
> > chance of recovering once the DBA gets a clue and fixes the issue.
>
> That's the part I'm not sure I believe.

To be clear, I think that I have yet to adequately demonstrate that
this is true. It's a bit tricky to do so -- absence of evidence isn't
evidence of absence. I think that your principled skepticism makes
sense right now.

Fortunately the early refactoring patches should be uncontroversial.
The controversial parts are all in the last patch in the patch series,
which isn't too much code. (Plus another patch to at least get rid of
vacuum_freeze_min_age, and maybe vacuum_freeze_table_age too, that
hasn't been written just yet.)

> Imagine a table with a
> gigantic number of pages that are not yet all-visible, a small number
> of all-visible pages, and one page containing very old XIDs on which a
> cursor holds a pin. I don't think it's obvious that not waiting is
> best. Maybe you're going to end up vacuuming the table repeatedly and
> doing nothing useful. If you avoid vacuuming it repeatedly, you still
> have a lot of work to do once the DBA locates a clue.

Maybe this is a simpler way of putting it: I want to delay waiting on
a pin until it's pretty clear that we truly have a pathological case,
which should in practice be limited to an anti-wraparound VACUUM,
which will now be naturally rare -- most individual tables will
literally never have even one anti-wraparound VACUUM.

We don't need to reason about the vacuuming schedule this way, since
anti-wraparound VACUUMs are driven by age(relfrozenxid) -- we don't
really have to predict anything. Maybe we'll need to do an
anti-wraparound VACUUM immediately after a non-aggressive autovacuum
runs, without getting a cleanup lock (due to an idle cursor
pathological case). We won't be able to advance relfrozenxid until the
anti-wraparound VACUUM runs (at the earliest) in this scenario, but it
makes no difference. Rather than predicting the future, we're covering
every possible outcome (at least to the extent that that's possible).

> I think there's probably an important principle buried in here: the
> XID threshold that forces a vacuum had better also force waiting for
> pins. If it doesn't, you can tight-loop on that table without getting
> anything done.

I absolutely agree -- that's why I think that we still need
FreezeLimit. Just as a backstop, that in practice very rarely
influences our behavior. Probably just in those remaining cases that
are never vacuumed except for the occasional anti-wraparound VACUUM
(even then it might not be very important).

> We

Re: a misbehavior of partition row movement (?)

2022-01-17 Thread Alvaro Herrera



> @@ -3398,7 +3432,7 @@ typedef SetConstraintStateData *SetConstraintState;
>   */
>  typedef uint32 TriggerFlags;
>  
> -#define AFTER_TRIGGER_OFFSET 0x0FFF  /* must be 
> low-order bits */
> +#define AFTER_TRIGGER_OFFSET 0x07FF  /* must be 
> low-order bits */
>  #define AFTER_TRIGGER_DONE   0x1000
>  #define AFTER_TRIGGER_IN_PROGRESS0x2000
>  /* bits describing the size and tuple sources of this event */
> @@ -3406,7 +3440,8 @@ typedef uint32 TriggerFlags;
>  #define AFTER_TRIGGER_FDW_FETCH  0x8000
>  #define AFTER_TRIGGER_1CTID  0x4000
>  #define AFTER_TRIGGER_2CTID  0xC000
> -#define AFTER_TRIGGER_TUP_BITS   0xC000
> +#define AFTER_TRIGGER_CP_UPDATE  0x0800
> +#define AFTER_TRIGGER_TUP_BITS   0xC800

So this patch releases one bit from AFTER_TRIGGER_OFFSET and makes it
become AFTER_TRIGGER_CP_UPDATE.  As far as I can tell there is no harm
in doing so.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: Add last commit LSN to pg_last_committed_xact()

2022-01-17 Thread Robert Haas

On Fri, Jan 14, 2022 at 7:42 PM James Coleman  wrote:
> I've attached a simple patch (sans tests and documentation) to get
> feedback early. After poking around this afternoon it seemed to me
> that the simplest approach was to hook into the commit timestamps
> infrastructure and store the commit's XLogRecPtr in the cache of the
> most recent value (but of course don't write it out to disk). That the
> downside of making this feature dependent on "track_commit_timestamps
> = on", but that seems reasonable:
>
> 1. Getting the xid of the last commit is similarly dependent on commit
> timestamps infrastructure.
> 2. It's a simple place to hook into and avoids new shared data and locking.
>
> Thoughts?

It doesn't seem great to me. It's making commit_ts do something other
than commit timestamps, which looks kind of ugly.

In general, I'm concerned about the cost of doing something like this.
Extra shared memory updates as part of the process of committing a
transaction are not (and can't be made) free.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Adding CI to our tree

2022-01-17 Thread Andrew Dunstan

On 1/17/22 13:19, Andres Freund wrote:
> Hi,
>
> On 2022-01-17 10:25:12 -0500, Andrew Dunstan wrote:
>> The buildfarm is moving in the opposite direction, to disaggregate
>> steps.
> I'm a bit confused as to where you want changes to vcregress.pl
> going. Upthread you argued against adding more granular targets to
> vcregress. But this seems to be arguing against that?

OK, let me clear that up. Yes I argued upthread for a 'make check-world'
equivalent, because it's useful for developers on Windows. But I don't
really want to use it in the buildfarm client, where I prefer to run
fine-grained tests.

>
>
>> There are several reasons for that, including that it makes for
>> less log output that you need to churn through o find out what's gone
>> wrong in a particular case, and that it makes disabling certain test
>> sets via the buildfarm client's `skip-steps' feature possible.
> FWIW, to me this shouldn't require a lot of separate manual test
> invocations. And continuing to have lots of granular test invocations from the
> buildfarm client is *bad*, because it requires constantly syncing up the set
> of test targets.

Well, the logic we use for TAP tests is we run them for the following if
they have a t subdirectory, subject to configuration settings like
PG_TEST_EXTRA:

 src/test/bin/*

 contrib/*

 src/test/modules/*

 src/test/*

As long as any new TAP test gets place in such a location nothing new is
required in the buildfarm client.

>
> It also makes the buildfarm far slower than necessary, because it runs a lot
> of stuff serially that it could run in parallel. 

That's a legitimate concern. I have made some strides towards gross
parallelism in the buildfarm by providing for different branches to be
run in parallel. This has proven to be fairly successful (i.e. I have
not run into any complaints, and several of my own animals utilize it).
I can certainly look at doing something of the sort for an individual
branch run.

> This is particularly a
> problem for things like valgrind runs, where individual tests are quite slow -
> but just throwing more CPUs at it would help a lot.

When I ran a valgrind animal, `make check` was horribly slow, and it's
already parallelized. But it was on a VM and I forget how many CPUs the
VM had.

>
> We should set things up so that:
>
> a) The output of each test can easily be associated with the corresponding set
>of log files.
> b) The list of tests run can be determined generically by looking at the
>filesystems
> c) For each test run, it's easy to see whether it failed or succeeded
>
> As part of the meson stuff (which has its own test runner), I managed to get a
> bit towards this by changing the log output hierarchy so that each test gets
> its own directory for log files (regress_log_*, initdb.log, postmaster.log,
> regression.diffs, server log files). What's missing is a
> test.{failed,succeeded} marker or such, to make it easy to report the log
> files for just failed tasks.

One thing I have been working on is a way to split the log output of an
individual buildfarm step, hitherto just a text blob, , so that you can
easily navigate to say regress_log_0003-foo-step.log without having to
page through myriads of stuff. It's been on the back burner but I need
to prioritize it, maybe when the current CF is done.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2022-01-17 Thread Robert Haas

On Mon, Jan 17, 2022 at 9:57 AM Shruthi Gowda  wrote:
> I have rebased and generated the patches on top of PostgreSQL commit
> ID cf925936ecc031355cd56fbd392ec3180517a110.
> Kindly apply v8-0001-pg_upgrade-Preserve-relfilenodes-and-tablespace-O.patch
> first and then v8-0002-Preserve-database-OIDs-in-pg_upgrade.patch.

OK, so looking over 0002, I noticed a few things:

1. datlastsysoid isn't being used for anything any more. That's not a
defect in your patch, but I've separately proposed to remove it.

2. I realized that the whole idea here depends on not having initdb
create more than one database without a fixed OID. The patch solves
that by nailing down the OID of template0, which is a sufficient
solution. However, I think nailing down the (initial) OID of postgres
as well would be a good idea, just in case somebody in the future
decides to add another system-created database.

3. The changes to gram.y don't do anything. Sure, you've added a new
"OID" token, but nothing generates that token, so it has no effect.
The reason the syntax works is that createdb_opt_name accepts "IDENT",
which means any string that's not in the keyword list (see kwlist.h).
But that's there already, so you don't need to do anything in this
file.

4. I felt that the documentation and comments could be somewhat improved.

Here's an updated version in which I've reverted the changes to gram.y
and tried to improve the comments and documentation. Could you have a
look at implementing (2) above?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

v9-0001-pg_upgrade-perserve-database-OID-patch-from-shrut.patch
Description: Binary data

Re: slowest tap tests - split or accelerate?

2022-01-17 Thread Andres Freund

Hi,

On 2022-01-17 15:13:57 -0500, Robert Haas wrote:
> I guess there must be something explaining it, but I don't know what
> it could be. The client and the server are each running the checksum
> algorithm over the same data. If that's not the same speed then  I
> don't get it. Unless, somehow, they're using different implementations
> of it?

I think that actually might be the issue. On linux a test a pg_verifybackup
was much faster than on windows (as in 10x). But if I disable openssl, it's
only 2x.

On the windows instance I *do* have openssl enabled. But I suspect something
is off and the windows buildsystem ends up with our hand-rolled implementation
on the client side, but not the server side. Which'd explain the times I'm
seeing: We have a fast CRC implementation, but the rest is pretty darn
unoptimized.

Greetings,

Andres Freund

Re: slowest tap tests - split or accelerate?

2022-01-17 Thread Tom Lane

Andres Freund  writes:
> I've occasionally pondered caching initdb results and reusing them across
> tests - just the locking around it seems a bit nasty, but perhaps that could
> be done as part of the tmp_install step. Of course, it'd need to deal with
> different options etc...

I'd actually built a prototype to do that, based on making a reference
cluster and then "cp -a"'ing it instead of re-running initdb.  I gave
up when I found than on slower, disk-bound machines it was hardly
any faster.  Thinking about it now, I wonder why not just re-use one
cluster for many tests, only dropping and re-creating the database
in which the testing happens.

regards, tom lane

Re: removing datlastsysoid

2022-01-17 Thread Tom Lane

Robert Haas  writes:
> Since that doesn't seem like an especially good idea, PFA a patch to
> remove it. Note that, even prior to that commit, it wasn't being used
> for anything when dumping modern servers, so it would still have been
> OK to remove it from the current system catalog structure. Now,
> though, we can remove all references to it.

+1.  Another reason to get rid of it is that it has nothing to do
with the system OID ranges defined in access/transam.h.

regards, tom lane

Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings

2022-01-17 Thread Tom Lane

Jelte Fennema  writes:
> Thanks for all the cleanup and adding of windows support. To me it now looks 
> good to merge.

I was about to commit this when I started to wonder if it actually does
anything useful.  In particular, I read in the Linux tcp(7) man page

   TCP_USER_TIMEOUT (since Linux 2.6.37)
  ...
  This option can be set during any state of a TCP connection, but
  is effective only during the synchronized states of a connection
  (ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT,  CLOSING,  and
  LAST-ACK).

ISTM that the case we care about is where the server fails to respond
to the TCP connection request.  If it does so, it seems pretty unlikely
that it wouldn't then eat the small amount of data we're going to send.

While the keepalive options aren't so explicitly documented, I'd bet that
they too don't have any effect until the connection is known established.

So I'm unconvinced that setting these values really has much effect
to make PQcancel more robust (or more accurately, make it fail faster).
I would like to know what scenarios you tested to convince you that
this is worth doing.

regards, tom lane

Re: missing indexes in indexlist with partitioned tables

2022-01-17 Thread Arne Roland

Hi!

Afaiac the join pruning where the outer table is a partitioned table is the 
relevant case.

I am not sure whether there are other cases.
The join pruning, which works great for plain relations since 9.0, falls short 
for partitioned tables, since the optimizer fails to prove uniqueness there.


In practical cases inner and outer tables are almost surely different ones, but 
I reattached a simpler example. It's the one, I came up with back in September.

I've seen this can be a reason to avoid partitioning for the time being, if the 
application relies on join pruning. I think generic views make it almost 
necessary to have it. If you had a different answer in mind, please don't 
hesitate to ask again.


Regards
Arne



From: Alvaro Herrera 
Sent: Monday, January 17, 2022 7:16:08 PM
To: Arne Roland
Cc: Julien Rouhaud; pgsql-hackers
Subject: Re: missing indexes in indexlist with partitioned tables

Hmm, can you show cases of queries for which having this new
partIndexlist changes plans?

--
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)





indexlist_to_late.sql
Description: indexlist_to_late.sql

Re: Adding CI to our tree

2022-01-17 Thread Andres Freund

Hi,

On 2022-01-17 14:30:53 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > I think it's not actually that hard, with something like I described in the
> > email upthread, with each tests going into a prescribed location, and the
> > on-disk status being inspectable in an automated way. check-world could 
> > invoke
> > a command to summarize the tests at the end in a .NOTPARALLEL, to make the
> > local case easier.
>
> That sounds a bit, um, make-centric.  At this point it seems to me
> we ought to be thinking about how it'd work under meson.

Some of this is a lot easier with meson. It has a builtin test runner, which
understands tap (thereby not requiring prove anymore). Those tests can be
listed (meson test --list).

Depending on the option this results in a list of all tests with just the
"topline" name of passing tests and error output from failing tests, or all
output all the time or ... At the end it prints a summary of test counts and a
list of failed tests.

Here's an example (the leading timestamps are from CI, not meson), on windows:
https://api.cirrus-ci.com/v1/task/6009638771490816/logs/check.log

The test naming isn't necessarily great, but that's my fault.

Running all the tests with meson takes a good bit less time than running most,
but far from all, tests using vcregress.pl:
https://cirrus-ci.com/build/4611852939296768

meson test makes it far easier to spot which tests failed, it's consistent
across operating systems, allows to skip individual tests, etc.

However: It doesn't address the log collection issue in itself. For that we'd
still need to collect them in a way that's easier to associate with individual
tests.

In the meson branch I made it so that each test (including individual tap
ones) has it's own log directory, which makes it easier to select all the logs
for a failing test etc.

> > This subthread is about the windows tests specifically, where it's even 
> > worse
> > - there's no way to run all tests.
>
> That's precisely because the windows build doesn't use make.
> We shouldn't be thinking about inventing two separate dead-end
> solutions to this problem.

Agreed. I think some improvements, e.g. around making the logs easier to
associate with an individual test, is orthogonal to the buildsystem issue.

I think it might still be worth adding stopgap way of running all tap tests on
windows though. Having a vcregress.pl function to find all directories with t/
and run the tests there, shouldn't be a lot of code...

Greetings,

Andres Freund

Re: slowest tap tests - split or accelerate?

2022-01-17 Thread Robert Haas

On Mon, Jan 17, 2022 at 2:57 PM Andres Freund  wrote:
> I wonder if there's something explaining why pg_verifybackup is greatly slowed
> down by sha224 but not crc32c, but the server's runtime only differs by ~20ms?
> It seems incongruous that pg_basebackup, with all the complexity of needing to
> communicate with the server, transferring the backup over network, and *also*
> computing checksums, takes as long as the pg_verifybackup invocation?

I guess there must be something explaining it, but I don't know what
it could be. The client and the server are each running the checksum
algorithm over the same data. If that's not the same speed then  I
don't get it. Unless, somehow, they're using different implementations
of it?

> I've occasionally pondered caching initdb results and reusing them across
> tests - just the locking around it seems a bit nasty, but perhaps that could
> be done as part of the tmp_install step. Of course, it'd need to deal with
> different options etc...

It's a thought, but it does seem like a bit of a pain to implement.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

removing datlastsysoid

2022-01-17 Thread Robert Haas

Hi,

While reviewing another patch, I noticed that it slightly adjusted the
treatment of datlastsysoid. That made me wonder what datlastsysoid is
used for, so I started poking around and discovered that the answer,
at least insofar as I can determine, is "nothing". The documentation
claims that the value is useful "particularly to pg_dump," which turns
out not to be true any more. Tom's recent commit,
30e7c175b81d53c0f60f6ad12d1913a6d7d77008, to remove pg_dump/pg_dumpall
support for dumping from pre-9.2 servers, removed all remaining uses
of this value from the source tree. It's still maintained. We just
don't do anything with it.

Since that doesn't seem like an especially good idea, PFA a patch to
remove it. Note that, even prior to that commit, it wasn't being used
for anything when dumping modern servers, so it would still have been
OK to remove it from the current system catalog structure. Now,
though, we can remove all references to it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


0001-Remove-datlastsysoid.patch
Description: Binary data

Re: SLRUs in the main buffer pool, redux

2022-01-17 Thread Thomas Munro

On Mon, Jan 17, 2022 at 11:23 PM Heikki Linnakangas  wrote:
> IIRC one issue with this has been performance. When an SLRU is working
> well, a cache hit in the SLRU is very cheap. Linearly scanning the SLRU
> array is cheap, compared to computing the hash and looking up a buffer
> in the buffer cache. Would be good to do some benchmarking of that.

One trick I want to experiment with is trying to avoid the mapping
table lookup by using ReadRecentBuffer().  In the prototype patch I do
that for one buffer per SLRU cache (so that'll often point to the head
CLOG page), but it could be extended to a small array of recently
accessed buffers to scan linearly, much like the current SLRU happy
case except that it's not shared and doesn't need a lock so it's even
happier.  I'm half joking here, but that would let us keep calling
this subsystem SLRU :-)

> I wanted to do this with the CSN (Commit Sequence Number) work a long
> time ago. That would benefit from something like an SLRU with very fast
> access, to look up CSNs. Even without CSNs, if the CLOG was very fast to
> access we might not need hint bits anymore. I guess trying to make SLRUs
> faster than they already are is moving the goalposts, but at a minimum
> let's make sure they don't get any slower.

One idea I've toyed with is putting a bitmap into shared memory where
each bit corresponds to a range of (say) 256 xids, accessed purely
with atomics.  If a bit is set we know they're all committed, and
otherwise you have to do more pushups.  I've attached a quick and
dirty experimental patch along those lines, but I don't think it's
quite right, especially with people waving 64 bit xid patches around.
Perhaps it'd need to be a smaller sliding window, somehow, and perhaps
people will balk at the wild and unsubstantiated assumption that
transactions generally commit.
From ee0db89b7c054ca0297d42ef94058b37bb8b9436 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 14 Jan 2022 11:15:48 +1300
Subject: [PATCH] WIP: Try to cache CLOG results.

To make CLOG lookups faster, maintain a bitmap of xid groups that are
known to be all committed, using atomic ops.

XXX Just an experiment...
---
 src/backend/access/transam/clog.c | 144 +-
 1 file changed, 143 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/clog.c 
b/src/backend/access/transam/clog.c
index de787c3d37..0678fa828c 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
+#include "port/atomics.h"
 #include "storage/proc.h"
 #include "storage/sync.h"
 
@@ -81,6 +82,127 @@
  */
 #define THRESHOLD_SUBTRANS_CLOG_OPT5
 
+/*
+ * The number of transactions to summarize into a single bit of shared memory
+ * that says 'everything in this group has committed'.  Must be a power of two
+ * and <= CLOG_XACTS_PER_PAGE.  Currently set so that it corresponds to one
+ * cache line of CLOG page, so that it's cheap to scan a group on every
+ * commit.  That results in a 1MB summary array.  XXX Many other policies
+ * including lazy computation and course granularity are possible.
+ */
+#define CLOG_XACTS_PER_SUMMARY_GROUP (64 * CLOG_XACTS_PER_BYTE)
+
+/* Fixed size of the array of all-committed summary bits, in bytes. */
+#define CLOG_SUMMARY_SIZE (0x8000 / CLOG_XACTS_PER_SUMMARY_GROUP / 8)
+
+static pg_atomic_uint32 *CLOGSummary;
+
+/*
+ * Write the all-committed bit for the summary group containing 'xid'.
+ */
+static void
+TransactionXidSetSummaryBit(TransactionId xid, bool value)
+{
+   size_t bit_index = (xid & 0x7fff) / CLOG_XACTS_PER_SUMMARY_GROUP;
+   size_t bits_per_word = sizeof(uint32) * 8;
+   size_t word_index = bit_index / bits_per_word;
+   uint32 mask = 1 << (bit_index % bits_per_word);
+
+   if (value)
+   pg_atomic_fetch_or_u32(&CLOGSummary[word_index], mask);
+   else
+   pg_atomic_fetch_and_u32(&CLOGSummary[word_index], ~mask);
+}
+
+/*
+ * Read the all-committed bit for the summary group containing 'xid'.
+ */
+static bool
+TransactionXidGetSummaryBit(TransactionId xid)
+{
+   size_t bit_index = (xid & 0x7fff) / CLOG_XACTS_PER_SUMMARY_GROUP;
+   size_t bits_per_word = sizeof(uint32) * 8;
+   size_t word_index = bit_index / bits_per_word;
+   uint32 mask = 1 << (bit_index % bits_per_word);
+
+   return (pg_atomic_read_u32(&CLOGSummary[word_index]) & mask) != 0;
+}
+
+/*
+ * Summarize a cache line's worth of CLOG statuses into an all-committed bit,
+ * if possible.  Called for each CLOG commit, so had better be fast.
+ */
+static void
+TransactionXidSummarizeGroup(TransactionId xid, char *page)
+{
+   int first_chunk;
+   int end_chunk;
+   uint64 all_committed_chunk;
+   uint64 *page_chunks = (uint64 *) page;
+
+   /* Already set? */
+   if (unlikely(TransactionXidGetSummaryBit(xid)))
+   return;
+
+   /*
+* XXX

Re: slowest tap tests - split or accelerate?

2022-01-17 Thread Andres Freund

Hi,

On 2022-01-17 14:05:17 -0500, Robert Haas wrote:
> On Mon, Jan 17, 2022 at 1:41 PM Andres Freund  wrote:
> > The reason these in particular are slow is that they do a lot of
> > pg_basebackups without either / one-of -cfast / --no-sync. The lack of 
> > -cfast
> > in particularly is responsible for a significant proportion of the test
> > time. The only reason this didn't cause the tests to take many minutes is 
> > that
> > spread checkpoints only throttle when writing out a buffer and there aren't
> > that many dirty buffers...
>
> Adding -cfast to 002_algorithm.pl seems totally reasonable. I'm not
> sure what else can realistically be done to speed it up without losing
> the point of the test. And it's basically just a single loop, so
> splitting it up doesn't seem to make a lot of sense either.

It's also not that slow compared other tests after the -cfast addition.

However, I'm a bit surprised at how long the individual pg_verifybackup calls
take on windows - about as long as the pg_basebackup call itself.

 Running: pg_basebackup -D 
C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/sha224
 --manifest-checksums sha224 --no-sync -cfast
# timing: [4.798 + 0.704s]: complete
# Running: pg_verifybackup -e 
C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/sha224
backup successfully verified
# timing: [5.507 + 0.697s]: completed

Interestingly, with crc32c, this is not so:

# Running: pg_basebackup -D 
C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/crc32c
 --manifest-checksums crc32c --no-sync -cfast
# timing: [3.500 + 0.688s]: completed
ok 5 - backup ok with algorithm "crc32c"
ok 6 - crc32c is mentioned many times in the manifest
# Running: pg_verifybackup -e 
C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/crc32c
backup successfully verified
# timing: [4.194 + 0.197s]: completed

I wonder if there's something explaining why pg_verifybackup is greatly slowed
down by sha224 but not crc32c, but the server's runtime only differs by ~20ms?
It seems incongruous that pg_basebackup, with all the complexity of needing to
communicate with the server, transferring the backup over network, and *also*
computing checksums, takes as long as the pg_verifybackup invocation?

> pg_basebackup's 010_pg_basebackup.pl looks like it could be split up,
> though. That one, at least to me, looks like people have just kept
> adding semi-related things into the same test file.

Yea.

It's generally interesting how much time initdb takes in these tests. It's
about 1.1s on my linux workstation, and 2.1s on windows.

I've occasionally pondered caching initdb results and reusing them across
tests - just the locking around it seems a bit nasty, but perhaps that could
be done as part of the tmp_install step. Of course, it'd need to deal with
different options etc...

Greetings,

Andres Freund

Re: Blank archive_command

2022-01-17 Thread Tom Lane

Robert Haas  writes:
> It might be nice to do something about the fact that you can't change
> archive_mode without a server restart, though. I suspect we had a good
> reason for that limitation from an engineering perspective, but from a
> user perspective, it sucks pretty hard.

Agreed.  I don't recall what the motivation for that was, but
maybe it could be fixed with some more elbow grease.

regards, tom lane

Re: Adding CI to our tree

2022-01-17 Thread Tom Lane

Andres Freund  writes:
> I think it's not actually that hard, with something like I described in the
> email upthread, with each tests going into a prescribed location, and the
> on-disk status being inspectable in an automated way. check-world could invoke
> a command to summarize the tests at the end in a .NOTPARALLEL, to make the
> local case easier.

That sounds a bit, um, make-centric.  At this point it seems to me
we ought to be thinking about how it'd work under meson.

> This subthread is about the windows tests specifically, where it's even worse
> - there's no way to run all tests.

That's precisely because the windows build doesn't use make.
We shouldn't be thinking about inventing two separate dead-end
solutions to this problem.

regards, tom lane

Re: Adding CI to our tree

2022-01-17 Thread Andres Freund

Hi,

On 2022-01-17 13:50:04 -0500, Robert Haas wrote:
> On Mon, Jan 17, 2022 at 1:19 PM Andres Freund  wrote:
> > FWIW, to me this shouldn't require a lot of separate manual test
> > invocations. And continuing to have lots of granular test invocations from 
> > the
> > buildfarm client is *bad*, because it requires constantly syncing up the set
> > of test targets.
> 
> I have a lot of sympathy with Andrew here, actually. If you just do
> 'make check-world' and assume that will cover everything, you get one
> giant output file. That is not great at all.

I very much agree with check-world output being near unusable.

>  That's really hard to accomplish if you just run all the tests with one
> invocation - any technique you use to find the boundaries between one test's
> output and the next will prove to be unreliable.

I think it's not actually that hard, with something like I described in the
email upthread, with each tests going into a prescribed location, and the
on-disk status being inspectable in an automated way. check-world could invoke
a command to summarize the tests at the end in a .NOTPARALLEL, to make the
local case easier.

pg_regress/isolation style tests have the "summary" output in regression.out,
but we remove it on success.
Tap tests have their output in the regress_log_* files, however these are far
more verbose than the output from running the tests directly.

I wonder if we should keep regression.out and also keep a copy of the
"shorter" tap test output in a file?

> But having said that, I also agree that it sucks to have to keep
> updating the BF client every time we want to do any kind of
> test-related changes in the main source tree.

It also sucks locally. I *hate* having to dig through a long check-world
output to find the first failure.

This subthread is about the windows tests specifically, where it's even worse
- there's no way to run all tests. Nor a list of test targets that's even
halfway complete :/. One just has to know that one has to invoke a myriad of
vcregress.pl taptest path/to/directory

Greetings,

Andres Freund

Re: Pluggable toaster

2022-01-17 Thread Tomas Vondra





On 1/14/22 19:41, Teodor Sigaev wrote:



In my understanding, we want to be able to
1. Access data from a toasted object one slice at a time, by using
knowledge of the structure
2. If toasted data is updated, then update a minimum number of
slices(s), without rewriting the existing slices
3. If toasted data is expanded, then allownew slices to be appended to
the object without rewriting the existing slices


There are more options:
1 share common parts between not only versions of row but between all 
rows in a column. Seems strange but examples:

   - urls often have a common prefix and so storing in a prefix tree (as
     SP-GiST does) allows significantly decrease storage size
   - the same for json - it's often use case with common part of its
     hierarchical structure
   - one more usecase for json. If json use only a few schemes
     (structure) it's possible to store in toast storage only values and
     don't store keys and structure


This sounds interesting, but very much like column compression, which 
was proposed some time ago. If we haven't made much progrees with that 
patch (AFAICS), what's the likelihood we'll succeed here, when it's 
combined with yet more complexity?


Maybe doing that kind of compression in TOAST is somehow simpler, but I 
don't see it.


2 Current toast storage stores chunks in heap accesses method and to 
provide fast access by toast id it makes an index. Ideas:

   - store chunks directly in btree tree, pgsql's btree already has an
     INCLUDE columns, so, chunks and visibility data will be stored only
     in leaf pages. Obviously it reduces number of disk's access for
     "untoasting".
   - use another access method for chunk storage



Maybe, but that probably requires more thought - e.g. btree requires the 
values to be less than 1/3 page, so I wonder how would that play with 
toasting of values.



ISTM that we would want the toast algorithm to be associated with the
datatype, not the column?
Can you explain your thinking?

Hm. I'll try to explain my motivation.
1) Datatype could have more than one suitable toasters. For different
    usecases: fast retrieving, compact storage, fast update etc. As I
    told   above, for jsonb there are several optimal strategies for
    toasting:   for values with a few different structures, for close to
    hierarchical structures,  for values with different parts by access
    mode (easy to imagine json with some keys used for search and some
    keys only for   output to user)
2) Toaster could be designed to work with different data type. Suggested
    appendable toaster is designed to work with bytea but could work with
    text

Looking on this point I have doubts where to store connection between 
toaster and datatype. If we add toasteroid to pg_type how to deal with 
several toaster for one datatype? (And we could want to has different 
toaster on one table!) If we add typoid to pg_toaster then how it will 
work with several datatypes? An idea to add a new many-to-many 
connection table seems workable but here there are another questions, 
such as will any toaster work with any table access method?


To resolve this bundle of question we propose validate() method of 
toaster, which should be called during DDL operation, i.e. toaster is 
assigned to column or column's datatype is changed.




Seems you'd need a mapping table, to allow M:N mapping between types and 
toasters, linking it to all "compatible" types. It's not clear to me how 
would this work with custom data types, domains etc.


Also, what happens to existing values when you change the toaster? What 
if the toasters don't use the same access method to store the chunks 
(heap vs. btree)? And so on.



More thought:
Now postgres has two options for column: storage and compression and now 
we add toaster. For me it seems too redundantly. Seems, storage should 
be binary value: inplace (plain as now) and toastable. All other 
variation such as toast limit, compression enabling, compression kind 
should be an per-column option for toaster (that's why we suggest valid 
toaster oid for any column with varlena/toastable datatype). It looks 
like a good abstraction but we will have a problem with backward 
compatibility and I'm afraid I can't implement it very fast.




So you suggest we move all of this to toaster? I'd say -1 to that, 
because it makes it much harder to e.g. add custom compression method, etc.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Adding CI to our tree

2022-01-17 Thread Tom Lane

Robert Haas  writes:
> I have a lot of sympathy with Andrew here, actually. If you just do
> 'make check-world' and assume that will cover everything, you get one
> giant output file. That is not great at all.

Yeah.  I agree with Andrew that we want output that is more modular,
not less so.  But we do need to find a way to have less knowledge
hard-wired in the buildfarm client script.

> But having said that, I also agree that it sucks to have to keep
> updating the BF client every time we want to do any kind of
> test-related changes in the main source tree. One way around that
> would be to put a file in the main source tree that the build farm
> client can read to know what to do. Another would be to have the BF
> client download the latest list of steps from somewhere instead of
> having it in the source code, so that it can be updated without
> everyone needing to update their machine.

The obvious place for "somewhere" is "the main source tree", so I
doubt your second suggestion is better than your first.  But your
first does seem like a plausible way to proceed.

Another way to think of it, maybe, is to migrate chunks of the
buildfarm client script itself into the source tree.  I'd rather
that developers not need to become experts on the buildfarm client
to make adjustments to the test process --- but I suspect that
a simple script like "run make check in these directories" is
not going to be flexible enough for everything.

regards, tom lane

Re: ICU for global collation

2022-01-17 Thread Finnerty, Jim

On 10.01.22 12:49, Daniel Verite wrote:

> I think some users would want their db-wide ICU collation to be
> case/accent-insensitive. 
...
> IIRC, that was the context for some questions where people were
> enquiring about db-wide ICU collations.

+1.  There is the DEFAULT_COLLATION_OID, which is the cluster-level default 
collation, a.k.a. the "global collation", as distinct from the "db-wide" 
database-level default collation, which controls the default type of the 
collatable types within that database.

> With the current patch, it's not possible, AFAICS, because the user
> can't tell that the collation is non-deterministic. Presumably this
> would require another option to CREATE DATABASE and another
> column to store that bit of information.

On 1/11/22, 6:24 AM, "Peter Eisentraut"  
wrote:

>  Adding this would be easy, but since pattern matching currently does not
>  support nondeterministic collations, if you make a global collation
>  nondeterministic, a lot of system views, psql, pg_dump queries etc.
>  would break, so it's not practical.  I view this is an orthogonal
>  project.  Once we can support this without breaking system views etc.,
>  then it's easy to enable with a new column in pg_database.

So this patch only enables the default cluster collation 
(DEFAULT_COLLATION_OID) to be a deterministic ICU collation, but doesn't extend 
the metadata in a way that would enable database collations to be ICU 
collations?

Waiting for the pattern matching problem to be solved in general before 
creating the metadata to support ICU collations everywhere will make it more 
difficult for members of the community to help solve the pattern matching 
problem.  

What additional metadata changes would be required to enable an ICU collation 
to be specified at either the cluster-level or the database-level, even if new 
checks need to be added to disallow a nondeterministic collation to be 
specified at the cluster level for now?

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2022-01-17 Thread Robert Haas

On Tue, Dec 14, 2021 at 1:21 PM Shruthi Gowda  wrote:
> Thanks, Robert for the updated version. I reviewed the changes and it
> looks fine.
> I also tested the patch. The patch works as expected.

Thanks.

> > - I adjusted the function header comment for heap_create. Your
> > proposed comment seemed like it was pretty detailed but not 100%
> > correct. It also made one of the lines kind of long because you didn't
> > wrap the text in the surrounding style. I decided to make it simpler
> > and shorter instead of longer still and 100% correct.
>
> The comment update looks fine. However, I still feel it would be good to
> mention on which (rare) circumstance a valid relfilenode can get passed.

In general, I think it's the job of a function parameter comment to
describe what the parameter does, not how the callers actually use it.
One problem with describing the latter is that, if someone later adds
another caller, there is a pretty good chance that they won't notice
that the comment needs to be changed. More fundamentally, the
parameter function comments should be like an instruction manual for
how to use the function. If you are trying to figure out how to use
this function, it is not helpful to know that "most callers like to
pass false" for this parameter. What you need to know is what value
your new call site should pass, and knowing what "most callers" do or
that something is "rare" doesn't really help. If we want to make this
comment more detailed, we should approach it from the point of view of
explaining how it ought to be set.

I've committed the v8-0001 patch you attached. I'll write separately
about v8-0002.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: slowest tap tests - split or accelerate?

2022-01-17 Thread Robert Haas

On Mon, Jan 17, 2022 at 1:41 PM Andres Freund  wrote:
> The reason these in particular are slow is that they do a lot of
> pg_basebackups without either / one-of -cfast / --no-sync. The lack of -cfast
> in particularly is responsible for a significant proportion of the test
> time. The only reason this didn't cause the tests to take many minutes is that
> spread checkpoints only throttle when writing out a buffer and there aren't
> that many dirty buffers...

Adding -cfast to 002_algorithm.pl seems totally reasonable. I'm not
sure what else can realistically be done to speed it up without losing
the point of the test. And it's basically just a single loop, so
splitting it up doesn't seem to make a lot of sense either.

pg_basebackup's 010_pg_basebackup.pl looks like it could be split up,
though. That one, at least to me, looks like people have just kept
adding semi-related things into the same test file.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Adding CI to our tree

2022-01-17 Thread Robert Haas

On Mon, Jan 17, 2022 at 1:19 PM Andres Freund  wrote:
> FWIW, to me this shouldn't require a lot of separate manual test
> invocations. And continuing to have lots of granular test invocations from the
> buildfarm client is *bad*, because it requires constantly syncing up the set
> of test targets.

I have a lot of sympathy with Andrew here, actually. If you just do
'make check-world' and assume that will cover everything, you get one
giant output file. That is not great at all. People who are looking
through buildfarm results do not want to have to look through giant
logfiles hunting for the failure; they want to look at the stuff
that's just directly relevant to the failure they saw. The current BF
is actually pretty bad at this. You can click on various things on a
buildfarm results page, but it's not very clear where those links are
taking you, and the pages at least in my browser (normally Chrome)
render so slowly as to make the whole thing almost unusable. I'd like
to have a thing where the buildfarm shows a list of tests in red or
green and I can click links next to each test to see the various logs
that test produced. That's really hard to accomplish if you just run
all the tests with one invocation - any technique you use to find the
boundaries between one test's output and the next will prove to be
unreliable.

But having said that, I also agree that it sucks to have to keep
updating the BF client every time we want to do any kind of
test-related changes in the main source tree. One way around that
would be to put a file in the main source tree that the build farm
client can read to know what to do. Another would be to have the BF
client download the latest list of steps from somewhere instead of
having it in the source code, so that it can be updated without
everyone needing to update their machine. There might well be other
approaches that are even better. But the "ask Andrew to adjust the BF
client and then have everybody install the new version" approach upon
which we have been relying up until now is not terribly scalable.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: slowest tap tests - split or accelerate?

2022-01-17 Thread Andres Freund

Hi,

On 2021-12-31 11:25:28 -0800, Andres Freund wrote:
> cfbot now runs most tests on windows, the windows task is by far the slowest,
> and the task limitted most in concurrency [2]. Running tap tests is the
> biggest part of that. This is a bigger issue on windows because we don't have
> infrastructure (yet) to run tests in parallel.
> 
> There's a few tests which stand out in their slowness, which seem worth
> addressing even if we tackle test parallelism on windows at some point. I
> often find them to be the slowest tests on linux too.
> 
> Picking a random successful cfbot run [1] I see the following tap tests taking
> more than 20 seconds:
> 
> 67188 ms pg_basebackup t/010_pg_basebackup.pl
> 25751 ms pg_verifybackup t/002_algorithm.pl

The reason these in particular are slow is that they do a lot of
pg_basebackups without either / one-of -cfast / --no-sync. The lack of -cfast
in particularly is responsible for a significant proportion of the test
time. The only reason this didn't cause the tests to take many minutes is that
spread checkpoints only throttle when writing out a buffer and there aren't
that many dirty buffers...

Attached is a patch changing the parameters in all the instances I
found. Testing on a local instance it about halves the runtime of
t/010_pg_basebackup.pl on linux and windows (but there's still a 2x time
difference between the two), it's less when running the tests concurrently CI.

It might be worth having one explicit use of -cspread. Perhaps combined with
an explicit checkpoint beforehand?

Greetings,

Andres Freund
>From 4abed85e9085786f82d87667f2451821c01d37c0 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 17 Jan 2022 00:51:43 -0800
Subject: [PATCH v1 1/4] tests: Consistently use pg_basebackup -cfast --no-sync
 to accelerate tests.

Most tests invoking pg_basebackup themselves did not yet use -cfast, which
makes pg_basebackup take considerably longer. The only reason this didn't
cause the tests to take many minutes is that spread checkpoints only throttle
when writing out a buffer and there aren't that many dirty buffers...

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 104 ++-
 src/bin/pg_verifybackup/t/002_algorithm.pl   |   2 +-
 src/bin/pg_verifybackup/t/003_corruption.pl  |   2 +-
 src/bin/pg_verifybackup/t/004_options.pl |   2 +-
 src/bin/pg_verifybackup/t/006_encoding.pl|   2 +-
 src/bin/pg_verifybackup/t/007_wal.pl |   4 +-
 6 files changed, 61 insertions(+), 55 deletions(-)

diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 872fec3d350..ebd102cd098 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -20,6 +20,13 @@ my $tempdir = PostgreSQL::Test::Utils::tempdir;
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 
+# For nearly all tests some options should be specified, to keep test times
+# reasonable. Using @pg_basebackup_defs as the first element of the array
+# passed to to IPC::Run interpolates the array (as it is not a reference to an
+# array)..
+my @pg_basebackup_defs = ('pg_basebackup', '--no-sync', '-cfast');
+
+
 # Set umask so test directories and files are created with default permissions
 umask(0077);
 
@@ -43,7 +50,7 @@ $node->set_replication_conf();
 system_or_bail 'pg_ctl', '-D', $pgdata, 'reload';
 
 $node->command_fails(
-	[ 'pg_basebackup', '-D', "$tempdir/backup" ],
+	[ @pg_basebackup_defs, '-D', "$tempdir/backup" ],
 	'pg_basebackup fails because of WAL configuration');
 
 ok(!-d "$tempdir/backup", 'backup directory was cleaned up');
@@ -54,7 +61,7 @@ mkdir("$tempdir/backup")
   or BAIL_OUT("unable to create $tempdir/backup");
 append_to_file("$tempdir/backup/dir-not-empty.txt", "Some data");
 
-$node->command_fails([ 'pg_basebackup', '-D', "$tempdir/backup", '-n' ],
+$node->command_fails([ @pg_basebackup_defs, '-D', "$tempdir/backup", '-n' ],
 	'failing run with no-clean option');
 
 ok(-d "$tempdir/backup", 'backup directory was created and left behind');
@@ -105,7 +112,7 @@ foreach my $filename (@tempRelationFiles)
 }
 
 # Run base backup.
-$node->command_ok([ 'pg_basebackup', '-D', "$tempdir/backup", '-X', 'none' ],
+$node->command_ok([ @pg_basebackup_defs, '-D', "$tempdir/backup", '-X', 'none' ],
 	'pg_basebackup runs');
 ok(-f "$tempdir/backup/PG_VERSION",  'backup was created');
 ok(-f "$tempdir/backup/backup_manifest", 'backup manifest included');
@@ -165,9 +172,9 @@ rmtree("$tempdir/backup");
 
 $node->command_ok(
 	[
-		'pg_basebackup','-D',
-		"$tempdir/backup2", '--no-manifest',
-		'--waldir', "$tempdir/xlog2"
+		@pg_basebackup_defs, '-D',
+		"$tempdir/backup2",  '--no-manifest',
+		'--waldir',  "$tempdir/xlog2"
 	],
 	'separate xlog directory');
 ok(-f "$tempdir/backup2/PG_VERSION",   'backup was created');
@@ -176,31 +183,31 @@ ok(-d "$tempdir/xlog2/",

1 2 >

1 - 100 of 163 matches

Mail list logo