Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2022-01-11 Thread Pavel Borisov
>
> The cfbot reports that you have mixed declarations and code
> (https://cirrus-ci.com/task/6407449413419008):
>
> [17:21:26.926] pg_amcheck.c: In function ‘main’:
> [17:21:26.926] pg_amcheck.c:634:4: error: ISO C90 forbids mixed
> declarations and code [-Werror=declaration-after-statement]
> [17:21:26.926] 634 | int vmaj = 0,
> [17:21:26.926] | ^~~
>

Corrected this, thanks!
Also added more comments on this part of the code.
PFA v8 of a patch

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


v8-0001-Add-option-for-amcheck-and-pg_amcheck-to-check-un.patch
Description: Binary data


Re: Add index scan progress to pg_stat_progress_vacuum

2022-01-11 Thread Masahiko Sawada
(We had better avoid top-posting[1])


On Tue, Jan 11, 2022 at 10:01 AM Imseih (AWS), Sami  wrote:
>
> I agree, Renaming "index_vacuum_count" can be taken up in a separate 
> discussion.
>
> I have attached the 3rd revision of the patch which also includes the 
> documentation changes. Also attached is a rendered html of the docs for 
> review.

Thank you for updating the patch!

Regarding the new pg_stat_progress_vacuum_index view, why do we need
to have a separate view? Users will have to check two views. If this
view is expected to be used together with and joined to
pg_stat_progress_vacuum, why don't we provide one view that has full
information from the beginning? Especially, I think it's not useful
that the total number of indexes to vacuum (num_indexes_to_vacuum
column) and the current number of indexes that have been vacuumed
(index_ordinal_position column) are shown in separate views.

Also, I’m not sure how useful index_tuples_removed is; what can we
infer from this value (without a total number)?

Regards,

[1] https://en.wikipedia.org/wiki/Posting_style#Top-posting

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Time to drop plpython2?

2022-01-11 Thread Peter Eisentraut

On 11.01.22 17:06, Tom Lane wrote:

Nonetheless, we need to make a recommendation to the
buildfarm owners about what's the minimum python3 version we intend
to support going forward.  Do we want to just set it at 3.6, with
the expectation that the meson move will happen before too long?


Well, the minimum supported version has always been the oldest version 
that actually works.  I don't think we ever said, we support >= X, even 
though < X still actually works, about any dependency.


I don't care much to tie this to Meson right now.  Meson might well move 
to 3.8 next week and ruin this whole scheme.


I'm okay with issuing some sort of recommendation for what is reasonable 
to test, and 3.5 or 3.6 seems like a good cutoff, considering what LTS 
OS currently ship.  But I'm not sure if that is the same as "support".






Re: row filtering for logical replication

2022-01-11 Thread Peter Smith
Here are some review comments for v62-0002

~~~

1. src/backend/replication/pgoutput/pgoutput.c -
pgoutput_row_filter_update_check

+ * If the change is to be replicated this function returns true, else false.
+ *
+ * Examples: Let's say the old tuple satisfies the row filter but the new tuple
+ * doesn't. Since the old tuple satisfies, the initial table synchronization
+ * copied this row (or another method was used to guarantee that there is data
+ * consistency).  However, after the UPDATE the new tuple doesn't satisfy the

The word "Examples:" should be on a line by itself; not merged with
the 1st example "Let's say...".

~~~

2. src/backend/replication/pgoutput/pgoutput.c -
pgoutput_row_filter_update_check

+ /*
+ * For updates, both the new tuple and old tuple needs to be checked
+ * against the row filter. The new tuple might not have all the replica
+ * identity columns, in which case it needs to be copied over from the old
+ * tuple.
+ */

Typo: "needs to be checked" --> "need to be checked"

~~~

3. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter_init

Missing blank line before a couple of block comments, here:

bool pub_no_filter = false;
List*schemarelids = NIL;
/*
* If the publication is FOR ALL TABLES then it is treated the
* same as if this table has no row filters (even if for other
* publications it does).
*/
if (pub->alltables)
pub_no_filter = true;
/*
* If the publication is FOR ALL TABLES IN SCHEMA and it overlaps
* with the current relation in the same schema then this is also
* treated same as if this table has no row filters (even if for
* other publications it does).
*/
else

~~~

4. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter_init

This function was refactored out of the code from pgoutput_row_filter
in the v62-0001 patch. So probably there are multiple comments from my
earlier v62-0001 review [1] of that pgoutput_row_filter function, that
now also apply to this pgoutput_row_filter_init function.

~~~

5. src/tools/pgindent/typedefs.list - ReorderBufferChangeType

Actually, the typedef for ReorderBufferChangeType was added in the
62-0001, so this typedef change should've been done in patch 0001 and
it can be removed from patch 0002

--
[1] 
https://www.postgresql.org/message-id/CAHut%2BPucFM3Bt-gaTT7Pr-Y_x%2BR0y%3DL7uqbhjPMUsSPhdLhRpA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Asynchronous and "direct" IO support for PostgreSQL.

2022-01-11 Thread Julien Rouhaud
Hi,

On Wed, Sep 1, 2021 at 1:57 PM Andres Freund  wrote:
>
> I've attached the code for posterity, but the series is large enough that I
> don't think it makes sense to do that all that often...

Agreed.

> The code is at
> https://github.com/anarazel/postgres/tree/aio

Just FYI the cfbot says that this version of the patchset doesn't
apply anymore, and it seems that your branch was only rebased to
43c1c4f (Sept. 21th) which doesn't rebase cleanly:

error: could not apply 8a20594f2f... lwlock, xlog: Report caller wait
event for LWLockWaitForVar.

Since it's still a WIP and a huge patchset I'm not sure if I should
switch the cf entry to Waiting on Author or not as it's probably going
to rot quite fast anyway.  Just to be safe I'll go ahead and change
the status.  If that's unhelpful just let me know and I'll switch it
back to needs review, as people motivated enough to review the patch
can still work with 43c1c4f as a starting point.




Re: cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Julien Rouhaud
On Wed, Jan 12, 2022 at 2:52 PM Thomas Munro  wrote:
>
> By way of documentation, I've just now tried to answer these question
> in the new FAQ at:
>
> https://wiki.postgresql.org/wiki/Cfbot

Great!  Thanks a lot!




Re: Proposal: allow database-specific role memberships

2022-01-11 Thread Julien Rouhaud
Hi,

On Thu, Dec 2, 2021 at 2:26 AM Kenaniah Cerny  wrote:
>
> Attached is a rebased version of the patch that omits catversion.h in order 
> to avoid conflicts.

Unfortunately even without that the patch doesn't apply anymore
according to the cfbot: http://cfbot.cputube.org/patch_36_3374.log

1 out of 3 hunks FAILED -- saving rejects to file src/backend/parser/gram.y.rej
[...]
2 out of 8 hunks FAILED -- saving rejects to file
src/bin/pg_dump/pg_dumpall.c.rej

Could you send a rebased version?

In the meantime I'm switching the patch to Waiting on Author.




Re: cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Tom Lane
Thomas Munro  writes:
> By way of documentation, I've just now tried to answer these question
> in the new FAQ at:
> https://wiki.postgresql.org/wiki/Cfbot

Thanks!

regards, tom lane




Re: [PATCH] Proof of concept for GUC improvements

2022-01-11 Thread Julien Rouhaud
Hi,

On Thu, Nov 4, 2021 at 5:50 AM David Christensen  wrote:
>
> Hi, enclosed is a v3 [...]

According to the cfbot, the patch doesn't apply anymore and needs a
rebase: http://cfbot.cputube.org/patch_36_3290.log

> 43 out of 133 hunks FAILED -- saving rejects to file 
> src/backend/utils/misc/guc.c.rej

I'm switching the patch to Waiting on Author.




Re: cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Thomas Munro
On Wed, Jan 12, 2022 at 7:37 PM Tom Lane  wrote:
> Julien Rouhaud  writes:
> > On Wed, Jan 12, 2022 at 01:19:22AM -0500, Tom Lane wrote:
> >> 2. You are attaching some random files, and would like to not
> >> displace the cfbot's idea of the latest patchset.
>
> > I'm assuming that someone wanting to send an additional patch to be applied 
> > on
> > top of the OP patchset is part of 2?
>
> AFAIK, if you're submitting a patch then you have to attach a complete
> patchset, or the cfbot will be totally lost.  Getting the bot to
> understand incremental patches would be useful for sure ... but it's
> outside the scope of what I'm asking for now, which is just clear
> documentation of what the bot can do already.

By way of documentation, I've just now tried to answer these question
in the new FAQ at:

https://wiki.postgresql.org/wiki/Cfbot




Re: cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Julien Rouhaud
On Wed, Jan 12, 2022 at 2:37 PM Tom Lane  wrote:
>
> AFAIK, if you're submitting a patch then you have to attach a complete
> patchset, or the cfbot will be totally lost.  Getting the bot to
> understand incremental patches would be useful for sure ... but it's
> outside the scope of what I'm asking for now, which is just clear
> documentation of what the bot can do already.

Right, but the use case I'm mentioning is a bit different: provide
another patch while *not* triggering the cfbot.  I've seen cases in
the past where people want to share some code to the OP and it seems
reasonable to allow that without risking the trigger the cfbot, at
least not without the OP validating or adapting the changes.




Re: do only critical work during single-user vacuum?

2022-01-11 Thread Masahiko Sawada
On Wed, Jan 12, 2022 at 10:57 AM Peter Geoghegan  wrote:
>
> On Tue, Jan 11, 2022 at 4:59 PM John Naylor
>  wrote:
> > I've attached a PoC *untested* patch to show what it would look like
> > as a top-level statement. If the "shape" is uncontroversial, I'll put
> > work into testing it and fleshing it out.
>
> Great!

+1

>
> > For the PoC I wanted to try re-using existing keywords. I went with
> > "VACUUM LIMIT" since LIMIT is already a keyword that cannot be used as
> > a table name. It also brings "wraparound limit" to mind. We could add
> > a single-use unreserved keyword (such as VACUUM_MINIMAL or
> > VACUUM_FAST), but that doesn't seem great.
>
> This seems reasonable, but you could add a new option instead, without
> much downside. While INDEX_CLEANUP kind of looks like a keyword, it
> isn't really a keyword. (Perhaps you knew this already.)
>
> Making this a new option is a little awkward, admittedly. It's not
> clear what it means to "VACUUM (LIMIT) my_table" -- do you just throw
> an error for stuff like that? So perhaps your approach of adding
> VacuumMinimalStmt (a minimal variant of the VACUUM command) is better.

It seems to me that adding new syntax instead of a new option is less
flexible. In the future, for instance, when we support parallel heap
scan for VACUUM, we may want to add a parallel-related option to both
VACUUM statement and VACUUM LIMIT statement. VACUUM LIMIT statement
would end up becoming like VACUUM statement?

As another idea, we might be able to add a new option that takes an
optional integer value, like VACUUM (MIN_XID), VACUUM (MIN_MXID), and
VACUUM (MIN_XID 50). We vacuum only tables whose age is older than
the given value. If the value is omitted, we vacuum only tables whose
age exceeds a threshold (say autovacuum_freeze_max_age * 0.95), which
can be used in an emergency case and output in GetNewTransactionID()
WARNINGs output. vacuumdb’s --min-xid-age and --min-mxid-age can use
this option instead of fetching the list of tables from the server.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2022-01-11 Thread Julien Rouhaud
Hi,

On Fri, Dec 24, 2021 at 2:06 AM Максим Орлов  wrote:
>
> Thanks for your review! Fixed all these remaining things from patch v6.
> PFA v7 patch.

The cfbot reports that you have mixed declarations and code
(https://cirrus-ci.com/task/6407449413419008):

[17:21:26.926] pg_amcheck.c: In function ‘main’:
[17:21:26.926] pg_amcheck.c:634:4: error: ISO C90 forbids mixed
declarations and code [-Werror=declaration-after-statement]
[17:21:26.926] 634 | int vmaj = 0,
[17:21:26.926] | ^~~




Re: cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Tom Lane
Julien Rouhaud  writes:
> On Wed, Jan 12, 2022 at 01:19:22AM -0500, Tom Lane wrote:
>> 2. You are attaching some random files, and would like to not
>> displace the cfbot's idea of the latest patchset.

> I'm assuming that someone wanting to send an additional patch to be applied on
> top of the OP patchset is part of 2?

AFAIK, if you're submitting a patch then you have to attach a complete
patchset, or the cfbot will be totally lost.  Getting the bot to
understand incremental patches would be useful for sure ... but it's
outside the scope of what I'm asking for now, which is just clear
documentation of what the bot can do already.

regards, tom lane




Re: cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Julien Rouhaud
On Wed, Jan 12, 2022 at 01:19:22AM -0500, Tom Lane wrote:
> Julien Rouhaud  writes:
> > For this kind of situation I think that the usual solution is to use a
> > .txt extension to make sure that the cfbot won't try to apply it.
> 
> Yeah ... this has come up before.  Is there a documented way to
> attach files that the cfbot will ignore?

Not that I know of unfortunately.  I think the apply part is done by
https://github.com/macdice/cfbot/blob/master/cfbot_patchburner_chroot_ctl.sh#L93-L120,
which seems reasonable.  So basically any extension outside of those could be
used, excluding of course anything clearly suspicious that would be rejected by
many email providers.

> Two specific scenarios seem to be interesting:
> 
> 1. You are attaching patch(es) plus some non-patch files
> 
> 2. You are attaching some random files, and would like to not
> displace the cfbot's idea of the latest patchset.

I'm assuming that someone wanting to send an additional patch to be applied on
top of the OP patchset is part of 2?




Re: Add jsonlog log_destination for JSON server logs

2022-01-11 Thread Michael Paquier
On Tue, Jan 11, 2022 at 08:34:26PM +, Bossart, Nathan wrote:
> I've been looking at the latest patch set intermittently and playing
> around with jsonlog a little.  It seems to work well, and I don't have
> any significant comments about the code.  0001 and 0002 seem
> straightforward and uncontroversial.

Thanks.  I have looked again at 0001 and 0002 today and applied both,
so it means that we are done with all the refactoring pieces proposed
up to now.

> IIUC 0003 simply introduces jsonlog using the existing framework.

This part will have to wait a bit more, but yes, this piece should be
straight-forward.

> I wonder if we should consider tracking each log destination as a set
> of function pointers.  The main logging code would just loop through
> the enabled log destinations and use these functions, and it otherwise
> would be completely detached (i.e., no "if jsonlog" blocks).  This
> might open up the ability to define custom log destinations via
> modules, too.  However, I don't know if there's any real demand for
> something like this, and it should probably be done separately from
> introducing jsonlog, anyway.

I am not sure that this is worth the complications, either.
--
Michael
From b5b0e35d66e491df0f75c874ff4dd73fb7607c68 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 12 Jan 2022 15:19:16 +0900
Subject: [PATCH v10] JSON logging

---
 src/include/postmaster/syslogger.h|   1 +
 src/include/utils/elog.h  |   2 +
 src/backend/postmaster/syslogger.c| 105 ++-
 src/backend/utils/adt/misc.c  |   5 +-
 src/backend/utils/error/Makefile  |   3 +-
 src/backend/utils/error/elog.c|  18 ++
 src/backend/utils/error/jsonlog.c | 294 ++
 src/backend/utils/misc/guc.c  |   4 +-
 src/backend/utils/misc/postgresql.conf.sample |  13 +-
 src/bin/pg_ctl/t/004_logrotate.pl |  20 +-
 doc/src/sgml/config.sgml  | 203 +++-
 11 files changed, 632 insertions(+), 36 deletions(-)
 create mode 100644 src/backend/utils/error/jsonlog.c

diff --git a/src/include/postmaster/syslogger.h b/src/include/postmaster/syslogger.h
index 2df68a196e..1ca326e52e 100644
--- a/src/include/postmaster/syslogger.h
+++ b/src/include/postmaster/syslogger.h
@@ -64,6 +64,7 @@ typedef union
 /* log destinations */
 #define PIPE_PROTO_DEST_STDERR	0x10
 #define PIPE_PROTO_DEST_CSVLOG	0x20
+#define PIPE_PROTO_DEST_JSONLOG	0x40
 
 /* GUC options */
 extern bool Logging_collector;
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 5bc38663cb..3eb8de3966 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -436,6 +436,7 @@ extern bool syslog_split_messages;
 #define LOG_DESTINATION_SYSLOG	 2
 #define LOG_DESTINATION_EVENTLOG 4
 #define LOG_DESTINATION_CSVLOG	 8
+#define LOG_DESTINATION_JSONLOG	16
 
 /* Other exported functions */
 extern void DebugFileOpen(void);
@@ -453,6 +454,7 @@ extern void write_pipe_chunks(char *data, int len, int dest);
 
 /* Destination-specific functions */
 extern void write_csvlog(ErrorData *edata);
+extern void write_jsonlog(ErrorData *edata);
 
 #ifdef HAVE_SYSLOG
 extern void set_syslog_parameters(const char *ident, int facility);
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index 2256f072aa..8d26573817 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -86,9 +86,11 @@ static bool pipe_eof_seen = false;
 static bool rotation_disabled = false;
 static FILE *syslogFile = NULL;
 static FILE *csvlogFile = NULL;
+static FILE *jsonlogFile = NULL;
 NON_EXEC_STATIC pg_time_t first_syslogger_file_time = 0;
 static char *last_sys_file_name = NULL;
 static char *last_csv_file_name = NULL;
+static char *last_json_file_name = NULL;
 
 /*
  * Buffers for saving partial messages from different backends.
@@ -281,6 +283,8 @@ SysLoggerMain(int argc, char *argv[])
 	last_sys_file_name = logfile_getname(first_syslogger_file_time, NULL);
 	if (csvlogFile != NULL)
 		last_csv_file_name = logfile_getname(first_syslogger_file_time, ".csv");
+	if (jsonlogFile != NULL)
+		last_json_file_name = logfile_getname(first_syslogger_file_time, ".json");
 
 	/* remember active logfile parameters */
 	currentLogDir = pstrdup(Log_directory);
@@ -367,6 +371,14 @@ SysLoggerMain(int argc, char *argv[])
 (csvlogFile != NULL))
 rotation_requested = true;
 
+			/*
+			 * Force a rotation if JSONLOG output was just turned on or off
+			 * and we need to open or close jsonlogFile accordingly.
+			 */
+			if (((Log_destination & LOG_DESTINATION_JSONLOG) != 0) !=
+(jsonlogFile != NULL))
+rotation_requested = true;
+
 			/*
 			 * If rotation time parameter changed, reset next rotation time,
 			 * but don't immediately force a rotation.
@@ -417,6 +429,12 @@ SysLoggerMain(int argc, char *argv[])
 rotation_requested = true;
 size_rotation_for 

cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Tom Lane
Julien Rouhaud  writes:
> For this kind of situation I think that the usual solution is to use a
> .txt extension to make sure that the cfbot won't try to apply it.

Yeah ... this has come up before.  Is there a documented way to
attach files that the cfbot will ignore?

Two specific scenarios seem to be interesting:

1. You are attaching patch(es) plus some non-patch files

2. You are attaching some random files, and would like to not
displace the cfbot's idea of the latest patchset.

I don't know exactly how to do either of those.

regards, tom lane




Re: Add connection active, idle time to pg_stat_activity

2022-01-11 Thread Julien Rouhaud
Hi,

On Mon, Nov 29, 2021 at 11:04 PM Kuntal Ghosh
 wrote:
>
> You also need to update the documentation.

You also need to update rules.sql: https://cirrus-ci.com/task/6145265819189248




Re: support for MERGE

2022-01-11 Thread Jaime Casanova
On Wed, Dec 22, 2021 at 11:35:56AM +, Simon Riggs wrote:
> On Mon, 15 Nov 2021 at 22:45, Alvaro Herrera  wrote:
> >
> > On 2021-Nov-15, Alvaro Herrera wrote:
> >
> > > Thanks everyone for the feedback.  I attach a version with the fixes
> > > that were submitted, as well as some additional changes:
> >
> > Attachment failure.
> 
> I rebased this, please check.
> 

Hi,

I found two crashes, actually I found them on the original patch Álvaro
sent on november but just checked that those already exists.

I configured with:

CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer" ./configure 
--prefix=/opt/var/pgdg/15/merge --enable-debug --enable-depend --enable-cassert 
--with-llvm --enable-tap-tests --with-pgport=54315

And tested on the regression database.

Attached the SQL files for the crashes and its respective stacktraces.
FWIW, the second crash doesn't appear to be caused by the MERGE patch
but I cannot trigger it other way.


-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL


merge1.sql
Description: application/sql
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
set = {__val = {4194304, 140726414547632, 2, 6, 5309120, 
94249822928768, 4611686018427388799, 140446649031334, 0, 281470681751456, 0, 0, 
0, 0, 0, 0}}
pid = 
tid = 
ret = 
#1  0x7fbc4867c535 in __GI_abort () at abort.c:79
save_stage = 1
act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, 
sa_mask = {__val = {0, 0, 0, 0, 0, 140446646788085, 2, 4062868317402242624, 
7017002064575424051, 94249829619760, 
  7003715557358071819, 674, 12862997737215894016, 140726414547872, 
140726414547888, 140726414548720}}, sa_flags = 22, sa_restorer = 0x7ffd6bf31bb0}
sigs = {__val = {32, 0 }}
#2  0x55b83f88c8b5 in ExceptionalCondition 
(conditionName=conditionName@entry=0x55b83f9d61c8 "resultnum >= 0 && resultnum 
< resultslot->tts_tupleDescriptor->natts", 
errorType=errorType@entry=0x55b83f8e700b "FailedAssertion", 
fileName=fileName@entry=0x55b83f9d5830 "execExprInterp.c", 
lineNumber=lineNumber@entry=674) at assert.c:69
No locals.
#3  0x55b83f58ea5f in ExecInterpExpr (state=0x55b840b0ac18, 
econtext=0x55b840b0aa10, isnull=0x7ffd6bf31fff) at execExprInterp.c:674
resultnum = 
op = 0x55b840b0adb0
resultslot = 0x55b840b0ab40
innerslot = 0x55b840b08660
outerslot = 0x0
scanslot = 0x0
dispatch_table = {0x55b83f58e5c4 , 0x55b83f58e5e4 
, 0x55b83f58e616 , 0x55b83f58e648 
, 0x55b83f58e67a , 
  0x55b83f58e6db , 0x55b83f58e73c 
, 0x55b83f58e79d , 0x55b83f58e7bc 
, 0x55b83f58e7db , 
  0x55b83f58e7fa , 0x55b83f58e815 
, 0x55b83f58e8b7 , 0x55b83f58e959 
, 0x55b83f58e9fb , 
  0x55b83f58ea5f , 0x55b83f58eaeb 
, 0x55b83f58eb0c , 0x55b83f58eb39 
, 0x55b83f58eb92 , 
  0x55b83f58ebad , 0x55b83f58ebc8 
, 0x55b83f58ebcf , 0x55b83f58ec16 
, 0x55b83f58ec4c , 
  0x55b83f58ec53 , 0x55b83f58ec9a 
, 0x55b83f58ecd0 , 0x55b83f58ecf5 
, 0x55b83f58ed43 , 
  0x55b83f58ed64 , 0x55b83f58ed9a 
, 0x55b83f58edd0 , 0x55b83f58ee10 
, 0x55b83f58ee3f , 
  0x55b83f58ee6e , 0x55b83f58ee89 
, 0x55b83f58eea4 , 0x55b83f58eecb 
, 0x55b83f58ef0d , 
  0x55b83f58ef4f , 0x55b83f58ef76 
, 0x55b83f58ef91 , 0x55b83f58efac 
, 0x55b83f58efc5 , 
  0x55b83f58f053 , 0x55b83f58f08b 
, 0x55b83f58f1c9 , 0x55b83f58f24a 
, 0x55b83f58f2ba , 
  0x55b83f58f323 , 0x55b83f58f33a 
, 0x55b83f58f345 , 0x55b83f58f35c 
, 0x55b83f58f373 , 
  0x55b83f58f38e , 0x55b83f58f3a5 
, 0x55b83f58f467 , 0x55b83f58f511 
, 0x55b83f58f528 , 
  0x55b83f58f540 , 0x55b83f58f558 
, 0x55b83f58f570 , 0x55b83f58f5a8 
, 0x55b83f58f5aa , 
  0x55b83f58f5aa , 0x55b83f58f00c 
, 0x55b83f58f604 , 0x55b83f58f618 
, 0x55b83f58f5c0 , 
  0x55b83f58f5d8 , 0x55b83f58f5ec 
, 0x55b83f58f62c , 0x55b83f58f643 
, 0x55b83f58f69d , 
  0x55b83f58f6b4 , 0x55b83f58f715 
, 0x55b83f58f730 , 0x55b83f58f75b 
, 0x55b83f58f7d9 , 
  0x55b83f58f829 , 0x55b83f58f874 
, 0x55b83f58f8e4 , 0x55b83f58f9f6 
, 0x55b83f58faef , 
  0x55b83f58fbe1 , 0x55b83f58fd30 
, 0x55b83f58fe61 , 0x55b83f58ff85 
, 0x55b83f58ffa0 , 
  0x55b83f58ffbb }
#4  0x55b83f58ae18 in ExecInterpExprStillValid (state=0x55b840b0ac18, 
econtext=0x55b840b0aa10, isNull=0x7ffd6bf31fff) at execExprInterp.c:1824
No locals.
#5  0x55b83f597f7b in ExecEvalExprSwitchContext (isNull=0x7ffd6bf31fff, 
econtext=0x55b840b0aa10, state=0x55b840b0ac18) at 
../../../src/include/executor/executor.h:339
retDatum = 
oldContext = 0x55b840ad7b40
retDatum = 
oldContext = 
#6  ExecProject (projInfo=0x55b840b0ac10) at 
../../../src/include/executor/executor.h:373
econtext = 0x55b840b0aa10
state = 0x55b840b0ac18
slot = 0x55b840b0ab40
isnull = false
econtext = 
state = 
slot = 
isnull = 

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-11 Thread Julien Rouhaud
Hi,

On Tue, Dec 28, 2021 at 10:56 AM Bharath Rupireddy
 wrote:
>
> attaching v1-0001-XXX from the initial mail again just for the sake of
> completion:

Unfortunately this breaks the cfbot as it tries to apply this patch
too: http://cfbot.cputube.org/patch_36_3474.log.

For this kind of situation I think that the usual solution is to use a
.txt extension to make sure that the cfbot won't try to apply it.




Re: Skipping logical replication transactions on subscriber side

2022-01-11 Thread Masahiko Sawada
On Mon, Jan 10, 2022 at 6:27 PM vignesh C  wrote:
>
> On Fri, Jan 7, 2022 at 11:23 AM Masahiko Sawada  wrote:
> >
> > On Fri, Jan 7, 2022 at 10:04 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Jan 5, 2022 at 12:31 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Dec 27, 2021 at 9:54 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Thu, Dec 16, 2021 at 2:42 PM Masahiko Sawada 
> > > > >  wrote:
> > > > > >
> > > > > > On Thu, Dec 16, 2021 at 2:21 PM Amit Kapila 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Thu, Dec 16, 2021 at 10:37 AM Masahiko Sawada 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Thu, Dec 16, 2021 at 11:43 AM Amit Kapila 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > I thought we just want to lock before clearing the skip_xid 
> > > > > > > > > something
> > > > > > > > > like take the lock, check if the skip_xid in the catalog is 
> > > > > > > > > the same
> > > > > > > > > as we have skipped, if it is the same then clear it, 
> > > > > > > > > otherwise, leave
> > > > > > > > > it as it is. How will that disallow users to change skip_xid 
> > > > > > > > > when we
> > > > > > > > > are skipping changes?
> > > > > > > >
> > > > > > > > Oh I thought we wanted to keep holding the lock while skipping 
> > > > > > > > changes
> > > > > > > > (changing skip_xid requires acquiring the lock).
> > > > > > > >
> > > > > > > > So if skip_xid is already changed, the apply worker would do
> > > > > > > > replorigin_advance() with WAL logging, instead of committing the
> > > > > > > > catalog change?
> > > > > > > >
> > > > > > >
> > > > > > > Right. BTW, how are you planning to advance the origin? Normally, 
> > > > > > > a
> > > > > > > commit transaction would do it but when we are skipping all 
> > > > > > > changes,
> > > > > > > the commit might not do it as there won't be any transaction id
> > > > > > > assigned.
> > > > > >
> > > > > > I've not tested it yet but replorigin_advance() with wal_log = true
> > > > > > seems to work for this case.
> > > > >
> > > > > I've tested it and realized that we cannot use replorigin_advance()
> > > > > for this purpose without changes. That is, the current
> > > > > replorigin_advance() doesn't allow to advance the origin by the owner:
> > > > >
> > > > > /* Make sure it's not used by somebody else */
> > > > > if (replication_state->acquired_by != 0)
> > > > > {
> > > > > ereport(ERROR,
> > > > > (errcode(ERRCODE_OBJECT_IN_USE),
> > > > >  errmsg("replication origin with OID %d is already
> > > > > active for PID %d",
> > > > > replication_state->roident,
> > > > > replication_state->acquired_by)));
> > > > > }
> > > > >
> > > > > So we need to change it so that the origin owner can advance its
> > > > > origin, which makes sense to me.
> > > > >
> > > > > Also, when we have to update the origin instead of committing the
> > > > > catalog change while updating the origin, we cannot record the origin
> > > > > timestamp.
> > > > >
> > > >
> > > > Is it because we currently update the origin timestamp with commit 
> > > > record?
> > >
> > > Yes.
> > >
> > > >
> > > > > This behavior makes sense to me because we skipped the
> > > > > transaction. But ISTM it’s not good if we emit the origin timestamp
> > > > > only when directly updating the origin. So probably we need to always
> > > > > omit origin timestamp.
> > > > >
> > > >
> > > > Do you mean to say that you want to omit it even when we are
> > > > committing the changes?
> > >
> > > Yes, it would be better to record only origin lsn in terms of consistency.
> > >
> > > >
> > > > > Apart from that, I'm vaguely concerned that the logic seems to be
> > > > > getting complex. Probably it comes from the fact that we store
> > > > > skip_xid in the catalog and update the catalog to clear/set the
> > > > > skip_xid. It might be worth revisiting the idea of storing skip_xid on
> > > > > shmem (e.g., ReplicationState)?
> > > > >
> > > >
> > > > IIRC, the problem with that idea was that we won't remember skip_xid
> > > > information after server restart and the user won't even know that it
> > > > has to set it again.
> > >
> > > Right, I agree that it’s not convenient when the server restarts or
> > > crashes, but these problems could not be critical in the situation
> > > where users have to use this feature; the subscriber already entered
> > > an error loop so they can know xid again and it’s an uncommon case
> > > that they need to restart during skipping changes.
> > >
> > > Anyway, I'll submit an updated patch soon so we can discuss complexity
> > > vs. convenience.
> >
> > Attached an updated patch. Please review it.

Thank you for the comments!

>
> Thanks for the updated patch, few comments:
> 1) Should this be case insensitive to support NONE too:
> +   /* Setting xid = NONE is treated as 

Re: Skipping logical replication transactions on subscriber side

2022-01-11 Thread Masahiko Sawada
On Wed, Jan 12, 2022 at 12:21 PM Amit Kapila  wrote:
>
> On Wed, Jan 12, 2022 at 5:49 AM Masahiko Sawada  wrote:
> >
> > On Tue, Jan 11, 2022 at 7:08 PM Amit Kapila  wrote:
> > >
> > > On Tue, Jan 11, 2022 at 1:51 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On second thought, the same is true for other cases, for example,
> > > > preparing the transaction and clearing skip_xid while handling a
> > > > prepare message. That is, currently we don't clear skip_xid while
> > > > handling a prepare message but do that while handling commit/rollback
> > > > prepared message, in order to avoid the worst case. If we do both
> > > > while handling a prepare message and the server crashes between them,
> > > > it ends up that skip_xid is cleared and the transaction will be
> > > > resent, which is identical to the worst-case above.
> > > >
> > >
> > > How are you thinking to update the skip xid before prepare? If we do
> > > it in the same transaction then the changes in the catalog will be
> > > part of the prepared xact but won't be committed. Now, say if we do it
> > > after prepare, then the situation won't be the same because after
> > > restart the same xact won't appear again.
> >
> > I was thinking to commit the catalog change first in a separate
> > transaction while not updating origin LSN and then prepare an empty
> > transaction while updating origin LSN.
> >
>
> But, won't it complicate the handling if in the future we try to
> enhance this API such that it skips partial changes like skipping only
> for particular relation(s) or particular operations as discussed
> previously in this thread?

Right. I was thinking that if we accept the situation that the user
has to set skip_xid again in case of the server crashes, we might be
able to accept also the situation that the user has to clear skip_xid
in a case of the server crashes. But it seems the former is less
problematic.

I've attached an updated patch that incorporated all comments I got so far.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v3-0001-Add-ALTER-SUBSCRIPTION-.-SKIP-to-skip-the-transac.patch
Description: Binary data


Re: ResourceOwner refactoring

2022-01-11 Thread Julien Rouhaud
Hi,

On Fri, Nov 26, 2021 at 8:41 PM Aleksander Alekseev
 wrote:
>
> > I will submit the actual code review in the follow-up email
>
> The patchset is in a good shape. I'm changing the status to "Ready for
> Committer".

The 2nd patch doesn't apply anymore due to a conflict on
resowner_private.h: http://cfbot.cputube.org/patch_36_3364.log.

I'm switching the entry to Waiting on Author.




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-01-11 Thread Julien Rouhaud
Hi,

On Wed, Nov 10, 2021 at 3:06 AM Michail Nikolaev
 wrote:
>
> > Attached is a proposal for a minor addition that would make sense to me, add
> > it if you think it's appropriate.
>
> Added. Also, I updated the documentation a little.
>
> > I have changed approach, so it is better to start from this email:
>
> Oops, I was thinking the comments feature in the commitfest app works
> in a different way :)

The cfbot reports that this patch is currently failing at least on
Linux and Windows, e.g. https://cirrus-ci.com/task/6532060239101952.

I'm switching this patch on Waiting on Author.




2022-01 Commitfest

2022-01-11 Thread Julien Rouhaud
Hi,

The January commitfest should have started almost two weeks ago, but given that
nothing happened until now I think that it's safe to assume that either
everyone forgot or no one wanted to volunteer.

I'm therfore volunteering to manage this commitfest, although since it's
already quite late it's probably going to be a bit chaotic and a best effort,
but it's better than nothing.

As of today, there's a total of 292 patches for this commitfest and 240 still
active patches, 15 of them being there since 10 or more commitfests.

Status summary:

- Needs review: 190.
- Waiting on Author: 23.
- Ready for Committer: 27.
- Committed: 43.
- Returned with Feedback: 1.
- Withdrawn: 7.
- Rejected: 1.

Note that I don't have admin permissions on the cf app, so I'd be glad if
something could grant it!




Re: row filtering for logical replication

2022-01-11 Thread Peter Smith
Here are my review comments for v62-0001

~~~

1. src/backend/catalog/pg_publication.c - GetTopMostAncestorInPublication

@@ -276,17 +276,46 @@ GetPubPartitionOptionRelations(List *result,
PublicationPartOpt pub_partopt,
 }

 /*
+ * Check if any of the ancestors are published in the publication. If so,
+ * return the relid of the topmost ancestor that is published via this
+ * publication, otherwise InvalidOid.
+ */

The GetTopMostAncestorInPublication function header comment seems to
be saying the same thing twice. I think it can be simplified

Suggested function comment:

"Return the relid of the topmost ancestor that is published via this
publication, otherwise return InvalidOid."

~~~

2. src/backend/commands/publicationcmds.c - AlterPublicationTables

- /* Calculate which relations to drop. */
+ /*
+ * In order to recreate the relation list for the publication, look
+ * for existing relations that need not be dropped.
+ */

Suggested minor rewording of comment:

"... look for existing relations that do not need to be dropped."

~~~

3. src/backend/commands/publicationcmds.c - AlterPublicationTables

+
+ /*
+ * Look if any of the new set of relations match with the
+ * existing relations in the publication. Additionally, if the
+ * relation has an associated where-clause, check the
+ * where-clauses also match. Drop the rest.
+ */
  if (RelationGetRelid(newpubrel->relation) == oldrelid)
Suggested minor rewording of comment:

"Look if any..." --> "Check if any..."

~~~

4. src/backend/executor/execReplication.c - CheckCmdReplicaIdentity

+ /*
+ * It is only safe to execute UPDATE/DELETE when all columns referenced in
+ * the row filters from publications which the relation is in are valid,
+ * which means all referenced columns are part of REPLICA IDENTITY, or the
+ * table does not publish UPDATES or DELETES.
+ */

Suggested minor rewording of comment:

"... in are valid, which means all ..." --> "... in are valid - i.e.
when all ..."

~~~

5. src/backend/parser/gram.y - ColId OptWhereClause

+ /*
+ * The OptWhereClause must be stored here but it is
+ * valid only for tables. If the ColId was mistakenly
+ * not a table this will be detected later in
+ * preprocess_pubobj_list() and an error is thrown.
+ */

Suggested minor rewording of comment:

"... and an error is thrown." --> "... and an error will be thrown."
~~~

6. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter

+ * ALL TABLES IN SCHEMA implies "don't use row filter expression" if
+ * the schema is the same as the table schema.
+ */
+ foreach(lc, data->publications)
+ {
+ Publication *pub = lfirst(lc);
+ HeapTuple rftuple = NULL;
+ Datum rfdatum = 0;
+ bool rfisnull;
+ bool pub_no_filter = false;
+ List*schemarelids = NIL;

Not all of these variables need to be declared at the top of the loop
like this. Consider moving some of them (e.g. rfisnull, schemarelids)
lower down to declare only in the scope that uses them.

~~~

7. src/backend/replication/pgoutput/pgoutput.c - pgoutput_row_filter

+ if (pub->pubactions.pubinsert)
+ no_filter[idx_ins] = true;
+ if (pub->pubactions.pubupdate)
+ no_filter[idx_upd] = true;
+ if (pub->pubactions.pubdelete)
+ no_filter[idx_del] = true;

This code can be simplified I think. e.g.

no_filter[idx_ins] |= pub->pubactions.pubinsert;
no_filter[idx_upd] |= pub->pubactions.pubupdate;
no_filter[idx_del] |= pub->pubactions.pubdelete;

~~~

8. src/backend/replication/pgoutput/pgoutput.c - get_rel_sync_entry

@@ -1245,9 +1650,6 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
  entry->pubactions.pubtruncate |= pub->pubactions.pubtruncate;
  }

- if (entry->pubactions.pubinsert && entry->pubactions.pubupdate &&
- entry->pubactions.pubdelete && entry->pubactions.pubtruncate)
- break;
  }

I was not sure why that code was removed. Is it deliberate/correct?

~~~

9. src/backend/utils/cache/relcache.c - rowfilter_column_walker

@@ -5521,39 +5535,98 @@ RelationGetExclusionInfo(Relation indexRelation,
  MemoryContextSwitchTo(oldcxt);
 }

+
+
 /*
- * Get publication actions for the given relation.
+ * Check if any columns used in the row filter WHERE clause are not part of
+ * REPLICA IDENTITY and save the invalid column number in
+ * rf_context::invalid_rfcolnum.
  */

There is an extra blank line before the function comment.

~~~

10. src/backend/utils/cache/relcache.c - GetRelationPublicationInfo

+ if (HeapTupleIsValid(rftuple))
+ {
+ Datum rfdatum;
+ bool rfisnull;
+ Node*rfnode;
+
+ context.pubviaroot = pubform->pubviaroot;
+ context.parentid = publish_as_relid;
+ context.relid = relid;
+
+ rfdatum = SysCacheGetAttr(PUBLICATIONRELMAP, rftuple,
+   Anum_pg_publication_rel_prqual,
+   );
+
+ if (!rfisnull)
+ {
+ rfnode = stringToNode(TextDatumGetCString(rfdatum));
+ rfcol_valid = !rowfilter_column_walker(rfnode, );
+ invalid_rfcolnum = context.invalid_rfcolnum;
+ pfree(rfnode);
+ }
+
+ ReleaseSysCache(rftuple);
+ }

Those 3 assignments to the context.pubviaroot/parentid/relid can be
moved to be 

Re: pg_upgrade parallelism

2022-01-11 Thread Jaime Casanova
On Wed, Nov 17, 2021 at 08:04:41PM +, Jacob Champion wrote:
> On Wed, 2021-11-17 at 14:44 -0500, Jaime Casanova wrote:
> > I'm trying to add more parallelism by copying individual segments
> > of a relfilenode in different processes. Does anyone one see a big
> > problem in trying to do that? I'm asking because no one did it before,
> > that could not be a good sign.
> 
> I looked into speeding this up a while back, too. For the use case I
> was looking at -- Greenplum, which has huge numbers of relfilenodes --
> spinning disk I/O was absolutely the bottleneck and that is typically
> not easily parallelizable. (In fact I felt at the time that Andres'
> work on async I/O might be a better way forward, at least for some
> filesystems.)
> 
> But you mentioned that you were seeing disks that weren't saturated, so
> maybe some CPU optimization is still valuable? I am a little skeptical
> that more parallelism is the way to do that, but numbers trump my
> skepticism.
> 

Sorry for being unresponsive too long. I did add a new --jobs-per-disk
option, this is a simple patch I made for the customer and ignored all
WIN32 parts because I don't know anything about that part. I was wanting
to complete that part but it has been in the same state two months now.

AFAIU, it seems there is a different struct for the parameters of the
function that will be called on the thread.

I also decided to create a new reap_*_child() function for using with
the new parameter.

Now, the customer went from copy 25Tb in 6 hours to 4h 45min, which is
an improvement of 20%!


> > - why we read()/write() at all? is not a faster way of copying the file?
> >   i'm asking that because i don't actually know.
> 
> I have idly wondered if something based on splice() would be faster,
> but I haven't actually tried it.
> 

I tried and got no better result.

> But there is now support for copy-on-write with the clone mode, isn't
> there? Or are you not able to take advantage of it?
> 

That's sadly not possible because those are different disks, and yes I
know that's something that pg_upgrade normally doesn't allow but is not
difficult to make it happen.

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL
>From 0d04f79cb51d6be0ced9c6561cfca5bfe18c4bdd Mon Sep 17 00:00:00 2001
From: Jaime Casanova 
Date: Wed, 15 Dec 2021 12:14:44 -0500
Subject: [PATCH] Add --jobs-per-disk option to allow multiple processes per
 tablespace

This option is independent of the --jobs one. It's will fork new processes
to copy the different segments of a relfilenode in parallel.
---
 src/bin/pg_upgrade/option.c  |  8 ++-
 src/bin/pg_upgrade/parallel.c| 93 
 src/bin/pg_upgrade/pg_upgrade.h  |  4 ++
 src/bin/pg_upgrade/relfilenode.c | 59 +++-
 4 files changed, 139 insertions(+), 25 deletions(-)

diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 66fe16964e..46b1913a42 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -54,6 +54,7 @@ parseCommandLine(int argc, char *argv[])
 		{"link", no_argument, NULL, 'k'},
 		{"retain", no_argument, NULL, 'r'},
 		{"jobs", required_argument, NULL, 'j'},
+		{"jobs-per-disks", required_argument, NULL, 'J'},
 		{"socketdir", required_argument, NULL, 's'},
 		{"verbose", no_argument, NULL, 'v'},
 		{"clone", no_argument, NULL, 1},
@@ -103,7 +104,7 @@ parseCommandLine(int argc, char *argv[])
 	if (os_user_effective_id == 0)
 		pg_fatal("%s: cannot be run as root\n", os_info.progname);
 
-	while ((option = getopt_long(argc, argv, "d:D:b:B:cj:kNo:O:p:P:rs:U:v",
+	while ((option = getopt_long(argc, argv, "d:D:b:B:cj:J:kNo:O:p:P:rs:U:v",
  long_options, )) != -1)
 	{
 		switch (option)
@@ -132,6 +133,10 @@ parseCommandLine(int argc, char *argv[])
 user_opts.jobs = atoi(optarg);
 break;
 
+			case 'J':
+user_opts.jobs_per_disk = atoi(optarg);
+break;
+
 			case 'k':
 user_opts.transfer_mode = TRANSFER_MODE_LINK;
 break;
@@ -291,6 +296,7 @@ usage(void)
 	printf(_("  -d, --old-datadir=DATADIR old cluster data directory\n"));
 	printf(_("  -D, --new-datadir=DATADIR new cluster data directory\n"));
 	printf(_("  -j, --jobs=NUMnumber of simultaneous processes or threads to use\n"));
+	printf(_("  -J, --jobs_per_disk=NUM   number of simultaneous processes or threads to use per tablespace\n"));
 	printf(_("  -k, --linklink instead of copying files to new cluster\n"));
 	printf(_("  -N, --no-sync do not wait for changes to be written safely to disk\n"));
 	printf(_("  -o, --old-options=OPTIONS old cluster options to pass to the server\n"));
diff --git a/src/bin/pg_upgrade/parallel.c b/src/bin/pg_upgrade/parallel.c
index ee7364da3b..82f698a9ab 100644
--- a/src/bin/pg_upgrade/parallel.c
+++ b/src/bin/pg_upgrade/parallel.c
@@ -17,6 +17,9 @@
 #include "pg_upgrade.h"
 
 static int	parallel_jobs;
+static int	

Re: Can there ever be out of sequence WAL files?

2022-01-11 Thread Julien Rouhaud
On Wed, Jan 12, 2022 at 01:10:25PM +0900, Michael Paquier wrote:
> 
> xlog.c can be a good read to check the assumptions WAL replay relies
> on, with things like CheckRecoveryConsistency() or
> reachedConsistency.

That should only stand for a WAL expected to be missing right?  For something
unexpected it should fail in XLogReadRecord() when trying to fetch a missing
block?




Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings

2022-01-11 Thread Tom Lane
... btw, speaking of signal-safe functions: I am dismayed to
notice that strerror (and strerror_r) are *not* in POSIX's
list of async-signal-safe functions.  This is really quite
unsurprising, considering that they are chartered to return
locale-dependent strings.  Unless the data has already been
collected in the current process, that'd imply reading something
from the locale definition files, allocating memory to hold it,
etc.

So I'm now thinking this bit in PQcancel is completely unsafe:

strncat(errbuf, SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)),
maxlen);

Seems we have to give up on providing any details beyond the
name of the function that failed.

regards, tom lane




Re: Can there ever be out of sequence WAL files?

2022-01-11 Thread Michael Paquier
On Wed, Jan 12, 2022 at 10:18:11AM +0800, Julien Rouhaud wrote:
> On Wed, Jan 12, 2022 at 07:19:48AM +0530, Bharath Rupireddy wrote:
>>> Can the postgres server ever have/generate out of sequence WAL files?
>>> For instance, 0001020C00A2, 0001020C00A3,
>>> 0001020C00A5 and so on, missing 0001020C00A4.
>>> Manual/Accidental deletion of the WAL files can happes, but are there
>>> any other extreme situations (like recycling, removing old WAL files
>>> etc.) caused by the postgres server leading to missing WAL files?
> 
> By definition there shouldn't be such situation, as it would otherwise be a
> (critical) bug.

I have seen that in the past, in cases where a system got harshly
deplugged then replugged where a segment file flush got missing.  But
that was just a flacky system, Postgres relied just on something
wrong.  So the answer is that this should not happen. 

>>> What happens when postgres server finds missing WAL file during
>>> crash/standby recovery?
> 
> The recovery should fail.

xlog.c can be a good read to check the assumptions WAL replay relies
on, with things like CheckRecoveryConsistency() or
reachedConsistency.
--
Michael


signature.asc
Description: PGP signature


Re: pg_dump/restore --no-tableam

2022-01-11 Thread Justin Pryzby
On Tue, Jan 11, 2022 at 04:50:23PM +0900, Michael Paquier wrote:
> On Mon, Jan 03, 2022 at 03:44:24PM -0600, Justin Pryzby wrote:
> > + 
> > + 
> > +  --no-table-am
> > +  
> > +   
> > +Do not output commands to select table access methods.
> > +With this option, all objects will be created with whichever
> > +table access method is the default during restore.
> > +   
> 
> Hmm.  --no-table-am may not be the best choice.  Should this be called
> --no-table-access-method instead?

I suppose you're right - I had previously renamed it from no-tableam.

> > -   no_toast_compression => {
> > -   dump_cmd => [
> > -   'pg_dump', '--no-sync',
> > -   "--file=$tempdir/no_toast_compression.sql",
> > -   '--no-toast-compression', 'postgres',
> > -   ],
> > -   },
> 
> Why is this command moved down?

Because it looks like this is intended to be mostly alphabetical, but that
wasn't preserved by 63db0ac3f.  It's most apparent in "my %full_runs".

The same could be said of no-privs, defaults_custom_format, pg_dumpall_globals,
section_data, but they've been that way forever.

-- 
Justin
>From 38da310269c1959994f2d7ab1cec9614209b60bb Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 7 Mar 2021 19:35:37 -0600
Subject: [PATCH] Add pg_dump/restore --no-table-access-method..

This was for some reason omitted from 3b925e905.
---
 doc/src/sgml/ref/pg_dump.sgml| 17 ++
 doc/src/sgml/ref/pg_restore.sgml | 11 +
 src/bin/pg_dump/pg_backup.h  |  2 ++
 src/bin/pg_dump/pg_backup_archiver.c | 12 ++
 src/bin/pg_dump/pg_dump.c|  3 +++
 src/bin/pg_dump/pg_restore.c |  4 
 src/bin/pg_dump/t/002_pg_dump.pl | 34 ++--
 7 files changed, 71 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0e1cfe0f8d6..2f0042fd968 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -931,6 +931,23 @@ PostgreSQL documentation
   
  
 
+ 
+  --no-table-access-method
+  
+   
+Do not output commands to select table access methods.
+With this option, all objects will be created with whichever
+table access method is the default during restore.
+   
+
+   
+This option is ignored when emitting an archive (non-text) output
+file.  For the archive formats, you can specify the option when you
+call pg_restore.
+   
+  
+ 
+
  
   --no-tablespaces
   
diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 93ea937ac8e..526986eadb1 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -649,6 +649,17 @@ PostgreSQL documentation
   
  
 
+ 
+  --no-table-access-method
+  
+   
+Do not output commands to select table access methods.
+With this option, all objects will be created with whichever
+access method is the default during restore.
+   
+  
+ 
+
  
   --no-tablespaces
   
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 753252e05e0..47741ad6406 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -93,6 +93,7 @@ typedef struct _restoreOptions
 {
 	int			createDB;		/* Issue commands to create the database */
 	int			noOwner;		/* Don't try to match original object owner */
+	int			noTableAm;	/* Don't issue tableAM-related commands */
 	int			noTablespace;	/* Don't issue tablespace-related commands */
 	int			disable_triggers;	/* disable triggers during data-only
 	 * restore */
@@ -179,6 +180,7 @@ typedef struct _dumpOptions
 	int			no_unlogged_table_data;
 	int			serializable_deferrable;
 	int			disable_triggers;
+	int			outputNoTableAm;
 	int			outputNoTablespaces;
 	int			use_setsessauth;
 	int			enable_row_security;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 8903a694ae9..49bf0907cd2 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -194,6 +194,7 @@ dumpOptionsFromRestoreOptions(RestoreOptions *ropt)
 	dopt->outputSuperuser = ropt->superuser;
 	dopt->outputCreateDB = ropt->createDB;
 	dopt->outputNoOwner = ropt->noOwner;
+	dopt->outputNoTableAm = ropt->noTableAm;
 	dopt->outputNoTablespaces = ropt->noTablespace;
 	dopt->disable_triggers = ropt->disable_triggers;
 	dopt->use_setsessauth = ropt->use_setsessauth;
@@ -3171,6 +3172,11 @@ _reconnectToDB(ArchiveHandle *AH, const char *dbname)
 	if (AH->currSchema)
 		free(AH->currSchema);
 	AH->currSchema = NULL;
+
+	if (AH->currTableAm)
+		free(AH->currTableAm);
+	AH->currTableAm = NULL;
+
 	if (AH->currTablespace)
 		free(AH->currTablespace);
 	AH->currTablespace = NULL;
@@ -3340,10 +3346,15 @@ 

Re: pg_upgrade should truncate/remove its logs before running

2022-01-11 Thread Justin Pryzby
On Wed, Jan 12, 2022 at 12:59:54PM +0900, Michael Paquier wrote:
> On Tue, Jan 11, 2022 at 02:03:07PM -0600, Justin Pryzby wrote:
> > There's no reason not to.  We created the dir, and the user didn't specify 
> > to
> > preserve it.  It'd be their fault if they put something valuable there after
> > starting pg_upgrade.
> 
> This is a path for the data internal to pg_upgrade.  My take is that
> the code simplifications the new option brings are more valuable than
> this assumption, which I guess would unlikely happen.  I may be wrong,
> of course.  By the way, while thinking about that, should we worry
> about --logdir="."?

I asked about that before.  Right now, it'll exit(1) when mkdir fails.

I had written a patch to allow "." by skipping mkdir (or allowing it to fail if
errno == EEXIST), but it seems like an awfully bad idea to try to make that
work with rmtree().

-- 
Justin




Re: pg_upgrade should truncate/remove its logs before running

2022-01-11 Thread Michael Paquier
On Tue, Jan 11, 2022 at 02:03:07PM -0600, Justin Pryzby wrote:
> I added mkdir() before the other stuff that messes with logfiles, because it
> needs to happen before that.
> 
> Are you suggesting to change the pre-existing behavior of when logfiles are
> created, like 0002 ?

Yes, something like that.

> There's no reason not to.  We created the dir, and the user didn't specify to
> preserve it.  It'd be their fault if they put something valuable there after
> starting pg_upgrade.

This is a path for the data internal to pg_upgrade.  My take is that
the code simplifications the new option brings are more valuable than
this assumption, which I guess would unlikely happen.  I may be wrong,
of course.  By the way, while thinking about that, should we worry
about --logdir="."?
--
Michael


signature.asc
Description: PGP signature


Re: Improve error handling of HMAC computations and SCRAM

2022-01-11 Thread Michael Paquier
On Tue, Jan 11, 2022 at 11:08:59AM +0300, Sergey Shinderuk wrote:
> Yeah, that's better.  I thought "providing errors about an error" was a
> typo, but now I see the same comment was committed in b69aba745.  Is it just
> me? :)

It is not only you :)  I have applied a fix to fix the comments on
HEAD and REL_14_STABLE.

Attached is a rebased patch for the HMAC portions, with a couple of
fixes I noticed while going through this stuff again (mostly around
SASLprep and pg_fe_scram_build_secret), and a fix for a conflict
coming from 9cb5518.  psql's \password is wrong to assume that the
only error that can happen for scran-sha-256 is an OOM, but we'll get
there.
--
Michael


signature.asc
Description: PGP signature


Re: Skipping logical replication transactions on subscriber side

2022-01-11 Thread Amit Kapila
On Wed, Jan 12, 2022 at 5:49 AM Masahiko Sawada  wrote:
>
> On Tue, Jan 11, 2022 at 7:08 PM Amit Kapila  wrote:
> >
> > On Tue, Jan 11, 2022 at 1:51 PM Masahiko Sawada  
> > wrote:
> > >
> > > On second thought, the same is true for other cases, for example,
> > > preparing the transaction and clearing skip_xid while handling a
> > > prepare message. That is, currently we don't clear skip_xid while
> > > handling a prepare message but do that while handling commit/rollback
> > > prepared message, in order to avoid the worst case. If we do both
> > > while handling a prepare message and the server crashes between them,
> > > it ends up that skip_xid is cleared and the transaction will be
> > > resent, which is identical to the worst-case above.
> > >
> >
> > How are you thinking to update the skip xid before prepare? If we do
> > it in the same transaction then the changes in the catalog will be
> > part of the prepared xact but won't be committed. Now, say if we do it
> > after prepare, then the situation won't be the same because after
> > restart the same xact won't appear again.
>
> I was thinking to commit the catalog change first in a separate
> transaction while not updating origin LSN and then prepare an empty
> transaction while updating origin LSN.
>

But, won't it complicate the handling if in the future we try to
enhance this API such that it skips partial changes like skipping only
for particular relation(s) or particular operations as discussed
previously in this thread?

-- 
With Regards,
Amit Kapila.




Re: do only critical work during single-user vacuum?

2022-01-11 Thread Justin Pryzby
On Tue, Jan 11, 2022 at 07:58:56PM -0500, John Naylor wrote:
> + // FIXME: also check reloption
> + // WIP: 95% is a starting point for discussion
> + if ((table_xid_age < autovacuum_freeze_max_age * 0.95) ||
> + (table_mxid_age < autovacuum_multixact_freeze_max_age * 
> 0.95))
> + continue;

Should be &&

Should this emergency vacuum "order by age() DESC" ?

-- 
Justin




Re: Can there ever be out of sequence WAL files?

2022-01-11 Thread Julien Rouhaud
On Wed, Jan 12, 2022 at 07:19:48AM +0530, Bharath Rupireddy wrote:
> >
> > Can the postgres server ever have/generate out of sequence WAL files?
> > For instance, 0001020C00A2, 0001020C00A3,
> > 0001020C00A5 and so on, missing 0001020C00A4.
> > Manual/Accidental deletion of the WAL files can happes, but are there
> > any other extreme situations (like recycling, removing old WAL files
> > etc.) caused by the postgres server leading to missing WAL files?

By definition there shouldn't be such situation, as it would otherwise be a
(critical) bug.

> > What happens when postgres server finds missing WAL file during
> > crash/standby recovery?

The recovery should fail.




Re: do only critical work during single-user vacuum?

2022-01-11 Thread Peter Geoghegan
On Tue, Jan 11, 2022 at 4:59 PM John Naylor
 wrote:
> I've attached a PoC *untested* patch to show what it would look like
> as a top-level statement. If the "shape" is uncontroversial, I'll put
> work into testing it and fleshing it out.

Great!

> For the PoC I wanted to try re-using existing keywords. I went with
> "VACUUM LIMIT" since LIMIT is already a keyword that cannot be used as
> a table name. It also brings "wraparound limit" to mind. We could add
> a single-use unreserved keyword (such as VACUUM_MINIMAL or
> VACUUM_FAST), but that doesn't seem great.

This seems reasonable, but you could add a new option instead, without
much downside. While INDEX_CLEANUP kind of looks like a keyword, it
isn't really a keyword. (Perhaps you knew this already.)

Making this a new option is a little awkward, admittedly. It's not
clear what it means to "VACUUM (LIMIT) my_table" -- do you just throw
an error for stuff like that? So perhaps your approach of adding
VacuumMinimalStmt (a minimal variant of the VACUUM command) is better.

> I'm not sure what the right trade-off is, but as written I used 95% of
> max age. It might be undesirable to end up so close to kicking off
> uninterruptible vacuums, but the point is to get out of single-user
> mode and back to streaming WAL as quickly as possible. It might also
> be worth overriding the min ages as well, but haven't done so here.

I wonder if we should keep autovacuum_freeze_max_age out of it -- its
default is too conservative in general. I'm concerned that applying
this autovacuum_freeze_max_age test during VACUUM LIMIT doesn't go far
enough -- it may require VACUUM LIMIT to do significantly more work
than is needed to get the system back online (while leaving a sensible
amount of headroom). Also seems like it might be a good idea to avoid
relying on the user configuration, given that VACUUM LIMIT is only run
when everything is already in disarray. (Besides, it's not clear that
it's okay to use the autovacuum_freeze_max_age GUC without also using
the reloption of the same name.)

What do you think of applying a similar test using a generic 1 billion
XID (and 1 billion MXID) age cutoff? When VACUUM LIMIT is run, we've
already learned that the *entire* XID space wasn't sufficient for the
user workload, so we're not really in a position to promise much.
Often the real problem will be something like a leaked replication
slot, or application code that's seriously misbehaving. It's really
the DBA's job to *keep* the system up. VACUUM LIMIT is just a tool
that allows the DBA to do this without excessive downtime.

The GetNewTransactionId() WARNINGs ought to be changed to reference
VACUUM LIMIT. (You probably just didn't get around to that in this
POC, but couldn't hurt to remind you.)

-- 
Peter Geoghegan




Re: Can there ever be out of sequence WAL files?

2022-01-11 Thread Bharath Rupireddy
On Tue, Dec 28, 2021 at 7:45 AM Bharath Rupireddy
 wrote:
>
> Hi,
>
> Can the postgres server ever have/generate out of sequence WAL files?
> For instance, 0001020C00A2, 0001020C00A3,
> 0001020C00A5 and so on, missing 0001020C00A4.
> Manual/Accidental deletion of the WAL files can happes, but are there
> any other extreme situations (like recycling, removing old WAL files
> etc.) caused by the postgres server leading to missing WAL files?
>
> What happens when postgres server finds missing WAL file during
> crash/standby recovery?
>
> Thoughts?

Hi Hackers, a gentle ping for the above question. I think I sent it
earlier during the holiday season.

Regards,
Bharath Rupireddy.




Re: [Ext:] Re: Stream Replication not working

2022-01-11 Thread Kyotaro Horiguchi
Hi.

At Tue, 11 Jan 2022 15:05:55 +, Allie Crawford 
 wrote in 
> er it. How do I figure out which database and relation is db:16384
> and relation:12141.?

On any database,

select datname from pg_database where oid = 16384;

Then on the shown database,

select relname from pg_class where oid = 12141;

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: do only critical work during single-user vacuum?

2022-01-11 Thread John Naylor
On Tue, Dec 21, 2021 at 4:56 PM Peter Geoghegan  wrote:

> But if we're going to add a new option to the VACUUM command (or
> something of similar scope), then we might as well add a new behavior
> that is reasonably exact -- something that (say) only *starts* a
> VACUUM for those tables whose relfrozenxid age currently exceeds half
> the autovacuum_freeze_max_age for the table (usually taken from the
> GUC, sometimes taken from the reloption), which also forces the
> failsafe. And with similar handling for
> relminmxid/autovacuum_multixact_freeze_max_age.

> This new command/facility should probably not be a new flag to the
> VACUUM command, as such. Rather, I think that it should either be an
> SQL-callable function, or a dedicated top-level command (that doesn't
> accept any tables). The only reason to have this is for scenarios
> where the user is already in a tough spot with wraparound failure,
> like that client of yours. Nobody wants to force the failsafe for one
> specific table. It's not general purpose, at all, and shouldn't claim
> to be.

I've attached a PoC *untested* patch to show what it would look like
as a top-level statement. If the "shape" is uncontroversial, I'll put
work into testing it and fleshing it out.

For the PoC I wanted to try re-using existing keywords. I went with
"VACUUM LIMIT" since LIMIT is already a keyword that cannot be used as
a table name. It also brings "wraparound limit" to mind. We could add
a single-use unreserved keyword (such as VACUUM_MINIMAL or
VACUUM_FAST), but that doesn't seem great.

> In other words, while triggering the failsafe is important, simply *not
> starting* VACUUM for relations where there is really no need for it is
> at least as important. We shouldn't even think about pruning or
> freezing with these tables. (ISTM that the only thing that might be a
> bit controversial about any of this is my definition of "safe", which
> seems like about the right trade-off to me.)

I'm not sure what the right trade-off is, but as written I used 95% of
max age. It might be undesirable to end up so close to kicking off
uninterruptible vacuums, but the point is to get out of single-user
mode and back to streaming WAL as quickly as possible. It might also
be worth overriding the min ages as well, but haven't done so here.

It can be executed in normal mode (although it's not expected to be),
which makes testing easier and allows for a future possibility of not
requiring shutdown at all, by e.g. terminating non-superuser
connections.

-- 
John Naylor
EDB: http://www.enterprisedb.com
 src/backend/commands/vacuum.c  | 133 +
 src/backend/nodes/copyfuncs.c  |   3 +
 src/backend/nodes/equalfuncs.c |   3 +
 src/backend/parser/gram.y  |  10 +++-
 src/backend/tcop/utility.c |  13 
 src/include/commands/vacuum.h  |   1 +
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/parsenodes.h |  12 
 8 files changed, 165 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 287098e4d0..d1c59a78e9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -52,6 +52,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/acl.h"
+#include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -271,10 +272,132 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	/* user-invoked vacuum never uses this parameter */
 	params.log_min_duration = -1;
 
+	/*
+	 * Create special memory context for cross-transaction storage.
+	 *
+	 * Since it is a child of PortalContext, it will go away eventually even
+	 * if we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	vac_context = AllocSetContextCreate(PortalContext,
+		"Vacuum",
+		ALLOCSET_DEFAULT_SIZES);
+
 	/* Now go through the common routine */
 	vacuum(vacstmt->rels, , NULL, isTopLevel);
 }
 
+/*
+ * Like ExecVacuum, but specialized for recovering quickly from anti-wraparound
+ * shutdown.
+ */
+void
+ExecVacuumMinimal(VacuumMinimalStmt *fmstmt, bool isTopLevel)
+{
+	VacuumParams params;
+	List	   *vacrels = NIL;
+	Relation	pgclass;
+	TableScanDesc scan;
+	HeapTuple	tuple;
+	int32 table_xid_age;
+	int32 table_mxid_age;
+	int32 save_VacuumCostDelay;
+
+	/* use defaults */
+	// WIP: It might be worth trying to do less work here
+	params.freeze_min_age = -1;
+	params.multixact_freeze_min_age = -1;
+
+	/* it's unlikely any table we choose will not be eligible for aggressive vacuum, but make sure */
+	params.freeze_table_age = 0;
+	params.multixact_freeze_table_age = 0;
+
+	/* skip unnecessary work, as in failsafe mode */
+	params.index_cleanup = VACOPTVALUE_DISABLED;
+	params.truncate = VACOPTVALUE_DISABLED;
+
+	/* user-invoked vacuum is never "for wraparound" */
+	params.is_wraparound = false;
+
+	/* user-invoked vacuum never uses this parameter */
+	

Re: Column Filtering in Logical Replication

2022-01-11 Thread Alvaro Herrera
On 2022-Jan-11, Alvaro Herrera wrote:

> On 2022-Jan-10, Alvaro Herrera wrote:
> 
> > Hmm.  So you're saying that we should only raise errors about the column
> > list if we are publishing UPDATE or DELETE, but otherwise let the
> > replica identity be anything.  OK, I'll see if I can come up with a
> > reasonable set of rules ...
> 
> This is an attempt to do it that way.  Now you can add a table to a
> publication without regards for how column filter compares to the
> replica identity, as long as the publication does not include updates
> and inserts.

I discovered a big hole in this, which is that ALTER PUBLICATION SET
(publish='insert,update') can add UPDATE publishing to a publication
that was only publishing INSERTs.  It's easy to implement a fix: in
AlterPublicationOptions, scan the list of tables and raise an error if
any of them has a column list that doesn't include all the columns in
the replica identity.

However, that proposal has an ugly flaw: there is no index on
pg_publication_rel.prpubid, which means that the only way to find the
relations we need to inspect is to seqscan pg_publication_rel.

Also, psql's query for \dRp+ uses a seqscan in pg_publication_rel.

Therefore, I propose to add an index on pg_publication_rel.prpubid.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"¿Qué importan los años?  Lo que realmente importa es comprobar que
a fin de cuentas la mejor edad de la vida es estar vivo"  (Mafalda)




Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings

2022-01-11 Thread Tom Lane
I wrote:
> (The fact that we now have test coverage for PQcancel makes me a lot
> more willing to try this than I might otherwise be.  Will be
> interested to see the cfbot's results.)

On closer look, I'm not sure that psql/t/020_cancel.pl is actually doing
anything on Windows; the cfbot's test transcript says it ran without
skipping, but looking at the test itself, it seems like it would skip on
Windows.

Meanwhile, the warnings build noticed that we need to #ifdef
out the optional_setsockopt function in some cases.  Revised
0002 attached.

regards, tom lane

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 14f35d37f6..e0ab7cd555 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -5707,8 +5707,8 @@ int PQrequestCancel(PGconn *conn);
PGconn object, and in case of failure stores the
error message in the PGconn object (whence it can
be retrieved by ).  Although
-   the functionality is the same, this approach creates hazards for
-   multiple-thread programs and signal handlers, since it is possible
+   the functionality is the same, this approach is not safe within
+   multiple-thread programs or signal handlers, since it is possible
that overwriting the PGconn's error message will
mess up the operation currently in progress on the connection.
   
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 5fc16be849..9e75abd304 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -4394,14 +4394,17 @@ PQfreeCancel(PGcancel *cancel)
 
 
 /*
- * PQcancel and PQrequestCancel: attempt to request cancellation of the
- * current operation.
+ * PQcancel: request query cancel
  *
  * The return value is true if the cancel request was successfully
  * dispatched, false if not (in which case an error message is available).
  * Note: successful dispatch is no guarantee that there will be any effect at
  * the backend.  The application must read the operation result as usual.
  *
+ * On failure, an error message is stored in *errbuf, which must be of size
+ * errbufsize (recommended size is 256 bytes).  *errbuf is not changed on
+ * success return.
+ *
  * CAUTION: we want this routine to be safely callable from a signal handler
  * (for example, an application might want to call it in a SIGINT handler).
  * This means we cannot use any C library routine that might be non-reentrant.
@@ -4409,13 +4412,9 @@ PQfreeCancel(PGcancel *cancel)
  * just as dangerous.  We avoid sprintf here for that reason.  Building up
  * error messages with strcpy/strcat is tedious but should be quite safe.
  * We also save/restore errno in case the signal handler support doesn't.
- *
- * internal_cancel() is an internal helper function to make code-sharing
- * between the two versions of the cancel function possible.
  */
-static int
-internal_cancel(SockAddr *raddr, int be_pid, int be_key,
-char *errbuf, int errbufsize)
+int
+PQcancel(PGcancel *cancel, char *errbuf, int errbufsize)
 {
 	int			save_errno = SOCK_ERRNO;
 	pgsocket	tmpsock = PGINVALID_SOCKET;
@@ -4427,18 +4426,26 @@ internal_cancel(SockAddr *raddr, int be_pid, int be_key,
 		CancelRequestPacket cp;
 	}			crp;
 
+	if (!cancel)
+	{
+		strlcpy(errbuf, "PQcancel() -- no cancel object supplied", errbufsize);
+		/* strlcpy probably doesn't change errno, but be paranoid */
+		SOCK_ERRNO_SET(save_errno);
+		return false;
+	}
+
 	/*
 	 * We need to open a temporary connection to the postmaster. Do this with
 	 * only kernel calls.
 	 */
-	if ((tmpsock = socket(raddr->addr.ss_family, SOCK_STREAM, 0)) == PGINVALID_SOCKET)
+	if ((tmpsock = socket(cancel->raddr.addr.ss_family, SOCK_STREAM, 0)) == PGINVALID_SOCKET)
 	{
 		strlcpy(errbuf, "PQcancel() -- socket() failed: ", errbufsize);
 		goto cancel_errReturn;
 	}
 retry3:
-	if (connect(tmpsock, (struct sockaddr *) >addr,
-raddr->salen) < 0)
+	if (connect(tmpsock, (struct sockaddr *) >raddr.addr,
+cancel->raddr.salen) < 0)
 	{
 		if (SOCK_ERRNO == EINTR)
 			/* Interrupted system call - we'll just try again */
@@ -4455,8 +4462,8 @@ retry3:
 
 	crp.packetlen = pg_hton32((uint32) sizeof(crp));
 	crp.cp.cancelRequestCode = (MsgType) pg_hton32(CANCEL_REQUEST_CODE);
-	crp.cp.backendPID = pg_hton32(be_pid);
-	crp.cp.cancelAuthCode = pg_hton32(be_key);
+	crp.cp.backendPID = pg_hton32(cancel->be_pid);
+	crp.cp.cancelAuthCode = pg_hton32(cancel->be_key);
 
 retry4:
 	if (send(tmpsock, (char *) , sizeof(crp), 0) != (int) sizeof(crp))
@@ -4508,27 +4515,6 @@ cancel_errReturn:
 	return false;
 }
 
-/*
- * PQcancel: request query cancel
- *
- * Returns true if able to send the cancel request, false if not.
- *
- * On failure, an error message is stored in *errbuf, which must be of size
- * errbufsize (recommended size is 256 bytes).  *errbuf is not changed on
- * success return.
- */
-int
-PQcancel(PGcancel *cancel, char *errbuf, int errbufsize)
-{
-	if 

Re: More data files / forks

2022-01-11 Thread Tomas Vondra


On 1/11/22 19:39, Chris Cleveland wrote:
> I'm working on a table access method that stores indexes in a structure
> that looks like an LSM tree. Changes get written to small segment files,
> which then get merged into larger segment files.
> 
> It's really tough to manage these files using existing fork/buffer/page
> files, because when you delete a large segment it leaves a lot of empty
> space. It's a lot easier to write the segments into separate files on
> disk and then delete them as needed.
> 

And is that empty space actually a problem? You can reuse that for new
data, no? It's a bit like empty space in regular data files - we could
try keeping it much lower, but it'd be harmful in practice.

> I could do that, but then I lose the advantages of having data in native
> Postgres files, including support for buffering and locking.
> 
> It's important to have the segments stored contiguously on disk. I've
> benchmarked it; it makes a huge performance difference.
> 

Yeah, I'm sure it's beneficial for sequential scans, readahead, etc. But
you can get most of that benefit by smart allocation strategy - instead
of working with individual pages, allocate larger chunks of pages. So
instead of grabbing pages one by one, "reserve" them in e.g. 1MB chunks,
or something.

Not sure how exactly you do the book-keeping, ofc. I wonder if BRIN
might serve as an inspiration, as it maintains revmap and actual index
tuples in the same fork. Not the same thing, but perhaps similar?

The other thing that comes to mind is logtape.c, which works with
multiple "logical tapes" stored in a single file - a bit like the
segments you're talking about. But maybe the assumptions about segments
being written/read exactly once is too limiting for your use case.

> Questions:
> 
> 1. Are there any other disadvantages to storing data in my own files on
> disk, instead of in files managed by Postgres?
> 

Well, you simply don't get many of the built-in benefits you mentioned,
various tools may not expect that, and so on.

> 2. Is it possible to increase the number of forks? I could store each
> level of the LSM tree in its own fork very efficiently. Forks could get
> truncated as needed. A dozen forks would handle it nicely.
> 

You're right the number of forks is fixed, and it's one of the places
that's not extensible. I don't recall any proposals to change that,
though, and even if we decided to do that, I doubt we'd allow the number
of forks to be entirely dynamic.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Skipping logical replication transactions on subscriber side

2022-01-11 Thread Masahiko Sawada
On Tue, Jan 11, 2022 at 7:11 PM Amit Kapila  wrote:
>
> On Tue, Jan 11, 2022 at 8:52 AM Masahiko Sawada  wrote:
> >
> > On Mon, Jan 10, 2022 at 8:50 PM Amit Kapila  wrote:
> > >
> > >
> > > Few other comments on the latest patch:
> > > =
> > > 1.
> > > A conflict will produce an error and will stop the replication; it must be
> > > resolved manually by the user.  Details about the conflict can be 
> > > found in
> > > -   the subscriber's server log.
> > > +as well as 
> > > the
> > > +   subscriber's server log.
> > >
> > > Can we slightly change the modified line to: "Details about the
> > > conflict can be found in  > > linkend="monitoring-pg-stat-subscription-workers"/> and the
> > > subscriber's server log."?
> >
> > Will fix it.
> >
> > >  I think we can commit this change
> > > separately as this is true even without this patch.
> >
> > Right. It seems an oversight of 8d74fc96db. I've attached the patch.
> >
>
> Pushed.

Thanks!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Skipping logical replication transactions on subscriber side

2022-01-11 Thread Masahiko Sawada
On Tue, Jan 11, 2022 at 7:08 PM Amit Kapila  wrote:
>
> On Tue, Jan 11, 2022 at 1:51 PM Masahiko Sawada  wrote:
> >
> > On Tue, Jan 11, 2022 at 3:12 PM Amit Kapila  wrote:
> > >
> > > On Tue, Jan 11, 2022 at 8:52 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Mon, Jan 10, 2022 at 8:50 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > I was thinking what if we don't advance origin explicitly in this
> > > > > case? Actually, that will be no different than the transactions where
> > > > > the apply worker doesn't apply any change because the initial sync is
> > > > > in progress (see should_apply_changes_for_rel()) or we have received
> > > > > an empty transaction. In those cases also, the origin lsn won't be
> > > > > advanced even though we acknowledge the advanced last_received
> > > > > location because of keep_alive messages. Now, it is possible after the
> > > > > restart we send the old start_lsn location because the replication
> > > > > origin was not updated before restart but we handle that case in the
> > > > > server by starting from the last confirmed location. See below code:
> > > > >
> > > > > CreateDecodingContext()
> > > > > {
> > > > > ..
> > > > > else if (start_lsn < slot->data.confirmed_flush)
> > > > > ..
> > > >
> > > > Good point. Probably one minor thing that is different from the
> > > > transaction where the apply worker applied an empty transaction is a
> > > > case where the server restarts/crashes before sending an
> > > > acknowledgment of the flush location. That is, in the case of the
> > > > empty transaction, the publisher sends an empty transaction again. On
> > > > the other hand in the case of skipping the transaction, a non-empty
> > > > transaction will be sent again but skip_xid is already changed or
> > > > cleared, therefore the user will have to specify skip_xid again. If we
> > > > write replication origin WAL record to advance the origin lsn, it
> > > > reduces the possibility of that. But I think it’s a very minor case so
> > > > we won’t need to deal with that.
> > > >
> > >
> > > Yeah, in the worst case, it will lead to conflict again and the user
> > > needs to set the xid again.
> >
> > On second thought, the same is true for other cases, for example,
> > preparing the transaction and clearing skip_xid while handling a
> > prepare message. That is, currently we don't clear skip_xid while
> > handling a prepare message but do that while handling commit/rollback
> > prepared message, in order to avoid the worst case. If we do both
> > while handling a prepare message and the server crashes between them,
> > it ends up that skip_xid is cleared and the transaction will be
> > resent, which is identical to the worst-case above.
> >
>
> How are you thinking to update the skip xid before prepare? If we do
> it in the same transaction then the changes in the catalog will be
> part of the prepared xact but won't be committed. Now, say if we do it
> after prepare, then the situation won't be the same because after
> restart the same xact won't appear again.

I was thinking to commit the catalog change first in a separate
transaction while not updating origin LSN and then prepare an empty
transaction while updating origin LSN. If the server crashes between
them, the skip_xid is cleared but the transaction will be resent.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [EXTERNAL] Re: PQcancel does not use tcp_user_timeout, connect_timeout and keepalive settings

2022-01-11 Thread Tom Lane
Jelte Fennema  writes:
> Attached are 3 patches that address the feedback from Andres about code 
> duplication 
> and splitting up commits. I completely removed internal_cancel now, since it 
> only had 
> one caller at this point.

Here's some cleaned-up versions of 0001 and 0002.  I have not bothered
with 0003 because I for one will not be convinced to commit it.  The
risk-vs-reward ratio looks far too high on that, and it's not obvious
why 0002 doesn't already solve whatever problem there is.

A couple of notes:

0001 introduces malloc/free into PQrequestCancel, which promotes it
from being "probably unsafe in a signal handler" to "definitely unsafe
in a signal handler".  Before, maybe you could have gotten away with
it if you were sure the PGconn object was idle, but now it's no-go for
sure.  I don't have a big problem with that, given that it's been
deprecated for decades, but I made the warning text in libpq.sgml a
little stronger.

As for 0002, I don't see a really good reason why we shouldn't try
to do it on Windows too.  If connect() will work, then it seems
likely that setsockopt() and WSAIOCtl() will too.  Moreover, note
that at least in our own programs, PQcancel doesn't *really* run in a
signal handler on Windows: see commentary in src/fe_utils/cancel.c.
(The fact that we now have test coverage for PQcancel makes me a lot
more willing to try this than I might otherwise be.  Will be
interested to see the cfbot's results.)

Also, I was somewhat surprised while working on this to realize
that PQconnectPoll doesn't call setTCPUserTimeout if keepalives
are disabled per useKeepalives().  I made PQcancel act the same,
but I wonder if that was intentional or a bug.  I'd personally
have thought that those settings were independent.

regards, tom lane

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 14f35d37f6..e0ab7cd555 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -5707,8 +5707,8 @@ int PQrequestCancel(PGconn *conn);
PGconn object, and in case of failure stores the
error message in the PGconn object (whence it can
be retrieved by ).  Although
-   the functionality is the same, this approach creates hazards for
-   multiple-thread programs and signal handlers, since it is possible
+   the functionality is the same, this approach is not safe within
+   multiple-thread programs or signal handlers, since it is possible
that overwriting the PGconn's error message will
mess up the operation currently in progress on the connection.
   
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 5fc16be849..9e75abd304 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -4394,14 +4394,17 @@ PQfreeCancel(PGcancel *cancel)
 
 
 /*
- * PQcancel and PQrequestCancel: attempt to request cancellation of the
- * current operation.
+ * PQcancel: request query cancel
  *
  * The return value is true if the cancel request was successfully
  * dispatched, false if not (in which case an error message is available).
  * Note: successful dispatch is no guarantee that there will be any effect at
  * the backend.  The application must read the operation result as usual.
  *
+ * On failure, an error message is stored in *errbuf, which must be of size
+ * errbufsize (recommended size is 256 bytes).  *errbuf is not changed on
+ * success return.
+ *
  * CAUTION: we want this routine to be safely callable from a signal handler
  * (for example, an application might want to call it in a SIGINT handler).
  * This means we cannot use any C library routine that might be non-reentrant.
@@ -4409,13 +4412,9 @@ PQfreeCancel(PGcancel *cancel)
  * just as dangerous.  We avoid sprintf here for that reason.  Building up
  * error messages with strcpy/strcat is tedious but should be quite safe.
  * We also save/restore errno in case the signal handler support doesn't.
- *
- * internal_cancel() is an internal helper function to make code-sharing
- * between the two versions of the cancel function possible.
  */
-static int
-internal_cancel(SockAddr *raddr, int be_pid, int be_key,
-char *errbuf, int errbufsize)
+int
+PQcancel(PGcancel *cancel, char *errbuf, int errbufsize)
 {
 	int			save_errno = SOCK_ERRNO;
 	pgsocket	tmpsock = PGINVALID_SOCKET;
@@ -4427,18 +4426,26 @@ internal_cancel(SockAddr *raddr, int be_pid, int be_key,
 		CancelRequestPacket cp;
 	}			crp;
 
+	if (!cancel)
+	{
+		strlcpy(errbuf, "PQcancel() -- no cancel object supplied", errbufsize);
+		/* strlcpy probably doesn't change errno, but be paranoid */
+		SOCK_ERRNO_SET(save_errno);
+		return false;
+	}
+
 	/*
 	 * We need to open a temporary connection to the postmaster. Do this with
 	 * only kernel calls.
 	 */
-	if ((tmpsock = socket(raddr->addr.ss_family, SOCK_STREAM, 0)) == PGINVALID_SOCKET)
+	if ((tmpsock = socket(cancel->raddr.addr.ss_family, SOCK_STREAM, 0)) == 

Re: Add index scan progress to pg_stat_progress_vacuum

2022-01-11 Thread Bossart, Nathan
On 1/11/22, 12:33 PM, "Imseih (AWS), Sami"  wrote:
> What about something like  "The number of indexes that are eligible for 
> vacuuming".
> This covers the cases where either an individual index is skipped or the 
> entire "index vacuuming" phase is skipped.

Hm.  I don't know if "eligible" is the right word.  An index can be
eligible for vacuuming but skipped because we set INDEX_CLEANUP to
false.  Maybe we should just stick with "The number of indexes that
will be vacuumed."  The only thing we may want to clarify is whether
this value will change in some cases (e.g., vacuum failsafe takes
effect).

Nathan



Re: Windows crash / abort handling

2022-01-11 Thread Andrew Dunstan


On 1/11/22 16:13, Andres Freund wrote:
> Hi,
>
> On 2022-01-11 12:01:42 -0500, Andrew Dunstan wrote:
>> On 1/11/22 02:51, Andres Freund wrote:
>>> It'd be a bit of a fight with cdb's awfully documented and quirky
>>> scripting [1], but the best solution would probably be to just use an
>>> environment variable from the target process to determine the dump
>>> location. Then each buildfarm config could set a BF_BACKTRACE_LOCATION
>>> variable or such...
>>>
>>> [1] So there's !envvar. But that yields a string like
>>> BF_BACKTRACE_LOCATION = value of environment variable when set to an
>>> alias.  And I haven't found an easy way to get rid of the "variablename
>>> = ". There is .foreach /pS [2] which could be used to skip over the
>>> varname =, but that then splits on all whitespaces. Gah.
>>>
>>> [2] 
>>> https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/-foreach
>>>
>> Ugly as heck.
> Indeed. I think I figured it out:
>
> 0:000> !envvar frak
> frak = C:\Program Files\Application Verifier\
> 0:000> ad /q path; .foreach /pS 2 (component {!envvar frak}){ .if 
> (${/d:path}) {aS ${/v:path} ${/f:path} ${component}} .else {aS ${/v:path} 
> ${component}}}; .block {.echo ${path}}
> C:\Program Files\Application Verifier\
>
> I mean, no explanation needed, right?
>
>
>> But in theory these animals could be running in parallel, and in theory
>> each animal could have more than one branch being run concurrently. In
>> fact locking them against each other can be difficult/impossible.
> The environment variable solution directing dumps for each animal / branch to
> different directories should take care of that, right?
>
>
> Do you have a preference where a script file implementing the necessary
> cdb.exe commands would reside? It's getting too long to comfortably implement
> it inside the registry setting itself... That script would be used by all
> branches, rather than be branch specific.



Well, the buildfarm code sets up a file for gdb in the branch's pgsql.build:


    my $cmdfile = "./gdbcmd";
    my $handle;
    open($handle, '>', $cmdfile) || die "opening $cmdfile: $!";
    print $handle "bt\n";
    print $handle 'p $_siginfo', "\n";
    close($handle);

    my @trace = ("\n\n");

    foreach my $core (@cores)
    {
    my @onetrace =
  run_log("gdb -x $cmdfile --batch $bindir/postgres $core");
    push(@trace,
    "$log_file_marker stack trace: $core $log_file_marker\n",
    @onetrace);
    }


So by analogy we could do something similar for the Windows debugger. Or
we could pick up a file from the repo if we wanted to commit it in
src/tools somewhere.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Parallel Full Hash Join

2022-01-11 Thread Melanie Plageman
On Fri, Nov 26, 2021 at 3:11 PM Thomas Munro  wrote:
>
> On Sun, Nov 21, 2021 at 4:48 PM Justin Pryzby  wrote:
> > On Wed, Nov 17, 2021 at 01:45:06PM -0500, Melanie Plageman wrote:
> > > Yes, this looks like that issue.
> > >
> > > I've attached a v8 set with the fix I suggested in [1] included.
> > > (I added it to 0001).
> >
> > This is still crashing :(
> > https://cirrus-ci.com/task/6738329224871936
> > https://cirrus-ci.com/task/4895130286030848
>
> I added a core file backtrace to cfbot's CI recipe a few days ago, so
> now we have:
>
> https://cirrus-ci.com/task/5676480098205696
>
> #3 0x009cf57e in ExceptionalCondition (conditionName=0x29cae8
> "BarrierParticipants(>shared->batch_barrier) == 1",
> errorType=, fileName=0x2ae561 "nodeHash.c",
> lineNumber=lineNumber@entry=2224) at assert.c:69
> No locals.
> #4 0x0071575e in ExecParallelScanHashTableForUnmatched
> (hjstate=hjstate@entry=0x80a60a3c8,
> econtext=econtext@entry=0x80a60ae98) at nodeHash.c:2224

I believe this assert can be safely removed.

It is possible for a worker to attach to the batch barrier after the
"last" worker was elected to scan and emit unmatched inner tuples. This
is safe because the batch barrier is already in phase PHJ_BATCH_SCAN
and this newly attached worker will simply detach from the batch
barrier and look for a new batch to work on.

The order of events would be as follows:

W1: advances batch to PHJ_BATCH_SCAN
W2: attaches to batch barrier in ExecParallelHashJoinNewBatch()
W1: calls ExecParallelScanHashTableForUnmatched() (2 workers attached to
barrier at this point)
W2: detaches from the batch barrier

The attached v10 patch removes this assert and updates the comment in
ExecParallelScanHashTableForUnmatched().

I'm not sure if I should add more detail about this scenario in
ExecParallelHashJoinNewBatch() under PHJ_BATCH_SCAN or if the detail in
ExecParallelScanHashTableForUnmatched() is sufficient.

- Melanie
From 998a1cc5bbbe26ce4c0bf23be22867fd1635e2c6 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 6 Mar 2021 12:06:16 +1300
Subject: [PATCH v10 2/3] Improve the naming of Parallel Hash Join phases.

Commit 3048898e dropped -ING from some wait event names.  Update the
corresponding barrier phases names to match.

While we're here making cosmetic changes, also rename "DONE" to "FREE".
That pairs better with "ALLOCATE", and describes the activity that
actually happens in that phase (as we do for the other phases) rather
than describing a state.  As for the growth barriers, rename their
"ALLOCATE" phase to "REALLOCATE", which is probably a better description
of what happens then.
---
 src/backend/executor/nodeHash.c | 68 +-
 src/backend/executor/nodeHashjoin.c | 91 +
 src/backend/utils/activity/wait_event.c |  8 +--
 src/include/executor/hashjoin.h | 38 +--
 src/include/utils/wait_event.h  |  4 +-
 5 files changed, 106 insertions(+), 103 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 049d718425..4e233f07c4 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -246,10 +246,10 @@ MultiExecParallelHash(HashState *node)
 	 */
 	pstate = hashtable->parallel_state;
 	build_barrier = >build_barrier;
-	Assert(BarrierPhase(build_barrier) >= PHJ_BUILD_ALLOCATING);
+	Assert(BarrierPhase(build_barrier) >= PHJ_BUILD_ALLOCATE);
 	switch (BarrierPhase(build_barrier))
 	{
-		case PHJ_BUILD_ALLOCATING:
+		case PHJ_BUILD_ALLOCATE:
 
 			/*
 			 * Either I just allocated the initial hash table in
@@ -259,7 +259,7 @@ MultiExecParallelHash(HashState *node)
 			BarrierArriveAndWait(build_barrier, WAIT_EVENT_HASH_BUILD_ALLOCATE);
 			/* Fall through. */
 
-		case PHJ_BUILD_HASHING_INNER:
+		case PHJ_BUILD_HASH_INNER:
 
 			/*
 			 * It's time to begin hashing, or if we just arrived here then
@@ -271,10 +271,10 @@ MultiExecParallelHash(HashState *node)
 			 * below.
 			 */
 			if (PHJ_GROW_BATCHES_PHASE(BarrierAttach(>grow_batches_barrier)) !=
-PHJ_GROW_BATCHES_ELECTING)
+PHJ_GROW_BATCHES_ELECT)
 ExecParallelHashIncreaseNumBatches(hashtable);
 			if (PHJ_GROW_BUCKETS_PHASE(BarrierAttach(>grow_buckets_barrier)) !=
-PHJ_GROW_BUCKETS_ELECTING)
+PHJ_GROW_BUCKETS_ELECT)
 ExecParallelHashIncreaseNumBuckets(hashtable);
 			ExecParallelHashEnsureBatchAccessors(hashtable);
 			ExecParallelHashTableSetCurrentBatch(hashtable, 0);
@@ -338,17 +338,17 @@ MultiExecParallelHash(HashState *node)
 	 * Unless we're completely done and the batch state has been freed, make
 	 * sure we have accessors.
 	 */
-	if (BarrierPhase(build_barrier) < PHJ_BUILD_DONE)
+	if (BarrierPhase(build_barrier) < PHJ_BUILD_FREE)
 		ExecParallelHashEnsureBatchAccessors(hashtable);
 
 	/*
 	 * The next synchronization point is in ExecHashJoin's HJ_BUILD_HASHTABLE
-	 * case, which will bring the build phase to PHJ_BUILD_RUNNING (if it isn't
+	 * case, which will bring the build 

Re: row filtering for logical replication

2022-01-11 Thread Alvaro Herrera
I just looked at 0002 because of Justin Pryzby's comment in the column
filtering thread, and realized that the pgoutput row filtering has a
very strange API, which receives both heap tuples and slots; and we seem
to convert to and from slots in seemingly unprincipled ways.  I don't
think this is going to fly.  I think it's OK for the initial entry into
pgoutput to be HeapTuple (but only because that's what
ReorderBufferTupleBuf has), but it should be converted a slot right when
it enters pgoutput, and then used as a slot throughout.

I think this is mostly sensible in 0001 (which was evidently developed
earlier), but 0002 makes a nonsensical change to the API, with poor
results.

(This is one of the reasons I've been saying that there patches should
be squashed together -- so that we can see that the overall API
transformation we're making are sensible.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Windows crash / abort handling

2022-01-11 Thread Andres Freund
Hi,

On 2022-01-11 12:01:42 -0500, Andrew Dunstan wrote:
> On 1/11/22 02:51, Andres Freund wrote:
> > It'd be a bit of a fight with cdb's awfully documented and quirky
> > scripting [1], but the best solution would probably be to just use an
> > environment variable from the target process to determine the dump
> > location. Then each buildfarm config could set a BF_BACKTRACE_LOCATION
> > variable or such...
> >
> > [1] So there's !envvar. But that yields a string like
> > BF_BACKTRACE_LOCATION = value of environment variable when set to an
> > alias.  And I haven't found an easy way to get rid of the "variablename
> > = ". There is .foreach /pS [2] which could be used to skip over the
> > varname =, but that then splits on all whitespaces. Gah.
> >
> > [2] 
> > https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/-foreach
> >
> 
> Ugly as heck.

Indeed. I think I figured it out:

0:000> !envvar frak
frak = C:\Program Files\Application Verifier\
0:000> ad /q path; .foreach /pS 2 (component {!envvar frak}){ .if (${/d:path}) 
{aS ${/v:path} ${/f:path} ${component}} .else {aS ${/v:path} ${component}}}; 
.block {.echo ${path}}
C:\Program Files\Application Verifier\

I mean, no explanation needed, right?


> But in theory these animals could be running in parallel, and in theory
> each animal could have more than one branch being run concurrently. In
> fact locking them against each other can be difficult/impossible.

The environment variable solution directing dumps for each animal / branch to
different directories should take care of that, right?


Do you have a preference where a script file implementing the necessary
cdb.exe commands would reside? It's getting too long to comfortably implement
it inside the registry setting itself... That script would be used by all
branches, rather than be branch specific.

Greetings,

Andres Freund




Re: Column Filtering in Logical Replication

2022-01-11 Thread Justin Pryzby
Is there any coordination between the "column filter" patch and the "row
filter" patch ?  Are they both on track for PG15 ?  Has anybody run them
together ?

Whichever patch is merged 2nd should include tests involving a subset of
columns along with a WHERE clause.

I have a suggestion: for the functions for which both patches are adding
additional argument types, define a filtering structure for both patches to
use.  Similar to what we did for some utility statements in a3dc92600.

I'm referring to:
logicalrep_write_update()
logicalrep_write_tuple()

That would avoid avoid some rebase conflicts on april 9, and avoid functions
with 7,8,9 arguments, and maybe simplify adding arguments in the future.

-- 
Justin




Re: Use -fvisibility=hidden for shared libraries

2022-01-11 Thread Tom Lane
Andres Freund  writes:
> What is bugging me is that I am fairly sure that my local compilers at some
> point complained about such mismatches on linux as well. But I can't reproduce
> that right now :/

> Now I wonder if I just saw it when cross compiling locally...

I still don't understand what are the conditions for MSVC to complain.
The rule is evidently not that every extern must agree with the function
definition, because for example you added

+extern PGDLLEXPORT void _PG_init(void);

in fmgr.h, but you didn't change any of the existing extern declarations
or definitions for _PG_init functions, and yet everything seems to work.

I had concluded that gcc/clang follow the rule "use an attribute if it
appears on at least one extern for the function", and this seems like
evidence that it works like that in MSVC too.

regards, tom lane




Re: Add jsonlog log_destination for JSON server logs

2022-01-11 Thread Bossart, Nathan
On 1/10/22, 4:51 AM, "Michael Paquier"  wrote:
> The issue comes from an incorrect change in syslogger_parseArgs()
> where I missed that the incrementation of argv by 3 has no need to be
> changed.  A build with -DEXEC_BACKEND is enough to show the failure,
> which caused a crash when starting up the syslogger because of a NULL
> pointer dereference.  The attached v9 should be enough to switch the
> CF bot to green.

I've been looking at the latest patch set intermittently and playing
around with jsonlog a little.  It seems to work well, and I don't have
any significant comments about the code.  0001 and 0002 seem
straightforward and uncontroversial.  IIUC 0003 simply introduces
jsonlog using the existing framework.

I wonder if we should consider tracking each log destination as a set
of function pointers.  The main logging code would just loop through
the enabled log destinations and use these functions, and it otherwise
would be completely detached (i.e., no "if jsonlog" blocks).  This
might open up the ability to define custom log destinations via
modules, too.  However, I don't know if there's any real demand for
something like this, and it should probably be done separately from
introducing jsonlog, anyway.

Nathan



Re: Add index scan progress to pg_stat_progress_vacuum

2022-01-11 Thread Imseih (AWS), Sami
On 1/11/22, 1:01 PM, "Bossart, Nathan"  wrote:

On 1/10/22, 5:01 PM, "Imseih (AWS), Sami"  wrote:
> I have attached the 3rd revision of the patch which also includes the 
documentation changes. Also attached is a rendered html of the docs for review.
>
> "max_index_vacuum_cycle_time" has been removed.
> "index_rows_vacuumed" renamed to "index_tuples_removed". "tuples" is a 
more consistent with the terminology used.
> "vacuum_cycle_ordinal_position" renamed to "index_ordinal_position".

Thanks for the new version of the patch!

nitpick: I get one whitespace error when applying the patch.

Applying: Expose progress for the "vacuuming indexes" phase of a 
VACUUM operation.
.git/rebase-apply/patch:44: tab in indent.
   Whenever  is triggered, 
index
warning: 1 line adds whitespace errors.

That was missed. Will fix it.

+ 
+  
+   num_indexes_to_vacuum bigint
+  
+  
+   The number of indexes that will be vacuumed. Only indexes with
+   pg_index.indisready set to "true" will be 
vacuumed.
+   Whenever  is triggered, 
index
+   vacuuming will be bypassed.
+  
+ 
+
+   
+  

We may want to avoid exhaustively listing the cases when this value
will be zero.  I would suggest saying, "When index cleanup is skipped,
this value will be zero" instead.

What about something like  "The number of indexes that are eligible for 
vacuuming".
This covers the cases where either an individual index is skipped or the entire 
"index vacuuming" phase is skipped.

+ 
+  
+   relid oid
+  
+  
+   OID of the table being vacuumed.
+  
+ 

Do we need to include this field?  I would expect indexrelid to go
here.

Having indexrelid and relid makes the pg_stat_progress_vacuum_index view 
"self-contained". A user can lookup the index and table being vacuumed without 
joining back to pg_stat_progress_vacuum.

+ 
+  
+   leader_pid bigint
+  
+  
+   Process ID of the parallel group leader. This field is 
NULL
+   if this process is a parallel group leader or the
+   vacuuming indexes phase is not performed in 
parallel.
+  
+ 

Are there cases where the parallel group leader will have an entry in
this view when parallelism is enabled?

Yes. A parallel group leader can perform an index vacuum just like a parallel 
worker. If you do something like "vacuum (parallel 3) ", you may have up to 4 
processes vacuuming indexes. The leader + 3 workers. 

+ 
+  
+   index_ordinal_position 
bigint
+  
+  
+   The order in which the index is being vacuumed. Indexes are 
vacuumed by OID in ascending order.
+  
+ 

Should we include the bit about the OID ordering?  I suppose that is
unlikely to change in the near future, but I don't know if it is
relevant information.  Also, do we need to include the "index_"
prefix?  This view is specific for indexes.  (I have the same question
for index_tuples_removed.)

I was on the fence about both of these as well. Will make a change to this.

Should this new table go after the "VACUUM phases" table?  It might
make sense to keep the phases table closer to where it is referenced.

I did not think that would read better. The introduction discusses both views 
and the "phase" table is linked from the pg_stat_progress_vacuum 

+/* Advertise the number of indexes to vacuum if we are not in failsafe 
mode */
+if (!lazy_check_wraparound_failsafe(vacrel))
+pgstat_progress_update_param(PROGRESS_VACUUM_TOTAL_INDEX_VACUUM, 
vacrel->nindexes);

Shouldn't this be 0 when INDEX_CLEANUP is off, too?

This view is only covering the "vacuum index" phase, but it should also cover 
index_cleanup phase as well. Will update the patch.

+#define PROGRESS_VACUUM_CURRENT_INDRELID 7
+#define PROGRESS_VACUUM_LEADER_PID   8
+#define PROGRESS_VACUUM_INDEX_ORDINAL9
+#define PROGRESS_VACUUM_TOTAL_INDEX_VACUUM   10
+#define PROGRESS_VACUUM_DEAD_TUPLES_VACUUMED 11

nitpick: I would suggest the following names to match the existing
style:

PROGRESS_VACUUM_NUM_INDEXES_TO_VACUUM
PROGRESS_VACUUM_INDEX_LEADER_PID
PROGRESS_VACUUM_INDEX_INDEXRELID
PROGRESS_VACUUM_INDEX_ORDINAL_POSITION
PROGRESS_VACUUM_INDEX_TUPLES_REMOVED

That looks better.

Nathan




Re: Boyer-More-Horspool searching LIKE queries

2022-01-11 Thread Heikki Linnakangas

On 11/01/2022 15:55, Atsushi Ogawa wrote:

I have created a patch to enable the Boyer-More-Horspool search
algorithm (B-M-H) for LIKE queries.


Cool!


The conditions under which B-M-H can be used are as follows.

(1) B-M-H in LIKE search supports only single-byte character sets and UTF8.
Multibyte character sets does not support, because it may contain another
characters in the byte sequence. For UTF-8, it works fine, because in
UTF-8 the byte sequence of one character cannot contain another character.


You can make it work with any encoding, if you check for that case after 
you find a match. See how text_position() does it.



(3) The pattern string should be at least 4 characters.
For example, '%AB%' can use B-M-H.


To be precise, the patch checks that the pattern string is at least 4 
*bytes* long. A pattern like E'%\U0001F418%' would benefit too.


If I'm reading the code correctly, it doesn't account for escapes 
correctly. It will use B-M-H for a pattern like '%\\%', even though 
that's just searching for a single backslash and won't benefit from B-M-H.



(4) The first and last character of the pattern string should be '%'.


I wonder if we can do better than that. If you have a pattern like 
'%foo%bar', its pretty obvious (to a human) that you can quickly check 
if the string ends in 'bar', and then check if it also contains the 
substring 'foo'. Is there some way to generalize that?


Looking at MatchText() in like.c, there is this piece of code:


else if (*p == '%')
{
charfirstpat;

/*
 * % processing is essentially a search for a text 
position at
 * which the remainder of the text matches the 
remainder of the
 * pattern, using a recursive call to check each 
potential match.
 *
 * If there are wildcards immediately following the %, 
we can skip
 * over them first, using the idea that any sequence of 
N _'s and
 * one or more %'s is equivalent to N _'s and one % 
(ie, it will
 * match any sequence of at least N text characters).  
In this way
 * we will always run the recursive search loop using a 
pattern
 * fragment that begins with a literal 
character-to-match, thereby
 * not recursing more than we have to.
 */
NextByte(p, plen);

while (plen > 0)
{
if (*p == '%')
NextByte(p, plen);
else if (*p == '_')
{
/* If not enough text left to match the 
pattern, ABORT */
if (tlen <= 0)
return LIKE_ABORT;
NextChar(t, tlen);
NextByte(p, plen);
}
else
break;  /* Reached a 
non-wildcard pattern char */
}


Could we use B-M-H to replace that piece of code?

How does the performance compare with regular expressions? Would it be 
possible to use this for trivial regular expressions too? Or could we 
speed up the regexp engine to take advantage of B-M-H, and use it for 
LIKE? Or something like that?



I have measured the performance with the following query.


Setting up the B-M-H table adds some initialization overhead, so this 
would be a loss for cases where the LIKE is executed only once, and/or 
the haystack strings are very small. That's probably OK, the overhead is 
probably small, and those cases are probably not performance-critical. 
But would be nice to measure that too.


- Heikki




Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

2022-01-11 Thread Tom Lane
Thomas Munro  writes:
> Ouch.  I think our options at this point are:
> 1.  Revert 6051857fc (and put it back when we have a working
> long-lived WES as I showed).  This is not very satisfying, now that we
> understand the bug, because even without that change I guess you must
> be able to reach the hanging condition by using Windows postgres_fdw
> to talk to a non-Windows server (ie a normal TCP stack with graceful
> shutdown/linger on process exit).

It'd be worth checking, perhaps.  One thing I've been wondering all
along is how much of this behavior is specific to the local-loopback
case where Windows can see both ends of the connection.  You'd think
that they couldn't long get away with such blatant violations of the
TCP specs when talking to external servers, because the failures
would be visible to everyone with a web browser.

regards, tom lane




Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

2022-01-11 Thread Thomas Munro
On Wed, Jan 12, 2022 at 4:00 AM Alexander Lakhin  wrote:
> So here we get similar hanging on WaitLatchOrSocket().
> Just to make sure that it's indeed the same issue, I've removed socket
> shutdown and the test executed to the end (several times). Argh.

Ouch.  I think our options at this point are:
1.  Revert 6051857fc (and put it back when we have a working
long-lived WES as I showed).  This is not very satisfying, now that we
understand the bug, because even without that change I guess you must
be able to reach the hanging condition by using Windows postgres_fdw
to talk to a non-Windows server (ie a normal TCP stack with graceful
shutdown/linger on process exit).
2.  Put your poll() check into the READABLE side.  There's some
precedent for that sort of kludge on the WRITEABLE side (and a
rejection of the fragile idea that clients of latch.c should only
perform "safe" sequences):

/*
 * Windows does not guarantee to log an FD_WRITE network event
 * indicating that more data can be sent unless the previous send()
 * failed with WSAEWOULDBLOCK.  While our caller might well have made
 * such a call, we cannot assume that here.  Therefore, if waiting for
 * write-ready, force the issue by doing a dummy send().  If the dummy
 * send() succeeds, assume that the socket is in fact write-ready, and
 * return immediately.  Also, if it fails with something other than
 * WSAEWOULDBLOCK, return a write-ready indication to let our caller
 * deal with the error condition.
 */




Re: pg_upgrade should truncate/remove its logs before running

2022-01-11 Thread Justin Pryzby
On Tue, Jan 11, 2022 at 04:41:58PM +0900, Michael Paquier wrote:
> On Sat, Jan 08, 2022 at 12:48:57PM -0600, Justin Pryzby wrote:
> > I fixed it by calling get_restricted_token() before parseCommandLine().
> > There's precedent for that in pg_regress (but the 3 other callers do it
> > differently).
> >
> > It seems more ideal to always call get_restricted_token sooner than later, 
> > but
> > for now I only changed pg_upgrade.  It's probably also better if
> > parseCommandLine() only parses the commandline, but for now I added on to 
> > the
> > logfile stuff that's already there.
>
> Well, the routine does a bit more than just parsing the options as it
> creates the directory infrastructure as well.  As you say, I think
> that it would be better to have the option parsing and the
> loading-into-structure portions in one routine, and the creation of
> the paths in a second one.  So, the new contents of the patch could
> just be moved in a new routine, after getting the restricted token.
> Moving get_restricted_token() before or after the option parsing as
> you do is not a big deal, but your patch is introducing in the
> existing routine more than what's currently done there as of HEAD.

I added mkdir() before the other stuff that messes with logfiles, because it
needs to happen before that.

Are you suggesting to change the pre-existing behavior of when logfiles are
created, like 0002 ?

> > Maybe the commandline argument should be callled something other than 
> > "logdir"
> > since it also outputs dumps there.  But the dumps are more or less not
> > user-facing.  But -d and -o are already used.  Maybe it shouldn't be
> > configurable at all?
> 
> If the choice of a short option becomes confusing, I'd be fine with
> just a long option, but -l is fine IMO.  Including the internal dumps
> in the directory is fine to me, and using a subdir, as you do, makes
> things more organized.
> 
> - "--binary-upgrade %s -f %s",
> + "--binary-upgrade %s -f %s/dump/%s",
> Some quotes seem to be missing here.

Yes, good catch

> Is it intentional to not use rmtree() here?  If you put all the data
> in the same directory, cleanup() gets simpler.

There's no reason not to.  We created the dir, and the user didn't specify to
preserve it.  It'd be their fault if they put something valuable there after
starting pg_upgrade.

-- 
Justin
>From 9b228bb1b529d0998340263dcb4238beeccd2ac9 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 11 Dec 2021 19:19:50 -0600
Subject: [PATCH 1/2] pg_upgrade: write log files and dumps in subdir..

The subdir must not exist when pg_upgrade is started.
This avoids appending to logfiles which have pre-existing errors in them.
And allows more easily cleaning up after pg_upgrade.
---
 doc/src/sgml/ref/pgupgrade.sgml | 15 -
 src/bin/pg_upgrade/.gitignore   |  3 ---
 src/bin/pg_upgrade/Makefile |  4 +---
 src/bin/pg_upgrade/check.c  | 12 +++
 src/bin/pg_upgrade/dump.c   |  6 --
 src/bin/pg_upgrade/exec.c   |  5 -
 src/bin/pg_upgrade/function.c   |  3 ++-
 src/bin/pg_upgrade/option.c | 31 +++
 src/bin/pg_upgrade/pg_upgrade.c | 38 -
 src/bin/pg_upgrade/pg_upgrade.h |  1 +
 src/bin/pg_upgrade/server.c |  6 --
 11 files changed, 74 insertions(+), 50 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index c5ce732ee98..e6d730a2574 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -130,6 +130,19 @@ PostgreSQL documentation
   cluster
  
 
+ 
+  -l
+  --logdir
+  Directory where SQL and log files will be written.
+  The directory will be created and must not exist before calling
+  pg_upgrade.
+  It will be removed after a successful upgrade unless
+  --retain is specified.
+  The default is pg_upgrade_output.d in the current
+  working directory.
+  
+ 
+
  
   -N
   --no-sync
@@ -768,7 +781,7 @@ psql --username=postgres --file=script.sql postgres
 
   
pg_upgrade creates various working files, such
-   as schema dumps, in the current working directory.  For security, be sure
+   as schema dumps, below the current working directory.  For security, be sure
that that directory is not readable or writable by any other users.
   
 
diff --git a/src/bin/pg_upgrade/.gitignore b/src/bin/pg_upgrade/.gitignore
index 2d3bfeaa502..7222f8a6b1f 100644
--- a/src/bin/pg_upgrade/.gitignore
+++ b/src/bin/pg_upgrade/.gitignore
@@ -1,9 +1,6 @@
 /pg_upgrade
 # Generated by test suite
-/pg_upgrade_internal.log
 /delete_old_cluster.sh
 /delete_old_cluster.bat
-/reindex_hash.sql
-/loadable_libraries.txt
 /log/
 /tmp_check/
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 44d06be5a61..ab21cee3601 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -45,9 +45,7 @@ uninstall:
 clean 

Re: improve CREATE EXTENSION error message

2022-01-11 Thread Bossart, Nathan
On 1/11/22, 11:23 AM, "Tom Lane"  wrote:
> "Bossart, Nathan"  writes:
>> Okay, the message looks like this in v5:
>
>> postgres=# CREATE EXTENSION does_not_exist;
>> ERROR:  extension "does_not_exist" is not available
>> DETAIL:  Could not open extension control file 
>> "/usr/local/pgsql/share/extension/does_not_exist.control": No such file or 
>> directory.
>> HINT:  The extension must first be installed on the system where 
>> PostgreSQL is running.
>
> Nobody complained about that wording, so pushed.

Thanks!

Nathan



Re: improve CREATE EXTENSION error message

2022-01-11 Thread Tom Lane
"Bossart, Nathan"  writes:
> Okay, the message looks like this in v5:

> postgres=# CREATE EXTENSION does_not_exist;
> ERROR:  extension "does_not_exist" is not available
> DETAIL:  Could not open extension control file 
> "/usr/local/pgsql/share/extension/does_not_exist.control": No such file or 
> directory.
> HINT:  The extension must first be installed on the system where 
> PostgreSQL is running.

Nobody complained about that wording, so pushed.

regards, tom lane




Re: Add index scan progress to pg_stat_progress_vacuum

2022-01-11 Thread Bossart, Nathan
On 1/10/22, 5:01 PM, "Imseih (AWS), Sami"  wrote:
> I have attached the 3rd revision of the patch which also includes the 
> documentation changes. Also attached is a rendered html of the docs for 
> review.
>
> "max_index_vacuum_cycle_time" has been removed.
> "index_rows_vacuumed" renamed to "index_tuples_removed". "tuples" is a more 
> consistent with the terminology used.
> "vacuum_cycle_ordinal_position" renamed to "index_ordinal_position".

Thanks for the new version of the patch!

nitpick: I get one whitespace error when applying the patch.

Applying: Expose progress for the "vacuuming indexes" phase of a VACUUM 
operation.
.git/rebase-apply/patch:44: tab in indent.
   Whenever  is triggered, 
index
warning: 1 line adds whitespace errors.

+ 
+  
+   num_indexes_to_vacuum bigint
+  
+  
+   The number of indexes that will be vacuumed. Only indexes with
+   pg_index.indisready set to "true" will be vacuumed.
+   Whenever  is triggered, index
+   vacuuming will be bypassed.
+  
+ 
+
+   
+  

We may want to avoid exhaustively listing the cases when this value
will be zero.  I would suggest saying, "When index cleanup is skipped,
this value will be zero" instead.

+ 
+  
+   relid oid
+  
+  
+   OID of the table being vacuumed.
+  
+ 

Do we need to include this field?  I would expect indexrelid to go
here.

+ 
+  
+   leader_pid bigint
+  
+  
+   Process ID of the parallel group leader. This field is 
NULL
+   if this process is a parallel group leader or the
+   vacuuming indexes phase is not performed in parallel.
+  
+ 

Are there cases where the parallel group leader will have an entry in
this view when parallelism is enabled?

+ 
+  
+   index_ordinal_position bigint
+  
+  
+   The order in which the index is being vacuumed. Indexes are vacuumed by 
OID in ascending order.
+  
+ 

Should we include the bit about the OID ordering?  I suppose that is
unlikely to change in the near future, but I don't know if it is
relevant information.  Also, do we need to include the "index_"
prefix?  This view is specific for indexes.  (I have the same question
for index_tuples_removed.)

Should this new table go after the "VACUUM phases" table?  It might
make sense to keep the phases table closer to where it is referenced.

+/* Advertise the number of indexes to vacuum if we are not in failsafe 
mode */
+if (!lazy_check_wraparound_failsafe(vacrel))
+pgstat_progress_update_param(PROGRESS_VACUUM_TOTAL_INDEX_VACUUM, 
vacrel->nindexes);

Shouldn't this be 0 when INDEX_CLEANUP is off, too?

+#define PROGRESS_VACUUM_CURRENT_INDRELID 7
+#define PROGRESS_VACUUM_LEADER_PID   8
+#define PROGRESS_VACUUM_INDEX_ORDINAL9
+#define PROGRESS_VACUUM_TOTAL_INDEX_VACUUM   10
+#define PROGRESS_VACUUM_DEAD_TUPLES_VACUUMED 11

nitpick: I would suggest the following names to match the existing
style:

PROGRESS_VACUUM_NUM_INDEXES_TO_VACUUM
PROGRESS_VACUUM_INDEX_LEADER_PID
PROGRESS_VACUUM_INDEX_INDEXRELID
PROGRESS_VACUUM_INDEX_ORDINAL_POSITION
PROGRESS_VACUUM_INDEX_TUPLES_REMOVED

Nathan



Re: [PATCH] Add reloption for views to enable RLS

2022-01-11 Thread Laurenz Albe
On Fri, 2021-12-17 at 18:31 +0100, Christoph Heiss wrote:
> As part of a customer project we are looking to implement an reloption 
> for views which when set, runs the subquery as invoked by the user 
> rather than the view owner, as is currently the case.
> The rewrite rule's table references are then checked as if the user were 
> referencing the table(s) directly.
> 
> This feature is similar to so-called 'SECURITY INVOKER' views in other DBMS.
> Although such permission checking could be implemented using views which 
> SELECT from a table function and further using triggers, that approach 
> has obvious performance downsides.

This has been requested before, see for example
https://stackoverflow.com/q/33858030/6464308

Row Level Security is only one use case; there may be other situations
when it is useful to check permissions on the underlying objects with
the current user rather than with the view owner.

> Our initial thought on implementing this was to simply add another 
> reloption for views, just like the already existing `security_barrier`. 
> With this in place, we then can conditionally evaluate in 
> RelationBuildRuleLock() if we need to call setRuleCheckAsUser() or not.
> The new reloption has been named `security`, which is an enum currently 
> only supporting a single value: `relation_permissions`.


You made that an enum with only a single value.
What other values could you imagine in the future?

I think that this should be a boolean reloption, for example "security_definer".
If unset or set to "off", you would get the current behavior.


> Finally, patch 0003 updates the documentation for this new reloption.

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 64d9030652..760ea2f794 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2292,6 +2292,10 @@ GRANT SELECT (col1), UPDATE (col1) ON mytable TO 
miriam_rw;
are not subject to row security.
   
 
+  
+   For views, the policies are applied as being referenced through the view 
owner by default, rather than the user referencing the view. To apply row 
security policies as defined for the invoking
user, the security option can be set on views (see CREATE VIEW) to get the same behavior.
+  
+
   
Row security policies can be specific to commands, or to roles, or to
both.  A policy can be specified to apply to ALL

Please avoid long lines like that.  Also, I don't think that the documentation 
on
RLS policies is the correct place for this.  It should be on a page dedicated 
to views
or permissions.

The CREATE VIEW page already has a paragraph about this, starting with
"Access to tables referenced in the view is determined by permissions of the 
view owner."
This looks like the best place to me (and it would need to be adapted anyway).

Yours,
Laurenz Albe





Re: Column Filtering in Logical Replication

2022-01-11 Thread Alvaro Herrera
On 2022-Jan-10, Alvaro Herrera wrote:

> Hmm.  So you're saying that we should only raise errors about the column
> list if we are publishing UPDATE or DELETE, but otherwise let the
> replica identity be anything.  OK, I'll see if I can come up with a
> reasonable set of rules ...

This is an attempt to do it that way.  Now you can add a table to a
publication without regards for how column filter compares to the
replica identity, as long as the publication does not include updates
and inserts.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)
>From e3350af9023b5181b70ce480d039b3df46e9c019 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 11 Jan 2022 15:46:07 -0300
Subject: [PATCH v17] Support column lists for logical replication of tables

Add the capability of specifying a column list for individual tables as
part of a publication.  Columns not in the list are not published.  This
enables replicating to a table with only a subset of the columns.

If no column list is specified, all the columns are replicated, as
previously

Author: Rahila Syed 
Discussion: https://postgr.es/m/CAH2L28vddB_NFdRVpuyRBJEBWjz4BSyTB=_ektnrh8nj1jf...@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml  |  13 +
 doc/src/sgml/protocol.sgml  |   4 +-
 doc/src/sgml/ref/alter_publication.sgml |  20 +-
 doc/src/sgml/ref/create_publication.sgml|  11 +-
 src/backend/catalog/pg_publication.c| 353 +++-
 src/backend/commands/publicationcmds.c  |  67 +++-
 src/backend/commands/tablecmds.c|  87 -
 src/backend/nodes/copyfuncs.c   |   1 +
 src/backend/nodes/equalfuncs.c  |   1 +
 src/backend/parser/gram.y   |  60 +++-
 src/backend/replication/logical/proto.c |  66 ++--
 src/backend/replication/logical/tablesync.c | 119 ++-
 src/backend/replication/pgoutput/pgoutput.c | 118 ++-
 src/bin/pg_dump/pg_dump.c   |  41 ++-
 src/bin/pg_dump/pg_dump.h   |   1 +
 src/bin/psql/describe.c |  26 +-
 src/bin/psql/tab-complete.c |   2 +
 src/include/catalog/pg_publication.h|   6 +
 src/include/catalog/pg_publication_rel.h|   3 +
 src/include/nodes/parsenodes.h  |   4 +-
 src/include/replication/logicalproto.h  |   6 +-
 src/test/regress/expected/publication.out   |  59 +++-
 src/test/regress/sql/publication.sql|  39 ++-
 src/test/subscription/t/028_column_list.pl  | 164 +
 24 files changed, 1184 insertions(+), 87 deletions(-)
 create mode 100644 src/test/subscription/t/028_column_list.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 03e2537b07..b7b75f64a2 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -6311,6 +6311,19 @@ SCRAM-SHA-256$iteration count:
Reference to relation
   
  
+
+ 
+  
+   prattrs int2vector
+   (references pg_attribute.attnum)
+  
+  
+   This is an array of values that indicates which table columns are
+   part of the publication.  For example a value of 1 3
+   would mean that the first and the third table columns are published.
+   A null value indicates that all attributes are published.
+  
+ 
 

   
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 34a7034282..5bc2e7a591 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -6877,7 +6877,9 @@ Relation
 
 
 
-Next, the following message part appears for each column (except generated columns):
+Next, the following message part appears for each column (except
+generated columns and other columns that don't appear in the column
+filter list, for tables that have one):
 
 
 
diff --git a/doc/src/sgml/ref/alter_publication.sgml b/doc/src/sgml/ref/alter_publication.sgml
index bb4ef5e5e2..d0e97243b8 100644
--- a/doc/src/sgml/ref/alter_publication.sgml
+++ b/doc/src/sgml/ref/alter_publication.sgml
@@ -25,12 +25,13 @@ ALTER PUBLICATION name ADD name SET publication_object [, ...]
 ALTER PUBLICATION name DROP publication_object [, ...]
 ALTER PUBLICATION name SET ( publication_parameter [= value] [, ... ] )
+ALTER PUBLICATION name ALTER TABLE publication_object SET COLUMNS { ( name [, ...] ) | ALL }
 ALTER PUBLICATION name OWNER TO { new_owner | CURRENT_ROLE | CURRENT_USER | SESSION_USER }
 ALTER PUBLICATION name RENAME TO new_name
 
 where publication_object is one of:
 
-TABLE [ ONLY ] table_name [ * ] [, ... ]
+TABLE [ ONLY ] table_name [ * ] [ ( column_name, [, ... ] ) ] [, ... ]
 ALL TABLES IN SCHEMA { schema_name | CURRENT_SCHEMA } [, ... ]
 
  
@@ -62,6 +63,11 @@ ALTER PUBLICATION name RENAME TO 
 
+  
+   The ALTER TABLE ... SET COLUMNS variant allows changing
+   the set of columns that are 

Re: sequences vs. synchronous replication

2022-01-11 Thread Tomas Vondra
On 12/24/21 11:40, Tomas Vondra wrote:
> 
> ...
> 
> FWIW I plan to explore the idea of looking at sequence page LSN, and
> flushing up to that position.
> 

So, I explored the page LSN idea, and it seems to be working pretty
nicely. There still is some impact on the workload doing just nextval
calls, but it's much better than WAL-logging everything. The patch is
available at [1].

While working on that patch, I realized this actually affects even
systems without any replicas - it's trivial to get a sequence going
backwards. Imagine you do this:

  BEGIN;
  SELECT nextval('s') FROM generate_series(1,50) s(i);
  ROLLBACK;

  SELECT nextval('s');

  -- kill -9 postgres

It's pretty likely a nextval('s') after restarting the server returns a
value from before the last nextval('s'), in case we do not manage to
flush the WAL before the kill.

The patch deals with this by updating XactLastRecEnd and then flushing
up to that point in RecordTransactionCommit(). But for that to happen,
we have to do the flush/wait even without a valid XID (which we may not
have when nextval gets called outside a transaction).

So I was wondering what other places do the same thing (generates WAL
without setting a XID), because that might either have similar issues
with not flushing data, and/or be affected by this change.

RecordTransactionCommit() says about such cases this:

  /*
   * Check if we want to commit asynchronously.  We can allow the
   * XLOG flush to happen asynchronously if synchronous_commit=off,
   * or if the current transaction has not performed any WAL-logged
   * operation or didn't assign an xid.  The transaction can end up
   * not writing any WAL, even if it has an xid, if it only wrote to
   * temporary and/or unlogged tables.  It can end up having written
   * WAL without an xid if it did HOT pruning.  In case of a crash,
   * the loss of such a transaction will be irrelevant; temp tables
   * will be lost anyway, unlogged tables will be truncated and HOT
   * pruning will be done again later. (Given the foregoing, you
   * might think that it would be unnecessary to emit the XLOG record
   * at all in this case, but we don't currently try to do that.  It
   * would certainly cause problems at least in Hot Standby mode,
   * where the KnownAssignedXids machinery requires tracking every
   * XID assignment.  It might be OK to skip it only when wal_level <
   * replica, but for now we don't.)
   *
   * However, if we're doing cleanup of any non-temp rels or
   * committing any command that wanted to force sync commit, then we
   * must flush XLOG immediately.  (We must not allow asynchronous
   * commit if there are any non-temp tables to be deleted, because
   * we might delete the files before the COMMIT record is flushed to
   * disk.  We do allow asynchronous commit if all to-be-deleted
   * tables are temporary though, since they are lost anyway if we
   * crash.)
   */

Note: This relates only to XLogFlush vs. XLogSetAsyncXactLSN, not about
waiting for sync standby. For that we ignore forceSyncCommit, which
seems a bit weird ...

Anyway, I was wondering what happens in practice, so I added very simple
logging to RecordTransactionCommit():

if (wrote_log && !markXidCommitted)
elog(WARNING, "not flushing at %X/%X",
 (uint32) (XactLastRecEnd >> 32),
 (uint32) XactLastRecEnd);

and then ran installcheck, which produces ~700 messages. Looking at the
WAL (last few records before the LSN reported by the log message), most
of this is related to HOT pruning (i.e. PRUNE), but there's plenty of
other WAL records. And I'm not sure if it's OK to just lose (some of)
those messages, as the comment claims for PRUNE.

It's quite possible I miss something basic/important, and everything is
fine and dandy, but here's a couple non-pruning examples - command
triggering the log message, along with the last few WAL records without
XID assigned right before RecordTransactionCommit() was called.

A more complete data set (full WAL dump, regression.diffs etc.) is
available at [2].



VACUUM ANALYZE num_exp_add;
---
VISIBLE cutoff xid 37114 flags 0x01, blkref #0: rel 1663/63341/3697 ...
INPLACE off 39, blkref #0: rel 1663/63341/1259 blk 5


SELECT proname, provolatile FROM pg_proc
   WHERE oid in ('functest_B_1'::regproc,
 'functest_B_2'::regproc,
 'functest_B_3'::regproc,
 'functest_B_4'::regproc) ORDER BY proname;

VACUUM nunused 223, blkref #0: rel 1663/63341/2608 blk 39
VISIBLE cutoff xid 39928 flags 0x01, blkref #0: rel 1663/63341/2608 ...
VACUUM nunused 6, blkref #0: rel 1663/63341/2608 blk 40
META_CLEANUP last_cleanup_num_delpages 5, blkref #0: rel 1663/63341 ...
META_CLEANUP last_cleanup_num_delpages 1, blkref #0: rel 1663/63341 ...
INPLACE off 13, blkref #0: rel 1663/63341/1259 blk 4

More data files / forks

2022-01-11 Thread Chris Cleveland
I'm working on a table access method that stores indexes in a structure
that looks like an LSM tree. Changes get written to small segment files,
which then get merged into larger segment files.

It's really tough to manage these files using existing fork/buffer/page
files, because when you delete a large segment it leaves a lot of empty
space. It's a lot easier to write the segments into separate files on disk
and then delete them as needed.

I could do that, but then I lose the advantages of having data in native
Postgres files, including support for buffering and locking.

It's important to have the segments stored contiguously on disk. I've
benchmarked it; it makes a huge performance difference.

Questions:

1. Are there any other disadvantages to storing data in my own files on
disk, instead of in files managed by Postgres?

2. Is it possible to increase the number of forks? I could store each level
of the LSM tree in its own fork very efficiently. Forks could get truncated
as needed. A dozen forks would handle it nicely.


Re: fix crash with Python 3.11

2022-01-11 Thread Jacob Champion
On Wed, 2021-12-22 at 09:24 +0100, Peter Eisentraut wrote:
> The fix is that we need to catch the PostgreSQL error and turn it into a 
> Python exception, like we do for other places where plpy.* methods call 
> into PostgreSQL internals.

Tested on Ubuntu 20.04, with 3.11.0a3 (built by hand) and 3.8.10 (from
the repositories). The tests pass, so LGTM. I agree that tidying up
some of the code duplication would be nice.

Thanks,
--Jacob


Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?

2022-01-11 Thread Bossart, Nathan
On 1/11/22, 10:06 AM, "John Naylor"  wrote:
> I pushed this with one small change -- I felt the comment didn't need
> to explain the warning message, since it now simply matches the coding
> more exactly. Also, v5 was a big enough change from v4 that I put
> Nathan as the first author.

Thanks!

Nathan



Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?

2022-01-11 Thread John Naylor
On Tue, Dec 7, 2021 at 10:51 PM Bossart, Nathan  wrote:
>
> On 12/7/21, 5:21 PM, "Bharath Rupireddy" 
>  wrote:
> > On Wed, Dec 8, 2021 at 4:17 AM Bossart, Nathan  wrote:
> >> I agree with Tom.  I would just s/server/backend/ (as per the
> >> attached) and call it a day.
> >
> > Thanks. v5 patch looks good to me.
>
> I've marked the commitfest entry as ready-for-committer.

I pushed this with one small change -- I felt the comment didn't need
to explain the warning message, since it now simply matches the coding
more exactly. Also, v5 was a big enough change from v4 that I put
Nathan as the first author.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: sepgsql logging

2022-01-11 Thread Tom Lane
Andrew Dunstan  writes:
> I am not that person either. I agree this looks reasonable, but I also
> would like the opinion of an expert, if we have one.

I'm not sure we do anymore.  Anyway, I tried this on Fedora 35 and
confirmed that it compiles and the (very tedious) test process
described in the sepgsql docs still passes.  Looking in the system's
logs, it appears that Dave didn't precisely emulate how SELinux
logs this setting, because I see messages like

Jan  4 12:25:46 nuc1 audit[1754]: AVC avc:  denied  { setgid } for  pid=1754 
comm="sss_cache" capability=6  
scontext=unconfined_u:unconfined_r:useradd_t:s0-s0:c0.c1023 
tcontext=unconfined_u:unconfined_r:useradd_t:s0-s0:c0.c1023 tclass=capability 
permissive=0

So it looks like their plan is to unconditionally write "permissive=0"
or "permissive=1", while Dave's patch just prints nothing in enforcing
mode.  While I can see some virtue in brevity, I think that doing
exactly what SELinux does is probably a better choice.  For one thing,
it'd remove doubt about whether one is looking at a log from a sepgsql
version that logs this or one that doesn't.

Other than that nitpick, I think we should just push this.

regards, tom lane




Re: Avoiding smgrimmedsync() during nbtree index builds

2022-01-11 Thread Melanie Plageman
On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman
 wrote:
>
> I have attached a v3 which includes two commits -- one of which
> implements the directmgr API and uses it and the other which adds
> functionality to use either directmgr or bufmgr API during index build.
>
> Also registering for march commitfest.

Forgot directmgr.h. Attached v4

- Melanie
From 291d5ea92e52f5ac4ff637f383c9bc92973c9e2b Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 10 Jan 2022 17:34:01 -0500
Subject: [PATCH v4 2/2] Use shared buffers when possible for index build

When there are not too many tuples, building the index in shared buffers
makes sense. It allows the buffer manager to handle how best to do the
IO.
---
 src/backend/access/nbtree/nbtree.c  |  34 ++--
 src/backend/access/nbtree/nbtsort.c | 268 ++--
 2 files changed, 225 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6ab6651420..0714c7fdd6 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -151,33 +151,27 @@ bthandler(PG_FUNCTION_ARGS)
 void
 btbuildempty(Relation index)
 {
+	/*
+	 * Since this only writes one page, use shared buffers.
+	 */
 	Page		metapage;
-	UnBufferedWriteState wstate;
-
-	wstate.smgr_rel = RelationGetSmgr(index);
-
-	unbuffered_prep();
-
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
+	Buffer metabuf;
 
 	/*
-	 * Write the page and log it.  It might seem that an immediate sync would
-	 * be sufficient to guarantee that the file exists on disk, but recovery
-	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
-	 * this even when wal_level=minimal.
+	 * Allocate a buffer for metapage and initialize metapage.
 	 */
-	unbuffered_write(, INIT_FORKNUM, BTREE_METAPAGE, metapage);
-	log_newpage((index)->smgr_rnode.node, INIT_FORKNUM,
-BTREE_METAPAGE, metapage, true);
+	metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL);
+	metapage = BufferGetPage(metabuf);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
 
 	/*
-	 * Even though we xlog'd the page, a concurrent checkpoint may have moved
-	 * the redo pointer past our xlog record, so we may still need to fsync.
+	 * Mark metapage buffer as dirty and XLOG it
 	 */
-	unbuffered_finish(, INIT_FORKNUM);
+	START_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
+	log_newpage_buffer(metabuf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(index, metabuf);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5687acd99d..68b53e6c04 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -234,6 +234,7 @@ typedef struct BTPageState
 {
 	Page		btps_page;		/* workspace for page building */
 	BlockNumber btps_blkno;		/* block # to write this page at */
+	Buffer btps_buf; /* buffer to write this page to */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
 	Size		btps_lastextra; /* last item's extra posting list space */
@@ -251,10 +252,25 @@ typedef struct BTWriteState
 	Relation	index;
 	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
-	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
+	BlockNumber btws_pages_alloced; /* # pages allocated for index builds outside SB */
+	BlockNumber btws_pages_written; /* # pages written out for index builds outside SB */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
 	UnBufferedWriteState ub_wstate;
+	/*
+	 * Allocate a new btree page. This does not initialize the page.
+	 */
+	Page (*_bt_bl_alloc_page) (struct BTWriteState *wstate, BlockNumber
+			*blockno, Buffer *buf);
+	/*
+	 * Emit a completed btree page, and release the working storage.
+	 */
+	void (*_bt_blwritepage) (struct BTWriteState *wstate, Page page,
+			BlockNumber blkno, Buffer buf);
+
+	void (*_bt_bl_unbuffered_prep) (UnBufferedWriteState *wstate);
+
+	void (*_bt_bl_unbuffered_finish) (UnBufferedWriteState *wstate, ForkNumber
+			forknum);
 } BTWriteState;
 
 
@@ -263,10 +279,22 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 	  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 			   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+
+static Page

Re: Windows crash / abort handling

2022-01-11 Thread Andrew Dunstan


On 1/11/22 02:51, Andres Freund wrote:
> Hi,
>
> On 2022-01-10 10:57:00 -0500, Andrew Dunstan wrote:
>> On 10/5/21 15:30, Andres Freund wrote
>>> The above ends up dumping all crashes into a single file, but that can
>>> probably be improved. But cdb is so gnarly that I wanted to stop looking 
>>> once
>>> I got this far...
> FWIW, I figured out how to put the dumps into separate files by now...
>
>
>>> Andrew, I wonder if something like this could make sense for windows BF 
>>> animals?
>> Very possibly. I wonder how well it will work on machines where I have
>> more than one animal .e.g. lorikeet (cygwin) jacana (msys) and bowerbird
>> (MSVC) are all on the same machine. Likewise drongo (MSVC) and fairywren
>> (msys2).
> Hm. I can see a few ways to deal with it. Are they running concurrently?
> If not then it's easy enough to deal with.
>
> It'd be a bit of a fight with cdb's awfully documented and quirky
> scripting [1], but the best solution would probably be to just use an
> environment variable from the target process to determine the dump
> location. Then each buildfarm config could set a BF_BACKTRACE_LOCATION
> variable or such...
>
> [1] So there's !envvar. But that yields a string like
> BF_BACKTRACE_LOCATION = value of environment variable when set to an
> alias.  And I haven't found an easy way to get rid of the "variablename
> = ". There is .foreach /pS [2] which could be used to skip over the
> varname =, but that then splits on all whitespaces. Gah.
>
> [2] 
> https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/-foreach
>

Ugly as heck. I normally don't use locations with spaces in them. Let's
assume we don't have to deal with that issue at least.

But in theory these animals could be running in parallel, and in theory
each animal could have more than one branch being run concurrently. In
fact locking them against each other can be difficult/impossible. From
experience, three different perls might not agree on how file locking
works ... In the case of fairywren/drongo I have had to set things up so
that there is a single batch file that runs one job after the other.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: sepgsql logging

2022-01-11 Thread Andrew Dunstan


On 1/11/22 10:40, Dave Page wrote:
>
>
> On Wed, 2021-04-14 at 09:49 -0400, Robert Haas wrote:
> > Looks superficially reasonable on first glance, but I think we
> should
> > try to get an opinion from someone who knows more about SELinux.
>
> I am not that someone, but this looks straightforward, it's been
> stalled for a while, and I think it should probably go in.
>
>
> I'd like to see that. Thanks for the review. 
>

I am not that person either. I agree this looks reasonable, but I also
would like the opinion of an expert, if we have one.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: generic plans and "initial" pruning

2022-01-11 Thread Robert Haas
On Fri, Dec 24, 2021 at 10:36 PM Amit Langote  wrote:
> However, using an idea that Robert suggested to me off-list a little
> while back, it seems possible to determine the set of partitions that
> we can safely skip locking.  The idea is to look at the "initial" or
> "pre-execution" pruning instructions contained in a given Append or
> MergeAppend node when AcquireExecutorLocks() is collecting the
> relations to lock and consider relations from only those sub-nodes
> that survive performing those instructions.   I've attempted
> implementing that idea in the attached patch.

Hmm. The first question that occurs to me is whether this is fully safe.

Currently, AcquireExecutorLocks calls LockRelationOid for every
relation involved in the query. That means we will probably lock at
least one relation on which we previously had no lock and thus
AcceptInvalidationMessages(). That will end up marking the query as no
longer valid and CheckCachedPlan() will realize this and tell the
caller to replan. In the corner case where we already hold all the
required locks, we will not accept invalidation messages at this
point, but must have done so after acquiring the last of the locks
required, and if that didn't mark the plan invalid, it can't be
invalid now either. Either way, everything is fine.

With the proposed patch, we might never lock some of the relations
involved in the query. Therefore, if one of those relations has been
modified in some way that would invalidate the plan, we will
potentially fail to discover this, and will use the plan anyway. For
instance, suppose there's one particular partition that has an extra
index and the plan involves an Index Scan using that index. Now
suppose that the scan of the partition in question is pruned, but
meanwhile, the index has been dropped. Now we're running a plan that
scans a nonexistent index. Admittedly, we're not running that part of
the plan. But is that enough for this to be safe? There are things
(like EXPLAIN or auto_explain) that we might try to do even on a part
of the plan tree that we don't try to run. Those things might break,
because for example we won't be able to look up the name of an index
in the catalogs for EXPLAIN output if the index is gone.

This is just a relatively simple example and I think there are
probably a bunch of others. There are a lot of kinds of DDL that could
be performed on a partition that gets pruned away: DROP INDEX is just
one example. The point is that to my knowledge we have no existing
case where we try to use a plan that might be only partly valid, so if
we introduce one, there's some risk there. I thought for a while, too,
about whether changes to some object in a part of the plan that we're
not executing could break things for the rest of the plan even if we
never do anything with the plan but execute it. I can't quite see any
actual hazard. For example, I thought about whether we might try to
get the tuple descriptor for the pruned-away object and get a
different tuple descriptor than we were expecting. I think we can't,
because (1) the pruned object has to be a partition, and tuple
descriptors have to match throughout the partitioning hierarchy,
except for column ordering, which currently can't be changed
after-the-fact and (2) IIRC, the tuple descriptor is stored in the
plan and not reconstructed at runtime and (3) if we don't end up
opening the relation because it's pruned, then we certainly can't do
anything with its tuple descriptor. But it might be worth giving more
thought to the question of whether there's any other way we could be
depending on the details of an object that ended up getting pruned.

> Note that "initial" pruning steps are now performed twice when
> executing generic plans: once in AcquireExecutorLocks() to find
> partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to
> determine the set of partition sub-nodes to be initialized for
> execution, though I wasn't able to come up with a good idea to avoid
> this duplication.

I think this is something that will need to be fixed somehow. Apart
from the CPU cost, it's scary to imagine that the set of nodes on
which we acquired locks might be different from the set of nodes that
we initialize. If we do the same computation twice, there must be some
non-zero probability of getting a different answer the second time,
even if the circumstances under which it would actually happen are
remote. Consider, for example, a function that is labeled IMMUTABLE
but is really VOLATILE. Now maybe you can get the system to lock one
set of partitions and then initialize a different set of partitions. I
don't think we want to try to reason about what consequences that
might have and prove that somehow it's going to be OK; I think we want
to nail the door shut very tightly to make sure that it can't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: sequences vs. synchronous replication

2022-01-11 Thread Tomas Vondra


On 12/22/21 18:50, Fujii Masao wrote:
> 
> 
> On 2021/12/22 21:11, Tomas Vondra wrote:
>> Interesting idea, but I think it has a couple of issues :-(
> 
> Thanks for the review!
> 
>> 1) We'd need to know the LSN of the last WAL record for any given
>> sequence, and we'd need to communicate that between backends somehow.
>> Which seems rather tricky to do without affecting performance.
> 
> How about using the page lsn for the sequence? nextval_internal()
> already uses that to check whether it's less than or equal to checkpoint
> redo location.
> 

I explored the idea of using page LSN a bit, and there's some good and
bad news.

The patch from 22/12 simply checks if the change should/would wait for
sync replica, and if yes it WAL-logs the sequence increment. There's a
couple problems with this, unfortunately:

1) Imagine a high-concurrency environment, with a lot of sessions doing
nextval('s') at the same time. One session WAL-logs the increment, but
before the WAL gets flushed / sent to replica, another session calls
nextval. SyncRepNeedsWait() says true, so it WAL-logs it again, moving
the page LSN forward. And so on. So in a high-concurrency environments,
this simply makes the matters worse - it causes an avalanche of WAL
writes instead of saving anything.

(You don't even need multiple sessions - a single session calling
nextval would have the same issue, WAL-logging every call.)


2) It assumes having a synchronous replica, but that's wrong. It's
partially my fault because I formulated this issue as if it was just
about sync replicas, but that's just one symptom. It applies even to
systems without any replicas.

Imagine you do

  BEGIN;
  SELECT nextval('s') FROM generate_series(1,40);
  ROLLBACK;

  SELECT nextval('s');

and then you murder the server by "kill -9". If you restart it and do a
nextval('s') again, the value will likely go back, generating duplicate
values :-(


So I think this approach is not really an improvement over WAL-logging
every increment. But there's a better way, I think - we don't need to
generate WAL, we just need to ensure we wait for it to be flushed at
transaction end in RecordTransactionCommit().

That is, instead of generating more WAL, simply update XactLastRecEnd
and then ensure RecordTransactionCommit flushes/waits etc. Attached is a
patch doing that - the changes in sequence.c are trivial, changes in
RecordTransactionCommit simply ensure we flush/wait even without XID
(this actually raises some additional questions that I'll discuss in a
separate message in this thread).

I repeated the benchmark measurements with nextval/insert workloads, to
compare this with the other patch (WAL-logging every increment). I had
to use a different machine, so the the results are not directly
comparable to the numbers presented earlier.

On btrfs, it looks like this. The log-all is the first patch, page-lsn
is the new patch using page LSN. The first columns are raw pgbench tps
values, the last two columns are comparison to master.

On btrfs, it looks like this (the numbers next to nextval are the cache
size, with 1 being the default):

  client  test master   log-all  page-lsn   log-all  page-lsn
  ---
   1  insert  829   807   802   97%   97%
  nextval/1 16491   814 164655%  100%
  nextval/3224487 16462 24632   67%  101%
  nextval/6424516 24918 24671  102%  101%
  nextval/128   32337 33178 32863  103%  102%

  client  test master   log-all  page-lsn   log-all  page-lsn
  ---
   4  insert 1577  1590  1546  101%   98%
  nextval/1 45607  1579 212203%   47%
  nextval/3268453 49141 51170   72%   75%
  nextval/6466928 65534 66408   98%   99%
  nextval/128   83502 81835 82576   98%   99%

The results seem clearly better, I think.

For "insert" there's no drop at all (same as before), because as soon as
a transaction generates any WAL, it has to flush/wait anyway.

And for "nextval" there's a drop, but only with 4 clients, and it's much
smaller (53% instead of 97%). And increasing the cache size eliminates
even that.

Out of curiosity I ran the tests on tmpfs too, which should show overhed
not related to I/O. The results are similar:

  client  test master   log-all  page-lsn   log-all  page-lsn
  ---
1 insert44033 43740 43215   99%   98%
  nextval/1 58640 48384 59243   83%  101%
  nextval/3261089 60901 60830  100%  100%
  nextval/6460412 61315 61550  101%  102%
  nextval/128   

Re: Time to drop plpython2?

2022-01-11 Thread Tom Lane
Peter Eisentraut  writes:
> On 15.11.21 19:52, Peter Eisentraut wrote:
>> I think we should just write to the build farm owners, we plan to drop 
>> python2 support in, say, 60 days, please update your setup to use 
>> python3 or disable python support.

> This discussion stalled.  I think we should do *something* for 
> PostgreSQL 15.

Agreed.  We want plpython2 out of there for v15, or we'll be adding
another year to the length of time we're on the hook to support
python2 in the back branches --- which, as you say, is getting
ever more painful to do.

> I suspect at this point, the Meson move isn't going to happen for 
> PostgreSQL 15.

Also agreed.  Nonetheless, we need to make a recommendation to the
buildfarm owners about what's the minimum python3 version we intend
to support going forward.  Do we want to just set it at 3.6, with
the expectation that the meson move will happen before too long?

(I'm not going to accept being fuzzy on this point, because I need
to know what to install on prairiedog and gaur.  I intend to install
whatever our minimum supported version is, not only to keep us honest
on that being actually supported, but because I anticipate that
newer python versions will be correspondingly harder to build on
ancient platforms.)

regards, tom lane




Re: sepgsql logging

2022-01-11 Thread Dave Page
Hi

On Tue, Jan 11, 2022 at 12:04 AM Jacob Champion 
wrote:

> On Wed, Apr 14, 2021 at 8:42 AM Dave Page  wrote:
> > Attached is a patch to clean this up. It will log denials as such
> > regardless of whether or not either selinux or sepgsql is in
> > permissive mode. When either is in permissive mode, it'll add "
> > permissive=1" to the end of the log messages. e.g.
>
> Dave,
>
> Just to clarify -- it looks like this patch *only* adds the
> "permissive=1" part, right? I don't see any changes around denied-vs-
> allowed.
>

Right. denied-vs-allowed is shown at the beginning of the log line. From my
earlier output:

2021-04-14 13:20:30.401 BST [23073] LOG:  SELinux: allowed { select }
scontext=user_u:user_r:user_t:s0
tcontext=unconfined_u:object_r:sepgsql_table_t:s0 tclass=db_column
name="column salt of table tb_users" permissive=1
2021-04-14 13:20:30.401 BST [23073] LOG:  SELinux: denied { select }
scontext=user_u:user_r:user_t:s0
tcontext=system_u:object_r:sepgsql_secret_table_t:s0 tclass=db_column
name="column phash of table tb_users" permissive=1


>
> I read the previous posts to mean that you were seeing "allowed" when
> you should have been seeing "denied".


That's what I *thought* was happening originally, because I was mistakenly
in permissive mode (if memory serves).


> I don't see that behavior --
> without this patch, I see the correct "denied" entries even when
> running in permissive mode. (It's been a while since the patch was
> posted, so I checked to make sure there hadn't been any relevant
> changes in the meantime, and none jumped out at me.)
>

Right. The point is that if permissive mode is enabled, access will not be
denied. Effectively if you see permissive=1, then "denied" really means
"would be denied if enforcing mode was enabled".

The idea is that you can run a production system in permissive mode to see
what would be denied without breaking things for users. You can use that
info to build your policy, and then when you no longer see any unexpected
denials in the logs, switch to enforcing mode.


>
> That said, the patch looks good as-is and seems to be working for me on
> a Rocky 8 VM. (You weren't kidding about the setup difficulty.) Having
> permissive mode show up in the logs seems very useful.
>
> As an aside, I don't see the "allowed" verbiage that sepgsql uses in
> any of the SELinux documentation. I do see third-party references to
> "granted", though, as in e.g.
>
> avc: granted { execute } for ...
>
> That's not something that I think this patch should touch, but it
> seemed tangentially relevant for future convergence work.
>

Interesting. I never spotted that one. I'm not sure it matters much, except
for consistency. It's not like the various tools for analyzing SELinux logs
would be likely to work on a PostgreSQL log.


>
> On Wed, 2021-04-14 at 09:49 -0400, Robert Haas wrote:
> > Looks superficially reasonable on first glance, but I think we should
> > try to get an opinion from someone who knows more about SELinux.
>
> I am not that someone, but this looks straightforward, it's been
> stalled for a while, and I think it should probably go in.
>

I'd like to see that. Thanks for the review.

-- 
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com


Re: make tuplestore helper function

2022-01-11 Thread Melanie Plageman
On Wed, Jan 5, 2022 at 7:57 PM Justin Pryzby  wrote:
>
> On Wed, Jan 05, 2022 at 12:09:16PM -0500, Melanie Plageman wrote:
> > On Fri, Dec 17, 2021 at 3:04 PM Justin Pryzby  wrote:
> > > There's a couples places that you're checking expectedDesc where it wasn't
> > > being checked before.  Is that some kind of live bug ?
> > > pg_config() text_to_table()
> >
> > Yes, expectedDesc was accessed in these locations without checking that
> > it is not NULL. Should that be a separate patch?
>
> I don't know.  The patch is already easy to review, since it's mostly limited
> to removing code and fixing inconsistencies (NULL check) and possible
> inefficiencies (randomAccess).
>
> If the expectedDesc NULL check were an 0001 patch, then 0002 (the main patch)
> would be even easier to review.  Only foreign.c is different.

I'll wait to do that if preferred by committer.
Are you imagining that patch 0001 would only add the check for
expectedDesc that is missing from pg_config() and text_to_table()?

> > > You removed one call to tuplestore_donestoring(), but not the others.
> > > I guess you could remove them all (optionally as a separate patch).
> >
> > I removed it in that one location because I wanted to get rid of the
> > local variable it used.
>
> What local variable ?  I see now that logicalfuncs.c never had a local 
> variable
> called tupstore.  I'm sure it was intended since 2014 (b89e15105) to say
> tupestore_donestoring(p->tupstore).  But donestoring(typo) causes no error,
> since the define is a NOP.
>
> src/include/utils/tuplestore.h:#define tuplestore_donestoring(state) ((void) 
> 0)

Yes, I mean the local variable, tupstore.

> > I am fine with removing the other occurrences,
> > but I thought there might be some reason why instead of removing it,
> > they made it into a no-op.
>
> I assume Tom left it (actually, added it back in dd04e958c) to avoid breaking
> extensions for no reason.  And maybe to preserve the possbility of at some
> point in the future doing something again during the "done" step.
>
> I'd leave it for input from a committer about those:
>
>  - remove tuplestore_donestoring() calls ?
>  - split expectedDesc NULL check to an 0001 patch ?
>  - anything other opened questions ?
>
> I'm marking this as RFC, with the caveat that the newline before
> MakeFuncResultTuplestore went missing again :)

oops. I've attached v6 with the newline.

- Melanie
From e57adc4950ed259a018e5a2c6fd21af127a39e44 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 2 Nov 2021 12:20:55 -0400
Subject: [PATCH v6] Add helper to make tuplestore

And use it to refactor out existing duplicate code.

Note that for func-api callers previously calling
tuplestore_begin_heap() with a constant value for the randomAccess
parameter, by using MakeFuncResultTuplestore(), they are now passing
true or false based on whether or not SFRM_Materialize_Random is set in
rsinfo->allowedModes.

This is consistent with the instructions in
src/backend/utils/fmgr/README, which state that tuplestores "must be
created with randomAccess = true if SFRM_Materialize_Random is set in
allowedModes, but it can (and preferably should) be created with
randomAccess = false if not".

Author: Melanie Plageman 
Reviewed-by: Justin Pryzby 
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_azyd1Z3W_r7Ou4sorTjRCs%2BPxeHw1CWJeXKofkE6TuZg%40mail.gmail.com
---
 src/backend/access/transam/xlogfuncs.c| 16 +---
 src/backend/commands/event_trigger.c  | 32 +--
 src/backend/commands/extension.c  | 48 +--
 src/backend/commands/prepare.c| 15 +---
 src/backend/foreign/foreign.c | 51 +--
 src/backend/libpq/hba.c   | 19 +
 src/backend/replication/logical/launcher.c| 16 +---
 .../replication/logical/logicalfuncs.c| 18 +---
 src/backend/replication/logical/origin.c  | 19 +
 src/backend/replication/slotfuncs.c   | 16 +---
 src/backend/replication/walsender.c   | 16 +---
 src/backend/storage/ipc/shmem.c   | 16 +---
 src/backend/utils/adt/datetime.c  | 17 +---
 src/backend/utils/adt/genfile.c   | 31 +--
 src/backend/utils/adt/jsonfuncs.c | 84 ++-
 src/backend/utils/adt/mcxtfuncs.c | 16 +---
 src/backend/utils/adt/misc.c  | 14 +---
 src/backend/utils/adt/pgstatfuncs.c   | 48 +--
 src/backend/utils/adt/varlena.c   | 15 ++--
 src/backend/utils/fmgr/funcapi.c  | 37 
 src/backend/utils/misc/guc.c  | 12 +--
 src/backend/utils/misc/pg_config.c| 11 ++-
 src/backend/utils/mmgr/portalmem.c| 15 +---
 src/include/funcapi.h |  2 +
 24 files changed, 124 insertions(+), 460 deletions(-)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index d8af5aad58..6af2c6b909 100644
--- 

Re: [Ext:] Re: Stream Replication not working

2022-01-11 Thread Allie Crawford
Amit,
Thank you for your help in trying to understand the information that the 
pg_locks table is showing. Regarding your question, I am not sure who to answer 
it. How do I figure out which database and relation is db:16384
and relation:12141.?

Thanks,
Allie

From: Amit Kapila 
Date: Tuesday, January 11, 2022 at 6:28 AM
To: Allie Crawford 
Cc: SATYANARAYANA NARLAPURAM , 
pgsql-hackers@lists.postgresql.org 
Subject: Re: [Ext:] Re: Stream Replication not working
It seems both master and standby have an exclusive lock on db:16384
and relation:12141. Which is this database/relation and why is the
app/database holding a lock on it?


--
With Regards,
Amit Kapila.

From: Allie Crawford 
Date: Tuesday, January 11, 2022 at 7:47 AM
To: Sushant Postgres 
Cc: SATYANARAYANA NARLAPURAM , 
pgsql-hackers@lists.postgresql.org 
Subject: Re: [Ext:] Re: Stream Replication not working
Satya,
I am a newbie on postgresql, I have no previous experience with postgresql and 
I need to get this replication working. In looking at the data that the pg_lock 
is showing, I do not know how to interpret it.
I will really appreciate any help you can give me in resolving this issue.
Regards,
Allie

From: Sushant Postgres 
Date: Tuesday, January 11, 2022 at 12:49 AM
To: Allie Crawford 
Cc: SATYANARAYANA NARLAPURAM , 
pgsql-hackers@lists.postgresql.org 
Subject: Re: [Ext:] Re: Stream Replication not working
Hi All,

for us also, logs are applying at slave server but very very slow. While 
checking we also have seen same set of locks to Master and Slave servers.
Please suggest the solution for that.
Many Thanks in Advance !!
Thanks

On Tue, Jan 11, 2022 at 2:12 AM Allie Crawford 
mailto:crawfor...@churchofjesuschrist.org>> 
wrote:
Thank you so much for your help on this Satya. I have detailed right below the 
output of the query you asked me to run.

Master

postgresql@ ~>psql

psql (13.5)

Type "help" for help.



postgresql=# select * from pg_locks;

  locktype  | database | relation | page | tuple | virtualxid | transactionid | 
classid | objid | objsubid | virtualtransaction |   pid   |  mode   | 
granted | fastpath

+--+--+--+---++---+-+---+--++-+-+-+--

 relation   |16384 |12141 |  |   ||   | 
|   |  | 3/6715 | 2669949 | AccessShareLock | t 
  | t

 virtualxid |  |  |  |   | 3/6715 |   | 
|   |  | 3/6715 | 2669949 | ExclusiveLock   | t 
  | t

(2 rows)



postgresql=#


Standby
postgresql@ ~>psql
psql (13.5)
Type "help" for help.

postgresql=# select * from pg_locks;
  locktype  | database | relation | page | tuple | virtualxid | transactionid | 
classid | objid | objsubid | virtualtransaction |  pid   |  mode   | 
granted | fastpath
+--+--+--+---++---+-+---+--+++-+-+--
 relation   |16384 |12141 |  |   ||   | 
|   |  | 2/50   | 642064 | AccessShareLock | t  
 | t
 virtualxid |  |  |  |   | 2/50   |   | 
|   |  | 2/50   | 642064 | ExclusiveLock   | t  
 | t
 virtualxid |  |  |  |   | 1/1|   | 
|   |  | 1/0|  17333 | ExclusiveLock   | t  
 | t
(3 rows)

postgresql=#




From: SATYANARAYANA NARLAPURAM 
mailto:satyanarlapu...@gmail.com>>
Date: Monday, January 10, 2022 at 1:06 PM
To: Allie Crawford 
Cc: 
pgsql-hackers@lists.postgresql.org 
mailto:pgsql-hackers@lists.postgresql.org>>
Subject: [Ext:] Re: Stream Replication not working
[External Email]
Seems there is a problem with the replay on your standby. Either it is too slow 
or stuck behind some locks ( replay_lag of 20:38:47.00904 indicates this and 
the flush_lsn is the same as lsn on primary ) . Run pg_locks to see if the 
replay is stuck behind a lock.



On Mon, Jan 10, 2022 at 11:53 AM Allie Crawford 
mailto:crawfor...@churchofjesuschrist.org>> 
wrote:
Hi All,
I have implemented Stream replication in one of my environments, and for some 
reason even though all the health checks are showing that the replication is 
working, when I run manual tests to see if changes are being replicated, the 
changes are not replicated to the standby postgresql environment. I have been 
researching for two day and I cannot find any documentation that talks about 
the case I am running into. I will appreciate if anybody could take a look at 
the details I have detailed below and give me some guidance on where the 
problem might be that is preventing my changes for being replicated. 

Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

2022-01-11 Thread Alexander Lakhin
10.01.2022 23:52, Thomas Munro wrote:
>> I'm yet to find out whether the other 
>> WaitLatchOrSocket' users (e. g. postgres_fdw) can suffer from the
>> disconnected socket state, but this approach definitely works for
>> walreceiver.
> I see where you're going: there might be safe call sequences and
> unsafe call sequences, and maybe walreceiver is asking for trouble by
> double-polling.  I'm not sure about that; I got the impression
> recently that it's possible to get FD_CLOSE while you still have
> buffered data to read, so then the next recv() will return > 0 and
> then we don't have any state left anywhere to remember that we saw
> FD_CLOSE, even if you're careful to poll and read in the ideal
> sequence.  I could be wrong, and it would be nice if there is an easy
> fix along those lines...  The documentation around FD_CLOSE is
> unclear.
I had no strong opinion regarding unsafe sequence, though initially I
suspected that exactly the second libpqrcv_PQgetResult call could cause
the issue. But after digging into WaitLatchOrSocket I'd inclined to put
the fix deeper to satisfy all possible callers.
At the other hand, I've shared Tom's concerns regarding other clients,
that can stuck on WaitForMultipleObjects() just as walreceiver does, and
hoped that only walreceiver suffer from a graceful server socket closing.
So to get these doubts cleared, I've made a simple test for postgres_fdw
(please look at the attachment; you can put it into
contrib/postgres_fdw/t and run `vcregress taptest contrib\postgres_fdw`).
This test shows for me:
===
...
t/001_disconnection.pl .. # 12:13:39.481084 executing query...
# 12:13:43.245277 result:   0
# 0|0

# 12:13:43.246342 executing query...
# 12:13:46.525924 result:   0
# 0|0

# 12:13:46.527097 executing query...
# 12:13:47.745176 result:   3
#
# psql::1: WARNING:  no connection to the server
# psql::1: ERROR:  FATAL:  terminating connection due to
administrator co
mmand
# server closed the connection unexpectedly
#   This probably means the server terminated abnormally
#   before or while processing the request.
# CONTEXT:  remote SQL command: FETCH 100 FROM c1
# 12:13:47.794612 executing query...
# 12:13:51.073318 result:   0
# 0|0

# 12:13:51.074347 executing query...
===

With the simple logging added to connection.c:
                /* Sleep until there's something to do */
elog(LOG, "pgfdw_get_result before WaitLatchOrSocket");
                wc = WaitLatchOrSocket(MyLatch,
                                       WL_LATCH_SET | WL_SOCKET_READABLE |
                                       WL_EXIT_ON_PM_DEATH,
                                       PQsocket(conn),
                                       -1L, PG_WAIT_EXTENSION);
elog(LOG, "pgfdw_get_result after WaitLatchOrSocket");

I see in 001_disconnection_local.log:
...
2022-01-11 15:13:52.875 MSK|Administrator|postgres|61dd747f.5e4|LOG: 
pgfdw_get_result after WaitLatchOrSocket
2022-01-11 15:13:52.875
MSK|Administrator|postgres|61dd747f.5e4|STATEMENT:  SELECT * FROM large
WHERE a = fx2(a)
2022-01-11 15:13:52.875 MSK|Administrator|postgres|61dd747f.5e4|LOG: 
pgfdw_get_result before WaitLatchOrSocket
2022-01-11 15:13:52.875
MSK|Administrator|postgres|61dd747f.5e4|STATEMENT:  SELECT * FROM large
WHERE a = fx2(a)
2022-01-11 15:14:36.976 MSK|||61dd74ac.840|DEBUG:  autovacuum:
processing database "postgres"
2022-01-11 15:14:51.088 MSK|Administrator|postgres|61dd747f.5e4|LOG: 
pgfdw_get_result after WaitLatchOrSocket
2022-01-11 15:14:51.088
MSK|Administrator|postgres|61dd747f.5e4|STATEMENT:  SELECT * FROM large
WHERE a = fx2(a)
2022-01-11 15:14:51.089 MSK|Administrator|postgres|61dd747f.5e4|LOG: 
pgfdw_get_result before WaitLatchOrSocket
2022-01-11 15:14:51.089
MSK|Administrator|postgres|61dd747f.5e4|STATEMENT:  SELECT * FROM large
WHERE a = fx2(a)
2022-01-11 15:15:37.006 MSK|||61dd74e9.9e8|DEBUG:  autovacuum:
processing database "postgres"
2022-01-11 15:16:37.116 MSK|||61dd7525.ad0|DEBUG:  autovacuum:
processing database "postgres"
2022-01-11 15:17:37.225 MSK|||61dd7561.6a0|DEBUG:  autovacuum:
processing database "postgres"
2022-01-11 15:18:36.916 MSK|||61dd7470.704|LOG:  checkpoint starting: time
...
2022-01-11 15:36:38.225 MSK|||61dd79d6.2a0|DEBUG:  autovacuum:
processing database "postgres"
...

So here we get similar hanging on WaitLatchOrSocket().
Just to make sure that it's indeed the same issue, I've removed socket
shutdown and the test executed to the end (several times). Argh.

Best regards,
Alexander


001_disconnection.pl
Description: Perl program


Re: [Ext:] Re: Stream Replication not working

2022-01-11 Thread Allie Crawford
Satya,
I am a newbie on postgresql, I have no previous experience with postgresql and 
I need to get this replication working. In looking at the data that the pg_lock 
is showing, I do not know how to interpret it.
I will really appreciate any help you can give me in resolving this issue.
Regards,
Allie

From: Sushant Postgres 
Date: Tuesday, January 11, 2022 at 12:49 AM
To: Allie Crawford 
Cc: SATYANARAYANA NARLAPURAM , 
pgsql-hackers@lists.postgresql.org 
Subject: Re: [Ext:] Re: Stream Replication not working
Hi All,

for us also, logs are applying at slave server but very very slow. While 
checking we also have seen same set of locks to Master and Slave servers.
Please suggest the solution for that.
Many Thanks in Advance !!
Thanks

On Tue, Jan 11, 2022 at 2:12 AM Allie Crawford 
mailto:crawfor...@churchofjesuschrist.org>> 
wrote:
Thank you so much for your help on this Satya. I have detailed right below the 
output of the query you asked me to run.

Master

postgresql@ ~>psql

psql (13.5)

Type "help" for help.



postgresql=# select * from pg_locks;

  locktype  | database | relation | page | tuple | virtualxid | transactionid | 
classid | objid | objsubid | virtualtransaction |   pid   |  mode   | 
granted | fastpath

+--+--+--+---++---+-+---+--++-+-+-+--

 relation   |16384 |12141 |  |   ||   | 
|   |  | 3/6715 | 2669949 | AccessShareLock | t 
  | t

 virtualxid |  |  |  |   | 3/6715 |   | 
|   |  | 3/6715 | 2669949 | ExclusiveLock   | t 
  | t

(2 rows)



postgresql=#


Standby
postgresql@ ~>psql
psql (13.5)
Type "help" for help.

postgresql=# select * from pg_locks;
  locktype  | database | relation | page | tuple | virtualxid | transactionid | 
classid | objid | objsubid | virtualtransaction |  pid   |  mode   | 
granted | fastpath
+--+--+--+---++---+-+---+--+++-+-+--
 relation   |16384 |12141 |  |   ||   | 
|   |  | 2/50   | 642064 | AccessShareLock | t  
 | t
 virtualxid |  |  |  |   | 2/50   |   | 
|   |  | 2/50   | 642064 | ExclusiveLock   | t  
 | t
 virtualxid |  |  |  |   | 1/1|   | 
|   |  | 1/0|  17333 | ExclusiveLock   | t  
 | t
(3 rows)

postgresql=#




From: SATYANARAYANA NARLAPURAM 
mailto:satyanarlapu...@gmail.com>>
Date: Monday, January 10, 2022 at 1:06 PM
To: Allie Crawford 
Cc: 
pgsql-hackers@lists.postgresql.org 
mailto:pgsql-hackers@lists.postgresql.org>>
Subject: [Ext:] Re: Stream Replication not working
[External Email]
Seems there is a problem with the replay on your standby. Either it is too slow 
or stuck behind some locks ( replay_lag of 20:38:47.00904 indicates this and 
the flush_lsn is the same as lsn on primary ) . Run pg_locks to see if the 
replay is stuck behind a lock.



On Mon, Jan 10, 2022 at 11:53 AM Allie Crawford 
mailto:crawfor...@churchofjesuschrist.org>> 
wrote:
Hi All,
I have implemented Stream replication in one of my environments, and for some 
reason even though all the health checks are showing that the replication is 
working, when I run manual tests to see if changes are being replicated, the 
changes are not replicated to the standby postgresql environment. I have been 
researching for two day and I cannot find any documentation that talks about 
the case I am running into. I will appreciate if anybody could take a look at 
the details I have detailed below and give me some guidance on where the 
problem might be that is preventing my changes for being replicated. Even 
though I was able to instantiate the standby while firewalld was enabled, I 
decided to disable it just in case that it was causing any issue to the manual 
changes, but disabling firewalld has not had any effect, I am still not able to 
get the manual changes test to be replicated to the standby site. As you will 
see in the details below, the streaming is working, both sites are in sync to 
the latest WAL but for some reasons the latest changes are not on the standby 
site. How is it possible that the standby site is completely in sync but yet 
does not contain the latest changes?

Thanks in advance for any help you can give me with this problem.

Regards,
Allie

Details:

Master postgresql Environment

postgresql=# select * from pg_stat_replication;

-[ RECORD 1 ]+--

pid  | 1979089

usesysid | 16404

usename   

Re: Logical replication timeout problem

2022-01-11 Thread Fabrice Chapuis
Can you explain why you think this will help in solving your current
problem?

Indeed your are right this function won't help, we have to look elsewhere.

It is still not clear to me why the problem happened? IIUC, after restoring
4096 changes from snap files, we send them to the subscriber, and then
apply worker should apply those one by one. Now, is it taking one minute to
restore 4096 changes due to which apply worker is timed out?

Now I can easily reproduce the problem.
In a first phase, snap files are generated and stored in pg_replslot. This
process end when1420 files are present in pg_replslots (this is in relation
with statements that must be replayed from WAL). In the pg_stat_replication
view, the state field is set to *catchup*.
In a 2nd phase, the snap files must be decoded. However after one minute
(wal_receiver_timeout parameter set to 1 minute) the worker process stop
with a timeout.

I can put a debug point to check if a timeout is sent to the worker
process. Do you have any other clue?

Thank you for your help

Fabrice




On Fri, Jan 7, 2022 at 11:26 AM Amit Kapila  wrote:

> On Wed, Dec 29, 2021 at 5:02 PM Fabrice Chapuis 
> wrote:
>
>> I put the instance with high level debug mode.
>> I try to do some log interpretation: After having finished writing the
>> modifications generated by the insert in the snap files,
>> then these files are read (restored). One minute after this work starts,
>> the worker process exit with an error code = 1.
>> I see that keepalive messages were sent before the work process work
>> leave.
>>
>> 2021-12-28 10:50:01.894 CET [55792] LOCATION:  WalSndKeepalive,
>> walsender.c:3365
>> ...
>> 2021-12-28 10:50:31.854 CET [55792] STATEMENT:  START_REPLICATION SLOT
>> "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
>> '"pub008_s012a00"')
>> 2021-12-28 10:50:31.907 CET [55792] DEBUG:  0: StartTransaction(1)
>> name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
>> 2021-12-28 10:50:31.907 CET [55792] LOCATION:  ShowTransactionStateRec,
>> xact.c:5075
>> 2021-12-28 10:50:31.907 CET [55792] STATEMENT:  START_REPLICATION SLOT
>> "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
>> '"pub008_s012a00"')
>> 2021-12-28 10:50:31.907 CET [55792] DEBUG:  0: spill 2271 changes in
>> XID 14312 to disk
>> 2021-12-28 10:50:31.907 CET [55792] LOCATION:  ReorderBufferSerializeTXN,
>> reorderbuffer.c:2245
>> 2021-12-28 10:50:31.907 CET [55792] STATEMENT:  START_REPLICATION SLOT
>> "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
>> '"pub008_s012a00"')
>> *2021-12-28 10:50:32.110 CET [55792] DEBUG:  0: restored
>> 4096/22603999 changes from disk*
>> 2021-12-28 10:50:32.110 CET [55792] LOCATION:  ReorderBufferIterTXNNext,
>> reorderbuffer.c:1156
>> 2021-12-28 10:50:32.110 CET [55792] STATEMENT:  START_REPLICATION SLOT
>> "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
>> '"pub008_s012a00"')
>> 2021-12-28 10:50:32.138 CET [55792] DEBUG:  0: restored 4096/22603999
>> changes from disk
>> ...
>>
>> *2021-12-28 10:50:35.341 CET [55794] DEBUG:  0: sending replication
>> keepalive2021-12-28 10:50:35.341 CET [55794] LOCATION:  WalSndKeepalive,
>> walsender.c:3365*
>> ...
>> *2021-12-28 10:51:31.995 CET [55791] ERROR:  XX000: terminating logical
>> replication worker due to timeout*
>>
>> *2021-12-28 10:51:31.995 CET [55791] LOCATION:  LogicalRepApplyLoop,
>> worker.c:1267*
>>
>>
> It is still not clear to me why the problem happened? IIUC, after
> restoring 4096 changes from snap files, we send them to the subscriber, and
> then apply worker should apply those one by one. Now, is it taking one
> minute to restore 4096 changes due to which apply worker is timed out?
>
> Could this function in* Apply main loop* in worker.c help to find a
>> solution?
>>
>> rc = WaitLatchOrSocket(MyLatch,
>> WL_SOCKET_READABLE | WL_LATCH_SET |
>> WL_TIMEOUT | WL_POSTMASTER_DEATH,
>> fd, wait_time,
>> WAIT_EVENT_LOGICAL_APPLY_MAIN);
>>
>
> Can you explain why you think this will help in solving your current
> problem?
>
> --
> With Regards,
> Amit Kapila.
>


Re: enhance pg_log_backend_memory_contexts() to log memory contexts of auxiliary processes

2022-01-11 Thread Fujii Masao




On 2022/01/08 1:50, Bharath Rupireddy wrote:

PSA v7 patch.


Thanks for updating the patch!
I applied some cosmetic changes and pushed the patch. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Boyer-More-Horspool searching LIKE queries

2022-01-11 Thread Atsushi Ogawa
I have created a patch to enable the Boyer-More-Horspool search
algorithm (B-M-H) for LIKE queries.

B-M-H needs to initialize the skip table and keep it during SQL execution.
In this patch, flinfo->fn_extra is used to keep the skip table.

The conditions under which B-M-H can be used are as follows.

(1) B-M-H in LIKE search supports only single-byte character sets and UTF8.
Multibyte character sets does not support, because it may contain another
characters in the byte sequence. For UTF-8, it works fine, because in
UTF-8 the byte sequence of one character cannot contain another character.

(2) The pattern string should be stable parameter, because B-M-H needs to
keep
the skip table generated from the pattern string during the execution of
the query.

(3) The pattern string should be at least 4 characters.
For example, '%AB%' can use B-M-H.

(4) The first and last character of the pattern string should be '%'.

(5) Characters other than the first and last of the pattern string
should not be '%', '_'.  However, escaped characters such as
'\%', '\_' are available.

Also, this patch changes the collation validity check in functions
(such as textlike) to be performed at the first execution of the query,
instead of each function execution.
I have measured the performance with the following query.

-- -- -- -- -- -- --
SET client_min_messages TO notice;

\timing

DO $$
DECLARE
  cnt integer := 0;
  total integer := 0;
BEGIN
  FOR i IN 1..500 LOOP
select count(*) into cnt
  from pg_catalog.pg_description d
  where d.description like '%greater%';

total := total + cnt;
  END LOOP;

  RAISE NOTICE 'TOTAL: %', total;
END
$$
;
-- -- -- -- -- -- --

Result
Without patch: 257.504ms
With patch: 191.638ms

Regards,
Atsushi Ogawa


like_bmh.patch
Description: Binary data


Re: [Ext:] Re: Stream Replication not working

2022-01-11 Thread Amit Kapila
On Tue, Jan 11, 2022 at 2:12 AM Allie Crawford
 wrote:
>
> Thank you so much for your help on this Satya. I have detailed right below 
> the output of the query you asked me to run.
>
>
>
> Master
>
> postgresql@ ~>psql
>
> psql (13.5)
>
> Type "help" for help.
>
>
>
> postgresql=# select * from pg_locks;
>
>   locktype  | database | relation | page | tuple | virtualxid | transactionid 
> | classid | objid | objsubid | virtualtransaction |   pid   |  mode   
> | granted | fastpath
>
> +--+--+--+---++---+-+---+--++-+-+-+--
>
>  relation   |16384 |12141 |  |   ||   
> | |   |  | 3/6715 | 2669949 | AccessShareLock 
> | t   | t
>
>  virtualxid |  |  |  |   | 3/6715 |   
> | |   |  | 3/6715 | 2669949 | ExclusiveLock   
> | t   | t
>
> (2 rows)
>
>
>
> postgresql=#
>
>
>
>
>
> Standby
>
> postgresql@ ~>psql
>
> psql (13.5)
>
> Type "help" for help.
>
>
>
> postgresql=# select * from pg_locks;
>
>   locktype  | database | relation | page | tuple | virtualxid | transactionid 
> | classid | objid | objsubid | virtualtransaction |  pid   |  mode   
> | granted | fastpath
>
> +--+--+--+---++---+-+---+--+++-+-+--
>
>  relation   |16384 |12141 |  |   ||   
> | |   |  | 2/50   | 642064 | AccessShareLock 
> | t   | t
>
>  virtualxid |  |  |  |   | 2/50   |   
> | |   |  | 2/50   | 642064 | ExclusiveLock   
> | t   | t
>
>  virtualxid |  |  |  |   | 1/1|   
> | |   |  | 1/0|  17333 | ExclusiveLock   
> | t   | t
>
> (3 rows)
>

It seems both master and standby have an exclusive lock on db:16384
and relation:12141. Which is this database/relation and why is the
app/database holding a lock on it?


-- 
With Regards,
Amit Kapila.




Re: Time to drop plpython2?

2022-01-11 Thread Peter Eisentraut

On 15.11.21 19:52, Peter Eisentraut wrote:
I think we should just write to the build farm owners, we plan to drop 
python2 support in, say, 60 days, please update your setup to use 
python3 or disable python support.


This discussion stalled.  I think we should do *something* for 
PostgreSQL 15.


I suspect at this point, the Meson move isn't going to happen for 
PostgreSQL 15.  Even if the code were ready, which it is not, then this 
kind of thing would surely be something more sensible to put into PG16 
early rather than PG15 late.  So I'm setting aside any concerns about 
which Python 3.x is the appropriate minimum level.


There is the discussion [0], which involves raising the minimum Python 
version from 2.6 to 2.7, which would affect some build farm members. 
But if we're going to drop 2.x anyway for PG15, say, then we don't need 
to spend time organizing a 2.6 to 2.7 transition.  (Although 
backpatching might be required.)


Also, I was recently doing some bug fixing work in PL/Python, and just 
getting a Python 2 installation is not straightforward these days on the 
sort of OS that might be on a development machine.  So there is also 
that driver for dropping Python 2.


I think what I wrote in the message I'm responding to is the best way 
forward, although maybe with less than 60 days now.


Thoughts?


[0]: 
https://www.postgresql.org/message-id/flat/c74add3c-09c4-a9dd-1a03-a846e5b2fc52%40enterprisedb.com





[PATCH]Add tab completion for foreigh table

2022-01-11 Thread tanghy.f...@fujitsu.com
Hi

Attached a patch to improve the tab completion for foreigh table.

Also modified some DOC description of ALTER TABLE at [1] in according with 
CREATE INDEX at [2].

In [1], we use "ALTER INDEX ATTACH PARTITION"
In [2], we use "ALTER INDEX ... ATTACH PARTITION"

I think the format in [2] is better.

[1] https://www.postgresql.org/docs/devel/sql-altertable.html
[2] https://www.postgresql.org/docs/devel/sql-createindex.html

Regards,
Tang


v1-0001-add-tab-completion-for-PARTITION-OF-when-creating.patch
Description:  v1-0001-add-tab-completion-for-PARTITION-OF-when-creating.patch


Re: ICU for global collation

2022-01-11 Thread Julien Rouhaud
On Tue, Jan 11, 2022 at 12:36:46PM +0100, Daniel Verite wrote:
> 
> If CREATE DATABASE referred to a collation in the template db,
> either that collation already exists, or the user would have to add it
> to the template db with CREATE COLLATION.
> initdb already populates the template databases with a full set of
> ICU collations through pg_import_system_collations().
> I don't quite see what change you're seeing that would be needed in
> initdb.

Yes, there are already the system collation imported.  But you still need to
make sure that that collation does exist in the template database, and it's
still impossible to connect to that database to check when processing the
CREATE DATABASE.  Also, if the wanted collation wasn't imported by
pg_import_system_collations() and isn't the server's default collation, then
the user would have to allow connection on the template0 database and create
the wanted collation there.  It doesn't seem like something that should be
recommended for a reasonably standard use case.




Re: Query regarding replication slots

2022-01-11 Thread Julien Rouhaud
On Tue, Jan 11, 2022 at 04:48:59PM +0530, RKN Sai Krishna wrote:
> Hi All,
> 
> I have a very basic question related to replication slots. Why should
> the master/primary server maintain the replication slot info like lsn
> corresponding to each standby server etc. Instead, why can't each
> standby server send the lsn that it needs, and master/primary server
> maintain the minimum lsn across all of the standby servers so that the
> information could be used for cleanup/removal of WAL segments?

Because the information is needed even if the standby servers are not
available, and if you only have a global xmin then you can't do much when
removing one of the slots.




Re: ICU for global collation

2022-01-11 Thread Daniel Verite
Julien Rouhaud wrote:

> > I guess there's still the possibility of requiring that the ICU db-wide
> > collation of the new database does exist in the template database,
> > and then the CREATE DATABASE would refer to that collation instead of
> > an independent locale string.
> 
> That could work.  However if having the collation in the template database a
> strict requirement the something should also be done for initdb, and it will
> probably be a bigger problem.

If CREATE DATABASE referred to a collation in the template db,
either that collation already exists, or the user would have to add it
to the template db with CREATE COLLATION.
initdb already populates the template databases with a full set of
ICU collations through pg_import_system_collations().
I don't quite see what change you're seeing that would be needed in
initdb.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite




Re: [PATCH] Allow multiple recursive self-references

2022-01-11 Thread Denis Hirn
> I have been studying this a bit more.  I don't understand your argument here.
> Why would this query have different semantics than, say
>
> WITH RECURSIVE t(n) AS (
> VALUES (1)
>   UNION ALL
> VALUES (2)
>   UNION ALL
> SELECT n+1 FROM t WHERE n < 100
> ) SELECT * FROM t LIMIT 100;
>
> The order of UNION branches shouldn't be semantically relevant.

WITH RECURSIVE (ab)uses the (typically associative and commutative) UNION [ALL] 
clause,
and fundamentally changes the semantics – associativity and commutativity no 
longer apply.
I think your confusion stems from this ambiguity.

Let me briefly reiterate WITH RECURSIVE. Basically, you always have a query 
like this:

> WITH RECURSIVE w(c1,...) AS (
>   
> UNION [ALL]
>   
> ) q;

There must be a non-recursive part that does not refer to w, and -- without
the patch -- exactly one recursive part that refers to w. The non-recursive 
part,
and the recursive part are combined using UNION [ALL].

However, let's assume a different, unambiguous syntax just to make my point:

> WITH RECURSIVE w(c1,...) AS (
>   
> RUNION [ALL]
>   
> ) q;

Everything remains the same except that the non-recursive part and the recursive
part are combined using RUNION [ALL] instead of UNION [ALL].

Now let me rephrase your examples using this syntax:

> WITH RECURSIVE t(n) AS (
>(VALUES (1) UNION ALL VALUES (2))
>  RUNION ALL
>SELECT n+1 FROM t WHERE n < 100
> )
> SELECT sum(n) FROM t;

> WITH RECURSIVE t(n) AS (
>VALUES (1) RUNION ALL (VALUES (2)
>  UNION ALL
>SELECT n+1 FROM t WHERE n < 100)
> )
> SELECT sum(n) FROM t;

I hope this shows that this is not the same. The first query has two 
non-recursive
cases and one recursive case. The second query has two recursive cases instead.

The rewrites of those examples:

>  RUNION RUNION
> /  \good   /  \
> VALUES(1)   UNION-->   VALUES(1)   UNION
>/ \/ \
>   SELECT n+1...   VALUES(2)   VALUES(2) SELECT n+1...

This rewrite would be valid. The patch would not do that, however.

>  RUNION RUNION
> /  \ bad   /  \
> VALUES(1)   UNION-->   UNION   SELECT n+1...
>/ \/ \
>   SELECT n+1...   VALUES(2)   VALUES(1) VALUES(2)

This is the rewrite you would expect. But it is not valid, because it changes 
semantics.
Therefore the patch does not do that.

> I'm having difficulties understanding which subset of cases your patch wants 
> to address.

With this patch an extended WITH RECURSIVE syntax is enabled:

> WITH RECURSIVE w(c1,...) AS (
>   
> UNION [ALL]
>   
> ...
> UNION [ALL]
>   
> ) q;

But really, it is:

> WITH RECURSIVE w(c1,...) AS (
>   
> ...
>   
> UNION [ALL]
>   
> ...
> UNION [ALL]
>   
> ) q;

We can have arbitrarily many non-recursive branches (that is possible without 
the patch),
as well as arbitrarily many recursive UNION [ALL] branches. Problem here: UNION 
[ALL]
associates to the left. This means that we end up with a left-deep parse tree, 
that looks
something like this:

>   RUNION
>  / \
>...  m
>/
> UNION
> /   \
>nn+1
>   /
> ...
> /
>  UNION
>  /   \
> 1 2

That is not correct, because all non-recursive branches must be part of the left
RUNION branch, and all recursive branches must be part of the right one. 
Postgres
performs this check in parse_cte.c using the checkWellFormedRecursion() 
function.

Having said the above, let me once again define the rewrite case the patch 
implements:

>   RUNION RUNION
>  / \  rotate   /\
>...n+m  --->   UNION  UNION
>/ / \/ \
> UNION  ...  n n+1  UNION
> /   \  /  / \
>nn+1  UNION  ...  m
>   /  /   \
> ... 1 2
> /   
>  UNION
>  /   \
> 1 2

By using tree rotation, the patch transforms the parse tree on the left
into the one on the right. All non-recursive terms 1..n are found in the
left branch of RUNION, all recursive terms n+1..m in the right branch.
This tree now passes the checkWellFormedRecursion() check.
I hope this clarifies things.

> The order of UNION branches shouldn't be semantically relevant.

I agree. However, a strict distinction must be made between RUNION and UNION 
clauses.
These must not be interchanged.


> I wonder why in ExecWorkTableScan() and ExecReScanWorkTableScan() you call

Re: a misbehavior of partition row movement (?)

2022-01-11 Thread Alvaro Herrera
On 2022-Jan-11, Amit Langote wrote:

> As for the fix to make cross-partition updates work correctly with
> foreign keys, I just realized it won't work for the users' existing
> foreign keys, because the parent table's triggers that are needed for
> the fix to work would not be present.  Were you thinking that we'd ask
> users of v13 and v14 to drop and recreate those constraints?

Yeah, more or less.  Also, any tables created from 13.6 onwards.

I was mainly thinking that we'll still have people creating new clusters
using pg13 for half a decade.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Right now the sectors on the hard disk run clockwise, but I heard a rumor that
you can squeeze 0.2% more throughput by running them counterclockwise.
It's worth the effort. Recommended."  (Gerry Pourwelle)




Query regarding replication slots

2022-01-11 Thread RKN Sai Krishna
Hi All,

I have a very basic question related to replication slots. Why should
the master/primary server maintain the replication slot info like lsn
corresponding to each standby server etc. Instead, why can't each
standby server send the lsn that it needs, and master/primary server
maintain the minimum lsn across all of the standby servers so that the
information could be used for cleanup/removal of WAL segments?

The minimum lsn could as well be streamed to all of the standby
servers while streaming the WAL records, so that the cleanup on the
standby server as well happens as per the minimum lsn. Also, even if
the primary server crashes, any standby server becoming the master is
well aware of the minimum lsn and the WAL records required for all of
the remaining standby servers are intact.

Thanks,
RKN




Re: ICU for global collation

2022-01-11 Thread Peter Eisentraut

On 10.01.22 12:49, Daniel Verite wrote:

With the current patch, it's not possible, AFAICS, because the user
can't tell that the collation is non-deterministic. Presumably this
would require another option to CREATE DATABASE and another
column to store that bit of information.


Adding this would be easy, but since pattern matching currently does not 
support nondeterministic collations, if you make a global collation 
nondeterministic, a lot of system views, psql, pg_dump queries etc. 
would break, so it's not practical.  I view this is an orthogonal 
project.  Once we can support this without breaking system views etc., 
then it's easy to enable with a new column in pg_database.



The "daticucol" column also suggests that we don't expect to add
other collation providers in the future. Maybe a pair of columns like
(datcollprovider, datcolllocale) would be more future-proof,
or a (datcollprovider, datcolllocale, datcollisdeterministic)
triplet if non-deterministic collations are allowed.


I don't expect many new collation providers.  So I don't think an 
EAV-like storage would be helpful.  The other problem is that we don't 
know what we need.  For example, the libc provider needs both a collate 
and a ctype value, so that wouldn't fit into that scheme nicely.



Also, pg_collation has "collversion" to detect a mismatch between
the ICU runtime and existing indexes. I don't see that field
for the db-wide ICU collation, so maybe we currently miss the capability
to detect the mismatch for the db-wide collation?


Yeah, I think I need to add a datcollversion field and the associated 
checks.





Re: ICU for global collation

2022-01-11 Thread Peter Eisentraut



On 07.01.22 10:03, Julien Rouhaud wrote:

I changed the datcollate, datctype, and the new daticucoll fields to type
text (from name).  That way, the daticucoll field can be set to null if it's
not applicable.  Also, the limit of 63 characters can actually be a problem
if you want to use some combination of the options that ICU locales offer.
And for less extreme uses, having variable-length fields will save some
storage, since typical locale names are much shorter.


I understand the need to have daticucoll as text, however it's not clear to me
why this has to be changed for datcollate and datctype?  IIUC those will only
ever contain libc-based collation and are still mandatory?


Right.  I just did this for consistency.  It would be strange otherwise 
to have some fields as name and some as text.  Arguably, using "name" 
here was wrong to begin with, since they are not really object names. 
Maybe there used to be a reason to avoid variable-length fields in 
pg_database, but there isn't one now AFAICT.



- pg_upgrade

It checks (in check_locale_and_encoding()) the compatibility for each database,
and it looks like the daticucoll field should also be verified.  Other than
that I don't think there is anything else needed for the pg_upgrade part as
everything else should be handled by pg_dump (I didn't look at the changes yet
given the below problems).


Ok, I have added this and will include it in my next patch submission.


- CREATE DATABASE



- initdb


Ok, some work is needed to make these interfaces behave sensibly in 
various combinations.  I will look into that.





Re: ICU for global collation

2022-01-11 Thread Julien Rouhaud
On Tue, Jan 11, 2022 at 10:10:25AM +0100, Peter Eisentraut wrote:
> 
> On 10.01.22 07:00, Julien Rouhaud wrote:
> > 
> > So I tried to run Noah's benchmark to see if I could reproduce the slowdown.
> > Unfortunately the results I'm getting don't really make sense as removing 
> > the
> > optimisation brings a 15% speedup, and with a few more runs I can see that I
> > have about 25% noise, so there isn't much I can do to help.
> 
> Heh, I had that same experience, it actually got faster without the
> optimization, but then got lost in the noise on further testing.

Ah, so it's not just my machine :)

> Looking back at those discussions, I don't think those old test results are
> relevant anymore.  In the patch that was being tested there,
> pg_newlocale_from_collation(), did not contain
> 
> if (collid == DEFAULT_COLLATION_OID)
> return (pg_locale_t) 0;
> 
> so the default collation actually went through most or all of the function
> and did a lot of work.  That would understandably be quite slow.  But just
> calling a function and returning immediately should not be a problem.
> Otherwise, the call to check_collation_set() in varstr_cmp() and elsewhere
> would be just as bad.

I didn't noticed that.  That definitely explain why the performance concern
isn't valid anymore.

> So, unless there are concerns, I'm going to see about making a patch to call
> pg_newlocale_from_collation() even with the default collation. That would
> make the actual feature patch quite a bit smaller, since we won't have to
> patch every call site of pg_newlocale_from_collation().

+1 for me!




Re: Skipping logical replication transactions on subscriber side

2022-01-11 Thread Amit Kapila
On Tue, Jan 11, 2022 at 8:52 AM Masahiko Sawada  wrote:
>
> On Mon, Jan 10, 2022 at 8:50 PM Amit Kapila  wrote:
> >
> >
> > Few other comments on the latest patch:
> > =
> > 1.
> > A conflict will produce an error and will stop the replication; it must be
> > resolved manually by the user.  Details about the conflict can be found 
> > in
> > -   the subscriber's server log.
> > +as well as the
> > +   subscriber's server log.
> >
> > Can we slightly change the modified line to: "Details about the
> > conflict can be found in  > linkend="monitoring-pg-stat-subscription-workers"/> and the
> > subscriber's server log."?
>
> Will fix it.
>
> >  I think we can commit this change
> > separately as this is true even without this patch.
>
> Right. It seems an oversight of 8d74fc96db. I've attached the patch.
>

Pushed.

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2022-01-11 Thread Amit Kapila
On Tue, Jan 11, 2022 at 1:51 PM Masahiko Sawada  wrote:
>
> On Tue, Jan 11, 2022 at 3:12 PM Amit Kapila  wrote:
> >
> > On Tue, Jan 11, 2022 at 8:52 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Jan 10, 2022 at 8:50 PM Amit Kapila  
> > > wrote:
> > > >
> > > > I was thinking what if we don't advance origin explicitly in this
> > > > case? Actually, that will be no different than the transactions where
> > > > the apply worker doesn't apply any change because the initial sync is
> > > > in progress (see should_apply_changes_for_rel()) or we have received
> > > > an empty transaction. In those cases also, the origin lsn won't be
> > > > advanced even though we acknowledge the advanced last_received
> > > > location because of keep_alive messages. Now, it is possible after the
> > > > restart we send the old start_lsn location because the replication
> > > > origin was not updated before restart but we handle that case in the
> > > > server by starting from the last confirmed location. See below code:
> > > >
> > > > CreateDecodingContext()
> > > > {
> > > > ..
> > > > else if (start_lsn < slot->data.confirmed_flush)
> > > > ..
> > >
> > > Good point. Probably one minor thing that is different from the
> > > transaction where the apply worker applied an empty transaction is a
> > > case where the server restarts/crashes before sending an
> > > acknowledgment of the flush location. That is, in the case of the
> > > empty transaction, the publisher sends an empty transaction again. On
> > > the other hand in the case of skipping the transaction, a non-empty
> > > transaction will be sent again but skip_xid is already changed or
> > > cleared, therefore the user will have to specify skip_xid again. If we
> > > write replication origin WAL record to advance the origin lsn, it
> > > reduces the possibility of that. But I think it’s a very minor case so
> > > we won’t need to deal with that.
> > >
> >
> > Yeah, in the worst case, it will lead to conflict again and the user
> > needs to set the xid again.
>
> On second thought, the same is true for other cases, for example,
> preparing the transaction and clearing skip_xid while handling a
> prepare message. That is, currently we don't clear skip_xid while
> handling a prepare message but do that while handling commit/rollback
> prepared message, in order to avoid the worst case. If we do both
> while handling a prepare message and the server crashes between them,
> it ends up that skip_xid is cleared and the transaction will be
> resent, which is identical to the worst-case above.
>

How are you thinking to update the skip xid before prepare? If we do
it in the same transaction then the changes in the catalog will be
part of the prepared xact but won't be committed. Now, say if we do it
after prepare, then the situation won't be the same because after
restart the same xact won't appear again.

-- 
With Regards,
Amit Kapila.




Re: Proposal: More structured logging

2022-01-11 Thread Ronan Dunklau
Le mercredi 29 décembre 2021, 14:59:16 CET Justin Pryzby a écrit :
> > Subject: [PATCH v3 2/3] Add test module for the new tag functionality.
> 
> ...
> 
> > +test_logging(PG_FUNCTION_ARGS)
> > +{
> 
> ...
> 
> > +(errmsg("%s", message),
> > + ({
> > +   forboth(lk, keys, lv, values)
> > +   {
> > +   (errtag(lfirst(lk), "%s", (char *) 
lfirst(lv)));
> > +   }})
> > +   ));
> 
> The windows build fails with that syntax.
> http://cfbot.cputube.org/ronan-dunklau.html
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.157923

Thank you. I switched to an explicit sequence of errstart / errfinish to avoid  
putting too much things in nested macro calls. As I don't have any Windows 
knowledge, I am very grateful for the new cirrus-ci integration which allowed 
me to build on Windows without hassle.

> > Subject: [PATCH v3 3/3] Output error tags in CSV logs
> > +++ b/doc/src/sgml/config.sgml
> > @@ -7370,6 +7371,7 @@ CREATE TABLE postgres_log
> > 
> >backend_type text,
> >leader_pid integer,
> >query_id bigint,
> > 
> > +  tags jsonb
> > 
> >PRIMARY KEY (session_id, session_line_num)
> >  
> >  );
> >  
> 
> That's invalid sql due to missing a trailing ",".

Thanks, fixed.

> 
> You should also update file-fdw.sgml - which I only think of since we forgot
> in to update it before e568ed0eb and 0830d21f5.  config.sgml should have a
> comment as a reminder to do that.

Done, and I added anoher commit per your suggestion to add this comment.

Thank you for this review.

Regards,

-- 
Ronan Dunklau>From 487c34465b34ef6ef8cb466247cbaaa8cf4bc534 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Tue, 11 Jan 2022 10:16:53 +0100
Subject: [PATCH v4 4/4] Add comment in config.sgml as a reminder to also
 update file_fdw.sgml

---
 doc/src/sgml/config.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4df79fcbcc..d4c20f40ac 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7375,7 +7375,7 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
 query id,
 and optional tags added by the logger as JSON.
 Here is a sample table definition for storing CSV-format log output:
-
+
 
 CREATE TABLE postgres_log
 (
-- 
2.34.1

>From a1c4ae874bf156a05da436838d6b2b73f6621905 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Fri, 13 Aug 2021 15:03:32 +0200
Subject: [PATCH v4 3/4] Output error tags in CSV logs

---
 doc/src/sgml/config.sgml   |  4 +++-
 doc/src/sgml/file-fdw.sgml |  1 +
 src/backend/utils/error/elog.c | 25 -
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index afbb6c35e3..4df79fcbcc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7372,7 +7372,8 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
 location of the error in the PostgreSQL source code
 (if log_error_verbosity is set to verbose),
 application name, backend type, process ID of parallel group leader,
-and query id.
+query id,
+and optional tags added by the logger as JSON.
 Here is a sample table definition for storing CSV-format log output:
 
 
@@ -7404,6 +7405,7 @@ CREATE TABLE postgres_log
   backend_type text,
   leader_pid integer,
   query_id bigint,
+  tags jsonb,
   PRIMARY KEY (session_id, session_line_num)
 );
 
diff --git a/doc/src/sgml/file-fdw.sgml b/doc/src/sgml/file-fdw.sgml
index 5b98782064..ccb4e9d8dd 100644
--- a/doc/src/sgml/file-fdw.sgml
+++ b/doc/src/sgml/file-fdw.sgml
@@ -268,6 +268,7 @@ CREATE FOREIGN TABLE pglog (
   backend_type text,
   leader_pid integer,
   query_id bigint
+  tags jsonb,
 ) SERVER pglog
 OPTIONS ( filename 'log/pglog.csv', format 'csv' );
 
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index d43e1c2c31..1e6c7222c3 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -80,6 +80,7 @@
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/guc.h"
+#include "utils/json.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 
@@ -3013,10 +3014,32 @@ write_csvlog(ErrorData *edata)
 			appendStringInfo(, "%d", leader->pid);
 	}
 	appendStringInfoChar(, ',');
-
 	/* query id */
 	appendStringInfo(, "%lld", (long long) pgstat_get_my_query_id());
+	appendStringInfoChar(, ',');
+	if (edata->tags != NIL)
+	{
+		StringInfoData tagbuf;
+		ListCell   *lc;
+		bool		first = true;
 
+		initStringInfo();
+		appendStringInfoChar(, '{');
+		foreach(lc, edata->tags)
+		{
+			ErrorTag   *etag = lfirst(lc);
+
+			if (!first)
+appendStringInfoChar(, ',');
+			escape_json(, etag->tagname);
+			appendStringInfoChar(, ':');
+			escape_json(, etag->tagvalue);
+			first = false;
+		}
+		appendStringInfoChar(, '}');
+		appendCSVLiteral(, tagbuf.data);
+		pfree(tagbuf.data);
+	}
 	

  1   2   >