Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)

2022-09-20 Thread Michael Paquier
On Wed, Sep 21, 2022 at 03:45:49PM +0900, Fujii Masao wrote:
> Instead of making allocate_backup_state() or reset_backup_state()
> store the label name in BackupState before do_pg_backup_start(),
> how about making do_pg_backup_start() do that after checking its length?
> Seems this can simplify the code very much.
> 
> If so, ISTM that we can replace allocate_backup_state() and
> reset_backup_state() with just palloc0() and MemSet(), respectively.
> Also we can remove fill_backup_label_name().

Yep, agreed.  Having all these routines feels a bit overengineered.
--
Michael


signature.asc
Description: PGP signature


Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)

2022-09-20 Thread Michael Paquier
On Tue, Sep 20, 2022 at 05:13:49PM +0530, Bharath Rupireddy wrote:
> On Tue, Sep 20, 2022 at 7:20 AM Michael Paquier  wrote:
> I've separated out these variables from the backup state, please see
> the attached v9 patch.

Thanks, the separation looks cleaner.

>>> I've also taken help of the error callback mechanism to clean up the
>>> allocated memory in case of a failure. For do_pg_abort_backup() cases,
>>> I think we don't need to clean the memory because that gets called on
>>> proc exit (before_shmem_exit()).
>>
>> Memory could still bloat while the process running the SQL functions
>> is running depending on the error code path, anyway.
> 
> I didn't get your point. Can you please elaborate it? I think adding
> error callbacks at the right places would free up the memory for us.
> Please note that we already are causing memory leaks on HEAD today.

I mean that HEAD makes no effort in freeing this memory in
TopMemoryContext on session ERROR.

> I addressed the above review comments. I also changed a wrong comment
> [1] that lies before pg_backup_start() even after the removal of
> exclusive backup.
> 
> I'm attaching v9 patch set herewith, 0001 - refactors the backup code
> with backup state, 0002 - adds error callbacks to clean up the memory
> allocated for backup variables. Please review them further.

I have a few comments on 0001.

+#include 
Double quotes wanted here.

deallocate_backup_variables() is the only part of xlogbackup.c that
includes references of the tablespace map_and backup_label
StringInfos.  I would be tempted to fully decouple that from
xlogbackup.c/h for the time being.

-   tblspc_map_file = makeStringInfo();
Not sure that there is a need for a rename here.

+void
+build_backup_content(BackupState *state, bool ishistoryfile, StringInfo str)
+{
It would be more natural to have build_backup_content() do by itself
the initialization of the StringInfo for the contents of backup_label
and return it as a result of the routine?  This is short-lived in
xlogfuncs.c when the backup ends.

@@ -248,18 +250,25 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
[...]
+   /* Construct backup_label contents. */
+   build_backup_content(backup_state, false, backup_label);

Actually, for base backups, perhaps it would be more intuitive to
build and free the StringInfo of the backup_label when we send it for
base.tar rather than initializing it at the beginning and freeing it
at the end?

- * pg_backup_start: set up for taking an on-line backup dump
+ * pg_backup_start: start taking an on-line backup.
  *
- * Essentially what this does is to create a backup label file in $PGDATA,
- * where it will be archived as part of the backup dump.  The label file
- * contains the user-supplied label string (typically this would be used
- * to tell where the backup dump will be stored) and the starting time and
- * starting WAL location for the dump.
+ * This function starts the backup and creates tablespace_map contents.

The last part of the comment is still correct while the former is not,
so this loses some information.
--
Michael


signature.asc
Description: PGP signature


Assign TupleTableSlot->tts_tableOid duplicated in tale AM.

2022-09-20 Thread Wenchao Zhang
Hi hackers,

Recently, we discover that the field of tts_tableOid of TupleTableSlot is
assigned duplicated in table AM's interface which is not necessary. For
example, in table_scan_getnextslot,

```
static inline bool
table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction,
TupleTableSlot *slot)
{
slot->tts_tableOid = RelationGetRelid(sscan->rs_rd);

/*
 * We don't expect direct calls to table_scan_getnextslot with valid
 * CheckXidAlive for catalog or regular tables.  See detailed
comments in
 * xact.c where these variables are declared.
 */
if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
elog(ERROR, "unexpected table_scan_getnextslot call during
logical decoding");

return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction,
slot);
}
```

we can see that it assigns tts_tableOid, while calling
sscan->rs_rd->rd_tableam->scan_getnextslot which implemented by
heap_getnextslot also assigns tts_tableOid in the call of
ExecStoreBufferHeapTuple.

```
TupleTableSlot *
ExecStoreBufferHeapTuple(HeapTuple tuple,
 TupleTableSlot *slot,
 Buffer buffer)
{
/*
 * sanity checks
 */
Assert(tuple != NULL);
Assert(slot != NULL);
Assert(slot->tts_tupleDescriptor != NULL);
Assert(BufferIsValid(buffer));

if (unlikely(!TTS_IS_BUFFERTUPLE(slot)))
elog(ERROR, "trying to store an on-disk heap tuple into
wrong type of slot");
tts_buffer_heap_store_tuple(slot, tuple, buffer, false);

slot->tts_tableOid = tuple->t_tableOid;

return slot;
}
```

We can get the two assigned values are same by reading codes. Maybe we
should remove one?

What's more, we find that maybe we assign slot->tts_tableOid in outer
interface like table_scan_getnextslot may be better and more friendly when
we import other pluggable storage formats. It can avoid duplicated
assignments in every implementation of table AM's interfaces.

Regards,
Wenchao


Re: ICU for global collation

2022-09-20 Thread Marina Polyakova

On 2022-09-20 12:59, Peter Eisentraut wrote:

On 17.09.22 10:33, Marina Polyakova wrote:

3.

The locale provider is ICU, but it has not yet been set from the 
template database:



$ initdb --locale-provider icu --icu-locale en-US -D data &&
pg_ctl -D data -l logfile start &&
createdb --icu-locale ru-RU --template template0 mydb
...
createdb: error: database creation failed: ERROR:  ICU locale cannot 
be

specified unless locale provider is ICU


Please see attached patch for a fix.  Does that work for you?


Yes, it works. The following test checks this fix:

diff --git a/src/bin/scripts/t/020_createdb.pl 
b/src/bin/scripts/t/020_createdb.pl
index 
b87d8fc63b5246b02bcd4499aae815269b60df7c..c2464a99618cd7ca5616cc21121e1e4379b52baf 
100644

--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -71,6 +71,14 @@ if ($ENV{with_icu} eq 'yes')
$node2->command_ok(
 		[ 'createdb', '-T', 'template0', '--locale-provider=libc', 'foobar55' 
],
 		'create database with libc provider from template database with icu 
provider');

+
+   $node2->command_ok(
+   [
+   'createdb', '-T', 'template0', '--icu-locale',
+   'en-US', 'foobar56'
+   ],
+		'create database with icu locale from template database with icu 
provider'

+   );
 }
 else
 {

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)

2022-09-20 Thread Fujii Masao




On 2022/09/20 20:43, Bharath Rupireddy wrote:

-   if (strlen(backupidstr) > MAXPGPATH)
+   if (state->name_overflowed == true)
 ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("backup label too long (max %d bytes)",
It does not strike me as a huge issue to force a truncation of such
backup label names.  1024 is large enough for basically all users,
in my opinion.  Or you could just truncate in the internal logic, but
still complain at MAXPGPATH - 1 as the last byte would be for the zero
termination.  In short, there is no need to complicate things with
name_overflowed.


We currently allow MAXPGPATH bytes of label name excluding null
termination. I don't want to change this behaviour. In the attached v9
patch, I'm truncating the larger names to  MAXPGPATH + 1 bytes in
backup state (one extra byte for representing that the name has
overflown, and another extra byte for null termination).


This looks much complicated to me.

Instead of making allocate_backup_state() or reset_backup_state()
store the label name in BackupState before do_pg_backup_start(),
how about making do_pg_backup_start() do that after checking its length?
Seems this can simplify the code very much.

If so, ISTM that we can replace allocate_backup_state() and
reset_backup_state() with just palloc0() and MemSet(), respectively.
Also we can remove fill_backup_label_name().

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: make additional use of optimized linear search routines

2022-09-20 Thread Richard Guo
On Wed, Sep 21, 2022 at 1:40 PM Michael Paquier  wrote:

> In short, switching those code paths to use the linear search routines
> looks like a good thing in the long-term, so I would like to apply
> this patch.  If you have any comments or objections, please feel
> free.


Yeah, I agree that the changes in the patch are meaningful even if the
performance gain is limited.

I wonder if there are other code paths we can replace with the linear
search routines. I tried to search for them but no luck.

Thanks
Richard


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-20 Thread John Naylor
On Tue, Sep 20, 2022 at 3:19 PM Masahiko Sawada 
wrote:
>
> On Fri, Sep 16, 2022 at 4:54 PM John Naylor
>  wrote:

> > Here again, I'd rather put this off and focus on getting the "large
> > details" in good enough shape so we can got towards integrating with
> > vacuum.
>
> Thank you for the comments! These above comments are addressed by
> Nathan in a newly derived thread. I'll work on the patch.

I still seem to be out-voted on when to tackle this particular
optimization, so I've extended the v6 benchmark code with a hackish
function that populates a fixed number of keys, but with different fanouts.
(diff attached as a text file)

I didn't take particular care to make this scientific, but the following
seems pretty reproducible. Note what happens to load and search performance
when node16 has 15 entries versus 16:

 fanout | nkeys  | rt_mem_allocated | rt_load_ms | rt_search_ms
++--++--
 15 | 327680 |  3776512 | 39 |   20
(1 row)
num_keys = 327680, height = 4, n4 = 1, n16 = 23408, n32 = 0, n128 = 0, n256
= 0

 fanout | nkeys  | rt_mem_allocated | rt_load_ms | rt_search_ms
++--++--
 16 | 327680 |  3514368 | 25 |   11
(1 row)
num_keys = 327680, height = 4, n4 = 0, n16 = 21846, n32 = 0, n128 = 0, n256
= 0

In trying to wrap the SIMD code behind layers of abstraction, the latest
patch (and Nathan's cleanup) threw it away in almost all cases. To explain,
we need to talk about how vectorized code deals with the "tail" that is too
small for the register:

1. Use a one-by-one algorithm, like we do for the pg_lfind* variants.
2. Read some junk into the register and mask off false positives from the
result.

There are advantages to both depending on the situation.

Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15
elements or less, it will iterate over them one-by-one exactly like a
node4. Only when full with 16 will the vector path be taken. When another
entry is added, the elements are copied to the next bigger node, so there's
a *small* window where it's fast.

In short, this code needs to be lower level so that we still have full
control while being portable. I will work on this, and also the related
code for node dispatch.

Since v6 has some good infrastructure to do low-level benchmarking, I also
want to do some experiments with memory management.

(I have further comments about the code, but I will put that off until
later)

> I'll consider how to integrate with vacuum as the next step. One
> concern for me is how to limit the memory usage to
> maintenance_work_mem. Unlike using a flat array, memory space for
> adding one TID varies depending on the situation. If we want strictly
> not to allow using memory more than maintenance_work_mem, probably we
> need to estimate the memory consumption in a conservative way.

+1

--
John Naylor
EDB: http://www.enterprisedb.com
commit 18407962e96ccec6c9aeeba97412edd762a5a4fe
Author: John Naylor 
Date:   Wed Sep 21 11:44:43 2022 +0700

Add special benchmark function to test effect of fanout

diff --git a/contrib/bench_radix_tree/Makefile 
b/contrib/bench_radix_tree/Makefile
index b8f70e12d1..952bb0ceae 100644
--- a/contrib/bench_radix_tree/Makefile
+++ b/contrib/bench_radix_tree/Makefile
@@ -7,7 +7,7 @@ OBJS = \
 EXTENSION = bench_radix_tree
 DATA = bench_radix_tree--1.0.sql
 
-REGRESS = bench
+REGRESS = bench_fixed_height
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql 
b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 6663abe6a4..f2fee15b17 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -40,3 +40,15 @@ OUT load_ms int8)
 returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c 
b/contrib/bench_radix_tree/bench_radix_tree.c
index 5806ef7519..0778da2d7b 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -13,6 +13,7 @@
 #include "fmgr.h"
 #include "funcapi.h"
 #include "lib/radixtree.h"
+#include 
 #include "miscadmin.h"
 #include "utils/timestamp.h"
 
@@ -24,6 +25,7 @@ PG_MODULE_MAGIC;
 PG_FUNCTION_INFO_V1(bench_seq_search);
 PG_FUNCTION_INFO_V1(bench_shuffle_search);
 PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
 
 static radix_tree *rt = NULL;
 static ItemPointer itemptrs = NULL;
@@ -299,3 +301,108 @@ bench_load_random_int(PG_FUNCTION_ARGS)
rt_free(rt);
PG_RETURN_DATUM(HeapTupleGetDatum(h

Re: Fix snapshot name for SET TRANSACTION documentation

2022-09-20 Thread Fujii Masao




On 2022/09/21 12:01, Japin Li wrote:


Hi hackers,

In 6c2003f8a1bbc7c192a2e83ec51581c018aa162f, we change the snapshot name
when exporting snapshot, however, there is one place we missed update the
snapshot name in documentation.  Attach a patch to fix it.


Thanks for the patch! Looks good to me.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: make additional use of optimized linear search routines

2022-09-20 Thread Michael Paquier
On Sat, Sep 03, 2022 at 10:06:58AM +0900, Michael Paquier wrote:
> Ohoh.  This sounds like a good idea to me, close to what John has
> applied lately.  I'll take a closer look..

So, the two code paths patched here are rather isolated.  The one in
TransactionIdIsInProgress() requires an overflowed set of subxids
still running, something similar to what the isolation test
subxid-overflow does.  XidIsConcurrent() is also kind of hard to
reason about with a benchmark.

Anyway, I did not know about the work done with SIMD instructions in
pg_lfind.h and after playing the API I have run some micro benchmarks
with on pg_lfind32() and I can see some improvements.  With a range of
100~10k elements in a fixed number of repeated calls with a for loop
and lfind(), I could not get up to the 40% speedup.  That was somewhat
closer to 15%~20% on x86 and 20%~25% with arm64.  There is a trend 
where things got better with a higher number of elements with
lfind().

In short, switching those code paths to use the linear search routines
looks like a good thing in the long-term, so I would like to apply
this patch.  If you have any comments or objections, please feel
free.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH]Feature improvement for MERGE tab completion

2022-09-20 Thread Fujii Masao



On 2022/09/21 0:51, Alvaro Herrera wrote:

The rules starting at line 4111 make me a bit nervous, since nowhere
we're restricting them to operating only on MERGE lines.  I don't think
it's a real problem since USING is not terribly common anyway.  Likewise
for the ones with WHEN [NOT] MATCHED.  I kinda wish we had a way to
search for stuff like "keyword MERGE appears earlier in the command",
but we don't have that.


Yeah, I was thinking the same when updating the patch.

How about adding something like PartialMatches() that checks whether
the keywords are included in the input string or not? If so, we can restrict
some tab-completion rules to operating only on MERGE, as follows. I attached
the WIP patch (0002 patch) that introduces PartialMatches().
Is this approach over-complicated? Thought?

+   else if (PartialMatches("MERGE", "INTO", MatchAny, "USING") ||
+PartialMatches("MERGE", "INTO", MatchAny, "AS", MatchAny, 
"USING") ||
+PartialMatches("MERGE", "INTO", MatchAny, MatchAny, 
"USING"))
+   {
+   /* Complete MERGE INTO ... ON with target table attributes */
+   if (TailMatches("INTO", MatchAny, "USING", MatchAny, "ON"))
+   COMPLETE_WITH_ATTR(prev4_wd);
+   else if (TailMatches("INTO", MatchAny, "AS", MatchAny, "USING", MatchAny, 
"AS", MatchAny, "ON"))
+   COMPLETE_WITH_ATTR(prev8_wd);
+   else if (TailMatches("INTO", MatchAny, MatchAny, "USING", MatchAny, 
MatchAny, "ON"))
+   COMPLETE_WITH_ATTR(prev6_wd);


Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATIONFrom 671b4dff2e259b4d148dcbfaee0611fc2bddce85 Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Wed, 21 Sep 2022 11:46:59 +0900
Subject: [PATCH 1/2] psql: Improve tab-completion for MERGE.

---
 src/bin/psql/tab-complete.c | 102 +++-
 1 file changed, 67 insertions(+), 35 deletions(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index f3465adb85..820f47d23a 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1669,7 +1669,7 @@ psql_completion(const char *text, int start, int end)
"COMMENT", "COMMIT", "COPY", "CREATE", "DEALLOCATE", "DECLARE",
"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", 
"EXPLAIN",
"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", 
"LISTEN", "LOAD", "LOCK",
-   "MERGE", "MOVE", "NOTIFY", "PREPARE",
+   "MERGE INTO", "MOVE", "NOTIFY", "PREPARE",
"REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
"RESET", "REVOKE", "ROLLBACK",
"SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START",
@@ -3641,7 +3641,7 @@ psql_completion(const char *text, int start, int end)
  */
else if (Matches("EXPLAIN"))
COMPLETE_WITH("SELECT", "INSERT INTO", "DELETE FROM", "UPDATE", 
"DECLARE",
- "MERGE", "EXECUTE", "ANALYZE", 
"VERBOSE");
+ "MERGE INTO", "EXECUTE", "ANALYZE", 
"VERBOSE");
else if (HeadMatches("EXPLAIN", "(*") &&
 !HeadMatches("EXPLAIN", "(*)"))
{
@@ -3660,12 +3660,12 @@ psql_completion(const char *text, int start, int end)
}
else if (Matches("EXPLAIN", "ANALYZE"))
COMPLETE_WITH("SELECT", "INSERT INTO", "DELETE FROM", "UPDATE", 
"DECLARE",
- "MERGE", "EXECUTE", "VERBOSE");
+ "MERGE INTO", "EXECUTE", "VERBOSE");
else if (Matches("EXPLAIN", "(*)") ||
 Matches("EXPLAIN", "VERBOSE") ||
 Matches("EXPLAIN", "ANALYZE", "VERBOSE"))
COMPLETE_WITH("SELECT", "INSERT INTO", "DELETE FROM", "UPDATE", 
"DECLARE",
- "MERGE", "EXECUTE");
+ "MERGE INTO", "EXECUTE");
 
 /* FETCH && MOVE */
 
@@ -4065,58 +4065,90 @@ psql_completion(const char *text, int start, int end)
else if (HeadMatches("LOCK") && TailMatches("IN", "SHARE"))
COMPLETE_WITH("MODE", "ROW EXCLUSIVE MODE",
  "UPDATE EXCLUSIVE MODE");
+
+   /* Complete LOCK [TABLE] [ONLY]  [IN lockmode MODE] with 
"NOWAIT" */
+   else if (HeadMatches("LOCK") && TailMatches("MODE"))
+   COMPLETE_WITH("NOWAIT");
+
 /* MERGE --- can be inside EXPLAIN */
else if (TailMatches("MERGE"))
COMPLETE_WITH("INTO");
else if (TailMatches("MERGE", "INTO"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_mergetargets);
+
+   /* Complete MERGE INTO  [[AS] ] with USING */
else if (TailMatches("MERGE", "INTO", MatchAny))
 

Re: Add common function ReplicationOriginName.

2022-09-20 Thread Amit Kapila
On Tue, Sep 20, 2022 at 2:06 PM Aleksander Alekseev
 wrote:
>
> Hi Amit,
>
> > I think it is better to use Size. Even though, it may not fail now as
> > the size of names for origin will always be much lesser but it is
> > better if we are consistent. If we agree with this, then as a first
> > patch, we can make it to use Size in existing places and then
> > introduce this new function.
>
> OK, here is the updated patchset.
>
> * 0001 replaces int's with Size's in the existing code
>

Pushed this one.

> * 0002 applies Peter's patch on top of 0001
>

Can't we use the existing function ReplicationOriginNameForTablesync()
by passing relid as InvalidOid for this purpose? We need a check
inside to decide which name to construct, otherwise, it should be
fine. If we agree with this, then we can change the name of the
function to something like ReplicationOriginNameForLogicalRep or
ReplicationOriginNameForLogicalRepWorkers.

-- 
With Regards,
Amit Kapila.




Re: [RFC] building postgres with meson - v13

2022-09-20 Thread Andres Freund
Hi,

On 2022-09-21 09:52:48 +0700, John Naylor wrote:
> On Wed, Sep 21, 2022 at 7:11 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2022-09-19 19:16:30 -0700, Andres Freund wrote:
> > > To have some initial "translation" for other developers I've started a
> wiki
> > > page with a translation table. Still very WIP:
> > > https://wiki.postgresql.org/wiki/Meson
> > >
> > > For now, a bit of polishing aside, I'm just planning to add a minimal
> > > explanation of what's happening, and a reference to this thread.
> >
> > I added installation instructions for meson for a bunch of platforms, but
> 
> Small typo: The homebrew section is still labeled with "find MacPorts
> libraries".

Thanks, fixed. I wrote these blindly, so there's probably more wrong than this
- although Thomas was helpful enough to provide some information / testing.

Greetings,

Andres Freund




Re: predefined role(s) for VACUUM and ANALYZE

2022-09-20 Thread Nathan Bossart
On Wed, Sep 21, 2022 at 10:31:47AM +0900, Michael Paquier wrote:
> Did you just run an aclupdate()?  4% for aclitem[] sounds like quite a
> number to me :/  It may be worth looking at if these operations could
> be locally optimized more, as well.  I'd like to think that we could
> live with that to free up enough bits in AclItems for the next 20
> years, anyway.  Any opinions?

Yes, the test was mostly for aclupdate().  Looking at that function, I bet
most of its time is spent in palloc0() and memcpy().  It might be possible
to replace the linear search if the array was sorted, but I'm skeptical
that will help much.  In the end, I'm not it's worth worrying too much
about 2,000 calls to aclupdate() with an array of 2,000 ACLs taking 5.3
seconds instead of 5.1 seconds.

I bet a more pressing concern is the calls to aclmask() since checking
privileges is probably done more frequently than updating them.  That
appears to use a linear search, too, so maybe sorting the aclitem arrays is
actually worth exploring.  I still doubt there will be much noticeable
impact from expanding AclMode outside of the most extreme cases.

> For the column sizes of the catalogs, I was wondering about how
> pg_column_size() changes when they hold ACL information.  Unoptimized
> alignment could cause an unnecessary increase in the structure sizes,
> so the addition of new fields or changes in object size could have
> unexpected side effects.

After a few tests, I haven't discovered any changes to the output of
pg_column_size().

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Virtual tx id

2022-09-20 Thread Julien Rouhaud
Hi,

On Wed, Sep 21, 2022 at 01:58:47PM +1000, James Sewell wrote:
> Hello Hackers!
>
> Is it possible to get the current virtual txid from C somehow?
>
> I've looked through the code, but can't seem to find anything other than
> getting a NULL when there is no (real) xid assigned. Maybe I'm missing
> something?

It should be MyProc->lxid, and prepend it with MyBackendId if you want to
compare something like pg_locks.virtualtransaction.




Re: Virtual tx id

2022-09-20 Thread Zhang Mingli
HI,
On Sep 21, 2022, 11:59 +0800, James Sewell , wrote:
> Hello Hackers!
>
> Is it possible to get the current virtual txid from C somehow?
>
> I've looked through the code, but can't seem to find anything other than 
> getting a NULL when there is no (real) xid assigned. Maybe I'm missing 
> something?
>
> Cheers,
> James

Virtual xid is meaningful only inside a read-only transaction.

It’s made up of MyProc->BackendId and MyProc->LocalTransactionId.

To catch it, you can begin a transaction, but don’t exec sqls that could change 
the db.

And gdb on process to see( must exec a read only sql to call StartTransaction, 
else MyProc->lxid is not assigned).

```
Begin;
// do nothing

```
gdb on it and see

```
p MyProc->lxid
p MyProc->backendId

```




Regards,
Zhang Mingli


Re: Virtual tx id

2022-09-20 Thread Japin Li


On Wed, 21 Sep 2022 at 11:58, James Sewell  wrote:
> Hello Hackers!
>
> Is it possible to get the current virtual txid from C somehow?
>
The virtual txid is consisted of MyProc->backendId and MyProc->lxid.  Do you
mean a C function that returns virtual txid?

> I've looked through the code, but can't seem to find anything other than
> getting a NULL when there is no (real) xid assigned. Maybe I'm missing
> something?
>
Do you mean use SQL function to check the virtual?  IIRC, there is no such
functions.  Maybe you can use pg_export_snapshot() to get virtual txid (v10
and later), the filename of exported snapshot consists MyProc->backendId
and MyProc->lxid.


-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Virtual tx id

2022-09-20 Thread James Sewell
Hello Hackers!

Is it possible to get the current virtual txid from C somehow?

I've looked through the code, but can't seem to find anything other than
getting a NULL when there is no (real) xid assigned. Maybe I'm missing
something?

Cheers,
James


Fix snapshot name for SET TRANSACTION documentation

2022-09-20 Thread Japin Li

Hi hackers,

In 6c2003f8a1bbc7c192a2e83ec51581c018aa162f, we change the snapshot name
when exporting snapshot, however, there is one place we missed update the
snapshot name in documentation.  Attach a patch to fix it.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

>From 5eac3c0ca274da30b2e2cf0e19887892d5f92788 Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Wed, 21 Sep 2022 10:51:22 +0800
Subject: [PATCH v1] Fix sanpshot name for set transaction documentation

---
 doc/src/sgml/ref/set_transaction.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/set_transaction.sgml b/doc/src/sgml/ref/set_transaction.sgml
index e062e2461e..d394e622f8 100644
--- a/doc/src/sgml/ref/set_transaction.sgml
+++ b/doc/src/sgml/ref/set_transaction.sgml
@@ -164,7 +164,7 @@ SET SESSION CHARACTERISTICS AS TRANSACTION transa
snapshot identifier, which must be given to SET TRANSACTION
SNAPSHOT to specify which snapshot is to be imported.  The
identifier must be written as a string literal in this command, for example
-   '03A1-1'.
+   '0003-001B-1'.
SET TRANSACTION SNAPSHOT can only be executed at the
start of a transaction, before the first query or
data-modification statement (SELECT,
-- 
2.25.1



Re: [RFC] building postgres with meson - v13

2022-09-20 Thread John Naylor
On Wed, Sep 21, 2022 at 7:11 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-09-19 19:16:30 -0700, Andres Freund wrote:
> > To have some initial "translation" for other developers I've started a
wiki
> > page with a translation table. Still very WIP:
> > https://wiki.postgresql.org/wiki/Meson
> >
> > For now, a bit of polishing aside, I'm just planning to add a minimal
> > explanation of what's happening, and a reference to this thread.
>
> I added installation instructions for meson for a bunch of platforms, but

Small typo: The homebrew section is still labeled with "find MacPorts
libraries".

--
John Naylor
EDB: http://www.enterprisedb.com


RE: why can't a table be part of the same publication as its schema

2022-09-20 Thread houzj.f...@fujitsu.com
On Wednesday, September 21, 2022 4:06 AM Mark Dilger 
 wrote:
> > On Sep 20, 2022, at 12:36 PM, Jonathan S. Katz 
> wrote:
> >
> > This behavior exists "FOR ALL TABLES" without the "IN SCHEMA" qualifier.
> This was discussed multiple times on the original thread[1].
> 
> Yes, nobody is debating that as far as I can see.  And I do take your point 
> that
> this stuff was discussed in other threads quite a while back.
> 
> > I tried to diligently read the sections where we talk about granting +
> privileges[2][3] to see what it says about "ALL * IN SCHEMA". Unless I missed 
> it,
> and I read through it twice, it does not explicitly state whether or not 
> "GRANT"
> applies to all objects at only that given moment, or to future objects of that
> type which are created in that schema. Maybe the behavior is implied or is 
> part
> of the standard, but it's not currently documented.
> 
> Interesting.  Thanks for that bit of research.
> 
> > We do link to "ALTER DEFAULT PRIVILEGES" at the bottom of the GRANT[2]
> docs, but we don't give any indication as to why.
> >
> > (This is also to say we should document in GRANT that ALL * IN SCHEMA does
> not apply to future objects;
> 
> Yes, I agree this should be documented.
> 
> > if you need that behavior use ALTER DEFAULT PRIVILEGES. Separate thread :)
> >
> > I understand there is a risk of confusion of the similar grammar across
> commands, but the current command in logical replication has this is building
> on the existing behavior.
> 
> I don't complain that it is buidling on the existing behavior.  I'm *only*
> concerned about the keywords we're using for this.  Consider the following:
> 
>-- AS ADMIN
>CREATE USER bob NOSUPERUSER;
>GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA foo TO bob;
>SET ROLE bob;
>CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo;
> 
> We're going to have that fail in pg15 because the FOR ALL TABLES IN SCHEMA
> option is reserved to superusers.  But we agreed that was a stop-gap solution
> that we'd potentially loosen in the future.  Certainly we'll need wiggle room 
> in
> the syntax to perform that loosening:
> 
>--- Must be superuser for this in pg15, and in subsequent releases.
>CREATE PUBLICATION bobs_pub FOR ALL FUTURE TABLES IN SCHEMA foo;
> 
>--- Not supported in pg15, but reserved for some future pg versions to
> allow
>--- non-superusers to create publications on tables currently in schema 
> foo,
>--- assuming they have sufficient privileges on those tables
>CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo;
> 
> Doing it this way makes the syntax consistent between the GRANT...TO bob and
> the CREATE PUBLICATION bobs_pub.  Surely this makes more sense?

Thanks for the suggestion.

My concern is that I am not sure do we really want to add a feature that only
publish all the current tables(not future tables).

I think, if possible, it would be better to find an approach that can release 
the
superuser restriction for both FOR ALL TABLES and FOR ALL TABLES IN SCHEMA in
the future release. I think another solution might be introduce a new
publication option (like: include_future).

When user execute:
CREATE PUBLICATION ... FOR ALL TABLES IN SCHEMA ... WITH (include_future)

it means we publish all current and future tables and require superuser. We can
set the default value of this option to 'true' and user can set it to false if
they only want to publish the current tables and don't want to use superuser.
And in this approach, we don't need to change the syntax.

Best regards,
Hou zj


Re: Make ON_ERROR_STOP stop on shell script failure

2022-09-20 Thread Fujii Masao




On 2022/09/20 15:15, bt22nakamorit wrote:

I thought that this action is rather unexpected since, based on the
word """ON_ERROR_STOP""", ones may expect that failures of shell
scripts should halt the incoming instructions as well.
One clear solution is to let failures of shell script stop incoming
queries just like how errors of SQLs do currently. Thoughts?


+1



I edited the documentation for ON_ERROR_STOP.
Any other suggestions?


Thanks for the patch!
Could you add it to the next CommitFest so that we don't forget it?


We can execute the shell commands via psql in various ways
other than \! meta-command. For example,

* `command`
* \g | command
* \gx | command
* \o | command
* \w | command
* \copy ... program 'command'

ON_ERROR_STOP should handle not only \! but also all the above in the same way?


One concern about this patch is that some applications already depend on
the current behavior of ON_ERROR_STOP, i.e., psql doesn't stop even when
the shell command returns non-zero exit code. If so, we might need to
extend ON_ERROR_STOP so that it accepts the following setting values.

* off - don't stop even when either sql or shell fails (same as the current 
behavior)
* on or sql - stop only whensql fails (same as the current behavior)
* shell - stop only when shell fails
* all - stop when either sql or shell fails

Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory

2022-09-20 Thread Bharath Rupireddy
On Fri, Aug 19, 2022 at 5:27 PM Bharath Rupireddy
 wrote:
>
> Please review the attached v6 patch.

I'm attaching the v7 patch rebased on to the latest HEAD.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 5ccee7e8ed82952bad6ef410176952aec2733e4f Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 21 Sep 2022 01:44:44 +
Subject: [PATCH v7] Make WAL file initialization by pg_receivewal atomic

While initializing WAL files with zeros (i.e. padding) the
pg_receivewal can fail for any reason, for instance, no space left
on device or host server crash etc., which may leave partially
padded WAL files due to which the pg_receivewal won't come up after
the hostserver restart. The error it emits is "error: write-ahead
log file "foo.partial" has x bytes, should be 0 or y" (where y is
size of WAL file typically and x < y).

To fix this, one needs to manually intervene and delete the partially
padded WAL file which requires an internal understanding of what
the issue is and often a time-consuming process in production
environments.

The above problem can also exist for pg_basebackup as it uses the
same walmethods.c function for initialization.

To address the above problem, make WAL file initialization atomic
i.e. first create a temporary file, pad it and then rename it to the
target file. This is similar to what postgres does to make WAL file
initialization atomic.

Authors: Bharath Rupireddy, SATYANARAYANA NARLAPURAM
Reviewed-by: Cary Huang, mahendrakar s
Reviewed-by: Kyotaro Horiguchi, Michael Paquier
Discussion: https://www.postgresql.org/message-id/CAHg%2BQDcVUss6ocOmbLbV5f4YeGLhOCt%2B1x2yLNfG2H_eDwU8tw%40mail.gmail.com
---
 src/bin/pg_basebackup/pg_receivewal.c | 22 ++
 src/bin/pg_basebackup/walmethods.c| 96 ---
 2 files changed, 109 insertions(+), 9 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c
index f98ec557db..1cbc992e67 100644
--- a/src/bin/pg_basebackup/pg_receivewal.c
+++ b/src/bin/pg_basebackup/pg_receivewal.c
@@ -237,6 +237,28 @@ is_xlogfilename(const char *filename, bool *ispartial,
 		return true;
 	}
 
+	/*
+	 * File looks like a temporary partial uncompressed WAL file that was
+	 * leftover during pre-padding phase from a previous crash, delete it now
+	 * as we don't need it.
+	 */
+	if (fname_len == XLOG_FNAME_LEN + strlen(".partial.tmp") &&
+		strcmp(filename + XLOG_FNAME_LEN, ".partial.tmp") == 0)
+	{
+		/* 12 is length of string ".partial.tmp" */
+		char	path[MAXPGPATH + XLOG_FNAME_LEN + 12];
+
+		snprintf(path, sizeof(path), "%s/%s", basedir, filename);
+
+		if (unlink(path) < 0)
+			pg_log_error("could not remove file \"%s\"", path);
+
+		if (verbose)
+			pg_log_info("removed file \"%s\"", path);
+
+		return false;
+	}
+
 	/* File does not look like something we know */
 	return false;
 }
diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index bc2e83d02b..4d34d4346a 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -118,7 +118,10 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname,
    const char *temp_suffix, size_t pad_to_size)
 {
 	DirectoryMethodData *dir_data = (DirectoryMethodData *) wwmethod;
-	char		tmppath[MAXPGPATH];
+	char	targetpath[MAXPGPATH];
+	char 	tmpsuffixpath[MAXPGPATH];
+	bool	shouldcreatetempfile = false;
+	int			flags;
 	char	   *filename;
 	int			fd;
 	DirectoryMethodFile *f;
@@ -134,7 +137,7 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname,
 	clear_error(wwmethod);
 
 	filename = dir_get_file_name(wwmethod, pathname, temp_suffix);
-	snprintf(tmppath, sizeof(tmppath), "%s/%s",
+	snprintf(targetpath, sizeof(targetpath), "%s/%s",
 			 dir_data->basedir, filename);
 	pg_free(filename);
 
@@ -143,12 +146,57 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname,
 	 * the file descriptor is important for dir_sync() method as gzflush()
 	 * does not do any system calls to fsync() to make changes permanent on
 	 * disk.
+	 *
+	 * Create a temporary file for padding and no compression cases to ensure
+	 * a fully initialized file is created.
 	 */
-	fd = open(tmppath, O_WRONLY | O_CREAT | PG_BINARY, pg_file_create_mode);
+	if (pad_to_size > 0 &&
+		wwmethod->compression_algorithm == PG_COMPRESSION_NONE)
+	{
+		shouldcreatetempfile = true;
+		flags = O_WRONLY | PG_BINARY;
+	}
+	else
+	{
+		shouldcreatetempfile = false;
+		flags = O_WRONLY | O_CREAT | PG_BINARY;
+	}
+
+	fd = open(targetpath, flags, pg_file_create_mode);
 	if (fd < 0)
 	{
-		wwmethod->lasterrno = errno;
-		return NULL;
+		if (errno == ENOENT && shouldcreatetempfile)
+		{
+			/*
+			 * Actual file doesn't exist. Now, create a temporary file pad it
+			 * and rename to the target file. The temporary file may exist from
+			 * the last failed attempt (failed during partial padding or
+			 * renaming or some ot

RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-20 Thread kuroda.hay...@fujitsu.com
> One last thing - do you think there is any need to mention this
> behaviour in the pgdocs, or is OK just to be a hidden performance
> improvement?

FYI - I put my opinion.
We have following sentence in the logical-replication.sgml:

```
...
If the table does not have any suitable key, then it can be set
   to replica identity full, which means the entire row becomes
   the key.  This, however, is very inefficient and should only be used as a
   fallback if no other solution is possible.
...
```

Here the word "very inefficient" may mean that sequential scans will be 
executed every time.
I think some descriptions can be added around here.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Jonathan S. Katz

[personal views, not RMT]

On 9/20/22 4:06 PM, Mark Dilger wrote:


I don't complain that it is buidling on the existing behavior.  I'm *only* 
concerned about the keywords we're using for this.  Consider the following:

-- AS ADMIN
CREATE USER bob NOSUPERUSER;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA foo TO bob;
SET ROLE bob;
CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo;

We're going to have that fail in pg15 because the FOR ALL TABLES IN SCHEMA 
option is reserved to superusers.  But we agreed that was a stop-gap solution 
that we'd potentially loosen in the future.  Certainly we'll need wiggle room 
in the syntax to perform that loosening:

--- Must be superuser for this in pg15, and in subsequent releases.
CREATE PUBLICATION bobs_pub FOR ALL FUTURE TABLES IN SCHEMA foo;

--- Not supported in pg15, but reserved for some future pg versions to allow
--- non-superusers to create publications on tables currently in schema foo,
--- assuming they have sufficient privileges on those tables
CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo;

Doing it this way makes the syntax consistent between the GRANT...TO bob and 
the CREATE PUBLICATION bobs_pub.  Surely this makes more sense?


When you put it that way, I see your point. However, for the 
lesser-privileged user though, will the behavior be that it will 
continue to add all future tables in a schema to the publication so long 
as they have sufficient privileges on those tables? Or would that mirror 
the current behavior with GRANT?


While I understand it makes it consistent, the one concern I raise is 
that it means the less privileged user could have a less convenient user 
experience than the privileged user. Perhaps that's OK, but worth noting.



I'm not a huge fan of the keyword "FUTURE" here, but I found a reference to 
another database that uses that keyword for what I think is a similar purpose.


I did try doing research on this prior, but hadn't thought to 
incorporate "future" into my searches.


Doing so, I probably found the same database that you did that used the 
"FUTURE" word for adding permissions to future objects (and this is 
fresh, as the docs for it were published last week). That's definitely 
interesting.


I did see some notes on a legacy database system that offered similar 
advice to what we do for GRANT if you're not using ALTER DEFAULT PRIVILEGES.



 We should choose *something* for this, though, if we want things to be 
rational going forward.


That all said, while I understand your point and open to the suggestion 
on "FUTURE", I'm not convinced on the syntax change. But I'll sleep on it.


Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


RE: Perform streaming logical transactions by background workers and parallel apply

2022-09-20 Thread wangw.f...@fujitsu.com
> FYI -
> 
> The latest patch 30-0001 fails to apply, it seems due to a recent commit [1].
> 
> [postgres@CentOS7-x64 oss_postgres_misc]$ git apply
> ../patches_misc/v30-0001-Perform-streaming-logical-transactions-by-
> parall.patch
> error: patch failed: src/include/replication/logicalproto.h:246
> error: src/include/replication/logicalproto.h: patch does not apply

Thanks for your kindly reminder.

I rebased the patch set and attached them in [1].

[1] - 
https://www.postgresql.org/message-id/OS3PR01MB6275298521AE1BBEF5A055EE9E4F9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei


Re: Perform streaming logical transactions by background workers and parallel apply

2022-09-20 Thread Peter Smith
FYI -

The latest patch 30-0001 fails to apply, it seems due to a recent commit [1].

[postgres@CentOS7-x64 oss_postgres_misc]$ git apply
../patches_misc/v30-0001-Perform-streaming-logical-transactions-by-parall.patch
error: patch failed: src/include/replication/logicalproto.h:246
error: src/include/replication/logicalproto.h: patch does not apply

--
[1] 
https://github.com/postgres/postgres/commit/bfcf1b34805f70df48eedeec237230d0cc1154a6

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Hash index build performance tweak from sorting

2022-09-20 Thread David Rowley
On Tue, 2 Aug 2022 at 03:37, Simon Riggs  wrote:
> Using the above test case, I'm getting a further 4-7% improvement on
> already committed code with the attached patch, which follows your
> proposal.
>
> The patch passes info via a state object, useful to avoid API churn in
> later patches.

Hi Simon,

I took this patch for a spin and saw a 2.5% performance increase using
the random INT test that Tom posted. The index took an average of
7227.47 milliseconds on master and 7045.05 with the patch applied.

On making a pass of the changes, I noted down a few things.

1. In _h_spoolinit() the HSpool is allocated with palloc and then
you're setting the istate field to a pointer to the HashInsertState
which is allocated on the stack by the only calling function
(hashbuild()).  Looking at hashbuild(), it looks like the return value
of _h_spoolinit is never put anywhere to make it available outside of
the function, so it does not seem like there is an actual bug there.
However, it just seems like a bug waiting to happen. If _h_spoolinit()
is pallocing memory, then we really shouldn't be setting pointer
fields in that memory to point to something on the stack.  It might be
nicer if the istate field in HSpool was a HashInsertStateData and
_h_spoolinit() just memcpy'd the contents of that parameter.  That
would make HSpool 4 bytes smaller and save additional pointer
dereferences in _hash_doinsert().

2. There are quite a few casts that are not required. e.g:

_hash_doinsert(rel, itup, heapRel, (HashInsertState) &istate);
buildstate.spool = _h_spoolinit(heap, index, num_buckets,
(HashInsertState) &insertstate);
buildstate.istate = (HashInsertState) &insertstate;

This is just my opinion, but I don't really see the value in having a
typedef for a pointer to HashInsertStateData. I can understand that if
the struct was local to a .c file, but you've got the struct and
pointer typedef in the same header. I understand we often do this in
the code, but I feel like we do it less often in newer code. e.g we do
it in aset.c but not generation.c (which is much newer than aset.c).
My personal preference would be just to name the struct
HashInsertState and have no extra pointer typedefs.

3. Just a minor nitpick. Line wraps at 80 chars. You're doing this
sometimes but not others. This seems just to be due to the additional
function parameters that have been added.

4. I added the following Assert to _hash_pgaddtup() as I expected the
itup_off to be set to the same thing before and after this change. I
see the Assert is failing in the regression tests.

Assert(PageGetMaxOffsetNumber(page) + 1 ==
   _hash_binsearch(page, _hash_get_indextuple_hashkey(itup)));

I think this is because _hash_binsearch() returns the offset with the
first tuple with the given hashkey, so if there are duplicate hashkey
values then it looks like PageAddItemExtended() will set needshuffle
and memmove() the existing item(s) up one slot.  I don't know this
hash index building code very well, but I wonder if it's worth having
another version of _hash_binsearch() that can be used to make
_hash_pgaddtup() put any duplicate hashkeys after the existing ones
rather than before and shuffle the others up? It sounds like that
might speed up normal insertions when there are many duplicate values
to hash.

I wonder if this might be the reason the random INT test didn't come
out as good as your original test which had unique values. The unique
values test would do less shuffling during PageAddItemExtended(). If
so, that implies that skipping the binary search is only part of the
gains here and that not shuffling tuples accounts for quite a bit of
the gain you're seeing. If so, then it would be good to not have to
shuffle duplicate hashkey tuples up in the page during normal
insertions as well as when building the index.

In any case, it would be nice to have some way to assert that we don't
accidentally pass sorted==true to _hash_pgaddtup() when there's an
existing item on the page with a higher hash value. Maybe we could
just look at the hash value of the last tuple on the page and ensure
it's <= to the current one?

5. I think it would be nicer to move the insertstate.sorted = false;
into the else branch in hashbuild(). However, you might have to do
that anyway if you were to do what I mentioned in #1.

David




Re: predefined role(s) for VACUUM and ANALYZE

2022-09-20 Thread Michael Paquier
On Tue, Sep 20, 2022 at 04:50:10PM -0700, Nathan Bossart wrote:
> On Tue, Sep 20, 2022 at 04:31:17PM -0700, Nathan Bossart wrote:
>> On Tue, Sep 20, 2022 at 11:05:33AM -0700, Nathan Bossart wrote:
>>> On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote:
 Any impact for the column sizes of the catalogs holding ACL
 information?  Just asking while browsing the patch set.
>>> 
>>> Since each aclitem requires 16 bytes instead of 12, I assume so.  However,
>>> in my testing, I hit a "row is too big" error with the same number of
>>> aclitems in a pg_class row before and after the change.  I might be missing
>>> something in my patch, or maybe I am misunderstanding how arrays of
>>> aclitems are stored on disk.
>> 
>> Ah, it looks like relacl is compressed.  The column is marked "extended,"
>> but pg_class doesn't appear to have a TOAST table, so presumably no
>> out-of-line storage can be used.  I found a couple of threads about this
>> [0] [1] [2].

Adding a toast table to pg_class has been a sensitive topic over the
years.  Based on my recollection of the events, there were worries
about the potential cross-dependencies with pg_class and pg_attribute
that this would create.

> I suppose there is some risk that folks with really long aclitem arrays
> might be unable to pg_upgrade to a version with uint64 AclModes, but I
> suspect that risk is limited to extreme cases (i.e., multiple thousands of
> aclitems).  I'm not sure whether that's worth worrying about too much.

Did you just run an aclupdate()?  4% for aclitem[] sounds like quite a
number to me :/  It may be worth looking at if these operations could
be locally optimized more, as well.  I'd like to think that we could
live with that to free up enough bits in AclItems for the next 20
years, anyway.  Any opinions?

For the column sizes of the catalogs, I was wondering about how
pg_column_size() changes when they hold ACL information.  Unoptimized
alignment could cause an unnecessary increase in the structure sizes,
so the addition of new fields or changes in object size could have
unexpected side effects.
--
Michael


signature.asc
Description: PGP signature


Re: Support pg_attribute_aligned and noreturn in MSVC

2022-09-20 Thread Michael Paquier
On Tue, Sep 20, 2022 at 08:01:20AM -0400, James Coleman wrote:
> I don't have access to a Windows machine for testing, but re-reading
> the documentation it looks like the issue is that our noreturn macro
> is used after the definition while the MSVC equivalent is used before.

A CI setup would do the job for example, see src/tools/ci/README that
explains how to set up things.

> I've removed that for now (and commented about it); it's not as
> valuable anyway since it's mostly an indicator for code analysis
> (human and machine).

Except for the fact that the patch missed to define
pg_attribute_noreturn() in the MSVC branch, this looks fine to me.  I
have been looking at what you meant with packing, and I can see the
business with PACK(), something actually doable with gcc.

That's a first step, at least, and I see no reason not to do it, so
applied.
--
Michael


signature.asc
Description: PGP signature


Re: Summary function for pg_buffercache

2022-09-20 Thread Andres Freund
Hi,

On 2022-09-20 12:45:24 +0300, Aleksander Alekseev wrote:
> > I'm not sure how to avoid any undefined behaviour without locks though.
> > Even with locks, performance is much better. But is it good enough for 
> > production?
>
> Potentially you could avoid taking locks by utilizing atomic
> operations and lock-free algorithms. But these algorithms are
> typically error-prone and not always produce a faster code than the
> lock-based ones. I'm pretty confident this is out of scope of this
> particular patch.

Why would you need lockfree operations?  All you need to do is to read
BufferDesc->state into a local variable and then make decisions based on that?


> + for (int i = 0; i < NBuffers; i++)
> + {
> + BufferDesc *bufHdr;
> + uint32  buf_state;
> +
> + bufHdr = GetBufferDescriptor(i);
> +
> + /* Lock each buffer header before inspecting. */
> + buf_state = LockBufHdr(bufHdr);
> +
> + /* Invalid RelFileNumber means the buffer is unused */
> + if (RelFileNumberIsValid(BufTagGetRelNumber(&bufHdr->tag)))
> + {
> + buffers_used++;
> + usagecount_avg += BUF_STATE_GET_USAGECOUNT(buf_state);
> +
> + if (buf_state & BM_DIRTY)
> + buffers_dirty++;
> + }
> + else
> + buffers_unused++;
> +
> + if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
> + buffers_pinned++;
> +
> + UnlockBufHdr(bufHdr, buf_state);
> + }

I.e. instead of locking the buffer header as done above, this could just do
something along these lines:

BufferDesc *bufHdr;
uint32  buf_state;

bufHdr = GetBufferDescriptor(i);

buf_state = pg_atomic_read_u32(&bufHdr->state);

if (buf_state & BM_VALID)
{
buffers_used++;
usagecount_avg += BUF_STATE_GET_USAGECOUNT(buf_state);

if (buf_state & BM_DIRTY)
buffers_dirty++;
}
else
buffers_unused++;

if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
buffers_pinned++;


Without a memory barrier you can get very slightly "out-of-date" values of the
state, but that's fine in this case.

Greetings,

Andres Freund




Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-20 Thread Peter Smith
Hi Onder,

Thanks for addressing all my previous feedback. I checked the latest
v12-0001, and have no more comments at this time.

One last thing - do you think there is any need to mention this
behaviour in the pgdocs, or is OK just to be a hidden performance
improvement?

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [RFC] building postgres with meson - v13

2022-09-20 Thread Andres Freund
Hi,

On 2022-09-19 19:16:30 -0700, Andres Freund wrote:
> To have some initial "translation" for other developers I've started a wiki
> page with a translation table. Still very WIP:
> https://wiki.postgresql.org/wiki/Meson
> 
> For now, a bit of polishing aside, I'm just planning to add a minimal
> explanation of what's happening, and a reference to this thread.

I added installation instructions for meson for a bunch of platforms, but
failed to figure out how to do so in a rhel9 container. I don't have a rhel
subscription, and apparently the repos with developer tools now require a
subscription. Great way to make it easy for projects to test anything on RHEL.

Greetings,

Andres Freund




Re: predefined role(s) for VACUUM and ANALYZE

2022-09-20 Thread Nathan Bossart
On Tue, Sep 20, 2022 at 04:31:17PM -0700, Nathan Bossart wrote:
> On Tue, Sep 20, 2022 at 11:05:33AM -0700, Nathan Bossart wrote:
>> On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote:
>>> Any impact for the column sizes of the catalogs holding ACL
>>> information?  Just asking while browsing the patch set.
>> 
>> Since each aclitem requires 16 bytes instead of 12, I assume so.  However,
>> in my testing, I hit a "row is too big" error with the same number of
>> aclitems in a pg_class row before and after the change.  I might be missing
>> something in my patch, or maybe I am misunderstanding how arrays of
>> aclitems are stored on disk.
> 
> Ah, it looks like relacl is compressed.  The column is marked "extended,"
> but pg_class doesn't appear to have a TOAST table, so presumably no
> out-of-line storage can be used.  I found a couple of threads about this
> [0] [1] [2].

I suppose there is some risk that folks with really long aclitem arrays
might be unable to pg_upgrade to a version with uint64 AclModes, but I
suspect that risk is limited to extreme cases (i.e., multiple thousands of
aclitems).  I'm not sure whether that's worth worrying about too much.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: default sorting behavior for index

2022-09-20 Thread Tom Lane
Zhihong Yu  writes:
> I was looking at this check in src/backend/parser/parse_utilcmd.c w.r.t.
> constraint:
> ...
> If the index has DESC sorting order, why it cannot be used to back a
> constraint ?
> Some concrete sample would help me understand this.

Please read the nearby comments, particularly

 * Insist on default opclass, collation, and sort options.
 * While the index would still work as a constraint with
 * non-default settings, it might not provide exactly the same
 * uniqueness semantics as you'd get from a normally-created
 * constraint; and there's also the dump/reload problem
 * mentioned above.

The "mentioned above" refers to this:

 * Insist on it being a btree.  That's the only kind that supports
 * uniqueness at the moment anyway; but we must have an index that
 * exactly matches what you'd get from plain ADD CONSTRAINT syntax,
 * else dump and reload will produce a different index (breaking
 * pg_upgrade in particular).

The concern about whether the uniqueness semantics are the same probably
mostly applies to just the opclass and collation properties.  However,
rd_indoption contains AM-specific options, and we have little ability
to be sure in this code exactly what those bits might do.  In any case
we'd definitely have a risk of things breaking during pg_upgrade if we
ignore rd_indoption.

regards, tom lane




Re: predefined role(s) for VACUUM and ANALYZE

2022-09-20 Thread Nathan Bossart
On Tue, Sep 20, 2022 at 11:05:33AM -0700, Nathan Bossart wrote:
> On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote:
>> Any impact for the column sizes of the catalogs holding ACL
>> information?  Just asking while browsing the patch set.
> 
> Since each aclitem requires 16 bytes instead of 12, I assume so.  However,
> in my testing, I hit a "row is too big" error with the same number of
> aclitems in a pg_class row before and after the change.  I might be missing
> something in my patch, or maybe I am misunderstanding how arrays of
> aclitems are stored on disk.

Ah, it looks like relacl is compressed.  The column is marked "extended,"
but pg_class doesn't appear to have a TOAST table, so presumably no
out-of-line storage can be used.  I found a couple of threads about this
[0] [1] [2].

[0] https://postgr.es/m/17245.964897719%40sss.pgh.pa.us
[1] https://postgr.es/m/200309040531.h845ViP05881%40candle.pha.pa.us
[2] https://postgr.es/m/29061.1265327626%40sss.pgh.pa.us

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Federated Authn/z with OAUTHBEARER

2022-09-20 Thread Jacob Champion
Hi Mahendrakar, thanks for your interest and for the patch!

On Mon, Sep 19, 2022 at 10:03 PM mahendrakar s
 wrote:
> The changes for each component are summarized below.
>
> 1. Provider-specific extension:
> Each OAuth provider implements their own token validator as an
> extension. Extension registers an OAuth provider hook which is matched
> to a line in the HBA file.

How easy is it to write a Bearer validator using C? My limited
understanding was that most providers were publishing libraries in
higher-level languages.

Along those lines, sample validators will need to be provided, both to
help in review and to get the pytest suite green again. (And coverage
for the new code is important, too.)

> 2. Add support to pass on the OAuth bearer token. In this
> obtaining the bearer token is left to 3rd party application or user.
>
> ./psql -U  -d 'dbname=postgres
> oauth_client_id= oauth_bearer_token=

This hurts, but I think people are definitely going to ask for it, given
the frightening practice of copy-pasting these (incredibly sensitive
secret) tokens all over the place... Ideally I'd like to implement
sender constraints for the Bearer token, to *prevent* copy-pasting (or,
you know, outright theft). But I'm not sure that sender constraints are
well-implemented yet for the major providers.

> 3. HBA: An additional param ‘provider’ is added for the oauth method.
> Defining "oauth" as method + passing provider, issuer endpoint
> and expected audience
>
> * * * * oauth   provider=
> issuer= scope=

Naming aside (this conflicts with Samay's previous proposal, I think), I
have concerns about the implementation. There's this code:

> + if (oauth_provider && oauth_provider->name)
> + {
> + ereport(ERROR,
> + (errmsg("OAuth provider \"%s\" is already 
> loaded.",
> + oauth_provider->name)));
> + }

which appears to prevent loading more than one global provider. But
there's also code that deals with a provider list? (Again, it'd help to
have test code covering the new stuff.)

>b)  libpq optionally compiled for the clients which
> explicitly need libpq to orchestrate OAuth communication with the
> issuer (it depends heavily on 3rd party library iddawc as Jacob
> already pointed out. The library seems to be supporting all the OAuth
> flows.)

Speaking of iddawc, I don't think it's a dependency we should choose to
rely on. For all the code that it has, it doesn't seem to provide
compatibility with several real-world providers.

Google, for one, chose not to follow the IETF spec it helped author, and
iddawc doesn't support its flavor of Device Authorization. At another
point, I think iddawc tried to decode Azure's Bearer tokens, which is
incorrect...

I haven't been able to check if those problems have been fixed in a
recent version, but if we're going to tie ourselves to a huge
dependency, I'd at least like to believe that said dependency is
battle-tested and solid, and personally I don't feel like iddawc is.

> - auth_method = I_TOKEN_AUTH_METHOD_NONE;
> - if (conn->oauth_client_secret && *conn->oauth_client_secret)
> - auth_method = I_TOKEN_AUTH_METHOD_SECRET_BASIC;

This code got moved, but I'm not sure why? It doesn't appear to have
made a change to the logic.

> + if (conn->oauth_client_secret && *conn->oauth_client_secret)
> + {
> + session_response_type = I_RESPONSE_TYPE_CLIENT_CREDENTIALS;
> + }

Is this an Azure-specific requirement? Ideally a public client (which
psql is) shouldn't have to provide a secret to begin with, if I
understand that bit of the protocol correctly. I think Google also
required provider-specific changes in this part of the code, and
unfortunately I don't think they looked the same as yours.

We'll have to figure all that out... Standards are great; everyone has
one of their own. :)

Thanks,
--Jacob




Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-20 Thread Nathan Bossart
On Tue, Sep 20, 2022 at 04:00:26PM -0700, Nathan Bossart wrote:
> On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote:
>> I'm attaching v5 patch-set. I've addressed review comments received so
>> far and fixed a compiler warning that CF bot complained about.
>> 
>> Please review it further.
> 
> 0001 looks reasonable to me.
> 
> +errno = 0;
> +rc = pg_pwritev_zeros(fd, pad_to_size);
> 
> Do we need to reset errno?  pg_pwritev_zeros() claims to set errno
> appropriately.
> 
> +/*
> + * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if
> + * writing more bytes per pg_pwritev_with_retry() call is proven to be more
> + * performant.
> + */
> +#define PWRITEV_BLCKSZ  XLOG_BLCKSZ
> 
> This seems like something we should sort out now instead of leaving as
> future work.  Given your recent note, I think we should just use
> XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance
> findings with different buffer sizes.

I also noticed that the latest patch set no longer applies, so I've marked
the commitfest entry as waiting-on-author.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-20 Thread Nathan Bossart
On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote:
> I'm attaching v5 patch-set. I've addressed review comments received so
> far and fixed a compiler warning that CF bot complained about.
> 
> Please review it further.

0001 looks reasonable to me.

+errno = 0;
+rc = pg_pwritev_zeros(fd, pad_to_size);

Do we need to reset errno?  pg_pwritev_zeros() claims to set errno
appropriately.

+/*
+ * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if
+ * writing more bytes per pg_pwritev_with_retry() call is proven to be more
+ * performant.
+ */
+#define PWRITEV_BLCKSZ  XLOG_BLCKSZ

This seems like something we should sort out now instead of leaving as
future work.  Given your recent note, I think we should just use
XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance
findings with different buffer sizes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Auto explain after query timeout

2022-09-20 Thread Robert Haas
On Tue, Sep 20, 2022 at 5:08 PM James Coleman  wrote:
> - A safe explain (e.g., disallow catalog access) that is potentially
> missing information.

This would be pretty useless I think, because you'd be missing all
relation names.

> - A safe way to interrupt queries such as "safe shutdown" of a node
> (e.g., a seq scan could stop returning tuples early) and allow a
> configurable buffer of time after the statement timeout before firing
> a hard abort of the query (and transaction).

This might be useful, but it seems extremely difficult to get working.
You'd not only have to design the safe shutdown mechanism itself, but
also find a way to safely engage it at the right times.

> Alternatively I wonder if it's possible (this would maybe assume no
> catalog changes in the current transaction -- or at least none that
> would be referenced by the current query) to open a new transaction
> (with the same horizon information) and duplicate the plan over to
> that transaction and run the explain there. This way you do it *after*
> the error is raised. That's some serious spit-balling -- I'm not
> saying that's doable, just trying to imagine how one might
> comprehensively address the concerns.

Doesn't work, because the new transaction's snapshot wouldn't be the
same as that of the old one. Imagine that you create a table and run a
query on it in the same transaction. Then you migrate the plan tree to
a new transaction and try to find out the table name. But in the new
transaction, that table doesn't exist: it was destroyed by the
previous rollback.

Honestly I have no very good ideas how to create the feature you want
here. I guess the only thing I can think of is to separate the EXPLAIN
process into two phases: a first phase that runs when the plan tree is
set up and gathers all of the information that we might need later,
like relation names, and then a second phase that runs later when you
want to generate the output and does nothing that can fail, or at
least no database: maybe it's allowed to allocate memory, for example.
But that sounds like a big and perhaps painful refactoring exercise,
and I can imagine that there might be reasons why it doesn't work out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans

2022-09-20 Thread Peter Geoghegan
On Mon, Sep 12, 2022 at 2:01 PM Peter Geoghegan  wrote:
> I'd like to talk about one such technique on this thread. The attached
> WIP patch reduces the size of xl_heap_freeze_page records by applying
> a simple deduplication process.

Attached is v2, which I'm just posting to keep CFTester happy. No real
changes here.

--
Peter Geoghegan
From c555f716a79c7de1f26a5e1e87265f2702b980ee Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sat, 30 Jul 2022 10:47:59 -0700
Subject: [PATCH v2] Shrink freeze WAL records via deduplication.

---
 src/include/access/heapam.h|  25 ++
 src/include/access/heapam_xlog.h   |  34 +--
 src/backend/access/heap/heapam.c   | 310 -
 src/backend/access/heap/vacuumlazy.c   |  39 +---
 src/backend/access/rmgrdesc/heapdesc.c |   4 +-
 5 files changed, 289 insertions(+), 123 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9dab35551..579ef32b0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -99,6 +99,19 @@ typedef enum
 	HEAPTUPLE_DELETE_IN_PROGRESS	/* deleting xact is still in progress */
 } HTSV_Result;
 
+/* heap_prepare_freeze_tuple state describing how to freeze a tuple */
+typedef struct HeapFreezeTuple
+{
+	/* Fields describing how to process tuple */
+	uint16		t_infomask2;
+	uint16		t_infomask;
+	TransactionId xmax;
+	uint8		frzflags;
+
+	/* Page offset number for tuple */
+	OffsetNumber offset;
+} HeapFreezeTuple;
+
 /* 
  *		function prototypes for heap access method
  *
@@ -164,6 +177,18 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
  Buffer *buffer, struct TM_FailureData *tmfd);
 
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
+extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
+	  TransactionId relfrozenxid,
+	  TransactionId relminmxid,
+	  TransactionId cutoff_xid,
+	  TransactionId cutoff_multi,
+	  HeapFreezeTuple *frz,
+	  bool *totally_frozen,
+	  TransactionId *relfrozenxid_out,
+	  MultiXactId *relminmxid_out);
+extern void heap_freeze_prepared_tuples(Relation rel, Buffer buffer,
+		TransactionId latestRemovedXid,
+		HeapFreezeTuple *tuples, int ntuples);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 			  TransactionId relfrozenxid, TransactionId relminmxid,
 			  TransactionId cutoff_xid, TransactionId cutoff_multi);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 34220d93c..4e3a023eb 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -315,34 +315,36 @@ typedef struct xl_heap_inplace
 
 /*
  * This struct represents a 'freeze plan', which is what we need to know about
- * a single tuple being frozen during vacuum.
+ * a group of tuples being frozen from the same page.
  */
 /* 0x01 was XLH_FREEZE_XMIN */
 #define		XLH_FREEZE_XVAC		0x02
 #define		XLH_INVALID_XVAC	0x04
 
-typedef struct xl_heap_freeze_tuple
+typedef struct xl_heap_freeze_plan
 {
 	TransactionId xmax;
-	OffsetNumber offset;
+	uint16		ntuples;
 	uint16		t_infomask2;
 	uint16		t_infomask;
 	uint8		frzflags;
-} xl_heap_freeze_tuple;
+} xl_heap_freeze_plan;
 
 /*
  * This is what we need to know about a block being frozen during vacuum
  *
- * Backup block 0's data contains an array of xl_heap_freeze_tuple structs,
- * one for each tuple.
+ * Backup block 0's data contains an array of xl_heap_freeze_plan structs.
  */
 typedef struct xl_heap_freeze_page
 {
-	TransactionId cutoff_xid;
-	uint16		ntuples;
+	TransactionId latestRemovedXid;
+	uint16		nplans;
+
+	/* FREEZE PLANS FOLLOW */
+	/* OFFSET NUMBERS FOLLOW  */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, ntuples) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -401,20 +403,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
-  TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
-  int ntuples);
-extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
-	  TransactionId relfrozenxid,
-	  TransactionId relminmxid,
-	  TransactionId cutoff_xid,
-	  TransactionId cutoff_multi,
-	  xl_heap_freeze_tuple *frz,
-	  bool *totally_frozen,
-	  TransactionId *relfrozenxid_out,
-	  MultiXactId *relminmxid_out);
-extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
-	  xl_heap_freeze_tuple *frz);
 extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
    Buffer vm_buffer, TransactionId cutoff_xid, u

Re: Auto explain after query timeout

2022-09-20 Thread James Coleman
On Tue, Sep 20, 2022 at 3:06 PM Robert Haas  wrote:
>
> On Tue, Sep 20, 2022 at 2:35 PM James Coleman  wrote:
> > Either I'm missing something (and/or this was fixed in a later PG
> > version), but I don't think this is how it works. We have this
> > specific problem now: we set auto_explain.log_min_duration to 200 (ms)
> > and statement_timeout set to 30s, but when a statement times out we do
> > not get the plan logged with auto-explain.
>
> I think you're correct. auto_explain uses the ExecutorEnd hook, but
> that hook will not be fired in the case of an error. Indeed, if an
> error has already been thrown, it would be unsafe to try to
> auto-explain anything. For instance -- and this is just one problem of
> probably many -- ExplainTargetRel() performs catalog lookups, which is
> not OK in a failed transaction.
>
> To make this work, I think you'd need a hook that fires just BEFORE
> the error is thrown. However, previous attempts to introduce hooks
> into ProcessInterrupts() have not met with a wam response from Tom, so
> it might be a tough sell. And maybe for good reason. I see at least
> two problems. The first is that explaining a query is a pretty
> complicated operation that involves catalog lookups and lots of
> complicated stuff, and I don't think that it would be safe to do all
> of that at any arbitrary point in the code where ProcessInterrupts()
> happened to fire. What if one of the syscache lookups misses the cache
> and we have to open the underlying catalog? Then
> AcceptInvalidationMessages() will fire, but we don't currently assume
> that any old CHECK_FOR_INTERRUPTS() can process invalidations. What if
> the running query has called a user-defined function or procedure
> which is running DDL which is in the middle of changing catalog state
> for some relation involved in the query at the moment that the
> statement timeout arrives? I feel like there are big problems here.
>
> The other problem I see is that ProcessInterrupts() is our mechanism
> for allowing people to escape from queries that would otherwise run
> forever by hitting ^C. But what if the explain code goes crazy and
> itself needs to be interrupted, when we're already inside
> ProcessInterrupts()? Maybe that would work out OK... or maybe it
> wouldn't. I'm not really sure. But it's another reason to be very,
> very cautious about putting complicated logic inside
> ProcessInterrupts().

This is exactly the kind of background I was hoping someone would
provide; thank you, Robert.

It seems like one could imagine addressing all of these by having one of:

- A safe explain (e.g., disallow catalog access) that is potentially
missing information.
- A safe way to interrupt queries such as "safe shutdown" of a node
(e.g., a seq scan could stop returning tuples early) and allow a
configurable buffer of time after the statement timeout before firing
a hard abort of the query (and transaction).

Both of those seem like a significant amount of work.

Alternatively I wonder if it's possible (this would maybe assume no
catalog changes in the current transaction -- or at least none that
would be referenced by the current query) to open a new transaction
(with the same horizon information) and duplicate the plan over to
that transaction and run the explain there. This way you do it *after*
the error is raised. That's some serious spit-balling -- I'm not
saying that's doable, just trying to imagine how one might
comprehensively address the concerns.

Does any of that seem at all like a path you could imagine being fruitful?

Thanks,
James Coleman




Re: Making C function declaration parameter names consistent with corresponding definition names

2022-09-20 Thread Peter Geoghegan
On Mon, Sep 19, 2022 at 11:36 PM Peter Geoghegan  wrote:
> Attached revision v4 fixes those pg_dump patch items.
>
> It also breaks out the ecpg changes into their own patch.

I pushed much of this just now. All that remains to bring the entire
codebase into compliance is the ecpg patch and the pg_dump patch.
Those two areas are relatively tricky. But it's now unlikely that I'll
need to push a commit that makes relatively many CF patches stop
applying against HEAD -- that part is over.

Once we're done with ecpg and pg_dump, we can talk about the actual
practicalities of formally adopting a project policy on consistent
parameter names. I mostly use clang-tidy via my editor's support for
the clangd language server -- clang-tidy is primarily a linter, so it
isn't necessarily run in bulk all that often. I'll need to come up
with instructions for running clang-tidy from the command line that
are easy to follow.

I've found that the run_clang_tidy script (AKA run-clang-tidy.py)
works, but the whole experience feels hobbled together. I think that
we really need something like a build target for this -- something
comparable to what we do to support GCOV. That would also allow us to
use additional clang-tidy checks, which might be useful. We might even
find it useful to come up with some novel check of our own. Apparently
it's not all that difficult to write one from scratch, to implement
custom rules. There are already custom rules for big open source
projects such as the Linux Kernel, Chromium, and LLVM itself.

-- 
Peter Geoghegan




Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Mark Dilger



> On Sep 20, 2022, at 12:36 PM, Jonathan S. Katz  wrote:
> 
> This behavior exists "FOR ALL TABLES" without the "IN SCHEMA" qualifier. This 
> was discussed multiple times on the original thread[1].

Yes, nobody is debating that as far as I can see.  And I do take your point 
that this stuff was discussed in other threads quite a while back.

> I tried to diligently read the sections where we talk about granting + 
> privileges[2][3] to see what it says about "ALL * IN SCHEMA". Unless I missed 
> it, and I read through it twice, it does not explicitly state whether or not 
> "GRANT" applies to all objects at only that given moment, or to future 
> objects of that type which are created in that schema. Maybe the behavior is 
> implied or is part of the standard, but it's not currently documented.

Interesting.  Thanks for that bit of research.

> We do link to "ALTER DEFAULT PRIVILEGES" at the bottom of the GRANT[2] docs, 
> but we don't give any indication as to why.
> 
> (This is also to say we should document in GRANT that ALL * IN SCHEMA does 
> not apply to future objects;

Yes, I agree this should be documented.

> if you need that behavior use ALTER DEFAULT PRIVILEGES. Separate thread :)
> 
> I understand there is a risk of confusion of the similar grammar across 
> commands, but the current command in logical replication has this is building 
> on the existing behavior.

I don't complain that it is buidling on the existing behavior.  I'm *only* 
concerned about the keywords we're using for this.  Consider the following:

   -- AS ADMIN
   CREATE USER bob NOSUPERUSER;
   GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA foo TO bob;
   SET ROLE bob;
   CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo;

We're going to have that fail in pg15 because the FOR ALL TABLES IN SCHEMA 
option is reserved to superusers.  But we agreed that was a stop-gap solution 
that we'd potentially loosen in the future.  Certainly we'll need wiggle room 
in the syntax to perform that loosening:

   --- Must be superuser for this in pg15, and in subsequent releases.
   CREATE PUBLICATION bobs_pub FOR ALL FUTURE TABLES IN SCHEMA foo;

   --- Not supported in pg15, but reserved for some future pg versions to allow
   --- non-superusers to create publications on tables currently in schema foo,
   --- assuming they have sufficient privileges on those tables
   CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo;

Doing it this way makes the syntax consistent between the GRANT...TO bob and 
the CREATE PUBLICATION bobs_pub.  Surely this makes more sense?

I'm not a huge fan of the keyword "FUTURE" here, but I found a reference to 
another database that uses that keyword for what I think is a similar purpose.  
We should choose *something* for this, though, if we want things to be rational 
going forward.


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Jonathan S. Katz

(RMT hat on, unless otherwise noted)

On 9/20/22 9:42 AM, Robert Haas wrote:

On Mon, Sep 19, 2022 at 11:03 PM Jonathan S. Katz  wrote:

For #1 (allowing calls that have schema/table overlap...), there appears
to be both a patch that allows this (reversing[8]), and a suggestion for
dealing with a corner-case that is reasonable, i.e. disallowing adding
schemas to a publication when specifying column-lists. Do we think we
can have consensus on this prior to the RC1 freeze?


I am not sure whether we can or should rush a fix in that fast, but I
agree with this direction.


The RMT met today to discuss this.

We did agree that the above is an open item that should be resolved 
before this release. While it is an accepted pattern for us to "ERROR" 
on unsupported behavior and then later introduce said behavior, we do 
agree with Peter's original post in this thread and would like it resolved.


As for the state of the fix, the patch has been iterated on and Amit 
felt ready to commit it[1]. We do want to hear how others feel about 
this, but the folks behind this feature have been working on this patch 
since this was reported.



For #2 ("ALL TABLES IN SCHEMA" syntax), this was heavily discussed on
the original thread[1][3][4][5][7]. I thought Tom's proposal on the
syntax[3] was reasonable as it "future proofs" for when we allow other
schema-scoped objects to be published and give control over which ones
can be published.


All right, well, I still don't like it and think it's confusing, but
perhaps I'm in the minority.


The RMT discussed this as well. The RMT feels that there should not be 
any changes to syntax/behavior for this release. This doesn't preclude 
future work in this area (e.g. having a toggle for "all future 
behavior"), but based on all the discussions and existing behavior in 
this feature, we do not see a need to make changes or delay the release 
on this.



I don't think we should change this behavior that's already in logical
replication. While I understand the reasons why "GRANT ... ALL TABLES IN
SCHEMA" has a different behavior (i.e. it's not applied to future
objects) and do not advocate to change it, I have personally been
affected where I thought a permission would be applied to all future
objects, only to discover otherwise. I believe it's more intuitive to
think that "ALL" applies to "everything, always."


Nah, there's room for multiple behaviors here. It's reasonable to want
to add all the tables currently in the schema to a publication (or
grant permissions on them) and it's reasonable to want to include all
current and future tables in the schema in a publication (or grant
permissions on them) too. The reason I don't like the ALL TABLES IN
SCHEMA syntax is that it sounds like the former, but actually is the
latter. Based on your link to the email from Tom, I understand now the
reason why it's like that, but it's still counterintuitive to me.



I understand your view on "multiple behaviors" and I do agree with your 
reasoning. I still think we should leave this as is, but perhaps this 
opens up an option we add later to specify the behavior.






For #3 (more superuser-only), in general I do agree that we shouldn't be
adding more of these. However, we have in this release, and not just to
this feature. ALTER SUBSCRIPTION ... SKIP[11] requires superuser. I
think it's easier for us to "relax" privileges (e.g. w/predefined roles)
than to make something "superuser-only" in the future, so I don't see
this being a blocker for v15. The feature will continue to work for
users even if we remove "superuser-only" in the future.


Yeah, this is clearly not a release blocker, I think.


The RMT concurs. We do recommend future work on "relaxing" the 
superuser-only requirement.


Thanks,

Jonathan

[1] 
https://www.postgresql.org/message-id/CAA4eK1LDhoBM8K5uVme8PZ%2BkxNOfVpRh%3DoO42JtFdqBgBuj1bA%40mail.gmail.com


OpenPGP_signature
Description: OpenPGP digital signature


Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Jonathan S. Katz

On 9/20/22 10:55 AM, Mark Dilger wrote:




On Sep 19, 2022, at 8:03 PM, Jonathan S. Katz  wrote:

"When a partitioned table is added to a publication, all of its existing and future 
partitions are implicitly considered to be part of the publication."[10]

Additionally, this is the behavior that is already present in "FOR ALL TABLES":

"Marks the publication as one that replicates changes for all tables in the 
database, including tables created in the future."[10]

I don't think we should change this behavior that's already in logical 
replication.


The existing behavior in logical replication doesn't have any "IN SCHEMA" 
qualifiers.


This behavior exists "FOR ALL TABLES" without the "IN SCHEMA" qualifier. 
This was discussed multiple times on the original thread[1].





While I understand the reasons why "GRANT ... ALL TABLES IN SCHEMA" has a different behavior (i.e. 
it's not applied to future objects) and do not advocate to change it, I have personally been affected where I 
thought a permission would be applied to all future objects, only to discover otherwise. I believe it's more 
intuitive to think that "ALL" applies to "everything, always."


The conversation is focusing on what "ALL TABLES" means, but the ambiguous part is what "IN SCHEMA" means.  In GRANT it 
means "currently in schema, computed now."  We are about to create confusion by adding the "IN SCHEMA" phrase to 
publication commands meaning "later in schema, computed then."  A user who diligently consults the documentation for one command 
to discover what "IN SCHEMA" means may fairly, but wrongly, assume it means the same thing in another command.


I tried to diligently read the sections where we talk about granting + 
privileges[2][3] to see what it says about "ALL * IN SCHEMA". Unless I 
missed it, and I read through it twice, it does not explicitly state 
whether or not "GRANT" applies to all objects at only that given moment, 
or to future objects of that type which are created in that schema. 
Maybe the behavior is implied or is part of the standard, but it's not 
currently documented. We do link to "ALTER DEFAULT PRIVILEGES" at the 
bottom of the GRANT[2] docs, but we don't give any indication as to why.


(This is also to say we should document in GRANT that ALL * IN SCHEMA 
does not apply to future objects; if you need that behavior use ALTER 
DEFAULT PRIVILEGES. Separate thread :)


I understand there is a risk of confusion of the similar grammar across 
commands, but the current command in logical replication has this is 
building on the existing behavior.


Thanks,

Jonathan

[1] 
https://www.postgresql.org/message-id/flat/CALDaNm0OANxuJ6RXqwZsM1MSY4s19nuH3734j4a72etDwvBETQ%40mail.gmail.com

[2] https://www.postgresql.org/docs/current/sql-grant.html
[3] https://www.postgresql.org/docs/current/ddl-priv.html


OpenPGP_signature
Description: OpenPGP digital signature


Support logical replication of large objects

2022-09-20 Thread Borui Yang
Hello,

I’m working on a patch to support logical replication of large objects
(LOBs). This is a useful feature when a database in logical
replication has lots of tables, functions and other objects that
change over time, such as in online cross major version upgrade.  As
an example, this lets users replicate large objects between different
PostgreSQL major versions.

The topic was previously discussed in [1]. Moreover, we need to
address the following 3 challenges. I worked on some designs and
appreciate feedback :

1. Replication of the change stream of LOBs
My suggestion is that we can just add a check when any LOB
function or API is called and executed in the backend, and then add a
simple SQL command in WAL files to do the replication . Take lo_unlink
as example[2] : we can create a “simple” SQL like  SELECT
lo_unlink(); and log it in WAL, so that replica only needs to
replay the “simple” SQL command. We can unlink the LOBs in replica
accordingly.
Pros :
a. We do not need to add much additional functionality.
b. For most of the LOBs related APIs, we don’t need to touch whole
LOBs, except for the case creation of LOBs.
Cons:
a. For the case creation of LOBs, we may need to split the whole
LOB content into WAL files which will increase volume of WAL and
replicated writes dramatically. This could be prevented if we can make
sure the original file is publicly shared, like a url from cloud
storage, or exists on host on replica as well.
2. Initializing replication of LOBs
When a subscription is established, LOBs in the source should be
replicated even if they are not created in replica. Here are two
options to approach this problem:
Option 1 : Support LOB related system catalogs in logical replication
We can make an exception in this line[3] in the
“check_publication_add_schema” function.
   Pros: All required LOBs can be transferred to replica.
   Cons: There is currently no support for allowing logical
replication of system catalogs.
Option 2 : Run a function or a script from source instance when it
detects logical replication is established.
The point is that we can ask the source to replicate whole LOBs
when a new subscription is created.
Maybe we can copy the whole LOBs related system catalogs and
replicate the copy to replica, then restore the original LOBs into
replica from the copy.
   Cons :  This will increase the volume of WAL and replicated
writes dramatically. I currently do not have a preference on either
option. I would like to see if others have thoughts on how we could
approach this.
3. OID conflicts
A common case is that the OID we want to publish is already used
in subscriber.
Option 1 (My recommendation) :  Create/Update existing System
catalog for mapping the OID if conflict happens
Maybe we could add another column naming like mapping_oid in
system catalog pg_largeobject_metaqdate on the replica. When the
replica detects the OID (E.g. 16400) that  source is replicating is
already used in replica, replica could store the 16400 as mapping_oid
and create a new OID (E.g. 16500) as oid to be used in replica, so
whatever operation is done on 16400 in source, in replica we just need
to perform on 16500.
   Cons : We would need to add additional columns to the system catalog
Option 2 :  Prompt error message in Replica and let user handle it manually
   Cons : This is not user-friendly.

Please let me know your thoughts.

Borui Yang
Amazon RDS/Aurora for PostgreSQL

[1] 
https://www.postgresql.org/message-id/VisenaEmail.16.35d9e854e3626e81.15cd0de93df%40tc7-visena
[2] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/libpq/be-fsstubs.c#l312
[3] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/catalog/pg_publication.c#l98




Re: Auto explain after query timeout

2022-09-20 Thread Gurjeet
On Tue Sep 20, 2022 at 11:34 AM PDT, James Coleman wrote:
> On Tue, Sep 20, 2022 at 2:12 PM Gurjeet  wrote:
> >
> > For someone who would like to achieve this in the field today, I believe
> > they can set auto_explain.log_min_duration equal to, or less than,
> > statement_timeout.
>
> Either I'm missing something (and/or this was fixed in a later PG
> version), but I don't think this is how it works. We have this
> specific problem now: we set auto_explain.log_min_duration to 200 (ms)
> and statement_timeout set to 30s, but when a statement times out we do
> not get the plan logged with auto-explain.

My DBA skills are rusty, so I'll take your word for it.

If this is the current behaviour, I'm inclined to treat this as a bug,
or at least a desirable improvement, and see if auto_explain can be
improved to emit the plan on statment_timeout.

>From what I undestand, the behaviour of auto_explain is that it waits
for the query to finish before it emits the plan. This information is
very useful for diagnosing long-running queries that are still running.
Many a times, you encounter such queries in production workloads, and
reproducing such a scenario later on is either undesirable, expensive, or
even impossible. So being able to see the plan of a query that has
crossed auto_explain.log_min_duration as soon as possible, would be highly
desirable.

Best regards,
Gurjeet
http://Gurje.et




Re: Auto explain after query timeout

2022-09-20 Thread Robert Haas
On Tue, Sep 20, 2022 at 2:35 PM James Coleman  wrote:
> Either I'm missing something (and/or this was fixed in a later PG
> version), but I don't think this is how it works. We have this
> specific problem now: we set auto_explain.log_min_duration to 200 (ms)
> and statement_timeout set to 30s, but when a statement times out we do
> not get the plan logged with auto-explain.

I think you're correct. auto_explain uses the ExecutorEnd hook, but
that hook will not be fired in the case of an error. Indeed, if an
error has already been thrown, it would be unsafe to try to
auto-explain anything. For instance -- and this is just one problem of
probably many -- ExplainTargetRel() performs catalog lookups, which is
not OK in a failed transaction.

To make this work, I think you'd need a hook that fires just BEFORE
the error is thrown. However, previous attempts to introduce hooks
into ProcessInterrupts() have not met with a wam response from Tom, so
it might be a tough sell. And maybe for good reason. I see at least
two problems. The first is that explaining a query is a pretty
complicated operation that involves catalog lookups and lots of
complicated stuff, and I don't think that it would be safe to do all
of that at any arbitrary point in the code where ProcessInterrupts()
happened to fire. What if one of the syscache lookups misses the cache
and we have to open the underlying catalog? Then
AcceptInvalidationMessages() will fire, but we don't currently assume
that any old CHECK_FOR_INTERRUPTS() can process invalidations. What if
the running query has called a user-defined function or procedure
which is running DDL which is in the middle of changing catalog state
for some relation involved in the query at the moment that the
statement timeout arrives? I feel like there are big problems here.

The other problem I see is that ProcessInterrupts() is our mechanism
for allowing people to escape from queries that would otherwise run
forever by hitting ^C. But what if the explain code goes crazy and
itself needs to be interrupted, when we're already inside
ProcessInterrupts()? Maybe that would work out OK... or maybe it
wouldn't. I'm not really sure. But it's another reason to be very,
very cautious about putting complicated logic inside
ProcessInterrupts().

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: A question about wording in messages

2022-09-20 Thread Tom Lane
Alvaro Herrera  writes:
> After spending way too much time editing this line, I ended up with
> exactly what Tom proposed, so +1 for his version.  I think "This
> controls" adds nothing very useful, and we don't have it anywhere else,
> except tcp_keepalives_count from where I also propose to remove it.

LGTM.

regards, tom lane




Re: Support tls-exporter as channel binding for TLSv1.3

2022-09-20 Thread Jacob Champion
On Tue, Sep 20, 2022 at 11:01 AM Jacob Champion  wrote:
> Well, I'm working on a next version, but it's ballooning in complexity
> as I try to navigate the fix for OpenSSL 1.0.1 (which is currently
> failing the tests, unsurprisingly).

To be more specific: I think I'm hitting the case that Heikki pointed
out several years ago [1]:

> The problematic case is when e.g. the server
> only supports tls-unique and the client only supports
> tls-server-end-point. What we would (usually) like to happen, is to fall
> back to not using channel binding. But it's not clear how to make that
> work, and still protect from downgrade attacks.

The problem was deferred when tls-unique was removed. We might have to
actually solve it now.

bcc: Heikki, in case he would like to weigh in.

--Jacob

[1] 
https://www.postgresql.org/message-id/ec787074-2305-c6f4-86aa-6902f98485a4%40iki.fi




Re: Auto explain after query timeout

2022-09-20 Thread James Coleman
On Tue, Sep 20, 2022 at 2:12 PM Gurjeet  wrote:
>
> On Tue Sep 20, 2022 at 10:34 AM PDT, James Coleman wrote:
> > Hopefully I'm not missing something obvious, but as far as I know
> > there's no way to configure auto explain to work fire
> > statement_timeout fires.
>
> I believe you're correct.
>
> > I'd like to look into this at some point, but I'm wondering if anyone
> > has thought about it before, and, if so, is there any obvious
> > impediment to doing so?
>
> This would be a neat feature. Since the changes would be fairly
> localized to the contrib module, this would be a great first patch for
> someone new to contributing.
>
> This can be exposed at a new GUC auto_explain.log_on_statement_timeout.
> I wish our conventions allowed for creating hierarchies of GUC
> parameters, e.g. auto_explain.when.statmeent_timeout.
>
> For someone who would like to achieve this in the field today, I believe
> they can set auto_explain.log_min_duration equal to, or less than,
> statement_timeout.

Either I'm missing something (and/or this was fixed in a later PG
version), but I don't think this is how it works. We have this
specific problem now: we set auto_explain.log_min_duration to 200 (ms)
and statement_timeout set to 30s, but when a statement times out we do
not get the plan logged with auto-explain.

James Coleman




Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2022-09-20 Thread Nathan Bossart
Here is a rebased patch for cfbot.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 9ae1f5409ddeee17b99a9565247da885cbb86be9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 22 Mar 2022 15:35:34 -0700
Subject: [PATCH v6 1/1] Disallow setting XMAX_COMMITTED and XMAX_LOCK_ONLY
 together.

Even though Postgres doesn't set both the XMAX_COMMITTED and
XMAX_LOCK_ONLY infomask bits on the same tuple header, the heap
visibility logic still accepts it.  However, other functions like
compute_new_xmax_infomask() seem to assume that this bit pattern
is not possible.

This change marks this bit pattern as disallowed, removes the heap
visibility logic that handles it, and adds checks like those for
other disallowed infomask bit combinations (e.g., XMAX_COMMITTED
and XMAX_IS_MULTI).  Besides simplifying the heap visibility logic
a bit, this change aims to reduce ambiguity about the legal tuple
header states.

Note that this change also disallows XMAX_COMMITTED together with
the special pre-v9.3 locked-only bit pattern that
HEAP_XMAX_IS_LOCKED_ONLY checks for.  This locked-only bit pattern
may still be present on servers pg_upgraded from pre-v9.3 versions.
---
 contrib/amcheck/verify_heapam.c | 19 
 src/backend/access/heap/README.tuplock  |  2 +-
 src/backend/access/heap/heapam.c| 10 +++
 src/backend/access/heap/heapam_visibility.c | 97 ++---
 src/backend/access/heap/hio.c   |  2 +
 5 files changed, 96 insertions(+), 34 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index d33f33f170..f74f88afc5 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -663,6 +663,25 @@ check_tuple_header(HeapCheckContext *ctx)
 		 */
 	}
 
+	if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
+		(ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY))
+	{
+		report_corruption(ctx,
+		  pstrdup("locked-only should not be marked committed"));
+
+		/* As above, do not skip further checks. */
+	}
+
+	/* also check for pre-v9.3 lock-only bit pattern */
+	if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
+		HEAP_XMAX_IS_LOCKED_ONLY(ctx->tuphdr->t_infomask))
+	{
+		report_corruption(ctx,
+		  pstrdup("tuple with HEAP_XMAX_EXCL_LOCK set and HEAP_XMAX_IS_MULTI unset should not be marked committed"));
+
+		/* As above, do not skip further checks. */
+	}
+
 	if (infomask & HEAP_HASNULL)
 		expected_hoff = MAXALIGN(SizeofHeapTupleHeader + BITMAPLEN(ctx->natts));
 	else
diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock
index 6441e8baf0..4e565e60ab 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -152,4 +152,4 @@ The following infomask bits are applicable:
   whether the XMAX is a TransactionId or a MultiXactId.
 
 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
-is set.
+or the HEAP_XMAX_LOCK_ONLY bit is set.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index eb811d751e..616df576c3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5115,6 +5115,16 @@ l5:
 		MultiXactStatus status;
 		MultiXactStatus new_status;
 
+		/*
+		 * Currently we don't allow XMAX_LOCK_ONLY and XMAX_COMMITTED to both be
+		 * set in a tuple header, so cross-check.
+		 *
+		 * Note that this also checks for the special locked-only bit pattern
+		 * that may be present on servers that were pg_upgraded from pre-v9.3
+		 * versions.
+		 */
+		Assert(!HEAP_XMAX_IS_LOCKED_ONLY(old_infomask));
+
 		if (old_infomask2 & HEAP_KEYS_UPDATED)
 			status = MultiXactStatusUpdate;
 		else
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 6e33d1c881..e6ee8ff1fa 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -78,6 +78,31 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * InfomaskAssertions()
+ *
+ * Checks for invalid infomask bit patterns.
+ */
+static inline void
+InfomaskAssertions(HeapTupleHeader tuple)
+{
+	/*
+	 * XMAX_COMMITTED and XMAX_LOCK_ONLY cannot both be set in a tuple header.
+	 *
+	 * Note that this also checks for the special locked-only bit pattern that
+	 * may be present on servers that were pg_upgraded from pre-v9.3 versions.
+	 */
+	Assert(!((tuple->t_infomask & HEAP_XMAX_COMMITTED) &&
+			 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask)));
+
+	/*
+	 * XMAX_COMMITTED and XMAX_IS_MULTI cannot be be set in a tuple header.
+	 */
+	Assert(!((tuple->t_infomask & HEAP_XMAX_COMMITTED) &&
+			 (tuple->t_infomask & HEAP_XMAX_IS_MULTI)));
+}
+
+
 /*
  * SetHintBits()
  *
@@ -113,6 +138,8 @@ static inline void
 SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 			uint16 infomask, TransactionId xid)
 {
+	InfomaskAssertions(tuple);
+
 	if (TransactionIdIsValid(xid))
 	{
 		/* NB: xid must be known committed here! 

Re: Auto explain after query timeout

2022-09-20 Thread Gurjeet
On Tue Sep 20, 2022 at 10:34 AM PDT, James Coleman wrote:
> Hopefully I'm not missing something obvious, but as far as I know
> there's no way to configure auto explain to work fire
> statement_timeout fires.

I believe you're correct.

> I'd like to look into this at some point, but I'm wondering if anyone
> has thought about it before, and, if so, is there any obvious
> impediment to doing so?

This would be a neat feature. Since the changes would be fairly
localized to the contrib module, this would be a great first patch for
someone new to contributing.

This can be exposed at a new GUC auto_explain.log_on_statement_timeout.
I wish our conventions allowed for creating hierarchies of GUC
parameters, e.g. auto_explain.when.statmeent_timeout.

For someone who would like to achieve this in the field today, I believe
they can set auto_explain.log_min_duration equal to, or less than,
statement_timeout.

Best regards,
Gurjeet
http://Gurje.et




Re: predefined role(s) for VACUUM and ANALYZE

2022-09-20 Thread Nathan Bossart
On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote:
> I have gone through the thread, and I'd agree with getting more
> granularity when it comes to assigning ACLs to relations rather than
> just an on/off switch for the objects of a given type would be nice.
> I've been looking at the whole use of AclMode and AclItem in the code,
> and I don't quite see why a larger size could have a noticeable
> impact.  There are a few things that could handle a large number of
> AclItems, though, say for array operations like aclupdate().  These
> could be easily checked with some micro-benchmarking or some SQL
> queries that emulate a large number of items in aclitem[] arrays.

I performed a few quick tests with a couple thousand ACLs on my laptop, and
I'm consistently seeing a 4.3% regression.

> Any impact for the column sizes of the catalogs holding ACL
> information?  Just asking while browsing the patch set.

Since each aclitem requires 16 bytes instead of 12, I assume so.  However,
in my testing, I hit a "row is too big" error with the same number of
aclitems in a pg_class row before and after the change.  I might be missing
something in my patch, or maybe I am misunderstanding how arrays of
aclitems are stored on disk.

> Some comments in utils/acl.h need a refresh as the number of lower and
> upper bits looked at from ai_privs changes.

Oops, I missed that one.  I fixed it in the attached patch set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 15e9a60b990070d36f4e1f749cbfbb638b37a666 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 7 Sep 2022 22:25:29 -0700
Subject: [PATCH v6 1/4] Change AclMode from a uint32 to a uint64.

---
 src/backend/nodes/outfuncs.c|  2 +-
 src/bin/pg_upgrade/check.c  | 35 +
 src/include/catalog/pg_type.dat |  2 +-
 src/include/nodes/parsenodes.h  |  6 +++---
 src/include/utils/acl.h | 28 +-
 5 files changed, 54 insertions(+), 19 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 60610e3a4b..dbb9ad52da 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -528,7 +528,7 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
 	WRITE_BOOL_FIELD(lateral);
 	WRITE_BOOL_FIELD(inh);
 	WRITE_BOOL_FIELD(inFromCl);
-	WRITE_UINT_FIELD(requiredPerms);
+	WRITE_UINT64_FIELD(requiredPerms);
 	WRITE_OID_FIELD(checkAsUser);
 	WRITE_BITMAPSET_FIELD(selectedCols);
 	WRITE_BITMAPSET_FIELD(insertedCols);
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index f1bc1e6886..615a53a864 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -28,6 +28,7 @@ static void check_for_incompatible_polymorphics(ClusterInfo *cluster);
 static void check_for_tables_with_oids(ClusterInfo *cluster);
 static void check_for_composite_data_type_usage(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
+static void check_for_aclitem_data_type_usage(ClusterInfo *cluster);
 static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
 static void check_for_new_tablespace_dir(ClusterInfo *new_cluster);
@@ -107,6 +108,13 @@ check_and_dump_old_cluster(bool live_check)
 	check_for_reg_data_type_usage(&old_cluster);
 	check_for_isn_and_int8_passing_mismatch(&old_cluster);
 
+	/*
+	 * PG 16 increased the size of the 'aclitem' type, which breaks the on-disk
+	 * format for existing data.
+	 */
+	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
+		check_for_aclitem_data_type_usage(&old_cluster);
+
 	/*
 	 * PG 14 changed the function signature of encoding conversion functions.
 	 * Conversions from older versions cannot be upgraded automatically
@@ -1319,6 +1327,33 @@ check_for_reg_data_type_usage(ClusterInfo *cluster)
 		check_ok();
 }
 
+/*
+ * check_for_aclitem_data_type_usage
+ *
+ *	aclitem changed its storage format in 16, so check for it.
+ */
+static void
+check_for_aclitem_data_type_usage(ClusterInfo *cluster)
+{
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for incompatible aclitem data type in user tables");
+
+	snprintf(output_path, sizeof(output_path), "tables_using_aclitem.txt");
+
+	if (check_for_data_type_usage(cluster, "pg_catalog.aclitem", output_path))
+	{
+		pg_log(PG_REPORT, "fatal");
+		pg_fatal("Your installation contains the \"aclitem\" data type in user tables.\n"
+ "The internal format of \"aclitem\" changed in PostgreSQL version 16\n"
+ "so this cluster cannot currently be upgraded.  You can drop the\n"
+ "problem columns and restart the upgrade.  A list of the problem\n"
+ "columns is in the file:\n"
+ "%s", output_path);
+	}
+	else
+		check_ok();
+}
 
 /*
  * check_for_jsonb_9_4_usage()
diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat
index df45879463..484dec39e8 100644
--- a/src/include/catalog/pg_type.dat
+

Re: Support tls-exporter as channel binding for TLSv1.3

2022-09-20 Thread Jacob Champion
On Mon, Sep 19, 2022 at 5:54 PM Michael Paquier  wrote:
> X509_get_signature_nid() has been introduced in 1.0.2.
> SSL_export_keying_material() is older than that, being present since
> 1.0.1.  Considering the fact that we want to always have
> tls-server-end-point as default, it seems to me that we should always
> publish SCRAM-SHA-256-PLUS and support channel binding when building
> with OpenSSL >= 1.0.2.  The same can be said about the areas where we
> have HAVE_BE_TLS_GET_[PEER_]CERTIFICATE_HASH.

Should we advertise support even if the client is too old to provide
an extended master secret?

> I was planning to continue working on this patch based on your latest
> review.  Anyway, as that's originally your code, I am fine to let you
> take the lead here.  Just let me know which way you prefer, and I'll
> stick to it :)

Well, I'm working on a next version, but it's ballooning in complexity
as I try to navigate the fix for OpenSSL 1.0.1 (which is currently
failing the tests, unsurprisingly). You mentioned not wanting to add
maintenance burden for 1.0.1, and I'm curious to see whether the
approach you have in mind would be easier than what mine is turning
out to be. Maybe we can compare notes?

--Jacob




Re: oat_post_create expected behavior

2022-09-20 Thread Jeff Davis
On Tue, 2022-08-02 at 13:30 -0700, Mary Xu wrote:
> > Right, same thing I'm saying.  I also think we should discourage
> > people from doing cowboy CCIs inside their OAT hooks, because that
> > makes the testability problem even worse.  Maybe that means we
> > need to uniformly move the CREATE hooks to after a system-provided
> > CCI, but I've not thought hard about the implications of that.
> 
> I like this approach, however, I am relatively new to the PG scene
> and
> am not sure how or what I should look into in terms of the
> implications that Tom mentioned. Are there any tips? What should be
> the next course of action here? I could update my patch to always
> call
> CCI before the create hooks.

I didn't see a clear consensus that we should call OAT_POST_CREATE
after CCI, so I went ahead and updated the comment. We can always
update the behavior later when we do have consensus, but until that
time, at least the comment will be more helpful.

If you are satisfied you can mark the CF issue as "committed", or you
can leave it open if you think it's still unresolved.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: CFM Manager

2022-09-20 Thread Jacob Champion
On Thu, Sep 8, 2022 at 2:34 PM Jacob Champion  wrote:
> I still have yet to update the section "5 to 7 days before end of CF"
> and onward.

Well, I've saved the hardest part for last...

Ibrar, Hamid, have the checklist rewrites been helpful so far? Are you
planning on doing an (optional!) triage, and if so, are there any
pieces in particular you'd like me to document?

--Jacob




Tighten pg_get_object_address argument checking

2022-09-20 Thread Peter Eisentraut

For publication schemas (OBJECT_PUBLICATION_NAMESPACE) and user
mappings (OBJECT_USER_MAPPING), pg_get_object_address() checked the
array length of the second argument, but not of the first argument.
If the first argument was too long, it would just silently ignore
everything but the first argument.  Fix that by checking the length of
the first argument as well.

I wouldn't be surprised if there were more holes like this in this area. 
 I just happened to find these while working on something related.From eb80c87a083464160a1436e5f983df840b282085 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Sep 2022 13:37:27 -0400
Subject: [PATCH] Tighten pg_get_object_address argument checking

For publication schemas (OBJECT_PUBLICATION_NAMESPACE) and user
mappings (OBJECT_USER_MAPPING), pg_get_object_address() checked the
array length of the second argument, but not of the first argument.
If the first argument was too long, it would just silently ignore
everything but the first argument.  Fix that by checking the length of
the first argument as well.
---
 src/backend/catalog/objectaddress.c  | 10 --
 src/test/regress/expected/object_address.out | 16 +++-
 src/test/regress/sql/object_address.sql  |  2 +-
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/src/backend/catalog/objectaddress.c 
b/src/backend/catalog/objectaddress.c
index 8377b4f7d4d1..27616ac2ad26 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -2239,10 +2239,16 @@ pg_get_object_address(PG_FUNCTION_ARGS)
 */
switch (type)
{
+   case OBJECT_PUBLICATION_NAMESPACE:
+   case OBJECT_USER_MAPPING:
+   if (list_length(name) != 1)
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("name list length must 
be exactly %d", 1)));
+   /* fall through to check args length */
+   /* FALLTHROUGH */
case OBJECT_DOMCONSTRAINT:
case OBJECT_CAST:
-   case OBJECT_USER_MAPPING:
-   case OBJECT_PUBLICATION_NAMESPACE:
case OBJECT_PUBLICATION_REL:
case OBJECT_DEFACL:
case OBJECT_TRANSFORM:
diff --git a/src/test/regress/expected/object_address.out 
b/src/test/regress/expected/object_address.out
index 4117fc27c9a5..cbb99c7b9f94 100644
--- a/src/test/regress/expected/object_address.out
+++ b/src/test/regress/expected/object_address.out
@@ -105,7 +105,7 @@ BEGIN
('text search template'), ('text search configuration'),
('policy'), ('user mapping'), ('default acl'), ('transform'),
('operator of access method'), ('function of access method'),
-   ('publication relation')
+   ('publication namespace'), ('publication relation')
LOOP
FOR names IN VALUES ('{eins}'), ('{addr_nsp, zwei}'), ('{eins, 
zwei, drei}')
LOOP
@@ -285,10 +285,10 @@ WARNING:  error for policy,{eins,zwei,drei},{}: schema 
"eins" does not exist
 WARNING:  error for policy,{eins,zwei,drei},{integer}: schema "eins" does not 
exist
 WARNING:  error for user mapping,{eins},{}: argument list length must be 
exactly 1
 WARNING:  error for user mapping,{eins},{integer}: user mapping for user 
"eins" on server "integer" does not exist
-WARNING:  error for user mapping,{addr_nsp,zwei},{}: argument list length must 
be exactly 1
-WARNING:  error for user mapping,{addr_nsp,zwei},{integer}: user mapping for 
user "addr_nsp" on server "integer" does not exist
-WARNING:  error for user mapping,{eins,zwei,drei},{}: argument list length 
must be exactly 1
-WARNING:  error for user mapping,{eins,zwei,drei},{integer}: user mapping for 
user "eins" on server "integer" does not exist
+WARNING:  error for user mapping,{addr_nsp,zwei},{}: name list length must be 
exactly 1
+WARNING:  error for user mapping,{addr_nsp,zwei},{integer}: name list length 
must be exactly 1
+WARNING:  error for user mapping,{eins,zwei,drei},{}: name list length must be 
exactly 1
+WARNING:  error for user mapping,{eins,zwei,drei},{integer}: name list length 
must be exactly 1
 WARNING:  error for default acl,{eins},{}: argument list length must be 
exactly 1
 WARNING:  error for default acl,{eins},{integer}: unrecognized default ACL 
object type "i"
 WARNING:  error for default acl,{addr_nsp,zwei},{}: argument list length must 
be exactly 1
@@ -313,6 +313,12 @@ WARNING:  error for function of access 
method,{addr_nsp,zwei},{}: name list leng
 WARNING:  error for function of access method,{addr_nsp,zwei},{integer}: name 
list length must be at least 3
 WARNING:  error for function of access method,{eins,zwei,drei},{}: argument 
list length must be exactly 2
 WARNING:  error for function of access meth

Auto explain after query timeout

2022-09-20 Thread James Coleman
Hopefully I'm not missing something obvious, but as far as I know
there's no way to configure auto explain to work fire
statement_timeout fires.

I'd like to look into this at some point, but I'm wondering if anyone
has thought about it before, and, if so, is there any obvious
impediment to doing so?

Thanks,
James Coleman




Re: A question about wording in messages

2022-09-20 Thread Alvaro Herrera
On 2022-Sep-16, Amit Kapila wrote:

> We only want to commit the changes to 15 (a) if those fixes a problem
> introduced in 15, or (b) those are for a bug fix. I think the error
> message improvements fall into none of those categories, we can map it
> to (b) but I feel those are an improvement in the current messages and
> don't seem critical to me.

IMO at least the GUC one does fix a problem related to the wording of a
user-visible message, which also flows into the translations.  I prefer
to have that one fixed it in 15 also.  The other messages (errors) don't
seem very interesting because they're not as visible, so I don't care if
those are not backpatched.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"This is a foot just waiting to be shot"(Andrew Dunstan)




Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

2022-09-20 Thread Jacob Champion
On Mon, Sep 19, 2022 at 9:09 PM Tom Lane  wrote:
> You have to assume that somebody (a) has a role or DB name starting
> with slash, (b) has an explicit reference to that name in their
> pg_hba.conf, (c) doesn't read the release notes, and (d) doesn't
> notice that things are misbehaving until after some hacker manages
> to break into their installation on the strength of the misbehaving
> entry.  OK, I'll grant that the probability of (c) is depressingly
> close to unity; but each of the other steps seems quite low probability.
> All four of them happening in one installation is something I doubt
> will happen.

I can't argue with (a) or (b), but (d) seems decently likely to me. If
your normal user base consists of people who are authorized to access
your system, what clues would you have that your HBA is silently
failing open?

--Jacob




Re: making relfilenodes 56 bits

2022-09-20 Thread Robert Haas
On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar  wrote:
> [ new patch ]

+typedef pg_int64 RelFileNumber;

This seems really random to me. First, why isn't this an unsigned
type? OID is unsigned and I don't see a reason to change to a signed
type. But even if we were going to change to a signed type, why
pg_int64? That is declared like this:

/* Define a signed 64-bit integer type for use in client API declarations. */
typedef PG_INT64_TYPE pg_int64;

Surely this is not a client API declaration

Note that if we change this a lot of references to INT64_FORMAT will
need to become UINT64_FORMAT.

I think we should use int64 at the SQL level, because we don't have an
unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits.
So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or
similar. But internally I think using unsigned is cleaner.

+ * RelFileNumber is unique within a cluster.

Not really, because of CREATE DATABASE. Probably just drop this line.
Or else expand it: we never assign the same RelFileNumber twice within
the lifetime of the same cluster, but there can be multiple relations
with the same RelFileNumber e.g. because CREATE DATABASE duplicates
the RelFileNumber values from the template database. But maybe we
don't need this here, as it's already explained in relfilelocator.h.

+ret = (int8) (tag->relForkDetails[0] >> BUFTAG_RELNUM_HIGH_BITS);

Why not declare ret as ForkNumber instead of casting twice?

+uint64  relnum;
+
+Assert(relnumber <= MAX_RELFILENUMBER);
+Assert(forknum <= MAX_FORKNUM);
+
+relnum = relnumber;

Perhaps it'd be better to write uint64 relnum = relnumber instead of
initializing on a separate line.

+#define RELNUMBERCHARS  20  /* max chars printed by %llu */

Maybe instead of %llu we should say UINT64_FORMAT (or INT64_FORMAT if
there's some reason to stick with a signed type).

+elog(ERROR, "relfilenumber is out of bound");

It would have to be "out of bounds", with an "s". But maybe "is too
large" would be better.

+nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
+loggedRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
+flushedRelFileNumber = ShmemVariableCache->flushedRelFileNumber;

Maybe it would be a good idea to asset that next <= flushed and
flushed <= logged?

+#ifdef USE_ASSERT_CHECKING
+
+{
+RelFileLocatorBackend rlocator;
+char   *rpath;

Let's add a comment here, like "Because the RelFileNumber counter only
ever increases and never wraps around, it should be impossible for the
newly-allocated RelFileNumber to already be in use. But, if Asserts
are enabled, double check that there's no main-fork relation file with
the new RelFileNumber already on disk."

+elog(ERROR, "cannot forward RelFileNumber during recovery");

forward -> set (or advance)

+if (relnumber >= ShmemVariableCache->loggedRelFileNumber)

It probably doesn't make any difference, but to me it seems better to
test flushedRelFileNumber rather than logRelFileNumber here. What do
you think?

 /*
  * We set up the lockRelId in case anything tries to lock the dummy
- * relation.  Note that this is fairly bogus since relNumber may be
- * different from the relation's OID.  It shouldn't really matter though.
- * In recovery, we are running by ourselves and can't have any lock
- * conflicts.  While syncing, we already hold AccessExclusiveLock.
+ * relation.  Note we are setting relId to just FirstNormalObjectId which
+ * is completely bogus.  It shouldn't really matter though. In recovery,
+ * we are running by ourselves and can't have any lock conflicts.  While
+ * syncing, we already hold AccessExclusiveLock.
  */
 rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
-rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
+rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId;

Boy, this makes me uncomfortable. The existing logic is pretty bogus,
and we're replacing it with some other bogus thing. Do we know whether
anything actually does try to use this for locking?

One notable difference between the existing logic and your change is
that, with the existing logic, we use a bogus value that will differ
from one relation to the next, whereas with this change, it will
always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId =
(Oid) rlocator.relNumber would be a more natural adaptation?

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)\
+do {\
+if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+ereport(ERROR,  \
+errcode(ERRCODE_INVALID_PARAMETER_VALUE),   \
+errmsg("relfilenumber %lld is out of range",\
+(long long) (relfilenumber))); \
+} while (0)

Here, you take the approach of casting the relfilenumber to 

Re: cataloguing NOT NULL constraints

2022-09-20 Thread Robert Haas
On Tue, Sep 20, 2022 at 10:39 AM Alvaro Herrera  wrote:
> That said, the patch I posted for this ~10 years ago used a separate
> contype and was simpler than what I ended up with now, but amusingly
> enough it was returned at the time with the argument that it would be
> better to treat them as normal CHECK constraints; so I want to be very
> sure that we're not just going around in circles.

I don't have an intrinsic view on whether we ought to have one contype
or two, but I think this comment of yours from a few messages ago is
right on point: ".. though I'm now wondering if there's additional
overhead from checking
the constraint twice on each row: first the attnotnull bit, then the
CHECK itself.  Hmm.  That's probably quite bad." For that exact
reason, it seems absolutely necessary to be able to somehow identify
the "redundant" check constraints, so that we don't pay the expense of
revalidating them. Another contype would be one way of identifying
such constraints, but it could probably be done in other ways, too.
Perhaps it could even be discovered dynamically, like when we build a
relcache entry. I actually have no idea what design is best.

I am a little confused as to why we want to do this, though. While
we're on the topic of what is more complicated and simpler, what
functionality do we get out of adding all of these additional catalog
entries that then have to be linked back to the corresponding
attnotnull markings? And couldn't we get that functionality in some
much simpler way? Like, if you want to track whether the NOT NULL
constraint has been validated, we could add an attnotnullvalidated
column, or probably better, change the existing attnotnull column to a
character used as an enum, or maybe an integer bit-field of some kind.
I'm not trying to blow up your patch with dynamite or anything, but
I'm a little suspicious that this may be one of those cases where
pgsql-hackers discussed turns a complicated project into an even more
complicated project to no real benefit.

One thing that I don't particularly like about this whole design is
that it feels like it creates a bunch of catalog bloat. Now all of the
attnotnull flags also generate additional pg_constraint rows. The
catalogs in the default install will be bigger than before, and the
catalogs after user tables are created will be more bigger. If we get
some nifty benefit out of all that, cool! But if we're just
anti-optimizing the catalog storage out of some feeling that the
result will be intellectually purer than some competing design, maybe
we should reconsider. It's not stupid to optimize for common special
cases, and making a column as NOT NULL is probably at least one and
maybe several orders of magnitude more common than putting some other
CHECK constraint on it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




default sorting behavior for index

2022-09-20 Thread Zhihong Yu
Hi,
I was looking at this check in src/backend/parser/parse_utilcmd.c w.r.t.
constraint:

if (indclass->values[i] != defopclass ||
attform->attcollation != index_rel->rd_indcollation[i]
||
attoptions != (Datum) 0 ||
index_rel->rd_indoption[i] != 0)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("index \"%s\" column number %d does not
have default sorting behavior", index_name, i + 1),
 errdetail("Cannot create a primary key or
unique constraint using such an index."),

It seems this first came in via `Indexes with INCLUDE columns and their
support in B-tree`

If the index has DESC sorting order, why it cannot be used to back a
constraint ?
Some concrete sample would help me understand this.

Thanks


Re: A question about wording in messages

2022-09-20 Thread Alvaro Herrera
On 2022-Sep-14, Kyotaro Horiguchi wrote:

> At Tue, 13 Sep 2022 22:38:46 -0400, Tom Lane  wrote in 
> > Kyotaro Horiguchi  writes:
> > > I saw the following message recently modified.
> > >> This controls the maximum distance we can read ahead in the WAL to 
> > >> prefetch referenced data blocks.
> > > Maybe the "we" means "PostgreSQL program and you" but I see it
> > > somewhat out of place.
> > 
> > +1, I saw that today and thought it was outside our usual style.
> > The whole thing is awfully verbose for a GUC description, too.
> > Maybe
> > 
> > "Maximum distance to read ahead in WAL to prefetch data blocks."

I failed to notice this issue.  I agree it's unusual and +1 for changing it.

> It seems to sufficiently work for average users and rather easy to
> read, but it looks a short description.

> So, taking the middle of them, how about the following?
> 
> Short: Buffer size for reading ahead in the WAL during recovery.
> Extra: This controls the maximum distance to read ahead in WAL to prefetch 
> data blocks."

But why do we care that it's short?  We don't need it to be long .. we
only need it to explain what it needs to explain.

After spending way too much time editing this line, I ended up with
exactly what Tom proposed, so +1 for his version.  I think "This
controls" adds nothing very useful, and we don't have it anywhere else,
except tcp_keepalives_count from where I also propose to remove it.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)
>From 5b8bf15ed5d9f1a21150039da33a557f027640d4 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 20 Sep 2022 18:19:34 +0200
Subject: [PATCH] fix some GUC strings

---
 src/backend/utils/misc/guc_tables.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 550e95056c..ab3140ff3a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2591,7 +2591,7 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_RECOVERY,
 			gettext_noop("Buffer size for reading ahead in the WAL during recovery."),
-			gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced data blocks."),
+			gettext_noop("Maximum distance to read ahead in the WAL to prefetch referenced data blocks."),
 			GUC_UNIT_BYTE
 		},
 		&wal_decode_buffer_size,
@@ -3248,7 +3248,7 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"tcp_keepalives_count", PGC_USERSET, CONN_AUTH_TCP,
 			gettext_noop("Maximum number of TCP keepalive retransmits."),
-			gettext_noop("This controls the number of consecutive keepalive retransmits that can be "
+			gettext_noop("Number of consecutive keepalive retransmits that can be "
 		 "lost before a connection is considered dead. A value of 0 uses the "
 		 "system default."),
 		},
-- 
2.30.2



Re: Tree-walker callbacks vs -Wdeprecated-non-prototype

2022-09-20 Thread Tom Lane
I wrote:
> (That verbiage is from the gcc manual; clang seems to act the same
> except that -Wcast-function-type is selected by -Wall, or perhaps is
> even on by default.)

Nah, scratch that: the reason -Wcast-function-type is on is that
we explicitly enable it, and have done so since de8feb1f3 (v14).
I did not happen to see this warning with gcc because the test runs
I made with this patch already had c35ba141d, whereas I did my
clang test on another machine that wasn't quite up to HEAD.
So we should have good warning coverage for bogus walker signatures
on both compilers.

regards, tom lane




Re: [PATCH]Feature improvement for MERGE tab completion

2022-09-20 Thread Alvaro Herrera
On 2022-Sep-18, Fujii Masao wrote:

> The tab-completion code for MERGE was added in the middle of that for LOCK 
> TABLE.
> This would be an oversight of the commit that originally supported 
> tab-completion
> for MERGE. I fixed this issue.

Argh, thanks.

> "MERGE" was tab-completed with just after "EXPLAIN" or "EXPLAIN ANALYZE", etc.
> Since "INTO" always follows "MERGE", it's better to complete with "MERGE INTO"
> there. I replaced "MERGE" with "MERGE INTO" in those tab-completions.

OK, that would be similar to REFRESH MATERIALIZED VIEW.

The rules starting at line 4111 make me a bit nervous, since nowhere
we're restricting them to operating only on MERGE lines.  I don't think
it's a real problem since USING is not terribly common anyway.  Likewise
for the ones with WHEN [NOT] MATCHED.  I kinda wish we had a way to
search for stuff like "keyword MERGE appears earlier in the command",
but we don't have that.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I'm always right, but sometimes I'm more right than other times."
  (Linus Torvalds)




Re: Proposal to use JSON for Postgres Parser format

2022-09-20 Thread Alvaro Herrera
On 2022-Sep-20, Matthias van de Meent wrote:

> Allow me to add: compressability
> 
> In the thread surrounding [0] there were complaints about the size of
> catalogs, and specifically the template database. Significant parts of
> that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
> consists mostly of serialized Nodes. If we're going to replace our
> current NodeToText infrastructure, we'd better know we can effectively
> compress this data.

True.  Currently, the largest ev_action values compress pretty well.  I
think if we wanted this to be more succint, we would have to invent some
binary format -- perhaps something like Protocol Buffers: it'd be stored
in the binary format in catalogs, but for output it would be converted
into something easy to read (we already do this for
pg_statistic_ext_data for example).  We'd probably lose compressibility,
but that'd be okay because the binary format would already remove most
of the redundancy by nature.

Do we want to go there?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Java is clearly an example of money oriented programming"  (A. Stepanov)




Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Mark Dilger



> On Sep 19, 2022, at 8:03 PM, Jonathan S. Katz  wrote:
> 
> "When a partitioned table is added to a publication, all of its existing and 
> future partitions are implicitly considered to be part of the 
> publication."[10]
> 
> Additionally, this is the behavior that is already present in "FOR ALL 
> TABLES":
> 
> "Marks the publication as one that replicates changes for all tables in the 
> database, including tables created in the future."[10]
> 
> I don't think we should change this behavior that's already in logical 
> replication.

The existing behavior in logical replication doesn't have any "IN SCHEMA" 
qualifiers.

> While I understand the reasons why "GRANT ... ALL TABLES IN SCHEMA" has a 
> different behavior (i.e. it's not applied to future objects) and do not 
> advocate to change it, I have personally been affected where I thought a 
> permission would be applied to all future objects, only to discover 
> otherwise. I believe it's more intuitive to think that "ALL" applies to 
> "everything, always."

The conversation is focusing on what "ALL TABLES" means, but the ambiguous part 
is what "IN SCHEMA" means.  In GRANT it means "currently in schema, computed 
now."  We are about to create confusion by adding the "IN SCHEMA" phrase to 
publication commands meaning "later in schema, computed then."  A user who 
diligently consults the documentation for one command to discover what "IN 
SCHEMA" means may fairly, but wrongly, assume it means the same thing in 
another command.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: RFC: Logging plan of the running query

2022-09-20 Thread torikoshia

On 2022-09-19 17:47, Алена Рыбакина wrote:
Thanks for your review and comments!


Hi,

I'm sorry,if this message is duplicated previous this one, but I'm not
sure that the previous message is sent correctly. I sent it from email
address a.rybak...@postgrespro.ru and I couldn't send this one email
from those address.


I've successfully received your mail from both a.rybak...@postgrespro.ru 
and lena.riback...@yandex.ru.



I like idea to create patch for logging query plan. After reviewing
this code and notice some moments and I'd rather ask you some
questions.

Firstly, I suggest some editing in the comment of commit. I think, it 
is
turned out the more laconic and the same clear. I wrote it below since 
I

can't think of any other way to add it.



```
Currently, we have to wait for finishing of the query execution to 
check

its plan.
This is not so convenient in investigation long-running queries on
production
environments where we cannot use debuggers.



To improve this situation there is proposed the patch containing the
pg_log_query_plan()
function which request to log plan of the specified backend process.



By default, only superusers are allowed to request log of the plan
otherwise
allowing any users to issue this request could create cause lots of log
messages
and it can lead to denial of service.

At the next requesting CHECK_FOR_INTERRUPTS(), the target backend logs
its plan at
LOG_SERVER_ONLY level and therefore this plan will appear in the server
log only,
not to be sent to the client.

Thanks, I have incorporated your comments.
Since the latter part of the original message comes from the commit 
message of pg_log_backend_memory_contexts(43620e328617c), so I left it 
as it was for consistency.




Secondly, I have question about deleting USE_ASSERT_CHECKING in lock.h?
It supposed to have been checked in another placed of the code by
matching values. I am worry about skipping errors due to untesting with
assert option in the places where it (GetLockMethodLocalHash)
participates and we won't able to get core file in segfault cases. I
might not understand something, then can you please explain to me?
Since GetLockMethodLocalHash() is only used for assertions, this is only 
defined when USE_ASSERT_CHECKING is enabled.
This patch uses GetLockMethodLocalHash() not only for the assertion 
purpose, so I removed "ifdef USE_ASSERT_CHECKING" for this function.

I belive it does not lead to skip errors.

Thirdly, I have incomprehension of the point why save_ActiveQueryDesc 
is
declared in the pquery.h? I am seemed to save_ActiveQueryDesc to be 
used

in an once time in the ExecutorRun function and  its declaration
superfluous. I added it in the attached patch.

Exactly.


Fourthly, it seems to me there are not enough explanatory comments in
the code. I also added them in the attached patch.

Thanks!

| +   /*
| +* Save value of ActiveQueryDesc before having called
| +* ExecutorRun_hook function due to having reset by
| +* AbortSubTransaction.
| +*/
| +
| save_ActiveQueryDesc = ActiveQueryDesc;
| ActiveQueryDesc = queryDesc;
|
| @@ -314,6 +320,7 @@ ExecutorRun(QueryDesc *queryDesc,
| else
| standard_ExecutorRun(queryDesc, direction, count, 
execute_once);

|
| +   /* We set the actual value of ActiveQueryDesc */
| ActiveQueryDesc = save_ActiveQueryDesc;

Since these processes are needed for nested queries, not only for
AbortSubTransaction[1], added comments on it.

| +/* Function to handle the signal to output the query plan. */
|  extern void HandleLogQueryPlanInterrupt(void);

I feel this comment is unnecessary since the explanation of 
HandleLogQueryPlanInterrupt() is written in explain.c and no functions 
in explain.h have comments in it.



Lastly, I have incomprehension about handling signals since have been
unused it before. Could another signal disabled calling this signal to
log query plan? I noticed this signal to be checked the latest in
procsignal_sigusr1_handler function.
Are you concerned that one signal will not be processed when multiple 
signals are sent in succession?
AFAIU both of them are processed since SendProcSignal flags 
ps_signalFlags for each signal.


```
 SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
 {
 volatile ProcSignalSlot *slot;
...(snip)...
278 if (slot->pss_pid == pid)
279 {
280 /* Atomically set the proper flag */
281 slot->pss_signalFlags[reason] = true;
282 /* Send signal */
283 return kill(pid, SIGUSR1);
```

Comments of ProcSignalReason also say 'We can cope with concurrent 
signals for different reasons'.


```C
 /*
  * Reasons for signaling a Postgres child process (a backend or an 
auxiliary
  * process, like checkpointer).  We can cope with concurrent signals 
for different
  * reasons.  However, if the same reason is signaled multiple times in 
quick
  * succession, the process is likely to observe only one notification 
of

Re: cataloguing NOT NULL constraints

2022-09-20 Thread Alvaro Herrera
On 2022-Sep-20, Isaac Morland wrote:

> On Tue, 20 Sept 2022 at 06:56, Alvaro Herrera 
> wrote:
> 
> > .. though I'm now wondering if there's additional overhead from checking
> > the constraint twice on each row: first the attnotnull bit, then the
> > CHECK itself.  Hmm.  That's probably quite bad.
> 
> Another reason to treat NOT NULL-implementing constraints differently.

Yeah.

> My thinking is that pg_constraint entries for NOT NULL columns are mostly
> an implementation detail. I've certainly never cared whether I had an
> actual constraint corresponding to my NOT NULL columns.

Naturally, all catalog entries are implementation details; a user never
really cares if an entry exists or not, only that the desired semantics
are provided.  In this case, we want the constraint row because it gives
us some additional features, such as the ability to mark NOT NULL
constraints NOT VALID and validating them later, which is a useful thing
to do in large production databases.  We have some hacks to provide part
of that functionality using straight CHECK constraints, but you cannot
cleanly get the `attnotnull` flag set for a column (which means it's
hard to add a primary key, for example).

It is also supposed to fix some inconsistencies such as disallowing to
remove a constraint on a table when it is implied from a constraint on
an ancestor table.  Right now we have ad-hoc protections for partitions,
but we don't do that for legacy inheritance.

That said, the patch I posted for this ~10 years ago used a separate
contype and was simpler than what I ended up with now, but amusingly
enough it was returned at the time with the argument that it would be
better to treat them as normal CHECK constraints; so I want to be very
sure that we're not just going around in circles.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: making relfilenodes 56 bits

2022-09-20 Thread Amul Sul
On Fri, Sep 9, 2022 at 3:32 PM Dilip Kumar  wrote:
>
> On Thu, Sep 8, 2022 at 4:10 PM Dilip Kumar  wrote:
>
> > On a separate note, while reviewing the latest patch I see there is some 
> > risk of using the unflushed relfilenumber in GetNewRelFileNumber() 
> > function.  Basically, in the current code, the flushing logic is tightly 
> > coupled with the logging new relfilenumber logic and that might not work 
> > with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD.  So the idea 
> > is we need to keep the flushing logic separate from the logging, I am 
> > working on the idea and I will post the patch soon.
>
> I have fixed the issue, so now we will track nextRelFileNumber,
> loggedRelFileNumber and flushedRelFileNumber.  So whenever
> nextRelFileNumber is just VAR_RELNUMBER_NEW_XLOG_THRESHOLD behind the
> loggedRelFileNumber we will log VAR_RELNUMBER_PER_XLOG more
> relfilenumbers.  And whenever nextRelFileNumber reaches the
> flushedRelFileNumber then we will do XlogFlush for WAL upto the last
> loggedRelFileNumber.  Ideally flushedRelFileNumber should always be
> VAR_RELNUMBER_PER_XLOG number behind the loggedRelFileNumber so we can
> avoid tracking the flushedRelFileNumber.  But I feel keeping track of
> the flushedRelFileNumber looks cleaner and easier to understand.  For
> more details refer to the code in GetNewRelFileNumber().
>

Here are a few minor suggestions I came across while reading this
patch, might be useful:

+#ifdef USE_ASSERT_CHECKING
+
+   {

Unnecessary space after USE_ASSERT_CHECKING.
--

+   return InvalidRelFileNumber;/* placate compiler */

I don't think we needed this after the error on the latest branches.
--

+   LWLockAcquire(RelFileNumberGenLock, LW_SHARED);
+   if (shutdown)
+   checkPoint.nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
+   else
+   checkPoint.nextRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
+
+   LWLockRelease(RelFileNumberGenLock);

This is done for the good reason, I think, it should have a comment
describing different checkPoint.nextRelFileNumber assignment
need and crash recovery perspective.
--

+#define SizeOfRelFileLocatorBackend \
+   (offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))

Can append empty parenthesis "()" to the macro name, to look like a
function call at use or change the macro name to uppercase?
--

 +   if (val < 0 || val > MAX_RELFILENUMBER)
..
 if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \

How about adding a macro for this condition as RelFileNumberIsValid()?
We can replace all the checks referring to MAX_RELFILENUMBER with this.

Regards,
Amul




Re: cataloguing NOT NULL constraints

2022-09-20 Thread Isaac Morland
On Tue, 20 Sept 2022 at 06:56, Alvaro Herrera 
wrote:

The NULL checks would still be mostly done by the attnotnull checks
> internally, so there shouldn't be too much of a difference.
>
> .. though I'm now wondering if there's additional overhead from checking
> the constraint twice on each row: first the attnotnull bit, then the
> CHECK itself.  Hmm.  That's probably quite bad.
>

Another reason to treat NOT NULL-implementing constraints differently.

My thinking is that pg_constraint entries for NOT NULL columns are mostly
an implementation detail. I've certainly never cared whether I had an
actual constraint corresponding to my NOT NULL columns. So I think marking
them as such, or a different contype, and excluding them from \d+ display,
probably makes sense. Just need to deal with the issue of trying to create
a constraint and having its name conflict with a NOT NULL constraint. Could
it work to reserve [field name]_notnull for NOT NULL-implementing
constraints? I'd be worried about what happens with field renames; renaming
the constraint automatically seems a bit weird, but maybe…


Re: On login trigger: take three

2022-09-20 Thread Daniel Gustafsson
> On 20 Sep 2022, at 15:43, Sergey Shinderuk  wrote:
> 
> On 02.09.2022 18:36, Daniel Gustafsson wrote:
>> This had bitrotted a fair bit, attached is a rebase along with (mostly)
>> documentation fixes.  0001 adds a generic GUC for ignoring event triggers and
>> 0002 adds the login event with event trigger support, and hooks it up to the
>> GUC such that broken triggers wont require single-user mode.  Moving the CF
>> entry back to Needs Review.

> There is a race around setting and clearing of dathasloginevt.

Thanks for the report!  The whole dathasloginevt mechanism is IMO too clunky
and unelegant to go ahead with, I wouldn't be surprised if there are other bugs
lurking there.  Since the original authors seems to have retired from the patch
(I've only rebased it to try and help) I am inclined to mark it as returned
with feedback.

--
Daniel Gustafsson   https://vmware.com/





Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Alvaro Herrera
On 2022-Sep-13, Kyotaro Horiguchi wrote:

> At Mon, 12 Sep 2022 04:26:48 +, "houzj.f...@fujitsu.com" 
>  wrote in 

> > I feel we'd better compare the syntax with the existing publication command:
> > FOR ALL TABLES. If you create a publication FOR ALL TABLES, it means 
> > publishing
> > all the tables in the database *including* tables created in the future. I
> > think both the syntax and meaning of ALL TABLES IN SCHEMA are consistent 
> > with
> > the existing FOR ALL TABLES.
> 
> IMHO, I feel closer to Robert. "ALL TABLES IN SCHEMA" sounds like the
> concrete tables at the time of invocation.  While I agree that it is
> not directly comparable to GRANT, 

What if we remove the ALL keyword from there?  That would leave us with
"FOR TABLES IN SCHEMA", which seems to better convey that it doesn't
restrict to current tables in there.


> but if I see "ALTER PUBLICATION p1 ADD SCHEMA s1", I automatically
> translate that into "all tables in the schema s1 at the time of using
> this publication".

... but that translation is wrong if replication supports other kinds of
objects, as it inevitably will in the near future.  Clearly the fact
that we spell out TABLES there is important.  When we add support for
sequences, we could have combinations

ADD [ALL] TABLES IN SCHEMA s
ADD [ALL] SEQUENCES IN SCHEMA s
ADD [ALL] TABLES AND SEQUENCES IN SCHEMA s

and at that point, the unadorned ADD SCHEMA one will become ambiguous.

> At least, it would cause less confusion when it were "ALT PUB p1 DROP
> SCEMA s1" aginst "DROP ALL TABLES IN SCHEMA s1".

I'm not sure what you mean here.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"En las profundidades de nuestro inconsciente hay una obsesiva necesidad
de un universo lógico y coherente. Pero el universo real se halla siempre
un paso más allá de la lógica" (Irulan)




Re: On login trigger: take three

2022-09-20 Thread Sergey Shinderuk

On 02.09.2022 18:36, Daniel Gustafsson wrote:

This had bitrotted a fair bit, attached is a rebase along with (mostly)
documentation fixes.  0001 adds a generic GUC for ignoring event triggers and
0002 adds the login event with event trigger support, and hooks it up to the
GUC such that broken triggers wont require single-user mode.  Moving the CF
entry back to Needs Review.



Hello!

There is a race around setting and clearing of dathasloginevt.

Steps to reproduce:

1. Create a trigger:

  CREATE FUNCTION on_login_proc() RETURNS event_trigger AS $$
  BEGIN
RAISE NOTICE 'You are welcome!';
  END;
  $$ LANGUAGE plpgsql;

  CREATE EVENT TRIGGER on_login_trigger ON login EXECUTE PROCEDURE 
on_login_proc();


2. Then drop it, but don't start new sessions:

  DROP EVENT TRIGGER on_login_trigger;

3. Create another trigger, but don't commit yet:

  BEGIN;
  CREATE EVENT TRIGGER on_login_trigger ON login EXECUTE PROCEDURE 
on_login_proc();


4. Start a new session. This clears dathasloginevt.

5. Commit the transaction:

  COMMIT;

Now we have a trigger, but it doesn't fire, because dathasloginevt=false.


If two sessions create triggers concurrently, one of them will fail.

Steps:

1. In the first session, start a transaction and create a trigger:

  BEGIN;
  CREATE EVENT TRIGGER on_login_trigger1 ON login EXECUTE PROCEDURE 
on_login_proc();


2. In the second session, create another trigger (this query blocks):

  CREATE EVENT TRIGGER on_login_trigger2 ON login EXECUTE PROCEDURE 
on_login_proc();


3. Commit in the first session:

  COMMIT;

The second session fails:

  postgres=# CREATE EVENT TRIGGER on_login_trigger2 ON login EXECUTE 
PROCEDURE on_login_proc();

  ERROR:  tuple concurrently updated



What else bothers me is that login triggers execute in an environment 
under user control and one has to be very careful. The example trigger 
from the documentation


+DECLARE

+  hour integer = EXTRACT('hour' FROM current_time);

+  rec boolean;

+BEGIN

+-- 1. Forbid logging in between 2AM and 4AM.

+IF hour BETWEEN 2 AND 4 THEN

+  RAISE EXCEPTION 'Login forbidden';

+END IF;


can be bypassed with PGOPTIONS='-c timezone=...'. Probably this is 
nothing new and concerns any SECURITY DEFINER function, but still...



Best regards,

--
Sergey Shinderukhttps://postgrespro.com/




Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Robert Haas
On Mon, Sep 19, 2022 at 11:03 PM Jonathan S. Katz  wrote:
> For #1 (allowing calls that have schema/table overlap...), there appears
> to be both a patch that allows this (reversing[8]), and a suggestion for
> dealing with a corner-case that is reasonable, i.e. disallowing adding
> schemas to a publication when specifying column-lists. Do we think we
> can have consensus on this prior to the RC1 freeze?

I am not sure whether we can or should rush a fix in that fast, but I
agree with this direction.

> For #2 ("ALL TABLES IN SCHEMA" syntax), this was heavily discussed on
> the original thread[1][3][4][5][7]. I thought Tom's proposal on the
> syntax[3] was reasonable as it "future proofs" for when we allow other
> schema-scoped objects to be published and give control over which ones
> can be published.

All right, well, I still don't like it and think it's confusing, but
perhaps I'm in the minority.

> I don't think we should change this behavior that's already in logical
> replication. While I understand the reasons why "GRANT ... ALL TABLES IN
> SCHEMA" has a different behavior (i.e. it's not applied to future
> objects) and do not advocate to change it, I have personally been
> affected where I thought a permission would be applied to all future
> objects, only to discover otherwise. I believe it's more intuitive to
> think that "ALL" applies to "everything, always."

Nah, there's room for multiple behaviors here. It's reasonable to want
to add all the tables currently in the schema to a publication (or
grant permissions on them) and it's reasonable to want to include all
current and future tables in the schema in a publication (or grant
permissions on them) too. The reason I don't like the ALL TABLES IN
SCHEMA syntax is that it sounds like the former, but actually is the
latter. Based on your link to the email from Tom, I understand now the
reason why it's like that, but it's still counterintuitive to me.

> For #3 (more superuser-only), in general I do agree that we shouldn't be
> adding more of these. However, we have in this release, and not just to
> this feature. ALTER SUBSCRIPTION ... SKIP[11] requires superuser. I
> think it's easier for us to "relax" privileges (e.g. w/predefined roles)
> than to make something "superuser-only" in the future, so I don't see
> this being a blocker for v15. The feature will continue to work for
> users even if we remove "superuser-only" in the future.

Yeah, this is clearly not a release blocker, I think.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: HOT chain validation in verify_heapam()

2022-09-20 Thread Robert Haas
On Tue, Sep 20, 2022 at 5:00 AM Himanshu Upadhyaya
 wrote:
> Please find it attached.

This patch still has no test cases. Just as we have test cases for the
existing corruption checks, we should have test cases for these new
corruption checks, showing cases where they actually fire.

I think I would be inclined to set lp_valid[x] = true in both the
redirected and non-redirected case, and then have the very first thing
that the second loop does be if (nextoffnum == 0 ||
!lp_valid[ctx.offnum]) continue. I think that would be more clear
about the intent to ignore line pointers that failed validation. Also,
if you did it that way, then you could catch the case of a redirected
line pointer pointing to another redirected line pointer, which is a
corruption condition that the current code does not appear to check.

+/*
+ * Validation via the predecessor array. 1) If the predecessor's
+ * xmin is aborted or in progress, the current tuples xmin should
+ * be aborted or in progress respectively. Also both xmin's must
+ * be equal. 2) If the predecessor's xmin is not frozen, then
+ * current tuple's shouldn't be either. 3) If the predecessor's
+ * xmin is equal to the current tuple's xmin, the current tuple's
+ * cmin should be greater than the predecessor's cmin. 4) If the
+ * current tuple is not HOT then its predecessor's tuple must not
+ * be HEAP_HOT_UPDATED. 5) If the current tuple is HOT then its
+ * predecessor's tuple must be HEAP_HOT_UPDATED.
+ */

This comment needs to be split up into pieces and the pieces need to
be moved closer to the tests to which they correspond.

+  psprintf("unfrozen tuple was
updated to produce a tuple at offset %u which is not frozen",

Shouldn't this say "which is frozen"?

+ * Not a corruption if current tuple is updated/deleted by a
+ * different transaction, means t_cid will point to cmax (that is
+ * command id of deleting transaction) and cid of predecessor not
+ * necessarily will be smaller than cid of current tuple. t_cid

I think that the next person who reads this code is likely to
understand that the CIDs of different transactions are numerically
unrelated. What's less obvious is that if the XID is the same, the
newer update must have a higher CID.

+ * can hold combo command id but we are not worrying here since
+ * combo command id of the next updated tuple (if present) must be
+ * greater than combo command id of the current tuple. So here we
+ * are not checking HEAP_COMBOCID flag and simply doing t_cid
+ * comparison.

I disapprove of ignoring the HEAP_COMBOCID flag. Emitting a message
claiming that the CID has a certain value when that's actually a combo
CID is misleading, so at least a different message wording is needed
in such cases. But it's also not clear to me that the newer update has
to have a higher combo CID, because combo CIDs can be reused. If you
have multiple cursors open in the same transaction, the updates can be
interleaved, and it seems to me that it might be possible for an older
CID to have created a certain combo CID after a newer CID, and then
both cursors could update the same page in succession and end up with
combo CIDs out of numerical order. Unless somebody can make a
convincing argument that this isn't possible, I think we should just
skip this check for cases where either tuple has a combo CID.

+if (TransactionIdEquals(pred_xmin, curr_xmin) &&
+(TransactionIdEquals(curr_xmin, curr_xmax) ||
+ !TransactionIdIsValid(curr_xmax)) && pred_cmin >= curr_cmin)

I don't understand the reason for the middle part of this condition --
TransactionIdEquals(curr_xmin, curr_xmax) ||
!TransactionIdIsValid(curr_xmax). I suppose the comment is meant to
explain this, but I still don't get it. If a tuple with XMIN 12345
CMIN 2 is updated to produce a tuple with XMIN 12345 CMIN 1, that's
corruption, regardless of what the XMAX of the second tuple may happen
to be.

+if (HeapTupleHeaderIsHeapOnly(curr_htup) &&
+!HeapTupleHeaderIsHotUpdated(pred_htup))

+if (!HeapTupleHeaderIsHeapOnly(curr_htup) &&
+HeapTupleHeaderIsHotUpdated(pred_htup))

I think it would be slightly clearer to write these tests the other
way around i.e. check the previous tuple's state first.

+if (!TransactionIdIsValid(curr_xmax) &&
HeapTupleHeaderIsHotUpdated(tuphdr))
+{
+report_corruption(ctx,
+  psprintf("tuple has been updated, but xmax is 0"));
+result = false;
+}

I guess this message needs to say "tuple has been HOT updated, but
xmax is 0" or something like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Summary function for pg_buffercache

2022-09-20 Thread Zhang Mingli
Hi,
On Sep 20, 2022, 20:49 +0800, Melih Mutlu , wrote:
> Hi Zhang,
>
> Those are two different locks.
> The locks that are taken in the patch are for buffer headers. This locks only 
> the current buffer and makes that particular buffer's info consistent within 
> itself.
>
> However, the lock mentioned in the doc is for buffer manager which would 
> prevent changes on any buffer if it's held.
> pg_buffercache_summary (and pg_buffercache_pages) does not hold buffer 
> manager lock. Therefore, consistency across all buffers is not guaranteed.
>
> For pg_buffercache_pages, self-consistent buffer information is useful since 
> it shows each buffer separately.
>
> For pg_buffercache_summary, even self-consistency may not matter much since 
> results are aggregated and we can't see individual buffer information.
> Consistency across all buffers is also not a concern since its purpose is to 
> give an overall idea about the state of buffers.
>
> I see that these two different locks in the same context can be confusing. I 
> hope it is a bit more clear now.
>
> Best,
> Melih
Thanks for your explanation, LGTM.


Re: Summary function for pg_buffercache

2022-09-20 Thread Melih Mutlu
Hi Zhang,

Those are two different locks.
The locks that are taken in the patch are for buffer headers. This locks
only the current buffer and makes that particular buffer's info consistent
within itself.

However, the lock mentioned in the doc is for buffer manager which would
prevent changes on any buffer if it's held.
pg_buffercache_summary (and pg_buffercache_pages) does not hold buffer
manager lock. Therefore, consistency across all buffers is not guaranteed.

For pg_buffercache_pages, self-consistent buffer information is useful
since it shows each buffer separately.

For pg_buffercache_summary, even self-consistency may not matter much since
results are aggregated and we can't see individual buffer information.
Consistency across all buffers is also not a concern since its purpose is
to give an overall idea about the state of buffers.

I see that these two different locks in the same context can be confusing.
I hope it is a bit more clear now.

Best,
Melih

>


Re: Summary function for pg_buffercache

2022-09-20 Thread Zhang Mingli
Hi,

Regards,
Zhang Mingli
On Sep 20, 2022, 20:43 +0800, Aleksander Alekseev , 
wrote:
>
> Correct, the procedure doesn't take the locks of the buffer manager.
> It does take the locks of every individual buffer.
Ah, now I get it, thanks.


Re: Summary function for pg_buffercache

2022-09-20 Thread Aleksander Alekseev
Hi Zhang,

> The doc says we don’t take lock during pg_buffercache_summary, but I see 
> locks in the v8 patch, Isn’t it?
>
> ```
> Similar to pg_buffercache_pages function
>  pg_buffercache_summary doesn't take buffer manager
>  locks [...]
> ```

Correct, the procedure doesn't take the locks of the buffer manager.
It does take the locks of every individual buffer.

I agree that the text is somewhat confusing, but it is consistent with
the current description of pg_buffercache [1]. I think this is a
problem worth addressing but it also seems to be out of scope of the
proposed patch.

[1]: https://www.postgresql.org/docs/current/pgbuffercache.html

-- 
Best regards,
Aleksander Alekseev




Re: Summary function for pg_buffercache

2022-09-20 Thread Zhang Mingli
Hi,

Correct me if I’m wrong.

The doc says we don’t take lock during pg_buffercache_summary, but I see locks 
in the v8 patch, Isn’t it?

```
Similar to pg_buffercache_pages function
 pg_buffercache_summary doesn't take buffer manager
 locks, thus the result is not consistent across all buffers. This is
 intentional. The purpose of this function is to provide a general idea about
 the state of shared buffers as fast as possible. Additionally,
 pg_buffercache_summary allocates much less memory.

```




Regards,
Zhang Mingli
On Sep 20, 2022, 20:10 +0800, Melih Mutlu , wrote:
> Hi,
>
> Seems like cfbot tests are passing now:
> https://cirrus-ci.com/build/4727923671302144
>
> Best,
> Melih
>
> > Melih Mutlu , 20 Eyl 2022 Sal, 14:00 tarihinde şunu 
> > yazdı:
> > > Aleksander Alekseev , 20 Eyl 2022 Sal, 13:57 
> > > tarihinde şunu yazdı:
> > > > > There was a missing empty line in pg_buffercache.out which made the
> > > > > tests fail. Here is a corrected v8 patch.
> > > >
> > > > I was just sending a corrected patch without the missing line.
> > > >
> > > > Thanks a lot for all these reviews and the corrected patch.
> > > >
> > > > Best,
> > > > Melih


Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2022-09-20 Thread Robert Haas
On Mon, Sep 19, 2022 at 4:42 PM Dmitry Koval  wrote:
> Thanks for comments and advice!
> I thought about this problem and discussed about it with colleagues.
> Unfortunately, I don't know of a good general solution.

Yeah, me neither.

> But for specific situation like this (certain partition is not changing)
> we can add CONCURRENTLY modifier.
> Our DDL query can be like
>
> ALTER TABLE...SPLIT PARTITION [CONCURRENTLY];
>
> With CONCURRENTLY modifier we can lock partitioned table in
> ShareUpdateExclusiveLock mode and split partition - in
> AccessExclusiveLock mode. So we don't lock partitioned table in
> AccessExclusiveLock mode and can modify other partitions during SPLIT
> operation (except split partition).
> If smb try to modify split partition, he will receive error "relation
> does not exist" at end of operation (because split partition will be drop).

I think that a built-in DDL command can't really assume that the user
won't modify anything. You'd have to take a ShareLock.

But you might be able to have a CONCURRENTLY variant of the command
that does the same kind of multi-transaction thing as, e.g., CREATE
INDEX CONCURRENTLY. You would probably have to be quite careful about
race conditions (e.g. you commit the first transaction and before you
start the second one, someone drops or detaches the partition you were
planning to merge or split). Might take some thought, but feels
possibly doable. I've never been excited enough about this kind of
thing to want to put a lot of energy into engineering it, because
doing it "manually" feels so much nicer to me, and doubly so given
that we now have ATTACH CONCURRENTLY and DETACH CONCURRENTLY, but it
does seem like a thing some people would probably use and value.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

2022-09-20 Thread Bharath Rupireddy
On Tue, Sep 20, 2022 at 12:57 PM Alvaro Herrera  wrote:
>
> On 2022-Sep-19, Bharath Rupireddy wrote:
>
> > We have a bunch of messages [1] that have an offset, but not LSN in
> > the error message. Firstly, is there an easiest way to figure out LSN
> > from offset reported in the error messages? If not, is adding LSN to
> > these messages along with offset a good idea? Of course, we can't just
> > convert offset to LSN using XLogSegNoOffsetToRecPtr() and report, but
> > something meaningful like reporting the LSN of the page that we are
> > reading-in or writing-out etc.
>
> Maybe add errcontext() somewhere that reports the LSN would be
> appropriate.  For example, the page_read() callbacks have the LSN
> readily available, so the ones in backend could install the errcontext
> callback; or perhaps ReadPageInternal can do it #ifndef FRONTEND.  Not
> sure what is best of those options, but either of those sounds better
> than sticking the LSN in a lower-level routine that doesn't necessarily
> have the info already.

All of the error messages [1] have the LSN from which offset was
calculated, I think we can just append that to the error messages
(something like " offset %u, LSN %X/%X: %m") and not complicate
it. Thoughts?

[1]
errmsg("could not read from WAL segment %s, offset %u: %m",
errmsg("could not read from WAL segment %s, offset %u: %m",
errmsg("could not write to log file %s "
   "at offset %u, length %zu: %m",
errmsg("unexpected timeline ID %u in WAL segment %s, offset %u",
errmsg("could not read from WAL segment %s, offset %u: read %d of %zu",
pg_log_error("received write-ahead log record for offset %u with no file open",
"invalid magic number %04X in WAL segment %s, offset %u",
"invalid info bits %04X in WAL segment %s, offset %u",
"invalid info bits %04X in WAL segment %s, offset %u",
"unexpected pageaddr %X/%X in WAL segment %s, offset %u",
"out-of-sequence timeline ID %u (after %u) in WAL segment %s, offset %u",

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Summary function for pg_buffercache

2022-09-20 Thread Melih Mutlu
Hi,

Seems like cfbot tests are passing now:
https://cirrus-ci.com/build/4727923671302144

Best,
Melih

Melih Mutlu , 20 Eyl 2022 Sal, 14:00 tarihinde şunu
yazdı:

> Aleksander Alekseev , 20 Eyl 2022 Sal, 13:57
> tarihinde şunu yazdı:
>
>> There was a missing empty line in pg_buffercache.out which made the
>> tests fail. Here is a corrected v8 patch.
>>
>
> I was just sending a corrected patch without the missing line.
>
> Thanks a lot for all these reviews and the corrected patch.
>
> Best,
> Melih
>


Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

2022-09-20 Thread Bharath Rupireddy
On Tue, Sep 20, 2022 at 9:02 AM Nathan Bossart  wrote:
>
> On Mon, Sep 19, 2022 at 03:16:42PM -0700, Nathan Bossart wrote:
> > It seems like you want the opposite of pg_walfile_name_offset().  Perhaps
> > we could add a function like pg_walfile_offset_lsn() that accepts a WAL
> > file name and byte offset and returns the LSN.
>
> Like so...

Yeah, something like this will be handy for sure, but I'm not sure if
we want this to be in core. Let's hear from others.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Support pg_attribute_aligned and noreturn in MSVC

2022-09-20 Thread James Coleman
On Mon, Sep 19, 2022 at 11:21 PM Michael Paquier  wrote:
>
> On Mon, Sep 19, 2022 at 08:51:37PM -0400, James Coleman wrote:
> > Yes, fixed.
>
> The CF bot is failing compilation on Windows:
> http://commitfest.cputube.org/james-coleman.html
> https://api.cirrus-ci.com/v1/task/5376566577332224/logs/build.log
>
> There is something going on with noreturn() after applying it to
> elog.h:
> 01:11:00.468] c:\cirrus\src\include\utils\elog.h(410,45): error C2085:
> 'ThrowErrorData': not in formal parameter list (compiling source file
> src/common/hashfn.c) [c:\cirrus\libpgcommon.vcxproj]
> [01:11:00.468] c:\cirrus\src\include\mb\pg_wchar.h(701,80): error
> C2085: 'pgwin32_message_to_UTF16': not in formal parameter list
> (compiling source file src/common/encnames.c)
> [c:\cirrus\libpgcommon.vcxproj]
> [01:11:00.468] c:\cirrus\src\include\utils\elog.h(411,54): error
> C2085: 'pg_re_throw': not in formal parameter list (compiling source
> file src/common/hashfn.c) [c:\cirrus\libpgcommon.vcxproj]
>
> align() seems to look fine, at least.  I'd be tempted to apply the
> align part first, as that's independently useful for itemptr.h.


I don't have access to a Windows machine for testing, but re-reading
the documentation it looks like the issue is that our noreturn macro
is used after the definition while the MSVC equivalent is used before.
I've removed that for now (and commented about it); it's not as
valuable anyway since it's mostly an indicator for code analysis
(human and machine).

James Coleman


v3-0001-Support-pg_attribute_aligned-in-MSVC.patch
Description: Binary data


Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Amit Kapila
On Tue, Sep 20, 2022 at 2:57 PM Alvaro Herrera  wrote:
>
> On 2022-Sep-20, Amit Kapila wrote:
>
> > On Mon, Sep 19, 2022 at 8:46 PM Alvaro Herrera  
> > wrote:
>
> > > This seems a pretty arbitrary restriction.  It feels like you're adding
> > > this restriction precisely so that you don't have to write the code to
> > > reject the ALTER .. SET SCHEMA if an incompatible configuration is
> > > detected.  But we already have such checks in several cases, so I don't
> > > see why this one does not seem a good idea.
> > >
> > I agree that we have such checks at other places as well and one
> > somewhat similar is in ATPrepChangePersistence().
> >
> > ATPrepChangePersistence()
> > {
> > ...
> > ...
> > /*
> >  * Check that the table is not part of any publication when changing to
> >  * UNLOGGED, as UNLOGGED tables can't be published.
> >  */
>
> Right, I think this is a sensible approach.
>
> > However, another angle to look at it is that we try to avoid adding
> > restrictions in other DDL commands for defined publications.
>
> Well, it makes sense to avoid restrictions wherever possible.  But here,
> the consequence is that you end up with a restriction in the publication
> definition that is not very sensible.  Imagine if you said "you can't
> add schema S because it contains an unlogged table".  It's absurd.
>
> Maybe this can be relaxed in a future release, but it's quite odd.
>

Yeah, we can relax it in a future release based on some field
experience, or maybe we can keep the current restriction of not
allowing to add a table when the schema of the table is part of the
same publication and try to relax that in a future release based on
field experience.

> > The intention was to be in sync with FOR ALL TABLES.
>
> I guess we can change both (FOR ALL TABLES and IN SCHEMA) later.
>

That sounds reasonable.

-- 
With Regards,
Amit Kapila.




Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)

2022-09-20 Thread Bharath Rupireddy
On Tue, Sep 20, 2022 at 7:20 AM Michael Paquier  wrote:
>
> The main regression test suite should not include direct calls to
> pg_backup_start() or pg_backup_stop() as these depend on wal_level,
> and we spend a certain amount of resources in keeping the tests a
> maximum portable across such configurations, wal_level = minimal being
> one of them.  One portable example is in 001_stream_rep.pl.

Understood.

> > That's a good idea. I'm marking a flag if the label name overflows (>
> > MAXPGPATH), later using it in do_pg_backup_start() to error out. We
> > could've thrown error directly, but that changes the behaviour - right
> > now, first "
> > wal_level must be set to \"replica\" or \"logical\" at server start."
> > gets emitted and then label name overflow error - I don't want to do
> > that.
>
> -   if (strlen(backupidstr) > MAXPGPATH)
> +   if (state->name_overflowed == true)
> ereport(ERROR,
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>  errmsg("backup label too long (max %d bytes)",
> It does not strike me as a huge issue to force a truncation of such
> backup label names.  1024 is large enough for basically all users,
> in my opinion.  Or you could just truncate in the internal logic, but
> still complain at MAXPGPATH - 1 as the last byte would be for the zero
> termination.  In short, there is no need to complicate things with
> name_overflowed.

We currently allow MAXPGPATH bytes of label name excluding null
termination. I don't want to change this behaviour. In the attached v9
patch, I'm truncating the larger names to  MAXPGPATH + 1 bytes in
backup state (one extra byte for representing that the name has
overflown, and another extra byte for null termination).

> +   StringInfo  backup_label;   /* backup_label file contents */
> +   StringInfo  tablespace_map; /* tablespace_map file contents */
> +   StringInfo  history_file;   /* history file contents */
> IMV, repeating a point I already made once upthread, BackupState
> should hold none of these.  Let's just generate the contents of these
> files in the contexts where they are needed, making BackupState
> something to rely on to build them in the code paths where they are
> necessary.  This is going to make the reasoning around the memory
> contexts where each one of them is stored much easier and reduce the
> changes of bugs in the long-term.

I've separated out these variables from the backup state, please see
the attached v9 patch.

> > I've also taken help of the error callback mechanism to clean up the
> > allocated memory in case of a failure. For do_pg_abort_backup() cases,
> > I think we don't need to clean the memory because that gets called on
> > proc exit (before_shmem_exit()).
>
> Memory could still bloat while the process running the SQL functions
> is running depending on the error code path, anyway.

I didn't get your point. Can you please elaborate it? I think adding
error callbacks at the right places would free up the memory for us.
Please note that we already are causing memory leaks on HEAD today.

I addressed the above review comments. I also changed a wrong comment
[1] that lies before pg_backup_start() even after the removal of
exclusive backup.

I'm attaching v9 patch set herewith, 0001 - refactors the backup code
with backup state, 0002 - adds error callbacks to clean up the memory
allocated for backup variables. Please review them further.

[1]
 * Essentially what this does is to create a backup label file in $PGDATA,
 * where it will be archived as part of the backup dump.  The label file
 * contains the user-supplied label string (typically this would be used
 * to tell where the backup dump will be stored) and the starting time and
 * starting WAL location for the dump.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From ab1e86ac7fb75a2d2219c7681ead40faf8c01446 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 20 Sep 2022 10:04:29 +
Subject: [PATCH v9] Refactor backup related code

Refactor backup related code, advantages of doing so are following:

1) backup state is more structured now - all in a single structure,
callers can create backup_label contents whenever required, either
during the pg_backup_start or the pg_backup_stop or in between.
2) no parsing of backup_label file lines now, no error checking
for invalid parsing.
3) backup_label and history file contents have most of the things
in common, they can now be created within a single function.
4) makes backup related code extensible and readable.

This introduces new source files xlogbackup.c/.h for backup related
code and adds the new code in there. The xlog.c file has already grown
to ~9000 LOC (as of this time). Eventually, we would want to move
all the backup related code from xlog.c, xlogfuncs.c, elsewhere to
here.

Author: Bharath Rupireddy
Reviewed-by: Michael Paquier
Reviewed-by: Fujii Masao
Discussion: ht

Re: binary version of pg_current_wal_insert_lsn and pg_walfile_name functions

2022-09-20 Thread Bharath Rupireddy
On Mon, Sep 19, 2022 at 8:19 PM Ashutosh Sharma  wrote:
>
> Hi All,
>
> Currently, we have pg_current_wal_insert_lsn and pg_walfile_name sql
> functions which gives us information about the next wal insert
> location and the WAL file that the next wal insert location belongs
> to. Can we have a binary version of these sql functions?

+1 for the idea in general.

As said, pg_waldump seems to be the right candidate. I think we want
the lsn of the last WAL record and its info and the WAL file name
given an input data directory or just the pg_wal directory or any
directory where WAL files are located. For instance, one can use this
on an archive location containing archived WAL files or on a node
where pg_receivewal is receiving WAL files. Am I missing any other
use-cases?

pg_waldump currently can't understand compressed and partial files. I
think that we need to fix this as well.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Proposal to use JSON for Postgres Parser format

2022-09-20 Thread Matthias van de Meent
On Tue, 20 Sept 2022 at 12:00, Alexander Korotkov  wrote:
> On Tue, Sep 20, 2022 at 7:48 AM Tom Lane  wrote:
> > Peter Geoghegan  writes:
> > > On Mon, Sep 19, 2022 at 8:39 PM Tom Lane  wrote:
> > >> Our existing format is certainly not great on those metrics, but
> > >> I do not see how "let's use JSON!" is a route to improvement.
> >
> > > The existing format was designed with developer convenience as a goal,
> > > though -- despite my complaints, and in spite of your objections.
> >
> > As Munro adduces nearby, it'd be a stretch to conclude that the current
> > format was designed with any Postgres-related goals in mind at all.
> > I think he's right that it's a variant of some Lisp-y dump format that's
> > probably far hoarier than even Berkeley Postgres.
> >
> > > If it didn't have to be easy (or even practical) for developers to
> > > directly work with the output format, then presumably the format used
> > > internally could be replaced with something lower level and faster. So
> > > it seems like the two goals (developer ergonomics and faster
> > > interchange format for users) might actually be complementary.
> >
> > I think the principal mistake in what we have now is that the storage
> > format is identical to the "developer friendly" text format (plus or
> > minus some whitespace).  First we need to separate those.  We could
> > have more than one equivalent text format perhaps, and I don't have
> > any strong objection to basing the text format (or one of them) on
> > JSON.
>
> +1 for considering storage format and text format separately.
>
> Let's consider what our criteria could be for the storage format.
>
> 1) Storage effectiveness (shorter is better) and
> serialization/deserialization effectiveness (faster is better). On
> this criterion, the custom binary format looks perfect.
> 2) Robustness in the case of corruption. It seems much easier to
> detect the data corruption and possibly make some partial manual
> recovery for textual format.
> 3) Standartness. It's better to use something known worldwide or at
> least used in other parts of PostgreSQL than something completely
> custom. From this perspective, JSON/JSONB is better than custom
> things.

Allow me to add: compressability

In the thread surrounding [0] there were complaints about the size of
catalogs, and specifically the template database. Significant parts of
that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
consists mostly of serialized Nodes. If we're going to replace our
current NodeToText infrastructure, we'd better know we can effectively
compress this data.

In that same thread, I also suggested that we could try to not emit a
Node's fields if they contain their default values while serializing;
such as the common `:location -1` or `:mynodefield <>`. Those fields
still take up space in the format, while conveying no interesting
information (the absense of that field in the struct definition would
convey the same). It would be useful if this new serialized format
would allow us to do similar tricks cheaply.

As for JSON vs JSONB for storage:
I'm fairly certain that JSONB is less compact than JSON (without
taking compression into the picture) due to the 4-byte guaranteed
overhead for each jsonb element; while for JSON that is only 2 bytes
for each (up to 3 when you consider separators, plus potential extra
overhead for escaped values that are unlikely to appear our catalogs).
Some numbers can be stored more efficiently in JSONB, but only large
numbers and small fractions that we're unlikely to hit in system
views: a back-of-the-envelope calculation puts the cutoff point of
efficient storage between strings-of-decimals and Numeric at >10^12, <
-10^11, or very precise fractional values.

Kind regards,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/CAEze2WgGexDM63dOvndLdAWwA6uSmSsc97jmrCuNmrF1JEDK7w%40mail.gmail.com




Re: pg_create_logical_replication_slot argument incongruency

2022-09-20 Thread Florin Irion
Thank you!

Il mar 20 set 2022, 12:29 Michael Paquier  ha scritto:

> On Tue, Sep 20, 2022 at 08:41:56AM +0200, Florin Irion wrote:
> > OK, patch only for the docs attached.
>
> Thanks, applied.
> --
> Michael
>


Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

2022-09-20 Thread Drouvot, Bertrand

Hi,

On 9/20/22 6:30 AM, Michael Paquier wrote:

On Tue, Sep 20, 2022 at 12:09:33AM -0400, Tom Lane wrote:

You have to assume that somebody (a) has a role or DB name starting
with slash, (b) has an explicit reference to that name in their
pg_hba.conf, (c) doesn't read the release notes, and (d) doesn't
notice that things are misbehaving until after some hacker manages
to break into their installation on the strength of the misbehaving
entry.  OK, I'll grant that the probability of (c) is depressingly
close to unity; but each of the other steps seems quite low probability.
All four of them happening in one installation is something I doubt
will happen.


It is the kind of things that could blow up as a CVE and some bad PR
for the project, so I cannot get excited about enforcing this new rule
in an authentication file (aka before a role is authenticated) while
we are talking about 3~4 code paths (?) that would need an extra check
to make sure that no instances have such object names.


I also have the feeling that having (a), (b) and (d) is low probability.

That said, If the user "/john" already exists and has a hba entry then 
this entry will still match with the patch. Problem is that all the 
users that contain "john" would also now match.


But things get worst if say /a is an existing user and hba entry as the 
entry would match any users that contains "a" with the patch.


I assume (maybe i should not) that if objects starting with / already 
exist there is very good reason(s) behind. Then I don't think that 
preventing their creation in the DDL would help (quite the contrary for 
the ones that really need them).


It looks to me that adding a GUC (off by default) to enable/disable the 
regexp usage in the hba could be a fair compromise. It won't block any 
creation starting with a / and won't open more doors (if such objects 
exist) by default.


Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Summary function for pg_buffercache

2022-09-20 Thread Melih Mutlu
Aleksander Alekseev , 20 Eyl 2022 Sal, 13:57
tarihinde şunu yazdı:

> There was a missing empty line in pg_buffercache.out which made the
> tests fail. Here is a corrected v8 patch.
>

I was just sending a corrected patch without the missing line.

Thanks a lot for all these reviews and the corrected patch.

Best,
Melih


Re: Summary function for pg_buffercache

2022-09-20 Thread Aleksander Alekseev
Hi hackers,

> Here is a corrected patch v7. To me it seems to be in pretty good
> shape, unless cfbot and/or other hackers will report any issues.

There was a missing empty line in pg_buffercache.out which made the
tests fail. Here is a corrected v8 patch.

-- 
Best regards,
Aleksander Alekseev


v8-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data


Re: cataloguing NOT NULL constraints

2022-09-20 Thread Alvaro Herrera
On 2022-Sep-19, Matthias van de Meent wrote:

> I'm not sure on the 'good' part of this alternative, but we could go
> with a single row-based IS NOT NULL to reduce such clutter, utilizing
> the `ROW() IS NOT NULL` requirement of a row only matching IS NOT NULL
> when all attributes are also IS NOT NULL:
> 
> Check constraints:
> "tab_notnull_check" CHECK (ROW(a, b, c, d, e) IS NOT NULL)

There's no way to mark this NOT VALID individually or validate it
afterwards, though.

> But the performance of repeated row-casting would probably not be as
> good as our current NULL checks

The NULL checks would still be mostly done by the attnotnull checks
internally, so there shouldn't be too much of a difference.

.. though I'm now wondering if there's additional overhead from checking
the constraint twice on each row: first the attnotnull bit, then the
CHECK itself.  Hmm.  That's probably quite bad.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: cataloguing NOT NULL constraints

2022-09-20 Thread Alvaro Herrera
On 2022-Sep-19, Isaac Morland wrote:

> I thought I saw some discussion about the SQL standard saying that there is
> a difference between putting NOT NULL in a column definition, and CHECK
> (column_name NOT NULL). So if we're going to take this seriously,

Was it Peter E.'s reply to this thread?

https://postgr.es/m/bac841ed-b86d-e3c2-030d-02a3db067...@enterprisedb.com

because if it wasn't there, then I would like to know what you source
is.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end." (2nd Commandment for C programmers)




Re: cataloguing NOT NULL constraints

2022-09-20 Thread Alvaro Herrera
On 2022-Sep-19, Robert Haas wrote:

> On Wed, Aug 17, 2022 at 2:12 PM Alvaro Herrera  
> wrote:

> > 55489 16devel 1776237=# \d tab
> > Tabla «public.tab»
> >  Columna │  Tipo   │ Ordenamiento │ Nulable  │ Por omisión
> > ─┼─┼──┼──┼─
> >  a   │ integer │  │ not null │
> > Restricciones CHECK:
> > "tab_a_not_null" CHECK (a IS NOT NULL)
> 
> In a table with many columns, most of which are NOT NULL, this is
> going to produce a ton of clutter. I don't like that.
> 
> I'm not sure what a good alternative would be, though.

Perhaps that can be solved by displaying the constraint name in the
table:

 55489 16devel 1776237=# \d tab
 Tabla «public.tab»
  Columna │  Tipo   │ Ordenamiento │ Nulable│ Por omisión
 ─┼─┼──┼┼─
  a   │ integer │  │ tab_a_not_null │ 

(Perhaps we can change the column title also, not sure.)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"The Gord often wonders why people threaten never to come back after they've
been told never to return" (www.actsofgord.com)




Re: pg_create_logical_replication_slot argument incongruency

2022-09-20 Thread Michael Paquier
On Tue, Sep 20, 2022 at 08:41:56AM +0200, Florin Irion wrote:
> OK, patch only for the docs attached.

Thanks, applied.
--
Michael


signature.asc
Description: PGP signature


  1   2   >