Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-12 Thread Michael Paquier
On Mon, Jul 11, 2022 at 02:26:46PM +0200, Matthias van de Meent wrote:
> Thanks for reviewing.

I think that v6 is over-engineered because there should be no need to
add a check in xlogreader.c as long as the origin of the problem is
blocked, no?  And the origin here is when the record is assembled.  At
least this is the cleanest solution for HEAD, but not in the
back-branches if we'd care about doing something with records already
generated, and I am not sure that we need to care about other things
than HEAD, TBH.  So it seems to me that there is no need to create a
XLogRecMaxLength which is close to a duplicate of
DecodeXLogRecordRequiredSpace().

@@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
   XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)

 {
XLogRecData *rdt;
-   uint32  total_len = 0;
+   uint64  total_len = 0;
This has no need to change.

My suggestion from upthread was close to what you proposed, but I had
in mind something simpler, as of:

+   /*
+* Ensure that xlogreader.c can read the record.
+*/
+   if (unlikely(!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len
+   elog(ERROR, "too much WAL data");

This would be the amount of data allocated by the WAL reader when it
is possible to allocate an oversized record, related to the business
of the circular buffer depending on if the read is blocking or not.

Among the two problems to solve at hand, the parts where the APIs are
changed and made more robust with unsigned types and where block data
is not overflowed with its 16-byte limit are committable, so I'd like
to do that first (still need to check its performance with some micro
benchmark on XLogRegisterBufData()).  The second part to block the
creation of the assembled record is simpler, now
DecodeXLogRecordRequiredSpace() would make the path a bit hotter,
though we could inline it in the worst case?
--
Michael
From 2031f15fe90c94deb21d4e41b717ada9a8bdb520 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Mon, 20 Jun 2022 10:21:22 +0200
Subject: [PATCH v7] Add protections in xlog record APIs against large numbers
 and overflows.

Before this; it was possible for an extension to create malicious WAL records
that were too large to replay; or that would overflow the xl_tot_len field,
causing potential corruption in WAL record IO ops. Emitting invalid records
was also possible through pg_logical_emit_message(), which allowed you to emit
arbitrary amounts of data up to 2GB, much higher than the replay limit of 1GB.
---
 src/include/access/xloginsert.h |  4 ++--
 src/backend/access/transam/xloginsert.c | 30 -
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index c04f77b173..aed4643d1c 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -43,12 +43,12 @@ extern void XLogBeginInsert(void);
 extern void XLogSetRecordFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int max_block_id, int ndatas);
-extern void XLogRegisterData(char *data, int len);
+extern void XLogRegisterData(char *data, uint32 len);
 extern void XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags);
 extern void XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator,
 			  ForkNumber forknum, BlockNumber blknum, char *page,
 			  uint8 flags);
-extern void XLogRegisterBufData(uint8 block_id, char *data, int len);
+extern void XLogRegisterBufData(uint8 block_id, char *data, uint32 len);
 extern void XLogResetInsertion(void);
 extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index f3c29fa909..3788429b70 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -348,13 +348,13 @@ XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator, ForkNumber forknum,
  * XLogRecGetData().
  */
 void
-XLogRegisterData(char *data, int len)
+XLogRegisterData(char *data, uint32 len)
 {
 	XLogRecData *rdata;
 
 	Assert(begininsert_called);
 
-	if (num_rdatas >= max_rdatas)
+	if (unlikely(num_rdatas >= max_rdatas))
 		elog(ERROR, "too much WAL data");
 	rdata = [num_rdatas++];
 
@@ -386,7 +386,7 @@ XLogRegisterData(char *data, int len)
  * limited)
  */
 void
-XLogRegisterBufData(uint8 block_id, char *data, int len)
+XLogRegisterBufData(uint8 block_id, char *data, uint32 len)
 {
 	registered_buffer *regbuf;
 	XLogRecData *rdata;
@@ -399,8 +399,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 		elog(ERROR, "no block with id %d registered with WAL insertion",
 			 block_id);
 
-	if (num_rdatas >= max_rdatas)
+	/*
+	 * Check against max_rdatas and ensure we do not register more data per
+	 * buffer than can be handled by the physical data format; i.e. that
+	 * regbuf->rdata_len 

Re: Reducing the chunk header sizes on all memory context types

2022-07-12 Thread Andres Freund
Hi,

On 2022-07-12 10:42:07 -0700, Andres Freund wrote:
> On 2022-07-12 17:01:18 +1200, David Rowley wrote:
> > There is at least one. It might be major; to reduce the AllocSet chunk
> > header from 16 bytes down to 8 bytes I had to get rid of the freelist
> > pointer that was reusing the "aset" field in the chunk header struct.
> > This works now by storing that pointer in the actual palloc'd memory.
> > This could lead to pretty hard-to-trace bugs if we have any code that
> > accidentally writes to memory after pfree.
> 
> Can't we use the same trick for allcations in the freelist as we do for the
> header in a live allocation? I.e. split the 8 byte header into two and use
> part of it to point to the next element in the list using the offset from the
> start of the block, and part of it to indicate the size?

So that doesn't work because the members in the freelist can be in different
blocks and those can be further away from each other.


Perhaps that could still made work somehow: To point to a block we don't
actually need 64bit pointers, they're always at least of some certain size -
assuming we can allocate them suitably aligned. And chunks are always 8 byte
aligned. Unfortunately that doesn't quite get us far enough - assuming a 4kB
minimum block size (larger than now, but potentially sensible as a common OS
page size), we still only get to 2^12*8 = 32kB.

It'd easily work if we made each context have an array of allocated non-huge
blocks, so that the blocks can be addressed with a small index. The overhead
of that could be reduced in the common case by embedding a small constant
sized array in the Aset.  That might actually be worth trying out.

Greetings,

Andres Freund




Re: Reducing the chunk header sizes on all memory context types

2022-07-12 Thread David Rowley
On Wed, 13 Jul 2022 at 05:42, Andres Freund  wrote:
> > There is at least one. It might be major; to reduce the AllocSet chunk
> > header from 16 bytes down to 8 bytes I had to get rid of the freelist
> > pointer that was reusing the "aset" field in the chunk header struct.
> > This works now by storing that pointer in the actual palloc'd memory.
> > This could lead to pretty hard-to-trace bugs if we have any code that
> > accidentally writes to memory after pfree.
>
> Can't we use the same trick for allcations in the freelist as we do for the
> header in a live allocation? I.e. split the 8 byte header into two and use
> part of it to point to the next element in the list using the offset from the
> start of the block, and part of it to indicate the size?

That can't work as the next freelist item might be on some other block.

David




Re: Reducing the chunk header sizes on all memory context types

2022-07-12 Thread David Rowley
On Wed, 13 Jul 2022 at 05:44, Andres Freund  wrote:
> On 2022-07-12 20:22:57 +0300, Yura Sokolov wrote:
> > I don't get, why "large chunk" needs additional fields for size and
> > offset.
> > Large allocation sizes are certainly rounded to page size.
> > And allocations which doesn't fit 1GB we could easily round to 1MB.
> > Then we could simply store `size>>20`.
> > It will limit MaxAllocHugeSize to `(1<<(30+20))-1` - 1PB. Doubdfully we
> > will deal with such huge allocations in near future.
>
> What would gain by doing something like this? The storage density loss of
> storing an exact size is smaller than what you propose here.

I do agree that the 16-byte additional header size overhead for
allocations >= 1GB are not really worth troubling too much over.
However, if there was some way to make it so we always had an 8-byte
header, it would simplify some of the code in places such as
AllocSetFree().   For example, (ALLOC_BLOCKHDRSZ + hdrsize +
chunksize) could be simplified at compile time if hdrsize was a known
constant.

I did consider that in all cases where the allocations are above
allocChunkLimit that the chunk is put on a dedicated block and in
fact, the blockoffset is always the same for those.  I wondered if we
could use the full 60 bits for the chunksize for those cases.  The
reason I didn't pursue that is because:

#define MaxAllocHugeSize (SIZE_MAX / 2)

That's 63-bits, so 60 isn't enough.

Yeah, we likely could reduce that without upsetting anyone.  It feels
like it'll be a while before not being able to allocate a chunk of
memory more than 1024 petabytes will be an issue, although, I do hope
to grow old enough to one day come back here at laugh at that.

David




Re: minor change for create_list_bounds()

2022-07-12 Thread David Rowley
On Thu, 30 Jun 2022 at 11:41, Nathan Bossart  wrote:
>
> On Tue, Mar 08, 2022 at 11:05:10AM -0800, Zhihong Yu wrote:
> > I was looking at commit db632fbca and noticed that,
> > in create_list_bounds(), if index is added to boundinfo->interleaved_parts
> > in the first if statement, there is no need to perform the second check
> > involving call to partition_bound_accepts_nulls().
>
> Given this change probably doesn't meaningfully impact performance or code
> clarity, I'm personally -1 for this patch.  Is there another motivation
> that I am missing?

While I agree that the gains on making this change are small. It just
accounts to saving a call to bms_add_member() when we've already found
the partition to be interleaved due to interleaved Datum values,  I
just disagree with not doing anything about it. My reasons are:

1. This code is new to PG15. We have the opportunity now to make a
meaningful improvement and backpatch it.  When PG15 is out, the bar is
set significantly higher for fixing this type of thing due to having
to consider the additional cost of backpatching conflicts with other
future fixes in that area.
2. I think the code as I just pushed it is easier to understand than
what was there before.
3. I'd like to encourage people to look at and critique our newly
added code. Having a concern addressed seems like a good reward for
the work.

I've now pushed the patch along with some other minor adjustments in the area.

Thanks for the report/patch.

David




Re: PATCH: Add Table Access Method option to pgbench

2022-07-12 Thread Michel Pelletier
On Thu, 30 Jun 2022 at 18:09, Michael Paquier  wrote:

> On Fri, Jul 01, 2022 at 10:06:49AM +0900, Michael Paquier wrote:
> > And the conclusion back then is that one can already achieve this by
> > using PGOPTIONS:
> > PGOPTIONS='-c default_table_access_method=wuzza' pgbench [...]
> >
> > So there is no need to complicate more pgbench, particularly when it
> > comes to partitioned tables where USING is not supported.  Your patch
> > touches this area of the client code to bypass the backend error.
>
> Actually, it could be a good thing to mention that directly in the
> docs of pgbench.
>

I've attached a documentation patch that mentions and links to the
PGOPTIONS documentation per your suggestion.  I'll keep the other patch on
the back burner, perhaps in the future there will be demand for a command
line option as more TAMs are created.

Thanks,

-Michel


> --
> Michael
>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 37fd80388c..e2d728e0c4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -317,7 +317,7 @@ UPDATE pg_settings SET setting = reset_val WHERE name = 'configuration_parameter
 

 
-   
+   
 Parameter Interaction via the Shell
 
  
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 2acf55c2ac..f15825c293 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1018,6 +1018,17 @@ pgbench  options  d
always, auto and
never.
   
+
+  
+   The environment variable PGOPTIONS specifies database
+   configuration options that are passed to PostgreSQL via the command line
+   (See ).  For example, a hypothetical
+   default Table Access Method for the tables that pgbench creates
+   called wuzza can be specified with:
+
+PGOPTIONS='-c default_table_access_method=wuzza'
+
+  
  
 
  


Re: Perform streaming logical transactions by background workers and parallel apply

2022-07-12 Thread Peter Smith
Below are my review comments for the v16* patch set:


v16-0001


1.0 

There are places (comments, docs, errmsgs, etc) in the patch referring
to "parallel mode". I think every one of those references should be
found and renamed to "parallel streaming mode" or "streaming=parallel"
or at the very least match sure that "streaming" is in the same
sentence. IMO it's too vague just saying "parallel" without also
saying the context is for the "streaming" parameter.

I have commented on some of those examples below, but please search
everything anyway (including the docs) to catch the ones I haven't
explicitly mentioned.

==

1.1 src/backend/commands/subscriptioncmds.c

+defGetStreamingMode(DefElem *def)
+{
+ /*
+ * If no value given, assume "true" is meant.
+ */

Please fix this comment to identical to this pushed patch [1]

==

1.2 .../replication/logical/applybgworker.c - apply_bgworker_start

+ if (list_length(ApplyWorkersFreeList) > 0)
+ {
+ wstate = (ApplyBgworkerState *) llast(ApplyWorkersFreeList);
+ ApplyWorkersFreeList = list_delete_last(ApplyWorkersFreeList);
+ Assert(wstate->pstate->status == APPLY_BGWORKER_FINISHED);
+ }

The Assert that the entries in the free-list are FINISHED seems like
unnecessary checking. IIUC, code is already doing the Assert that
entries are FINISHED before allowing them into the free-list in the
first place.

~~~

1.3 .../replication/logical/applybgworker.c - apply_bgworker_find

+ if (found)
+ {
+ char status = entry->wstate->pstate->status;
+
+ /* If any workers (or the postmaster) have died, we have failed. */
+ if (status == APPLY_BGWORKER_EXIT)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("background worker %u failed to apply transaction %u",
+ entry->wstate->pstate->n,
+ entry->wstate->pstate->stream_xid)));
+
+ Assert(status == APPLY_BGWORKER_BUSY);
+
+ return entry->wstate;
+ }

Why not remove that Assert but change the condition to be:

if (status != APPLY_BGWORKER_BUSY)
ereport(...)

==

1.4 src/backend/replication/logical/proto.c - logicalrep_write_stream_abort

@@ -1163,31 +1163,56 @@ logicalrep_read_stream_commit(StringInfo in,
LogicalRepCommitData *commit_data)
 /*
  * Write STREAM ABORT to the output stream. Note that xid and subxid will be
  * same for the top-level transaction abort.
+ *
+ * If write_abort_lsn is true, send the abort_lsn and abort_time fields.
+ * Otherwise not.
  */

"Otherwise not." -> ", otherwise don't."

~~~

1.5 src/backend/replication/logical/proto.c - logicalrep_read_stream_abort

+ *
+ * If read_abort_lsn is true, try to read the abort_lsn and abort_time fields.
+ * Otherwise not.
  */
 void
-logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
- TransactionId *subxid)
+logicalrep_read_stream_abort(StringInfo in,
+ LogicalRepStreamAbortData *abort_data,
+ bool read_abort_lsn)

"Otherwise not." -> ", otherwise don't."

==

1.6 src/backend/replication/logical/worker.c - file comment

+ * If streaming = parallel, We assign a new apply background worker (if
+ * available) as soon as the xact's first stream is received. The main apply

"We" -> "we" ... or maybe better just remove it completely.

~~~

1.7 src/backend/replication/logical/worker.c - apply_handle_stream_prepare

+ /*
+ * After sending the data to the apply background worker, wait for
+ * that worker to finish. This is necessary to maintain commit
+ * order which avoids failures due to transaction dependencies and
+ * deadlocks.
+ */
+ apply_bgworker_send_data(wstate, s->len, s->data);
+ apply_bgworker_wait_for(wstate, APPLY_BGWORKER_FINISHED);
+ apply_bgworker_free(wstate);

The comment should be changed how you had suggested [2], so that it
will be formatted the same way as a couple of other similar comments.

~~~

1.8 src/backend/replication/logical/worker.c - apply_handle_stream_abort

+ /* Check whether the publisher sends abort_lsn and abort_time. */
+ if (am_apply_bgworker())
+ read_abort_lsn = MyParallelState->server_version >= 16;

This is handling decisions about read/write of the protocol bytes. I
think feel like it will be better to be checking the server *protocol*
version (not the server postgres version) to make this decision – e.g.
this code should be using the new macro you introduced so it will end
up looking much like how the pgoutput_stream_abort code is doing it.

~~~

1.9 src/backend/replication/logical/worker.c - store_flush_position

@@ -2636,6 +2999,10 @@ store_flush_position(XLogRecPtr remote_lsn)
 {
  FlushPosition *flushpos;

+ /* We only need to collect the LSN in main apply worker */
+ if (am_apply_bgworker())
+ return;
+

SUGGESTION
/* Skip if not the main apply worker */

==

1.10 src/backend/replication/pgoutput/pgoutput.c

@@ -1820,6 +1820,8 @@ pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
XLogRecPtr abort_lsn)
 {
  ReorderBufferTXN *toptxn;
+ bool write_abort_lsn = false;
+ PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;

 

Re: Documentation about PL transforms

2022-07-12 Thread Pavel Stehule
Hi

I am doing an review of this patch and I have two comments

1. I am not sure if get_call_trftypes is a good name - the prefix get_call
is used when some runtime data is processed. This function just returns
reformatted data from the system catalogue. Maybe get_func_trftypes_list,
or just replace function get_func_trftypes (now, the list is an array, so
there should not be performance issues). For consistency, the new function
should be used in plperl and plpython too. Probably this function is not
widely used, so I don't see why we need to have two almost equal functions
in code.

2.

+It also contains the OID of the intended procedural language and
whether
+that procedural language is declared as TRUSTED,
useful
+if a single inline handler is supporting more than one procedural
language.

I am not sure if this sentence is in the correct place. Maybe can be
mentioned separately,
so generally handlers can be used by more than one procedural language. But
maybe
I don't understand this sentence.

Regards

Pavel


Re: PG15 beta1 sort performance regression due to Generation context change

2022-07-12 Thread David Rowley
On Tue, 12 Jul 2022 at 17:15, David Rowley  wrote:
> So far only Robert has raised concerns with this regression for PG15
> (see [2]). Tom voted for leaving things as they are for PG15 in [3].
> John agrees, as quoted above. Does anyone else have any opinion?

Let me handle this slightly differently.  I've moved the open item for
this into the "won't fix" section.  If nobody shouts at me for that
then I'll let that end the debate.  Otherwise, we can consider the
argument when it arrives.

David




Re: remove_useless_groupby_columns is too enthusiastic

2022-07-12 Thread David Rowley
On Wed, 13 Jul 2022 at 05:31, Tom Lane  wrote:
> I tried the attached quick-hack patch that just prevents
> remove_useless_groupby_columns from removing anything that
> appears in ORDER BY.  That successfully fixes the complained-of
> case, and it doesn't change any existing regression test results.
> I'm not sure whether there are any cases that it makes significantly
> worse.

In addition to this, we also do a pre-verification step to see if the
ORDER BY has anything in it that the GROUP BY does not?

I don't think there's any harm in removing the GROUP BY item if the
pre-check finds something extra in ORDER BY since
preprocess_groupclause() is going to fail in that case anyway.

David




Re: making relfilenodes 56 bits

2022-07-12 Thread Dilip Kumar
On Tue, Jul 12, 2022 at 7:21 PM Robert Haas  wrote:
>

> In this version, I also removed the struct padding, changed the limit
> on the number of entries to a nice round 64, and made some comment
> updates. I considered trying to go further and actually make the file
> variable-size, so that we never again need to worry about the limit on
> the number of entries, but I don't actually think that's a good idea.
> It would require substantially more changes to the code in this file,
> and that means there's more risk of introducing bugs, and I don't see
> that there's much value anyway, because if we ever do hit the current
> limit, we can just raise the limit.
>
> If we were going to split up durable_rename(), the only intelligible
> split I can see would be to have a second version of the function, or
> a flag to the existing function, that caters to the situation where
> the old file is already known to have been fsync()'d.

The patch looks good except one minor comment

+ * corruption.  Since the file might be more tha none standard-size disk
+ * sector in size, we cannot rely on overwrite-in-place. Instead, we generate

typo "more tha none" -> "more than one"

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: proposal: Allocate work_mem From Pool

2022-07-12 Thread Joseph D Wagner

Before I try to answer that, I need to know how the scheduler works.



As I understand the term used, there is no scheduler inside Postgres
for user connections -- they're handled by the OS kernel.


Then, I'm probably using the wrong term. Right now, I have
max_worker_processes set to 16. What happens when query #17
wants some work done? What do you call the thing that handles
that? What is its algorithm for allocating work to the processes?
Or am I completely misunderstanding the role worker processes
play in execution?

Joseph Wagner




Re: POC: GROUP BY optimization

2022-07-12 Thread David Rowley
On Thu, 31 Mar 2022 at 12:19, Tomas Vondra
 wrote:
> Pushed, after going through the patch once more, running check-world
> under valgrind, and updating the commit message.

I'm just in this general area of the code again today and wondered
about the header comment for the preprocess_groupclause() function.

It says:

 * In principle it might be interesting to consider other orderings of the
 * GROUP BY elements, which could match the sort ordering of other
 * possible plans (eg an indexscan) and thereby reduce cost.  We don't
 * bother with that, though.  Hashed grouping will frequently win anyway.

I'd say this commit makes that paragraph mostly obsolete.  It's only
true now in the sense that we don't try orders that suit some index
that would provide pre-sorted results for a GroupAggregate path.  The
comment leads me to believe that we don't do anything at all to find a
better order, and that's not true now.

David




Re: "ERROR: latch already owned" on gharial

2022-07-12 Thread Sandeep Thakkar
Thanks Robert.

We are receiving the alerts from buildfarm-admins for anole and gharial not
reporting. Who can help to stop these? Thanks

On Wed, Jul 6, 2022 at 1:27 AM Robert Haas  wrote:

> On Sun, Jul 3, 2022 at 11:51 PM Thomas Munro 
> wrote:
> > On Wed, Jun 1, 2022 at 12:55 AM Robert Haas 
> wrote:
> > > OK, I have access to the box now. I guess I might as well leave the
> > > crontab jobs enabled until the next time this happens, since Thomas
> > > just took steps to improve the logging, but I do think these BF
> > > members are overdue to be killed off, and would like to do that as
> > > soon as it seems like a reasonable step to take.
> >
> > A couple of months later, there has been no repeat of that error.  I'd
> > happily forget about that and move on, if you want to decommission
> > these.
>
> I have commented out the BF stuff in crontab on that machine.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>
>
>

-- 
Sandeep Thakkar


Re: Add --{no-,}bypassrls flags to createuser

2022-07-12 Thread Michael Paquier
On Thu, May 26, 2022 at 04:47:46PM +0900, Kyotaro Horiguchi wrote:
> FWIW, the "fancy" here causes me to think about something likely to
> cause syntax breakage of the query to be sent.
> 
> createuser -a 'user"1' -a 'user"2' 'user"3'
> createuser -v "2023-1-1'; DROP TABLE public.x; select '" hoge

That would be mostly using spaces here, to make sure that quoting is
correctly applied.

> BUT, thses should be prevented by the functions enumerated above. So,
> I don't think we need them.

Mostly.  For example, the test for --valid-until can use a timestamp
with spaces to validate the use of appendStringLiteralConn().  A
second thing is that --member was checked, but not --admin, so I have
renamed regress_user2 to "regress user2" for that to apply a maximum
of coverage, and applied the patch.

One thing that I found annoying is that this made the list of options
of createuser much harder to follow.  That's not something caused by
this patch as many options have accumulated across the years and there
is a kind pattern where the connection options were listed first, but
I have cleaned up that while on it.  A second area where this could be
done is createdb, as it could be easily expanded if the backend query
gains support for more stuff, but that can happen when it makes more
sense.
--
Michael


signature.asc
Description: PGP signature


Re: enable/disable broken for statement triggers on partitioned tables

2022-07-12 Thread Amit Langote
Hi,

On Fri, Jul 8, 2022 at 3:44 AM Dmitry Koval  wrote:
> I've looked through the code and everything looks good.
> But there is one thing I doubt.
> Patch changes result of test:
> 
>
> create function trig_nothing() returns trigger language plpgsql
>as $$ begin return null; end $$;
> create table parent (a int) partition by list (a);
> create table child1 partition of parent for values in (1);
>
> create trigger tg after insert on parent
>for each row execute procedure trig_nothing();
> select tgrelid::regclass, tgname, tgenabled from pg_trigger
>where tgrelid in ('parent'::regclass, 'child1'::regclass)
>order by tgrelid::regclass::text;
> alter table only parent enable always trigger tg; -- no recursion
> select tgrelid::regclass, tgname, tgenabled from pg_trigger
>where tgrelid in ('parent'::regclass, 'child1'::regclass)
>order by tgrelid::regclass::text;
> alter table parent enable always trigger tg; -- recursion
> select tgrelid::regclass, tgname, tgenabled from pg_trigger
>where tgrelid in ('parent'::regclass, 'child1'::regclass)
>order by tgrelid::regclass::text;
>
> drop table parent, child1;
> drop function trig_nothing();
>
> 
> Results of vanilla + patch:
> 
> CREATE FUNCTION
> CREATE TABLE
> CREATE TABLE
> CREATE TRIGGER
>   tgrelid | tgname | tgenabled
> -++---
>   child1  | tg | O
>   parent  | tg | O
> (2 rows)
>
> ALTER TABLE
>   tgrelid | tgname | tgenabled
> -++---
>   child1  | tg | O
>   parent  | tg | A
> (2 rows)
>
> ALTER TABLE
>   tgrelid | tgname | tgenabled
> -++---
>   child1  | tg | O
>   parent  | tg | A
> (2 rows)
>
> DROP TABLE
> DROP FUNCTION
>
> 
> Results of vanilla:
> 
> CREATE FUNCTION
> CREATE TABLE
> CREATE TABLE
> CREATE TRIGGER
>   tgrelid | tgname | tgenabled
> -++---
>   child1  | tg | O
>   parent  | tg | O
> (2 rows)
>
> ALTER TABLE
>   tgrelid | tgname | tgenabled
> -++---
>   child1  | tg | O
>   parent  | tg | A
> (2 rows)
>
> ALTER TABLE
>   tgrelid | tgname | tgenabled
> -++---
>   child1  | tg | A
>   parent  | tg | A
> (2 rows)
>
> DROP TABLE
> DROP FUNCTION
> 
> The patch doesn't start recursion in case 'tgenabled' flag of parent
> table is not changes.
> Probably vanilla result is more correct.

Thanks for the review and this test case.

I agree that the patch shouldn't have changed that behavior, so I've
fixed the patch so that EnableDisableTrigger() recurses even if the
parent trigger is unchanged.


--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v2-0001-Fix-ENABLE-DISABLE-TRIGGER-to-handle-recursion-co.patch
Description: Binary data


Re: Some clean-up work in get_cheapest_group_keys_order()

2022-07-12 Thread David Rowley
On Wed, 13 Jul 2022 at 13:12, Tom Lane  wrote:
>
> David Rowley  writes:
> > On Wed, 13 Jul 2022 at 11:02, Tom Lane  wrote:
> >> Agreed, but I think there are other instances of that idiom that
> >> should be cleaned up while you're at it.
>
> > Agreed.  I imagine we should just do the remaining cleanup in master
> > only. Do you agree?
>
> No objection.

I've now pushed the original patch to 15 and master and also pushed a
cleanup commit to remove the list_truncate(list_copy instances from
master only.

Thanks for looking.

David




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-07-12 Thread Kyotaro Horiguchi
At Tue, 12 Jul 2022 19:18:22 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-07-13 11:00:07 +0900, Kyotaro Horiguchi wrote:
> > I imagined to use B_INVALID as a kind of "default" partition, which
> > accepts all unknown backend types.
> 
> There shouldn't be any unknown backend types. Something has gone wrong if we
> get far without a backend type set.
> 
> 
> > We can just ignore that values but then we lose the clue for malfunction of
> > stats machinery. I thought that that backend-type as the sentinel for
> > malfunctions.  Thus we can emit logs instead.
> > 
> > I feel that the stats machinery shouldn't stop the server as possible,
> > or I think it is overreaction to abort for invalid values that can be
> > easily coped with.
> 
> I strongly disagree. That just ends up with hard to find bugs.

I was not sure about the policy on that since, as Melanie (and I)
mentioned, GetBackendTypeDesc() is gracefully treating invalid values.

Since both of you are agreeing on this point, I'm fine with
Assert()ing assuming that GetBackendTypeDesc() (or other places
backend-type is handled) is modified to behave the same way.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-07-12 Thread Andres Freund
Hi,

On 2022-07-13 11:00:07 +0900, Kyotaro Horiguchi wrote:
> I imagined to use B_INVALID as a kind of "default" partition, which
> accepts all unknown backend types.

There shouldn't be any unknown backend types. Something has gone wrong if we
get far without a backend type set.


> We can just ignore that values but then we lose the clue for malfunction of
> stats machinery. I thought that that backend-type as the sentinel for
> malfunctions.  Thus we can emit logs instead.
> 
> I feel that the stats machinery shouldn't stop the server as possible,
> or I think it is overreaction to abort for invalid values that can be
> easily coped with.

I strongly disagree. That just ends up with hard to find bugs.

Greetings,

Andres Freund




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-07-12 Thread Kyotaro Horiguchi
At Tue, 12 Jul 2022 12:19:06 -0400, Melanie Plageman 
 wrote in 
> > +
> >  >io_ops.stats[backend_type_get_idx(MyBackendType)];
> >
> > backend_type_get_idx(x) is actually (x - 1) plus assertion on the
> > value range. And the only use-case is here. There's an reverse
> > function and also used only at one place.
> >
> > +   Datum   backend_type_desc =
> > +
> >  CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));
> >
> > In this usage GetBackendTypeDesc() gracefully treats out-of-domain
> > values but idx_get_backend_type keenly kills the process for the
> > same. This is inconsistent.
> >
> > My humbel opinion on this is we don't define the two functions and
> > replace the calls to them with (x +/- 1).  Addition to that, I think
> > we should not abort() by invalid backend types.  In that sense, I
> > wonder if we could use B_INVALIDth element for this purpose.
> >
> 
> I think that GetBackendTypeDesc() should probably also error out for an
> unknown value.
> 
> I would be open to not using the helper functions. I thought it would be
> less error-prone, but since it is limited to the code in
> pgstat_io_ops.c, it is probably okay. Let me think a bit more.
> 
> Could you explain more about what you mean about using B_INVALID
> BackendType?

I imagined to use B_INVALID as a kind of "default" partition, which
accepts all unknown backend types. We can just ignore that values but
then we lose the clue for malfunction of stats machinery. I thought
that that backend-type as the sentinel for malfunctions.  Thus we can
emit logs instead.

I feel that the stats machinery shouldn't stop the server as possible,
or I think it is overreaction to abort for invalid values that can be
easily coped with.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: DropRelFileLocatorBuffers

2022-07-12 Thread Kyotaro Horiguchi
At Tue, 12 Jul 2022 10:30:20 -0400, Robert Haas  wrote 
in 
> On Tue, Jul 12, 2022 at 12:07 AM Dilip Kumar  wrote:
> > I think the naming used in your patch looks better to me. So +1 for the 
> > change.
> 
> Committed.

Thank you, Robert and Dilip.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Some clean-up work in get_cheapest_group_keys_order()

2022-07-12 Thread Tom Lane
David Rowley  writes:
> On Wed, 13 Jul 2022 at 11:02, Tom Lane  wrote:
>> Agreed, but I think there are other instances of that idiom that
>> should be cleaned up while you're at it.

> Agreed.  I imagine we should just do the remaining cleanup in master
> only. Do you agree?

No objection.

regards, tom lane




Re: test_oat_hooks bug (was: Re: pgsql: Add copy/equal support for XID lists)

2022-07-12 Thread Michael Paquier
On Tue, Jul 12, 2022 at 05:20:59PM +0200, Alvaro Herrera wrote:
> While looking for a place to host a test for XID lists support, I
> noticed a mistake in test_oat_hooks, fixed as per the attached.

Indeed.  Good catch.
--
Michael


signature.asc
Description: PGP signature


Re: Some clean-up work in get_cheapest_group_keys_order()

2022-07-12 Thread David Rowley
On Wed, 13 Jul 2022 at 11:02, Tom Lane  wrote:
>
> David Rowley  writes:
> > * I think list_truncate(list_copy(list), n) is a pretty bad way to
> > copy the first n elements of a list, especially when n is likely to be
> > 0 most of the time. I think we should just add a function called
> > list_copy_head(). We already have list_copy_tail().
>
> Agreed, but I think there are other instances of that idiom that
> should be cleaned up while you're at it.

Agreed.  I imagine we should just do the remaining cleanup in master
only. Do you agree?

David




Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-12 Thread Masahiko Sawada
On Tue, Jul 12, 2022 at 7:59 PM Amit Kapila  wrote:
>
> On Tue, Jul 12, 2022 at 2:53 PM Masahiko Sawada  wrote:
> >
> > On Tue, Jul 12, 2022 at 5:58 PM shiy.f...@fujitsu.com
> >  wrote:
> > >
> > >
> > > It happened when executing the following code because it tried to free a 
> > > NULL
> > > pointer (catchange_xip).
> > >
> > > /* be tidy */
> > > if (ondisk)
> > > pfree(ondisk);
> > > +   if (catchange_xip)
> > > +   pfree(catchange_xip);
> > >  }
> > >
> > > It seems to be related to configure option. I could reproduce it when 
> > > using
> > > `./configure --enable-debug`.
> > > But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og 
> > > -ggdb"`.
> >
> > Hmm, I could not reproduce this problem even if I use ./configure
> > --enable-debug. And it's weird that we checked if catchange_xip is not
> > null but we did pfree for it:
> >
>
> Yeah, this looks weird to me as well but one difference in running
> tests could be the timing of WAL LOG for XLOG_RUNNING_XACTS. That may
> change the timing of SnapBuildSerialize. The other thing we can try is
> by checking the value of catchange_xcnt before pfree.

Yeah, we can try that.

While reading the code, I realized that we try to pfree both ondisk
and catchange_xip also when we jumped to 'out:':

out:
ReorderBufferSetRestartPoint(builder->reorder,
 builder->last_serialized_snapshot);
/* be tidy */
if (ondisk)
pfree(ondisk);
if (catchange_xip)
pfree(catchange_xip);

But we use both ondisk and catchange_xip only if we didn't jump to
'out:'. If this problem is related to compiler optimization with
'goto' statement, moving them before 'out:' might be worth trying.

>
> BTW, I think ReorderBufferGetCatalogChangesXacts should have an Assert
> to ensure rb->catchange_ntxns and xcnt are equal. We can probably then
> avoid having xcnt_p as an out parameter as the caller can use
> rb->catchange_ntxns instead.
>

Agreed.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: pg15b2: large objects lost on upgrade

2022-07-12 Thread Michael Paquier
On Tue, Jul 12, 2022 at 04:51:44PM -0400, Robert Haas wrote:
> I spent a bunch of time looking at this today and I have more sympathy
> for Justin's previous proposal now. I found it somewhat hacky that he
> was relying on the hard-coded value of LargeObjectRelationId and
> LargeObjectLOidPNIndexId, but I discovered that it's harder to do
> better than I had assumed. Suppose we don't want to compare against a
> hard-coded constant but against the value that is actually present
> before the dump overwrites the pg_class row's relfilenode. Well, we
> can't get that value from the database in question before restoring
> the dump, because restoring either the dump creates or recreates the
> database in all cases. The CREATE DATABASE command that will be part
> of the dump always specifies TEMPLATE template0, so if we want
> something other than a hard-coded constant, we need the
> pg_class.relfilenode values from template0 for pg_largeobject and
> pg_largeobject_loid_pn_index. But we can't connect to that database to
> query those values, because it has datallowconn = false. Oops.
> 
> I have a few more ideas to try here. It occurs to me that we could fix
> this more cleanly if we could get the dump itself to set the
> relfilenode for pg_largeobject to the desired value. Right now, it's
> just overwriting the relfilenode stored in the catalog without
> actually doing anything that would cause a change on disk. But if we
> could make it change the relfilenode in a more principled way that
> would actually cause an on-disk change, then the orphaned-file problem
> would be fixed, because we'd always be installing the new file over
> top of the old file. I'm going to investigate how hard it would be to
> make that work.

Thanks for all the details here.  This originally sounded like the new
cluster was keeping around some orphaned relation files with the old
LOs still stored in it.  But as that's just the freshly initdb'd
relfilenodes of pg_largeobject, that does not strike me as something
absolutely critical to fix for v15 as orphaned relfilenodes are an
existing problem.  If we finish with a solution rather simple in
design, I'd be fine to stick a fix in REL_15_STABLE, but playing with
this stable branch more than necessary may be risky after beta2.  At
the end, I would be fine to drop the open item now that the main issue
has been fixed.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Log details for client certificate failures

2022-07-12 Thread Jacob Champion
On Mon, Jul 11, 2022 at 6:09 AM Peter Eisentraut
 wrote:
> I squashed those two together.  I also adjusted the error message a bit
> more for project style.  (We can put both lines into detail.)

Oh, okay. Log parsers don't have any issues with that?

> I had to read up on this "ex_data" API.  Interesting.  But I'm wondering
> a bit about how the life cycle of these objects is managed.  What
> happens if the allocated error string is deallocated before its
> containing object?  Or vice versa?

Yeah, I'm currently leaning heavily on the lack of any memory context
switches here. And I end up leaking out a pointer to the stale stack
of be_tls_open_server(), which is gross -- it works since there are no
other clients, but that could probably come back to bite us.

The ex_data API exposes optional callbacks for new/dup/free (I'm
currently setting those to NULL), so we can run custom code whenever
the SSL* is destroyed. If you'd rather the data have the same lifetime
of the SSL* object, we can switch to malloc/strdup/free (or even
OPENSSL_strdup() in later versions). But since we don't have any use
for the ex_data outside of this function, maybe we should just clear
it before we return, rather than carrying it around.

> How do we ensure we don't
> accidentally reuse the error string when the code runs again?  (I guess
> currently it can't?)

The ex_data is associated with the SSL*, not the global SSL_CTX*, so
that shouldn't be an issue. A new SSL* gets created at the start of
be_tls_open_server().

>  Maybe we should avoid this and just put the
> errdetail itself into a static variable that we unset once the message
> is printed?

If you're worried about the lifetime of the palloc'd data being too
short, does switching to a static variable help in that case?

--Jacob




Re: [PATCH] Log details for client certificate failures

2022-07-12 Thread Jacob Champion
On Sat, Jul 9, 2022 at 6:49 AM Graham Leggett  wrote:
> Please don’t invent another format, or try and truncate the data. This is a 
> huge headache when troubleshooting.

I hear you, and I agree that correlating these things across machines
is something we should be making easier. I'm just not convinced that
the particular format you've proposed, with a new set of rules for
quoting and escaping, needs to be part of this patch. (And I think
there are good reasons to truncate unverified cert data, so there'd
have to be clear benefits to offset the risk of opening it up.)

Searching Google for "issuer rdnSequence" comes up with mostly false
positives related to LDAP filtering and certificate dumps, and the
true positives seem to be mail threads that you've participated in. Do
many LDAP servers log certificate failures in this format by default?
(For that matter, does httpd?) The discussion at the time you added
this to httpd [1] seemed to be making the point that this was a niche
format, suited mostly for interaction with LDAP filters -- and Kaspar
additionally pointed out that it's not a canonical format, so all of
our implementations would have to have an ad hoc agreement to choose
exactly one encoding.

If you're using randomized serial numbers, you should be able to grep
for those by themselves and successfully match many different formats,
no? To me, that seems good enough for a first patch, considering we
don't currently log any of this information.

--Jacobfi

[1] https://lists.apache.org/thread/1665qc4mod7ppp58qk3bqc2l3wtl3lkn




Re: Some clean-up work in get_cheapest_group_keys_order()

2022-07-12 Thread Tom Lane
David Rowley  writes:
> * I think list_truncate(list_copy(list), n) is a pretty bad way to
> copy the first n elements of a list, especially when n is likely to be
> 0 most of the time. I think we should just add a function called
> list_copy_head(). We already have list_copy_tail().

Agreed, but I think there are other instances of that idiom that
should be cleaned up while you're at it.

> I think the first 3 are worth fixing in PG15 since all that code is
> new to that version. The 4th, I'm so sure about.

I'd say keeping v15 and v16 in sync here is worth something.

regards, tom lane




Some clean-up work in get_cheapest_group_keys_order()

2022-07-12 Thread David Rowley
I was rebasing a patch which requires me to make some changes in
get_cheapest_group_keys_order().  I noticed a few things in there that
I think we could do a little better on:

* The code uses pfree() on a list and it should be using list_free()

* There's a manually coded for loop over a list which seems to be done
so we can skip the first n elements of the list.  for_each_from()
should be used for that.

* I think list_truncate(list_copy(list), n) is a pretty bad way to
copy the first n elements of a list, especially when n is likely to be
0 most of the time. I think we should just add a function called
list_copy_head(). We already have list_copy_tail().

* We could reduce some of the branching in the while loop and just set
cheapest_sort_cost to DBL_MAX to save having to check if we're doing
the first loop.

I think the first 3 are worth fixing in PG15 since all that code is
new to that version. The 4th, I'm so sure about.

Does anyone else have any thoughts?

David
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 9d8f4fd5c7..b969a52dd6 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1584,6 +1584,27 @@ list_copy(const List *oldlist)
return newlist;
 }
 
+/*
+ * Return a shallow copy of the specified list containing only the first 'len'
+ * elements.  If oldlist is shorter than 'len' then we copy the entire list.
+ */
+List *
+list_copy_head(const List *oldlist, int len)
+{
+   List   *newlist;
+
+   len = Min(oldlist->length, len);
+
+   if (len <= 0)
+   return NIL;
+
+   newlist = new_list(oldlist->type, len);
+   memcpy(newlist->elements, oldlist->elements, len * sizeof(ListCell));
+
+   check_list_invariants(newlist);
+   return newlist;
+}
+
 /*
  * Return a shallow copy of the specified list, without the first N elements.
  */
diff --git a/src/backend/optimizer/path/pathkeys.c 
b/src/backend/optimizer/path/pathkeys.c
index 9775c4a722..849abbaaba 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -17,6 +17,8 @@
  */
 #include "postgres.h"
 
+#include 
+
 #include "miscadmin.h"
 #include "access/stratnum.h"
 #include "catalog/pg_opfamily.h"
@@ -591,7 +593,7 @@ get_cheapest_group_keys_order(PlannerInfo *root, double 
nrows,
 
ListCell   *cell;
PathkeyMutatorState mstate;
-   double  cheapest_sort_cost = -1.0;
+   double  cheapest_sort_cost = DBL_MAX;
 
int nFreeKeys;
int nToPermute;
@@ -620,23 +622,23 @@ get_cheapest_group_keys_order(PlannerInfo *root, double 
nrows,
nToPermute = 4;
if (nFreeKeys > nToPermute)
{
-   int i;
PathkeySortCost *costs = palloc(sizeof(PathkeySortCost) * 
nFreeKeys);
+   PathkeySortCost *cost = costs;
 
-   /* skip the pre-ordered pathkeys */
-   cell = list_nth_cell(*group_pathkeys, n_preordered);
-
-   /* estimate cost for sorting individual pathkeys */
-   for (i = 0; cell != NULL; i++, (cell = lnext(*group_pathkeys, 
cell)))
+   /*
+* Estimate cost for sorting individual pathkeys skipping the
+* pre-ordered pathkeys.
+*/
+   for_each_from(cell, *group_pathkeys, n_preordered)
{
-   List   *to_cost = list_make1(lfirst(cell));
-
-   Assert(i < nFreeKeys);
+   PathKey*pathkey = (PathKey *) lfirst(cell);
+   List   *to_cost = list_make1(pathkey);
 
-   costs[i].pathkey = lfirst(cell);
-   costs[i].cost = cost_sort_estimate(root, to_cost, 0, 
nrows);
+   cost->pathkey = pathkey;
+   cost->cost = cost_sort_estimate(root, to_cost, 0, 
nrows);
+   cost++;
 
-   pfree(to_cost);
+   list_free(to_cost);
}
 
/* sort the pathkeys by sort cost in ascending order */
@@ -646,9 +648,9 @@ get_cheapest_group_keys_order(PlannerInfo *root, double 
nrows,
 * Rebuild the list of pathkeys - first the preordered ones, 
then the
 * rest ordered by cost.
 */
-   new_group_pathkeys = list_truncate(list_copy(*group_pathkeys), 
n_preordered);
+   new_group_pathkeys = list_copy_head(*group_pathkeys, 
n_preordered);
 
-   for (i = 0; i < nFreeKeys; i++)
+   for (int i = 0; i < nFreeKeys; i++)
new_group_pathkeys = lappend(new_group_pathkeys, 
costs[i].pathkey);
 
pfree(costs);
@@ -689,7 +691,7 @@ get_cheapest_group_keys_order(PlannerInfo *root, double 
nrows,
 
cost = cost_sort_estimate(root, var_group_pathkeys, 
n_preordered, nrows);
 
- 

Re: Extending outfuncs support to utility statements

2022-07-12 Thread Tom Lane
Peter Eisentraut  writes:
> This is also needed to be able to store utility statements in (unquoted) 
> SQL function bodies.  I have some in-progress code for that that I need 
> to dust off.  IIRC, there are still some nontrivial issues to work 
> through on the reading side.  I don't have a problem with enabling the 
> outfuncs side in the meantime.

BTW, I experimented with trying to enable WRITE_READ_PARSE_PLAN_TREES
for utility statements, and found that the immediate problem is that
Constraint and a couple of other node types lack read functions
(they're the ones marked "custom_read_write, no_read" in parsenodes.h).
They have out functions, so writing the inverses seems like it's just
something nobody ever got around to.  Perhaps there are deeper problems
lurking behind that one, though.

regards, tom lane




Re: making relfilenodes 56 bits

2022-07-12 Thread Robert Haas
On Tue, Jul 12, 2022 at 6:02 PM Andres Freund  wrote:
> MAX_FORKNUM is way lower right now. And hardcoded. So this doesn't imply a new
> restriction. As we iterate over 0..MAX_FORKNUM in a bunch of places (with
> filesystem access each time), it's not feasible to make that number large.

Yeah. TBH, what I'd really like to do is kill the entire fork system
with fire and replace it with something more scalable, which would
maybe permit the sort of thing Hannu suggests here. With the current
system, forget it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-12 Thread Andres Freund
Hi,

Please don't top quote - as mentioned a couple times recently.

On 2022-07-12 23:00:22 +0200, Hannu Krosing wrote:
> Re: staticAssertStmt(MAX_FORKNUM <= INT8_MAX);
> 
> Have you really thought through making the ForkNum 8-bit ?

MAX_FORKNUM is way lower right now. And hardcoded. So this doesn't imply a new
restriction. As we iterate over 0..MAX_FORKNUM in a bunch of places (with
filesystem access each time), it's not feasible to make that number large.

Greetings,

Andres Freund




Re: Extending outfuncs support to utility statements

2022-07-12 Thread Tom Lane
I wrote:
> There might be enough node types that are raw-parse-tree-only,
> but not involved in utility statements, to make it worth
> continuing to suppress readfuncs support for them.  But I kinda
> doubt it.  I'll try to get some numbers later today.

Granting that we want write/read support for utility statements,
it seems that what we can save by suppressing raw-parse-tree-only
nodes is only about 10kB.  That's clearly not worth troubling over
in the grand scheme of things, so I suggest that we just open the
floodgates as attached.

regards, tom lane

diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl
index 7694e04d0a..96af17516a 100644
--- a/src/backend/nodes/gen_node_support.pl
+++ b/src/backend/nodes/gen_node_support.pl
@@ -166,23 +166,6 @@ push @scalar_types, qw(EquivalenceClass* EquivalenceMember*);
 # currently not required.
 push @scalar_types, qw(QualCost);
 
-# XXX various things we are not publishing right now to stay level
-# with the manual system
-push @no_read_write,
-  qw(AccessPriv AlterTableCmd CreateOpClassItem FunctionParameter InferClause ObjectWithArgs OnConflictClause PartitionCmd RoleSpec VacuumRelation);
-push @no_read, qw(A_ArrayExpr A_Indices A_Indirection AlterStatsStmt
-  CollateClause ColumnDef ColumnRef CreateForeignTableStmt CreateStatsStmt
-  CreateStmt FuncCall ImportForeignSchemaStmt IndexElem IndexStmt
-  JsonAggConstructor JsonArgument JsonArrayAgg JsonArrayConstructor
-  JsonArrayQueryConstructor JsonCommon JsonFuncExpr JsonKeyValue
-  JsonObjectAgg JsonObjectConstructor JsonOutput JsonParseExpr JsonScalarExpr
-  JsonSerializeExpr JsonTable JsonTableColumn JsonTablePlan LockingClause
-  MultiAssignRef PLAssignStmt ParamRef PartitionElem PartitionSpec
-  PlaceHolderVar PublicationObjSpec PublicationTable RangeFunction
-  RangeSubselect RangeTableFunc RangeTableFuncCol RangeTableSample RawStmt
-  ResTarget ReturnStmt SelectStmt SortBy StatsElem TableLikeClause
-  TriggerTransition TypeCast TypeName WindowDef WithClause XmlSerialize);
-
 
 ## check that we have the expected number of files on the command line
 die "wrong number of input files, expected @all_input_files\n"
@@ -795,14 +778,6 @@ foreach my $n (@node_types)
 	next if elem $n, @nodetag_only;
 	next if elem $n, @no_read_write;
 
-	# XXX For now, skip all "Stmt"s except that ones that were there before.
-	if ($n =~ /Stmt$/)
-	{
-		my @keep =
-		  qw(AlterStatsStmt CreateForeignTableStmt CreateStatsStmt CreateStmt DeclareCursorStmt ImportForeignSchemaStmt IndexStmt NotifyStmt PlannedStmt PLAssignStmt RawStmt ReturnStmt SelectStmt SetOperationStmt);
-		next unless elem $n, @keep;
-	}
-
 	my $no_read = (elem $n, @no_read);
 
 	# output format starts with upper case node type name
@@ -827,13 +802,20 @@ _out${n}(StringInfo str, const $n *node)
 
 ";
 
-	print $rff "
+	if (!$no_read)
+	{
+		my $macro =
+		  (@{ $node_type_info{$n}->{fields} } > 0)
+		  ? 'READ_LOCALS'
+		  : 'READ_LOCALS_NO_FIELDS';
+		print $rff "
 static $n *
 _read${n}(void)
 {
-\tREAD_LOCALS($n);
+\t$macro($n);
 
-" unless $no_read;
+";
+	}
 
 	# print instructions for each field
 	foreach my $f (@{ $node_type_info{$n}->{fields} })
@@ -883,6 +865,7 @@ _read${n}(void)
 			print $rff "\tREAD_LOCATION_FIELD($f);\n" unless $no_read;
 		}
 		elsif ($t eq 'int'
+			|| $t eq 'int16'
 			|| $t eq 'int32'
 			|| $t eq 'AttrNumber'
 			|| $t eq 'StrategyNumber')
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 4d776e7b51..85ac0b5e21 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -415,79 +415,6 @@ _outExtensibleNode(StringInfo str, const ExtensibleNode *node)
 	methods->nodeOut(str, node);
 }
 
-static void
-_outQuery(StringInfo str, const Query *node)
-{
-	WRITE_NODE_TYPE("QUERY");
-
-	WRITE_ENUM_FIELD(commandType, CmdType);
-	WRITE_ENUM_FIELD(querySource, QuerySource);
-	/* we intentionally do not print the queryId field */
-	WRITE_BOOL_FIELD(canSetTag);
-
-	/*
-	 * Hack to work around missing outfuncs routines for a lot of the
-	 * utility-statement node types.  (The only one we actually *need* for
-	 * rules support is NotifyStmt.)  Someday we ought to support 'em all, but
-	 * for the meantime do this to avoid getting lots of warnings when running
-	 * with debug_print_parse on.
-	 */
-	if (node->utilityStmt)
-	{
-		switch (nodeTag(node->utilityStmt))
-		{
-			case T_CreateStmt:
-			case T_IndexStmt:
-			case T_NotifyStmt:
-			case T_DeclareCursorStmt:
-WRITE_NODE_FIELD(utilityStmt);
-break;
-			default:
-appendStringInfoString(str, " :utilityStmt ?");
-break;
-		}
-	}
-	else
-		appendStringInfoString(str, " :utilityStmt <>");
-
-	WRITE_INT_FIELD(resultRelation);
-	WRITE_BOOL_FIELD(hasAggs);
-	WRITE_BOOL_FIELD(hasWindowFuncs);
-	WRITE_BOOL_FIELD(hasTargetSRFs);
-	WRITE_BOOL_FIELD(hasSubLinks);
-	WRITE_BOOL_FIELD(hasDistinctOn);
-	WRITE_BOOL_FIELD(hasRecursive);
-	WRITE_BOOL_FIELD(hasModifyingCTE);
-	

Re: System catalog documentation chapter

2022-07-12 Thread Tom Lane
Bruce Momjian  writes:
> On Tue, Jul 12, 2022 at 08:56:36PM +0200, Peter Eisentraut wrote:
>> views.sgml is a pretty generic name for a chapter that just contains system
>> views.

> Yes, I struggled with that.  What made me choose "views" is that the
> current name was catalogs.sgml, not syscatalogs.sgml.  If is acceptable
> to use catalogs.sgml and sysviews.sgml?

"catalogs" isn't too confusable with user-defined objects, so I think
that name is fine --- and anyway it has decades of history so changing
it seems unwise.

We seem to have been trending towards less-abbreviated .sgml file names
over time, so personally I'd go for system-views.sgml.

regards, tom lane




Re: Reducing Memory Consumption (aset and generation)

2022-07-12 Thread Ranier Vilela
Em ter., 12 de jul. de 2022 às 02:35, David Rowley 
escreveu:

> On Mon, 11 Jul 2022 at 20:48, Matthias van de Meent
>  wrote:
> > > 2) v1-002-generation-reduces-memory-consumption.patch
> > > Reduces memory used by struct GenerationBlock, by minus 8 bits,
> >
> > That seems fairly straight-forward -- 8 bytes saved on each page isn't
> > a lot, but it's something.
>
> I think 002 is likely the only patch here that has some merit.
> However, it's hard to imagine any measurable performance gains from
> it.  I think the smallest generation block we have today is 8192
> bytes. Saving 8 bytes in that equates to a saving of 0.1% of memory.
> For an 8MB page, it's 1024 times less than that.
>

> I imagine Ranier has been working on this due the performance
> regression mentioned in [1].  I think it'll be much more worthwhile to
> aim to reduce the memory chunk overheads rather than the block
> overheads, as Ranier is doing here. I posted a patch in [2] which does
> that.  To make that work, I need to have the owning context in the
> block. The 001 and 003 patch seems to remove those here.
>
I saw the numbers at [2], 17% is very impressive.
How you need the context in the block, 001 and 003,
they are more of a hindrance than a help.

So, feel free to incorporate 002 into your patch if you wish.
The best thing to do here is to close and withdraw from commitfest.

regards,
Ranier Vilela


Re: System catalog documentation chapter

2022-07-12 Thread Bruce Momjian
On Tue, Jul 12, 2022 at 08:56:36PM +0200, Peter Eisentraut wrote:
> On 08.07.22 21:32, Bruce Momjian wrote:
> > On Fri, Jul  8, 2022 at 12:07:45PM -0400, Bruce Momjian wrote:
> > > On Fri, Jul  8, 2022 at 11:49:47AM -0400, Tom Lane wrote:
> > > > Bruce Momjian  writes:
> > > > > Agreed. I don't want to break links into the documentation in final
> > > > > released versions, so head and PG15 seem wise.
> > > > 
> > > > I would not expect this to change the doc URLs for any individual
> > > > catalogs or views --- if it does, I won't be happy.
> > > 
> > > Good point --- I thought the chapter was in the URL, but I now see it is
> > > just the section heading:
> > > 
> > >   https://www.postgresql.org/docs/devel/view-pg-available-extensions.html
> > > 
> > > so I guess we can backpatch this with no issues.
> > 
> > Attached is a patch to accomplish this.  Its output can be seen here:
> > 
> > https://momjian.us/tmp/pgsql/internals.html
> 
> views.sgml is a pretty generic name for a chapter that just contains system
> views.

Yes, I struggled with that.  What made me choose "views" is that the
current name was catalogs.sgml, not syscatalogs.sgml.  If is acceptable
to use catalogs.sgml and sysviews.sgml?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: System catalog documentation chapter

2022-07-12 Thread Bruce Momjian
On Tue, Jul 12, 2022 at 08:56:01PM +0200, Peter Eisentraut wrote:
> On 08.07.22 18:07, Bruce Momjian wrote:
> > so I guess we can backpatch this with no issues.
> 
> It inserts a new chapter, which would renumber all other chapters. That's a
> pretty big change to backpatch.  I'm against that.

Okay, I can see the renumbering as being confusing so I will do PG 15
and head only.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: should check interrupts in BuildRelationExtStatistics ?

2022-07-12 Thread Tom Lane
Zhihong Yu  writes:
> Looking at the files under src/backend/utils/sort/, looks like license
> header is missing from qsort_interruptible.c

[ shrug ... ] qsort.c and qsort_arg.c don't have that either.

regards, tom lane




Re: making relfilenodes 56 bits

2022-07-12 Thread Hannu Krosing
Re: staticAssertStmt(MAX_FORKNUM <= INT8_MAX);

Have you really thought through making the ForkNum 8-bit ?

For example this would limit a columnar storage with each column
stored in it's own fork (which I'd say is not entirely unreasonable)
to having just about ~250 columns.

And there can easily be other use cases where we do not want to limit
number of forks so much

Cheers
Hannu

On Tue, Jul 12, 2022 at 10:36 PM Robert Haas  wrote:
>
> On Tue, Jul 12, 2022 at 1:09 PM Andres Freund  wrote:
> > What does currently happen if we exceed that?
>
> elog
>
> > > diff --git a/src/include/utils/wait_event.h 
> > > b/src/include/utils/wait_event.h
> > > index b578e2ec75..5d3775ccde 100644
> > > --- a/src/include/utils/wait_event.h
> > > +++ b/src/include/utils/wait_event.h
> > > @@ -193,7 +193,7 @@ typedef enum
> > >   WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
> > >   WAIT_EVENT_LOGICAL_REWRITE_WRITE,
> > >   WAIT_EVENT_RELATION_MAP_READ,
> > > - WAIT_EVENT_RELATION_MAP_SYNC,
> > > + WAIT_EVENT_RELATION_MAP_RENAME,
> >
> > Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
> > since it includes fsync etc?
>
> Sure, I had it that way for a while and changed it at the last minute.
> I can change it back.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>
>




Re: should check interrupts in BuildRelationExtStatistics ?

2022-07-12 Thread Zhihong Yu
On Tue, Jul 12, 2022 at 1:31 PM Tom Lane  wrote:

> I wrote:
> > Thomas Munro  writes:
> >> On Wed, Jul 6, 2022 at 11:37 AM Tom Lane  wrote:
> >>> qsort_interruptible
>
> >> +1
>
> > So here's a patch that does it that way.
>
> Hearing no comments, pushed.
>
> regards, tom lane
>
>
> Hi,
Looking at the files under src/backend/utils/sort/, looks like license
header is missing from qsort_interruptible.c

Please consider the patch which adds license header to qsort_interruptible.c

Cheers


qsort-interruptible-copyright.patch
Description: Binary data


Re: pg15b2: large objects lost on upgrade

2022-07-12 Thread Robert Haas
On Mon, Jul 11, 2022 at 9:16 AM Robert Haas  wrote:
> I am not saying we shouldn't try to fix this up more thoroughly, just
> that I think you are overestimating the consequences.

I spent a bunch of time looking at this today and I have more sympathy
for Justin's previous proposal now. I found it somewhat hacky that he
was relying on the hard-coded value of LargeObjectRelationId and
LargeObjectLOidPNIndexId, but I discovered that it's harder to do
better than I had assumed. Suppose we don't want to compare against a
hard-coded constant but against the value that is actually present
before the dump overwrites the pg_class row's relfilenode. Well, we
can't get that value from the database in question before restoring
the dump, because restoring either the dump creates or recreates the
database in all cases. The CREATE DATABASE command that will be part
of the dump always specifies TEMPLATE template0, so if we want
something other than a hard-coded constant, we need the
pg_class.relfilenode values from template0 for pg_largeobject and
pg_largeobject_loid_pn_index. But we can't connect to that database to
query those values, because it has datallowconn = false. Oops.

I have a few more ideas to try here. It occurs to me that we could fix
this more cleanly if we could get the dump itself to set the
relfilenode for pg_largeobject to the desired value. Right now, it's
just overwriting the relfilenode stored in the catalog without
actually doing anything that would cause a change on disk. But if we
could make it change the relfilenode in a more principled way that
would actually cause an on-disk change, then the orphaned-file problem
would be fixed, because we'd always be installing the new file over
top of the old file. I'm going to investigate how hard it would be to
make that work.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-12 Thread Robert Haas
On Tue, Jul 12, 2022 at 1:09 PM Andres Freund  wrote:
> What does currently happen if we exceed that?

elog

> > diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
> > index b578e2ec75..5d3775ccde 100644
> > --- a/src/include/utils/wait_event.h
> > +++ b/src/include/utils/wait_event.h
> > @@ -193,7 +193,7 @@ typedef enum
> >   WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
> >   WAIT_EVENT_LOGICAL_REWRITE_WRITE,
> >   WAIT_EVENT_RELATION_MAP_READ,
> > - WAIT_EVENT_RELATION_MAP_SYNC,
> > + WAIT_EVENT_RELATION_MAP_RENAME,
>
> Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
> since it includes fsync etc?

Sure, I had it that way for a while and changed it at the last minute.
I can change it back.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: should check interrupts in BuildRelationExtStatistics ?

2022-07-12 Thread Tom Lane
I wrote:
> Thomas Munro  writes:
>> On Wed, Jul 6, 2022 at 11:37 AM Tom Lane  wrote:
>>> qsort_interruptible

>> +1

> So here's a patch that does it that way.

Hearing no comments, pushed.

regards, tom lane




Re: automatically generating node support functions

2022-07-12 Thread Tom Lane
Peter Eisentraut  writes:
> On 11.07.22 19:57, Tom Lane wrote:
>> So at this point I'm rather attracted to the idea of reverting to
>> a manually-maintained NodeTag enum.  We know how to avoid ABI
>> breakage with that, and it's not exactly the most painful part
>> of adding a new node type.

> One of the nicer features is that you now get to see the numbers 
> assigned to the enum tags, like

>  T_LockingClause = 91,
>  T_XmlSerialize = 92,
>  T_PartitionElem = 93,

> so that when you get an error like "unsupported node type: %d", you can 
> just look up what it is.

Yeah, I wasn't thrilled about reverting that either.  I think the
defenses I installed in eea9fa9b2 should be sufficient to deal
with the risk.

regards, tom lane




Re: Remove trailing newlines from pg_upgrade's messages

2022-07-12 Thread Tom Lane
Kyotaro Horiguchi  writes:
> FWIW, the following change makes sense to me according to the spec of
> validate_exec()...

> diff --git a/src/bin/pg_upgrade/exec.c b/src/bin/pg_upgrade/exec.c
> index fadeea12ca..3cff186213 100644
> --- a/src/bin/pg_upgrade/exec.c
> +++ b/src/bin/pg_upgrade/exec.c
> @@ -430,10 +430,10 @@ check_exec(const char *dir, const char *program, bool 
> check_version)
>   ret = validate_exec(path);
 
>   if (ret == -1)
> - pg_fatal("check for \"%s\" failed: not a regular file\n",
> + pg_fatal("check for \"%s\" failed: does not exist or 
> inexecutable\n",
>path);
>   else if (ret == -2)
> - pg_fatal("check for \"%s\" failed: cannot execute (permission 
> denied)\n",
> + pg_fatal("check for \"%s\" failed: cannot read (permission 
> denied)\n",
>path);
 
>   snprintf(cmd, sizeof(cmd), "\"%s\" -V", path);

I initially did this, but then I wondered why validate_exec() was
making it so hard: why can't we just report the failure with %m?
It turns out to take only a couple extra lines of code to ensure
that something more-or-less appropriate is returned, so we don't
need to guess about it here.  Pushed that way.

regards, tom lane




Re: Remove trailing newlines from pg_upgrade's messages

2022-07-12 Thread Tom Lane
Peter Eisentraut  writes:
> On 14.06.22 20:57, Tom Lane wrote:
>> Hence, the patch below removes trailing newlines from all of
>> pg_upgrade's message strings, and teaches its logging infrastructure
>> to print them where appropriate.  As in logging.c, there's now an
>> Assert that no format string passed to pg_log() et al ends with
>> a newline.

> This patch looks okay to me.  I compared the output before and after in 
> a few scenarios and didn't see any problematic differences.

Thanks, pushed after rebasing and adjusting some recently-added messages.

> In this particular patch, the few empty lines that disappeared don't 
> bother me.  In general, however, I think we can just fprintf(stderr, 
> "\n") directly as necessary.

Hmm, if anyone wants to do that I think it'd be advisable to invent
"pg_log_blank_line()" or something like that, so as to preserve the
logging abstraction layer.  But it's moot unless anyone's interested
enough to send a patch for that.  I'm not.

(I think it *would* be a good idea to try to get rid of the leading
newlines that appear in some of the messages, as discussed upthread.
But I'm not going to trouble over that right now either.)

regards, tom lane




Re: WIN32 pg_import_system_collations

2022-07-12 Thread Juan José Santamaría Flecha
Please find attached a rebased version. I have split the patch into two
parts trying to make it easier to review, one with the code changes and the
other with the test.

Other than that, there are minimal changes from the previous version to the
code due to the update of _WIN32_WINNT and enabling the test on cirrus.

Regards,

Juan José Santamaría Flecha

>


v6-0001-WIN32-pg_import_system_collations.patch
Description: Binary data


v6-0002-Add-collate.windows.win1252-test.patch
Description: Binary data


Re: automatically generating node support functions

2022-07-12 Thread Peter Eisentraut

On 11.07.22 19:57, Tom Lane wrote:

So at this point I'm rather attracted to the idea of reverting to
a manually-maintained NodeTag enum.  We know how to avoid ABI
breakage with that, and it's not exactly the most painful part
of adding a new node type.


One of the nicer features is that you now get to see the numbers 
assigned to the enum tags, like


T_LockingClause = 91,
T_XmlSerialize = 92,
T_PartitionElem = 93,

so that when you get an error like "unsupported node type: %d", you can 
just look up what it is.






Re: Making CallContext and InlineCodeBlock less special-case-y

2022-07-12 Thread Peter Eisentraut

On 12.07.22 01:01, Tom Lane wrote:

I wrote:

Peter Eisentraut  writes:

On 10.07.22 01:50, Tom Lane wrote:

As committed, gen_node_support.pl excludes CallContext and InlineCodeBlock
from getting unneeded support functions via some very ad-hoc code.



Couldn't we just enable those support functions?  I think they were just
excluded because they didn't have any before and nobody bothered to make
any.



Well, we could I suppose, but that path leads to a lot of dead code in
backend/nodes/ --- obviously these two alone are negligible, but I want
a story other than "it's a hack" for execnodes.h and the other files
we exclude from generation of support code.


Here's a proposed patch for this bit.  Again, whether these two
node types have unnecessary support functions is not the point ---
obviously we could afford to waste that much space.  Rather, what
I'm after is to have a more explainable and flexible way of dealing
with the file-level exclusions applied to a lot of other node types.
This patch doesn't make any change in the script's output now, but
it gives us flexibility for the future.


Yeah, looks reasonable.




Re: System catalog documentation chapter

2022-07-12 Thread Peter Eisentraut

On 08.07.22 21:32, Bruce Momjian wrote:

On Fri, Jul  8, 2022 at 12:07:45PM -0400, Bruce Momjian wrote:

On Fri, Jul  8, 2022 at 11:49:47AM -0400, Tom Lane wrote:

Bruce Momjian  writes:

Agreed. I don't want to break links into the documentation in final
released versions, so head and PG15 seem wise.


I would not expect this to change the doc URLs for any individual
catalogs or views --- if it does, I won't be happy.


Good point --- I thought the chapter was in the URL, but I now see it is
just the section heading:

https://www.postgresql.org/docs/devel/view-pg-available-extensions.html

so I guess we can backpatch this with no issues.


Attached is a patch to accomplish this.  Its output can be seen here:

https://momjian.us/tmp/pgsql/internals.html


views.sgml is a pretty generic name for a chapter that just contains 
system views.






Re: System catalog documentation chapter

2022-07-12 Thread Peter Eisentraut

On 08.07.22 18:07, Bruce Momjian wrote:

so I guess we can backpatch this with no issues.


It inserts a new chapter, which would renumber all other chapters. 
That's a pretty big change to backpatch.  I'm against that.





Re: First draft of the PG 15 release notes

2022-07-12 Thread Bruce Momjian
On Mon, Jul 11, 2022 at 11:31:32PM -0700, Noah Misch wrote:
> On Mon, Jul 11, 2022 at 12:39:57PM -0400, Bruce Momjian wrote:
> > I had trouble reading the sentences in the order you used so I
> > restructured it:
> > 
> > The new default is one of the secure schema usage patterns that  > linkend="ddl-schemas-patterns"/> has recommended since the security
> > release for CVE-2018-1058.  The change applies to newly-created
> > databases in existing clusters and for new clusters.  Upgrading a
> > cluster or restoring a database dump will preserve existing permissions.
> 
> I agree with the sentence order change.

Great.

> > For existing databases, especially those having multiple users, consider
> > issuing REVOKE to adopt this new default.  For new
> > databases having zero need to defend against insider threats, granting
> > USAGE permission on their public
> > schemas will yield the behavior of prior releases.
> 
> s/USAGE/CREATE/ in the last sentence.  Looks good with that change.

Ah, yes, of course.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Cleaning up historical portability baggage

2022-07-12 Thread Tom Lane
Andres Freund  writes:
> Redefining functions, be it by linking in something or by redefining function
> names via macros, is a mess. There's places where we then have to undefine
> some of these things to be able to include external headers etc. Some
> functions are only replaced in backends, others in frontend too. It makes it
> hard to know what exactly the assumed set of platform primitives is. Etc.

In the cases at hand, we aren't doing that, are we?  The replacement
function is only used on platforms that lack the relevant POSIX function,
so it's hard to argue that we're replacing anything.

regards, tom lane




Re: Reducing the chunk header sizes on all memory context types

2022-07-12 Thread Andres Freund
Hi,

On 2022-07-12 20:22:57 +0300, Yura Sokolov wrote:
> I don't get, why "large chunk" needs additional fields for size and
> offset.
> Large allocation sizes are certainly rounded to page size.
> And allocations which doesn't fit 1GB we could easily round to 1MB.
> Then we could simply store `size>>20`.
> It will limit MaxAllocHugeSize to `(1<<(30+20))-1` - 1PB. Doubdfully we
> will deal with such huge allocations in near future.

What would gain by doing something like this? The storage density loss of
storing an exact size is smaller than what you propose here.

Greetings,

Andres Freund




Re: Reducing the chunk header sizes on all memory context types

2022-07-12 Thread Andres Freund
Hi,

On 2022-07-12 17:01:18 +1200, David Rowley wrote:
> I've taken Andres' patch and made some quite significant changes to
> it. In the patch's current state, the sort performance regression in
> PG15 vs PG14 is fixed. The generation.c context chunk header has been
> reduced to 8 bytes from the previous 24 bytes as it is in master.
> aset.c context chunk header is now 8 bytes instead of 16 bytes.

I think this is *very* cool. But I might be biased :)


> There is at least one. It might be major; to reduce the AllocSet chunk
> header from 16 bytes down to 8 bytes I had to get rid of the freelist
> pointer that was reusing the "aset" field in the chunk header struct.
> This works now by storing that pointer in the actual palloc'd memory.
> This could lead to pretty hard-to-trace bugs if we have any code that
> accidentally writes to memory after pfree.

Can't we use the same trick for allcations in the freelist as we do for the
header in a live allocation? I.e. split the 8 byte header into two and use
part of it to point to the next element in the list using the offset from the
start of the block, and part of it to indicate the size?

Greetings,

Andres Freund




Re: Cleaning up historical portability baggage

2022-07-12 Thread Andres Freund
Hi,

On 2022-07-12 08:01:40 -0400, Robert Haas wrote:
> On Mon, Jul 11, 2022 at 9:11 PM Thomas Munro  wrote:
> > Hmm, but that's not what we're doing in general.  For example, on
> > Windows we're redirecting open() to a replacement function of our own,
> > we're not using "pg_open()" in our code.  That's not an example based
> > on AC_REPLACE_FUNCS, but there are plenty of those too.  Isn't this
> > quite well established?
> 
> Yes. I just don't care for it.
> 
> Sounds like I'm in the minority, though.

I agree with you, at least largely.

Redefining functions, be it by linking in something or by redefining function
names via macros, is a mess. There's places where we then have to undefine
some of these things to be able to include external headers etc. Some
functions are only replaced in backends, others in frontend too. It makes it
hard to know what exactly the assumed set of platform primitives is. Etc.

Greetings,

Andres Freund




remove_useless_groupby_columns is too enthusiastic

2022-07-12 Thread Tom Lane
I looked into the complaint at [1] about the planner being much
stupider when one side of a JOIN USING is referenced than the other
side.  It seemed to me that that shouldn't be happening, because
the relevant decisions are made on the basis of EquivalenceClasses
and both USING columns should be in the same EquivalenceClass.
I was right about that ... but the damage is being done far upstream
of any EquivalenceClass work.  It turns out that the core of the
issue is that the query looks like

SELECT ... t1 JOIN t2 USING (x)
  GROUP BY x, t2.othercol ORDER BY t2.othercol LIMIT n

In the "okay" case, we resolve "GROUP BY x" as GROUP BY t1.x.
Later on, we are able to realize that ordering by t1.x is
equivalent to ordering by t2.x (because same EquivalenceClass),
and that it's equally good to consider the GROUP BY items in
either order, and that ordering by t2.othercol, t2.x would
allow us to perform the grouping using a GroupAggregate on
data that's already sorted to meet the ORDER BY requirement.
Since there happens to be an index on t2.othercol, t2.x,
we can implement the query with no explicit sort, which wins
big thanks to the small LIMIT.

In the not-okay case, we resolve "GROUP BY x" as GROUP BY t2.x.
Then remove_useless_groupby_columns notices that x is t2's
primary key, so it figures that grouping by t2.othercol is
redundant and throws away that element of GROUP BY.  Now there
is no apparent connection between the GROUP BY and ORDER BY
lists, defeating the optimizations that would lead to a good
choice of plan --- in fact, we conclude early on that that
index's sort ordering is entirely useless to the query :-(

I tried the attached quick-hack patch that just prevents
remove_useless_groupby_columns from removing anything that
appears in ORDER BY.  That successfully fixes the complained-of
case, and it doesn't change any existing regression test results.
I'm not sure whether there are any cases that it makes significantly
worse.

(I also kind of wonder if the fundamental problem here is that
remove_useless_groupby_columns is being done at the wrong time,
and we ought to do it later when we have more semantic info.
But I'm not volunteering to rewrite it completely.)

Anyway, remove_useless_groupby_columns has been there since 9.6
and we've not previously seen reports of cases that it makes worse,
so this seems like a corner-case problem.  Hence I wouldn't risk
back-patching this change.  It seems worth considering for HEAD
though, so I'll stick it in the September CF.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/17544-ebd06b00b8836a04%40postgresql.org

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 06ad856eac..b2716abcc7 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2733,12 +2733,15 @@ remove_useless_groupby_columns(PlannerInfo *root)
 
 			/*
 			 * New list must include non-Vars, outer Vars, and anything not
-			 * marked as surplus.
+			 * marked as surplus.  In addition, keep anything that appears in
+			 * the ORDER BY clause, because otherwise we may falsely make it
+			 * look like the GROUP BY and ORDER BY clauses are incompatible.
 			 */
 			if (!IsA(var, Var) ||
 var->varlevelsup > 0 ||
 !bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber,
-			   surplusvars[var->varno]))
+			   surplusvars[var->varno]) ||
+list_member(parse->sortClause, sgc))
 new_groupby = lappend(new_groupby, sgc);
 		}
 


Re: Reducing the chunk header sizes on all memory context types

2022-07-12 Thread Yura Sokolov
Good day, David.

В Вт, 12/07/2022 в 17:01 +1200, David Rowley пишет:
> Over on [1], I highlighted that 40af10b57 (Use Generation memory
> contexts to store tuples in sorts) could cause some performance
> regressions for sorts when the size of the tuple is exactly a power of
> 2.  The reason for this is that the chunk header for generation.c
> contexts is 8 bytes larger (on 64-bit machines) than the aset.c chunk
> header. This means that we can store fewer tuples per work_mem during
> the sort and that results in more batches being required.
> 
> As I write this, this regression is still marked as an open item for
> PG15 in [2]. So I've been working on this to try to assist the
> discussion about if we need to do anything about that for PG15.
> 
> Over on [3], I mentioned that Andres came up with an idea and a
> prototype patch to reduce the chunk header size across the board by
> storing the context type in the 3 least significant bits in a uint64
> header.
> 
> I've taken Andres' patch and made some quite significant changes to
> it. In the patch's current state, the sort performance regression in
> PG15 vs PG14 is fixed. The generation.c context chunk header has been
> reduced to 8 bytes from the previous 24 bytes as it is in master.
> aset.c context chunk header is now 8 bytes instead of 16 bytes.
> 
> We can use this 8-byte chunk header by using the remaining 61-bits of
> the uint64 header to encode 2 30-bit values to store the chunk size
> and also the number of bytes we must subtract from the given pointer
> to find the block that the chunk is stored on.  Once we have the
> block, we can find the owning context by having a pointer back to the
> context from the block.  For memory allocations that are larger than
> what can be stored in 30 bits, the chunk header gets an additional two
> Size fields to store the chunk_size and the block offset.  We can tell
> the difference between the 2 chunk sizes by looking at the spare 1-bit
> the uint64 portion of the header.

I don't get, why "large chunk" needs additional fields for size and
offset.
Large allocation sizes are certainly rounded to page size.
And allocations which doesn't fit 1GB we could easily round to 1MB.
Then we could simply store `size>>20`.
It will limit MaxAllocHugeSize to `(1<<(30+20))-1` - 1PB. Doubdfully we
will deal with such huge allocations in near future.

And to limit block offset, we just have to limit maxBlockSize to 1GB,
which is quite reasonable limitation.
Chunks that are larger than maxBlockSize goes to separate blocks anyway,
therefore they have small block offset.

> Aside from speeding up the sort case, this also seems to have a good
> positive performance impact on pgbench read-only workload with -M
> simple. I'm seeing about a 17% performance increase on my AMD
> Threadripper machine.
> 
> master + Reduced Memory Context Chunk Overhead
> drowley@amd3990x:~$ pgbench -S -T 60 -j 156 -c 156 -M simple postgres
> tps = 1876841.953096 (without initial connection time)
> tps = 1919605.408254 (without initial connection time)
> tps = 1891734.480141 (without initial connection time)
> 
> Master
> drowley@amd3990x:~$ pgbench -S -T 60 -j 156 -c 156 -M simple postgres
> tps = 1632248.236962 (without initial connection time)
> tps = 1615663.151604 (without initial connection time)
> tps = 1602004.146259 (without initial connection time)

Trick with 3bit context type is great.

> The attached .png file shows the same results for PG14 and PG15 as I
> showed in the blog [4] where I discovered the regression and adds the
> results from current master + the attached patch. See bars in orange.
> You can see that the regression at 64MB work_mem is fixed. Adding some
> tracing to the sort shows that we're now doing 671745 tuples per batch
> instead of 576845 tuples. This reduces the number of batches from 245
> down to 210.
> 
> Drawbacks:
> 
> There is at least one. It might be major; to reduce the AllocSet chunk
> header from 16 bytes down to 8 bytes I had to get rid of the freelist
> pointer that was reusing the "aset" field in the chunk header struct.
> This works now by storing that pointer in the actual palloc'd memory.
> This could lead to pretty hard-to-trace bugs if we have any code that
> accidentally writes to memory after pfree.  The slab.c context already
> does this, but that's far less commonly used.  If we decided this was
> unacceptable then it does not change anything for the generation.c
> context.  The chunk header will still be 8 bytes instead of 24 there.
> So the sort performance regression will still be fixed.

At least we can still mark free list pointer as VALGRIND_MAKE_MEM_NOACCESS
and do VALGRIND_MAKE_MEM_DEFINED on fetching from free list, can we?

> To improve this situation, we might be able code it up so that
> MEMORY_CONTEXT_CHECKING builds add an additional freelist pointer to
> the header and also write it to the palloc'd memory then verify
> they're set to the same thing when we reuse a chunk from the 

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-07-12 Thread Andres Freund
Hi,

On 2022-07-11 22:22:28 -0400, Melanie Plageman wrote:
> Yes, per an off list suggestion by you, I have changed the tests to use a
> sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of
> the missing calls to count IO Operations for IOPATH_LOCAL and
> IOPATH_STRATEGY.
> 
> I struggled to come up with a way to test writes for a particular
> type of backend are counted correctly since a dirty buffer could be
> written out by another type of backend before the target BackendType has
> a chance to write it out.

I guess temp file writes would be reliably done by one backend... Don't have a
good idea otherwise.


> I also struggled to come up with a way to test IO operations for
> background workers. I'm not sure of a way to deterministically have a
> background worker do a particular kind of IO in a test scenario.

I think it's perfectly fine to not test that - for it to be broken we'd have
to somehow screw up setting the backend type. Everything else is the same as
other types of backends anyway.

If you *do* want to test it, you probably could use
SET parallel_leader_participation = false;
SET force_parallel_mode = 'regress';
SELECT something_triggering_io;


> I'm not sure how to cause a strategy "extend" for testing.

COPY into a table should work. But might be unattractive due to the size of of
the COPY ringbuffer.


> > Would be nice to have something testing that the ringbuffer stats stuff
> > does something sensible - that feels not entirely trivial.
> >
> >
> I've added a test to test that reused strategy buffers are counted as
> allocs. I would like to add a test which checks that if a buffer in the
> ring is pinned and thus not reused, that it is not counted as a strategy
> alloc, but I found it challenging without a way to pause vacuuming, pin
> a buffer, then resume vacuuming.

Yea, that's probably too hard to make reliable to be worth it.

Greetings,

Andres Freund




Re: making relfilenodes 56 bits

2022-07-12 Thread Andres Freund
Hi,

On 2022-07-12 09:51:12 -0400, Robert Haas wrote:
> On Mon, Jul 11, 2022 at 7:22 PM Andres Freund  wrote:
> > I guess I'm not enthused in duplicating the necessary knowledge in evermore
> > places. We've forgotten one of the magic incantations in the past, and 
> > needing
> > to find all the places that need to be patched is a bit bothersome.
> >
> > Perhaps we could add extract helpers out of durable_rename()?
> >
> > OTOH, I don't really see what we gain by keeping things out of the critical
> > section? It does seem good to have the temp-file creation/truncation and 
> > write
> > separately, but after that I don't think it's worth much to avoid a
> > PANIC. What legitimate issue does it avoid?
> 
> OK, so then I think we should just use durable_rename(). Here's a
> patch that does it that way. I briefly considered the idea of
> extracting helpers, but it doesn't seem worthwhile to me. There's not
> that much code in durable_rename() in the first place.

Cool.


> In this version, I also removed the struct padding, changed the limit
> on the number of entries to a nice round 64, and made some comment
> updates.

What does currently happen if we exceed that?

I wonder if we should just reference a new define generated by genbki.pl
documenting the number of relations that need to be tracked. Then we don't
need to maintain this manually going forward.


> I considered trying to go further and actually make the file
> variable-size, so that we never again need to worry about the limit on
> the number of entries, but I don't actually think that's a good idea.

Yea, I don't really see what we'd gain. For this stuff to change we need to
recompile anyway.


> If we were going to split up durable_rename(), the only intelligible
> split I can see would be to have a second version of the function, or
> a flag to the existing function, that caters to the situation where
> the old file is already known to have been fsync()'d.

I was thinking of something like durable_rename_prep() that'd fsync the
file/directories under their old names, and then durable_rename_exec() that
actually renames and then fsyncs.  But without a clear usecase...


> + /* Write new data to the file. */
> + pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_WRITE);
> + if (write(fd, newmap, sizeof(RelMapFile)) != sizeof(RelMapFile))
...
> + pgstat_report_wait_end();
> +

Not for this patch, but we eventually should move this sequence into a
wrapper. Perhaps combined with retry handling for short writes, the ENOSPC
stuff and an error message when the write fails. It's a bit insane how many
copies of this we have.


> diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
> index b578e2ec75..5d3775ccde 100644
> --- a/src/include/utils/wait_event.h
> +++ b/src/include/utils/wait_event.h
> @@ -193,7 +193,7 @@ typedef enum
>   WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
>   WAIT_EVENT_LOGICAL_REWRITE_WRITE,
>   WAIT_EVENT_RELATION_MAP_READ,
> - WAIT_EVENT_RELATION_MAP_SYNC,
> + WAIT_EVENT_RELATION_MAP_RENAME,

Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
since it includes fsync etc?

Greetings,

Andres Freund




Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-07-12 Thread Andres Freund
Hi,

On 2022-07-12 12:19:06 -0400, Melanie Plageman wrote:
> > > I also realized that I am not differentiating between IOPATH_SHARED and
> > > IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
> > > of buffer we are fsync'ing by the time we call register_dirty_segment(),
> > > I'm not sure how we would fix this.
> >
> > I think there scarcely happens flush for strategy-loaded buffers.  If
> > that is sensible, IOOP_FSYNC would not make much sense for
> > IOPATH_STRATEGY.
> >
> 
> Why would it be less likely for a backend to do its own fsync when
> flushing a dirty strategy buffer than a regular dirty shared buffer?

We really just don't expect a backend to do many segment fsyncs at
all. Otherwise there's something wrong with the forwarding mechanism.

It'd be different if we tracked WAL fsyncs more granularly - which would be
quite interesting - but that's something for another day^Wpatch.


> > > > Wonder if it's worth making the lock specific to the backend type?
> > > >
> > >
> > > I've added another Lock into PgStat_IOPathOps so that each BackendType
> > > can be locked separately. But, I've also kept the lock in
> > > PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
> > > done easily.
> >
> > Looks fine about the lock separation.
> >
> 
> Actually, I think it is not safe to use both of these locks. So for
> picking one method, it is probably better to go with the locks in
> PgStat_IOPathOps, it will be more efficient for flush (and not for
> fetching and resetting), so that is probably the way to go, right?

I think it's good to just use one kind of lock, and efficiency of snapshotting
/ resetting is nearly irrelevant. But I don't see why it's not safe to use
both kinds of locks?


> > Looks fine, but I think pgstat_flush_io_ops() need more comments like
> > other pgstat_flush_* functions.
> >
> > +   for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> > +   stats_shmem->stats[i].stat_reset_timestamp = ts;
> >
> > I'm not sure we need a separate reset timestamp for each backend type
> > but SLRU counter does the same thing..
> >
> 
> Yes, I think for SLRU stats it is because you can reset individual SLRU
> stats. Also there is no wrapper data structure to put it in. I could
> keep it in PgStatShared_BackendIOPathOps since you have to reset all IO
> operation stats at once, but I am thinking of getting rid of
> PgStatShared_BackendIOPathOps since it is not needed if I only keep the
> locks in PgStat_IOPathOps and make the global shared value an array of
> PgStat_IOPathOps.

I'm strongly against introducing super granular reset timestamps. I think that
was a mistake for SLRU stats, but we can't fix that as easily.


> Currently, strategy allocs count only reuses of a strategy buffer (not
> initial shared buffers which are added to the ring).
> strategy writes count only the writing out of dirty buffers which are
> already in the ring and are being reused.

That seems right to me.


> Alternatively, we could also count as strategy allocs all those buffers
> which are added to the ring and count as strategy writes  all those
> shared buffers which are dirty when initially added to the ring.

I don't think that'd provide valuable information. The whole reason that
strategy writes are interesting is that they can lead to writing out data a
lot sooner than they would be written out without a strategy being used.


> Subject: [PATCH v24 2/3] Track IO operation statistics
> 
> Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
> location or type of IO done by a backend. For example, the checkpointer
> may write a shared buffer out. This would be counted as an IOOp write on
> an IOPath IOPATH_SHARED by BackendType "checkpointer".

I'm still not 100% happy with IOPath - seems a bit too easy to confuse with
the file path. What about 'origin'?


> Each IOOp (alloc, fsync, extend, write) is counted per IOPath
> (direct, local, shared, or strategy) through a call to
> pgstat_count_io_op().

It seems we should track reads too - it's quite interesting to know whether
reads happened because of a strategy, for example. You do reference reads in a
later part of the commit message even :)


> The primary concern of these statistics is IO operations on data blocks
> during the course of normal database operations. IO done by, for
> example, the archiver or syslogger is not counted in these statistics.

We could extend this at a later stage, if we really want to. But I'm not sure
it's interesting or fully possible. E.g. the archiver's write are largely not
done by the archiver itself, but by a command (or module these days) it shells
out to.


> Note that this commit does not add code to increment IOPATH_DIRECT. A
> future patch adding wrappers for smgrwrite(), smgrextend(), and
> smgrimmedsync() would provide a good location to call
> pgstat_count_io_op() for unbuffered IO and avoid regressions for future
> users of these functions.

Hm. Perhaps we 

test_oat_hooks bug (was: Re: pgsql: Add copy/equal support for XID lists)

2022-07-12 Thread Alvaro Herrera
While looking for a place to host a test for XID lists support, I
noticed a mistake in test_oat_hooks, fixed as per the attached.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
>From 7005187f58a2a8b0300eceaa8fae8f825d5525a9 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 12 Jul 2022 17:18:23 +0200
Subject: [PATCH] Fix flag tests in src/test/modules/test_oat_hooks

In what must have been a copy'n paste mistake, all the flag tests use
the same flag rather than a different flag each.  The bug is not
suprising, considering that it's dead code.
---
 src/test/modules/test_oat_hooks/test_oat_hooks.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/test/modules/test_oat_hooks/test_oat_hooks.c b/src/test/modules/test_oat_hooks/test_oat_hooks.c
index 1f40d632e0..900d597f5d 100644
--- a/src/test/modules/test_oat_hooks/test_oat_hooks.c
+++ b/src/test/modules/test_oat_hooks/test_oat_hooks.c
@@ -1795,15 +1795,15 @@ accesstype_arg_to_string(ObjectAccessType access, void *arg)
 return psprintf("%s%s%s%s%s%s",
 ((drop_arg->dropflags & PERFORM_DELETION_INTERNAL)
  ? "internal action," : ""),
-((drop_arg->dropflags & PERFORM_DELETION_INTERNAL)
+((drop_arg->dropflags & PERFORM_DELETION_CONCURRENTLY)
  ? "concurrent drop," : ""),
-((drop_arg->dropflags & PERFORM_DELETION_INTERNAL)
+((drop_arg->dropflags & PERFORM_DELETION_QUIETLY)
  ? "suppress notices," : ""),
-((drop_arg->dropflags & PERFORM_DELETION_INTERNAL)
+((drop_arg->dropflags & PERFORM_DELETION_SKIP_ORIGINAL)
  ? "keep original object," : ""),
-((drop_arg->dropflags & PERFORM_DELETION_INTERNAL)
+((drop_arg->dropflags & PERFORM_DELETION_SKIP_EXTENSIONS)
  ? "keep extensions," : ""),
-((drop_arg->dropflags & PERFORM_DELETION_INTERNAL)
+((drop_arg->dropflags & PERFORM_DELETION_CONCURRENT_LOCK)
  ? "normal concurrent drop," : ""));
 			}
 			break;
-- 
2.30.2



Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-07-12 Thread Melanie Plageman
Thanks for the review!

On Tue, Jul 12, 2022 at 4:06 AM Kyotaro Horiguchi 
wrote:

> At Mon, 11 Jul 2022 22:22:28 -0400, Melanie Plageman <
> melanieplage...@gmail.com> wrote in
> > Hi,
> >
> > In the attached patch set, I've added in missing IO operations for
> > certain IO Paths as well as enumerating in the commit message which IO
> > Paths and IO Operations are not currently counted and or not possible.
> >
> > There is a TODO in HandleWalWriterInterrupts() about removing
> > pgstat_report_wal() since it is immediately before a proc_exit()
>
> Right. walwriter does that without needing the explicit call.
>

I have deleted it.


>
> > I was wondering if LocalBufferAlloc() should increment the counter or if
> > I should wait until GetLocalBufferStorage() to increment the counter.
>
> Depends on what "allocate" means. Different from shared buffers, local
> buffers are taken from OS then allocated to page.  OS-allcoated pages
> are restricted by num_temp_buffers so I think what we're interested in
> is the count incremented by LocalBuferAlloc(). (And it is the parallel
> of alloc for shared-buffers)
>

I've left it in LocalBufferAlloc().


>
> > I also realized that I am not differentiating between IOPATH_SHARED and
> > IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
> > of buffer we are fsync'ing by the time we call register_dirty_segment(),
> > I'm not sure how we would fix this.
>
> I think there scarcely happens flush for strategy-loaded buffers.  If
> that is sensible, IOOP_FSYNC would not make much sense for
> IOPATH_STRATEGY.
>

Why would it be less likely for a backend to do its own fsync when
flushing a dirty strategy buffer than a regular dirty shared buffer?


>
> > > IMO it'd be a good idea to add pgstat_report_io_ops() to
> > > pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
> > > autovac worker get updated more regularly.
> > >
> >
> > noted and fixed.
>
> > > How about moving the pgstat_report_io_ops() into
> > > pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
> > > unnecessary to have multiple pgstat_* calls in these places.
> > >
> > >
> > >
> > noted and fixed.
>
> +* Also report IO Operations statistics
>
> I think that the function comment also should mention this.
>

I've added comments at the top of all these functions.


>
> > > Wonder if it's worth making the lock specific to the backend type?
> > >
> >
> > I've added another Lock into PgStat_IOPathOps so that each BackendType
> > can be locked separately. But, I've also kept the lock in
> > PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
> > done easily.
>
> Looks fine about the lock separation.
>

Actually, I think it is not safe to use both of these locks. So for
picking one method, it is probably better to go with the locks in
PgStat_IOPathOps, it will be more efficient for flush (and not for
fetching and resetting), so that is probably the way to go, right?


> By the way, in the following line:
>
> +
>  >io_ops.stats[backend_type_get_idx(MyBackendType)];
>
> backend_type_get_idx(x) is actually (x - 1) plus assertion on the
> value range. And the only use-case is here. There's an reverse
> function and also used only at one place.
>
> +   Datum   backend_type_desc =
> +
>  CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));
>
> In this usage GetBackendTypeDesc() gracefully treats out-of-domain
> values but idx_get_backend_type keenly kills the process for the
> same. This is inconsistent.
>
> My humbel opinion on this is we don't define the two functions and
> replace the calls to them with (x +/- 1).  Addition to that, I think
> we should not abort() by invalid backend types.  In that sense, I
> wonder if we could use B_INVALIDth element for this purpose.
>

I think that GetBackendTypeDesc() should probably also error out for an
unknown value.

I would be open to not using the helper functions. I thought it would be
less error-prone, but since it is limited to the code in
pgstat_io_ops.c, it is probably okay. Let me think a bit more.

Could you explain more about what you mean about using B_INVALID
BackendType?


>
> > > > + LWLockAcquire(_shmem->lock, LW_SHARED);
> > > > + memcpy(, reset_offset, sizeof(stats_shmem->stats));
> > > > + LWLockRelease(_shmem->lock);
> > >
> > > Which then also means that you don't need the reset offset stuff. It's
> > > only there because with the changecount approach we can't take a lock
> to
> > > reset the stats (since there is no lock). With a lock you can just
> reset
> > > the shared state.
> > >
> >
> > Yes, I believe I have cleaned up all of this embarrassing mess. I use the
> > lock in PgStatShared_BackendIOPathOps for reset all and snapshot and the
> > locks in PgStat_IOPathOps for flush.
>
> Looks fine, but I think pgstat_flush_io_ops() need more comments like
> other pgstat_flush_* functions.
>
> +   for (int i = 0; i < BACKEND_NUM_TYPES; i++)

Re: making relfilenodes 56 bits

2022-07-12 Thread Dilip Kumar
On Mon, Jul 11, 2022 at 9:49 PM Robert Haas  wrote:
>

> It also makes me wonder why we're using macros rather than static
> inline functions in buf_internals.h. I wonder whether we could do
> something like this, for example, and keep InvalidForkNumber as -1:
>
> static inline ForkNumber
> BufTagGetForkNum(BufferTag *tagPtr)
> {
> int8 ret;
>
> StaticAssertStmt(MAX_FORKNUM <= INT8_MAX);
> ret = (int8) ((tagPtr->relForkDetails[0] >> BUFFERTAG_RELNUMBER_BITS);
> return (ForkNumber) ret;
> }
>
> Even if we don't use that particular trick, I think we've generally
> been moving toward using static inline functions rather than macros,
> because it provides better type-safety and the code is often easier to
> read. Maybe we should also approach it that way here. Or even commit a
> preparatory patch replacing the existing macros with inline functions.
> Or maybe it's best to leave it alone, not sure.

I think it make sense to convert existing macros as well, I have
attached a patch for the same,
>
> > I had those changes in v7-0003, now I have merged with 0002.  This has
> > assert check while replaying the WAL for smgr create and smgr
> > truncate, and while during normal path when allocating the new
> > relfilenumber we are asserting for any existing file.
>
> I think a test-and-elog might be better. Most users won't be running
> assert-enabled builds, but this seems worth checking regardless.

IMHO the recovery time asserts we can convert to elog but one which we
are doing after each GetNewRelFileNumber is better to keep as an
assert as we are doing the file access so it can be costly?

> > I have done some performance tests, with very small values I can see a
> > lot of wait events for RelFileNumberGen but with bigger numbers like
> > 256 or 512 it is not really bad.  See results at the end of the
> > mail[1]
>
> It's a little hard to interpret these results because you don't say
> how often you were checking the wait events, or how often the
> operation took to complete. I suppose we can guess the relative time
> scale from the number of Activity events: if there were 190
> WalWriterMain events observed, then the time to complete the operation
> is probably 190 times how often you were checking the wait events, but
> was that every second or every half second or every tenth of a second?

I am executing it after every 0.5 sec using below script in psql
\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5

And running test for 60 sec
./pgbench -c 32 -j 32 -T 60 -f create_script.sql -p 54321  postgres

$ cat create_script.sql
select create_table(100);

// function body 'create_table'
CREATE OR REPLACE FUNCTION create_table(count int) RETURNS void AS $$
DECLARE
  relname varchar;
  pid int;
  i   int;
BEGIN
  SELECT pg_backend_pid() INTO pid;
  relname := 'test_' || pid;
  FOR i IN 1..count LOOP
EXECUTE format('CREATE TABLE %s(a int)', relname);

EXECUTE format('DROP TABLE %s', relname);
  END LOOP;
END;
$$ LANGUAGE plpgsql;



> > I have done these changes during GetNewRelFileNumber() this required
> > to track the last logged record pointer as well but I think this looks
> > clean.  With this I can see some reduction in RelFileNumberGen wait
> > event[1]
>
> I find the code you wrote here a little bit magical. I believe it
> depends heavily on choosing to issue the new WAL record when we've
> exhausted exactly 50% of the available space. I suggest having two
> constants, one of which is the number of relfilenumber values per WAL
> record, and the other of which is the threshold for issuing a new WAL
> record. Maybe something like RFN_VALUES_PER_XLOG and
> RFN_NEW_XLOG_THRESHOLD, or something. And then work code that works
> correctly for any value of RFN_NEW_XLOG_THRESHOLD between 0 (don't log
> new RFNs until old allocation is completely exhausted) and
> RFN_VALUES_PER_XLOG - 1 (log new RFNs after using just 1 item from the
> previous allocation). That way, if in the future someone decides to
> change the constant values, they can do that and the code still works.

ok



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 553a6b185ee7270bf206387421e50b48c7a8b97b Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Tue, 12 Jul 2022 17:10:04 +0530
Subject: [PATCH v1] Convert buf_internal.h macros to static inline functions

Readability wise inline functions are better compared to macros and this
will also help to write cleaner and readable code for 64-bit relfilenode
because as part of that patch we will have to do some complex bitwise
operation so doing that inside the inline function will be cleaner.
---
 src/backend/storage/buffer/buf_init.c |   2 +-
 src/backend/storage/buffer/bufmgr.c   |  16 ++--
 src/backend/storage/buffer/localbuf.c |  12 +--
 src/include/storage/buf_internals.h   | 156 +-
 4 files changed, 111 insertions(+), 75 deletions(-)

diff --git 

Re: Introduce "log_connection_stages" setting.

2022-07-12 Thread Euler Taveira
On Tue, Jul 12, 2022, at 10:52 AM, Sergey Dudoladov wrote:
> The problem we face is excessive logging of connection information
> that clutters the logs and in corner cases with many short-lived
> connections leads to disk space exhaustion.
You are proposing a fine-grained control over connection stages reported when
log_connections = on. It seems useful if you're only interested in (a)
malicious access or (b) authorized access (for audit purposes).

> I would like to get feedback on the following idea:
> 
> Add the `log_connection_stages` setting of type “string” with possible
> values "received", "authorized", "authenticated", "disconnected", and
> "all", with "all" being the default.
> The setting would have effect only when `log_connections` is on.
> Example: log_connection_stages=’authorized,disconnected’.
> That also implies there would be no need for a separate
> "log_disconnection" setting.
Your proposal will add more confusion to the already-confused logging-related
GUCs. If you are proposing to introduce a fine-grained control, the first step
should be merge log_connections and log_disconnections into a new GUC (?) and
deprecate them. (I wouldn't introduce a new GUC that depends on the stage of
other GUC as you proposed.) There are 3 stages: connect (received), authorized
(authenticated), disconnect. You can also add 'all' that mimics log_connections
/ log_disconnections enabled. Another question is if we can reuse
log_connections for this improvement instead of a new GUC. In this case, you
would turn the boolean value into an enum value. Will it cause trouble while
upgrading to this new version? It is one of the reasons to create a new GUC. I
would suggest log_connection_messages or log_connection (with the 's' in the
end -- since it is too similar to the current GUC name, I'm afraid it is not a
good name for it).


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: proposal: Allocate work_mem From Pool

2022-07-12 Thread Justin Pryzby
On Tue, Jul 12, 2022 at 03:55:39AM -0700, Joseph D Wagner wrote:
> Before I try to answer that, I need to know how the scheduler works.
> 
> Let's say there's a max of 8 worker process, and 12 queries trying to run.
> When does query #9 run? After the first of 1-8 completes, simple FIFO?
> Or something else?
> 
> Also, how long goes a query hold a worker process?  All the way to
> completion?  Or does is perform some unit of work and rotate to
> another query?

I think what you're referring to as a worker process is what postgres refers to
as a "client backend" (and not a "parallel worker", even though that sounds
more similar to your phrase).

> P.S.  If there's a link to all this somewhere, please let me know.
> Parsing through years of email archives is not always user friendly or
> helpful.

Looking at historic communication is probably the easy part.
Here's some to start you out.
https://www.postgresql.org/message-id/flat/4d39869f4bdc42b3a43004e3685ac45d%40index.de

-- 
Justin




Re: DropRelFileLocatorBuffers

2022-07-12 Thread Robert Haas
On Tue, Jul 12, 2022 at 12:07 AM Dilip Kumar  wrote:
> I think the naming used in your patch looks better to me. So +1 for the 
> change.

Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Cleaning up historical portability baggage

2022-07-12 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jul 11, 2022 at 9:11 PM Thomas Munro  wrote:
>> Hmm, but that's not what we're doing in general.  For example, on
>> Windows we're redirecting open() to a replacement function of our own,
>> we're not using "pg_open()" in our code.  That's not an example based
>> on AC_REPLACE_FUNCS, but there are plenty of those too.  Isn't this
>> quite well established?

> Yes. I just don't care for it.
> Sounds like I'm in the minority, though.

I concur with your point that it's not great to use the standard name
for a function that doesn't have exactly the standard semantics.
But if it does, using a nonstandard name is not better.  It's just one
more thing that readers of our code have to learn about.

Note that "exactly" only needs to mean "satisfies all the promises
made by POSIX".  If some caller is depending on behavior details not
specified in the standard, that's the caller's bug not the wrapper
function's.  Otherwise, yeah, we couldn't ever be sure whether a
wrapper function is close enough.

regards, tom lane




Introduce "log_connection_stages" setting.

2022-07-12 Thread Sergey Dudoladov
Hello,

The problem we face is excessive logging of connection information
that clutters the logs and in corner cases with many short-lived
connections leads to disk space exhaustion.

Current connection log lines share significant parts of the
information - host, port, very close timestamps etc. - in the common
case of a successful connection:

2022-07-12 12:17:39.369
UTC,,,12875,"10.2.101.63:35616",62cd6663.324b,1,"",2022-07-12 12:17:39
UTC,,0,LOG,0,"connection received: host=10.2.101.63
port=35616","","not initialized"
2022-07-12 12:17:39.374
UTC,"some_user","postgres",12875,"10.2.101.63:35616",62cd6663.324b,2,"authentication",2022-07-12
12:17:39 UTC,18/4571,0,LOG,0,"connection authorized:
user=some_user database=postgres SSL enabled (protocol=, cipher=,
compression=)","","client backend"
2022-07-12 12:17:39.430
UTC,"some_user","postgres",12875,"10.2.101.63:35616",62cd6663.324b,3,"idle",2022-07-12
12:17:39 UTC,,0,LOG,0,"disconnection: session time: 0:00:00.060
user=some_user database=postgres host=10.2.101.63
port=35616","","client backend"

Removing some of the lines should not harm log-based investigations in
most cases, but will shrink the logs improving readability and space
consumption.

I would like to get feedback on the following idea:

Add the `log_connection_stages` setting of type “string” with possible
values "received", "authorized", "authenticated", "disconnected", and
"all", with "all" being the default.
The setting would have effect only when `log_connections` is on.
Example: log_connection_stages=’authorized,disconnected’.
That also implies there would be no need for a separate
"log_disconnection" setting.

For the sake of completeness I have to mention omitting ‘received’
from `log_connection_stages` would lead to absence in logs info about
connections that do not complete initialization: for them only the
“connection received” line is currently logged. The attachment
contains a log excerpt to clarify the situation I am talking about.

Regards,
Sergey.
2022-07-11 15:44:53.126 UTC,,,75,,62cbfa36.4b,3148,,2022-07-11 10:23:50 
UTC,,0,DEBUG,0,"forked new backend, pid=270675 
socket=12","","postmaster"
2022-07-11 15:44:53.127 
UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,1,"",2022-07-11 15:44:53 
UTC,,0,LOG,0,"connection received: host=10.2.80.0 
port=1094","","not initialized"
2022-07-11 15:44:53.127 
UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,2,"",2022-07-11 15:44:53 
UTC,,0,DEBUG,0,"shmem_exit(0): 0 before_shmem_exit callbacks to 
make","","not initialized"
2022-07-11 15:44:53.127 
UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,3,"",2022-07-11 15:44:53 
UTC,,0,DEBUG,0,"shmem_exit(0): 0 on_shmem_exit callbacks to 
make","","not initialized"
2022-07-11 15:44:53.127 
UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,4,"",2022-07-11 15:44:53 
UTC,,0,DEBUG,0,"proc_exit(0): 1 callbacks to make","","not 
initialized"
2022-07-11 15:44:53.127 
UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,5,"",2022-07-11 15:44:53 
UTC,,0,DEBUG,0,"exit(0)","","not initialized"
2022-07-11 15:44:53.127 
UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,6,"",2022-07-11 15:44:53 
UTC,,0,DEBUG,0,"shmem_exit(-1): 0 before_shmem_exit callbacks to 
make","","not initialized"
2022-07-11 15:44:53.127 
UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,7,"",2022-07-11 15:44:53 
UTC,,0,DEBUG,0,"shmem_exit(-1): 0 on_shmem_exit callbacks to 
make","","not initialized"
2022-07-11 15:44:53.127 
UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,8,"",2022-07-11 15:44:53 
UTC,,0,DEBUG,0,"proc_exit(-1): 0 callbacks to make","","not 
initialized"
2022-07-11 15:44:53.129 UTC,,,75,,62cbfa36.4b,3158,,2022-07-11 10:23:50 
UTC,,0,DEBUG,0,"reaping dead processes ","","postmaster"
2022-07-11 15:44:53.129 UTC,,,75,,62cbfa36.4b,3159,,2022-07-11 10:23:50 
UTC,,0,DEBUG,0,"server process (PID 270675) exited with exit code 
0","","postmaster"


Re: making relfilenodes 56 bits

2022-07-12 Thread Robert Haas
On Mon, Jul 11, 2022 at 7:22 PM Andres Freund  wrote:
> I guess I'm not enthused in duplicating the necessary knowledge in evermore
> places. We've forgotten one of the magic incantations in the past, and needing
> to find all the places that need to be patched is a bit bothersome.
>
> Perhaps we could add extract helpers out of durable_rename()?
>
> OTOH, I don't really see what we gain by keeping things out of the critical
> section? It does seem good to have the temp-file creation/truncation and write
> separately, but after that I don't think it's worth much to avoid a
> PANIC. What legitimate issue does it avoid?

OK, so then I think we should just use durable_rename(). Here's a
patch that does it that way. I briefly considered the idea of
extracting helpers, but it doesn't seem worthwhile to me. There's not
that much code in durable_rename() in the first place.

In this version, I also removed the struct padding, changed the limit
on the number of entries to a nice round 64, and made some comment
updates. I considered trying to go further and actually make the file
variable-size, so that we never again need to worry about the limit on
the number of entries, but I don't actually think that's a good idea.
It would require substantially more changes to the code in this file,
and that means there's more risk of introducing bugs, and I don't see
that there's much value anyway, because if we ever do hit the current
limit, we can just raise the limit.

If we were going to split up durable_rename(), the only intelligible
split I can see would be to have a second version of the function, or
a flag to the existing function, that caters to the situation where
the old file is already known to have been fsync()'d.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0001-Remove-the-restriction-that-the-relmap-must-be-51.patch
Description: Binary data


Re: Making Vars outer-join aware

2022-07-12 Thread Tom Lane
Richard Guo  writes:
> Note that the evaluation of expression 'b.j + 1' now occurs below the
> outer join. Is this something we need to be concerned about?

It seems more formally correct to me, but perhaps somebody would
complain about possibly-useless expression evals.  We could likely
re-complicate the logic that inserts PHVs during pullup so that it
looks for Vars it can apply the nullingrels to.  Maybe there's an
opportunity to share code with flatten_join_alias_vars?

regards, tom lane




[PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-07-12 Thread Önder Kalacı
Hi hackers,


It is often not feasible to use `REPLICA IDENTITY FULL` on the
publication, because it leads to full table scan

per tuple change on the subscription. This makes `REPLICA IDENTITY
FULL` impracticable -- probably other

than some small number of use cases.

With this patch, I'm proposing the following change: If there is an
index on the subscriber, use the index

as long as the planner sub-modules picks any index over sequential scan.

Majority of the logic on the subscriber side has already existed in
the code. The subscriber is already

capable of doing (unique) index scans. With this patch, we are
allowing the index to iterate over the

tuples fetched and only act when tuples are equal. The ones familiar
with this part of the code could

realize that the sequential scan code on the subscriber already
implements the `tuples_equal()` function.

In short, the changes on the subscriber are mostly combining parts of
(unique) index scan and

sequential scan codes.

The decision on whether to use an index (or which index) is mostly
derived from planner infrastructure.

The idea is that on the subscriber we have all the columns. So,
construct all the `Path`s with the

restrictions on all columns, such as `col_1 = $1 AND col_2 = $2 ...
AND col_n = $N`. Finally, let

the planner sub-module -- `create_index_paths()` -- to give us the
relevant  index `Path`s. On top of

that adds the sequential scan `Path` as well. Finally, pick the
cheapest `Path` among.

>From the performance point of view, there are few things to note.
First, the patch aims not to
change the behavior when PRIMARY KEY or UNIQUE INDEX is used. Second,
when REPLICA IDENTITY
IS FULL on the publisher and an index is used on the subscriber, the
difference mostly comes down
to `index scan` vs `sequential scan`. That's why it is hard to claim a
certain number of improvements.
It mostly depends on the data size, index and the data distribution.

Still, below I try to showcase the potential improvements using an
index on the subscriber
`pgbench_accounts(bid)`. With the index, the replication catches up
around ~5 seconds.
When the index is dropped, the replication takes around ~300 seconds.

// init source db
pgbench -i -s 100 -p 5432 postgres
psql -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;" -p 5432 postgres
psql -c "CREATE INDEX i1 ON pgbench_accounts(aid);" -p 5432 postgres
psql -c "ALTER TABLE pgbench_accounts REPLICA IDENTITY FULL;" -p 5432 postgres
psql -c "CREATE PUBLICATION pub_test_1 FOR TABLE pgbench_accounts;" -p
5432 postgres

// init target db, drop existing primary key
pgbench -i -p 9700 postgres
psql -c "truncate pgbench_accounts;" -p 9700 postgres
psql -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;" -p 9700 postgres
psql -c "CREATE SUBSCRIPTION sub_test_1 CONNECTION 'host=localhost
port=5432 user=onderkalaci dbname=postgres' PUBLICATION pub_test_1;"
-p 9700 postgres

// create one index, even on a low cardinality column
psql -c "CREATE INDEX i2 ON pgbench_accounts(bid);" -p 9700 postgres

// now, run some pgbench tests and observe replication
pgbench -t 500 -b tpcb-like -p 5432 postgres



What do hackers think about this change?


Thanks,

Onder Kalaci & Developing the Citus extension for PostgreSQL


0001_use_index_on_subs_when_pub_rep_ident_full.patch
Description: Binary data


Re: proposal: Allocate work_mem From Pool

2022-07-12 Thread John Naylor
On Tue, Jul 12, 2022 at 5:55 PM Joseph D Wagner 
wrote:
> Before I try to answer that, I need to know how the scheduler works.

As I understand the term used, there is no scheduler inside Postgres for
user connections -- they're handled by the OS kernel. That's probably why
it'd be a difficult project to be smart about memory -- step one might be
to invent a scheduler. (The autovacuum launcher and checkpointer, etc have
their own logic about when to do things, but while running they too are
just OS processes scheduled by the kernel.)

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-07-12 Thread Amit Kapila
On Fri, Jul 8, 2022 at 10:26 PM Melih Mutlu  wrote:
>
>> I think that won't work because each time on restart the slot won't be
>> fixed. Now, it is possible that we may drop the wrong slot if that
>> state of copying rel is SUBREL_STATE_DATASYNC. Also, it is possible
>> that while creating a slot, we fail because the same name slot already
>> exists due to some other worker which has created that slot has been
>> restarted. Also, what about origin_name, won't that have similar
>> problems? Also, if the state is already SUBREL_STATE_FINISHEDCOPY, if
>> the slot is not the same as we have used in the previous run of a
>> particular worker, it may start WAL streaming from a different point
>> based on the slot's confirmed_flush_location.
>
>
> You're right Amit. In case of a failure, tablesync phase of a relation may 
> continue with different worker and replication slot due to this change in 
> naming.
> Seems like the same replication slot should be used from start to end for a 
> relation during tablesync. However, creating/dropping replication slots can 
> be a major overhead in some cases.
> It would be nice if these slots are somehow reused.
>
> To overcome this issue, I've been thinking about making some changes in my 
> patch.
> So far, my proposal would be as follows:
>
> Slot naming can be like pg__ instead of 
> pg__. This way each worker can use the same replication 
> slot during their lifetime.
> But if a worker is restarted, then it will switch to a new replication slot 
> since its pid has changed.
>

I think using worker_pid also has similar risks of mixing slots from
different workers because after restart same worker_pid could be
assigned to a totally different worker. Can we think of using a unique
64-bit number instead? This will be allocated when each workers
started for the very first time and after that we can refer catalog to
find it as suggested in the idea below.

> pg_subscription_rel catalog can store replication slot name for each 
> non-ready relation. Then we can find the slot needed for that particular 
> relation to complete tablesync.
>

Yeah, this is worth investigating. However, instead of storing the
slot_name, we can store just the unique number (as suggested above).
We should use the same for the origin name as well.

> If a worker syncs a relation without any error, everything works well and 
> this new replication slot column from the catalog will not be needed.
> However if a worker is restarted due to a failure, the previous run of that 
> worker left its slot behind since it did not exit properly.
> And the restarted worker (with a different pid) will see that the relation is 
> actually in  SUBREL_STATE_FINISHEDCOPY and want to proceed for the catchup 
> step.
> Then the worker can look for that particular relation's replication slot from 
> pg_subscription_rel catalog (slot name should be there since relation state 
> is not ready). And tablesync can proceed with that slot.
>
> There might be some cases where some replication slots are left behind. An 
> example to such cases would be when the slot is removed from 
> pg_subscription_rel catalog and detached from any relation, but tha slot 
> actually couldn't be dropped for some reason. For such cases, a slot cleanup 
> logic is needed. This cleanup can also be done by tablesync workers.
> Whenever a tablesync worker is created, it can look for existing replication 
> slots that do not belong to any relation and any worker (slot name has pid 
> for that), and drop those slots if it finds any.
>

This sounds tricky. Why not first drop slot/origin and then detach it
from pg_subscription_rel? On restarts, it is possible that we may
error out after dropping the slot or origin but before updating the
catalog entry but in such case we can ignore missing slot/origin and
detach them from pg_subscription_rel. Also, if we use the unique
number as suggested above, I think even if we don't remove it after
the relation state is ready, it should be okay.

> What do you think about this new way of handling slots? Do you see any points 
> of concern?
>
> I'm currently working on adding this change into the patch. And would 
> appreciate any comment.
>

Thanks for making progress!

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Compression dictionaries for JSONB

2022-07-12 Thread Aleksander Alekseev
Hi Nikita,

> Aleksander, please point me in the right direction if it was mentioned 
> before, I have a few questions:

Thanks for your feedback. These are good questions indeed.

> 1) It is not clear for me, how do you see the life cycle of such a 
> dictionary? If it is meant to keep growing without
> cleaning up/rebuilding it could affect performance in an undesirable way, 
> along with keeping unused data without
> any means to get rid of them.
> 2) From (1) follows another question - I haven't seen any means for getting 
> rid of unused keys (or any other means
> for dictionary cleanup). How could it be done?

Good point. This was not a problem for ZSON since the dictionary size
was limited to 2**16 entries, the dictionary was immutable, and the
dictionaries had versions. For compression dictionaries we removed the
2**16 entries limit and also decided to get rid of versions. The idea
was that you can simply continue adding new entries, but no one
thought about the fact that this will consume the memory required to
decompress the document indefinitely.

Maybe we should return to the idea of limited dictionary size and
versions. Objections?

> 4) If one dictionary is used by several tables - I see future issues in 
> concurrent dictionary updates. This will for sure
> affect performance and can cause unpredictable behavior for queries.

You are right. Another reason to return to the idea of dictionary versions.

> Also, I agree with Simon Riggs, using OIDs from the general pool for 
> dictionary entries is a bad idea.

Yep, we agreed to stop using OIDs for this, however this was not
changed in the patch at this point. Please don't hesitate joining the
effort if you want to. I wouldn't mind taking a short break from this
patch.

> 3) Is the possible scenario legal - by some means a dictionary does not 
> contain some keys for entries? What happens then?

No, we should either forbid removing dictionary entries or check that
all the existing documents are not using the entries being removed.

> If you have any questions on Pluggable TOAST don't hesitate to ask me and on 
> JSONB Toaster you can ask Nikita Glukhov.

Will do! Thanks for working on this and I'm looking forward to the
next version of the patch for the next round of review.

-- 
Best regards,
Aleksander Alekseev




Re: Cleaning up historical portability baggage

2022-07-12 Thread Robert Haas
On Mon, Jul 11, 2022 at 9:11 PM Thomas Munro  wrote:
> Hmm, but that's not what we're doing in general.  For example, on
> Windows we're redirecting open() to a replacement function of our own,
> we're not using "pg_open()" in our code.  That's not an example based
> on AC_REPLACE_FUNCS, but there are plenty of those too.  Isn't this
> quite well established?

Yes. I just don't care for it.

Sounds like I'm in the minority, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-12 Thread Amit Kapila
On Tue, Jul 12, 2022 at 2:53 PM Masahiko Sawada  wrote:
>
> On Tue, Jul 12, 2022 at 5:58 PM shiy.f...@fujitsu.com
>  wrote:
> >
> >
> > It happened when executing the following code because it tried to free a 
> > NULL
> > pointer (catchange_xip).
> >
> > /* be tidy */
> > if (ondisk)
> > pfree(ondisk);
> > +   if (catchange_xip)
> > +   pfree(catchange_xip);
> >  }
> >
> > It seems to be related to configure option. I could reproduce it when using
> > `./configure --enable-debug`.
> > But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og 
> > -ggdb"`.
>
> Hmm, I could not reproduce this problem even if I use ./configure
> --enable-debug. And it's weird that we checked if catchange_xip is not
> null but we did pfree for it:
>

Yeah, this looks weird to me as well but one difference in running
tests could be the timing of WAL LOG for XLOG_RUNNING_XACTS. That may
change the timing of SnapBuildSerialize. The other thing we can try is
by checking the value of catchange_xcnt before pfree.

BTW, I think ReorderBufferGetCatalogChangesXacts should have an Assert
to ensure rb->catchange_ntxns and xcnt are equal. We can probably then
avoid having xcnt_p as an out parameter as the caller can use
rb->catchange_ntxns instead.

-- 
With Regards,
Amit Kapila.




RE: proposal: Allocate work_mem From Pool

2022-07-12 Thread Joseph D Wagner
>> I think it would be better if work_mem was allocated from a pool
>> of  memory

> I think this has been proposed before, and the issue/objection
> with this idea is probably that query plans will be inconsistent,
> and end up being sub-optimal.

> work_mem is considered at planning time, but I think you only
> consider its application execution.  A query that was planned
>  with the configured work_mem but can't obtain the expected
> amount at execution time might perform poorly. Maybe it
> should be replanned with lower work_mem, but that would
> lose the arms-length relationship between the planner-executor.

> Should an expensive query wait a bit to try to get more
> work_mem? What do you do if 3 expensive queries are all
> waiting ?

Before I try to answer that, I need to know how the scheduler works.

Let's say there's a max of 8 worker process, and 12 queries trying to run.
When does query #9 run? After the first of 1-8 completes, simple FIFO?
Or something else?

Also, how long goes a query hold a worker process?  All the way to
completion?  Or does is perform some unit of work and rotate to
another query?

Joseph D Wagner

P.S.  If there's a link to all this somewhere, please let me know.
Parsing through years of email archives is not always user friendly or
helpful.





Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2022-07-12 Thread Alexander Korotkov
Hi Aleksander!

Thank you for your efforts reviewing this patch.

On Thu, Jul 7, 2022 at 12:43 PM Aleksander Alekseev
 wrote:
> > I'm going to need more time to meditate on the proposed changes and to 
> > figure out the performance impact.
>
> OK, turned out this patch is slightly more complicated than I
> initially thought, but I think I managed to get some vague
> understanding of what's going on.
>
> I tried to reproduce the case with concurrently updated tuples you
> described on the current `master` branch. I created a new table:
>
> ```
> CREATE TABLE phonebook(
>   "id" SERIAL PRIMARY KEY NOT NULL,
>   "name" NAME NOT NULL,
>   "phone" INT NOT NULL);
>
> INSERT INTO phonebook ("name", "phone")
> VALUES ('Alice', 123), ('Bob', 456), ('Charlie', 789);
> ```
>
> Then I opened two sessions and attached them with LLDB. I did:
>
> ```
> (lldb) b heapam_tuple_update
> (lldb) c
> ```
>
> ... in both cases because I wanted to see two calls (steps 2 and 4) to
> heapam_tuple_update() and check the return values.
>
> Then I did:
>
> ```
> session1 =# BEGIN;
> session2 =# BEGIN;
> session1 =# UPDATE phonebook SET name = 'Alex' WHERE name = 'Alice';
> ```
>
> This update succeeds and I see heapam_tuple_update() returning TM_Ok.
>
> ```
> session2 =# UPDATE phonebook SET name = 'Alfred' WHERE name = 'Alice';
> ```
>
> This update hangs on a lock.
>
> ```
> session1 =# COMMIT;
> ```
>
> Now session2 unfreezes and returns 'UPDATE 0'. table_tuple_update()
> was called once and returned TM_Updated. Also session2 sees an updated
> tuple now. So apparently the visibility check (step 3) didn't pass.

Yes.  But it's not exactly a visibility check.  Session2 re-evaluates
WHERE condition on the most recent row version (bypassing snapshot).
WHERE condition is not true anymore, thus the row is not upated.

> At this point I'm slightly confused. I don't see where a performance
> improvement is expected, considering that session2 gets blocked until
> session1 commits.
>
> Could you please walk me through here? Am I using the right test case
> or maybe you had another one in mind? Which steps do you consider
> expensive and expect to be mitigated by the patch?

This patch is not intended to change some high-level logic. On the
high level transaction, which updated the row, still holding a lock on
it until finished. The possible positive performance impact I expect
from doing the work of two calls tuple_update() and tuple_lock() in
the one call of tuple_update().  If we do this in one call, we can
save some efforts, for instance lock the same buffer once not twice.

--
Regards,
Alexander Korotkov




Re: [PATCH] Compression dictionaries for JSONB

2022-07-12 Thread Nikita Malakhov
Hi hackers!

Aleksander, please point me in the right direction if it was mentioned
before, I have a few questions:

1) It is not clear for me, how do you see the life cycle of such a
dictionary? If it is meant to keep growing without
cleaning up/rebuilding it could affect performance in an undesirable way,
along with keeping unused data without
any means to get rid of them.
Also, I agree with Simon Riggs, using OIDs from the general pool for
dictionary entries is a bad idea.

2) From (1) follows another question - I haven't seen any means for getting
rid of unused keys (or any other means
for dictionary cleanup). How could it be done?

3) Is the possible scenario legal - by some means a dictionary does not
contain some keys for entries? What happens then?

4) If one dictionary is used by several tables - I see future issues in
concurrent dictionary updates. This will for sure
affect performance and can cause unpredictable behavior for queries.

If you have any questions on Pluggable TOAST don't hesitate to ask me and
on JSONB Toaster you can ask Nikita Glukhov.

Thank you!

Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/
On Mon, Jul 11, 2022 at 6:41 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi hackers,
>
> > I invite anyone interested to join this effort as a co-author!
>
> Here is v5. Same as v4 but with a fixed compiler warning (thanks,
> cfbot). Sorry for the noise.
>
> --
> Best regards,
> Aleksander Alekseev
>


Re: CREATE TABLE ( .. STORAGE ..)

2022-07-12 Thread Aleksander Alekseev
Hi Peter,

> The "safety check: do not allow toasted storage modes unless column
> datatype is TOAST-aware" could be moved into GetAttributeStorage(), so
> it doesn't have to be repeated.  (Note that GetAttributeCompression()
> does similar checking.)

Good point. Fixed.

> ATExecSetStorage() currently doesn't do any such check, and your patch
> isn't adding one.  Is there a reason for that?

ATExecSetStorage() does this, but the check is a bit below [1]. In v7
I moved the check to GetAttributeStorage() as well.

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/tablecmds.c#l8312

-- 
Best regards,
Aleksander Alekseev


v7-0001-Allow-specifying-STORAGE-attribute-for-a-new-tabl.patch
Description: Binary data


Re: Cleaning up historical portability baggage

2022-07-12 Thread Peter Eisentraut

On 12.07.22 03:10, Thomas Munro wrote:

AFAIK we generally only use pg_whatever() when there's a good reason,
such as an incompatibility, a complication or a different abstraction
that you want to highlight to a reader.  The reason here was
temporary: we couldn't implement standard pread/pwrite perfectly on
ancient HP-UX, but we*can*  implement it on Windows, so the reason is
gone.

These particular pg_ prefixes have only been in our tree for a few
years and I was hoping to boot them out again before they stick, like
"Size".  I like using standard interfaces where possible for the very
basic stuff, to de-weird our stuff.


I agree.  That's been the established approach.




Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-12 Thread Masahiko Sawada
On Tue, Jul 12, 2022 at 5:58 PM shiy.f...@fujitsu.com
 wrote:
>
> On Tue, Jul 12, 2022 8:49 AM Masahiko Sawada  wrote:
> >
> > I've attached an updated patch.
> >
>
> Hi,
>
> I met a segmentation fault in test_decoding test after applying the patch for 
> master
> branch. Attach the backtrace.

Thank you for testing the patch!

>
> It happened when executing the following code because it tried to free a NULL
> pointer (catchange_xip).
>
> /* be tidy */
> if (ondisk)
> pfree(ondisk);
> +   if (catchange_xip)
> +   pfree(catchange_xip);
>  }
>
> It seems to be related to configure option. I could reproduce it when using
> `./configure --enable-debug`.
> But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -ggdb"`.

Hmm, I could not reproduce this problem even if I use ./configure
--enable-debug. And it's weird that we checked if catchange_xip is not
null but we did pfree for it:

#1  pfree (pointer=0x0) at mcxt.c:1177
#2  0x0078186b in SnapBuildSerialize (builder=0x1fd5e78,
lsn=25719712) at snapbuild.c:1792

Is it reproducible in your environment? If so, could you test it again
with the following changes?

diff --git a/src/backend/replication/logical/snapbuild.c
b/src/backend/replication/logical/snapbuild.c
index d015c06ced..a6e76e3781 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1788,7 +1788,7 @@ out:
/* be tidy */
if (ondisk)
pfree(ondisk);
-   if (catchange_xip)
+   if (catchange_xip != NULL)
pfree(catchange_xip);
 }

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-12 Thread shiy.f...@fujitsu.com
On Tue, Jul 12, 2022 8:49 AM Masahiko Sawada  wrote:
> 
> I've attached an updated patch.
> 

Hi,

I met a segmentation fault in test_decoding test after applying the patch for 
master
branch. Attach the backtrace.

It happened when executing the following code because it tried to free a NULL
pointer (catchange_xip).

/* be tidy */
if (ondisk)
pfree(ondisk);
+   if (catchange_xip)
+   pfree(catchange_xip);
 }

It seems to be related to configure option. I could reproduce it when using
`./configure --enable-debug`.
But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -ggdb"`.

Regards,
Shi yu
#0  0x00910333 in GetMemoryChunkContext (pointer=0x0) at 
../../../../src/include/utils/memutils.h:129
#1  pfree (pointer=0x0) at mcxt.c:1177
#2  0x0078186b in SnapBuildSerialize (builder=0x1fd5e78, lsn=25719712) 
at snapbuild.c:1792
#3  0x00782797 in SnapBuildProcessRunningXacts (builder=0x1fd5e78, 
lsn=25719712, running=0x27dfe50) at snapbuild.c:1199
#4  0x00774273 in standby_decode (ctx=0x1fc3e08, buf=0x7ffd8c95b5d0) at 
decode.c:346
#5  0x00773ab3 in LogicalDecodingProcessRecord 
(ctx=ctx@entry=0x1fc3e08, record=0x1fc41a0) at decode.c:119
#6  0x0077815d in pg_logical_slot_get_changes_guts (fcinfo=0x1fb5d88, 
confirm=, binary=) at logicalfuncs.c:271
#7  0x0067b63d in ExecMakeTableFunctionResult (setexpr=0x1fb42e0, 
econtext=0x1fb41b0, argContext=, expectedDesc=0x1fd7da8, 
randomAccess=false)
at execSRF.c:234
#8  0x0068b76f in FunctionNext (node=node@entry=0x1fb3fa0) at 
nodeFunctionscan.c:94
#9  0x0067be37 in ExecScanFetch (recheckMtd=0x68b450 , 
accessMtd=0x68b470 , node=0x1fb3fa0) at execScan.c:133
#10 ExecScan (node=0x1fb3fa0, accessMtd=0x68b470 , 
recheckMtd=0x68b450 ) at execScan.c:199
#11 0x0067344b in ExecProcNode (node=0x1fb3fa0) at 
../../../src/include/executor/executor.h:259
#12 ExecutePlan (execute_once=, dest=0x1fc02e8, 
direction=, numberTuples=0, sendTuples=, 
operation=CMD_SELECT,
use_parallel_mode=, planstate=0x1fb3fa0, estate=0x1fb3d78) 
at execMain.c:1636
#13 standard_ExecutorRun (queryDesc=0x1f9e178, direction=, 
count=0, execute_once=) at execMain.c:363
#14 0x007dda9e in PortalRunSelect (portal=0x1f51338, forward=, count=0, dest=) at pquery.c:924
#15 0x007decf7 in PortalRun (portal=portal@entry=0x1f51338, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x1fc02e8, 
altdest=altdest@entry=0x1fc02e8, qc=0x7ffd8c95bbf0) at pquery.c:768
#16 0x007db4ff in exec_simple_query (
query_string=0x1ee29a8 "SELECT data FROM 
pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', 
'1', 'include-xids', '0');")
at postgres.c:1243
#17 0x007dc742 in PostgresMain (dbname=, 
username=) at postgres.c:4482
#18 0x0076132b in BackendRun (port=, port=) at postmaster.c:4503
#19 BackendStartup (port=) at postmaster.c:4231
#20 ServerLoop () at postmaster.c:1805
#21 0x007621fb in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0x1edbf20) at postmaster.c:1477
#22 0x004f0f69 in main (argc=3, argv=0x1edbf20) at main.c:202


Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-12 Thread Amit Kapila
On Tue, Jul 12, 2022 at 1:13 PM Masahiko Sawada  wrote:
>
> On Tue, Jul 12, 2022 at 3:25 PM Amit Kapila  wrote:
> >
> > On Tue, Jul 12, 2022 at 11:38 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > >
> > > > I'm doing benchmark tests and will share the results.
> > > >
> > >
> > > I've done benchmark tests to measure the overhead introduced by doing
> > > bsearch() every time when decoding a commit record. I've simulated a
> > > very intensified situation where we decode 1M commit records while
> > > keeping builder->catchange.xip array but the overhead is negilible:
> > >
> > > HEAD: 584 ms
> > > Patched: 614 ms
> > >
> > > I've attached the benchmark script I used. With increasing
> > > LOG_SNAPSHOT_INTERVAL_MS to 9, the last decoding by
> > > pg_logicla_slot_get_changes() decodes 1M commit records while keeping
> > > catalog modifying transactions.
> > >
> >
> > Thanks for the test. We should also see how it performs when (a) we
> > don't change LOG_SNAPSHOT_INTERVAL_MS,
>
> What point do you want to see in this test? I think the performance
> overhead depends on how many times we do bsearch() and how many
> transactions are in the list.
>

Right, I am not expecting any visible performance difference in this
case. This is to ensure that we are not incurring any overhead in the
more usual scenarios (or default cases). As per my understanding, the
purpose of increasing the value of LOG_SNAPSHOT_INTERVAL_MS is to
simulate a stress case for the changes made by the patch, and keeping
its value default will test the more usual scenarios.

> I increased this value to easily
> simulate the situation where we decode many commit records while
> keeping catalog modifying transactions. But even if we don't change
> this value, the result would not change if we don't change how many
> commit records we decode.
>
> > and (b) we have more DDL xacts
> > so that the array to search is somewhat bigger
>
> I've done the same performance tests while creating 64 catalog
> modifying transactions. The result is:
>
> HEAD: 595 ms
> Patched: 628 ms
>
> There was no big overhead.
>

Yeah, especially considering you have simulated a stress case for the patch.

-- 
With Regards,
Amit Kapila.




Re: Weird behaviour with binary copy, arrays and column count

2022-07-12 Thread James Vanns
That's a good tip! Can't believe I hadn't even thought of that! :/

Cheers

Jim

On Mon, 11 Jul 2022 at 21:59, Greg Stark  wrote:
>
> On Fri, 8 Jul 2022 at 13:09, James Vanns  wrote:
> >
> > It does seem to smell of an alignment, padding, buffer overrun, parsing 
> > kind of error.
>
> It does I think you may need to bust out a debugger and see what
> array_recv is actually seeing...
>
> --
> greg



-- 
Jim Vanns
Principal Production Engineer
Industrial Light & Magic, London




Re: replacing role-level NOINHERIT with a grant-level option

2022-07-12 Thread tushar

On 7/11/22 11:01 PM, Robert Haas wrote:

Oops. Here is a rebased version of v3 which aims to fix this bug.

Thanks, Issue seems to be fixed with this patch.

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company





Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-07-12 Thread Kyotaro Horiguchi
At Mon, 11 Jul 2022 22:22:28 -0400, Melanie Plageman 
 wrote in 
> Hi,
> 
> In the attached patch set, I've added in missing IO operations for
> certain IO Paths as well as enumerating in the commit message which IO
> Paths and IO Operations are not currently counted and or not possible.
> 
> There is a TODO in HandleWalWriterInterrupts() about removing
> pgstat_report_wal() since it is immediately before a proc_exit()

Right. walwriter does that without needing the explicit call.

> I was wondering if LocalBufferAlloc() should increment the counter or if
> I should wait until GetLocalBufferStorage() to increment the counter.

Depends on what "allocate" means. Different from shared buffers, local
buffers are taken from OS then allocated to page.  OS-allcoated pages
are restricted by num_temp_buffers so I think what we're interested in
is the count incremented by LocalBuferAlloc(). (And it is the parallel
of alloc for shared-buffers)

> I also realized that I am not differentiating between IOPATH_SHARED and
> IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
> of buffer we are fsync'ing by the time we call register_dirty_segment(),
> I'm not sure how we would fix this.

I think there scarcely happens flush for strategy-loaded buffers.  If
that is sensible, IOOP_FSYNC would not make much sense for
IOPATH_STRATEGY.

> On Wed, Jul 6, 2022 at 3:20 PM Andres Freund  wrote:
> 
> > On 2022-07-05 13:24:55 -0400, Melanie Plageman wrote:
> > > @@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
> > >  {
> > >   Assert(!IsPostmasterEnvironment);
> > >
> > > + MyBackendType = B_STANDALONE_BACKEND;
> >
> > Hm. This is used for singleuser mode as well as bootstrap. Should we
> > split those? It's not like bootstrap mode really matters for stats, so
> > I'm inclined not to.
> >
> >
> I have no opinion currently.
> It depends on how commonly you think developers might want separate
> bootstrap and single user mode IO stats.

Regarding to stats, I don't think separating them makes much sense.

> > > @@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool
> > check_only)
> > >* out the initial relation mapping files.
> > >*/
> > >   RelationMapFinishBootstrap();
> > > + // TODO: should this be done for bootstrap?
> > > + pgstat_report_io_ops();
> >
> > Hm. Not particularly useful, but also not harmful. But we don't need an
> > explicit call, because it'll be done at process exit too. At least I
> > think, it could be that it's different for bootstrap.
>
> I've removed this and other occurrences which were before proc_exit()
> (and thus redundant). (Though I did not explicitly check if it was
> different for bootstrap.)

pgstat_report_stat(true) is supposed to be called as needed via
before_shmem_hook so I think that's the right thing.

> > IMO it'd be a good idea to add pgstat_report_io_ops() to
> > pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
> > autovac worker get updated more regularly.
> >
> 
> noted and fixed.

> > How about moving the pgstat_report_io_ops() into
> > pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
> > unnecessary to have multiple pgstat_* calls in these places.
> >
> >
> >
> noted and fixed.

+* Also report IO Operations statistics

I think that the function comment also should mention this.

> > Wonder if it's worth making the lock specific to the backend type?
> >
> 
> I've added another Lock into PgStat_IOPathOps so that each BackendType
> can be locked separately. But, I've also kept the lock in
> PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
> done easily.

Looks fine about the lock separation.
By the way, in the following line:

+   
>io_ops.stats[backend_type_get_idx(MyBackendType)];

backend_type_get_idx(x) is actually (x - 1) plus assertion on the
value range. And the only use-case is here. There's an reverse
function and also used only at one place.

+   Datum   backend_type_desc =
+   
CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));

In this usage GetBackendTypeDesc() gracefully treats out-of-domain
values but idx_get_backend_type keenly kills the process for the
same. This is inconsistent.

My humbel opinion on this is we don't define the two functions and
replace the calls to them with (x +/- 1).  Addition to that, I think
we should not abort() by invalid backend types.  In that sense, I
wonder if we could use B_INVALIDth element for this purpose.

> > > + LWLockAcquire(_shmem->lock, LW_SHARED);
> > > + memcpy(, reset_offset, sizeof(stats_shmem->stats));
> > > + LWLockRelease(_shmem->lock);
> >
> > Which then also means that you don't need the reset offset stuff. It's
> > only there because with the changecount approach we can't take a lock to
> > reset the stats (since there is no lock). With a lock you can just reset
> > the shared state.
> >
> 
> Yes, I 

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-12 Thread Masahiko Sawada
On Tue, Jul 12, 2022 at 3:25 PM Amit Kapila  wrote:
>
> On Tue, Jul 12, 2022 at 11:38 AM Masahiko Sawada  
> wrote:
> >
> > On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada  
> > wrote:
> > >
> > >
> > > I'm doing benchmark tests and will share the results.
> > >
> >
> > I've done benchmark tests to measure the overhead introduced by doing
> > bsearch() every time when decoding a commit record. I've simulated a
> > very intensified situation where we decode 1M commit records while
> > keeping builder->catchange.xip array but the overhead is negilible:
> >
> > HEAD: 584 ms
> > Patched: 614 ms
> >
> > I've attached the benchmark script I used. With increasing
> > LOG_SNAPSHOT_INTERVAL_MS to 9, the last decoding by
> > pg_logicla_slot_get_changes() decodes 1M commit records while keeping
> > catalog modifying transactions.
> >
>
> Thanks for the test. We should also see how it performs when (a) we
> don't change LOG_SNAPSHOT_INTERVAL_MS,

What point do you want to see in this test? I think the performance
overhead depends on how many times we do bsearch() and how many
transactions are in the list. I increased this value to easily
simulate the situation where we decode many commit records while
keeping catalog modifying transactions. But even if we don't change
this value, the result would not change if we don't change how many
commit records we decode.

> and (b) we have more DDL xacts
> so that the array to search is somewhat bigger

I've done the same performance tests while creating 64 catalog
modifying transactions. The result is:

HEAD: 595 ms
Patched: 628 ms

There was no big overhead.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Support TRUNCATE triggers on foreign tables

2022-07-12 Thread Yugo NAGATA
On Tue, 12 Jul 2022 09:24:20 +0900
Fujii Masao  wrote:

> 
> 
> On 2022/07/08 17:13, Ian Lawrence Barwick wrote:
> >> If we want to add such prevention, we will need similar checks for
> >> INSERT/DELETE/UPDATE not only TRUNCATE. However, I think such fix is 
> >> independent
> >> from this and it can be proposed as another patch.
> > 
> > Ah OK, makes sense from that point of view. Thanks for the clarification!
> 
> So I pushed the v2 patch that Yugo-san posted. Thanks!

Thanks!


> 
> Regards,
> 
> -- 
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION


-- 
Yugo NAGATA 




Re: Making Vars outer-join aware

2022-07-12 Thread Richard Guo
On Mon, Jul 11, 2022 at 3:38 AM Tom Lane  wrote:

> Here's v2 of this patch series.  It's functionally identical to v1,
> but I've rebased it over the recent auto-node-support-generation
> changes, and also extracted a few separable bits in hopes of making
> the main planner patch smaller.  (It's still pretty durn large,
> unfortunately.)  Unlike the original submission, each step will
> compile on its own, though the intermediate states mostly don't
> pass all regression tests.


Noticed a different behavior from previous regarding PlaceHolderVar.
Take the query below as an example:

select a.i, ss.jj from a left join (select b.i, b.j + 1 as jj from b) ss
on a.i = ss.i;

Previously the expression 'b.j + 1' would not be wrapped in a
PlaceHolderVar, since it contains a Var of the subquery and meanwhile it
does not contain any non-strict constructs. And now in the patch, we
would insert a PlaceHolderVar for it, in order to have a place to store
varnullingrels. So the plan for the above query now becomes:

# explain (verbose, costs off) select a.i, ss.jj from a left join
(select b.i, b.j + 1 as jj from b) ss on a.i = ss.i;
QUERY PLAN
--
 Hash Right Join
   Output: a.i, ((b.j + 1))
   Hash Cond: (b.i = a.i)
   ->  Seq Scan on public.b
 Output: b.i, (b.j + 1)
   ->  Hash
 Output: a.i
 ->  Seq Scan on public.a
   Output: a.i
(9 rows)

Note that the evaluation of expression 'b.j + 1' now occurs below the
outer join. Is this something we need to be concerned about?

Thanks
Richard


Re: [PATCH] Optimize json_lex_string by batching character copying

2022-07-12 Thread John Naylor
On Mon, Jul 11, 2022 at 11:07 PM Andres Freund  wrote:

> I wonder if we can add a somewhat more general function for scanning until
> some characters are found using SIMD? There's plenty other places that
could
> be useful.

In simple cases, we could possibly abstract the entire loop. With this
particular case, I imagine the most approachable way to write the loop
would be a bit more low-level:

while (p < end - VECTOR_WIDTH &&
   !vector_has_byte(p, '\\') &&
   !vector_has_byte(p, '"') &&
   vector_min_byte(p, 0x20))
p += VECTOR_WIDTH

I wonder if we'd lose a bit of efficiency here by not accumulating set bits
from the three conditions, but it's worth trying.
-- 
John Naylor
EDB: http://www.enterprisedb.com


Re: First draft of the PG 15 release notes

2022-07-12 Thread Noah Misch
On Mon, Jul 11, 2022 at 12:39:57PM -0400, Bruce Momjian wrote:
> I had trouble reading the sentences in the order you used so I
> restructured it:
> 
>   The new default is one of the secure schema usage patterns thatlinkend="ddl-schemas-patterns"/> has recommended since the security
>   release for CVE-2018-1058.  The change applies to newly-created
>   databases in existing clusters and for new clusters.  Upgrading a
>   cluster or restoring a database dump will preserve existing permissions.

I agree with the sentence order change.

>   For existing databases, especially those having multiple users, consider
>   issuing REVOKE to adopt this new default.  For new
>   databases having zero need to defend against insider threats, granting
>   USAGE permission on their public
>   schemas will yield the behavior of prior releases.

s/USAGE/CREATE/ in the last sentence.  Looks good with that change.




Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-12 Thread Amit Kapila
On Tue, Jul 12, 2022 at 11:38 AM Masahiko Sawada  wrote:
>
> On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada  
> wrote:
> >
> >
> > I'm doing benchmark tests and will share the results.
> >
>
> I've done benchmark tests to measure the overhead introduced by doing
> bsearch() every time when decoding a commit record. I've simulated a
> very intensified situation where we decode 1M commit records while
> keeping builder->catchange.xip array but the overhead is negilible:
>
> HEAD: 584 ms
> Patched: 614 ms
>
> I've attached the benchmark script I used. With increasing
> LOG_SNAPSHOT_INTERVAL_MS to 9, the last decoding by
> pg_logicla_slot_get_changes() decodes 1M commit records while keeping
> catalog modifying transactions.
>

Thanks for the test. We should also see how it performs when (a) we
don't change LOG_SNAPSHOT_INTERVAL_MS, and (b) we have more DDL xacts
so that the array to search is somewhat bigger

-- 
With Regards,
Amit Kapila.




Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-12 Thread Masahiko Sawada
On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada  wrote:
>
> On Tue, Jul 12, 2022 at 9:48 AM Masahiko Sawada  wrote:
> >
> > On Fri, Jul 8, 2022 at 8:20 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Jul 8, 2022 at 5:59 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Fri, Jul 8, 2022 at 12:46 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Fri, Jul 8, 2022 at 3:27 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > >
> > > > > > 1.
> > > > > > In ReorderBufferGetCatalogChangesXacts(), isn't it better to use the
> > > > > > list length of 'catchange_txns' to allocate xids array? If we can do
> > > > > > so, then we will save the need to repalloc as well.
> > > > >
> > > > > Since ReorderBufferGetcatalogChangesXacts() collects all ongoing
> > > > > catalog modifying transactions, the length of the array could be
> > > > > bigger than the one taken last time. We can start with the previous
> > > > > length but I think we cannot remove the need for repalloc.
> > > > >
> > > >
> > > > It is using the list "catchange_txns" to form xid array which
> > > > shouldn't change for the duration of
> > > > ReorderBufferGetCatalogChangesXacts(). Then the caller frees the xid
> > > > array after its use. Next time in
> > > > ReorderBufferGetCatalogChangesXacts(), the fresh allocation for xid
> > > > array happens, so not sure why repalloc would be required?
> > >
> > > Oops, I mistook catchange_txns for catchange->xcnt. You're right.
> > > Starting with the length of catchange_txns should be sufficient.
> > >
> >
> > I've attached an updated patch.
> >
> > While trying this idea, I noticed there is no API to get the length of
> > dlist, as we discussed offlist. Alternative idea was to use List
> > (T_XidList) but I'm not sure it's a great idea since deleting an xid
> > from the list is O(N), we need to implement list_delete_xid, and we
> > need to make sure allocating list node in the reorder buffer context.
> > So in the patch, I added a variable, catchange_ntxns, to keep track of
> > the length of the dlist. Please review it.
> >
>
> I'm doing benchmark tests and will share the results.
>

I've done benchmark tests to measure the overhead introduced by doing
bsearch() every time when decoding a commit record. I've simulated a
very intensified situation where we decode 1M commit records while
keeping builder->catchange.xip array but the overhead is negilible:

HEAD: 584 ms
Patched: 614 ms

I've attached the benchmark script I used. With increasing
LOG_SNAPSHOT_INTERVAL_MS to 9, the last decoding by
pg_logicla_slot_get_changes() decodes 1M commit records while keeping
catalog modifying transactions.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


bench.spec
Description: Binary data