Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-08-08 Thread Richard Guo
On Mon, Aug 7, 2023 at 9:29 PM Ashutosh Bapat 
wrote:

> Thanks. The patch looks good overall. I like it because of its potential
> to reduce memory consumption in reparameterization. That's cake on top of
> cream :)
>

Thanks for reviewing this patch!


> Some comments here.
>
> + {
> + FLAT_COPY_PATH(new_path, path, Path);
> +
> + if (path->pathtype == T_SampleScan)
> + {
> + Index scan_relid = path->parent->relid;
> + RangeTblEntry *rte;
> +
> + /* it should be a base rel with a tablesample clause... */
> + Assert(scan_relid > 0);
> + rte = planner_rt_fetch(scan_relid, root);
> + Assert(rte->rtekind == RTE_RELATION);
> + Assert(rte->tablesample != NULL);
> +
> + ADJUST_CHILD_EXPRS(rte->tablesample, TableSampleClause *);
> + }
> + }
>
> This change makes this function usable only after the final plan has been
> selected. If some code accidently uses it earlier, it would corrupt
> rte->tablesample without getting detected easily. I think we should mention
> this in the function prologue and move it somewhere else. Other option is
> to a.
> leave rte->tablesample unchanged here with a comment as to why we aren't
> changing it b. reparameterize tablesample expression in
> create_samplescan_plan() if the path is parameterized by the parent. We
> call
> reparameterize_path_by_child() in create_nestloop_plan() as this patch does
> right now. I like that we are delaying reparameterization. It saves a
> bunch of
> memory. I haven't measured it but from my recent experiments I know it
> will be
> a lot.
>

I agree that we should mention in the function's comment that
reparameterize_path_by_child can only be used after the best path has
been selected because the RTE might be modified by this function.  I'm
not sure if it's a good idea to move the reparameterization of
tablesample to create_samplescan_plan().  That would make the
reparameterization work separated in different places.  And in the
future we may find more cases where we need modify RTEs or RELs for
reparameterization.  It seems better to me that we keep all the work in
one place.


>   * If the inner path is parameterized, it is parameterized by the
> - * topmost parent of the outer rel, not the outer rel itself.  Fix
> - * that.
> + * topmost parent of the outer rel, not the outer rel itself.  We will
>
> There's something wrong with the original sentence (which was probably
> written
> by me back then :)). I think it should have read "If the inner path is
> parameterized by the topmost parent of the outer rel instead of the outer
> rel
> itself, fix that.". If you agree, the new comment should change it
> accordingly.
>

Right.  Will do that.


> @@ -2430,6 +2430,16 @@ create_nestloop_path(PlannerInfo *root,
>  {
>   NestPath   *pathnode = makeNode(NestPath);
>   Relids inner_req_outer = PATH_REQ_OUTER(inner_path);
> + Relids outerrelids;
> +
> + /*
> + * Paths are parameterized by top-level parents, so run parameterization
> + * tests on the parent relids.
> + */
> + if (outer_path->parent->top_parent_relids)
> + outerrelids = outer_path->parent->top_parent_relids;
> + else
> + outerrelids = outer_path->parent->relids;
>
> This looks like an existing bug. Did partitionwise join paths ended up with
> extra restrict clauses without this fix? I am not able to see why this
> change
> is needed by rest of the changes in the patch?
>

This is not an existing bug.  Our current code (without this patch)
would always call create_nestloop_path() after the reparameterization
work has been done for the inner path.  So we can safely check against
outer rel (not its topmost parent) in create_nestloop_path().  But now
with this patch, the reparameterization is postponed until createplan.c,
so we have to check against the topmost parent of outer rel in
create_nestloop_path(), otherwise we'd have duplicate clauses in the
final plan.  For instance, without this change you'd get a plan like

regression=# explain (costs off)
select * from prt1 t1 left join lateral
(select t1.a as t1a, t2.* from prt1 t2) s on t1.a = s.a;
   QUERY PLAN
-
 Append
   ->  Nested Loop Left Join
 Join Filter: (t1_1.a = t2_1.a)
 ->  Seq Scan on prt1_p1 t1_1
 ->  Index Scan using iprt1_p1_a on prt1_p1 t2_1
   Index Cond: (a = t1_1.a)
   ->  Nested Loop Left Join
 Join Filter: (t1_2.a = t2_2.a)
 ->  Seq Scan on prt1_p2 t1_2
 ->  Index Scan using iprt1_p2_a on prt1_p2 t2_2
   Index Cond: (a = t1_2.a)
   ->  Nested Loop Left Join
 Join Filter: (t1_3.a = t2_3.a)
 ->  Seq Scan on prt1_p3 t1_3
 ->  Index Scan using iprt1_p3_a on prt1_p3 t2_3
   Index Cond: (a = t1_3.a)
(16 rows)

Note that the clauses in Join Filter are duplicate.


> Anyway, let's rename outerrelids to something else e.g.
>  outer_param_relids to reflect
> its true meaning.
>

Hmm, I'm not sure this is a good idea.  Here the 'outerrelids' is just

Re: cpluspluscheck vs ICU

2023-08-08 Thread Peter Eisentraut

On 08.08.23 01:35, Andres Freund wrote:

Hi,

On 2023-03-10 20:10:30 -0800, Andres Freund wrote:

On 2023-03-10 19:37:27 -0800, Andres Freund wrote:

I just hit this once more - and I figured out a fairly easy fix:

We just need a
   #ifndef U_DEFAULT_SHOW_DRAFT
   #define U_DEFAULT_SHOW_DRAFT 0
   #endif
before including unicode/ucol.h.

At first I was looking at
   #define U_SHOW_CPLUSPLUS_API 0
and
   #define U_HIDE_INTERNAL_API 1
which both work, but they are documented to be internal.


Err. Unfortunately only the U_SHOW_CPLUSPLUS_API approach actually works. The
others don't, not quite sure what I was doing earlier.

So it's either relying on a define marked as internal, or the below:


Alternatively we could emit U_DEFAULT_SHOW_DRAFT 0 into pg_config.h to avoid
that issue.


The only other thing I see is to do something like:
[ugly]
which seems mighty ugly.


The ICU docs talk about it like it's not really internal:
https://github.com/unicode-org/icu/blob/720e5741ccaa112c4faafffdedeb7459b66c5673/docs/processes/release/tasks/healthy-code.md#test-icu4c-headers

So I'm inclined to go with that solution.


This looks sensible to me.

Perhaps undef U_SHOW_CPLUSPLUS_API after including the headers, so that 
if extensions want to use the ICU C++ APIs, they are not tripped up by this?






Re: [PATCH] Add loongarch native checksum implementation.

2023-08-08 Thread YANG Xudong



On 2023/8/8 14:38, John Naylor wrote:


It seems that platforms capable of running Postgres 
only support 64 bit.


I think so.



 > So I guess using aligned memory access is necessary and I have updated
 > the comment in the code.

Okay, so it's not "necessary" in the sense that it's illegal, so I'm 
thinking we can just re-use the Arm comment language, as in 0002.


Yes. I think it is similar to Arm.


v4 0001 is the same as v3, but with a draft commit message. I will 
squash and commit this week, unless there is additional feedback.


Looks good to me. Thanks for the additional patch.




Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-08 Thread Michael Paquier
On Wed, Aug 02, 2023 at 01:16:02PM +0900, Kyotaro Horiguchi wrote:
> I believe this approach is sufficient to determine whether the error
> is OOM or not. If total_len is currupted and has an excessively large
> value, it's highly unlikely that all subsequent pages for that length
> will be consistent.
> 
> Do you have any thoughts on this?

This could be more flexible IMO, and actually in some cases
errormsg_fatal may be eaten if using the WAL prefetcher as the error
message is reported with the caller of XLogPrefetcherReadRecord(), no?

Anything that has been discussed on this thread now involves a change
in XLogReaderState that induces an ABI breakage.  For HEAD, we are
likely going in this direction, but if we are going to bite the bullet
we'd better be a bit more aggressive with the integration and report
an error code side-by-side with the error message returned by
XLogPrefetcherReadRecord(), XLogReadRecord() and XLogNextRecord() so
as all of the callers can decide what they want to do on an invalid
record or just an OOM.

Attached is the idea of infrastructure I have in mind, as of 0001,
where this adds an error code to report_invalid_record().  For now
this includes three error codes appended to the error messages
generated that can be expanded if need be: no error, OOM and invalid
data.  The invalid data part may needs to be much more verbose, and
could be improved to make this stuff "less scary" as the other thread
proposes, but what I have here would be enough to trigger a different 
decision in the startup process if a record cannot be fetched on OOM
or if there's a different reason behind that.

0002 is an example of decision that can be taken in WAL replay if we
see an OOM, based on the error code received.  One argument is that we
may want to improve emode_for_corrupt_record() so as it reacts better
on OOM, upgrading the emode wanted, but this needs more analysis
depending on the code path involved.

0003 is my previous trick to inject an OOM failure at replay.  Reusing
the previous script, this would be enough to prevent an early redo
creating a loss of data.

Note that we have a few other things going in the tree.  As one
example, pg_walinspect would consider an OOM as the end of WAL.  Not
critical, still slightly incorrect as the end of WAL may not have been
reached yet so it can report some incorrect information depending on
what the WAL reader faces.  This could be improved with the additions
of 0001.

Thoughts or comments?
--
Michael
From acc80d3a6199dd7b28848579cb723e8af837a198 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 8 Aug 2023 16:13:56 +0900
Subject: [PATCH v1 1/3] Add infrastructure to report error codes in WAL reader

This adds a field named errorcode to XLogReaderState, while the APIs in
charge of reading the next WAL records report an error code in parallel
of the error read.
---
 src/include/access/xlogprefetcher.h   |  3 +-
 src/include/access/xlogreader.h   | 16 +++-
 src/backend/access/transam/twophase.c |  3 +-
 src/backend/access/transam/xlogprefetcher.c   |  5 +-
 src/backend/access/transam/xlogreader.c   | 86 +++
 src/backend/access/transam/xlogrecovery.c |  3 +-
 src/backend/replication/logical/logical.c |  3 +-
 .../replication/logical/logicalfuncs.c|  3 +-
 src/backend/replication/slotfuncs.c   |  3 +-
 src/backend/replication/walsender.c   |  3 +-
 src/bin/pg_rewind/parsexlog.c |  9 +-
 src/bin/pg_waldump/pg_waldump.c   |  3 +-
 contrib/pg_walinspect/pg_walinspect.c |  3 +-
 13 files changed, 112 insertions(+), 31 deletions(-)

diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
index 7dd7f20ad0..7f80ed922f 100644
--- a/src/include/access/xlogprefetcher.h
+++ b/src/include/access/xlogprefetcher.h
@@ -48,7 +48,8 @@ extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
 	XLogRecPtr recPtr);
 
 extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
-			char **errmsg);
+			char **errmsg,
+			XLogReaderError *errorcode);
 
 extern void XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index da32c7db77..24554de10f 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -58,6 +58,14 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
+/* Values for XLogReaderState.reason */
+typedef enum XLogReaderError
+{
+	XLOG_READER_NONE = 0,
+	XLOG_READER_OOM,			/* out-of-memory */
+	XLOG_READER_INVALID_DATA,	/* record data */
+} XLogReaderError;
+
 /* Function type definitions for various xlogreader interactions */
 typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 			   XLogRecPtr targetPagePtr,
@@ -310,6 +318,8 @@ struct XLogReaderState
 	/* Buffer to hold error message */
 	char	   *errormsg_

Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns

2023-08-08 Thread John Naylor
On Wed, Jul 5, 2023 at 9:45 PM Jakub Wartak 
wrote:

> [v3]

--- a/doc/src/sgml/limits.sgml
+++ b/doc/src/sgml/limits.sgml
@@ -10,6 +10,7 @@
   hard limits are reached.
  

+
  

@@ -92,11 +93,25 @@
  can be increased by recompiling
PostgreSQL
 

-   
-partition keys
-32
-can be increased by recompiling
PostgreSQL
-   
+
+ partition keys
+ 32
+ can be increased by recompiling
PostgreSQL
+

Ahem.

> > Also, perhaps the LO entries should be split into a separate patch.
Since they are a special case and documented elsewhere, it's not clear
their limits fit well here. Maybe they could.
>
> Well, but those are *limits* of the engine and they seem to be pretty
widely chosen especially in migration scenarios (because they are the only
ones allowed to store over 1GB). I think we should warn the dangers of
using pg_largeobjects.

I see no argument here against splitting into a separate patch for later.

> > Also the shared counter is the cause of the slowdown, but not the
reason for the numeric limit.
>
> Isn't it both? typedef Oid is unsigned int = 2^32, and according to
GetNewOidWithIndex() logic if we exhaust the whole OID space it will hang
indefinitely which has the same semantics as "being impossible"/permanent
hang (?)

Looking again, I'm thinking the OID type size is more relevant for the
first paragraph, and the shared/global aspect is more relevant for the
second.

The last issue is how to separate the notes at the bottom, since there are
now two topics.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: pg_upgrade fails with in-place tablespace

2023-08-08 Thread Michael Paquier
On Tue, Aug 08, 2023 at 12:35:38PM +0800, Rui Zhao wrote:
> I have only enabled allow_in_place_tablespaces to true during a binary 
> upgrade.

The pg_dump part should be OK to allow the restore of in-place
tablespaces, so I'll get that fixed first, but I think that we should
add a regression test as well.

I'll try to look at the binary upgrade path and pg_upgrade after this
is done.
--
Michael


signature.asc
Description: PGP signature


Re: Adding a LogicalRepWorker type field

2023-08-08 Thread Peter Smith
PSA v4 patches.

On Fri, Aug 4, 2023 at 5:45 PM Peter Smith  wrote:
>
> Thank you for the feedback received so far. Sorry, I have become busy
> lately with other work.
>
> IIUC there is a general +1 for the idea, but I still have some
> juggling necessary to make the functions/macros of patch 0001
> acceptable to everybody.
>
> The difficulties arise from:
> - no function overloading C
> - ideally, we prefer the same names for everything (e.g. instead of
> dual-set macros)
> - but launcher code calls need to pass param (other code always uses
> MyLogicalRepWorker)
> - would be nice (although no mandatory) to not have to always pass
> MyLogicalRepWorker as a param.
>
> Anyway, I will work towards finding some compromise on this next week.
> Currently, I feel the best choice is to go with what Bharath suggested
> in the first place -- just always pass the parameter (including
> passing MyLogicalRepWorker). I will think more about it...

v4-0001 uses only 3 simple inline functions. Callers always pass
parameters as Bharath had suggested.

>
> --
>
> Meanwhile, here are some replies to other feedback received:
>
> Ashutosh --  "May be we should create a union of type specific members" [1].
>
> Yes, I was wondering this myself, but I won't pursue this yet until
> getting over the hurdle of finding acceptable functions/macros for
> patch 0001. Hopefully, we can come back to this idea.
>

To be explored later.

> ~~
>
> Mellih -- "Isn't the apply worker type still decided by eliminating
> the other choices even with the patch?".
>
> AFAIK the only case of that now is the one-time check in the
> logicalrep_worker_launch() function. IMO, that is quite a different
> proposition to the HEAD macros that have to make that deduction
> 10,000s ot times in executing code. Actually, even the caller knows
> exactly the kind of worker it wants to launch so we can pass the
> LogicalRepWorkerType directly to logicalrep_worker_launch() and
> eliminate even this one-time-check. I can code it that way in the next
> patch version.
>

Now even the one-time checking to assign the worker type is removed.
The callers know the LogicalReWorkerType they want, so v4-0001 just
passes the type into logicalrep_worker_launch()

> ~~
>
> Barath -- "-1 for these names starting with prefix TYPE_, in fact LR_
> looked better."
>
> Hmmm. I'm not sure what is best. Of the options below I prefer
> "WORKER_TYPE_xxx", but I don't mind so long as there is a consensus.
>
> LR_APPLY_WORKER
> LR_PARALLEL_APPLY_WORKER
> LR_TABLESYNC_WORKER
>
> TYPE_APPLY_WORKERT
> TYPE_PARALLEL_APPLY_WORKER
> TYPE_TABLESYNC_WORKER
>
> WORKER_TYPE_APPLY
> WORKER_TYPE_PARALLEL_APPLY
> WORKER_TYPE_TABLESYNC
>
> APPLY_WORKER
> PARALLEL_APPLY_WORKER
> TABLESYNC_WORKER
>
> APPLY
> PARALLEL_APPLY
> TABLESYNC
>

For now, in v4-0001 these are called WORKERTYPE_xxx. Please do propose
better names if these are no good.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v4-0001-Add-LogicalRepWorkerType-enum.patch
Description: Binary data


v4-0002-Switch-on-worker-type.patch
Description: Binary data


Re: WIP: new system catalog pg_wait_event

2023-08-08 Thread Drouvot, Bertrand

Hi,

On 8/8/23 7:37 AM, Drouvot, Bertrand wrote:

Hi,

On 8/8/23 5:05 AM, Michael Paquier wrote:

On Tue, Aug 08, 2023 at 11:53:32AM +0900, Kyotaro Horiguchi wrote:

As I mentioned in another thread, I'm uncertain about our stance on
the class id of the wait event. If a class acts as a namespace, we
should include it in the view. Otherwise, if the class id is just an
attribute of the wait event, we should make the event name unique.


Including the class name in the view makes the most sense to me, FWIW,
as it could be also possible that one reuses an event name in the
existing in-core list, but for extensions.  That's of course not
something I would recommend.


Thanks Kyotaro-san and Michael for the feedback. I do agree and will
add the class name.



Please find attached v3 adding the wait event types.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom b54aa2dd33835a83bfe08a99338874200512fdf0 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS 
Date: Sat, 5 Aug 2023 12:39:42 +
Subject: [PATCH v3] pg_wait_event

Adding a new system view, namely pg_wait_event, that describes the wait events.
---
 doc/src/sgml/system-views.sgml| 64 +++
 src/backend/catalog/system_views.sql  |  3 +
 src/backend/utils/activity/.gitignore |  1 +
 src/backend/utils/activity/Makefile   |  6 +-
 .../activity/generate-wait_event_types.pl | 77 +--
 src/backend/utils/activity/meson.build|  1 +
 src/backend/utils/activity/pg_wait_event.c| 41 ++
 src/include/catalog/pg_proc.dat   |  6 ++
 src/include/utils/meson.build |  4 +-
 src/test/regress/expected/rules.out   |  4 +
 src/test/regress/expected/sysviews.out| 11 +++
 src/test/regress/sql/sysviews.sql |  7 ++
 src/tools/msvc/clean.bat  |  1 +
 13 files changed, 200 insertions(+), 26 deletions(-)
  21.6% doc/src/sgml/
  56.7% src/backend/utils/activity/
   5.2% src/include/catalog/
   7.4% src/test/regress/expected/
   4.0% src/test/regress/sql/
   4.9% src/

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 57b228076e..ed26c8326f 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -221,6 +221,11 @@
   views
  
 
+ 
+  pg_wait_event
+  wait events
+ 
+
 

   
@@ -4825,4 +4830,63 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
  
 
+
+ 
+  pg_wait_event
+
+  
+   pg_wait_event
+  
+
+  
+   The view pg_wait_event provides description about 
the
+   wait events.
+  
+
+  
+   pg_wait_event Columns
+   
+
+ 
+  
+   Column Type
+  
+  
+   Description
+  
+ 
+
+
+
+ 
+  
+   wait_event_type text
+  
+  
+   Wait event type
+  
+ 
+
+ 
+  
+   wait_event_name text
+  
+  
+   Wait event name
+  
+ 
+
+ 
+  
+   description texte
+  
+  
+   Wait event description
+  
+ 
+
+   
+  
+ 
+
 
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index af65af6bdd..f86a4dd770 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1342,3 +1342,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 ss.stats_reset
 FROM pg_subscription as s,
  pg_stat_get_subscription_stats(s.oid) as ss;
+
+CREATE VIEW pg_wait_event AS
+SELECT * FROM pg_get_wait_events() AS we;
diff --git a/src/backend/utils/activity/.gitignore 
b/src/backend/utils/activity/.gitignore
index d77079285b..ad089a0b63 100644
--- a/src/backend/utils/activity/.gitignore
+++ b/src/backend/utils/activity/.gitignore
@@ -1,2 +1,3 @@
 /pgstat_wait_event.c
 /wait_event_types.h
+/pg_wait_event_insert.c
diff --git a/src/backend/utils/activity/Makefile 
b/src/backend/utils/activity/Makefile
index f1117745d4..8595e6ea77 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -32,10 +32,14 @@ OBJS = \
pgstat_subscription.o \
pgstat_wal.o \
pgstat_xact.o \
+   pg_wait_event.o \
wait_event.o
 
 include $(top_srcdir)/src/backend/common.mk
 
+pg_wait_event.o: pg_wait_event_insert.c
+pg_wait_event_insert.c: wait_event_types.h
+
 wait_event.o: pgstat_wait_event.c
 pgstat_wait_event.c: wait_event_types.h
touch $@
@@ -44,4 +48,4 @@ wait_event_types.h: 
$(top_srcdir)/src/backend/utils/activity/wait_event_names.tx
$(PERL) $(srcdir)/generate-wait_event_types.pl --code $<
 
 maintainer-clean: clean
-   rm -f wait_event_types.h pgstat_wait_event.c
+   rm -f wait_event_types.h pgstat_wait_event.c pg_wait_event_insert.c
diff --git a/src/backend/utils/activity/generate-wait_event_types.pl 
b/src/backend/utils/activity/generate-wait_event_types.pl
index 56335e8730..c3f58acb4b 1006

Re: [PATCH] psql: Add tab-complete for optional view parameters

2023-08-08 Thread Dean Rasheed
On Mon, 7 Aug 2023 at 19:49, Christoph Heiss  wrote:
>
> On Fri, Jan 06, 2023 at 12:18:44PM +, Dean Rasheed wrote:
> > Hmm, I don't think we should be offering "check_option" as a tab
> > completion for CREATE VIEW at all, since that would encourage users to
> > use non-SQL-standard syntax, rather than CREATE VIEW ... WITH
> > [CASCADED|LOCAL] CHECK OPTION.
>
> Left that part in for now. I would argue that it is a well-documented
> combination and as such users would expect it to turn up in the
> tab-complete as well. OTOH not against removing it either, if there are
> others voicing the same opinion ..
>

On reflection, I think that's probably OK. I mean, I still don't like
the fact that it's offering to complete with non-SQL-standard syntax,
but that seems less bad than using an incomplete list of options, and
I don't really have any other better ideas.

Regards,
Dean




Re: Handle infinite recursion in logical replication setup

2023-08-08 Thread Peter Eisentraut

On 12.09.22 07:23, vignesh C wrote:

On Fri, 9 Sept 2022 at 11:12, Amit Kapila  wrote:


On Thu, Sep 8, 2022 at 9:32 AM vignesh C  wrote:



The attached patch has the changes to handle the same.



Pushed. I am not completely sure whether we want the remaining
documentation patch in this thread in its current form or by modifying
it. Johnathan has shown some interest in it. I feel you can start a
separate thread for it to see if there is any interest in the same and
close the CF entry for this work.


Thanks for pushing the patch. I have closed this entry in commitfest.
I will wait for some more time and see the response regarding the
documentation patch and then start a new thread if required.


This patch added the following error message:

errdetail_plural("Subscribed publication %s is subscribing to other 
publications.",

"Subscribed publications %s are subscribing to other publications.",
list_length(publist), pubnames->data),

But in PostgreSQL, a publication cannot subscribe to a publication, so 
this is not giving accurate information.  Apparently, what it is trying 
to say is that


The subscription that you are creating subscribes to publications that 
contain tables that are written to by other subscriptions.


Can we get to a more accurate wording like this?

There is also a translatability issue there, in the way the publication 
list is pasted into the message.


Is the list of affected publications really that interesting?  I wonder 
whether the list of affected tables might be more relevant?






Re: postgres_fdw: wrong results with self join + enable_nestloop off

2023-08-08 Thread Etsuro Fujita
Hi Richard,

On Mon, Jul 31, 2023 at 5:52 PM Richard Guo  wrote:
> On Fri, Jul 28, 2023 at 4:56 PM Etsuro Fujita  wrote:
>> here is a rebased version of the second patch, in which I modified the
>> ForeignPath and CustomPath cases in reparameterize_path_by_child() to
>> reflect the new members fdw_restrictinfo and custom_restrictinfo, for
>> safety, and tweaked a comment a bit.

> Hmm, it seems that ForeignPath for a foreign join does not support
> parameterized paths for now, as in postgresGetForeignJoinPaths() we have
> this check:
>
>   /*
>* This code does not work for joins with lateral references, since those
>* must have parameterized paths, which we don't generate yet.
>*/
>   if (!bms_is_empty(joinrel->lateral_relids))
>   return;
>
> And in create_foreign_join_path() we just set the path.param_info to
> NULL.
>
>   pathnode->path.param_info = NULL;   /* XXX see above */
>
> So I doubt that it's necessary to adjust fdw_restrictinfo in
> reparameterize_path_by_child, because it seems to me that
> fdw_restrictinfo must be empty there.  Maybe we can add an Assert there
> as below:
>
> -ADJUST_CHILD_ATTRS(fpath->fdw_restrictinfo);
> +
> +/*
> + * Parameterized foreign joins are not supported.  So this
> + * ForeignPath cannot be a foreign join and fdw_restrictinfo
> + * must be empty.
> + */
> +Assert(fpath->fdw_restrictinfo == NIL);
>
> That being said, it's also no harm to handle fdw_restrictinfo in
> reparameterize_path_by_child as the patch does.  So I'm OK we do that
> for safety.

Ok, but maybe my explanation was not good, so let me explain.  The
reason why I modified the code as such is to make the handling of
fdw_restrictinfo consistent with that of fdw_outerpath: we have the
code to reparameterize fdw_outerpath, which should be NULL though, as
we do not currently support parameterized foreign joins.

I modified the code a bit further to use an if-test to avoid a useless
function call, and added/tweaked comments and docs further.  Attached
is a new version of the patch.  I am planning to commit this, if there
are no objections.

Thanks!

Best regards,
Etsuro Fujita


0002-Allow-join-pushdown-even-if-pseudoconstant-quals-v4.patch
Description: Binary data


Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-08 Thread Kyotaro Horiguchi
At Tue, 8 Aug 2023 16:29:49 +0900, Michael Paquier  wrote 
in 
> On Wed, Aug 02, 2023 at 01:16:02PM +0900, Kyotaro Horiguchi wrote:
> > I believe this approach is sufficient to determine whether the error
> > is OOM or not. If total_len is currupted and has an excessively large
> > value, it's highly unlikely that all subsequent pages for that length
> > will be consistent.
> > 
> > Do you have any thoughts on this?
> 
> This could be more flexible IMO, and actually in some cases
> errormsg_fatal may be eaten if using the WAL prefetcher as the error
> message is reported with the caller of XLogPrefetcherReadRecord(), no?

Right. The goal of my PoC was to detect OOM accurately or at least
sufficiently so. We need to separately pass the "error code" along
with the message to make it work with the prefethcer.  We could
enclose errormsg and errorcode in a struct.

> Anything that has been discussed on this thread now involves a change
> in XLogReaderState that induces an ABI breakage.  For HEAD, we are
> likely going in this direction, but if we are going to bite the bullet
> we'd better be a bit more aggressive with the integration and report
> an error code side-by-side with the error message returned by
> XLogPrefetcherReadRecord(), XLogReadRecord() and XLogNextRecord() so
> as all of the callers can decide what they want to do on an invalid
> record or just an OOM.

Sounds reasonable.

> Attached is the idea of infrastructure I have in mind, as of 0001,
> where this adds an error code to report_invalid_record().  For now
> this includes three error codes appended to the error messages
> generated that can be expanded if need be: no error, OOM and invalid
> data.  The invalid data part may needs to be much more verbose, and
> could be improved to make this stuff "less scary" as the other thread
> proposes, but what I have here would be enough to trigger a different 
> decision in the startup process if a record cannot be fetched on OOM
> or if there's a different reason behind that.

Agreed. This clarifies the basis for decisions at the upper layer
(ReadRecord) and adds flexibility.

> 0002 is an example of decision that can be taken in WAL replay if we
> see an OOM, based on the error code received.  One argument is that we
> may want to improve emode_for_corrupt_record() so as it reacts better
> on OOM, upgrading the emode wanted, but this needs more analysis
> depending on the code path involved.
> 
> 0003 is my previous trick to inject an OOM failure at replay.  Reusing
> the previous script, this would be enough to prevent an early redo
> creating a loss of data.
> 
> Note that we have a few other things going in the tree.  As one
> example, pg_walinspect would consider an OOM as the end of WAL.  Not
> critical, still slightly incorrect as the end of WAL may not have been
> reached yet so it can report some incorrect information depending on
> what the WAL reader faces.  This could be improved with the additions
> of 0001.
> 
> Thoughts or comments?

I like the overall direction. Though, I'm considering enclosing the
errormsg and errorcode in a struct.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Handle infinite recursion in logical replication setup

2023-08-08 Thread Amit Kapila
On Tue, Aug 8, 2023 at 1:50 PM Peter Eisentraut  wrote:
>
> On 12.09.22 07:23, vignesh C wrote:
> > On Fri, 9 Sept 2022 at 11:12, Amit Kapila  wrote:
> >>
> >> On Thu, Sep 8, 2022 at 9:32 AM vignesh C  wrote:
> >>>
> >>>
> >>> The attached patch has the changes to handle the same.
> >>>
> >>
> >> Pushed. I am not completely sure whether we want the remaining
> >> documentation patch in this thread in its current form or by modifying
> >> it. Johnathan has shown some interest in it. I feel you can start a
> >> separate thread for it to see if there is any interest in the same and
> >> close the CF entry for this work.
> >
> > Thanks for pushing the patch. I have closed this entry in commitfest.
> > I will wait for some more time and see the response regarding the
> > documentation patch and then start a new thread if required.
>
> This patch added the following error message:
>
> errdetail_plural("Subscribed publication %s is subscribing to other
> publications.",
> "Subscribed publications %s are subscribing to other publications.",
> list_length(publist), pubnames->data),
>
> But in PostgreSQL, a publication cannot subscribe to a publication, so
> this is not giving accurate information.  Apparently, what it is trying
> to say is that
>
> The subscription that you are creating subscribes to publications that
> contain tables that are written to by other subscriptions.
>
> Can we get to a more accurate wording like this?
>

+1 for changing the message as per your suggestion.

> There is also a translatability issue there, in the way the publication
> list is pasted into the message.
>
> Is the list of affected publications really that interesting?  I wonder
> whether the list of affected tables might be more relevant?
>

In that case, we need to specify both schema name and table name in
that case. I guess the list could be very long and not sure what to do
for schema publications ( Create Publication ... For Schema).

-- 
With Regards,
Amit Kapila.




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Heikki Linnakangas

On 08/08/2023 05:15, Andres Freund wrote:

IMO, for the individual user case it's important to use CI for "free", without
a whole lot of complexity. Which imo rules approaches like providing
$cloud_provider compute accounts, that's too much setup work.


+1


With the improvements detailed below, cirrus' free CI would last
about ~65 runs / month.

I think that's plenty.


For cfbot I hope we can find funding to pay for compute to use for CI.


+1


Potential paths forward for cfbot, in addition to the above:

- Pay for compute / ask the various cloud providers to grant us compute
   credits. At least some of the cloud providers can be used via cirrus-ci.

- Host (some) CI runners ourselves. Particularly with macos and windows, that
   could provide significant savings.

- Build our own system, using buildbot, jenkins or whatnot.


Opinions as to what to do?


The resources for running our own system isn't free either. I'm sure we 
can get sponsors for the cirrus-ci credits, or use donations.


I have been quite happy with Cirrus CI overall.


The attached series of patches:


All of this makes sense to me, although I don't use macos myself.


5) Move use of -Dsegsize_blocks=6 from macos to linux

Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we
could stop covering both meson and autoconf segsize_blocks. It does affect
runtime on linux as well.


Could we have a comment somewhere on why we use -Dsegsize_blocks on 
these particular CI runs? It seems pretty random. I guess the idea is to 
have one autoconf task and one meson task with that option, to check 
that the autoconf/meson option works?



6) Disable write cache flushes on windows

It's a bit ugly to do this without using the UI... Shaves off about 30s
from the tests.


A brief comment would be nice: "We don't care about persistence over 
hard crashes in the CI, so disable write cache flushes to speed it up."


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: postgres_fdw: wrong results with self join + enable_nestloop off

2023-08-08 Thread Richard Guo
On Tue, Aug 8, 2023 at 4:40 PM Etsuro Fujita 
wrote:

> I modified the code a bit further to use an if-test to avoid a useless
> function call, and added/tweaked comments and docs further.  Attached
> is a new version of the patch.  I am planning to commit this, if there
> are no objections.


+1 to the v4 patch.  It looks good to me.

Thanks
Richard


Re: Fix last unitialized memory warning

2023-08-08 Thread Peter Eisentraut

On 19.07.23 19:15, Tristan Partin wrote:

On Sun Jul 9, 2023 at 2:23 AM CDT, Peter Eisentraut wrote:

On 06.07.23 15:41, Tristan Partin wrote:
> On Thu Jul 6, 2023 at 3:21 AM CDT, Peter Eisentraut wrote:
>> On 05.07.23 23:06, Tristan Partin wrote:
>>> Thanks for following up. My system is Fedora 38. I can confirm 
this is

>>> still happening on master.
>>>
>>> $ gcc --version
>>> gcc (GCC) 13.1.1 20230614 (Red Hat 13.1.1-4)
>>> Copyright (C) 2023 Free Software Foundation, Inc.
>>> This is free software; see the source for copying conditions.  
There is NO
>>> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR 
PURPOSE.

>>> $ meson setup build --buildtype=release
>>
>> This buildtype turns on -O3 warnings.  We have usually opted against
>> chasing warnings in -O3 level because there are often some
>> false-positive uninitialized variable warnings with every new 
compiler.

>>
>> Note that we have set the default build type to debugoptimized, for 
that

>> reason.
> > Good to know, thanks.
> > Regarding the original patch, do you think it is good to be applied?

That patch looks reasonable.  But I can't actually reproduce the 
warning, even with gcc-13.  I do get the warning from plpgsql.  Can 
you show the warning you are seeing?


Here is the full warning that the original patch suppresses.


I was able to reproduce the warning now on Fedora.  I agree with the patch

-   PgBenchValue vargs[MAX_FARGS];
+   PgBenchValue vargs[MAX_FARGS] = { 0 };

I suggest to also do

 typedef enum
 {
-   PGBT_NO_VALUE,
+   PGBT_NO_VALUE = 0,

to make clear that the initialization value is meant to be invalid.

I also got the plpgsql warning that you showed earlier, but I couldn't 
think of a reasonable way to fix that.






Re: [PATCH] psql: Add tab-complete for optional view parameters

2023-08-08 Thread Christoph Heiss


On Tue, Aug 08, 2023 at 09:17:51AM +0100, Dean Rasheed wrote:
>
> On Mon, 7 Aug 2023 at 19:49, Christoph Heiss  wrote:
> >
> > On Fri, Jan 06, 2023 at 12:18:44PM +, Dean Rasheed wrote:
> > > Hmm, I don't think we should be offering "check_option" as a tab
> > > completion for CREATE VIEW at all, since that would encourage users to
> > > use non-SQL-standard syntax, rather than CREATE VIEW ... WITH
> > > [CASCADED|LOCAL] CHECK OPTION.
> >
> > Left that part in for now. I would argue that it is a well-documented
> > combination and as such users would expect it to turn up in the
> > tab-complete as well. OTOH not against removing it either, if there are
> > others voicing the same opinion ..
> >
>
> On reflection, I think that's probably OK. I mean, I still don't like
> the fact that it's offering to complete with non-SQL-standard syntax,
> but that seems less bad than using an incomplete list of options, and
> I don't really have any other better ideas.

My thought pretty much as well. While obviously far from ideal as you
say, it's the less bad case of these both. I would also guess that it is
not the only instance of non-SQL-standard syntax completion in psql ..

Thanks for weighing in once again.

Cheers,
Christoph




Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning

2023-08-08 Thread Richard Guo
On Thu, Jul 27, 2023 at 10:06 PM Ashutosh Bapat <
ashutosh.bapat@gmail.com> wrote:

> 0002 - WIP patch to avoid repeated translations of RestrictInfo.
> The WIP patch avoids repeated translations by tracking the child for
> which a RestrictInfo is translated and reusing the same translation
> every time it is requested. In order to track the translations,
> RestrictInfo gets two new members.
> 1. parent_rinfo - In a child's RestrictInfo this points to the
> RestrictInfo applicable to the topmost parent in partition hierarchy.
> This is NULL in the topmost parent's RestrictInfo
> 2. child_rinfos - In a parent's RestrictInfo, this is a list that
> contains all the translated child RestrictInfos. In child
> RestrictInfos this is NULL.


I haven't looked into the details but with 0002 patch I came across a
crash while planning the query below.

regression=# set enable_partitionwise_join to on;
SET
regression=# EXPLAIN (COSTS OFF)
SELECT * FROM prt1 t1, prt2 t2
WHERE t1.a = t2.b AND t1.a < 450 AND t2.b > 250 AND t1.b = 0;
server closed the connection unexpectedly

Thanks
Richard


Suppress generating WAL records during the upgrade

2023-08-08 Thread Hayato Kuroda (Fujitsu)
Dear hackers,
(CC: Julien, Sawada-san, Amit)

This is a fork thread from "[PoC] pg_upgrade: allow to upgrade publisher node" 
[1].

# Background

[1] is the patch which allows to replicate logical replication slots from old 
to new node.
Followings describe the rough steps:

1. Boot old node as binary-upgrade mode
2. Check confirmed_lsn of all the slots, and confirm all WALs are replicated to 
downstream
3. Dump slot info to sql file
4. Stop old node
5. Boot new node as binary-upgrade mode
...

Here, step 2 was introduced for avoiding data loss. If there are some WAL 
records
ahead confirmed_lsn, such records would not be replicated anymore - it may be 
dangerous.

So in the current patch, pg_upgrade fails if other records than 
SHUTDOWN_CHECKPOINT
exits after any confirmed_flush_lsn.

# Problem

We found that following three records might be generated during the upgrade.

* RUNNING_XACT
* CHECKPOINT_ONLINE
* XLOG_FPI_FOR_HINT

RUNNING_XACT might be written by the background writer. Conditions for the 
generation are:

a. Elapsed 15 seconds since the last WAL creation or bootstraping of the 
process, and either of them:
b-1. The process had never create the RUNNING_XACT record, or
b-2. Some "important WALs" were created after the last RUNNING_XACT record


CHECKPOINT_ONLINE might be written by the checkpointer. Conditions for the 
generation are:

a. Elapsed checkpoint_timeout seconds since the last creation or bootstraping, 
and either of them:
b-1. The process had never create the CHECKPOINT_ONLINE record, or
b-2. Some "important WALs" were created after the last CHECKPOINT record


XLOG_FPI_FOR_HINT, which is raised by Sawada-san, might be generated by backend 
processes.
Conditions for the generation are:

a. Backend processes scanned any tuples (even if it was the system catalog), or 
either of them:
b-1. Data checksum was enabled, or
b-2. wal_log_hints was set to on

# Solution

I wanted to suppress generations of WALs during the upgrade, because of the "# 
Background".

Regarding the RUNNING_XACT and CHECKPOINT_ONLINE, it might be OK by removing the
condition b-1. The duration between bootstrap and initial 
{RUNNING_XACT|CHECKPOINT_ONLINE}
becomes longer, but I could not find impacts by it.

As for the XLOG_FPI_FOR_HINT, the simplest way I came up with is not to call
XLogSaveBufferForHint() during binary upgrade. Considerations may be not enough,
but I attached the patch for the fix. It passed CI on my repository.


Do you have any other considerations about it?
An approach, which adds "if (IsBinaryUpgare)" in XLogInsertAllowed(), was 
proposed in [2]. 
But I'm not sure it could really solve the issue - e.g., XLogInsertRecord() just
raised an ERROR if !XLogInsertAllowed().

[1]: https://commitfest.postgresql.org/44/4273/
[2]: 
https://www.postgresql.org/message-id/flat/20210121152357.s6eflhqyh4g5e6dv%40dalibo.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



0001-Suppress-generating-WAL-records-during-the-upgrade.patch
Description:  0001-Suppress-generating-WAL-records-during-the-upgrade.patch


Re: Avoid stack frame setup in performance critical routines using tail calls

2023-08-08 Thread John Naylor
On Wed, Jul 19, 2023 at 3:53 PM Andres Freund  wrote:
>
> Hi,
>
> David and I were chatting about this patch, in the context of his bump
> allocator patch.  Attached is a rebased version that is also split up
into two
> steps, and a bit more polished.

Here is a quick test -- something similar was used to measure the slab
improvements last cycle. With radix tree v37 0001-0011 from [1],

create extension bench_radix_tree;
select avg(load_ms) from generate_series(1,100) x(x), lateral (select *
from bench_load_random_int(100 * 1000 * (1+x-x))) a;

The backend was pinned and turbo off. Perf runs were separate from timed
runs. I included 0002 for completeness.

v37
 avg
-
 27.0400

  32.42%  postgres  bench_radix_tree.so  [.] rt_recursive_set
  21.60%  postgres  postgres [.] SlabAlloc
  11.06%  postgres  [unknown][k] 0x930018f7
  10.49%  postgres  bench_radix_tree.so  [.] rt_extend_down
   7.07%  postgres  postgres [.] MemoryContextAlloc
   4.83%  postgres  bench_radix_tree.so  [.] rt_node_insert_inner
   2.19%  postgres  bench_radix_tree.so  [.] rt_grow_node_48
   2.16%  postgres  bench_radix_tree.so  [.] rt_set.isra.0
   1.50%  postgres  bench_radix_tree.so  [.] MemoryContextAlloc@plt

v37 + palloc sibling calls
 avg
-
 26.0700

v37 + palloc sibling calls + opt aset
 avg
-
 26.0900

  33.78%  postgres  bench_radix_tree.so  [.] rt_recursive_set
  23.04%  postgres  postgres [.] SlabAlloc
  11.43%  postgres  [unknown][k] 0x930018f7
  11.05%  postgres  bench_radix_tree.so  [.] rt_extend_down
   5.52%  postgres  bench_radix_tree.so  [.] rt_node_insert_inner
   2.47%  postgres  bench_radix_tree.so  [.] rt_set.isra.0
   2.30%  postgres  bench_radix_tree.so  [.] rt_grow_node_48
   1.88%  postgres  postgres [.] MemoryContextAlloc
   1.44%  postgres  bench_radix_tree.so  [.] MemoryContextAlloc@plt

It's nice to see MemoryContextAlloc go down in the profile.

[1]
https://www.postgresql.org/message-id/cad21aoa3gs45dfmoyte-wm4fu+byzsypvcsmygglxwm40cg...@mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com


Re: How to build a new grammer for pg?

2023-08-08 Thread Hannu Krosing
I would look at how Babelfish DB did it when adding SQL Server compatibility

https://babelfishpg.org/ and https://github.com/babelfish-for-postgresql/

another source to inspect could be
https://github.com/IvorySQL/IvorySQL for "oracle compatible
PostgreSQL"

On Tue, Aug 1, 2023 at 10:07 PM Jonah H. Harris  wrote:
>
> On Tue, Aug 1, 2023 at 3:45 PM Andrew Dunstan  wrote:
>>
>> Or to enable some language other than SQL (QUEL anyone?)
>
>
> A few years ago, I got a minimal POSTQUEL working again to release as a patch 
> for April Fools' Day, which I never did. I should dig that up somewhere :)
>
> Anyway, as far as OP's original question regarding replacing the grammar, 
> there are a couple of such implementations floating around that have done 
> that. But, I actually think the pluggable parser patches were good examples 
> of how to integrate a replacement parser that generates the expected parse 
> tree nodes for anyone who wants to do their own custom parser. See Julien 
> Rouhaud's SQLOL in the "Hook for extensible parsing" thread and Jim 
> Mlodgenski's "Parser Hook" thread.
>
> --
> Jonah H. Harris
>




Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-08-08 Thread Ashutosh Bapat
On Tue, Aug 8, 2023 at 12:44 PM Richard Guo  wrote:
>
>
> I agree that we should mention in the function's comment that
> reparameterize_path_by_child can only be used after the best path has
> been selected because the RTE might be modified by this function.  I'm
> not sure if it's a good idea to move the reparameterization of
> tablesample to create_samplescan_plan().  That would make the
> reparameterization work separated in different places.  And in the
> future we may find more cases where we need modify RTEs or RELs for
> reparameterization.  It seems better to me that we keep all the work in
> one place.

Let's see what a committer thinks. Let's leave it like that for now.

>
>
> This is not an existing bug.  Our current code (without this patch)
> would always call create_nestloop_path() after the reparameterization
> work has been done for the inner path.  So we can safely check against
> outer rel (not its topmost parent) in create_nestloop_path().  But now
> with this patch, the reparameterization is postponed until createplan.c,
> so we have to check against the topmost parent of outer rel in
> create_nestloop_path(), otherwise we'd have duplicate clauses in the
> final plan.

Hmm. I am worried about the impact this will have when the nestloop
path is created for two child relations. Your code will still pick the
top_parent_relids which seems odd. Have you tested that case?

> For instance, without this change you'd get a plan like
>
> regression=# explain (costs off)
> select * from prt1 t1 left join lateral
> (select t1.a as t1a, t2.* from prt1 t2) s on t1.a = s.a;
>QUERY PLAN
... snip ...
>
> Note that the clauses in Join Filter are duplicate.

Thanks for the example.

>
> Good question.  But I don't think calling this function at the beginning
> of reparameterize_path_by_child() can solve this problem.  Even if we do
> that, if we find that the path cannot be reparameterized in
> reparameterize_path_by_child(), it would be too late for us to take any
> measures.  So we need to make sure that such situation cannot happen.  I
> think we can emphasize this point in the comments of the two functions,
> and meanwhile add an Assert in reparameterize_path_by_child() that the
> path must be reparameterizable.

Works for me.

>
> Attaching v3 patch to address all the reviews above.

The patch looks good to me.

Please add this to the next commitfest.

-- 
Best Wishes,
Ashutosh Bapat




Re: Support to define custom wait events for extensions

2023-08-08 Thread Masahiro Ikeda

On 2023-08-08 10:05, Michael Paquier wrote:

On Tue, Aug 08, 2023 at 09:39:02AM +0900, Masahiro Ikeda wrote:

I am thinking a bit that we also need another hash where the key
is a custom string. For extensions that have no dependencies
with shared_preload_libraries, I think we need to avoid that
WaitEventExtensionNew() is called repeatedly and a new eventId
is issued each time.

So, is it better to have another hash where the key is
a custom string and uniqueness is identified by it to determine
if a new eventId should be issued?


Yeah, I was also considering if something like that is really
necessary, but I cannot stop worrying about adding more contention to
the hash table lookup each time an extention needs to retrieve an
event ID to use for WaitLatch() or such.  The results of the hash
table lookups could be cached in each backend, still it creates an
extra cost when combined with queries running in parallel on
pg_stat_activity that do the opposite lookup event_id -> event_name.

My suggestion adds more load to AddinShmemInitLock instead.
Hence, I was just thinking about relying on AddinShmemInitLock to
insert new entries in the hash table, only if its shmem state is not
found when calling ShmemInitStruct().  Or perhaps it is just OK to not
care about the impact event_name -> event_id lookup for fresh
connections, and just bite the bullet with two lookup keys instead of
relying on AddinShmemInitLock for the addition of new entries in the
hash table?  Hmm, perhaps you're right with your approach here, at the
end.


For the first idea, I agree that if a lot of new connections come in,
it is easy to leads many conflicts. The only solution I can think of
is to use connection pooling now.

IIUC, the second idea is based on the premise of allocating their shared
memory for each extension. In that case, the cons of the first idea can
be solved because the wait event infos are saved in their shared memory 
and

they don't need call WaitEventExtensionNew() anymore. Is that right?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONFrom 0e3ccc6474bfcc51114d9363b7819b68f37fcad3 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Tue, 8 Aug 2023 19:24:32 +0900
Subject: [PATCH] Change to manage custom wait events in shared hash

Currently, names of the custom wait event must be
registered per backends. This patch relaxes the
constraints to store the wait events and their names
in the dynamic shared hashtable. So, all backends can
look up the wait event names on pg_stat_activity view
without additional processing by extensions.
---
 doc/src/sgml/monitoring.sgml  |   7 +-
 doc/src/sgml/xfunc.sgml   |  10 +-
 src/backend/storage/ipc/ipci.c|   6 +
 src/backend/storage/lmgr/lwlock.c |   2 +
 src/backend/utils/activity/wait_event.c   | 275 --
 src/backend/utils/init/postinit.c |   3 +
 src/include/storage/lwlock.h  |   2 +
 src/include/utils/wait_event.h|  17 +-
 .../modules/worker_spi/t/001_worker_spi.pl|  18 +-
 src/test/modules/worker_spi/worker_spi.c  |   6 +-
 src/tools/pgindent/typedefs.list  |   2 +-
 11 files changed, 216 insertions(+), 132 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index f4fc5d814f..19181832d7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1121,10 +1121,9 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
  LWLock types
  to the list shown in  and
  . In some cases, the name
- assigned by an extension will not be available in all server processes;
- so an Extension or LWLock wait
- event might be reported as just
- extension rather than the
+ of LWLock assigned by an extension will not be
+ available in all server processes; so an wait event might be reported
+ as just extension rather than the
  extension-assigned name.
 

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index d6345a775b..7fec034db4 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3470,17 +3470,13 @@ void RequestAddinShmemSpace(int size)
 
 
 
- shmem_startup_hook can allocate in shared memory
+ shmem_startup_hook can allocate in dynamic shared memory
  custom wait events by calling while holding the LWLock
  AddinShmemInitLock to avoid any race conditions:
 
-uint32 WaitEventExtensionNew(void)
-
- Next, each process needs to associate the wait event allocated previously
- to a user-facing custom string, which is something done by calling:
-
-void WaitEventExtensionRegisterName(uint32 wait_event_info, const char *wait_event_name)
+uint32 WaitEventExtensionNew(const char *wait_event_name)
 
+ The wait event is associated to a user-facing custom string.
  An example can be found in src/test/modules/worker_spi
  in the PostgreSQL source tree.
 

Re: Support to define custom wait events for extensions

2023-08-08 Thread Masahiro Ikeda

I accidentally attached a patch in one previous email.
But, you don't need to check it, sorry.
(v1-0001-Change-to-manage-custom-wait-events-in-shared-hash.patch)

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




Re: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-08-08 Thread Amit Kapila
On Mon, Aug 7, 2023 at 1:46 PM José Neves  wrote:
>
> Humm, that's... challenging. I faced some issues after "the fix" because I 
> had a couple of transactions with 25k updates, and I had to split it to be 
> able to push to our event messaging system, as our max message size is 10MB. 
> Relying on commit time would mean that all transaction operations will have 
> the same timestamp. If something goes wrong while my worker is pushing that 
> transaction data chunks, I will duplicate some data in the next run, so... 
> this wouldn't allow me to deal with data duplication.
> Is there any other way that you see to deal with it?
>
> Right now I only see an option, which is to store all processed LSNs on the 
> other side of the ETL. I'm trying to avoid that overhead.
>

Sorry, I don't understand your system enough to give you suggestions
but if you have any questions related to how logical replication work
then I might be able to help.

-- 
With Regards,
Amit Kapila.




Re: 2023-08-10 release announcement draft

2023-08-08 Thread Jonathan S. Katz

On 8/8/23 1:30 AM, Erik Rijkers wrote:

Op 8/8/23 om 03:15 schreef Jonathan S. Katz:


Please provide your feedback no later than August 10, 2023 0:00 AoE[1].


'You us'  should be
'You use'
    (2x)


It should actually be just "Use" -- but I've fixed both instances. Thanks!

Jonathan




OpenPGP_signature
Description: OpenPGP digital signature


Re: A failure in 031_recovery_conflict.pl on Debian/s390x

2023-08-08 Thread Christoph Berg
Re: Andres Freund
> Hm, that could just be a "harmless" race. Does it still happen if you apply
> the attached patch in addition?

Putting that patch on top of v8 made it pass 294 times before exiting
like this:

[08:52:34.134](0.032s) ok 1 - buffer pin conflict: cursor with conflicting pin 
established
Waiting for replication conn standby's replay_lsn to pass 0/343 on primary
done
timed out waiting for match: (?^:User was holding shared buffer pin for too 
long) at t/031_recovery_conflict.pl line 318.

But admittedly, this build machine is quite sluggish at times, likely
due to neighboring VMs on the same host. (Perhaps that even explains
the behavior from before this patch.) I'll still attach the logs since
I frankly can't judge what errors are acceptable here and which
aren't.

Christoph
2023-08-08 08:52:33.312 UTC [1734403] LOG:  starting PostgreSQL 17devel (Debian 
17~~devel-1) on s390x-ibm-linux-gnu, compiled by gcc (Debian 13.2.0-1) 13.2.0, 
64-bit
2023-08-08 08:52:33.312 UTC [1734403] LOG:  listening on Unix socket 
"/tmp/UNCayI_17k/.s.PGSQL.63827"
2023-08-08 08:52:33.318 UTC [1734406] LOG:  database system was shut down at 
2023-08-08 08:52:33 UTC
2023-08-08 08:52:33.350 UTC [1734403] LOG:  database system is ready to accept 
connections
2023-08-08 08:52:33.477 UTC [1734411] 031_recovery_conflict.pl LOG:  statement: 
CREATE TABLESPACE test_recovery_conflict_tblspc LOCATION ''
2023-08-08 08:52:33.484 UTC [1734413] 031_recovery_conflict.pl LOG:  received 
replication command: SHOW data_directory_mode
2023-08-08 08:52:33.484 UTC [1734413] 031_recovery_conflict.pl STATEMENT:  SHOW 
data_directory_mode
2023-08-08 08:52:33.485 UTC [1734413] 031_recovery_conflict.pl LOG:  received 
replication command: SHOW wal_segment_size
2023-08-08 08:52:33.485 UTC [1734413] 031_recovery_conflict.pl STATEMENT:  SHOW 
wal_segment_size
2023-08-08 08:52:33.485 UTC [1734413] 031_recovery_conflict.pl LOG:  received 
replication command: IDENTIFY_SYSTEM
2023-08-08 08:52:33.485 UTC [1734413] 031_recovery_conflict.pl STATEMENT:  
IDENTIFY_SYSTEM
2023-08-08 08:52:33.485 UTC [1734413] 031_recovery_conflict.pl LOG:  received 
replication command: BASE_BACKUP ( LABEL 'pg_basebackup base backup',  
PROGRESS,  CHECKPOINT 'fast',  WAIT 0,  MANIFEST 'yes',  TARGET 'client')
2023-08-08 08:52:33.485 UTC [1734413] 031_recovery_conflict.pl STATEMENT:  
BASE_BACKUP ( LABEL 'pg_basebackup base backup',  PROGRESS,  CHECKPOINT 'fast', 
 WAIT 0,  MANIFEST 'yes',  TARGET 'client')
2023-08-08 08:52:33.490 UTC [1734404] LOG:  checkpoint starting: immediate 
force wait
2023-08-08 08:52:33.500 UTC [1734404] LOG:  checkpoint complete: wrote 7 
buffers (5.5%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.001 s, 
sync=0.001 s, total=0.011 s; sync files=0, longest=0.000 s, average=0.000 s; 
distance=11350 kB, estimate=11350 kB; lsn=0/260, redo lsn=0/228
2023-08-08 08:52:33.507 UTC [1734414] 031_recovery_conflict.pl LOG:  received 
replication command: SHOW data_directory_mode
2023-08-08 08:52:33.507 UTC [1734414] 031_recovery_conflict.pl STATEMENT:  SHOW 
data_directory_mode
2023-08-08 08:52:33.508 UTC [1734414] 031_recovery_conflict.pl LOG:  received 
replication command: CREATE_REPLICATION_SLOT "pg_basebackup_1734414" TEMPORARY 
PHYSICAL ( RESERVE_WAL)
2023-08-08 08:52:33.508 UTC [1734414] 031_recovery_conflict.pl STATEMENT:  
CREATE_REPLICATION_SLOT "pg_basebackup_1734414" TEMPORARY PHYSICAL ( 
RESERVE_WAL)
2023-08-08 08:52:33.513 UTC [1734414] 031_recovery_conflict.pl LOG:  received 
replication command: IDENTIFY_SYSTEM
2023-08-08 08:52:33.513 UTC [1734414] 031_recovery_conflict.pl STATEMENT:  
IDENTIFY_SYSTEM
2023-08-08 08:52:33.518 UTC [1734414] 031_recovery_conflict.pl LOG:  received 
replication command: START_REPLICATION SLOT "pg_basebackup_1734414" 0/200 
TIMELINE 1
2023-08-08 08:52:33.518 UTC [1734414] 031_recovery_conflict.pl STATEMENT:  
START_REPLICATION SLOT "pg_basebackup_1734414" 0/200 TIMELINE 1
2023-08-08 08:52:33.596 UTC [1734413] 031_recovery_conflict.pl LOG:  temporary 
file: path "base/pgsql_tmp/pgsql_tmp1734413.0", size 137324
2023-08-08 08:52:33.596 UTC [1734413] 031_recovery_conflict.pl STATEMENT:  
BASE_BACKUP ( LABEL 'pg_basebackup base backup',  PROGRESS,  CHECKPOINT 'fast', 
 WAIT 0,  MANIFEST 'yes',  TARGET 'client')
2023-08-08 08:52:33.741 UTC [1734425] standby LOG:  received replication 
command: IDENTIFY_SYSTEM
2023-08-08 08:52:33.741 UTC [1734425] standby STATEMENT:  IDENTIFY_SYSTEM
2023-08-08 08:52:33.766 UTC [1734425] standby LOG:  received replication 
command: START_REPLICATION 0/300 TIMELINE 1
2023-08-08 08:52:33.766 UTC [1734425] standby STATEMENT:  START_REPLICATION 
0/300 TIMELINE 1
2023-08-08 08:52:33.823 UTC [1734427] 031_recovery_conflict.pl LOG:  statement: 
CREATE DATABASE test_db
2023-08-08 08:52:33.878 UTC [1734429] 031_recovery_conflict.pl LOG:  statement: 
CREATE TABLE test_recovery_conflict_table1(a int, b int);
2023-08-08 08:52:33.887 UTC [1734429] 031_recovery_co

Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning

2023-08-08 Thread Ashutosh Bapat
On Tue, Aug 8, 2023 at 4:08 PM Richard Guo  wrote:
>
>
> I haven't looked into the details but with 0002 patch I came across a
> crash while planning the query below.
>
> regression=# set enable_partitionwise_join to on;
> SET
> regression=# EXPLAIN (COSTS OFF)
> SELECT * FROM prt1 t1, prt2 t2
> WHERE t1.a = t2.b AND t1.a < 450 AND t2.b > 250 AND t1.b = 0;
> server closed the connection unexpectedly

Thanks for the report. PFA the patches with this issue fixed. There is
another issue seen when running partition_join.sql "ERROR:  index key
does not match expected index column". I will investigate that issue.

Right now I am looking at the 4th point in my first email in this
thread. [1]. I am trying to figure whether that approach would work.
If that approach doesn't work what's the best to track the
translations and also figure out answers to 1, 2, 3 there. Please let
me know your opinions if any.

-- 
Best Wishes,
Ashutosh Bapat

[1] 
https://www.postgresql.org/message-id/CAExHW5s=bclmmq8n_bn6iu+pjau0ds3z_6dn6ile69esmsp...@mail.gmail.com
From 4abdfd676a31d885ccdbe1803c60c1df1d0c1df2 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Wed, 12 Jul 2023 14:34:14 +0530
Subject: [PATCH 1/2] Report memory used for planning a query in EXPLAIN
 ANALYZE

The memory used in the CurrentMemoryContext and its children is sampled
before and after calling pg_plan_query() from ExplainOneQuery(). The
difference in the two samples is reported as the memory consumed while
planning the query. This may not account for the memory allocated in
memory contexts which are not children of CurrentMemoryContext. These
contexts are usually other long lived contexts, e.g.
CacheMemoryContext, which are shared by all the queries run in a
session. The consumption in those can not be attributed only to a given
query and hence should not be reported any way.

The memory consumption is reported as "Planning Memory" property in
EXPLAIN ANALYZE output.

Ashutosh Bapat
---
 src/backend/commands/explain.c| 12 ++--
 src/backend/commands/prepare.c|  7 ++-
 src/backend/utils/mmgr/mcxt.c | 19 +++
 src/include/commands/explain.h|  3 ++-
 src/include/utils/memutils.h  |  1 +
 src/test/regress/expected/explain.out | 15 +++
 6 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8570b14f62..9f859949f0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -397,16 +397,20 @@ ExplainOneQuery(Query *query, int cursorOptions,
 	planduration;
 		BufferUsage bufusage_start,
 	bufusage;
+		Size		mem_consumed;
 
 		if (es->buffers)
 			bufusage_start = pgBufferUsage;
 		INSTR_TIME_SET_CURRENT(planstart);
+		mem_consumed = MemoryContextMemUsed(CurrentMemoryContext);
 
 		/* plan the query */
 		plan = pg_plan_query(query, queryString, cursorOptions, params);
 
 		INSTR_TIME_SET_CURRENT(planduration);
 		INSTR_TIME_SUBTRACT(planduration, planstart);
+		mem_consumed = MemoryContextMemUsed(CurrentMemoryContext)
+		- mem_consumed;
 
 		/* calc differences of buffer counters. */
 		if (es->buffers)
@@ -417,7 +421,7 @@ ExplainOneQuery(Query *query, int cursorOptions,
 
 		/* run it (if needed) and produce output */
 		ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
-	   &planduration, (es->buffers ? &bufusage : NULL));
+	   &planduration, (es->buffers ? &bufusage : NULL), &mem_consumed);
 	}
 }
 
@@ -527,7 +531,7 @@ void
 ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 			   const char *queryString, ParamListInfo params,
 			   QueryEnvironment *queryEnv, const instr_time *planduration,
-			   const BufferUsage *bufusage)
+			   const BufferUsage *bufusage, const Size *mem_consumed)
 {
 	DestReceiver *dest;
 	QueryDesc  *queryDesc;
@@ -628,6 +632,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		double		plantime = INSTR_TIME_GET_DOUBLE(*planduration);
 
 		ExplainPropertyFloat("Planning Time", "ms", 1000.0 * plantime, 3, es);
+
+		if (mem_consumed)
+			ExplainPropertyUInteger("Planning Memory", "bytes",
+		(uint64) *mem_consumed, es);
 	}
 
 	/* Print info about runtime of triggers */
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 18f70319fc..84544ce481 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -583,10 +583,12 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	instr_time	planduration;
 	BufferUsage bufusage_start,
 bufusage;
+	Size		mem_consumed;
 
 	if (es->buffers)
 		bufusage_start = pgBufferUsage;
 	INSTR_TIME_SET_CURRENT(planstart);
+	mem_consumed = MemoryContextMemUsed(CurrentMemoryContext);
 
 	/* Look it up in the hash table */
 	entry = FetchPreparedStatement(execstmt->name, true);
@@ -623,6 +625,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *

Re: initial pruning in parallel append

2023-08-08 Thread Robert Haas
On Tue, Aug 8, 2023 at 2:58 AM Amit Langote  wrote:
> > That doesn't seem like a big problem because there aren't many node
> > types that do pruning, right? I think we're just talking about Append
> > and MergeAppend, or something like that, right?
>
> MergeAppend can't be parallel-aware atm, so only Append.

Well, the question isn't whether it's parallel-aware, but whether
startup-time pruning happens there.

> So change the ordering of the following code in ParallelQueryMain():

Yeah, that would be a reasonable thing to do.

> Or we could consider something like the patch I mentioned in my 1st
> email.  The idea there was to pass the pruning result via a separate
> channel, not the DSM chunk linked into the PlanState tree.  To wit, on
> the leader side, ExecInitParallelPlan() puts the serialized
> List-of-Bitmapset into the shm_toc with a dedicated PARALLEL_KEY,
> alongside PlannedStmt, ParamListInfo, etc.  The List-of-Bitmpaset is
> initialized during the leader's ExecInitNode().  On the worker side,
> ExecParallelGetQueryDesc() reads the List-of-Bitmapset string and puts
> the resulting node into the QueryDesc, that ParallelQueryMain() then
> uses to do ExecutorStart() which copies the pointer to
> EState.es_part_prune_results.  ExecInitAppend() consults
> EState.es_part_prune_results and uses the Bitmapset from there, if
> present, instead of performing initial pruning.

This also seems reasonable.

> I'm assuming it's not
> too ugly if ExecInitAppend() uses IsParallelWorker() to decide whether
> it should be writing to EState.es_part_prune_results or reading from
> it -- the former if in the leader and the latter in a worker.

I don't think that's too ugly. I mean you have to have an if statement
someplace.

> If we
> are to go with this approach we will need to un-revert ec386948948c,
> which moved PartitionPruneInfo nodes out of Append/MergeAppend nodes
> to a List in PlannedStmt (copied into EState.es_part_prune_infos),
> such that es_part_prune_results mirrors es_part_prune_infos.

The comment for the revert (which was
5472743d9e8583638a897b47558066167cc14583) points to
https://www.postgresql.org/message-id/4191508.1674157...@sss.pgh.pa.us
as the reason, but it's not very clear to me why that email led to
this being reverted. In any event, I agree that if we go with your
idea to pass this via a separate PARALLEL_KEY, unreverting that patch
seems to make sense, because otherwise I think we don't have a fast
way to find the nodes that contain the state that we care about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Use of additional index columns in rows filtering

2023-08-08 Thread Tomas Vondra



On 8/8/23 00:18, Peter Geoghegan wrote:
> On Mon, Aug 7, 2023 at 12:34 PM Tomas Vondra
>  wrote:
>> But then we call get_index_paths/build_index_path a little bit later,
>> and that decides to skip "lower SAOP" (which seems a bit strange,
>> because the column is "after" the equality, but meh). Anyway, at this
>> point we already decided what's a filter, ignoring the index clauses,
>> and not expecting any backsies.
> 
> I'm not surprised that it's due to the issue around "lower SAOP"
> clauses within get_index_paths/build_index_path. That whole approach
> seems rather ad-hoc to me. As you probably realize already, my own
> patch has to deal with lots of issues in the same area.
> 

Yeah. It's be much easier if the decision was done in one place, without
then changing it later.

>> The simples fix seems to be to add these skipped SAOP clauses as
>> filters. We know it can be evaluated on the index ...
> 
> Right. Obviously, my preferred solution to the problem at hand is to
> make everything into index quals via an approach like the one from my
> patch -- that works sensibly, no matter the length of the SAOP arrays.
> But even if you're willing to assume that that work will be in place
> for 17, there are still certain remaining gaps, that also seem
> important.
> 

Agreed.

> Even my patch cannot always make SAOP clauses into index quals. There
> are specific remaining gaps that I hope that your patch will still
> cover. The simplest example is a similar NOT IN() inequality, like
> this:
> 
> select
>   ctid, *
> from
>   tenk1
> where
>   thousand = 42
>   and
>   tenthous not in (1, 3, 42, 43, 44, 45, 46, 47, 48, 49, 50);
> 
> There is no way that my patch can handle this case. Where your patch
> seems to be unable to do better than master here, either -- just like
> with the "tenthous in ( )" variant. Once again, the inequality SAOP
> also ends up as table filter quals, not index filter quals.
> 

Are you sure? Because if I try with the 20230716 patch, I get this plan
(after disabling bitmapscan):

 QUERY PLAN
---
 Index Scan using tenk1_thous_tenthous on tenk1  (cost=0.31..44.54
rows=10 width=250)
   Index Cond: (thousand = 42)
   Index Filter: (tenthous <> ALL
('{1,3,42,43,44,45,46,47,48,49,50}'::integer[]))
   Filter: (tenthous <> ALL ('{1,3,42,43,44,45,46,47,48,49,50}'::integer[]))
(4 rows)

So the condition is recognized as index filter. Or did you mean
something different?

> It would also be nice if we found a way of doing this, while still
> reliably avoiding all visibility checks (just like "real index quals"
> will) -- since that should be safe in this specific case.
> 
> The MDAM paper describes a scheme for converting NOT IN() clauses into
> DNF single value predicates. But that's not going to happen for 17,
> and doesn't seem all that helpful with a query like this in any case.
> But it does suggest an argument in favor of visibility checks not
> being truly required for SAOP inequalities like this one (when they
> appear in index filters). I'm not sure if that idea is too particular
> to SAOP inequalities to be interesting -- just a suggestion.
> 

Not sure. A couple messages back I suggested that maybe there is a way
to check which expression would be safe to evaluate before checking the
visibility. This seems similar, although what you're suggesting really
applies to the "transformed" SAOP, and I'm not sure it can be extended
to the original SAOP.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Peter Eisentraut

On 08.08.23 04:15, Andres Freund wrote:

Potential paths forward for individual CI:

- migrate wholesale to another CI provider

- split CI tasks across different CI providers, rely on github et al
   displaying the CI status for different platforms

- give up


With the proposed optimizations, it seems you can still do a fair amount 
of work under the free plan.



Potential paths forward for cfbot, in addition to the above:

- Pay for compute / ask the various cloud providers to grant us compute
   credits. At least some of the cloud providers can be used via cirrus-ci.

- Host (some) CI runners ourselves. Particularly with macos and windows, that
   could provide significant savings.

- Build our own system, using buildbot, jenkins or whatnot.


I think we should use the "compute credits" plan from Cirrus CI.  It 
should be possible to estimate the costs for that.  Money is available, 
I think.






Re: Use of additional index columns in rows filtering

2023-08-08 Thread Tomas Vondra
On 8/8/23 06:21, Peter Geoghegan wrote:
> On Mon, Aug 7, 2023 at 3:18 PM Peter Geoghegan  wrote:
>> Even my patch cannot always make SAOP clauses into index quals. There
>> are specific remaining gaps that I hope that your patch will still
>> cover. The simplest example is a similar NOT IN() inequality, like
>> this:
>>
>> select
>>   ctid, *
>> from
>>   tenk1
>> where
>>   thousand = 42
>>   and
>>   tenthous not in (1, 3, 42, 43, 44, 45, 46, 47, 48, 49, 50);
>>
>> There is no way that my patch can handle this case. Where your patch
>> seems to be unable to do better than master here, either -- just like
>> with the "tenthous in ( )" variant. Once again, the inequality SAOP
>> also ends up as table filter quals, not index filter quals.
>>
>> It would also be nice if we found a way of doing this, while still
>> reliably avoiding all visibility checks (just like "real index quals"
>> will) -- since that should be safe in this specific case.
> 
> Actually, this isn't limited to SAOP inequalities. It appears as if
> *any* simple inequality has the same limitation. So, for example, the
> following query can only use table filters with the patch (never index
> filters):
> 
> select
>   ctid, *
> from
>   tenk1
> where
>   thousand = 42 and tenthous != 1;
> 
> This variant will use index filters, as expected (though with some
> risk of heap accesses when VM bits aren't set):
> 
> select
>   ctid, *
> from
>   tenk1
> where
>   thousand = 42 and tenthous is distinct from 1;
> 
> Offhand I suspect that it's a similar issue to the one you described for 
> SAOPs.
> 
> I see that get_op_btree_interpretation() will treat != as a kind of
> honorary member of an opfamily whose = operator has our != operator as
> its negator. Perhaps we should be finding a way to pass != quals into
> the index AM so that they become true index quals (obviously they
> would only be index filter predicates, never access predicates). That
> has the advantage of working in a way that's analogous to the way that
> index quals already avoid visibility checks.
> 

Are you sure you're using the right build? Because I get this plan:

 QUERY PLAN
---
 Index Scan using tenk1_thous_tenthous on tenk1  (cost=0.29..44.48
rows=10 width=250)
   Index Cond: (thousand = 42)
   Index Filter: (tenthous <> 1)
   Filter: (tenthous <> 1)
(4 rows)

Again, the inequality is clearly recognized as index filter.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: generic plans and "initial" pruning

2023-08-08 Thread Amit Langote
On Tue, Aug 8, 2023 at 12:36 AM Robert Haas  wrote:
> On Thu, Aug 3, 2023 at 4:37 AM Amit Langote  wrote:
> > Here's a patch set where the refactoring to move the ExecutorStart()
> > calls to be closer to GetCachedPlan() (for the call sites that use a
> > CachedPlan) is extracted into a separate patch, 0002.  Its commit
> > message notes an aspect of this refactoring that I feel a bit nervous
> > about -- needing to also move the CommandCounterIncrement() call from
> > the loop in PortalRunMulti() to PortalStart() which now does
> > ExecutorStart() for the PORTAL_MULTI_QUERY case.
>
> I spent some time today reviewing 0001. Here are a few thoughts and
> notes about things that I looked at.

Thanks for taking a look at this.

> First, I wondered whether it was really adequate for ExecEndPlan() to
> just loop over estate->es_plan_nodes and call it good. Put
> differently, is it possible that we could ever have more than one
> relevant EState, say for a subplan or an EPQ execution or something,
> so that this loop wouldn't cover everything? I found nothing to make
> me think that this is a real danger.

Check.

> Second, I wondered whether the ordering of cleanup operations could be
> an issue. Right now, a node can position cleanup code before, after,
> or both before and after recursing to child nodes, whereas with this
> design change, the cleanup code will always be run before recursing to
> child nodes.

Because a node is appended to es_planstate_nodes at the end of
ExecInitNode(), child nodes get added before their parent nodes.  So
the children are cleaned up first.

> Here, I think we have problems. Both ExecGather and
> ExecEndGatherMerge intentionally clean up the children before the
> parent, so that the child shutdown happens before
> ExecParallelCleanup(). Based on the comment and commit
> acf555bc53acb589b5a2827e65d655fa8c9adee0, this appears to be
> intentional, and you can sort of see why from looking at the stuff
> that happens in ExecParallelCleanup(). If the instrumentation data
> vanishes before the child nodes have a chance to clean things up,
> maybe EXPLAIN ANALYZE won't reflect that instrumentation any more. If
> the DSA vanishes, maybe we'll crash if we try to access it. If we
> actually reach DestroyParallelContext(), we're just going to start
> killing the workers. None of that sounds like what we want.
>
> The good news, of a sort, is that I think this might be the only case
> of this sort of problem. Most nodes recurse at the end, after doing
> all the cleanup, so the behavior won't change. Moreover, even if it
> did, most cleanup operations look pretty localized -- they affect only
> the node itself, and not its children. A somewhat interesting case is
> nodes associated with subplans. Right now, because of the coding of
> ExecEndPlan, nodes associated with subplans are all cleaned up at the
> very end, after everything that's not inside of a subplan. But with
> this change, they'd get cleaned up in the order of initialization,
> which actually seems more natural, as long as it doesn't break
> anything, which I think it probably won't, since as I mention in most
> cases node cleanup looks quite localized, i.e. it doesn't care whether
> it happens before or after the cleanup of other nodes.
>
> I think something will have to be done about the parallel query stuff,
> though. I'm not sure exactly what. It is a little weird that Gather
> and Gather Merge treat starting and killing workers as a purely
> "private matter" that they can decide to handle without the executor
> overall being very much aware of it. So maybe there's a way that some
> of the cleanup logic here could be hoisted up into the general
> executor machinery, that is, first end all the nodes, and then go
> back, and end all the parallelism using, maybe, another list inside of
> the estate. However, I think that the existence of ExecShutdownNode()
> is a complication here -- we need to make sure that we don't break
> either the case where that happen before overall plan shutdown, or the
> case where it doesn't.

Given that children are closed before parent, the order of operations
in ExecEndGather[Merge] is unchanged.

> Third, a couple of minor comments on details of how you actually made
> these changes in the patch set. Personally, I would remove all of the
> "is closed separately" comments that you added. I think it's a
> violation of the general coding principle that you should make the
> code look like it's always been that way. Sure, in the immediate
> future, people might wonder why you don't need to recurse, but 5 or 10
> years from now that's just going to be clutter. Second, in the cases
> where the ExecEndNode functions end up completely empty, I would
> suggest just removing the functions entirely and making the switch
> that dispatches on the node type have a switch case that lists all the
> nodes that don't need a callback here and say /* Nothing do for these
> node types */ break;. This will save a few

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Andres Freund
Hi,

On 2023-08-08 16:28:49 +0200, Peter Eisentraut wrote:
> On 08.08.23 04:15, Andres Freund wrote:
> > Potential paths forward for cfbot, in addition to the above:
> >
> > - Pay for compute / ask the various cloud providers to grant us compute
> >credits. At least some of the cloud providers can be used via cirrus-ci.
> >
> > - Host (some) CI runners ourselves. Particularly with macos and windows, 
> > that
> >could provide significant savings.
> >
> > - Build our own system, using buildbot, jenkins or whatnot.
>
> I think we should use the "compute credits" plan from Cirrus CI.  It should
> be possible to estimate the costs for that.  Money is available, I think.

Unfortunately just doing that seems like it would up considerably on the too
expensive side. Here are the stats for last months' cfbot runtimes (provided
by Thomas):

   task_name|sum
+
 FreeBSD - 13 - Meson   | 1017:56:09
 Windows - Server 2019, MinGW64 - Meson | 00:00:00
 SanityCheck| 76:48:41
 macOS - Ventura - Meson| 873:12:43
 Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06
 Linux - Debian Bullseye - Autoconf | 830:17:26
 Linux - Debian Bullseye - Meson| 860:37:21
 CompilerWarnings   | 935:30:35
(8 rows)

If I did the math right, that's about 7000 credits (and 1 credit costs 1 USD).

task costs in credits
linux-sanity: 55.30
linux-autoconf: 598.04
linux-meson: 619.40
linux-compiler-warnings: 674.28
freebsd   : 732.24
windows   : 1201.09
macos : 3143.52


Now, those times are before optimizing test runtime. And besides optimizing
the tasks, we can also optimize not running tests for docs patches etc. And
optimize cfbot to schedule a bit better.

But still, the costs look not realistic to me.

If instead we were to use our own GCP account, it's a lot less. t2d-standard-4
instances, which are faster than what we use right now, cost $0.168984 / hour
as "normal" instances and $0.026764 as "spot" instances right now [1]. Windows
VMs are considerably more expensive due to licensing - 0.184$/h in addition.

Assuming spot instances, linux+freebsd tasks would cost ~100USD month (maybe
10-20% more in reality, due to a) spot instances getting terminated requiring
retries and b) disks).

Windows would be ~255 USD / month (same retries caveats).

Given the cost of macos, it seems like it'd be by far the most of affordable
to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, as
persistent runners. Cirrus has builtin macos virtualization support - but can
only host two VMs on each mac, due to macos licensing restrictions. A single
mac mini would suffice to keep up with our unoptimized monthly runtime
(although there likely would be some overhead).

Greetings,

Andres Freund

[1] https://cloud.google.com/compute/all-pricing




Re: Fix last unitialized memory warning

2023-08-08 Thread Tristan Partin

On Tue Aug 8, 2023 at 5:20 AM CDT, Peter Eisentraut wrote:

On 19.07.23 19:15, Tristan Partin wrote:
> On Sun Jul 9, 2023 at 2:23 AM CDT, Peter Eisentraut wrote:
>> On 06.07.23 15:41, Tristan Partin wrote:
>> > On Thu Jul 6, 2023 at 3:21 AM CDT, Peter Eisentraut wrote:
>> >> On 05.07.23 23:06, Tristan Partin wrote:
>> >>> Thanks for following up. My system is Fedora 38. I can confirm 
>> this is

>> >>> still happening on master.
>> >>>
>> >>> $ gcc --version
>> >>> gcc (GCC) 13.1.1 20230614 (Red Hat 13.1.1-4)
>> >>> Copyright (C) 2023 Free Software Foundation, Inc.
>> >>> This is free software; see the source for copying conditions.  
>> There is NO
>> >>> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR 
>> PURPOSE.

>> >>> $ meson setup build --buildtype=release
>> >>
>> >> This buildtype turns on -O3 warnings.  We have usually opted against
>> >> chasing warnings in -O3 level because there are often some
>> >> false-positive uninitialized variable warnings with every new 
>> compiler.

>> >>
>> >> Note that we have set the default build type to debugoptimized, for 
>> that

>> >> reason.
>> > > Good to know, thanks.
>> > > Regarding the original patch, do you think it is good to be applied?
>>
>> That patch looks reasonable.  But I can't actually reproduce the 
>> warning, even with gcc-13.  I do get the warning from plpgsql.  Can 
>> you show the warning you are seeing?
> 
> Here is the full warning that the original patch suppresses.


I was able to reproduce the warning now on Fedora.  I agree with the patch

-   PgBenchValue vargs[MAX_FARGS];
+   PgBenchValue vargs[MAX_FARGS] = { 0 };

I suggest to also do

  typedef enum
  {
-   PGBT_NO_VALUE,
+   PGBT_NO_VALUE = 0,

to make clear that the initialization value is meant to be invalid.

I also got the plpgsql warning that you showed earlier, but I couldn't 
think of a reasonable way to fix that.


Applied in v2.

--
Tristan Partin
Neon (https://neon.tech)
From 7e28d060871b19a8c09539e2da5e226b780431b9 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Wed, 7 Jun 2023 09:21:59 -0500
Subject: [PATCH v2] Fix last remaining uninitialized memory warnings

gcc fails to properly analyze the code due to the loop stop condition
including `l != NULL`. Let's just help it out.
---
 src/bin/pgbench/pgbench.c | 2 +-
 src/bin/pgbench/pgbench.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 539c2795e2..2ba3e367c4 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -2239,7 +2239,7 @@ evalStandardFunc(CState *st,
 {
 	/* evaluate all function arguments */
 	int			nargs = 0;
-	PgBenchValue vargs[MAX_FARGS];
+	PgBenchValue vargs[MAX_FARGS] = { 0 };
 	PgBenchExprLink *l = args;
 	bool		has_null = false;
 
diff --git a/src/bin/pgbench/pgbench.h b/src/bin/pgbench/pgbench.h
index 957c9ca9da..f8efa4b868 100644
--- a/src/bin/pgbench/pgbench.h
+++ b/src/bin/pgbench/pgbench.h
@@ -33,7 +33,7 @@ union YYSTYPE;
  */
 typedef enum
 {
-	PGBT_NO_VALUE,
+	PGBT_NO_VALUE = 0,
 	PGBT_NULL,
 	PGBT_INT,
 	PGBT_DOUBLE,
-- 
Tristan Partin
Neon (https://neon.tech)



Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Tristan Partin

On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote:

FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
like the following (depends on caching etc):

task costs in credits
linux-sanity: 0.01
linux-compiler-warnings: 0.05
linux-meson: 0.07
freebsd   : 0.08
linux-autoconf: 0.09
windows   : 0.18
macos : 0.28
total task runtime is 40.8
cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month


I am not in the loop on the autotools vs meson stuff. How much longer do 
we anticipate keeping autotools around? Seems like it could be a good 
opportunity to reduce some CI usage if autotools were finally dropped, 
but I know there are still outstanding tasks to complete.


Back of the napkin math says autotools is about 12% of the credit cost, 
though I haven't looked to see if linux-meson and linux-autotools are 
1:1.


--
Tristan Partin
Neon (https://neon.tech)




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Andres Freund
Hi,

On 2023-08-08 10:25:52 -0500, Tristan Partin wrote:
> On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote:
> > FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
> > like the following (depends on caching etc):
> > 
> > task costs in credits
> > linux-sanity: 0.01
> > linux-compiler-warnings: 0.05
> > linux-meson: 0.07
> > freebsd   : 0.08
> > linux-autoconf: 0.09
> > windows   : 0.18
> > macos : 0.28
> > total task runtime is 40.8
> > cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month
> 
> I am not in the loop on the autotools vs meson stuff. How much longer do we
> anticipate keeping autotools around?

I think it depends in what fashion. We've been talking about supporting
building out-of-tree modules with "pgxs" for at least a 5 year support
window. But the replacement isn't yet finished [1], so that clock hasn't yet
started ticking.


> Seems like it could be a good opportunity to reduce some CI usage if
> autotools were finally dropped, but I know there are still outstanding tasks
> to complete.
> 
> Back of the napkin math says autotools is about 12% of the credit cost,
> though I haven't looked to see if linux-meson and linux-autotools are 1:1.

The autoconf task is actually doing quite useful stuff right now, leaving the
use of configure aside, as it builds with address sanitizer. Without that it'd
be a lot faster. But we'd loose, imo quite important, coverage. The tests
would run a bit faster with meson, but it'd be overall a difference on the
margins.

Greetings,

Andres Freund

[1] https://github.com/anarazel/postgres/tree/meson-pkgconfig




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Tristan Partin

On Tue Aug 8, 2023 at 10:38 AM CDT, Andres Freund wrote:

Hi,

On 2023-08-08 10:25:52 -0500, Tristan Partin wrote:
> On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote:
> > FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
> > like the following (depends on caching etc):
> > 
> > task costs in credits

> >   linux-sanity: 0.01
> >   linux-compiler-warnings: 0.05
> >   linux-meson: 0.07
> >   freebsd   : 0.08
> >   linux-autoconf: 0.09
> >   windows   : 0.18
> >   macos : 0.28
> > total task runtime is 40.8
> > cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month
> 
> I am not in the loop on the autotools vs meson stuff. How much longer do we

> anticipate keeping autotools around?

I think it depends in what fashion. We've been talking about supporting
building out-of-tree modules with "pgxs" for at least a 5 year support
window. But the replacement isn't yet finished [1], so that clock hasn't yet
started ticking.


> Seems like it could be a good opportunity to reduce some CI usage if
> autotools were finally dropped, but I know there are still outstanding tasks
> to complete.
> 
> Back of the napkin math says autotools is about 12% of the credit cost,

> though I haven't looked to see if linux-meson and linux-autotools are 1:1.

The autoconf task is actually doing quite useful stuff right now, leaving the
use of configure aside, as it builds with address sanitizer. Without that it'd
be a lot faster. But we'd loose, imo quite important, coverage. The tests
would run a bit faster with meson, but it'd be overall a difference on the
margins.

[1] https://github.com/anarazel/postgres/tree/meson-pkgconfig


Makes sense. Please let me know if I can help you out in anyway for the 
v17 development cycle besides what we have already talked about.


--
Tristan Partin
Neon (https://neon.tech)




Re: generic plans and "initial" pruning

2023-08-08 Thread Robert Haas
On Tue, Aug 8, 2023 at 10:32 AM Amit Langote  wrote:
> But should ExecInitNode() subroutines return the partially initialized
> PlanState node or NULL on detecting invalidation?  If I'm
> understanding how you think this should be working correctly, I think
> you mean the former, because if it were the latter, ExecInitNode()
> would end up returning NULL at the top for the root and then there's
> nothing to pass to ExecEndNode(), so no way to clean up to begin with.
> In that case, I think we will need to adjust ExecEndNode() subroutines
> to add `if (node->ps.ps_ResultTupleSlot)` in the above code, for
> example.  That's something Tom had said he doesn't like very much [1].

Yeah, I understood Tom's goal as being "don't return partially
initialized nodes."

Personally, I'm not sure that's an important goal. In fact, I don't
even think it's a desirable one. It doesn't look difficult to audit
the end-node functions for cases where they'd fail if a particular
pointer were NULL instead of pointing to some real data, and just
fixing all such cases to have NULL-tests looks like purely mechanical
work that we are unlikely to get wrong. And at least some cases
wouldn't require any changes at all.

If we don't do that, the complexity doesn't go away. It just moves
someplace else. Presumably what we do in that case is have
ExecInitNode functions undo any initialization that they've already
done before returning NULL. There are basically two ways to do that.
Option one is to add code at the point where they return early to
clean up anything they've already initialized, but that code is likely
to substantially duplicate whatever the ExecEndNode function already
knows how to do, and it's very easy for logic like this to get broken
if somebody rearranges an ExecInitNode function down the road. Option
two is to rearrange the ExecInitNode functions now, to open relations
or recurse at the beginning, so that we discover the need to fail
before we initialize anything. That restricts our ability to further
rearrange the functions in future somewhat, but more importantly,
IMHO, it introduces more risk right now. Checking that the ExecEndNode
function will not fail if some pointers are randomly null is a lot
easier than checking that changing the order of operations in an
ExecInitNode function breaks nothing.

I'm not here to say that we can't do one of those things. But I think
adding null-tests to ExecEndNode functions looks like *far* less work
and *way* less risk.

There's a second issue here, too, which is when we abort ExecInitNode
partway through, how do we signal that? You're rightly pointing out
here that if we do that by returning NULL, then we don't do it by
returning a pointer to the partially initialized node that we just
created, which means that we either need to store those partially
initialized nodes in a separate data structure as you propose to do in
0001, or else we need to pick a different signalling convention. We
could change (a) ExecInitNode to have an additional argument, bool
*kaboom, or (b) we could make it return bool and return the node
pointer via a new additional argument, or (c) we could put a Boolean
flag into the estate and let the function signal failure by flipping
the value of the flag. If we do any of those things, then as far as I
can see 0001 is unnecessary. If we do none of them but also avoid
creating partially initialized nodes by one of the two techniques
mentioned two paragraphs prior, then 0001 is also unnecessary. If we
do none of them but do create partially initialized nodes, then we
need 0001.

So if this were a restaurant menu, then it might look like this:

Prix Fixe Menu (choose one from each)

First Course - How do we clean up after partial initialization?
(1) ExecInitNode functions produce partially initialized nodes
(2) ExecInitNode functions get refactored so that the stuff that can
cause early exit always happens first, so that no cleanup is ever
needed
(3) ExecInitNode functions do any required cleanup in situ

Second Course - How do we signal that initialization stopped early?
(A) Return NULL.
(B) Add a bool * out-parmeter to ExecInitNode.
(C) Add a Node * out-parameter to ExecInitNode and change the return
value to bool.
(D) Add a bool to the EState.
(E) Something else, maybe.

I think that we need 0001 if we choose specifically (1) and (A). My
gut feeling is that the least-invasive way to do this project is to
choose (1) and (D). My second choice would be (1) and (C), and my
third choice would be (1) and (A). If I can't have (1), I think I
prefer (2) over (3), but I also believe I prefer hiding in a deep hole
to either of them. Maybe I'm not seeing the whole picture correctly
here, but both (2) and (3) look awfully painful to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Use of additional index columns in rows filtering

2023-08-08 Thread Peter Geoghegan
On Tue, Aug 8, 2023 at 7:28 AM Tomas Vondra
 wrote:
> Are you sure? Because if I try with the 20230716 patch, I get this plan
> (after disabling bitmapscan):

> So the condition is recognized as index filter. Or did you mean
> something different?

No, I was just wrong about this (and about inequalities in general). I
now see why the planner preferred a bitmap scan, which makes sense.
Apologies.

> Not sure. A couple messages back I suggested that maybe there is a way
> to check which expression would be safe to evaluate before checking the
> visibility. This seems similar, although what you're suggesting really
> applies to the "transformed" SAOP, and I'm not sure it can be extended
> to the original SAOP.

The transformation doesn't necessarily have to happen in order for it
to be possible in principle (and correct). My point was that there are
a handful of important types of expressions (SAOPs among them, but
possibly also RowCompareExpr and IS NULL tests) that are "index quals
in spirit". These expressions therefore don't seem to need visibility
checks at all -- the index qual guarantees "apply transitively".

It's possible that an approach that focuses on leakproof quals won't
have any problems, and will be strictly better than "extending the
index qual guarantees to index-qual-like expressions". Really not sure
about that.

In any case I see a huge amount of value in differentiating between
cases that need visibility checks (if only via the VM) and those that
do not, ever. I'm speaking very generally here -- nothing to do with
my adversarial tenk1 test case. It's weird that index quals have such
a massive advantage over even simple index filters -- that feels
artificial. I suggest that you focus on that aspect, since it has the
potential to make what is already a compelling patch into a much more
compelling patch.

-- 
Peter Geoghegan




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Alvaro Herrera
On 2023-Aug-08, Andres Freund wrote:

> Given the cost of macos, it seems like it'd be by far the most of affordable
> to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, as
> persistent runners. Cirrus has builtin macos virtualization support - but can
> only host two VMs on each mac, due to macos licensing restrictions. A single
> mac mini would suffice to keep up with our unoptimized monthly runtime
> (although there likely would be some overhead).

If using persistent workers is an option, maybe we should explore that.
I think we could move all or some of the Linux - Debian builds to
hardware that we already have in shelves (depending on how much compute
power is really needed.)

I think using other OSes is more difficult, mostly because I doubt we
want to deal with licenses; but even FreeBSD might not be a realistic
option, at least not in the short term.

Still,

>task_name|sum
> +
>  FreeBSD - 13 - Meson   | 1017:56:09
>  Windows - Server 2019, MinGW64 - Meson | 00:00:00
>  SanityCheck| 76:48:41
>  macOS - Ventura - Meson| 873:12:43
>  Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06
>  Linux - Debian Bullseye - Autoconf | 830:17:26
>  Linux - Debian Bullseye - Meson| 860:37:21
>  CompilerWarnings   | 935:30:35
> (8 rows)
>

moving just Debian, that might alleviate 76+830+860+935 hours from the
Cirrus infra, which is ~46%.  Not bad.


(How come Windows - Meson reports allballs?)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No tengo por qué estar de acuerdo con lo que pienso"
 (Carlos Caszeli)




Re: A failure in 031_recovery_conflict.pl on Debian/s390x

2023-08-08 Thread Thomas Munro
On Wed, Aug 9, 2023 at 2:01 AM Christoph Berg  wrote:
> Putting that patch on top of v8 made it pass 294 times before exiting
> like this:
>
> [08:52:34.134](0.032s) ok 1 - buffer pin conflict: cursor with conflicting 
> pin established
> Waiting for replication conn standby's replay_lsn to pass 0/343 on primary
> done
> timed out waiting for match: (?^:User was holding shared buffer pin for too 
> long) at t/031_recovery_conflict.pl line 318.

Can you reproduce that with logging like this added on top?

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5b663a2997..72f5274c95 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -629,6 +629,7 @@ SetStartupBufferPinWaitBufId(int bufid)
/* use volatile pointer to prevent code rearrangement */
volatile PROC_HDR *procglobal = ProcGlobal;

+   elog(LOG, "XXX SetStartupBufferPinWaitBufId(%d)", bufid);
procglobal->startupBufferPinWaitBufId = bufid;
 }

@@ -640,8 +641,11 @@ GetStartupBufferPinWaitBufId(void)
 {
/* use volatile pointer to prevent code rearrangement */
volatile PROC_HDR *procglobal = ProcGlobal;
+   int bufid;

-   return procglobal->startupBufferPinWaitBufId;
+   bufid = procglobal->startupBufferPinWaitBufId;
+   elog(LOG, "XXX GetStartupBufferPinWaitBufId() -> %d", bufid);
+   return bufid;
 }




Re: pgsql: Some refactoring to export json(b) conversion functions

2023-08-08 Thread Alvaro Herrera
On 2023-Jul-26, Amit Langote wrote:

> Some refactoring to export json(b) conversion functions
> 
> This is to export datum_to_json(), datum_to_jsonb(), and
> jsonb_from_cstring(), though the last one is exported as
> jsonb_from_text().

After this commit, Coverity started complaining that
datum_to_jsonb_internal() leaks the JsonLexContext here

 754   │ case JSONTYPE_CAST:
 755   │ case JSONTYPE_JSON:
 756   │ {
 757   │ /* parse the json right into the existing result 
object */
 758   │ JsonLexContext *lex;
 759   │ JsonSemAction sem;
 760   │ text   *json = DatumGetTextPP(val);
 761   │ 
 762   │ lex = makeJsonLexContext(json, true);
 763   │ 
 764   │ memset(&sem, 0, sizeof(sem));
 765   │ 
 766   │ sem.semstate = (void *) result;
 767   │ 
 768   │ sem.object_start = jsonb_in_object_start;
 769   │ sem.array_start = jsonb_in_array_start;
 770   │ sem.object_end = jsonb_in_object_end;
 771   │ sem.array_end = jsonb_in_array_end;
 772   │ sem.scalar = jsonb_in_scalar;
 773   │ sem.object_field_start = 
jsonb_in_object_field_start;
 774   │ 
 775   │ pg_parse_json_or_ereport(lex, &sem);
 776   │ }
 777   │ break;

Admittedly, our support code for this is not great, since we have no
clean way to free those resources.  Some places like json_object_keys
are freeing everything manually (even though in that particular case
it's unnecessary, since that one runs in a memcxt that's going to be
cleaned up shortly afterwards).

One idea that Tom floated was to allow the JsonLexContext to be
optionally stack-allocated.  That reduces palloc() traffic; but some
callers do need it to be palloc'ed.  Here's a patch that does it that
way, and adds a freeing routine that knows what to do in either case.
It may make sense to do some further analysis and remove useless free
calls.


It may make sense to change the structs that contain JsonLexContext *
so that they directly embed JsonLexContext instead.  That would further
reduce palloc'ing.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end." (2nd Commandment for C programmers)
>From df8b71428481f718ac6bb60a3ca5969a6876da7e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 3 Aug 2023 11:44:15 +0200
Subject: [PATCH] JsonLexContext allocation/free

---
 src/backend/utils/adt/json.c |  38 
 src/backend/utils/adt/jsonb.c|  13 +--
 src/backend/utils/adt/jsonfuncs.c| 106 +--
 src/bin/pg_verifybackup/parse_manifest.c |   4 +-
 src/common/jsonapi.c |  43 +++--
 src/include/common/jsonapi.h |  23 +++--
 src/include/utils/jsonfuncs.h|   2 +-
 7 files changed, 146 insertions(+), 83 deletions(-)

diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index e405791f5d..27f9a51228 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -106,11 +106,11 @@ json_in(PG_FUNCTION_ARGS)
 {
 	char	   *json = PG_GETARG_CSTRING(0);
 	text	   *result = cstring_to_text(json);
-	JsonLexContext *lex;
+	JsonLexContext lex;
 
 	/* validate it */
-	lex = makeJsonLexContext(result, false);
-	if (!pg_parse_json_or_errsave(lex, &nullSemAction, fcinfo->context))
+	makeJsonLexContext(&lex, result, false);
+	if (!pg_parse_json_or_errsave(&lex, &nullSemAction, fcinfo->context))
 		PG_RETURN_NULL();
 
 	/* Internal representation is the same as text */
@@ -152,13 +152,13 @@ json_recv(PG_FUNCTION_ARGS)
 	StringInfo	buf = (StringInfo) PG_GETARG_POINTER(0);
 	char	   *str;
 	int			nbytes;
-	JsonLexContext *lex;
+	JsonLexContext lex;
 
 	str = pq_getmsgtext(buf, buf->len - buf->cursor, &nbytes);
 
 	/* Validate it. */
-	lex = makeJsonLexContextCstringLen(str, nbytes, GetDatabaseEncoding(), false);
-	pg_parse_json_or_ereport(lex, &nullSemAction);
+	makeJsonLexContextCstringLen(&lex, str, nbytes, GetDatabaseEncoding(), false);
+	pg_parse_json_or_ereport(&lex, &nullSemAction);
 
 	PG_RETURN_TEXT_P(cstring_to_text_with_len(str, nbytes));
 }
@@ -1625,14 +1625,16 @@ json_unique_object_field_start(void *_state, char *field, bool isnull)
 bool
 json_validate(text *json, bool check_unique_keys, bool throw_error)
 {
-	JsonLexContext *lex = makeJsonLexContext(json, check_unique_keys);
+	JsonLexContext lex;
 	JsonSemAction uniqueSemAction = {0};
 	JsonUniqueParsingState state;
 	JsonParseErrorType result;
 
+	makeJsonLexContext(&lex, json, check_unique_keys);
+
 	if (check_unique_keys)
 	{
-		state.lex = lex;
+		state.lex = &lex;
 		state.stack = NULL;
 		state.id_counter = 0;
 		state.unique = true;
@@ -164

Re: Use of additional index columns in rows filtering

2023-08-08 Thread Peter Geoghegan
On Mon, Aug 7, 2023 at 9:21 PM Peter Geoghegan  wrote:
> I see that get_op_btree_interpretation() will treat != as a kind of
> honorary member of an opfamily whose = operator has our != operator as
> its negator. Perhaps we should be finding a way to pass != quals into
> the index AM so that they become true index quals (obviously they
> would only be index filter predicates, never access predicates). That
> has the advantage of working in a way that's analogous to the way that
> index quals already avoid visibility checks.

The approach in your patch can only really work with index scans (and
index-only scans). So while it is more general than true index quals
in some ways, it's also less general in other ways: it cannot help
bitmap index scans.

While I accept that the inability of bitmap index scans to use index
filters in this way is, to some degree, a natural and inevitable
downside of bitmap index scans, that isn't always true. For example,
it doesn't seem to be the case with simple inequalities. Bitmap index
scans argue for making cases involving quals that are "index quals in
spirit" into actual index quals. Even if you can reliably avoid extra
heap accesses for plain index scans using expression evaluation, I
can't see that working for bitmap index scans.

More concretely, if we have an index on "tenk1 (four, two)", then we
miss out on the opportunity to eliminate heap accesses for a query
like this one:

select
  ctid, *
from
  tenk1
where
  four = 1 and two != 1;

This will get a bitmap index scan plan (that uses our composite
index), which makes sense overall. But the details beyond that make no
sense -- since we're using table filter quals for "two". It turns out
that the bitmap heap scan will access every single heap page in the
tenk1 table as a result, even though we could have literally avoided
all heap accesses had we been able to push down the != as an index
qual. This is a difference in "buffers hit" that is close to 2 orders
of magnitude.

I'd be okay with treating these cases as out of scope for this patch,
but we should probably agree on the parameters. The patch certainly
shouldn't make it any harder to fix cases such as this.

-- 
Peter Geoghegan




Re: Use of additional index columns in rows filtering

2023-08-08 Thread Tomas Vondra



On 8/8/23 19:43, Peter Geoghegan wrote:
> On Mon, Aug 7, 2023 at 9:21 PM Peter Geoghegan  wrote:
>> I see that get_op_btree_interpretation() will treat != as a kind of
>> honorary member of an opfamily whose = operator has our != operator as
>> its negator. Perhaps we should be finding a way to pass != quals into
>> the index AM so that they become true index quals (obviously they
>> would only be index filter predicates, never access predicates). That
>> has the advantage of working in a way that's analogous to the way that
>> index quals already avoid visibility checks.
> 
> The approach in your patch can only really work with index scans (and
> index-only scans). So while it is more general than true index quals
> in some ways, it's also less general in other ways: it cannot help
> bitmap index scans.
> 
> While I accept that the inability of bitmap index scans to use index
> filters in this way is, to some degree, a natural and inevitable
> downside of bitmap index scans, that isn't always true. For example,
> it doesn't seem to be the case with simple inequalities. Bitmap index
> scans argue for making cases involving quals that are "index quals in
> spirit" into actual index quals. Even if you can reliably avoid extra
> heap accesses for plain index scans using expression evaluation, I
> can't see that working for bitmap index scans.
> 
> More concretely, if we have an index on "tenk1 (four, two)", then we
> miss out on the opportunity to eliminate heap accesses for a query
> like this one:
> 
> select
>   ctid, *
> from
>   tenk1
> where
>   four = 1 and two != 1;
> 
> This will get a bitmap index scan plan (that uses our composite
> index), which makes sense overall. But the details beyond that make no
> sense -- since we're using table filter quals for "two". It turns out
> that the bitmap heap scan will access every single heap page in the
> tenk1 table as a result, even though we could have literally avoided
> all heap accesses had we been able to push down the != as an index
> qual. This is a difference in "buffers hit" that is close to 2 orders
> of magnitude.
> 
> I'd be okay with treating these cases as out of scope for this patch,
> but we should probably agree on the parameters. The patch certainly
> shouldn't make it any harder to fix cases such as this.
> 

I agree this patch shouldn't make it harder to improve these cases, but
TBH I don't quite see which part of the patch would do that? Which bit
are you objecting to? If we decide to change how match_clause_to_index()
deals with these cases, to recognize them as index quals, the patch will
be working just fine.

The only thing the patch does is it looks at clauses we decided not to
treat as index quals, and do maybe still evaluate them on index. And I
don't think I want to move these goalposts much further.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use of additional index columns in rows filtering

2023-08-08 Thread Peter Geoghegan
On Tue, Aug 8, 2023 at 11:14 AM Tomas Vondra
 wrote:
> I agree this patch shouldn't make it harder to improve these cases, but
> TBH I don't quite see which part of the patch would do that? Which bit
> are you objecting to? If we decide to change how match_clause_to_index()
> deals with these cases, to recognize them as index quals, the patch will
> be working just fine.

Well, I also recently said that I think that it's important that you
figure out a way to reliably avoid visibility checks, in cases where
it can be avoided entirely -- since that can lead to huge savings in
heap accesses. You haven't done that yet, but you really should look
into it IMV.

Assuming that that happens, then it immediately gives index scans a
huge advantage over bitmap index scans. At that point it seems
important to describe (in high level terms) where it is that the
advantage is innate, and where it's just because we haven't done the
required work for bitmap index scans. I became confused on this point
myself yesterday. Admittedly I should have been able to figure it out
on my own -- but it is confusing.

> The only thing the patch does is it looks at clauses we decided not to
> treat as index quals, and do maybe still evaluate them on index. And I
> don't think I want to move these goalposts much further.

Avoiding the need for visibility checks completely (in at least a
subset of cases) was originally your idea. I'm not going to insist on
it, or anything like that. It just seems like something that'll add a
great deal of value over what you have already.

-- 
Peter Geoghegan




Re: Use of additional index columns in rows filtering

2023-08-08 Thread Peter Geoghegan
On Tue, Aug 8, 2023 at 11:36 AM Peter Geoghegan  wrote:
> Assuming that that happens, then it immediately gives index scans a
> huge advantage over bitmap index scans. At that point it seems
> important to describe (in high level terms) where it is that the
> advantage is innate, and where it's just because we haven't done the
> required work for bitmap index scans. I became confused on this point
> myself yesterday. Admittedly I should have been able to figure it out
> on my own -- but it is confusing.

I also have some doubts about the costing. That contributed to my confusion.

Take my " four = 1 and two != 1" example query, from earlier today. As
I said, that gets a bitmap index scan, which does a hugely excessive
amount of heap access. But once I force the planner to use an index
scan, then (as predicted) there are useful index filters -- filters
that can eliminate 100% of all heap accesses. Yet the planner still
thinks that the total cost of the bitmap scan plan is only 415.28,
versus 714.89 for the index scan plan. Perhaps that's just because
this is a tricky case, for whatever reason...but it's not obvious what
that reason really is.

You keep pointing out that your patch only makes isolated, local
changes to certain specific plans. While that is true, it's also true
that there will be fairly far removed consequences. Why shouldn't I
treat those things as in scope?

-- 
Peter Geoghegan




Re: should frontend tools use syncfs() ?

2023-08-08 Thread Nathan Bossart
I ran a couple of tests for pg_upgrade with 100k tables (created using the
script here [0]) in order to demonstrate the potential benefits of this
patch.

pg_upgrade --sync-method fsync
real5m50.072s
user0m10.606s
sys 0m40.298s

pg_upgrade --sync-method syncfs
real3m44.096s
user0m8.906s
sys 0m26.398s

pg_upgrade --no-sync
real3m27.697s
user0m9.056s
sys 0m26.605s

[0] https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Use of additional index columns in rows filtering

2023-08-08 Thread Tomas Vondra



On 8/8/23 20:36, Peter Geoghegan wrote:
> On Tue, Aug 8, 2023 at 11:14 AM Tomas Vondra
>  wrote:
>> I agree this patch shouldn't make it harder to improve these cases, but
>> TBH I don't quite see which part of the patch would do that? Which bit
>> are you objecting to? If we decide to change how match_clause_to_index()
>> deals with these cases, to recognize them as index quals, the patch will
>> be working just fine.
> 
> Well, I also recently said that I think that it's important that you
> figure out a way to reliably avoid visibility checks, in cases where
> it can be avoided entirely -- since that can lead to huge savings in
> heap accesses. You haven't done that yet, but you really should look
> into it IMV.
> 
> Assuming that that happens, then it immediately gives index scans a
> huge advantage over bitmap index scans. At that point it seems
> important to describe (in high level terms) where it is that the
> advantage is innate, and where it's just because we haven't done the
> required work for bitmap index scans. I became confused on this point
> myself yesterday. Admittedly I should have been able to figure it out
> on my own -- but it is confusing.
> 

Yeah, I agree that might help a lot, particularly for tables that have a
significant fraction of not-all-visible pages.

>> The only thing the patch does is it looks at clauses we decided not to
>> treat as index quals, and do maybe still evaluate them on index. And I
>> don't think I want to move these goalposts much further.
> 
> Avoiding the need for visibility checks completely (in at least a
> subset of cases) was originally your idea. I'm not going to insist on
> it, or anything like that. It just seems like something that'll add a
> great deal of value over what you have already.
> 

Right, and I'm not against improving that, but I see it more like an
independent task. I don't think it needs (or should) to be part of this
patch - skipping visibility checks would apply to IOS, while this is
aimed only at plain index scans.

Furthermore, I don't have a very good idea how to do that (except maybe
for relying on the leakproof flag).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Extract numeric [field] in JSONB more effectively

2023-08-08 Thread Chapman Flack

Hi,

Looking at the most recent patch, so far I have a minor
spelling point, and a question (which I have not personally
explored).

The minor spelling point, the word 'field' has been spelled
'filed' throughout this comment (just as in the email subject):

+   /*
+* Simplify cast(jsonb_object_filed(jsonb, filedName) as type)
+* to jsonb_object_field_type(jsonb, filedName, targetTypeOid);
+*/

The question: the simplification is currently being applied
when the underlying operation uses F_JSONB_OBJECT_FIELD.
Are there opportunities for a similar benefit if applied
over F_JSONB_ARRAY_ELEMENT and/or F_JSONB_EXTRACT_PATH?

Regards,
-Chap




Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-08 Thread Robert Haas
On Mon, Aug 7, 2023 at 6:05 PM Andres Freund  wrote:
> I think the biggest flaw of the locking scheme is that the LockHash locks
> protect two, somewhat independent, things:
> 1) the set of currently lockable objects, i.e. the entries in the hash table 
> [partition]
> 2) the state of all the locks [in a partition]
>
> It'd not be that hard to avoid the shared hashtable lookup in a number of
> cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest
> above.  But we can't, in general, avoid the lock on the partition anyway, as
> the each lock's state is also protected by the partition lock.

Yes, and that's a huge problem. The main selling point of the whole
fast-path mechanism is to ease the pressure on the lock manager
partition locks, and if we did something like what you described in
the previous email without changing the locking regimen, we'd bring
all of that contention back. I'm pretty sure that would suck.

> The amount of work to do a lookup in the shared hashtable and/or create a new
> entry therein, is quite bound.  But the work for acquiring a lock is much less
> so. We'll e.g. often have to iterate over the set of lock holders etc.
>
> I think we ought to investigate whether pushing down the locking for the "lock
> state" into the individual locks is worth it. That way the partitioned lock
> would just protect the hashtable.

I think this would still suck. Suppose you put an LWLock or slock_t in
each LOCK. If you now run a lot of select queries against the same
table (e.g. pgbench -S -c 64 -j 64), everyone is going to fight over
the lock counts for that table. Here again, the value of the fast-path
system is that it spreads out the contention in ways that approaches
like this can't do.

Or, hmm, maybe what you're really suggesting is pushing the state down
into each PROCLOCK rather than each LOCK. That would be more promising
if we could do it, because that is per-lock *and also per-backend*.
But you can't decide from looking at a single PROCLOCK whether a new
lock at some given lock mode is grantable or not, at least not with
the current PROCLOCK representation.

I think any workable solution here has to allow a backend to take a
weak relation lock without contending with other backends trying to
take the same weak relation lock (provided there are no strong
lockers). Maybe backends should be able to allocate PROCLOCKs and
record weak relation locks there without actually linking them up to
LOCK objects, or something like that. Anyone who wants a strong lock
must first go and find all of those objects for the LOCK they want and
connect them up to that LOCK.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Andres Freund
Hi,

On 2023-08-08 18:34:58 +0200, Alvaro Herrera wrote:
> On 2023-Aug-08, Andres Freund wrote:
> 
> > Given the cost of macos, it seems like it'd be by far the most of affordable
> > to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, 
> > as
> > persistent runners. Cirrus has builtin macos virtualization support - but 
> > can
> > only host two VMs on each mac, due to macos licensing restrictions. A single
> > mac mini would suffice to keep up with our unoptimized monthly runtime
> > (although there likely would be some overhead).
> 
> If using persistent workers is an option, maybe we should explore that.
> I think we could move all or some of the Linux - Debian builds to
> hardware that we already have in shelves (depending on how much compute
> power is really needed.)

(76+830+860+935)/((365/12)*24) = 3.7

3.7 instances with 4 "vcores" are busy 100% of the time. So we'd need at least
~16 cpu threads - I think cirrus sometimes uses instances that disable HT, so
it'd perhaps be 16 cores actually.


> I think using other OSes is more difficult, mostly because I doubt we
> want to deal with licenses; but even FreeBSD might not be a realistic
> option, at least not in the short term.

They can be VMs, so that shouldn't be a big issue.

> >task_name|sum
> > +
> >  FreeBSD - 13 - Meson   | 1017:56:09
> >  Windows - Server 2019, MinGW64 - Meson | 00:00:00
> >  SanityCheck| 76:48:41
> >  macOS - Ventura - Meson| 873:12:43
> >  Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06
> >  Linux - Debian Bullseye - Autoconf | 830:17:26
> >  Linux - Debian Bullseye - Meson| 860:37:21
> >  CompilerWarnings   | 935:30:35
> > (8 rows)
> >
> 
> moving just Debian, that might alleviate 76+830+860+935 hours from the
> Cirrus infra, which is ~46%.  Not bad.
> 
> 
> (How come Windows - Meson reports allballs?)

It's mingw64, which we've marked as "manual", because we didn't have the cpu
cycles to run it.

Greetings,

Andres Freund




Re: Use of additional index columns in rows filtering

2023-08-08 Thread Tomas Vondra
On 8/8/23 21:15, Peter Geoghegan wrote:
> On Tue, Aug 8, 2023 at 11:36 AM Peter Geoghegan  wrote:
>> Assuming that that happens, then it immediately gives index scans a
>> huge advantage over bitmap index scans. At that point it seems
>> important to describe (in high level terms) where it is that the
>> advantage is innate, and where it's just because we haven't done the
>> required work for bitmap index scans. I became confused on this point
>> myself yesterday. Admittedly I should have been able to figure it out
>> on my own -- but it is confusing.
> 
> I also have some doubts about the costing. That contributed to my confusion.
> 
> Take my " four = 1 and two != 1" example query, from earlier today. As
> I said, that gets a bitmap index scan, which does a hugely excessive
> amount of heap access. But once I force the planner to use an index
> scan, then (as predicted) there are useful index filters -- filters
> that can eliminate 100% of all heap accesses. Yet the planner still
> thinks that the total cost of the bitmap scan plan is only 415.28,
> versus 714.89 for the index scan plan. Perhaps that's just because
> this is a tricky case, for whatever reason...but it's not obvious what
> that reason really is.
> 

Right. I haven't checked how the costs are calculated in these cases,
but I'd bet it's a combination of having correlated conditions, and the
bitmap costing being fairly rough (with plenty of constants etc).

The correlation seems like an obvious culprit, considering the explain says

 Bitmap Heap Scan on public.tenk1  (cost=31.35..413.85 rows=1250
width=250) (actual time=2.698..2.703 rows=0 loops=1)

So we expect 1250 rows. If that was accurate, the index scan would have
to do 1250 heap fetches. It's just luck the index scan doesn't need to
do that. I don't this there's a chance to improve this costing - if the
inputs are this off, it can't do anything.

Also, I think this is related to the earlier discussion about maybe
costing it according to the worst case - i.e. as if we still needed
fetch the same number of heap tuples as before. Which will inevitably
lead to similar issues, with worse plans looking cheaper.

> You keep pointing out that your patch only makes isolated, local
> changes to certain specific plans. While that is true, it's also true
> that there will be fairly far removed consequences. Why shouldn't I
> treat those things as in scope?
> 

That is certainly true - I'm trying to keep the scope somewhat close to
the original goal. Obviously, there may be additional things the patch
really needs to consider, but I'm not sure this is one of those cases
(perhaps I just don't understand what the issue is - the example seems
like a run-of-the-mill case of poor estimate / costing).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use of additional index columns in rows filtering

2023-08-08 Thread Peter Geoghegan
On Tue, Aug 8, 2023 at 1:24 PM Tomas Vondra
 wrote:
> > Assuming that that happens, then it immediately gives index scans a
> > huge advantage over bitmap index scans. At that point it seems
> > important to describe (in high level terms) where it is that the
> > advantage is innate, and where it's just because we haven't done the
> > required work for bitmap index scans. I became confused on this point
> > myself yesterday. Admittedly I should have been able to figure it out
> > on my own -- but it is confusing.
> >
>
> Yeah, I agree that might help a lot, particularly for tables that have a
> significant fraction of not-all-visible pages.

It also has the potential to make the costing a lot easier in certain
important cases. Accurately deriving just how many heap accesses can
be avoided via the VM from the statistics that are available to the
planner is likely always going to be very difficult. Finding a way to
make that just not matter at all (in these important cases) can also
make it safe to bias the costing, such that the planner tends to favor
index scans (and index-only scans) over bitmap index scans that cannot
possibly eliminate any heap page accesses via an index filter qual.

> Right, and I'm not against improving that, but I see it more like an
> independent task. I don't think it needs (or should) to be part of this
> patch - skipping visibility checks would apply to IOS, while this is
> aimed only at plain index scans.

I'm certainly not going to insist on it. Worth considering if putting
it in scope could make certain aspects of this patch (like the
costing) easier, though.

I think that it wouldn't be terribly difficult to make simple
inequalities into true index quals. I think I'd like to have a go at
it myself. To some degree I'm trying to get a sense of how much that'd
help you.

-- 
Peter Geoghegan




Re: Use of additional index columns in rows filtering

2023-08-08 Thread Peter Geoghegan
On Tue, Aug 8, 2023 at 1:49 PM Tomas Vondra
 wrote:
> So we expect 1250 rows. If that was accurate, the index scan would have
> to do 1250 heap fetches. It's just luck the index scan doesn't need to
> do that. I don't this there's a chance to improve this costing - if the
> inputs are this off, it can't do anything.

Well, that depends. If we can find a way to make the bitmap index scan
capable of doing something like the same trick through other means, in
some other patch, then this particular problem (involving a simple
inequality) just goes away. There may be other cases that look a
little similar, with a more complicated expression, where it just
isn't reasonable to expect a bitmap index scan to compete. Ideally,
bitmap index scans will only be at a huge disadvantage when it just
makes sense, due to the particulars of the expression.

I'm not trying to make this your problem. I'm just trying to establish
the general nature of the problem.

> Also, I think this is related to the earlier discussion about maybe
> costing it according to the worst case - i.e. as if we still needed
> fetch the same number of heap tuples as before. Which will inevitably
> lead to similar issues, with worse plans looking cheaper.

Not in those cases where it just doesn't come up, because we can
totally avoid visibility checks. As I said, securing that guarantee
has the potential to make the costing a lot more reliable/easier to
implement.

> That is certainly true - I'm trying to keep the scope somewhat close to
> the original goal. Obviously, there may be additional things the patch
> really needs to consider, but I'm not sure this is one of those cases
> (perhaps I just don't understand what the issue is - the example seems
> like a run-of-the-mill case of poor estimate / costing).

I'm not trying to impose any particular interpretation here. It's
early in the cycle, and my questions are mostly exploratory. I'm still
trying to develop my own understanding of the trade-offs in this area.

-- 
Peter Geoghegan




Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-08 Thread Andres Freund
Hi,

On 2023-08-08 16:44:37 -0400, Robert Haas wrote:
> On Mon, Aug 7, 2023 at 6:05 PM Andres Freund  wrote:
> > I think the biggest flaw of the locking scheme is that the LockHash locks
> > protect two, somewhat independent, things:
> > 1) the set of currently lockable objects, i.e. the entries in the hash 
> > table [partition]
> > 2) the state of all the locks [in a partition]
> >
> > It'd not be that hard to avoid the shared hashtable lookup in a number of
> > cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest
> > above.  But we can't, in general, avoid the lock on the partition anyway, as
> > the each lock's state is also protected by the partition lock.
> 
> Yes, and that's a huge problem. The main selling point of the whole
> fast-path mechanism is to ease the pressure on the lock manager
> partition locks, and if we did something like what you described in
> the previous email without changing the locking regimen, we'd bring
> all of that contention back. I'm pretty sure that would suck.

Yea - I tried to outline how I think we could implement the fastpath locking
scheme in a less limited way in the earlier email, that I had quoted above
this bit.  Here I was pontificating on what we possibly should do in addition
to that. I think even if we had "unlimited" fastpath locking, there's still
enough pressure on the lock manager locks that it's worth improving the
overall locking scheme.

Greetings,

Andres Freund




Re: Support to define custom wait events for extensions

2023-08-08 Thread Michael Paquier
On Tue, Aug 08, 2023 at 08:40:53PM +0900, Masahiro Ikeda wrote:
> I accidentally attached a patch in one previous email.
> But, you don't need to check it, sorry.
> (v1-0001-Change-to-manage-custom-wait-events-in-shared-hash.patch)

Sure, no worries.  With that in place, the init function in worker_spi
can be removed.
--
Michael


signature.asc
Description: PGP signature


Re: Support to define custom wait events for extensions

2023-08-08 Thread Andres Freund
Hi,

On 2023-08-08 08:54:10 +0900, Michael Paquier wrote:
> - WaitEventExtensionShmemInit() should gain a dshash_create(), to make
> sure that the shared table is around, and we are going to have a
> reference to it in WaitEventExtensionCounterData, saved from
> dshash_get_hash_table_handle().

I'm not even sure it's worth using dshash here. Why don't we just create a
decently sized dynahash (say 128 enties) in shared memory? We overallocate
shared memory by enough that there's a lot of headroom for further entries, in
the rare cases they're needed.

> We are going to need a fixed size for these custom strings, but perhaps a
> hard limit of 256 characters for each entry of the hash table is more than
> enough for most users?

I'd just use NAMEDATALEN.

- Andres




Re: Support to define custom wait events for extensions

2023-08-08 Thread Michael Paquier
On Tue, Aug 08, 2023 at 03:59:54PM -0700, Andres Freund wrote:
> On 2023-08-08 08:54:10 +0900, Michael Paquier wrote:
>> - WaitEventExtensionShmemInit() should gain a dshash_create(), to make
>> sure that the shared table is around, and we are going to have a
>> reference to it in WaitEventExtensionCounterData, saved from
>> dshash_get_hash_table_handle().
> 
> I'm not even sure it's worth using dshash here. Why don't we just create a
> decently sized dynahash (say 128 enties) in shared memory? We overallocate
> shared memory by enough that there's a lot of headroom for further entries, in
> the rare cases they're needed.

The question here would be how many slots the most popular extensions
actually need, but that could always be sized up based on the
feedback.

>> We are going to need a fixed size for these custom strings, but perhaps a
>> hard limit of 256 characters for each entry of the hash table is more than
>> enough for most users?
> 
> I'd just use NAMEDATALEN.

Both suggestions WFM.
--
Michael


signature.asc
Description: PGP signature


Re: pgbench: allow to exit immediately when any client is aborted

2023-08-08 Thread Fabien COELHO



Hello Yugo-san,


There are cases where "goto done" is used where some error like
"invalid socket: ..." happens. I would like to make pgbench exit in
such cases, too, so I chose to handle errors below the "done:" label.
However, we can change it to call "exit" instead of "goo done" at each
place. Do you think this is better?


Good point.

Now I understand the "!= FINISHED", because indeed in these cases the done 
is reached with unfinished but not necessarily ABORTED clients, and the 
comment was somehow misleading.


On reflection, there should be only one exit() call, thus I'd say to keep 
the "goto done" as you did, but to move the checking loop *before* the 
disconnect_all, and the overall section comment could be something like 
"possibly abort if any client is not finished, meaning some error 
occured", which is consistent with the "!= FINISHED" condition.


--
Fabien.




Re: Support to define custom wait events for extensions

2023-08-08 Thread Andres Freund
Hi,

On 2023-08-09 08:03:29 +0900, Michael Paquier wrote:
> On Tue, Aug 08, 2023 at 03:59:54PM -0700, Andres Freund wrote:
> > On 2023-08-08 08:54:10 +0900, Michael Paquier wrote:
> >> - WaitEventExtensionShmemInit() should gain a dshash_create(), to make
> >> sure that the shared table is around, and we are going to have a
> >> reference to it in WaitEventExtensionCounterData, saved from
> >> dshash_get_hash_table_handle().
> > 
> > I'm not even sure it's worth using dshash here. Why don't we just create a
> > decently sized dynahash (say 128 enties) in shared memory? We overallocate
> > shared memory by enough that there's a lot of headroom for further entries, 
> > in
> > the rare cases they're needed.
> 
> The question here would be how many slots the most popular extensions
> actually need, but that could always be sized up based on the
> feedback.

On a default initdb (i.e. 128MB s_b), after explicitly disabling huge pages,
we over-allocate shared memory by by 1922304 bytes, according to
pg_shmem_allocations. We allow that memory to be used for things like shared
hashtables that grow beyond their initial size. So even if the hash table's
static size is too small, there's lots of room to grow, even on small systems.

Just because it's somewhat interesting: With huge pages available and not
disabled, we over-allocate by 3364096 bytes.

Greetings,

Andres Freund




Re: pg_upgrade fails with in-place tablespace

2023-08-08 Thread Michael Paquier
On Tue, Aug 08, 2023 at 04:32:45PM +0900, Michael Paquier wrote:
> The pg_dump part should be OK to allow the restore of in-place
> tablespaces, so I'll get that fixed first, but I think that we should
> add a regression test as well.

I have added a test for that in 002_pg_dump.pl, and applied the fix
for pg_dumpall.

> I'll try to look at the binary upgrade path and pg_upgrade after this
> is done.

+   /* No need to check in-place tablespace. */
snprintf(query, sizeof(query),
 "SELECT pg_catalog.pg_tablespace_location(oid) AS spclocation "
 "FROM  pg_catalog.pg_tablespace "
 "WHERE spcname != 'pg_default' AND "
-"  spcname != 'pg_global'");
+"  spcname != 'pg_global' AND "
+"  pg_catalog.pg_tablespace_location(oid) != 'pg_tblspc/' || 
oid");

This does not really explain the reason why in-place tablespaces need
to be skipped (in short they don't need a separate creation or check
like the others in create_script_for_old_cluster_deletion because they
are part of the data folder).  Anyway, the more I think about it, the
less excited I get about the need to support pg_upgrade with in-place
tablespaces, especially regarding the fact that the patch blindly
enforces allows_in_place_tablespaces, assuming that it is OK to do so.
So what about the case where one would want to be warned if these are
still laying around when doing upgrades?  And what's the actual use
case for supporting that?  There is something else that we could do
here: add a pre-run check to make pg_upgrade fail gracefully if we
find in-place tablespaces in the old cluster.

+# Test with in-place tablespace.
+$oldnode->append_conf('postgresql.conf', 'allow_in_place_tablespaces = on');
+$oldnode->reload;
+$oldnode->safe_psql('postgres', "CREATE TABLESPACE space_test LOCATION ''");

While on it, note that this breaks the case of cross-version upgrades
where the old cluster uses binaries where in-place tablespaces are not
supported.  So this requires an extra version check.
--
Michael


signature.asc
Description: PGP signature


Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Andres Freund
Hi,

On 2023-08-08 11:58:25 +0300, Heikki Linnakangas wrote:
> On 08/08/2023 05:15, Andres Freund wrote:
> > With the improvements detailed below, cirrus' free CI would last
> > about ~65 runs / month.
>
> I think that's plenty.

Not so sure, I would regularly exceed it, I think. But it definitely will
suffice for more casual contributors.


> > Potential paths forward for cfbot, in addition to the above:
> > 
> > - Pay for compute / ask the various cloud providers to grant us compute
> >credits. At least some of the cloud providers can be used via cirrus-ci.
> > 
> > - Host (some) CI runners ourselves. Particularly with macos and windows, 
> > that
> >could provide significant savings.
> > 
> > - Build our own system, using buildbot, jenkins or whatnot.
> > 
> > 
> > Opinions as to what to do?
> 
> The resources for running our own system isn't free either. I'm sure we can
> get sponsors for the cirrus-ci credits, or use donations.

As outlined in my reply to Alvaro, just using credits likely is financially
not viable...


> > 5) Move use of -Dsegsize_blocks=6 from macos to linux
> > 
> > Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively 
> > we
> > could stop covering both meson and autoconf segsize_blocks. It does 
> > affect
> > runtime on linux as well.
> 
> Could we have a comment somewhere on why we use -Dsegsize_blocks on these
> particular CI runs? It seems pretty random. I guess the idea is to have one
> autoconf task and one meson task with that option, to check that the
> autoconf/meson option works?

Hm, some of that was in the commit message, but I should have added it to
.cirrus.yml as well.

Normally, the "relation segment" code basically has no coverage in our tests,
because we (quite reasonably) don't generate tables large enough. We've had
plenty bugs that we didn't notice due the code not being exercised much. So it
seemed useful to add CI coverage, by making the segments very small.

I chose the tasks by looking at how long they took at the time, I
think. Adding them to to the slower ones.


> > 6) Disable write cache flushes on windows
> > 
> > It's a bit ugly to do this without using the UI... Shaves off about 30s
> > from the tests.
> 
> A brief comment would be nice: "We don't care about persistence over hard
> crashes in the CI, so disable write cache flushes to speed it up."

Turns out that patch doesn't work on its own anyway, at least not
reliably... I tested it by interactively logging into a windows vm and testing
it there. It doesn't actually seem to suffice when run in isolation, because
the relevant registry key doesn't yet exist. I haven't yet figured out the
magic incantations for adding the missing "intermediary", but I'm getting
there...

Greetings,

Andres Freund




Re: pgbench: allow to exit immediately when any client is aborted

2023-08-08 Thread Yugo NAGATA
On Wed, 9 Aug 2023 02:15:01 +0200 (CEST)
Fabien COELHO  wrote:

> 
> Hello Yugo-san,
> 
> > There are cases where "goto done" is used where some error like
> > "invalid socket: ..." happens. I would like to make pgbench exit in
> > such cases, too, so I chose to handle errors below the "done:" label.
> > However, we can change it to call "exit" instead of "goo done" at each
> > place. Do you think this is better?
> 
> Good point.
> 
> Now I understand the "!= FINISHED", because indeed in these cases the done 
> is reached with unfinished but not necessarily ABORTED clients, and the 
> comment was somehow misleading.
> 
> On reflection, there should be only one exit() call, thus I'd say to keep 
> the "goto done" as you did, but to move the checking loop *before* the 
> disconnect_all, and the overall section comment could be something like 
> "possibly abort if any client is not finished, meaning some error 
> occured", which is consistent with the "!= FINISHED" condition.

Thank you for your suggestion.
I'll fix as above and submit a updated patch soon.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Robert Treat
On Tue, Aug 8, 2023 at 9:26 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-08-08 11:58:25 +0300, Heikki Linnakangas wrote:
> > On 08/08/2023 05:15, Andres Freund wrote:
> > > With the improvements detailed below, cirrus' free CI would last
> > > about ~65 runs / month.
> >
> > I think that's plenty.
>
> Not so sure, I would regularly exceed it, I think. But it definitely will
> suffice for more casual contributors.
>
>
> > > Potential paths forward for cfbot, in addition to the above:
> > >
> > > - Pay for compute / ask the various cloud providers to grant us compute
> > >credits. At least some of the cloud providers can be used via 
> > > cirrus-ci.
> > >
> > > - Host (some) CI runners ourselves. Particularly with macos and windows, 
> > > that
> > >could provide significant savings.
> > >
> > > - Build our own system, using buildbot, jenkins or whatnot.
> > >
> > >
> > > Opinions as to what to do?
> >
> > The resources for running our own system isn't free either. I'm sure we can
> > get sponsors for the cirrus-ci credits, or use donations.
>
> As outlined in my reply to Alvaro, just using credits likely is financially
> not viable...
>
>

In case it's helpful, from an SPI oriented perspective, $7K/month is
probably an order of magnitude more than what we can sustain, so I
don't see a way to make that work without some kind of additional
magic that includes other non-profits and/or commercial companies
changing donation habits between now and September.

Purchasing a couple of mac-mini's (and/or similar gear) would be near
trivial though, just a matter of figuring out where/how to host it
(but I think infra can chime in on that if that's what get's decided).

The other likely option would be to seek out cloud credits from one of
the big three (or others); Amazon has continually said they would be
happy to donate more credits to us if we had a use, and I think some
of the other hosting providers have said similarly at times; so we'd
need to ask and hope it's not too bureaucratic.

Robert Treat
https://xzilla.net




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-08 Thread Masahiko Sawada
On Mon, Aug 7, 2023 at 6:02 PM Amit Kapila  wrote:
>
> On Mon, Aug 7, 2023 at 2:02 PM Masahiko Sawada  wrote:
> >
> > On Mon, Aug 7, 2023 at 12:54 PM Amit Kapila  wrote:
> > >
> > > On Sun, Aug 6, 2023 at 6:02 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > IIUC the above query checks if the WAL record written at the slot's
> > > > confirmed_flush_lsn is a CHECKPOINT_SHUTDOWN, but there is no check if
> > > > this WAL record is the latest record.
> > > >
> > >
> > > Yeah, I also think there should be some way to ensure this. How about
> > > passing the number of records to read to this API? Actually, that will
> > > address my other concern as well where the current API can lead to
> > > reading an unbounded number of records if the confirmed_flush_lsn
> > > location is far behind the CHECKPOINT_SHUTDOWN. Do you have any better
> > > ideas to address it?
> >
> > It makes sense to me to limit the number of WAL records to read. But
> > as I mentioned below, if there is a chance of any WAL activity during
> > the upgrade, I'm not sure what limit to set.
> >
>
> In that case, we won't be able to pass the number of records. We need
> to check based on the type of records.
>
> >
> > > However, I am not
> > > sure if there can't be any additional WAL from checkpointer or
> > > bgwriter. Checkpointer has a code that ensures that if there is no
> > > important WAL activity then it would be skipped. Similarly, bgwriter
> > > also doesn't LOG xl_running_xacts unless there is an important
> > > activity.
> >
> > WAL records for hint bit updates could be generated even in upgrading mode?
> >
>
> Do you mean these records can be generated during reading catalog tables?

Yes.

>
> > > I feel if there is a chance of any WAL activity during the
> > > upgrade, we need to either change the check to ensure such WAL records
> > > are expected or document the same in some way.
> >
> > Yes, but how does it work with the above idea of limiting the number
> > of WAL records to read? If XLOG_FPI_FOR_HINT can still be generated in
> > the upgrade mode, we cannot predict how many such records are
> > generated after the latest CHECKPOINT_SHUTDOWN.
> >
>
> Right, as said earlier, in that case, we need to rely on the type of records.

Another idea would be that before starting the old cluster we check if
the slot's confirmed_flush_lsn in the slot state file matches the
latest checkpoint LSN got by pg_controlfile. We need another tool to
dump the slot state file, though.

>
> > I'm not really sure we should always perform the slot's
> > confirmed_flush_lsn check by default in the first place. With this
> > check, the upgrade won't be able to proceed if there is any logical
> > slot that is not used by logical replication (or something streaming
> > the changes using walsender), right? For example, if a user uses a
> > program that periodically consumes the changes from the logical slot,
> > the slot would not be able to pass the check even if the user executed
> > pg_logical_slot_get_changes() just before shutdown. The backend
> > process who consumes the changes is always terminated before the
> > shutdown checkpoint. On the other hand, I think there are cases where
> > the user can ensure that no meaningful WAL records are generated after
> > the last pg_logical_slot_get_changes(). I'm concerned that this check
> > might make upgrading such cases cumbersome unnecessarily.
> >
>
> You are right and I have mentioned the same case today in my response
> to Jonathan but do you have better ideas to deal with such slots than
> to give an ERROR?

It makes sense to me to give an ERROR for such slots but does it also
make sense to make the check optional?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-08-08 Thread Richard Guo
On Tue, Aug 8, 2023 at 7:34 PM Ashutosh Bapat 
wrote:

> On Tue, Aug 8, 2023 at 12:44 PM Richard Guo 
> wrote:
> > This is not an existing bug.  Our current code (without this patch)
> > would always call create_nestloop_path() after the reparameterization
> > work has been done for the inner path.  So we can safely check against
> > outer rel (not its topmost parent) in create_nestloop_path().  But now
> > with this patch, the reparameterization is postponed until createplan.c,
> > so we have to check against the topmost parent of outer rel in
> > create_nestloop_path(), otherwise we'd have duplicate clauses in the
> > final plan.
>
> Hmm. I am worried about the impact this will have when the nestloop
> path is created for two child relations. Your code will still pick the
> top_parent_relids which seems odd. Have you tested that case?


Well, the way it's working is that for child rels all parameterized
paths arriving at try_nestloop_path (or try_partial_nestloop_path) are
parameterized by top parents at first.  Then our current code (without
this patch) applies reparameterize_path_by_child to the inner path
before calling create_nestloop_path().  So in create_nestloop_path() the
inner path is parameterized by child rel.  That's why we check against
outer rel itself not its top parent there.

With this patch, the reparameterize_path_by_child work is postponed
until createplan time, so in create_nestloop_path() the inner path is
still parameterized by top parent.  So we have to check against the top
parent of outer rel.

I think the query shown upthread is sufficient to verify this theory.

regression=# explain (costs off)
select * from prt1 t1 left join lateral
(select t1.a as t1a, t2.* from prt1 t2) s on t1.a = s.a;
   QUERY PLAN
-
 Append
   ->  Nested Loop Left Join
 ->  Seq Scan on prt1_p1 t1_1
 ->  Index Scan using iprt1_p1_a on prt1_p1 t2_1
   Index Cond: (a = t1_1.a)
   ->  Nested Loop Left Join
 ->  Seq Scan on prt1_p2 t1_2
 ->  Index Scan using iprt1_p2_a on prt1_p2 t2_2
   Index Cond: (a = t1_2.a)
   ->  Nested Loop Left Join
 ->  Seq Scan on prt1_p3 t1_3
 ->  Index Scan using iprt1_p3_a on prt1_p3 t2_3
   Index Cond: (a = t1_3.a)
(13 rows)


> Please add this to the next commitfest.


Done here https://commitfest.postgresql.org/44/4477/

Thanks
Richard


Re: Handle infinite recursion in logical replication setup

2023-08-08 Thread Peter Smith
On Tue, Aug 8, 2023 at 6:52 PM Amit Kapila  wrote:
>
> On Tue, Aug 8, 2023 at 1:50 PM Peter Eisentraut  wrote:
> >
> > This patch added the following error message:
> >
> > errdetail_plural("Subscribed publication %s is subscribing to other
> > publications.",
> > "Subscribed publications %s are subscribing to other publications.",
> > list_length(publist), pubnames->data),
> >
> > But in PostgreSQL, a publication cannot subscribe to a publication, so
> > this is not giving accurate information.  Apparently, what it is trying
> > to say is that
> >
> > The subscription that you are creating subscribes to publications that
> > contain tables that are written to by other subscriptions.
> >
> > Can we get to a more accurate wording like this?
> >
>
> +1 for changing the message as per your suggestion.
>

PSA a patch to change this message text. The message now has wording
similar to the suggestion.

> > There is also a translatability issue there, in the way the publication
> > list is pasted into the message.
> >

The name/list substitution is now done within parentheses, which AFAIK
will be enough to eliminate any translation ambiguities.

> > Is the list of affected publications really that interesting?  I wonder
> > whether the list of affected tables might be more relevant?
> >
>
> In that case, we need to specify both schema name and table name in
> that case. I guess the list could be very long and not sure what to do
> for schema publications ( Create Publication ... For Schema).

Right, IIUC that was the reason for not choosing to show the tables in
the original patch -- i.e. the list might easily be very long with
100s or 1000s of tables it, and so inappropriate to substitute in the
message. OTOH, showing only problematic publications is a short list
but it is still more useful than showing nothing (e.g. there other
publications of the subscription might be OK so those ones are not
listed)

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Improved-message-for-create-subscription.patch
Description: Binary data


RE: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-08-08 Thread Zhijie Hou (Fujitsu)
On Thursday, August 3, 2023 7:30 PM Melih Mutlu   wrote:

> Right. I attached the v26 as you asked. 

Thanks for posting the patches.
 
While reviewing the patch, I noticed one rare case that it's possible that there
are two table sync worker for the same table in the same time.

The patch relies on LogicalRepWorkerLock to prevent concurrent access, but the
apply worker will start a new worker after releasing the lock. So, at the 
point[1]
where the lock is released and the new table sync worker has not been started,
it seems possible that another old table sync worker will be reused for the
same table.

/* Now safe to release the LWLock */
LWLockRelease(LogicalRepWorkerLock);
*[1]
/*
 * If there are free sync worker slot(s), start 
a new sync
 * worker for the table.
 */
if (nsyncworkers < 
max_sync_workers_per_subscription)
...

logicalrep_worker_launch(MyLogicalRepWorker->dbid,

I can reproduce it by using gdb.

Steps:
1. set max_sync_workers_per_subscription to 1 and setup pub/sub which publishes
   two tables(table A and B).
2. when the table sync worker for the table A started, use gdb to block it
   before being reused for another table.
3. set max_sync_workers_per_subscription to 2 and use gdb to block the apply
   worker at the point after releasing the LogicalRepWorkerLock and before
   starting another table sync worker for table B.
4. release the blocked table sync worker, then we can see the table sync worker
   is also reused for table B.
5. release the apply worker, then we can see the apply worker will start
   another table sync worker for the same table(B).

I think it would be better to prevent this case from happening as this case
will give some unexpected ERROR or LOG. Note that I haven't checked if it would
cause worse problems like duplicate copy or others.

Best Regards,
Hou zj


Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-08-08 Thread Andres Freund
Hi,

On 2023-08-08 22:29:50 -0400, Robert Treat wrote:
> In case it's helpful, from an SPI oriented perspective, $7K/month is
> probably an order of magnitude more than what we can sustain, so I
> don't see a way to make that work without some kind of additional
> magic that includes other non-profits and/or commercial companies
> changing donation habits between now and September.

Yea, I think that'd make no sense, even if we could afford it. I think the
patches I've written should drop it to 1/2 already. Thomas added some
throttling to push it down further.


> Purchasing a couple of mac-mini's (and/or similar gear) would be near
> trivial though, just a matter of figuring out where/how to host it
> (but I think infra can chime in on that if that's what get's decided).

Cool. Because of the limitation of running two VMs at a time on macos and the
comparatively low cost of mac minis, it seems they beat alternative models by
a fair bit.

Pginfra/sysadmin: ^


Based on being off by an order of magnitude, as you mention earlier, it seems
that:

1) reducing test runtime etc, as already in progress
2) getting 2 mac minis as runners
3) using ~350 USD / mo in GCP costs for windows, linux, freebsd (*)

Would be viable for a month or three? I hope we can get some cloud providers
to chip in for 3), but I'd like to have something in place that doesn't depend
on that.

Given the cost of macos VMs at AWS, the only of the big cloud providers to
have macos instances, I think we'd burn pointlessly quick through credits if
we used VMs for that.

(*) I think we should be able to get below that, but ...


> The other likely option would be to seek out cloud credits from one of
> the big three (or others); Amazon has continually said they would be
> happy to donate more credits to us if we had a use, and I think some
> of the other hosting providers have said similarly at times; so we'd
> need to ask and hope it's not too bureaucratic.

Yep.

I tried to start that progress within microsoft, fwiw.  Perhaps Joe and
Jonathan know how to start within AWS?  And perhaps Noah inside GCP?

It'd be the least work to get it up and running in GCP, as it's already
running there, but should be quite doable at the others as well.

Greetings,

Andres Freund




Re: 2023-08-10 release announcement draft

2023-08-08 Thread Robert Treat
On Mon, Aug 7, 2023 at 9:15 PM Jonathan S. Katz  wrote:
>
> Hi,
>
> Attached is the release announcement draft for the 2023-08-10 update
> release, which also includes the release of PostgreSQL 16 Beta 3.
>
> Please provide your feedback no later than August 10, 2023 0:00 AoE[1].
>
> Thanks,
>
> Jonathan
>
> [1] https://en.wikipedia.org/wiki/Anywhere_on_Earth

"Users who have skipped one or more update releases may need to run
additional, post-update steps; "

The comma should be removed.

"please see the release notes for earlier versions for details."

Use of 'for' twice is grammatically incorrect; I am partial to "please
see the release notes from earlier versions for details." but could
also see "please see the release notes for earlier versions to get
details."


Robert Treat
https://xzilla.net




pg15: reltuples stuck at -1 after pg_upgrade and VACUUM

2023-08-08 Thread Justin Pryzby
Since 3d351d916 (pg14), reltuples -1 means that the rel has never been
vacuumed nor analyzed.

But since 4496020e6d (backpatched to pg15), following pg_upgrade, vacuum
can leave reltuples=-1.

commit 4496020e6dfaffe8217e4d3f85567bb2b6927b45
Author: Peter Geoghegan 
Date:   Fri Aug 19 09:26:06 2022 -0700

Avoid reltuples distortion in very small tables.

$ /usr/pgsql-15/bin/initdb -N -D ./pg15.dat2
$ /usr/pgsql-15/bin/initdb -N -D ./pg15.dat3

$ /usr/pgsql-15/bin/postgres -c logging_collector=no -p 5678 -k /tmp -D 
./pg15.dat2& # old cluster, pre-upgrade
postgres=# CREATE TABLE t AS SELECT generate_series(1,);
postgres=# SELECT reltuples FROM pg_class WHERE oid='t'::regclass;
reltuples | -1
postgres=# VACUUM FREEZE t;
postgres=# SELECT reltuples FROM pg_class WHERE oid='t'::regclass;
reltuples | 

$ /usr/pgsql-15/bin/pg_upgrade -b /usr/pgsql-15/bin -d ./pg15.dat2 
-D./pg15.dat3 # -c logging_collector=no -p 5678 -k /tmp&

$ /usr/pgsql-15/bin/postgres -c logging_collector=no -p 5678 -k /tmp -D 
./pg15.dat3& # new cluster, post-upgrade
postgres=# VACUUM FREEZE VERBOSE t;
postgres=# SELECT reltuples FROM pg_class WHERE oid='t'::regclass;
reltuples | -1

The problem isn't that reltuples == -1 after the upgrade (which is
normal).  The issue is that if VACUUM skips all the pages, it can leave
reltuples -1.  My expectation is that after running "vacuum", no tables
are left in the "never have been vacuumed" state.

If the table was already frozen, then VACUUM (FREEZE) is inadequate to
fix it, and you need to use DISABLE_PAGE_SKIPPING.

-- 
Justin




Re: pg_upgrade fails with in-place tablespace

2023-08-08 Thread Rui Zhao
I have found the patch and upon review, I believe the following code can be 
improved.
+ /*
+ * In-place tablespaces use a relative path, and need to be dumped
+ * with an empty string as location.
+ */
+ if (is_absolute_path(spclocation))
+ appendStringLiteralConn(buf, spclocation, conn);
+ else
+ appendStringLiteralConn(buf, "", conn);
I believe that utilizing appendPQExpBufferStr(buf, "''"); would be better and 
more meaningful than using appendStringLiteralConn(buf, "", conn); in this 
scenario.
I apologize for this wrong usage. Please help me fix it.
I will try to respond to pg_upgrade after my deep dive.
--
Best regards,
Rui Zhao


[BackendXidGetPid] only access allProcs when xid matches

2023-08-08 Thread Junwang Zhao
In function `BackendXidGetPid`, when looping every proc's
TransactionId, there is no need to access its PGPROC since there
is shared memory access: `arrayP->pgprocnos[index]`.

Though the compiler can optimize this kind of inefficiency, I
believe we should ship with better code.


-- 
Regards
Junwang Zhao


0001-BackendXidGetPid-only-access-allProcs-when-xid-match.patch
Description: Binary data


Re: pg15: reltuples stuck at -1 after pg_upgrade and VACUUM

2023-08-08 Thread Peter Geoghegan
On Tue, Aug 8, 2023 at 8:43 PM Justin Pryzby  wrote:
> The problem isn't that reltuples == -1 after the upgrade (which is
> normal).  The issue is that if VACUUM skips all the pages, it can leave
> reltuples -1.  My expectation is that after running "vacuum", no tables
> are left in the "never have been vacuumed" state.

But -1 isn't the "never have been vacuumed" state, exactly. At best it
is the "never been vacuumed or analyzed" state.

-- 
Peter Geoghegan




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-08 Thread Amit Kapila
On Wed, Aug 9, 2023 at 8:01 AM Masahiko Sawada  wrote:
>
> On Mon, Aug 7, 2023 at 6:02 PM Amit Kapila  wrote:
> >
> > On Mon, Aug 7, 2023 at 2:02 PM Masahiko Sawada  
> > wrote:
> > >
> > > WAL records for hint bit updates could be generated even in upgrading 
> > > mode?
> > >
> >
> > Do you mean these records can be generated during reading catalog tables?
>
> Yes.
>

BTW, Kuroda-San has verified and found that three types of records
(including XLOG_FPI_FOR_HINT) can be generated by the system during
the upgrade. See email [1].

> >
> > > > I feel if there is a chance of any WAL activity during the
> > > > upgrade, we need to either change the check to ensure such WAL records
> > > > are expected or document the same in some way.
> > >
> > > Yes, but how does it work with the above idea of limiting the number
> > > of WAL records to read? If XLOG_FPI_FOR_HINT can still be generated in
> > > the upgrade mode, we cannot predict how many such records are
> > > generated after the latest CHECKPOINT_SHUTDOWN.
> > >
> >
> > Right, as said earlier, in that case, we need to rely on the type of 
> > records.
>
> Another idea would be that before starting the old cluster we check if
> the slot's confirmed_flush_lsn in the slot state file matches the
> latest checkpoint LSN got by pg_controlfile. We need another tool to
> dump the slot state file, though.
>

I feel it would be a good idea to provide such a tool for users to
avoid getting errors during upgrade but I think the upgrade code still
needs to ensure that there are no WAL records between
confirm_flush_lsn and SHUTDOWN_CHECKPOINT than required. Or, do you
want to say that we don't do any verification check during the upgrade
and let the data loss happens if the user didn't ensure that by
running such a tool?

> >
> > > I'm not really sure we should always perform the slot's
> > > confirmed_flush_lsn check by default in the first place. With this
> > > check, the upgrade won't be able to proceed if there is any logical
> > > slot that is not used by logical replication (or something streaming
> > > the changes using walsender), right? For example, if a user uses a
> > > program that periodically consumes the changes from the logical slot,
> > > the slot would not be able to pass the check even if the user executed
> > > pg_logical_slot_get_changes() just before shutdown. The backend
> > > process who consumes the changes is always terminated before the
> > > shutdown checkpoint. On the other hand, I think there are cases where
> > > the user can ensure that no meaningful WAL records are generated after
> > > the last pg_logical_slot_get_changes(). I'm concerned that this check
> > > might make upgrading such cases cumbersome unnecessarily.
> > >
> >
> > You are right and I have mentioned the same case today in my response
> > to Jonathan but do you have better ideas to deal with such slots than
> > to give an ERROR?
>
> It makes sense to me to give an ERROR for such slots but does it also
> make sense to make the check optional?
>

We can do that if we think so. We have two ways to make this check
optional (a) have a switch like --include-logical-replication-slots as
the proposed patch has which means by default we won't try to upgrade
slots; (b) have a switch like --exclude-logical-replication-slots as
Jonathan proposed which means we will exclude slots only if specified
by user. Now, one thing to note is that we don't seem to have any
include/exclude switch in the upgrade which I think indicates users by
default prefer to upgrade everything. Now, even if we decide not to
give any switch initially but do it only if there is a user demand for
it then also users will have a way to proceed with an upgrade which is
by dropping such slots. Do you have any preference?

[1] - 
https://www.postgresql.org/message-id/TYAPR01MB58660273EACEFC5BF256B133F50DA%40TYAPR01MB5866.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-08-08 Thread vignesh C
Hi Melih,

Here is a patch to help in getting the execution at various phases
like: a) replication slot creation time, b) Wal reading c) Number of
WAL records read d) subscription relation state change etc
Couple of observation while we tested with this patch:
1) We noticed that the patch takes more time for finding the decoding
start point.
2) Another observation was that the number of XLOG records read for
identify the consistent point was significantly high with the v26_0001
patch.

HEAD
postgres=# select avg(counttime)/1000 "avgtime(ms)",
median(counttime)/1000 "median(ms)", min(counttime)/1000
"mintime(ms)", max(counttime)/1000 "maxtime(ms)", logtype from test
group by logtype;
  avgtime(ms)   |   median(ms)   | mintime(ms) |
maxtime(ms) | logtype
++-+-+--
 0.00579245283018867920 | 0.0020 |   0 |
1 | SNAPSHOT_BUILD
 1.2246811320754717 | 0.9855 |   0 |
   37 | LOGICAL_SLOT_CREATION
   171.0863283018867920 |   183.9120 |   0 |
  408 | FIND_DECODING_STARTPOINT
 2.0699433962264151 | 1.4380 |   1 |
   49 | INIT_DECODING_CONTEXT
(4 rows)

HEAD + v26-0001 patch
postgres=# select avg(counttime)/1000 "avgtime(ms)",
median(counttime)/1000 "median(ms)", min(counttime)/1000
"mintime(ms)", max(counttime)/1000 "maxtime(ms)", logtype from test
group by logtype;
  avgtime(ms)   |   median(ms)   | mintime(ms) |
maxtime(ms) | logtype
++-+-+--
 0.00588113207547169810 | 0.0050 |   0 |
0 | SNAPSHOT_BUILD
 1.1270962264150943 | 1.1000 |   0 |
2 | LOGICAL_SLOT_CREATION
   301.1745528301886790 |   410.4870 |   0 |
  427 | FIND_DECODING_STARTPOINT
 1.4814660377358491 | 1.4530 |   1 |
9 | INIT_DECODING_CONTEXT
(4 rows)

In the above FIND_DECODING_STARTPOINT is very much higher with V26-0001 patch.

HEAD
FIND_DECODING_XLOG_RECORD_COUNT
- average =  2762
- median = 3362

HEAD + reuse worker patch(v26_0001 patch)
Where FIND_DECODING_XLOG_RECORD_COUNT
- average =  4105
- median = 5345

Similarly Number of xlog records read is higher with v26_0001 patch.

Steps to calculate the timing:
-- first collect the necessary LOG from subscriber's log.
cat *.log | grep -E
'(LOGICAL_SLOT_CREATION|INIT_DECODING_CONTEXT|FIND_DECODING_STARTPOINT|SNAPSHOT_BUILD|FIND_DECODING_XLOG_RECORD_COUNT|LOGICAL_XLOG_READ|LOGICAL_DECODE_PROCESS_RECORD|LOGICAL_WAIT_TRANSACTION)'
> grep.dat

create table testv26(logtime varchar, pid varchar, level varchar,
space varchar, logtype varchar, counttime int);
-- then copy these datas into db table to count the avg number.
COPY testv26 FROM '/home/logs/grep.dat' DELIMITER ' ';

-- Finally, use the SQL to analyze the data:
select avg(counttime)/1000 "avgtime(ms)", logtype from testv26 group by logtype;

--- To get the number of xlog records read:
select avg(counttime) from testv26 where logtype
='FIND_DECODING_XLOG_RECORD_COUNT' and counttime != 1;

Thanks to Peter and Hou-san who helped in finding these out. We are
parallely analysing this, @Melih Mutlu  posting this information so
that it might help you too in analysing this issue.

Regards,
Vignesh
From b755cab38ff76e9f63304b2d8f344cb098ca6a33 Mon Sep 17 00:00:00 2001
From: Hou Zhijie 
Date: Fri, 4 Aug 2023 17:57:29 +0800
Subject: [PATCH v1 1/2] count state change time

---
 src/backend/replication/logical/tablesync.c | 28 +
 1 file changed, 28 insertions(+)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 651a775065..0d9298f7b3 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -123,6 +123,10 @@
 #include "utils/syscache.h"
 #include "utils/usercontext.h"
 
+static TimestampTz start = 0;
+static long		secs = 0;
+static int			microsecs = 0;
+
 static bool table_states_valid = false;
 static List *table_states_not_ready = NIL;
 static bool FetchTableStates(bool *started_tx);
@@ -338,6 +342,11 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 		ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
 
 		CommitTransactionCommand();
+
+		TimestampDifference(start, GetCurrentTimestamp(), &secs, µsecs);
+		elog(LOG, "SUBREL_STATE_SYNCDONE %d", ((int) secs * 100 + microsecs));
+		start = GetCurrentTimestamp();
+
 		pgstat_report_stat(false);
 
 		/*
@@ -1258,6 +1267,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	bool		must_use_password;
 	bool		run_as_owner;
 
+	start = GetCurrentTimestamp();
+
 	/* Check the state of the table synchronization. */
 	StartTransactionCommand();
 	relstate = GetSubscriptionRelState(MyLogicalRepWorker->subid,
@@

Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning

2023-08-08 Thread David Rowley
On Fri, 28 Jul 2023 at 02:06, Ashutosh Bapat
 wrote:
> 0001 - to measure memory consumption during planning. This is the same
> one as attached to [1].

I see you're recording the difference in the CurrentMemoryContext of
palloc'd memory before and after planning.  That won't really alert us
to problems if the planner briefly allocates, say 12GBs of memory
during, say the join search then quickly pfree's it again.  unless
it's an oversized chunk, aset.c won't free() any memory until
MemoryContextReset(). Chunks just go onto a freelist for reuse later.
So at the end of planning, the context may still have that 12GBs
malloc'd, yet your new EXPLAIN ANALYZE property might end up just
reporting a tiny fraction of that.

I wonder if it would be more useful to just have ExplainOneQuery()
create a new memory context and switch to that and just report the
context's mem_allocated at the end.

It's also slightly annoying that these planner-related summary outputs
are linked to EXPLAIN ANALYZE. We could be showing them in EXPLAIN
without ANALYZE.  If we were to change that now, it might be a bit
annoying for the regression tests as we'd need to go and add SUMMARY
OFF in a load of places...

drowley@amd3990x:~/pg_src/src/test/regress/sql$ git grep -i "costs off" | wc -l
1592

hmm, that would cause a bit of churn... :-(

David




Re: Handle infinite recursion in logical replication setup

2023-08-08 Thread Amit Kapila
On Wed, Aug 9, 2023 at 8:20 AM Peter Smith  wrote:
>
> On Tue, Aug 8, 2023 at 6:52 PM Amit Kapila  wrote:
> >
> > On Tue, Aug 8, 2023 at 1:50 PM Peter Eisentraut  
> > wrote:
> > >
> > > This patch added the following error message:
> > >
> > > errdetail_plural("Subscribed publication %s is subscribing to other
> > > publications.",
> > > "Subscribed publications %s are subscribing to other publications.",
> > > list_length(publist), pubnames->data),
> > >
> > > But in PostgreSQL, a publication cannot subscribe to a publication, so
> > > this is not giving accurate information.  Apparently, what it is trying
> > > to say is that
> > >
> > > The subscription that you are creating subscribes to publications that
> > > contain tables that are written to by other subscriptions.
> > >
> > > Can we get to a more accurate wording like this?
> > >
> >
> > +1 for changing the message as per your suggestion.
> >
>
> PSA a patch to change this message text. The message now has wording
> similar to the suggestion.
>
> > > There is also a translatability issue there, in the way the publication
> > > list is pasted into the message.
> > >
>
> The name/list substitution is now done within parentheses, which AFAIK
> will be enough to eliminate any translation ambiguities.
>

A similar instance as below uses \"%s\" instead. So, isn't using the
same a better idea?
errdetail_plural("LDAP search for filter \"%s\" on server \"%s\"
returned %d entry.",
  "LDAP search for filter \"%s\" on server \"%s\" returned %d entries.",


-- 
With Regards,
Amit Kapila.




Re: pg_upgrade fails with in-place tablespace

2023-08-08 Thread Michael Paquier
On Wed, Aug 09, 2023 at 11:47:58AM +0800, Rui Zhao wrote:
> I believe that utilizing appendPQExpBufferStr(buf, "''"); would be better and 
> more meaningful than using appendStringLiteralConn(buf, "", conn); in this 
> scenario.
> I apologize for this wrong usage. Please help me fix it.
> I will try to respond to pg_upgrade after my deep dive.

That was my original suggestion in [1], and what you've added gives
exactly the same result.

[1]: https://www.postgresql.org/message-id/ZNGhF6iBpUKgNimI%40paquier.xyz
--
Michael


signature.asc
Description: PGP signature


Re: pgbench: allow to exit immediately when any client is aborted

2023-08-08 Thread Yugo NAGATA
On Wed, 9 Aug 2023 10:46:38 +0900
Yugo NAGATA  wrote:

> On Wed, 9 Aug 2023 02:15:01 +0200 (CEST)
> Fabien COELHO  wrote:
> 
> > 
> > Hello Yugo-san,
> > 
> > > There are cases where "goto done" is used where some error like
> > > "invalid socket: ..." happens. I would like to make pgbench exit in
> > > such cases, too, so I chose to handle errors below the "done:" label.
> > > However, we can change it to call "exit" instead of "goo done" at each
> > > place. Do you think this is better?
> > 
> > Good point.
> > 
> > Now I understand the "!= FINISHED", because indeed in these cases the done 
> > is reached with unfinished but not necessarily ABORTED clients, and the 
> > comment was somehow misleading.
> > 
> > On reflection, there should be only one exit() call, thus I'd say to keep 
> > the "goto done" as you did, but to move the checking loop *before* the 
> > disconnect_all, and the overall section comment could be something like 
> > "possibly abort if any client is not finished, meaning some error 
> > occured", which is consistent with the "!= FINISHED" condition.
> 
> Thank you for your suggestion.
> I'll fix as above and submit a updated patch soon.

I attached the updated patch v3 including changes above, a test,
and fix of the typo you pointed out.

Regards,
Yugo Nagata

> 
> Regards,
> Yugo Nagata
> 
> -- 
> Yugo NAGATA 
> 
> 


-- 
Yugo NAGATA 
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 6c5c8afa6d..83fe01c33f 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -787,6 +787,19 @@ pgbench  options  d
   
  
 
+ 
+  --exit-on-abort
+  
+   
+Exit immediately when any client is aborted due to some error. Without
+this option, even when a client is aborted, other clients could continue
+their run as specified by -t or -T option,
+and pgbench will print an incomplete results
+in this case.
+   
+  
+ 
+
  
   --log-prefix=prefix
   
@@ -985,7 +998,8 @@ pgbench  options  d
benchmark such as initial connection failures also exit with status 1.
Errors during the run such as database errors or problems in the script
will result in exit status 2. In the latter case,
-   pgbench will print partial results.
+   pgbench will print partial results if
+   --exit-on-abort option is not specified.
   
  
 
@@ -2801,14 +2815,17 @@ statement latencies in milliseconds, failures and retries:
  start a connection to the database server / the socket for connecting
  the client to the database server has become invalid). In such cases
  all clients of this thread stop while other threads continue to work.
+ However, --exit-on-abort is specified, all of the
+ threads stop immediately in this case.

  
  

  Direct client errors. They lead to immediate exit from
  pgbench with the corresponding error message
- only in the case of an internal pgbench
- error (which are supposed to never occur...). Otherwise in the worst
+ in the case of an internal pgbench
+ error (which are supposed to never occur...) or when
+ --exit-on-abort is specified . Otherwise in the worst
  case they only lead to the abortion of the failed client while other
  clients continue their run (but some client errors are handled without
  an abortion of the client and reported separately, see below). Later in
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 539c2795e2..eca55d0230 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -765,6 +765,8 @@ static int64 total_weight = 0;
 
 static bool verbose_errors = false; /* print verbose messages of all errors */
 
+static bool exit_on_abort = false;	/* exit when any client is aborted */
+
 /* Builtin test scripts */
 typedef struct BuiltinScript
 {
@@ -911,6 +913,7 @@ usage(void)
 		   "  -T, --time=NUM   duration of benchmark test in seconds\n"
 		   "  -v, --vacuum-all vacuum all four standard tables before tests\n"
 		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
+		   "  --exit-on-abort  exit when any client is aborted\n"
 		   "  --failures-detailed  report the failures grouped by basic types\n"
 		   "  --log-prefix=PREFIX  prefix for transaction time log file\n"
 		   "   (default: \"pgbench_log\")\n"
@@ -6612,6 +6615,7 @@ main(int argc, char **argv)
 		{"failures-detailed", no_argument, NULL, 13},
 		{"max-tries", required_argument, NULL, 14},
 		{"verbose-errors", no_argument, NULL, 15},
+		{"exit-on-abort", no_argument, NULL, 16},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -6945,6 +6949,10 @@ main(int argc, char **argv)
 benchmarking_option_set = true;
 verbose_errors = true;
 break;
+			case 16:			/* exit-on-abort */
+benchmarking_option_set = true;
+exit_on_a

Re: 2023-08-10 release announcement draft

2023-08-08 Thread Noah Misch
On Mon, Aug 07, 2023 at 10:03:44PM -0400, Jonathan S. Katz wrote:
> Fixes in PostgreSQL 16 Beta 3
> -
> 
> The following includes fixes included in PostgreSQL 16 Beta 3:

With both "includes" and "included" there, this line reads awkwardly to me.
I'd just delete the line, since the heading has the same information.

> * Add the `\drg` command to `psql` to display information about role grants.
> * Add timeline ID to filenames generated with `pg_waldump --save-fullpage`.
> * Fix crash after a deadlock occurs in a parallel `VACUUM` worker.
> 
> Please see the [release 
> notes](https://www.postgresql.org/docs/16/release-16.html)
> for a complete list of new and changed features:
> 
>   
> [https://www.postgresql.org/docs/16/release-16.html](https://www.postgresql.org/docs/16/release-16.html)




Re: pgbench: allow to exit immediately when any client is aborted

2023-08-08 Thread Yugo NAGATA
On Wed, 9 Aug 2023 13:59:21 +0900
Yugo NAGATA  wrote:

> On Wed, 9 Aug 2023 10:46:38 +0900
> Yugo NAGATA  wrote:
> 
> > On Wed, 9 Aug 2023 02:15:01 +0200 (CEST)
> > Fabien COELHO  wrote:
> > 
> > > 
> > > Hello Yugo-san,
> > > 
> > > > There are cases where "goto done" is used where some error like
> > > > "invalid socket: ..." happens. I would like to make pgbench exit in
> > > > such cases, too, so I chose to handle errors below the "done:" label.
> > > > However, we can change it to call "exit" instead of "goo done" at each
> > > > place. Do you think this is better?
> > > 
> > > Good point.
> > > 
> > > Now I understand the "!= FINISHED", because indeed in these cases the 
> > > done 
> > > is reached with unfinished but not necessarily ABORTED clients, and the 
> > > comment was somehow misleading.
> > > 
> > > On reflection, there should be only one exit() call, thus I'd say to keep 
> > > the "goto done" as you did, but to move the checking loop *before* the 
> > > disconnect_all, and the overall section comment could be something like 
> > > "possibly abort if any client is not finished, meaning some error 
> > > occured", which is consistent with the "!= FINISHED" condition.
> > 
> > Thank you for your suggestion.
> > I'll fix as above and submit a updated patch soon.
> 
> I attached the updated patch v3 including changes above, a test,
> and fix of the typo you pointed out.

I'm sorry but the test in the previous patch was incorrect. 
I attached the correct one.

Regards,
Yugo Nagata

> Regards,
> Yugo Nagata
> 
> > 
> > Regards,
> > Yugo Nagata
> > 
> > -- 
> > Yugo NAGATA 
> > 
> > 
> 
> 
> -- 
> Yugo NAGATA 


-- 
Yugo NAGATA 
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 6c5c8afa6d..83fe01c33f 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -787,6 +787,19 @@ pgbench  options  d
   
  
 
+ 
+  --exit-on-abort
+  
+   
+Exit immediately when any client is aborted due to some error. Without
+this option, even when a client is aborted, other clients could continue
+their run as specified by -t or -T option,
+and pgbench will print an incomplete results
+in this case.
+   
+  
+ 
+
  
   --log-prefix=prefix
   
@@ -985,7 +998,8 @@ pgbench  options  d
benchmark such as initial connection failures also exit with status 1.
Errors during the run such as database errors or problems in the script
will result in exit status 2. In the latter case,
-   pgbench will print partial results.
+   pgbench will print partial results if
+   --exit-on-abort option is not specified.
   
  
 
@@ -2801,14 +2815,17 @@ statement latencies in milliseconds, failures and retries:
  start a connection to the database server / the socket for connecting
  the client to the database server has become invalid). In such cases
  all clients of this thread stop while other threads continue to work.
+ However, --exit-on-abort is specified, all of the
+ threads stop immediately in this case.

  
  

  Direct client errors. They lead to immediate exit from
  pgbench with the corresponding error message
- only in the case of an internal pgbench
- error (which are supposed to never occur...). Otherwise in the worst
+ in the case of an internal pgbench
+ error (which are supposed to never occur...) or when
+ --exit-on-abort is specified . Otherwise in the worst
  case they only lead to the abortion of the failed client while other
  clients continue their run (but some client errors are handled without
  an abortion of the client and reported separately, see below). Later in
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 539c2795e2..eca55d0230 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -765,6 +765,8 @@ static int64 total_weight = 0;
 
 static bool verbose_errors = false; /* print verbose messages of all errors */
 
+static bool exit_on_abort = false;	/* exit when any client is aborted */
+
 /* Builtin test scripts */
 typedef struct BuiltinScript
 {
@@ -911,6 +913,7 @@ usage(void)
 		   "  -T, --time=NUM   duration of benchmark test in seconds\n"
 		   "  -v, --vacuum-all vacuum all four standard tables before tests\n"
 		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
+		   "  --exit-on-abort  exit when any client is aborted\n"
 		   "  --failures-detailed  report the failures grouped by basic types\n"
 		   "  --log-prefix=PREFIX  prefix for transaction time log file\n"
 		   "   (default: \"pgbench_log\")\n"
@@ -6612,6 +6615,7 @@ main(int argc, char **argv)
 		{"failures-detailed", no_argument, NULL, 13},
 		{"max-tries", required_argument, NULL, 14},
 		{"verbose-errors", no_argument, NULL, 15},
+

Re: Handle infinite recursion in logical replication setup

2023-08-08 Thread Peter Smith
On Wed, Aug 9, 2023 at 2:44 PM Amit Kapila  wrote:
>
> On Wed, Aug 9, 2023 at 8:20 AM Peter Smith  wrote:
> >
> > On Tue, Aug 8, 2023 at 6:52 PM Amit Kapila  wrote:
> > >
> > > On Tue, Aug 8, 2023 at 1:50 PM Peter Eisentraut  
> > > wrote:
> > > >
> > > > This patch added the following error message:
> > > >
> > > > errdetail_plural("Subscribed publication %s is subscribing to other
> > > > publications.",
> > > > "Subscribed publications %s are subscribing to other publications.",
> > > > list_length(publist), pubnames->data),
> > > >
> > > > But in PostgreSQL, a publication cannot subscribe to a publication, so
> > > > this is not giving accurate information.  Apparently, what it is trying
> > > > to say is that
> > > >
> > > > The subscription that you are creating subscribes to publications that
> > > > contain tables that are written to by other subscriptions.
> > > >
> > > > Can we get to a more accurate wording like this?
> > > >
> > >
> > > +1 for changing the message as per your suggestion.
> > >
> >
> > PSA a patch to change this message text. The message now has wording
> > similar to the suggestion.
> >
> > > > There is also a translatability issue there, in the way the publication
> > > > list is pasted into the message.
> > > >
> >
> > The name/list substitution is now done within parentheses, which AFAIK
> > will be enough to eliminate any translation ambiguities.
> >
>
> A similar instance as below uses \"%s\" instead. So, isn't using the
> same a better idea?
> errdetail_plural("LDAP search for filter \"%s\" on server \"%s\"
> returned %d entry.",
>   "LDAP search for filter \"%s\" on server \"%s\" returned %d entries.",
>
>

Hmm. I don't see any similarity -- there the plural is for the %d, not
for a list of string items. And in our case the publication name (or
list of names) are already quoted by the function returning that list,
so quoting them again doesn't really make sense.

Example output using the patch look like this:

SINGLULAR
test_sub=# create subscription sub_test connection 'dbname=test_pub'
publication pub_all_at_pub with(origin=NONE);
WARNING:  subscription "sub_test" requested copy_data with origin =
NONE but might copy data that had a different origin
DETAIL:  The subscription that you are creating has a publication
("pub_all_at_pub") containing tables written to by other
subscriptions.
HINT:  Verify that initial data copied from the publisher tables did
not come from other origins.
NOTICE:  created replication slot "sub_test" on publisher
CREATE SUBSCRIPTION

PLURAL
test_sub=# create subscription sub_test3 connection 'dbname=test_pub'
publication pub_all_at_pub,pub_s1_at_pub,pub_s2_at_pub
with(origin=NONE);
WARNING:  subscription "sub_test3" requested copy_data with origin =
NONE but might copy data that had a different origin
DETAIL:  The subscription that you are creating has publications
("pub_s1_at_pub", "pub_all_at_pub") containing tables written to by
other subscriptions.
HINT:  Verify that initial data copied from the publisher tables did
not come from other origins.
NOTICE:  created replication slot "sub_test3" on publisher
CREATE SUBSCRIPTION

~

I thought above looked fine but if PeterE also does not like the ()
then the message can be rearranged slightly like below to put the
offending publications at the end of the message, like this:

DETAIL:  The subscription that you are creating has the following
publications containing tables written to by other subscriptions:
"pub_s1_at_pub", "pub_all_at_pub"

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-08 Thread Michael Paquier
On Tue, Aug 08, 2023 at 05:44:03PM +0900, Kyotaro Horiguchi wrote:
> I like the overall direction. Though, I'm considering enclosing the
> errormsg and errorcode in a struct.

Yes, this suggestion makes sense as it simplifies all the WAL routines
that need to report back a complete error state, and there are four of
them now:
XLogPrefetcherReadRecord()
XLogReadRecord()
XLogNextRecord()
DecodeXLogRecord()

I have spent more time on 0001, polishing it and fixing a few bugs
that I have found while reviewing the whole.  Most of them were
related to mistakes in resetting the error state when expected.  I
have also expanded DecodeXLogRecord() to use an error structure
instead of only an errmsg, giving more consistency.  The error state
now relies on two structures:
+typedef enum XLogReaderErrorCode
+{
+   XLOG_READER_NONE = 0,
+   XLOG_READER_OOM,/* out-of-memory */
+   XLOG_READER_INVALID_DATA,   /* record data */
+} XLogReaderErrorCode;
+typedef struct XLogReaderError
+{
+   /* Buffer to hold error message */
+   char   *message;
+   boolmessage_deferred;
+   /* Error code when filling *message */
+   XLogReaderErrorCode code;
+} XLogReaderError;

I'm kind of happy with this layer, now.

I have also spent some time on finding a more elegant solution for the
WAL replay, relying on the new facility from 0001.  And it happens
that it is easy enough to loop if facing an out-of-memory failure when
reading a record when we are in crash recovery, as the state is
actually close to what a standby does.  The trick is that we should
not change the state and avoid tracking a continuation record.  This
is done in 0002, making replay more robust.  With the addition of the
error injection tweak in 0003, I am able to finish recovery while the
startup process loops if under memory pressure.  As mentioned
previously, there are more code paths to consider, but that's a start
to fix the data loss problems.

Comments are welcome.
--
Michael
From 9da3751ef46be90cf7a5d4050c5cb281c33dbf70 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 9 Aug 2023 13:47:30 +0900
Subject: [PATCH v2 1/3] Add infrastructure to report error codes in WAL reader

This commits moves the error state coming from WAL readers into a new
structure, that includes the existing pointer to the error message
buffer, but it also gains an error code that fed back to the callers of
the following routines:
XLogPrefetcherReadRecord()
XLogReadRecord()
XLogNextRecord()
DecodeXLogRecord()

This will help in improving the decisions to take during recovery
depending on the failure more reported.
---
 src/include/access/xlogprefetcher.h   |   2 +-
 src/include/access/xlogreader.h   |  33 +++-
 src/backend/access/transam/twophase.c |   8 +-
 src/backend/access/transam/xlog.c |   6 +-
 src/backend/access/transam/xlogprefetcher.c   |   4 +-
 src/backend/access/transam/xlogreader.c   | 167 --
 src/backend/access/transam/xlogrecovery.c |  14 +-
 src/backend/access/transam/xlogutils.c|   2 +-
 src/backend/replication/logical/logical.c |   9 +-
 .../replication/logical/logicalfuncs.c|   9 +-
 src/backend/replication/slotfuncs.c   |   8 +-
 src/backend/replication/walsender.c   |   8 +-
 src/bin/pg_rewind/parsexlog.c |  24 +--
 src/bin/pg_waldump/pg_waldump.c   |  10 +-
 contrib/pg_walinspect/pg_walinspect.c |  11 +-
 src/tools/pgindent/typedefs.list  |   2 +
 16 files changed, 200 insertions(+), 117 deletions(-)

diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
index 7dd7f20ad0..5563ad1a67 100644
--- a/src/include/access/xlogprefetcher.h
+++ b/src/include/access/xlogprefetcher.h
@@ -48,7 +48,7 @@ extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
 	XLogRecPtr recPtr);
 
 extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
-			char **errmsg);
+			XLogReaderError *errordata);
 
 extern void XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index da32c7db77..2b57b5eb01 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -58,6 +58,25 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
+/* Values for XLogReaderError.errorcode */
+typedef enum XLogReaderErrorCode
+{
+	XLOG_READER_NONE = 0,
+	XLOG_READER_OOM,			/* out-of-memory */
+	XLOG_READER_INVALID_DATA,	/* record data */
+} XLogReaderErrorCode;
+
+/* Error status generated by a WAL reader on failure */
+typedef struct XLogReaderError
+{
+	/* Buffer to hold error message */
+	char	   *message;
+	bool		message_deferred;
+	/* Error code when filling *message */
+	XLogReaderErrorCode code;
+} XLogReaderError;
+
+
 /* Function type definitions for various xlogreader interactions */
 typedef int (*XLogPageReadCB) (X

Re: Adding a LogicalRepWorker type field

2023-08-08 Thread Amit Kapila
On Tue, Aug 8, 2023 at 1:39 PM Peter Smith  wrote:
>
> On Fri, Aug 4, 2023 at 5:45 PM Peter Smith  wrote:
>
> v4-0001 uses only 3 simple inline functions. Callers always pass
> parameters as Bharath had suggested.
>

*
- Assert(am_leader_apply_worker());
+ Assert(is_leader_apply_worker(MyLogicalRepWorker));
...
- if (am_leader_apply_worker())
+ if (is_leader_apply_worker(MyLogicalRepWorker))

Passing everywhere MyLogicalRepWorker not only increased the code
change footprint but doesn't appear any better to me. Instead, let
am_parallel_apply_worker() keep calling
isParallelApplyWorker(MyLogicalRepWorker) as it is doing now. I feel
even if you or others feel that is a better idea, we can debate it
separately after the main patch is done because as far as I understand
that is not the core idea of this proposal.

* If you do the above then there won't be a need to change the
variable name is_parallel_apply_worker in logicalrep_worker_launch.

-- 
With Regards,
Amit Kapila.




Re: Handle infinite recursion in logical replication setup

2023-08-08 Thread Amit Kapila
On Wed, Aug 9, 2023 at 10:59 AM Peter Smith  wrote:
> >
> Example output using the patch look like this:
>
> SINGLULAR
> test_sub=# create subscription sub_test connection 'dbname=test_pub'
> publication pub_all_at_pub with(origin=NONE);
> WARNING:  subscription "sub_test" requested copy_data with origin =
> NONE but might copy data that had a different origin
> DETAIL:  The subscription that you are creating has a publication
> ("pub_all_at_pub") containing tables written to by other
> subscriptions.
> HINT:  Verify that initial data copied from the publisher tables did
> not come from other origins.
> NOTICE:  created replication slot "sub_test" on publisher
> CREATE SUBSCRIPTION
>
> PLURAL
> test_sub=# create subscription sub_test3 connection 'dbname=test_pub'
> publication pub_all_at_pub,pub_s1_at_pub,pub_s2_at_pub
> with(origin=NONE);
> WARNING:  subscription "sub_test3" requested copy_data with origin =
> NONE but might copy data that had a different origin
> DETAIL:  The subscription that you are creating has publications
> ("pub_s1_at_pub", "pub_all_at_pub") containing tables written to by
> other subscriptions.
> HINT:  Verify that initial data copied from the publisher tables did
> not come from other origins.
> NOTICE:  created replication slot "sub_test3" on publisher
> CREATE SUBSCRIPTION
>
> ~
>
> I thought above looked fine but if PeterE also does not like the ()
> then the message can be rearranged slightly like below to put the
> offending publications at the end of the message, like this:
>

Okay, I didn't know strings were already quoted but not sure if Round
Brackets () exactly will address the translatability concern.

> DETAIL:  The subscription that you are creating has the following
> publications containing tables written to by other subscriptions:
> "pub_s1_at_pub", "pub_all_at_pub"
>

Fair enough. Peter E., do let us know what you think makes sense here?

-- 
With Regards,
Amit Kapila.




Re: Adding a pg_servername() function

2023-08-08 Thread Laetitia Avrot
Dear all,

First, thank you so much for your interest in this patch. I didn't think I
would have so many answers.

I agree that the feature I'm suggesting could be done with a few tricks. I
meant to simplify the life of the user by providing a simple new feature.
(Also, I might have trust issues with DNS due to several past production
disasters.)

My question is very simple: Do you oppose having this feature in Postgres?

In other words, should I stop working on this?

Have a nice day,

Lætitia