date:20230110

Re: recovery modules

2023-01-10 Thread Michael Paquier

On Tue, Jan 03, 2023 at 11:05:38AM -0800, Nathan Bossart wrote:
> I noticed that cfbot's Windows tests are failing because the backslashes in
> the archive directory path are causing escaping problems.  Here is an
> attempt to fix that by converting all backslashes to forward slashes, which
> is what other tests (e.g., 025_stuck_on_old_timeline.pl) do.

+   GetOldestRestartPoint(, );
+   XLByteToSeg(restartRedoPtr, restartSegNo, wal_segment_size);
+   XLogFileName(lastRestartPointFname, restartTli, restartSegNo,
+wal_segment_size);
+
+   shell_archive_cleanup(lastRestartPointFname);

Hmm.  Is passing down the file name used as a cutoff point the best
interface for the modules?  Perhaps passing down the redo LSN and its
TLI would be a cleaner approach in terms of flexibility?  I agree with
letting the startup enforce these numbers as that can be easy to mess
up for plugin authors, leading to critical problems.  The same code
pattern is repeated twice for the end-of-recovery callback and the
cleanup commands when it comes to building the file name.  Not
critical, still not really nice.

 MODULES = basic_archive
-PGFILEDESC = "basic_archive - basic archive module"
+PGFILEDESC = "basic_archive - basic archive and recovery module"
"basic_archive" does not reflect what this module does.  Using one
library simplifies the whole configuration picture and the tests, so
perhaps something like basic_wal_module, or something like that, would
be better long-term?
--
Michael

signature.asc
Description: PGP signature

RE: [Proposal] Add foreign-server health checks infrastructure

2023-01-10 Thread Hayato Kuroda (Fujitsu)

Dear Ted,

Thank you for reviewing! PSA new version.

> +   /* quick exit if connection cache has been not initialized yet. */
>
> been not initialized -> not been initialized

Fixed.

> +   
> (errcode(ERRCODE_CONNECTION_FAILURE),
> +errmsg("could not connect to 
> server \"%s\"",
>
> Currently each server which is not connected would log a warning.
> Is it better to concatenate names for such servers and log one line ? This 
> would be cleaner when there are multiple such servers.

Sounds good, fixed as you said. The following shows the case that two 
disconnections
are detected by postgres_fdw_verify_connection_states_all().

```
postgres=*# select postgres_fdw_verify_connection_states_all ();
WARNING:  could not connect to server "my_external_server2", 
"my_external_server"
DETAIL:  Socket close is detected.
HINT:  Plsease check the health of server.
 postgres_fdw_verify_connection_states_all 
---
 f
(1 row)
```

Currently, the name of servers is concatenated without doing unique checks. IIUC
a backend process cannot connect to the same foreign server by using different
user mapping, so there is no possibility that the same name appears twice.
If the user mapping is altered in the transaction, the cache entry is 
invalidated
and will not be checked.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v24-0001-Add-PQConncheck-and-PQCanConncheck-to-libpq.patch
Description:  v24-0001-Add-PQConncheck-and-PQCanConncheck-to-libpq.patch


v24-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch
Description:  v24-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch


v24-0003-add-test.patch
Description: v24-0003-add-test.patch

Re: Minimal logical decoding on standbys

2023-01-10 Thread Bharath Rupireddy

On Tue, Jan 10, 2023 at 2:03 PM Drouvot, Bertrand
 wrote:
>
> Please find attached, V37 taking care of:

Thanks. I started to digest the design specified in the commit message
and these patches. Here are some quick comments:

1. Does logical decoding on standby work without any issues if the
standby is set for cascading replication?

2. Do logical decoding output plugins work without any issues on the
standby with decoding enabled, say, when the slot is invalidated?

3. Is this feature still a 'minimal logical decoding on standby'?
Firstly, why is it 'minimal'?

4. What happens in case of failover to the standby that's already
decoding for its clients? Will the decoding work seamlessly? If not,
what are recommended things that users need to take care of
during/after failovers?

0002:
1.
-if (InvalidateObsoleteReplicationSlots(_logSegNo))
+InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo,
, InvalidOid, NULL);

Isn't the function name too long and verbose? How about just
InvalidateLogicalReplicationSlots() let the function comment talk
about what sorts of replication slots it invalides?

2.
+errdetail("Logical decoding on
standby requires wal_level to be at least logical on master"));
+ * master wal_level is set back to replica, so existing logical
slots need to
invalidate such slots. Also do the same thing if wal_level on master

Can we use 'primary server' instead of 'master' like elsewhere? This
comment also applies for other patches too, if any.

3. Can we show a new status in pg_get_replication_slots's wal_status
for invalidated due to the conflict so that the user can monitor for
the new status and take necessary actions?

4. How will the user be notified when logical replication slots are
invalidated due to conflict with the primary server? How can they
(firstly, is there a way?) repair/restore such replication slots? Or
is recreating/reestablishing logical replication only the way out for
them? What users can do to avoid their logical replication slots
getting invalidated and run into these situations? Because
recreating/reestablishing logical replication comes with cost
(sometimes huge) as it involves building another instance, syncing
tables etc. Isn't it a good idea to touch up on all these aspects in
the documentation at least as to why we're doing this and why we can't
do this?

5.
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)

 ReplicationSlotAcquire(cmd->slotname, true);

+if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+ && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot read from logical replication slot \"%s\"",
+cmd->slotname),
+ errdetail("This slot has been invalidated because it
was conflicting with recovery.")));

Having the invalidation check in StartLogicalReplication() looks fine,
however, what happens if the slot gets invalidated when the
replication is in-progress? Do we need to error out in WalSndLoop() or
XLogSendLogical() too? Or is it already taken care of somewhere?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: SQL/JSON revisited

2023-01-10 Thread John Naylor

On Wed, Jan 11, 2023 at 2:02 PM Elena Indrupskaya <
e.indrupsk...@postgrespro.ru> wrote:
>
> Sorry for upsetting your bot. :(

What I do in these cases is save the incremental patch as a .txt file --
that way people can read it, but the cf bot doesn't try to launch a CI run.
And if I forget that detail, well it's not a big deal, it happens sometimes.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: cutting down the TODO list thread

2023-01-10 Thread John Naylor

So, I had intended to spend some time on this at least three times a year.
I've clearly failed at that, but now is as good a time as any to pick it
back up again.

Over in [1], Tom opined:

> John Naylor  writes:
>
> > "WARNING for Developers: Unfortunately this list does not contain all
the
> > information necessary for someone to start coding a feature. Some of
these
> > items might have become unnecessary since they were added --- others
might
> > be desirable but the implementation might be unclear. When selecting
items
> > listed below, be prepared to first discuss the value of the feature. Do
not
> > assume that you can select one, code it and then expect it to be
committed.
> > "
>
> I think we could make that even stronger: there's basically nothing on
> the TODO list that isn't problematic in some way.  Otherwise it would
> have been done already.  The entries involve large amounts of work,
> or things that are subtler than they might appear, or cases where the
> desirable semantics aren't clear, or tradeoffs that there's not
> consensus about, or combinations of those.
>
> IME it's typically a lot more productive to approach things via
> "scratch your own itch".  If a problem is biting you directly, then
> at least you have some clear idea of what it is that needs to be fixed.
> You might have to work up to an understanding of how to fix it, but
> you have a clear goal.

I've come up with some revised language, including s/15/16/ and removing
the category of "[E]" (easier to implement), since it wouldn't be here if
it were actually easy:

--
WARNING for Developers: This list contains some known PostgreSQL bugs, some
feature requests, and some things we are not even sure we want. This is not
meant to be a resource for beginning developers to get ideas for things to
work on. 

All of these items are hard, and some are perhaps impossible. Some of these
items might have become unnecessary since they were added. Others might be
desirable but:

- a large amount work is required
- the problems are subtler than they might appear
- the desirable semantics aren't clear
- there are tradeoffs that there's not consensus about
- some combinations of the above

If you really need a feature that is listed below, it will be worth reading
the linked email thread if there is one, since it will often show the
difficulties, or perhaps contain previous failed attempts to get a patch
committed. If after that you still want to work on it, be prepared to first
discuss the value of the feature. Do not assume that you can start coding
and expect it to be committed. Always discuss design on the Hackers list
before starting to code.

Over time, it may become clear that a TODO item has become outdated or
otherwise determined to be either too controversial or not worth the
development effort. Such items should be retired to the Not Worth Doing
page.

[D] marks changes that are done, and will appear in the PostgreSQL 16
release.
--

We could also revise the developer FAQ:
- remove phrase "Outstanding features are detailed in Todo."
- add suggestion to to check the Todo or Not_worth_doing pages to see if
the desired feature is undesirable or problematic
- rephrase "Working in isolation is not advisable because others might be
working on the same TODO item, or you might have misunderstood the TODO
item." so it doesn't mention 'TODO' at all.

[1] https://www.postgresql.org/message-id/415636.1673411259%40sss.pgh.pa.us

--
John Naylor
EDB: http://www.enterprisedb.com

Re: SQL/JSON revisited

2023-01-10 Thread Elena Indrupskaya


Tags in the patch follow the markup of the XMLTABLE function:

XMLTABLE (
     XMLNAMESPACES ( 
namespace_uri AS 
namespace_name , ... ), 

    row_expression 
PASSING BY 
{REF|VALUE} 
document_expression 
BY 
{REF|VALUE}
    COLUMNS name { 
type PATH 
column_expression 
DEFAULT 
default_expression 
NOT NULL | NULL

  | FOR ORDINALITY }
    , ...
) setof record

In the above, as well as in the signatures of SQL/JSON functions, there 
are no exact parameter names; otherwise, they should have been followed 
by the  tag, which is not the case. There are no parameter names 
in the functions' code either. Therefore,  tags seem more 
appropriate, according to the comment to commit 47046763c3.


Sorry for upsetting your bot. :(

--
Elena Indrupskaya
Lead Technical Writer
Postgres Professional http://www.postgrespro.com

On 2023-01-10 Tu 07:51, Elena Indrupskaya wrote:

Hi,

The Postgres Pro documentation team prepared another SQL/JSON
documentation patch (attached), to apply on top of
v1-0009-Documentation-for-SQL-JSON-features.patch.
The new patch:
- Fixes minor typos
- Does some rewording agreed with Nikita Glukhov
- Updates Docbook markup to make tags consistent across SQL/JSON
documentation and across func.sgml, and in particular, consistent with
the XMLTABLE function, which resembles SQL/JSON functions pretty much.


That's nice, but please don't post incremental patches like this. It
upsets the cfbot. (I wish there were a way to tell the cfbot to ignore
patches)

Also, I'm fairly certain that a good many of your changes are not
according to project style. The rule as I understand it is that
 is used for things that are parameters and  is
only used for things that are not parameters. (I'm not sure where that's
documented other than the comment on commit 47046763c3, but it's what I
attempted to do with the earlier doc tidy up.)


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: mprove tab completion for ALTER EXTENSION ADD/DROP

2023-01-10 Thread Michael Paquier

On Wed, Jan 11, 2023 at 12:10:33PM +0900, Kyotaro Horiguchi wrote:
> It suggests the *kinds* of objects that are part of the extension, but
> lists the objects of that kind regardless of dependency.  I read
> Michael suggested (and I agree) to restrict the objects (not kinds) to
> actually be a part of the extension. (And not for object kinds.)

Yeah, that's what I meant.  Now, if Vignesh does not want to extend
that, that's fine for me as well at the end on second thought, as this 
involves much more code for each DROP path depending on the object
type involved.

Adding the object names after DROP/ADD is useful on its own, and we
already have some completion once the object type is specified, so
simpler is perhaps just better here.
--
Michael


signature.asc
Description: PGP signature

Re: Avoiding "wrong tuple length" errors at the end of VACUUM on pg_database update (Backpatch of 947789f to v12 and v13)

2023-01-10 Thread Michael Paquier

On Tue, Jan 10, 2023 at 11:05:04AM -0800, Andres Freund wrote:
> What about a define that forces external toasting very aggressively for
> catalog tables, iff they have a toast table? I suspect doing so for
> non-catalog tables as well would trigger test changes. Running a buildfarm
> animal with that would at least make issues like this much easier to discover.

Hmm.  That could work.  I guess that you mean to do something like
that in SearchSysCacheCopy() when we build the tuple copy.  There is
an access to the where the cacheId, meaning that we know the catalog
involved.  Still, we would need a lookup at its pg_class entry to
check after a reltoastrelid, meaning an extra relation opened, which
would be fine under a specific #define, anyway..
--
Michael


signature.asc
Description: PGP signature

Re: Fix pg_publication_tables to exclude generated columns

2023-01-10 Thread Amit Kapila

On Wed, Jan 11, 2023 at 10:07 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> >> On Mon, Jan 9, 2023 11:06 PM Tom Lane  wrote:
> >>> We could just not fix it in the back branches.  I'd argue that this is
> >>> as much a definition change as a bug fix, so it doesn't really feel
> >>> like something to back-patch anyway.
>
> > So, if we don't backpatch then it could lead to an error when it
> > shouldn't have which is clearly a bug. I think we should backpatch
> > this unless Tom or others are against it.
>
> This isn't a hill that I'm ready to die on ... but do we have any field
> complaints about this?  If not, I still lean against a back-patch.
> I think there's a significant risk of breaking case A while fixing
> case B when we change this behavior, and that's something that's
> better done only in a major release.
>

Fair enough, but note that there is a somewhat related problem for
dropped columns [1] as well. While reviewing that it occurred to me
that generated columns also have a similar problem which leads to this
thread (it would have been better if there is a mention of the same in
the initial email). Now, as symptoms are similar, I think we shouldn't
back-patch that as well, otherwise, it will appear to be partially
fixed. What do you think?

[1] - 
https://www.postgresql.org/message-id/OSZPR01MB631087C65BA81E1FEE5A60D2FDF59%40OSZPR01MB6310.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

Re: [PATCH] Improve ability to display optimizer analysis using OPTIMIZER_DEBUG

2023-01-10 Thread Ankit Kumar Pandey





On 11/01/23 09:57, Tom Lane wrote:
IME it's typically a lot more productive to approach things via
"scratch your own itch".  If a problem is biting you directly, then
at least you have some clear idea of what it is that needs to be fixed.
You might have to work up to an understanding of how to fix it, but
you have a clear goal.



Question is, how newcomers should start contribution if they are not 
coming with a problem in their hand?


Todo list is possibly first thing anyone, who is willing to contribute 
is going to read and for a new


contributor, it is not easy to judge situation (if todo item is easy for 
newcomers or bit involved).


One way to exacerbate this issue is to mention mailing thread with 
discussions under todo items.


It is done for most of todo items but sometime pressing issues are left out.


That being said, I think this is part of learning process and okay to 
come up with ideas and fail.
Pghackers can possibly bring up issues in their approach (if discussion 
for the issue is not mentioned under

todo item).

--
Regards,
Ankit Kumar Pandey

Re: Avoiding "wrong tuple length" errors at the end of VACUUM on pg_database update (Backpatch of 947789f to v12 and v13)

2023-01-10 Thread Michael Paquier

On Tue, Jan 10, 2023 at 09:54:31AM -0800, Nathan Bossart wrote:
> +1

Okay, thanks.  Done this part as of c0ee694 and 72b6098, then.
--
Michael


signature.asc
Description: PGP signature

Re: typos

2023-01-10 Thread Michael Paquier

On Tue, Jan 10, 2023 at 01:55:56PM +0530, Amit Kapila wrote:
> I have not yet started, so please go ahead.

Okay, I have looked at that and fixed the whole new things, including
the typo you have introduced.  0001~0004 have been left out, as of the
same reasons as upthread.
--
Michael


signature.asc
Description: PGP signature

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-10 Thread Ankit Kumar Pandey


On 11/01/23 06:18, David Rowley wrote:




Not sure if we should be trying to improve that in this patch. I just
wanted to identify it as something else that perhaps could be done.


This could be within reach but still original problem of having hashagg 
removing


any gains from this remains.


eg

set enable_hashagg=0;

explain select distinct relkind, relname, count(*) over (partition by
relkind) from pg_Class;
 QUERY PLAN

 Unique  (cost=41.26..65.32 rows=412 width=73)
   ->  Incremental Sort  (cost=41.26..62.23 rows=412 width=73)
 Sort Key: relkind, relname, (count(*) OVER (?))
 Presorted Key: relkind
 ->  WindowAgg  (cost=36.01..43.22 rows=412 width=73)
   ->  Sort  (cost=36.01..37.04 rows=412 width=65)
 Sort Key: relkind
 ->  Seq Scan on pg_class  (cost=0.00..18.12 rows=412 
width=65)
(8 rows)

reset enable_hashagg;
explain select distinct relkind, relname, count(*) over (partition by
relkind) from pg_Class;
  QUERY PLAN
--
 HashAggregate  (cost=46.31..50.43 rows=412 width=73)
   Group Key: relkind, relname, count(*) OVER (?)
   ->  WindowAgg  (cost=36.01..43.22 rows=412 width=73)
 ->  Sort  (cost=36.01..37.04 rows=412 width=65)
   Sort Key: relkind
   ->  Seq Scan on pg_class  (cost=0.00..18.12 rows=412 width=65)
(6 rows)

HashAgg has better cost than Unique even with incremental sort (tried 
with other case


where we have more columns pushed down but still hashAgg wins).

explain select distinct a, b, count(*) over (partition by a order by b) from 
abcd;
  QUERY PLAN
--
 Unique  (cost=345712.12..400370.25 rows=1595 width=16)
   ->  Incremental Sort  (cost=345712.12..395456.14 rows=655214 width=16)
 Sort Key: a, b, (count(*) OVER (?))
 Presorted Key: a, b
 ->  WindowAgg  (cost=345686.08..358790.36 rows=655214 width=16)
   ->  Sort  (cost=345686.08..347324.11 rows=655214 width=8)
 Sort Key: a, b
 ->  Seq Scan on abcd  (cost=0.00..273427.14 rows=655214 
width=8)

explain select distinct a, b, count(*) over (partition by a order by b) from 
abcd;

   QUERY PLAN



 HashAggregate  (cost=363704.46..363720.41 rows=1595 width=16)

   Group Key: a, b, count(*) OVER (?)

   ->  WindowAgg  (cost=345686.08..358790.36 rows=655214 width=16)

 ->  Sort  (cost=345686.08..347324.11 rows=655214 width=8)

   Sort Key: a, b

   ->  Seq Scan on abcd  (cost=0.00..273427.14 rows=655214 width=8)

(6 rows)



I'm not really all that sure the above query shape makes much sense in
the real world. Would anyone ever want to use DISTINCT on some results
containing WindowFuncs?


This could still have been good to have if there were no negative impact

and some benefit in few cases but as mentioned before, if hashagg removes

any sort (which happened due to push down), all gains will be lost

and we will be probably worse off than before.

--
Regards,
Ankit Kumar Pandey

Re: split TOAST support out of postgres.h

2023-01-10 Thread Noah Misch

On Tue, Jan 10, 2023 at 12:00:46PM -0500, Robert Haas wrote:
> On Tue, Jan 10, 2023 at 9:46 AM Tom Lane  wrote:
> > Now, there is a fair question whether splitting this code out of
> > postgres.h is worth any trouble at all.  TBH my initial reaction
> > had been "no".  But once we found that only 40-ish backend files
> > need to read this new header, I became a "yes" vote because it
> > seems clear that there will be a total-compilation-time benefit.

A time claim with no benchmarks is a red flag.  I've chosen to run one:

export CCACHE_DISABLE=1
change=d952373a987bad331c0e499463159dd142ced1ef
for commit in $change $change^; do
echo === git checkout $commit
git checkout $commit
for n in `seq 1 200`; do make -j20 clean; env time make -j20; done
done

Results:

commit  median mean count
d952373a987bad331c0e499463159dd142ced1ef 49.35 49.37 200
d952373a987bad331c0e499463159dd142ced1ef^ 49.33 49.36 200

That is to say, the patch made the build a bit slower, not faster.  That's
with GCC 4.8.5 (RHEL 7).  I likely should have interleaved the run types, but
in any case the speed win didn't show up.

> I wasn't totally about this, either, but I think on balance it's
> probably a smart thing to do. I always found it kind of weird that we
> put that stuff in postgres.h. It seems to privilege the TOAST
> mechanism to an undue degree; what's the argument, for example, that
> TOAST macros are more generally relevant than CHECK_FOR_INTERRUPTS()
> or LWLockAcquire or HeapTuple? It felt to me like we'd just decided
> that one subsystem gets to go into the main header file and everybody
> else just had to have their own headers.
> 
> One thing that's particularly awkward about that is that if you want
> to write some front-end code that knows about how varlenas are stored
> on disk, it was very awkward with the old structure. You're not
> supposed to include "postgres.h" into frontend code, but if the stuff
> you need is defined there then what else can you do? So I generally
> think that anything that's in postgres.h should have a strong claim to
> be not only widely-needed in the backend, but also never needed at all
> in any other executable.

If the patch had just made postgres.h include varatt.h, like it does elog.h,
I'd consider that change a nonnegative.  Grouping things is nice, even if it
makes compilation a bit slower.  That also covers your frontend use case.  How
about that?

I agree fixing any one extension is easy enough.  Thinking back to the
htup_details.h refactor, I found the aggregate pain unreasonable relative to
alleged benefits, even though each individual extension wasn't too bad.
SET_VARSIZE is used more (74 pgxn distributions) than htup_details.h (62).

Re: [DOCS] Stats views and functions not in order?

2023-01-10 Thread Peter Smith

On Wed, Jan 4, 2023 at 6:08 PM vignesh C  wrote:
>
> On Mon, 2 Jan 2023 at 13:47, Peter Eisentraut
>  wrote:
> >
> > On 08.12.22 03:30, Peter Smith wrote:
> > > PSA patches for v9*
> > >
> > > v9-0001 - Now the table rows are ordered per PeterE's suggestions [1]
> >
> > committed

Thanks for pushing.

> >
> > > v9-0002 - All the review comments from DavidJ [2] are addressed
> >
> > I'm not sure about this one.  It removes the "see [link] for details"
> > phrases and instead makes the view name a link.  I think this loses the
> > cue that there is more information elsewhere.  Otherwise, one could
> > think that, say, the entry about pg_stat_activity is the primary source
> > and the link just links to itself.  Also keep in mind that people use
> > media where links are not that apparent (PDF), so the presence of a link
> > by itself cannot be the only cue about the flow of the information.
>

PSA new patch for v10-0001

v9-0001 --> pushed, thanks!
v9-0002 --> I removed this based on the reject reason above
v9-0003 --> v10-0001

> I'm not sure if anything is pending for v9-0003, if there is something
> pending, please post an updated patch for the same.
>

Thanks for the reminder. PSA v10.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v10-0001-Add-Statistics-Views-section-and-refentry-for-ea.patch
Description: Binary data

low wal_retrieve_retry_interval causes missed signals on Windows

2023-01-10 Thread Nathan Bossart

I discussed this elsewhere [0], but I thought it deserved its own thread.

After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed
that some of the archiving tests began consistently failing on Windows.  I
believe the problem is that WaitForWALToBecomeAvailable() depends on the
call to WaitLatch() for wal_retrieve_retry_interval to ensure that signals
are dispatched (i.e., pgwin32_dispatch_queued_signals()).  With a low retry
interval, WaitForWALToBecomeAvailable() might skip the call to WaitLatch(),
and the signals are never processed.

The attached patch fixes this by always calling WaitLatch(), even if
wal_retrieve_retry_interval milliseconds have already elapsed and the
timeout is 0.

[0] https://postgr.es/m/20221231235019.GA1223171%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d08f01156ce1ce1290bad440da97852c2aa6c24b Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 31 Dec 2022 15:16:54 -0800
Subject: [PATCH v1 1/1] ensure signals are dispatched in startup process on
 Windows

---
 src/backend/access/transam/xlogrecovery.c | 36 ++-
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index bc3c3eb3e7..b1a4c4ab6d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3466,6 +3466,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		if (lastSourceFailed)
 		{
+			long		wait_time;
+
 			/*
 			 * Don't allow any retry loops to occur during nonblocking
 			 * readahead.  Let the caller process everything that has been
@@ -3556,33 +3558,39 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * and retry from the archive, but if it hasn't been long
 	 * since last attempt, sleep wal_retrieve_retry_interval
 	 * milliseconds to avoid busy-waiting.
+	 *
+	 * NB: Even if it's already been wal_retrieve_rety_interval
+	 * milliseconds since the last attempt, we still call
+	 * WaitLatch() with the timeout set to 0 to make sure any
+	 * queued signals are dispatched on Windows builds.
 	 */
 	now = GetCurrentTimestamp();
 	if (!TimestampDifferenceExceeds(last_fail_time, now,
 	wal_retrieve_retry_interval))
 	{
-		long		wait_time;
-
 		wait_time = wal_retrieve_retry_interval -
 			TimestampDifferenceMilliseconds(last_fail_time, now);
 
 		elog(LOG, "waiting for WAL to become available at %X/%X",
 			 LSN_FORMAT_ARGS(RecPtr));
+	}
+	else
+		wait_time = 0;
 
-		/* Do background tasks that might benefit us later. */
-		KnownAssignedTransactionIdsIdleMaintenance();
+	/* Do background tasks that might benefit us later. */
+	KnownAssignedTransactionIdsIdleMaintenance();
 
-		(void) WaitLatch(>recoveryWakeupLatch,
-		 WL_LATCH_SET | WL_TIMEOUT |
-		 WL_EXIT_ON_PM_DEATH,
-		 wait_time,
-		 WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);
-		ResetLatch(>recoveryWakeupLatch);
-		now = GetCurrentTimestamp();
+	(void) WaitLatch(>recoveryWakeupLatch,
+	 WL_LATCH_SET | WL_TIMEOUT |
+	 WL_EXIT_ON_PM_DEATH,
+	 wait_time,
+	 WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL);
+	ResetLatch(>recoveryWakeupLatch);
+	now = GetCurrentTimestamp();
+
+	/* Handle interrupt signals of startup process */
+	HandleStartupProcInterrupts();
 
-		/* Handle interrupt signals of startup process */
-		HandleStartupProcInterrupts();
-	}
 	last_fail_time = now;
 	currentSource = XLOG_FROM_ARCHIVE;
 	break;
-- 
2.25.1

Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-10 Thread Dilip Kumar

On Wed, Jan 11, 2023 at 9:34 AM houzj.f...@fujitsu.com
 wrote:
>

> > I was looking into 0001, IMHO the pid should continue to represent the main
> > apply worker. So the pid will always show the main apply worker which is
> > actually receiving all the changes for the subscription (in short working as
> > logical receiver) and if it is applying changes through a parallel worker 
> > then it
> > should put the parallel worker pid in a new column called 
> > 'parallel_worker_pid'
> > or 'parallel_apply_worker_pid' otherwise NULL.  Thoughts?
>
> Thanks for the comment.
>
> IIRC, you mean something like following, right ?
> (sorry if I misunderstood)
> --
> For parallel apply worker:
> 'pid' column shows the pid of the leader, new column parallel_worker_pid 
> shows its own pid
>
> For leader apply worker:
> 'pid' column shows its own pid, new column parallel_worker_pid shows 0
> --
>
> If so, I am not sure if the above is better, because it is changing the
> existing column's('pid') meaning, the 'pid' will no longer represent the pid 
> of
> the worker itself. Besides, it seems not consistent with what we have for
> parallel query workers in pg_stat_activity. What do you think ?

Actually, I always imagined the pid is the process id of the worker
which is actually receiving the changes for the subscriber. Keeping
the pid to represent the leader makes more sense.  But as you said,
that parallel worker for backend is already following the terminology
as you have in your patch to show the pid as the pid of the applying
worker so I am fine with the way you have.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: MultiXact\SLRU buffers configuration

2023-01-10 Thread Dilip Kumar

On Mon, Jan 9, 2023 at 9:49 AM Andrey Borodin  wrote:
>
> On Tue, Jan 3, 2023 at 5:02 AM vignesh C  wrote:
> > does not apply on top of HEAD as in [1], please post a rebased patch:
> >
> Thanks! Here's the rebase.

I was looking into this patch, it seems like three different
optimizations are squeezed in a single patch
1) dividing buffer space in banks to reduce the seq search cost 2) guc
parameter for buffer size scale 3) some of the buffer size values are
modified compared to what it is on the head.  I think these are 3
patches which should be independently committable.

While looking into the first idea of dividing the buffer space in
banks, I see that it will speed up finding the buffers but OTOH while
searching the victim buffer it will actually can hurt the performance
the slru pages which are frequently accessed are not evenly
distributed across the banks.  So imagine the cases where we have some
banks with a lot of empty slots and other banks from which we
frequently have to evict out the pages in order to get the new pages
in.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: delay starting WAL receiver

2023-01-10 Thread Nathan Bossart

On Wed, Jan 11, 2023 at 05:20:38PM +1300, Thomas Munro wrote:
> Is the problem here that SIGCHLD is processed ...
> 
> PG_SETMASK(); <--- here?
> 
> selres = select(nSockets, , NULL, NULL, );
> 
> Meanwhile the SIGCHLD handler code says:
> 
>  * Was it the wal receiver?  If exit status is zero (normal) or one
>  * (FATAL exit), we assume everything is all right just like normal
>  * backends.  (If we need a new wal receiver, we'll start one at the
>  * next iteration of the postmaster's main loop.)
> 
> ... which is true, but that won't be reached for a while in this case
> if the timeout has already been set to 60s.  Your patch makes that
> 100ms, in that case, a time delay that by now attracts my attention
> like a red rag to a bull (I don't know why you didn't make it 0).

I think this is right.  At the very least, it seems consistent with my
observations.

> I'm not sure, but if I got that right, then I think the whole problem
> might automatically go away with CF #4032.  The SIGCHLD processing
> code will run not when signals are unblocked before select() (that is
> gone), but instead *after* the event loop wakes up with WL_LATCH_SET,
> and runs the handler code in the regular user context before dropping
> through to the rest of the main loop.

Yeah, with those patches, the problem goes away.  IIUC the key part is that
the postmaster's latch gets set when SIGCHLD is received, so even if
SIGUSR1 and SIGCHLD are processed out of order, WalReceiverPID gets cleared
and we try to restart it shortly thereafter.  I find this much easier to
reason about.

I'll go ahead and withdraw this patch from the commitfest.  Thanks for
chiming in.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Spinlock is missing when updating two_phase of ReplicationSlot

2023-01-10 Thread Michael Paquier

On Wed, Jan 11, 2023 at 11:07:05AM +0900, Masahiko Sawada wrote:
> I think we should acquire the spinlock when updating fields of the
> replication slot even by its owner. Otherwise readers could see
> inconsistent results. Looking at another place where we update
> two_phase_at, we acquire the spinlock:
> 
> SpinLockAcquire(>mutex);
> slot->data.confirmed_flush = ctx->reader->EndRecPtr;
> if (slot->data.two_phase)
> slot->data.two_phase_at = ctx->reader->EndRecPtr;
> SpinLockRelease(>mutex);
> 
> It seems to me an oversight of commit a8fd13cab0b. I've attached the
> small patch to fix it.

Looks right to me, the paths updating the data related to the slots
are careful about that, even when it comes to fetching a slot from
MyReplicationSlot.  I have been looking around the slot code to see if
there are other inconsistencies, and did not notice anything standing
out.  Will fix..
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Support using "all" for the db user in pg_ident.conf

2023-01-10 Thread Michael Paquier

On Wed, Dec 28, 2022 at 09:11:05AM +, Jelte Fennema wrote:
> That's totally fair, I attached a new iteration of this patchset where this
> refactoring and the new functionality are split up in two patches.

The confusion that 0001 is addressing is fair (cough, fc579e1, cough),
still I am wondering whether we could do a bit better to be more
consistent with the lines of the ident file, as of:
- renaming "pg_role" to "pg_user", as we use pg-username or
database-username when referring to it in the docs or the conf sample
file.
- renaming "systemuser" to "system_user_token" to outline that this is
not a simple string but an AuthToken with potentially a regexp?
- Changing the order of the elements in IdentLine to map with the
actual ident lines: usermap, systemuser and pg_user.

> I'm not sure if I'm understanding your concern correctly, but
> the system username will still be compared case sensitively if requested.
> The only thing this changes is that: before comparing the pg_role
> column to the requested pg_role case sensitively, it now checks if 
> the value of the pg_role column is lowercase "all". If that's the case, 
> then the pg_role column is not compared to the requested
> pg|role anymore, and instead access is granted.

Yeah, still my spider sense reacts on this proposal, and I think that
I know why..  In what is your proposal different from the following
entry in pg_ident.conf?  As of:
mapname /^(.*)$ \1

This would allow everything, and use for the PG user the same user as
the one provided by the client.  That's what your proposal with "all"
does, no?  The heart of the problem is that you still require PG roles
that have a 1:1 mapping the user names provided by the clients.
--
Michael


signature.asc
Description: PGP signature

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2023-01-10 Thread David Rowley

On Wed, 11 Jan 2023 at 17:32, Tom Lane  wrote:
>
> David Rowley  writes:
> > I think whatever the fix is here, we should likely ensure that the
> > results are consistent regardless of which Aggrefs are the presorted
> > ones.  Perhaps the easiest way to do that, and to ensure we call the
> > volatile functions are called the same number of times would just be
> > to never choose Aggrefs with volatile functions when doing
> > make_pathkeys_for_groupagg().
>
> There's existing logic in equivclass.c and other places that tries
> to draw very tight lines around what we'll assume about volatile
> sort expressions (pathkeys).  It sounds like there's someplace in
> this recent patch that didn't get that memo.

I'm not sure I did a good job of communicating my thoughts there. What
I mean is, having volatile functions in the aggregate's ORDER BY or
DISTINCT clause didn't seem very well behaved prior to the presorted
aggregates patch. If I go and fix the bug with the missing targetlist
items, then a query such as:

select string_agg(random()::text, ',' order by random()) from
generate_series(1,3);

should start putting the random() numbers in order where it didn't
prior to 1349d279. Perhaps users might be happy that those are in
order, however, if they then go and change the query to:

select sum(a order by a),string_agg(random()::text, ',' order by
random()) from generate_series(1,3);

then they might become unhappy again that their string_agg is not
ordered the way they specified because the planner opted to sort by
"a" rather than "random()" after the initial scan.

I'm wondering if 1349d279 should have just never opted to presort
Aggrefs which have volatile functions so that the existing behaviour
of unordered output is given always and nobody is fooled into thinking
this works correctly only to be disappointed later when they add some
other aggregate to their query or if we should fix both.  Certainly,
it seems much easier to do the former.

David

Re: Add a new pg_walinspect function to extract FPIs from WAL records

2023-01-10 Thread Michael Paquier

On Tue, Jan 10, 2023 at 09:29:03AM +0100, Drouvot, Bertrand wrote:
> Thanks for updating the patch!
> 
> +-- Compare FPI from WAL record and page from table, they must be same
> 
> I think "must be the same" or "must be identical" sounds better (but not 100% 
> sure).
> 
> Except this nit, V4 looks good to me.

+postgres=# SELECT lsn, tablespace_oid, database_oid, relfile_number,
block_number, fork_name, length(fpi) > 0 as fpi_ok FROM
pg_get_wal_fpi_info('0/7418E60', '0/7518218');

This query in the docs is too long IMO.  Could you split that across
multiple lines for readability?

+  pg_get_wal_fpi_info(start_lsn pg_lsn,
+  end_lsn pg_lsn,
+  lsn OUT pg_lsn,
+  tablespace_oid OUT oid,
+  database_oid OUT oid,
+  relfile_number OUT oid,
+  block_number OUT int8,
+  fork_name OUT text,
+  fpi OUT bytea)
I am a bit surprised by this format, used to define the functions part
of the module in the docs, while we have examples that actually show
what's printed out.  I understand that this comes from the original
commit of the module, but the rendered docs are really hard to parse
as well, no?  FWIW, I think that this had better be fixed as well in
the docs of v15..  Showing a full set of attributes for the returned
record is fine by me, still if these are too long we could just use
\x.  For this one, I think that there is little point in showing 14
records, so I would stick with a style similar to pageinspect.

+CREATE FUNCTION pg_get_wal_fpi_info(IN start_lsn pg_lsn,
+IN end_lsn pg_lsn,
+OUT lsn pg_lsn,
+   OUT tablespace_oid oid,
Slight indentation issue here.

Using "relfile_number" would be a first, for what is defined in the
code and the docs as a filenode.

+SELECT pg_current_wal_lsn() AS wal_lsn4 \gset
+-- Get FPI from WAL record
+SELECT fpi AS page_from_wal FROM pg_get_wal_fpi_info(:'wal_lsn3', :'wal_lsn4')
+  WHERE relfile_number = :'sample_tbl_oid' \gset
I would be tempted to keep the checks run here minimal with only a
basic set of checks on the LSN, without the dependencies to
pageinspect (tuple_data_split and get_raw_page), which would be fine
enough to check the execution of the function.

FWIW, I am surprised by the design choice behind ValidateInputLSNs()
to allow data to be gathered until the end of WAL in some cases, but
to not allow it in others.  It is likely too late to come back to this
choice for the existing functions in v15 (quoique?), but couldn't it
be useful to make this new FPI function work at least with an insanely
high LSN value to make sure that we fetch all the FPIs from a given
start position, up to the end of WAL?  That looks like a pretty good
default behavior to me, rather than issuing an error when a LSN is
defined as in the future..  I am really wondering why we have
ValidateInputLSNs(till_end_of_wal=false) to begin with, while we could
just allow any LSN value in the future automatically, as we can know
the current insert or replay LSNs (depending on the recovery state).
--
Michael


signature.asc
Description: PGP signature

Re: Fix pg_publication_tables to exclude generated columns

2023-01-10 Thread Tom Lane

Amit Kapila  writes:
>> On Mon, Jan 9, 2023 11:06 PM Tom Lane  wrote:
>>> We could just not fix it in the back branches.  I'd argue that this is
>>> as much a definition change as a bug fix, so it doesn't really feel
>>> like something to back-patch anyway.

> So, if we don't backpatch then it could lead to an error when it
> shouldn't have which is clearly a bug. I think we should backpatch
> this unless Tom or others are against it.

This isn't a hill that I'm ready to die on ... but do we have any field
complaints about this?  If not, I still lean against a back-patch.
I think there's a significant risk of breaking case A while fixing
case B when we change this behavior, and that's something that's
better done only in a major release.

regards, tom lane

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2023-01-10 Thread Tom Lane

David Rowley  writes:
> I think whatever the fix is here, we should likely ensure that the
> results are consistent regardless of which Aggrefs are the presorted
> ones.  Perhaps the easiest way to do that, and to ensure we call the
> volatile functions are called the same number of times would just be
> to never choose Aggrefs with volatile functions when doing
> make_pathkeys_for_groupagg().

There's existing logic in equivclass.c and other places that tries
to draw very tight lines around what we'll assume about volatile
sort expressions (pathkeys).  It sounds like there's someplace in
this recent patch that didn't get that memo.

regards, tom lane

Re: [PATCH] Improve ability to display optimizer analysis using OPTIMIZER_DEBUG

2023-01-10 Thread Tom Lane

John Naylor  writes:
> Note that the TODO list has accumulated some cruft over the years. Some
> time ago I started an effort to remove outdated/undesirable entries, and I
> should get back to that, but for the present, please take the warning at
> the top to heart:

> "WARNING for Developers: Unfortunately this list does not contain all the
> information necessary for someone to start coding a feature. Some of these
> items might have become unnecessary since they were added --- others might
> be desirable but the implementation might be unclear. When selecting items
> listed below, be prepared to first discuss the value of the feature. Do not
> assume that you can select one, code it and then expect it to be committed.
> "

I think we could make that even stronger: there's basically nothing on
the TODO list that isn't problematic in some way.  Otherwise it would
have been done already.  The entries involve large amounts of work,
or things that are subtler than they might appear, or cases where the
desirable semantics aren't clear, or tradeoffs that there's not
consensus about, or combinations of those.

IME it's typically a lot more productive to approach things via
"scratch your own itch".  If a problem is biting you directly, then
at least you have some clear idea of what it is that needs to be fixed.
You might have to work up to an understanding of how to fix it, but
you have a clear goal.

regards, tom lane

Re: delay starting WAL receiver

2023-01-10 Thread Thomas Munro

On Wed, Jan 11, 2023 at 5:20 PM Thomas Munro  wrote:
> (I don't know why you didn't make it 0)

(Oh, I see why it had to be non-zero to avoiding burning CPU, ignore that part.)

Re: Fix pg_publication_tables to exclude generated columns

2023-01-10 Thread Amit Kapila

On Tue, Jan 10, 2023 at 8:38 AM shiy.f...@fujitsu.com
 wrote:
>
> On Mon, Jan 9, 2023 11:06 PM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > > On Mon, Jan 9, 2023 at 5:29 PM shiy.f...@fujitsu.com
> > >  wrote:
> > >> I think one way to fix it is to modify pg_publication_tables query to 
> > >> exclude
> > >> generated columns. But in this way, we need to bump catalog version when
> > fixing
> > >> it in back-branch. Another way is to modify function
> > >> pg_get_publication_tables()'s return value to contain all supported 
> > >> columns
> > if
> > >> no column list is specified, and we don't need to change system view.
> >
> > > That sounds like a reasonable approach to fix the issue.
> >
> > We could just not fix it in the back branches.  I'd argue that this is
> > as much a definition change as a bug fix, so it doesn't really feel
> > like something to back-patch anyway.
> >
>
> If this is not fixed in back-branch, in some cases we will get an error when
> creating/refreshing subscription because we query pg_publication_tables in
> column list check.
>
> e.g.
>
> -- publisher
> CREATE TABLE test_mix_4 (a int PRIMARY KEY, b int, c int, d int GENERATED 
> ALWAYS AS (a + 1) STORED);
> CREATE PUBLICATION pub_mix_7 FOR TABLE test_mix_4 (a, b, c);
> CREATE PUBLICATION pub_mix_8 FOR TABLE test_mix_4;
>
> -- subscriber
> CREATE TABLE test_mix_4 (a int PRIMARY KEY, b int, c int, d int);
>
> postgres=# CREATE SUBSCRIPTION sub1 CONNECTION 'port=5432' PUBLICATION 
> pub_mix_7, pub_mix_8;
> ERROR:  cannot use different column lists for table "public.test_mix_4" in 
> different publications
>
> I think it might be better to fix it in back-branch. And if we fix it by
> modifying pg_get_publication_tables(), we don't need to bump catalog version 
> in
> back-branch, I think this seems acceptable.
>

So, if we don't backpatch then it could lead to an error when it
shouldn't have which is clearly a bug. I think we should backpatch
this unless Tom or others are against it.

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-10 Thread Amit Kapila

On Wed, Jan 11, 2023 at 9:34 AM houzj.f...@fujitsu.com
 wrote:
>
> On Tuesday, January 10, 2023 7:48 PM Dilip Kumar  
> wrote:
> >
> > I was looking into 0001, IMHO the pid should continue to represent the main
> > apply worker. So the pid will always show the main apply worker which is
> > actually receiving all the changes for the subscription (in short working as
> > logical receiver) and if it is applying changes through a parallel worker 
> > then it
> > should put the parallel worker pid in a new column called 
> > 'parallel_worker_pid'
> > or 'parallel_apply_worker_pid' otherwise NULL.  Thoughts?
>
> Thanks for the comment.
>
> IIRC, you mean something like following, right ?
> (sorry if I misunderstood)
> --
> For parallel apply worker:
> 'pid' column shows the pid of the leader, new column parallel_worker_pid 
> shows its own pid
>
> For leader apply worker:
> 'pid' column shows its own pid, new column parallel_worker_pid shows 0
> --
>
> If so, I am not sure if the above is better, because it is changing the
> existing column's('pid') meaning, the 'pid' will no longer represent the pid 
> of
> the worker itself. Besides, it seems not consistent with what we have for
> parallel query workers in pg_stat_activity. What do you think ?
>

+1. I think it makes sense to keep it similar to pg_stat_activity.

+  
+   Process ID of the leader apply worker, if this process is a apply
+   parallel worker. NULL if this process is a leader apply worker or a
+   synchronization worker.

Can we change the above description to something like: "Process ID of
the leader apply worker, if this process is a parallel apply worker.
NULL if this process is a leader apply worker or does not participate
in parallel apply, or a synchronization worker."?

-- 
With Regards,
Amit Kapila.

Re: delay starting WAL receiver

2023-01-10 Thread Thomas Munro

On Wed, Jan 11, 2023 at 2:08 PM Nathan Bossart  wrote:
> I discussed this a bit in a different thread [0], but I thought it deserved
> its own thread.
>
> After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed
> that the recovery tests consistently take much longer.  Upon further
> inspection, it looks like a similar race condition to the one described in
> e5d494d's commit message.  With some added debug logs, I see that all of
> the callers of MaybeStartWalReceiver() complete before SIGCHLD is
> processed, so ServerLoop() waits for a minute before starting the WAL
> receiver.
>
> The attached patch fixes this by adjusting DetermineSleepTime() to limit
> the sleep to at most 100ms when WalReceiverRequested is set, similar to how
> the sleep is limited when background workers must be restarted.

Is the problem here that SIGCHLD is processed ...

PG_SETMASK(); <--- here?

selres = select(nSockets, , NULL, NULL, );

Meanwhile the SIGCHLD handler code says:

 * Was it the wal receiver?  If exit status is zero (normal) or one
 * (FATAL exit), we assume everything is all right just like normal
 * backends.  (If we need a new wal receiver, we'll start one at the
 * next iteration of the postmaster's main loop.)

... which is true, but that won't be reached for a while in this case
if the timeout has already been set to 60s.  Your patch makes that
100ms, in that case, a time delay that by now attracts my attention
like a red rag to a bull (I don't know why you didn't make it 0).

I'm not sure, but if I got that right, then I think the whole problem
might automatically go away with CF #4032.  The SIGCHLD processing
code will run not when signals are unblocked before select() (that is
gone), but instead *after* the event loop wakes up with WL_LATCH_SET,
and runs the handler code in the regular user context before dropping
through to the rest of the main loop.

Re: Strengthen pg_waldump's --save-fullpage tests

2023-01-10 Thread Bharath Rupireddy

On Wed, Jan 11, 2023 at 6:32 AM Michael Paquier  wrote:
>
> On Tue, Jan 10, 2023 at 05:25:44PM +0100, Drouvot, Bertrand wrote:
> > I like the idea of comparing the full page (and not just the LSN) but
> > I'm not sure that adding the pageinspect dependency is a good thing.
> >
> > What about extracting the block directly from the relation file and
> > comparing it with the one extracted from the WAL? (We'd need to skip the
> > first 8 bytes to skip the LSN though).
>
> Byte-by-byte counting for the page hole?  The page checksum would
> matter as well, FWIW, as it is not set in WAL and a FPW logged in WAL
> means that the page got modified.  It means that it could have been
> flushed, updating its pd_lsn and its pd_checksum on the way.

Right. LSN of FPI from the WAL record and page from the table won't be
the same, essentially FPI LSN <= table page. Since the LSNs are
different, checksums too. This is the reason we have masking functions
common/bufmask.c and rm_mask functions defined for some of the
resource managers while verifying FPI consistency in
verifyBackupPageConsistency(). Note that pageinspect can give only
unmasked/raw page data, which means, byte-by-byte comparison isn't
possible with pageinspect too, hence I was comparing only the rows
with tuple_data_split().

Therefore, reading bytes from the table page and comparing
byte-by-byte with FPI requires us to invent new masking functions in
the tests - simply a no-go IMO.

As the concern here is to not establish pageinspect dependency with
pg_waldump, I'm fine to withdraw this patch and be happy with what we
have today.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2023-01-10 Thread David Rowley

On Wed, 11 Jan 2023 at 15:46, Richard Guo  wrote:
> However the scan/join plan's
> tlist does not contain random(), which I think we need to fix.

I was wondering if that's true and considered that we don't want to
evaluate random() for the sort then again when doing the aggregate
transitions, but I see that does not really work before 1349d279, per:

postgres=# set enable_presorted_aggregate=0;
SET
postgres=# select string_agg(random()::text, ',' order by random())
from generate_series(1,3);
string_agg
---
 0.8659110018246505,0.15612649559563474,0.2022878955613403
(1 row)

I'd have expected those random numbers to be concatenated in ascending order.

Running: select random() from generate_Series(1,3) order by random();
gives me the results in the order I'd have expected.

I think whatever the fix is here, we should likely ensure that the
results are consistent regardless of which Aggrefs are the presorted
ones.  Perhaps the easiest way to do that, and to ensure we call the
volatile functions are called the same number of times would just be
to never choose Aggrefs with volatile functions when doing
make_pathkeys_for_groupagg().

David

RE: [PATCH] Support % wildcard in extension upgrade filenames

2023-01-10 Thread Regina Obe

> Sandro Santilli  writes:
> > On Mon, Jan 09, 2023 at 05:51:49PM -0500, Tom Lane wrote:
> >> ... you still need one script file for each supported upgrade step
> 
> > That's exactly the problem we're trying to solve here.
> > The include support is nice on itself, but won't solve our problem.
> 
> The script-file-per-upgrade-path aspect solves a problem that you have,
> whether you admit it or not; I think you simply aren't realizing that
because
> you have not had to deal with the consequences of your proposed feature.
> Namely that you won't have any control over what the backend will try to
do in
> terms of upgrade paths.
> 

It would be nice if there were some way to apply at least a range match to
upgrades or explicitly state in the control file what paths should be
applied for an upgrade.
Ultimately as Sandro mentioned, it's the 1000 files issue we are trying to
address.

The only way we can fix that in the current setup, is to move to a minor
version mode which means we can 
never do micro updates.

> As an example, suppose that a database has foo 4.0 installed, and the DBA
> decides to try to downgrade to 3.0.  With the system as it stands, if
you've
> provided foo--4.0--3.0.sql then the conversion will go through, and
presumably
> it will work because you tested that that script does what it is intended
to.  If
> you haven't provided any such downgrade script, then ALTER EXTENSION
> UPDATE will say "Sorry Dave, I'm afraid I can't do that" and no harm is
done.
> 
In our case we've already addressed that in our script, because our upgrades
don't rely on what 
extension model tells us is the version, ultimately what our
postgis..version() tells us is the true determinate (one for the lib file
and one for the script).
But you are right, that's a selfish way of thinking about it, cause we have
internal plumbing to handle that issue already and other projects probably
don't.

What I was hoping for was to having a section in the control file to say
something like

Upgrade patterns: 3.0.%--4.0.0.sql, 2.0.%--4.0.0.sql

(and perhaps some logic to guarantee no way to match two patterns)
So you wouldn't be able to have a set of patterns like

Upgrade patterns: %--4.0.0.sql, 3.0.%--4.0.0.sql, 2.0.%--4.0.0.sql

That would allow your proposed include something or other to work and for us
to be able to use that, and still reduce our 
file footprint.

I'd even settle for absolute paths stated in the control file if we can
dispense with the need a file path to push you forward.

In that mode, your downgrade issue would never happen even with the way
people normally create scripts now.

> So I really think this is a case of "be careful what you ask for, you
might get it".
> Even if PostGIS is willing to put in the amount of infrastructure legwork
needed
> to make such a design bulletproof, I'm quite sure nobody else will manage
to
> use such a thing successfully. I'd rather spend our development effort on
a
> feature that has more than one use-case.
> 
>   regards, tom lane

I think you are underestimating the number of extensions that have this
issue and could benefit (agree not in the current incarnation of the patch).
It's not just PostGIS, it's pgRouting (has 56 files), h3 extension (has 37
files in last release, most of them a do nothing, except at the minor update
steps) that have the same issue (and both pgRouting and h3 do have little
bitty script updates that follows the best practice way of doing this).
For pgRouting alone there are 56 files for the last release (of which it can
easily be reduced to about 5 files if the paths could be explicitly stated
in the control file).

Yes all those extensions should dispense with their dreams of having micro
updates (I honestly wish they would).

Perhaps I'm a little obsessive, but it annoys me to see 1000s of files in my
extension folder, when I know even if I followed best practice I only need
like 5 files for each extension.

And as a packager to have to ship these files is even more annoying.

The reality is for micros, no one ships new functions (or at least shouldn't
be), so all functions just need to be replaced perhaps with a micro update
file you propose.

Heck if we could even have the option to itemize our own upgrade file paths
explicitly in the control file, 

Like:

paths: 3.0.1--4.0.0 = 3--4.0.0.sql, 3.0.2--4.0.0 = 3--4.0.0.sql,
2--4.0.0.sql = 2.0.2--4.0.0.sql

I'd be happy, and PostgreSQL can do the math pretending there are files it
thinks it needs.

So if we could somehow rework this patch perhaps adding something in the
control to explicitly state the upgrade paths.

I think that would satisfy your concern? And I'd be eternally grateful. 

Thanks,
Regina

RE: Perform streaming logical transactions by background workers and parallel apply

2023-01-10 Thread houzj.f...@fujitsu.com

On Tuesday, January 10, 2023 7:48 PM Dilip Kumar  wrote:
> 
> On Tue, Jan 10, 2023 at 10:26 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Monday, January 9, 2023 4:51 PM Amit Kapila 
> wrote:
> > >
> > > On Sun, Jan 8, 2023 at 11:32 AM houzj.f...@fujitsu.com
> > >  wrote:
> > > >
> > > > On Sunday, January 8, 2023 11:59 AM houzj.f...@fujitsu.com
> > >  wrote:
> > > > > Attach the updated patch set.
> > > >
> > > > Sorry, the commit message of 0001 was accidentally deleted, just
> > > > attach the same patch set again with commit message.
> > > >
> > >
> > > Pushed the first (0001) patch.
> >
> > Thanks for pushing, here are the remaining patches.
> > I reordered the patch number to put patches that are easier to commit
> > in the front of others.
> 
> I was looking into 0001, IMHO the pid should continue to represent the main
> apply worker. So the pid will always show the main apply worker which is
> actually receiving all the changes for the subscription (in short working as
> logical receiver) and if it is applying changes through a parallel worker 
> then it
> should put the parallel worker pid in a new column called 
> 'parallel_worker_pid'
> or 'parallel_apply_worker_pid' otherwise NULL.  Thoughts?

Thanks for the comment.

IIRC, you mean something like following, right ?
(sorry if I misunderstood)
--
For parallel apply worker:
'pid' column shows the pid of the leader, new column parallel_worker_pid shows 
its own pid

For leader apply worker:
'pid' column shows its own pid, new column parallel_worker_pid shows 0
--

If so, I am not sure if the above is better, because it is changing the
existing column's('pid') meaning, the 'pid' will no longer represent the pid of
the worker itself. Besides, it seems not consistent with what we have for
parallel query workers in pg_stat_activity. What do you think ?

Best regards,
Hou zj

Re: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL

2023-01-10 Thread Ted Yu

On Tue, Jan 10, 2023 at 7:55 PM houzj.f...@fujitsu.com <
houzj.f...@fujitsu.com> wrote:

> On Wednesday, January 11, 2023 10:21 AM Ted Yu 
> wrote:
> > /* First time through, initialize parallel apply worker state
> hashtable. */
> > if (!ParallelApplyTxnHash)
> >
> > I think it would be better if `ParallelApplyTxnHash` is created by the
> first
> > successful parallel apply worker.
>
> Thanks for the suggestion. But I am not sure if it's worth to changing the
> order here, because It will only optimize the case where user enable
> parallel
> apply but never get an available worker which should be rare. And in such a
> case, it'd be better to increase the number of workers or disable the
> parallel mode.
>
> Best Regards,
> Hou zj
>

I think even though the chance is rare, we shouldn't leak resource.

The `ParallelApplyTxnHash` shouldn't be created if there is no single apply
worker.

Re: [PATCH] Improve ability to display optimizer analysis using OPTIMIZER_DEBUG

2023-01-10 Thread John Naylor

On Tue, Jan 3, 2023 at 1:59 PM Ankit Kumar Pandey 
wrote:
>
>
> On 03/01/23 08:38, David Rowley wrote:
> >
> > Do you actually have a need for this or are you just trying to tick
> > off some TODO items?
> >
> I would say Iatter but reason I picked it up was more on side of
> learning optimizer better.

Note that the TODO list has accumulated some cruft over the years. Some
time ago I started an effort to remove outdated/undesirable entries, and I
should get back to that, but for the present, please take the warning at
the top to heart:

"WARNING for Developers: Unfortunately this list does not contain all the
information necessary for someone to start coding a feature. Some of these
items might have become unnecessary since they were added --- others might
be desirable but the implementation might be unclear. When selecting items
listed below, be prepared to first discuss the value of the feature. Do not
assume that you can select one, code it and then expect it to be committed.
"

--
John Naylor
EDB: http://www.enterprisedb.com

RE: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL

2023-01-10 Thread houzj.f...@fujitsu.com

On Wednesday, January 11, 2023 10:21 AM Ted Yu  wrote:
> /* First time through, initialize parallel apply worker state 
> hashtable. */
> if (!ParallelApplyTxnHash)
> 
> I think it would be better if `ParallelApplyTxnHash` is created by the first
> successful parallel apply worker.

Thanks for the suggestion. But I am not sure if it's worth to changing the
order here, because It will only optimize the case where user enable parallel
apply but never get an available worker which should be rare. And in such a
case, it'd be better to increase the number of workers or disable the parallel 
mode.

Best Regards,
Hou zj

Re: Can we let extensions change their dumped catalog schemas?

2023-01-10 Thread Tom Lane

Jacob Champion  writes:
> We'd like to be allowed to change the schema for a table that's been
> marked in the past with pg_extension_config_dump().

> Unless I'm missing something obvious (please, let it be that) there's no
> way to do this safely. Once you've marked an internal table as dumpable,
> its schema is effectively frozen if you want your dumps to work across
> versions, because otherwise you'll try to restore that "catalog" data
> into a table that has different columns. And while sometimes you can
> make that work, it doesn't in the general case.

I agree that's a problem, but it's not that we're arbitrarily prohibiting
something that would work.  What, concretely, do you think could be
done to improve the situation?

> 2) Provide a way to record the exact version of an extension in a dump.

Don't really see how that helps?  I also fear that it will break
a bunch of use-cases that work fine today, which are exactly the
ones for which we originally defined pg_dump as *not* committing
to a particular extension version.

It feels like what we really need here is some way to mutate the
old format of an extension config table into the new format.
Simple addition of new columns shouldn't be a problem (in fact,
I think that works already, or could easily be made to).  If you
want some ETL processing then it's harder :-(.  Could an ON INSERT
trigger on an old config table transpose converted data into a
newer config table?

Another point that ought to be made here is that pg_dump is not
the only outside consumer of extension config data.  You're likely
to break some applications if you change a config table too much.
That's not an argument that we shouldn't try to make pg_dump more
forgiving, but I'm not sure that we need to move heaven and earth.

regards, tom lane

Re: [PATCH] Simple code cleanup in tuplesort.c.

2023-01-10 Thread John Naylor

On Mon, Jan 9, 2023 at 7:29 PM Xing Guo  wrote:
>
> Thank you John. This is my first patch, I'll keep it in mind that
> adding a version number next time I sending the patch.

Welcome to the community! You may also consider reviewing a patch from the
current commitfest, since we can always use additional help there.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: ATTACH PARTITION seems to ignore column generation status

2023-01-10 Thread Amit Langote

On Wed, Jan 11, 2023 at 7:13 AM Tom Lane  wrote:
> I wrote:
> > Amit Langote  writes:
> >> Thanks for the patch.  It looks good, though I guess you said that we
> >> should also change the error message that CREATE TABLE ... PARTITION
> >> OF emits to match the other cases while we're here.  I've attached a
> >> delta patch.
>
> > Thanks.  I hadn't touched that issue because I wasn't entirely sure
> > which error message(s) you were unhappy with.  These changes look
> > fine offhand.
>
> So, after playing with that a bit ... removing the block in
> parse_utilcmd.c allows you to do
>
> regression=# CREATE TABLE gtest_parent (f1 date NOT NULL, f2 bigint, f3 
> bigint GENERATED ALWAYS AS (f2 * 2) STORED) PARTITION BY RANGE (f1);
> CREATE TABLE
> regression=# CREATE TABLE gtest_child PARTITION OF gtest_parent (
> regression(#f3 WITH OPTIONS GENERATED ALWAYS AS (f2 * 3) STORED
> regression(# ) FOR VALUES FROM ('2016-07-01') TO ('2016-08-01');
> CREATE TABLE
> regression=# \d gtest_child
>   Table "public.gtest_child"
>  Column |  Type  | Collation | Nullable |   Default
> ++---+--+-
>  f1 | date   |   | not null |
>  f2 | bigint |   |  |
>  f3 | bigint |   |  | generated always as (f2 * 3) stored
> Partition of: gtest_parent FOR VALUES FROM ('2016-07-01') TO ('2016-08-01')
>
> regression=# insert into gtest_parent values('2016-07-01', 42);
> INSERT 0 1
> regression=# table gtest_parent;
>  f1 | f2 | f3
> ++-
>  2016-07-01 | 42 | 126
> (1 row)
>
> That is, you can make a partition with a different generated expression
> than the parent has.  Do we really want to allow that?  I think it works
> as far as the backend is concerned, but it breaks pg_dump, which tries
> to dump this state of affairs as
>
> CREATE TABLE public.gtest_parent (
> f1 date NOT NULL,
> f2 bigint,
> f3 bigint GENERATED ALWAYS AS ((f2 * 2)) STORED
> )
> PARTITION BY RANGE (f1);
>
> CREATE TABLE public.gtest_child (
> f1 date NOT NULL,
> f2 bigint,
> f3 bigint GENERATED ALWAYS AS ((f2 * 3)) STORED
> );
>
> ALTER TABLE ONLY public.gtest_parent ATTACH PARTITION public.gtest_child FOR 
> VALUES FROM ('2016-07-01') TO ('2016-08-01');
>
> and that fails at reload because the ATTACH PARTITION code path
> checks for equivalence of the generation expressions.
>
> This different-generated-expression situation isn't really morally
> different from different ordinary DEFAULT expressions, which we
> do endeavor to support.

Ah, right, we are a bit more flexible in allowing that.  Though,
partition-specific defaults, unlike generated columns, are not
respected when inserting/updating via the parent:

create table partp (a int, b int generated always as (a+1) stored, c
int default 0) partition by list (a);
create table partc1 partition of partp (b generated always as (a+2)
stored, c default 1) for values in (1);
insert into partp values (1);
table partp;
 a | b | c
---+---+---
 1 | 3 | 0
(1 row)

create table partc2 partition of partp (b generated always as (a+2)
stored) for values in (2);
update partp set a = 2;
table partp;
 a | b | c
---+---+---
 2 | 4 | 0
(1 row)

> So maybe we should deem this a supported
> case and remove ATTACH PARTITION's insistence that the generation
> expressions match

I tend to agree now that we have 3f7836ff6.

> ... which I think would be a good thing anyway,
> because that check-for-same-reverse-compiled-expression business
> is pretty cheesy in itself.

Hmm, yeah, we usually transpose a parent's expression into one that
has a child's attribute numbers and vice versa when checking their
equivalence.

> AFAIK, 3f7836ff6 took care of the
> only problem that the backend would have with this, and pg_dump
> looks like it will work as long as the backend will take the
> ATTACH command.

Right.

> I also looked into making CREATE TABLE ... PARTITION OF reject
> this case; but that's much harder than it sounds, because what we
> have at the relevant point is a raw (unanalyzed) expression for
> the child's generation expression but a cooked one for the
> parent's, so there is no easy way to match the two.
>
> In short, it's seeming like the rule for both partitioning and
> traditional inheritance ought to be "a child column must have
> the same generated-or-not property as its parent, but their
> generation expressions need not be the same".  Thoughts?

Agreed.

I've updated your disallow-generated-child-columns-2.patch to do this,
and have also merged the delta post that I had attached with my last
email, whose contents you sound to agree with.
--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


disallow-generated-child-columns-3.patch
Description: Binary data

Re: pgsql: Add new GUC createrole_self_grant.

2023-01-10 Thread Tom Lane

Robert Haas  writes:
> On Tue, Jan 10, 2023 at 9:40 PM Tom Lane  wrote:
>> The scenario I'm worried about could be closed, mostly, if we were willing
>> to invent an intermediate GUC privilege level "can be set interactively
>> but only by CREATEROLE holders" ("PGC_CRSET"?).

> Of course, if it's possible for a non-CREATEROLE user to set the value
> that a CREATEROLE user experiences, that'd be more of a problem --

That's exactly the case I'm worried about, and it's completely reachable
if a CREATEROLE user makes a SECURITY DEFINER function that executes
an affected GRANT and is callable by an unprivileged user.  Now, that
probably isn't terribly likely, and it's unclear that there'd be any
serious security consequence even if the GRANT did do something
different than the author of the SECURITY DEFINER function was expecting.
Nonetheless, I'm feeling itchy about this chain of assumptions.

> To answer your question directly, though, I don't know of any other
> setting where that would be a useful level.

Yeah, I didn't think of one either.  Also, even if we invented PGC_CRSET,
it can't stop one CREATEROLE user from attacking another one, assuming
that there is some interesting attack that can be constructed here.
I think the whole point of your recent patches is to not assume that
CREATEROLE users are mutually trusting, so that's bad.

Bottom line is that a GUC doesn't feel like the right mechanism to use.
What do you think about inventing a role property, or a grantable role
that controls this behavior?

regards, tom lane

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-01-10 Thread John Naylor

On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada 
wrote:

> It looks no problem in terms of vacuum integration, although I've not
> fully tested yet. TID store uses the radix tree as the main storage,
> and with the template radix tree, the data types for shared and
> non-shared will be different. TID store can have an union for the
> radix tree and the structure would be like follows:

> /* Storage for Tids */
> union tree
> {
> local_radix_tree*local;
> shared_radix_tree   *shared;
> };

We could possibly go back to using a common data type for this, but with
unused fields in each setting, as before. We would have to be more careful
of things like the 32-bit crash from a few weeks ago.

> In the functions of TID store, we need to call either local or shared
> radix tree functions depending on whether TID store is shared or not.
> We need if-branch for each key-value pair insertion, but I think it
> would not be a big performance problem in TID store use cases, since
> vacuum is an I/O intensive operation in many cases.

Also, the branch will be easily predicted. That was still true in earlier
patches, but with many more branches and fatter code paths.

> Overall, I think
> there is no problem and I'll investigate it in depth.

Okay, great. If the separate-functions approach turns out to be ugly, we
can always go back to the branching approach for shared memory. I think
we'll want to keep this as a template overall, at least to allow different
value types and to ease adding variable-length keys if someone finds a need.

> Apart from that, I've been considering the lock support for shared
> radix tree. As we discussed before, the current usage (i.e, only
> parallel index vacuum) doesn't require locking support at all, so it
> would be enough to have a single lock for simplicity.

Right, that should be enough for PG16.

> If we want to
> use the shared radix tree for other use cases such as the parallel
> heap vacuum or the replacement of the hash table for shared buffers,
> we would need better lock support.

For future parallel pruning, I still think a global lock is "probably" fine
if the workers buffer in local arrays. Highly concurrent applications will
need additional work, of course.

> For example, if we want to support
> Optimistic Lock Coupling[1],

Interesting, from the same authors!

> we would need to change not only the node
> structure but also the logic. Which probably leads to widen the gap
> between the code for non-shared and shared radix tree. In this case,
> once we have a better radix tree optimized for shared case, perhaps we
> can replace the templated shared radix tree with it. I'd like to hear
> your opinion on this line.

I'm not in a position to speculate on how best to do scalable concurrency,
much less how it should coexist with the local implementation. It's
interesting that their "ROWEX" scheme gives up maintaining order in the
linear nodes.

> > One review point I'll mention: Somehow I didn't notice there is no use
for the "chunk" field in the rt_node type -- it's only set to zero and
copied when growing. What is the purpose? Removing it would allow the
smallest node to take up only 32 bytes with a fanout of 3, by eliminating
padding.
>
> Oh, I didn't notice that. The chunk field was originally used when
> redirecting the child pointer in the parent node from old to new
> (grown) node. When redirecting the pointer, since the corresponding
> chunk surely exists on the parent we can skip existence checks.
> Currently we use RT_NODE_UPDATE_INNER() for that (see
> RT_REPLACE_NODE()) but having a dedicated function to update the
> existing chunk and child pointer might improve the performance. Or
> reducing the node size by getting rid of the chunk field might be
> better.

I see. IIUC from a brief re-reading of the code, saving that chunk would
only save us from re-loading "parent->shift" from L1 cache and shifting the
key. The cycles spent doing that seem small compared to the rest of the
work involved in growing a node. Expressions like "if (idx < 0) return
false;" return to an asserts-only variable, so in production builds, I
would hope that branch gets elided (I haven't checked).

I'm quite keen on making the smallest node padding-free, (since we don't
yet have path compression or lazy path expansion), and this seems the way
to get there.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: pgsql: Add new GUC createrole_self_grant.

2023-01-10 Thread Robert Haas

On Tue, Jan 10, 2023 at 9:40 PM Tom Lane  wrote:
> Yeah.  I concur that a SUSET GUC isn't much fun for a non-superuser
> CREATEROLE holder who might wish to adjust the default behavior they get.
> I also concur that it seems a bit far-fetched that a CREATEROLE holder
> might create a SECURITY DEFINER function that would do something that
> would be affected by this setting.  Still, we have no field experience
> with how these mechanisms will actually be used, so I'm worried.

All right. I'm not that worried because I think any problems that crop
up probably won't be that bad, primarily due to the extremely
restricted set of circumstances in which the GUC operates -- but
that's a judgement call, and reasonable people can think differently.

> The scenario I'm worried about could be closed, mostly, if we were willing
> to invent an intermediate GUC privilege level "can be set interactively
> but only by CREATEROLE holders" ("PGC_CRSET"?).  But that's an awful lot
> of infrastructure to add for one GUC.  Are there any other GUCs where
> that'd be a more useful choice than any we have now?

I don't quite understand what that would do. If a non-CREATEROLE user
sets the GUC, absolutely nothing happens, because the code that is
controlled by the GUC cannot be reached without CREATEROLE privileges.

Of course, if it's possible for a non-CREATEROLE user to set the value
that a CREATEROLE user experiences, that'd be more of a problem --
though still insufficient to create a security vulnerability in and of
itself -- but if user A can change the GUC settings that user B
experiences, why screw around with this when you could just set
search_path?

To answer your question directly, though, I don't know of any other
setting where that would be a useful level. Up until this morning,
CREATEROLE was not usable for any serious purpose because we've been
shipping something that was broken by design for years, so it's
probably fortunate that not much depends on it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: mprove tab completion for ALTER EXTENSION ADD/DROP

2023-01-10 Thread Kyotaro Horiguchi

At Mon, 2 Jan 2023 13:19:50 +0530, vignesh C  wrote in 
> On Mon, 5 Dec 2022 at 06:53, Michael Paquier  wrote:
> >
> > The DROP could be matched with the objects that are actually part of
> > the so-said extension?
> 
> The modified v2 version has the changes to handle the same. Sorry for
> the delay as I was working on another project.

It suggests the *kinds* of objects that are part of the extension, but
lists the objects of that kind regardless of dependency.  I read
Michael suggested (and I agree) to restrict the objects (not kinds) to
actually be a part of the extension. (And not for object kinds.)

However I'm not sure it is useful to restrict object kinds since the
operator already knows what to drop, if you still want to do that, the
use of completion_dont_quote looks ugly since the function
(requote_identifier) is assuming an identifier as input.  I didn't
looked closer but maybe we need another way to do that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Using WaitEventSet in the postmaster

2023-01-10 Thread Thomas Munro

On Sun, Jan 8, 2023 at 11:55 AM Andres Freund  wrote:
> On 2023-01-07 18:08:11 +1300, Thomas Munro wrote:
> > On Sat, Jan 7, 2023 at 12:25 PM Andres Freund  wrote:
> > > On 2023-01-07 11:08:36 +1300, Thomas Munro wrote:
> > > > 3.  Is it OK to clobber the shared pending flag for SIGQUIT, SIGTERM,
> > > > SIGINT?  If you send all of these extremely rapidly, it's
> > > > indeterminate which one will be seen by handle_shutdown_request().
> > >
> > > That doesn't seem optimal. I'm mostly worried that we can end up 
> > > downgrading a
> > > shutdown request.
> >
> > I was contemplating whether I needed to do some more push-ups to
> > prefer the first delivered signal (instead of the last), but you're
> > saying that it would be enough to prefer the fastest shutdown type, in
> > cases where more than one signal was handled between server loops.
> > WFM.
>
> I don't see any need for such an order requirement - in case of receiving a
> "less severe" shutdown request first, we'd process the more severe one soon
> after. There's nothing to be gained by trying to follow the order of the
> incoming signals.

Oh, I fully agree.  I was working through the realisation that I might
need to serialise the handlers to implement the priority logic
correctly (upgrades good, downgrades bad), but your suggestion
fast-forwards to the right answer and doesn't require blocking, so I
prefer it, and had already gone that way in v9.  In this version I've
added a comment to explain that the outcome is the same in the end,
and also fixed the flag clearing logic which was subtly wrong before.

> > > I wonder if it'd be good to have a _pm_ in the name.
> >
> > I dunno about this one, it's all static stuff in a file called
> > postmaster.c and one (now) already has pm in it (see below).
>
> I guess stuff like signal handlers and their state somehow seems more global
> to me than their C linkage type suggests. Hence the desire to be a bit more
> "namespaced" in their naming. I do find it somewhat annoying when reasonably
> important global variables aren't uniquely named when using a debugger...

Alright, renamed.

> A few more code review comments:
>
> DetermineSleepTime() still deals with struct timeval, which we maintain at
> some effort. Just to then convert it away from struct timeval in the
> WaitEventSetWait() call. That doesn't seem quite right, and is basically
> introduced in this patch.

I agree, but I was trying to minimise the patch: signals and events
stuff is a lot already.  I didn't want to touch DetermineSleepTime()'s
time logic in the same commit.  But here's a separate patch for that.

> I think ServerLoop still has an outdated comment:
>
>  *
>  * NB: Needs to be called with signals blocked

Fixed.

> > + /* Process work scheduled by signal handlers. 
> > */
>
> Very minor: It feels a tad off to say that the work was scheduled by signal
> handlers, it's either from other processes or by the OS. But ...

OK, now it's "requested via signal handlers".

> > +/*
> > + * Child processes use SIGUSR1 to send 'pmsignals'.  pg_ctl uses SIGUSR1 
> > to ask
> > + * postmaster to check for logrotate and promote files.
> > + */
>
> s/send/notify us of/, since the concrete "pmsignal" is actually transported
> outside of the "OS signal" level?

Fixed.

> LGTM.

Thanks.  Here's v10.  I'll wait a bit longer to see if anyone else has feedback.
From 01bd9b3ef6029d5e5d568b514874a23a167d826d Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 9 Nov 2022 22:59:58 +1300
Subject: [PATCH v10 1/2] Give the postmaster a WaitEventSet and a latch.

Switch to an architecture similar to regular backends, where signal
handlers just set flags instead of doing real work.

Changes:

 * Allow the postmaster to set up its own local latch.  For now, we don't
   want other backends setting a shared memory latch directly (but that
   could be made to work with more research on robustness).

 * The existing signal handlers are cut in two: a handle_pm_XXX part that
   sets a pending_pm_XXX variable and sets the local latch, and a
   process_pm_XXX part.

 * Signal handlers are now installed with the regular pqsignal()
   function rather then the special pqsignal_pm() function; the concerns
   about the portability of SA_RESTART vs select() are no longer
   relevant, as we are not using select() or relying on EINTR.

Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CA%2BhUKG%2BZ-HpOj1JsO9eWUP%2Bar7npSVinsC_npxSy%2BjdOMsx%3DGg%40mail.gmail.com
---
 src/backend/libpq/pqcomm.c|   3 +-
 src/backend/libpq/pqsignal.c  |  40 ---
 src/backend/postmaster/fork_process.c |  18 +-
 src/backend/postmaster/postmaster.c   | 412 +++---
 src/backend/tcop/postgres.c   |   1 -
 src/backend/utils/init/miscinit.c |  13 +-
 src/include/libpq/pqsignal.h  |   3 -
 src/include/miscadmin.h   |   1 +
 8 files changed, 264 insertions(+), 227 deletions(-)

diff --git

Re: Add SHELL_EXIT_CODE to psql

2023-01-10 Thread vignesh C

On Tue, 10 Jan 2023 at 00:06, Corey Huinker  wrote:
>
> On Mon, Jan 9, 2023 at 10:01 AM Maxim Orlov  wrote:
>>
>> Hi!
>>
>> In overall, I think we move in the right direction. But we could make code 
>> better, should we?
>>
>> +   /* Capture exit code for SHELL_EXIT_CODE */
>> +   close_exit_code = pclose(fd);
>> +   if (close_exit_code == -1)
>> +   {
>> +   pg_log_error("%s: %m", cmd);
>> +   error = true;
>> +   }
>> +   if (WIFEXITED(close_exit_code))
>> +   exit_code=WEXITSTATUS(close_exit_code);
>> +   else if(WIFSIGNALED(close_exit_code))
>> +   exit_code=WTERMSIG(close_exit_code);
>> +   else if(WIFSTOPPED(close_exit_code))
>> +   exit_code=WSTOPSIG(close_exit_code);
>> +   if (exit_code)
>> +   error = true;
>> I think, it's better to add spaces around middle if block. It will be easy 
>> to read.
>> Also, consider, adding spaces around assignment in this block.
>
>
> Noted and will implement.
>
>>
>> +   /*
>> +   snprintf(exit_code_buf, sizeof(exit_code_buf), "%d", 
>> WEXITSTATUS(exit_code));
>> +   */
>> Probably, this is not needed.
>
>
> Heh. Oops.
>
>>
>> > 1. pg_regress now creates an environment variable called PG_OS_TARGET
>> Maybe, we can use env "OS"? I do not know much about Windows, but I think 
>> this is kind of standard environment variable there.
>
>
> I chose a name that would avoid collisions with anything a user might 
> potentially throw into their environment, so if the var "OS" is fairly 
> standard is a reason to avoid using it. Also, going with our own env var 
> allows us to stay in perfect synchronization with the build's #ifdef WIN32 
> ... and whatever #ifdefs may come in the future for new OSes. If there is 
> already an environment variable that does that for us, I would rather use 
> that, but I haven't found it.
>
> The 0001 patch is unchanged from last time (aside from anything rebasing 
> might have done).

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
3c6fc58209f24b959ee18f5d19ef96403d08f15c ===
=== applying patch
./v4-0002-Add-psql-variables-SHELL_ERROR-and-SHELL_EXIT_COD.patch
patching file doc/src/sgml/ref/psql-ref.sgml
Hunk #1 FAILED at 4264.
1 out of 1 hunk FAILED -- saving rejects to file
doc/src/sgml/ref/psql-ref.sgml.rej

[1] - http://cfbot.cputube.org/patch_41_4073.log

Regards,
Vignesh

Re: Allow tailoring of ICU locales with custom rules

2023-01-10 Thread vignesh C

On Thu, 5 Jan 2023 at 20:45, Peter Eisentraut
 wrote:
>
> Patch needed a rebase; no functionality changes.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
d952373a987bad331c0e499463159dd142ced1ef ===
=== applying patch
./v2-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patch
patching file doc/src/sgml/catalogs.sgml
patching file doc/src/sgml/ref/create_collation.sgml
patching file doc/src/sgml/ref/create_database.sgml
Hunk #1 FAILED at 192.
1 out of 1 hunk FAILED -- saving rejects to file
doc/src/sgml/ref/create_database.sgml.rej

[1] - http://cfbot.cputube.org/patch_41_4075.log

Regards,
Vignesh

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2023-01-10 Thread Richard Guo

On Tue, Jan 10, 2023 at 6:12 PM Dean Rasheed 
wrote:

> While doing some random testing, I noticed that the following is broken in
> HEAD:
>
> SELECT COUNT(DISTINCT random()) FROM generate_series(1,10);
>
> ERROR:  ORDER/GROUP BY expression not found in targetlist
>
> It appears to have been broken by 1349d279, though I haven't looked at
> the details.

Yeah, this is definitely broken.  For this query, we try to sort the
scan/join path by random() before performing the Aggregate, which is an
optimization implemented in 1349d2790b.  However the scan/join plan's
tlist does not contain random(), which I think we need to fix.

Thanks
Richard

Re: pgsql: Add new GUC createrole_self_grant.

2023-01-10 Thread Tom Lane

Robert Haas  writes:
> On Tue, Jan 10, 2023 at 8:47 PM Tom Lane  wrote:
>> [ squint ... ]  Are you sure it's not a security *hazard*, though?

> I think you have to squint pretty hard to find a security hazard here.

Maybe, but I'd be sad if somebody manages to find one after this is
out in the wild.

> That said, in my original design, this was controlled via a different
> mechanism which was superuser-only. I was informed that made no sense,
> so I changed it. Now here we are.

Yeah.  I concur that a SUSET GUC isn't much fun for a non-superuser
CREATEROLE holder who might wish to adjust the default behavior they get.
I also concur that it seems a bit far-fetched that a CREATEROLE holder
might create a SECURITY DEFINER function that would do something that
would be affected by this setting.  Still, we have no field experience
with how these mechanisms will actually be used, so I'm worried.

The scenario I'm worried about could be closed, mostly, if we were willing
to invent an intermediate GUC privilege level "can be set interactively
but only by CREATEROLE holders" ("PGC_CRSET"?).  But that's an awful lot
of infrastructure to add for one GUC.  Are there any other GUCs where
that'd be a more useful choice than any we have now?

regards, tom lane

Re: [PATCH] support tab-completion for single quote input with equal sign

2023-01-10 Thread Tom Lane

I wrote:
> I've spent some effort previously on getting tab-completion to deal
> sanely with single-quoted strings, but everything I've tried has
> crashed and burned :-(, mainly because it's not clear when to take
> the whole literal as one "word" and when not.

After a little further thought, a new idea occurred to me: maybe
we could push some of the problem down into the Matches functions.
Consider inventing a couple of new match primitives:

* MatchLiteral matches one or more parse tokens that form a single
complete, valid SQL literal string (either single-quoted or dollar
style).  Use it like

else if (Matches("CREATE", "SUBSCRIPTION", MatchAny, "CONNECTION", 
MatchLiteral))
 COMPLETE_WITH("PUBLICATION");

I think there's no question that most Matches calls that might subsume
a quoted literal would prefer to treat the literal as a single word,
and this'd let them do that correctly.

* MatchLiteralBegin matches the opening of a literal string (either '
or $...$).  Handwaving freely, we might do

else if (Matches("CREATE", "SUBSCRIPTION", MatchAny, "CONNECTION", 
MatchLiteralBegin))
 COMPLETE_WITH(List_of_connection_keywords);

This part of the idea still needs some thought, because it remains
unclear how we might offer completion for connection keywords
after the first one.

Implementing these primitives might be a little tricky too.
If memory serves, readline and libedit have different behaviors
around quote marks.  But at least it seems like a framework
that could solve a number of problems, if we can make it go.

regards, tom lane

Re: pgsql: Add new GUC createrole_self_grant.

2023-01-10 Thread Robert Haas

On Tue, Jan 10, 2023 at 8:47 PM Tom Lane  wrote:
> [ squint ... ]  Are you sure it's not a security *hazard*, though?

I think you have to squint pretty hard to find a security hazard here.
The effect of this GUC is to control the set of privileges that a
CREATEROLE user automatically grants to themselves. But granting
yourself privileges does not normally lead to any sort of security
privilege. It's not completely impossible, I believe. For example,
suppose that I, as a CREATEROLE user who is not a superuser, execute
"CREATE ROLE shifty". I then set up a schema that shifty can access
and I cannot. I then put that schema into my search_path despite the
fact that I haven't given myself access to it. Now, depending on the
value of this setting, I might either implicitly inherit shifty's
privileges, or I might not. So, if I was expecting that I wouldn't,
and I do, then I have now created a situation where if I do more dumb
things I could execute some shifty code that lets that shifty user
take over my account.

But, you know, if I'm that dumb, I could also hit myself in the head
with a hammer and the shifty guy could use the fact that I'm
unconscious to fish the sticky note out of my wallet where,
presumably, I keep my database password.

The bigger point here, I think, is that this GUC only controls default
privileges -- and we already have a system for default privileges that
allows any user to give away privileges on virtually any object that
they create to anyone. Nothing about that system is superuser-only.
This system is far more restricted in its scope. It only allows you to
give privileges to yourself, not anyone else, and only if you're a
CREATEROLE user who is not a SUPERUSER. It seems a bit crazy to say
that it's not a hazard for Alice to automatically grant every
permission in the book to Emil every time she creates a table or
schema or type or sequence or a function, but it is a hazard if Bob
can grant INHERIT and SET to himself on roles that he creates.

That said, in my original design, this was controlled via a different
mechanism which was superuser-only. I was informed that made no sense,
so I changed it. Now here we are.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL

2023-01-10 Thread Ted Yu

On Tue, Jan 10, 2023 at 6:13 PM houzj.f...@fujitsu.com <
houzj.f...@fujitsu.com> wrote:

> On Wednesday, January 11, 2023 1:25 AM Ted Yu  wrote:
>
> > I was reading src/backend/replication/logical/applyparallelworker.c .
> > In `pa_allocate_worker`, when pa_launch_parallel_worker returns NULL, I
> think the `ParallelApplyTxnHash` should be released.
>
> Thanks for reporting.
>
> ParallelApplyTxnHash is used to cache the state of streaming transactions
> being
> applied. There could be multiple streaming transactions being applied in
> parallel and their information were already saved in ParallelApplyTxnHash,
> so
> we should not release them just because we don't have a worker available to
> handle a new transaction here.
>
> Best Regards,
> Hou zj
>
Hi,

/* First time through, initialize parallel apply worker state
hashtable. */
if (!ParallelApplyTxnHash)

I think it would be better if `ParallelApplyTxnHash` is created by the
first successful parallel apply worker.

Please take a look at the new patch and see if it makes sense.

Cheers


create-parallel-apply-txn-hash-after-the-first-worker.patch
Description: Binary data

Re: Handle infinite recursion in logical replication setup

2023-01-10 Thread Amit Kapila

On Tue, Jan 10, 2023 at 11:24 PM Jonathan S. Katz  wrote:
>
> On 1/10/23 10:17 AM, Amit Kapila wrote:
> > On Tue, Jan 10, 2023 at 8:13 AM Jonathan S. Katz  
> > wrote:
>
> > One can use local or higher
> > for reducing the latency for COMMIT when synchronous replication is
> > used in the publisher. Won't using 'local' while creating subscription
> > would suffice the need to consistently replicate the data? I mean it
> > is equivalent to somebody using levels greater than local in case of
> > physical replication. I think in the case of physical replication, we
> > won't wait for standby to replicate to another node before sending a
> > response, so why to wait in the case of logical replication? If this
> > understanding is correct, then probably it is sufficient to support
> > 'local' for a subscription.
>
> I think this is a good explanation. We should incorporate some version
> of this into the docs, and add some clarity around the use of
> `synchronous_commit` option in `CREATE SUBSCRIPTION` in particular with
> the origin feature. I can make an attempt at this.
>

Okay, thanks!

> Perhaps another question: based on this, should we disallow setting
> values of `synchronous_commit` greater than `local` when using
> "origin=none"?
>

I think it should be done irrespective of the value of origin because
it can create a deadlock in those cases as well. I think one idea as
you suggest is to block levels higher than local and the other is to
make it behave like 'local' even if the level is higher than 'local'
which would be somewhat similar to our physical replication.

> That might be too strict, but maybe we should warn around
> the risk of deadlock either in the logs or in the docs.
>

Yeah, we can mention it in docs as well. We can probably document as
part of the docs work for this thread but it would be better to start
a separate thread to fix this behavior by either of the above two
approaches.

-- 
With Regards,
Amit Kapila.

Re: [PATCH] Add overlaps geometric operators that ignore point overlaps

2023-01-10 Thread Kyotaro Horiguchi

Hello.

At Sun, 1 Jan 2023 01:13:24 +0530, Ankit Kumar Pandey  
wrote in 
> This is patch for todo item: Add overlaps geometric operators that
> ignore point overlaps
> 
> Issue:
> 
> SELECT circle '((0,0), 1)' && circle '((2,0),1) returns True
> 
> Expectation: In above case, both figures touch other but do not
> overlap (i.e. touching != overlap). Hence, it should return false.

This may be slightly off from the common definition in other geometric
processing systems, it is the established behavior of PostgreSQL that
should already have users.

About the behavior itself, since it seems to me that the words "touch"
and "overlap" have no rigorous mathematical definitions, that depends
on definition. The following discussion would be mere a word play..

If circle ((0,0),1) means a circumference, i.e. a set of points
described as "x^2 + y^2 = 1" (or it may be a disc containing the area
inside (<=) here) and "overlap" means "share at least a point", the
two circles are overlapping. This seems to be our current stand point
and what is expressed in the doc.

If it meant the area exclusively inside the outline (i.e. x^2 + y^2 <
1), the two circles could be said touching but not overlapping.  Or,
if circle is defined as "(<)= 1" but "overlap" meant "share at least
an area", they could be said not overlapping but touching? (I'm not
sure about the border between a point and an area here and the
distinction would be connected with the annoying EPSILON..)  The same
discussion holds for boxes or other shapes.

> Now, as per as discussion
> (https://www.postgresql.org/message-id/20100322175532.GG26428%40fetter.org)
> and corresponding change in docs,
> https://www.postgresql.org/docs/15/functions-geometry.html, it
> mentions
> 
> `Do these objects overlap? (One point in common makes this true.)
> `. Does this means current behavior is correct? Or do we still need
> the proposed change (if so, with proper updates in docs)?
> 
> If current behavior is correct, this todo item might need some update
> (unless I missed anything) otherwise any suggestion is welcomed.

I read the todo description as we may want *another set* of operators
to do that, not to change the current behavior of the existing
operators.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL

2023-01-10 Thread houzj.f...@fujitsu.com

On Wednesday, January 11, 2023 1:25 AM Ted Yu  wrote:

> I was reading src/backend/replication/logical/applyparallelworker.c .
> In `pa_allocate_worker`, when pa_launch_parallel_worker returns NULL, I think 
> the `ParallelApplyTxnHash` should be released.

Thanks for reporting.

ParallelApplyTxnHash is used to cache the state of streaming transactions being
applied. There could be multiple streaming transactions being applied in
parallel and their information were already saved in ParallelApplyTxnHash, so
we should not release them just because we don't have a worker available to
handle a new transaction here.

Best Regards,
Hou zj

Spinlock is missing when updating two_phase of ReplicationSlot

2023-01-10 Thread Masahiko Sawada

Hi,

I realized that in CreateDecodingContext() function, we update both
slot->data.two_phase and two_phase_at without acquiring the spinlock:

/* Mark slot to allow two_phase decoding if not already marked */
if (ctx->twophase && !slot->data.two_phase)
{
slot->data.two_phase = true;
slot->data.two_phase_at = start_lsn;
ReplicationSlotMarkDirty();
ReplicationSlotSave();
SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
}

I think we should acquire the spinlock when updating fields of the
replication slot even by its owner. Otherwise readers could see
inconsistent results. Looking at another place where we update
two_phase_at, we acquire the spinlock:

SpinLockAcquire(>mutex);
slot->data.confirmed_flush = ctx->reader->EndRecPtr;
if (slot->data.two_phase)
slot->data.two_phase_at = ctx->reader->EndRecPtr;
SpinLockRelease(>mutex);

It seems to me an oversight of commit a8fd13cab0b. I've attached the
small patch to fix it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_spinlock.patch
Description: Binary data

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 4:39 PM Peter Geoghegan  wrote:
> * Run VACUUM FREEZE. We need FREEZE in order to be able to hit the
> relevant visibilitymap_set() call site (the one that passes
> VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
> VISIBILITYMAP_ALL_VISIBLE).
>
> Now all_visible_according_to_vm is set to true, but we don't have a
> lock/pin on the same heap page just yet.
>
> * Acquire several non-conflicting row locks on a row on the block in
> question, so that a new multi is allocated.

Forgot to mention that there needs to be a HOT update mixed in with
these SELECT ... FOR SHARE row lockers, too, which must abort once its
XID has been added to a multi. Obviously heap_lock_tuple() won't ever
unset VISIBILITYMAP_ALL_VISIBLE or PD_ALL_VISIBLE (it only ever clears
VISIBILITYMAP_ALL_FROZEN) -- so we need a heap_update() to clear all
of these status bits.

This enables the assertion to fail because:

* Pruning can get rid of the aborted successor heap-only tuple right
away, so it is not going to block us from setting the page all_visible
(that just leaves the original tuple to consider).

* The original tuple's xmax is a Multi, so it won't automatically be
ineligible for freezing because it's > OldestXmin in this scenario.

* FreezeMultiXactId() processing will completely remove xmax, without
caring too much about cutoffs like OldestXmin -- it only cares about
whether each individual XID needs to be kept or not.

(Granted, FreezeMultiXactId() will only remove xmax like this because
I deliberately removed its FRM_NOOP handling, but that is a very
delicate thing to rely on, especially from such a great distance. I
can't imagine that it doesn't fail on HEAD for any reason beyond sheer
luck.)

--
Peter Geoghegan

Re: Infinite Interval

2023-01-10 Thread Joseph Koshakow

On Sun, Jan 8, 2023 at 11:17 PM jian he  wrote:
>
>
>
> On Sun, Jan 8, 2023 at 4:22 AM Joseph Koshakow  wrote:
>>
>> On Sat, Jan 7, 2023 at 3:05 PM Joseph Koshakow  wrote:
>> >
>> > On Sat, Jan 7, 2023 at 3:04 PM Joseph Koshakow  wrote:
>> > >
>> > > I think this patch is just about ready for review, except for the
>> > > following two questions:
>> > >   1. Should finite checks on intervals only look at months or all three
>> > >   fields?
>> > >   2. Should we make the error messages for adding/subtracting infinite
>> > >   values more generic or leave them as is?
>> > >
>> > > My opinions are
>> > >   1. We should only look at months.
>> > >   2. We should make the errors more generic.
>> > >
>> > > Anyone else have any thoughts?
>>
>> Here's a patch with the more generic error messages.
>>
>> - Joe
>
>
> HI.
>
> I just found out another problem.
>
> select * from  generate_series(timestamp'-infinity', timestamp 'infinity', 
> interval 'infinity');
> ERROR:  timestamp out of range
>
> select * from  generate_series(timestamp'-infinity',timestamp 'infinity', 
> interval '-infinity'); --return following
>
>  generate_series
> -
> (0 rows)
>
>
> select * from generate_series(timestamp 'infinity',timestamp 'infinity', 
> interval 'infinity');
> --will run all the time.
>
> select * from  generate_series(timestamp 'infinity',timestamp 'infinity', 
> interval '-infinity');
> ERROR:  timestamp out of range
>
>  select * from  generate_series(timestamp'-infinity',timestamp'-infinity', 
> interval 'infinity');
> ERROR:  timestamp out of range
>
> select * from  generate_series(timestamp'-infinity',timestamp'-infinity', 
> interval '-infinity');
> --will run all the time.

Good catch, I didn't think to check non date/time functions.
Unfortunately, I think you may have opened Pandoras box. I went through
pg_proc.dat and found the following functions that all involve
intervals. We should probably investigate all of them and make sure
that they handle infinite intervals properly.

{ oid => '1026', descr => 'adjust timestamp to new time zone',
proname => 'timezone', prorettype => 'timestamp',
proargtypes => 'interval timestamptz', prosrc => 'timestamptz_izone' },

{ oid => '4133', descr => 'window RANGE support',
proname => 'in_range', prorettype => 'bool',
proargtypes => 'date date interval bool bool',
prosrc => 'in_range_date_interval' },

{ oid => '1305', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
provolatile => 's', prorettype => 'bool',
proargtypes => 'timestamptz interval timestamptz interval',
prosrc => 'see system_functions.sql' },

{ oid => '1305', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
provolatile => 's', prorettype => 'bool',
proargtypes => 'timestamptz interval timestamptz interval',
prosrc => 'see system_functions.sql' },
{ oid => '1306', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
provolatile => 's', prorettype => 'bool',
proargtypes => 'timestamptz timestamptz timestamptz interval',
prosrc => 'see system_functions.sql' },
{ oid => '1307', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
provolatile => 's', prorettype => 'bool',
proargtypes => 'timestamptz interval timestamptz timestamptz',
prosrc => 'see system_functions.sql' },

{ oid => '1308', descr => 'intervals overlap?',
proname => 'overlaps', proisstrict => 'f', prorettype => 'bool',
proargtypes => 'time time time time', prosrc => 'overlaps_time' },
{ oid => '1309', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
prorettype => 'bool', proargtypes => 'time interval time interval',
prosrc => 'see system_functions.sql' },
{ oid => '1310', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
prorettype => 'bool', proargtypes => 'time time time interval',
prosrc => 'see system_functions.sql' },
{ oid => '1311', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
prorettype => 'bool', proargtypes => 'time interval time time',
prosrc => 'see system_functions.sql' },

{ oid => '1386',
descr => 'date difference from today preserving months and years',
proname => 'age', prolang => 'sql', provolatile => 's',
prorettype => 'interval', proargtypes => 'timestamptz',
prosrc => 'see system_functions.sql' },

{ oid => '2042', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
prorettype => 'bool', proargtypes => 'timestamp interval timestamp interval',
prosrc => 'see system_functions.sql' },
{ oid => '2043', descr => 'intervals overlap?',
proname => 'overlaps', prolang => 'sql', proisstrict => 'f',
prorettype => 'bool', proargtypes => 'timestamp timestamp timestamp interval',
prosrc => 'see system_functions.sql' },
{ oid => '2044', descr => 'intervals overlap?',
proname => 'overlaps', prolang =>

Re: Allow +group in pg_ident.conf

2023-01-10 Thread Michael Paquier

On Tue, Jan 10, 2023 at 09:42:19AM -0500, Andrew Dunstan wrote:
> Ok, that sounds reasonable, but the cfbot doesn't like patches that
> depend on other patches in a different email. Maybe you can roll this up
> as an extra patch in your next version? It's pretty small.

This can go two ways if both of you agree, by sending an updated patch
on this thread based on the other one..  And actually, Jelte's patch
has less C code than this thread's proposal, eventhough it lacks
tests.
--
Michael


signature.asc
Description: PGP signature

delay starting WAL receiver

2023-01-10 Thread Nathan Bossart

I discussed this a bit in a different thread [0], but I thought it deserved
its own thread.

After setting wal_retrieve_retry_interval to 1ms in the tests, I noticed
that the recovery tests consistently take much longer.  Upon further
inspection, it looks like a similar race condition to the one described in
e5d494d's commit message.  With some added debug logs, I see that all of
the callers of MaybeStartWalReceiver() complete before SIGCHLD is
processed, so ServerLoop() waits for a minute before starting the WAL
receiver.

The attached patch fixes this by adjusting DetermineSleepTime() to limit
the sleep to at most 100ms when WalReceiverRequested is set, similar to how
the sleep is limited when background workers must be restarted.

[0] https://postgr.es/m/20221215224721.GA694065%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 6f32238f119236322dfb16262a88c3a9c5141413 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 15 Dec 2022 14:11:43 -0800
Subject: [PATCH v1 1/1] handle race condition when restarting wal receivers

---
 src/backend/postmaster/postmaster.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index eac3450774..9d6f58e0b3 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1601,9 +1601,9 @@ checkControlFile(void)
  *
  * In normal conditions we wait at most one minute, to ensure that the other
  * background tasks handled by ServerLoop get done even when no requests are
- * arriving.  However, if there are background workers waiting to be started,
- * we don't actually sleep so that they are quickly serviced.  Other exception
- * cases are as shown in the code.
+ * arriving.  However, if there are background workers or a WAL receiver
+ * waiting to be started, we make sure they are quickly serviced.  Other
+ * exception cases are as shown in the code.
  */
 static void
 DetermineSleepTime(struct timeval *timeout)
@@ -1611,11 +1611,12 @@ DetermineSleepTime(struct timeval *timeout)
 	TimestampTz next_wakeup = 0;
 
 	/*
-	 * Normal case: either there are no background workers at all, or we're in
-	 * a shutdown sequence (during which we ignore bgworkers altogether).
+	 * Normal case: either there are no background workers and no WAL receiver
+	 * at all, or we're in a shutdown sequence (during which we ignore
+	 * bgworkers and the WAL receiver altogether).
 	 */
 	if (Shutdown > NoShutdown ||
-		(!StartWorkerNeeded && !HaveCrashedWorker))
+		(!StartWorkerNeeded && !HaveCrashedWorker && !WalReceiverRequested))
 	{
 		if (AbortStartTime != 0)
 		{
@@ -1674,6 +1675,21 @@ DetermineSleepTime(struct timeval *timeout)
 		}
 	}
 
+	/*
+	 * If WalReceiverRequested is set, we're probably waiting on a SIGCHLD to
+	 * arrive to clear WalReceiverPID before starting the new WAL receiver.  We
+	 * don't expect that to take long, so limit the sleep to 100ms so that we
+	 * start the new WAL receiver promptly.
+	 */
+	if (WalReceiverRequested)
+	{
+		TimestampTz walrcv_wakeup;
+
+		walrcv_wakeup = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), 100);
+		if (next_wakeup == 0 || walrcv_wakeup < next_wakeup)
+			next_wakeup = walrcv_wakeup;
+	}
+
 	if (next_wakeup != 0)
 	{
 		long		secs;
-- 
2.25.1

Re: Strengthen pg_waldump's --save-fullpage tests

2023-01-10 Thread Michael Paquier

On Tue, Jan 10, 2023 at 05:25:44PM +0100, Drouvot, Bertrand wrote:
> I like the idea of comparing the full page (and not just the LSN) but
> I'm not sure that adding the pageinspect dependency is a good thing.
> 
> What about extracting the block directly from the relation file and
> comparing it with the one extracted from the WAL? (We'd need to skip the
> first 8 bytes to skip the LSN though).

Byte-by-byte counting for the page hole?  The page checksum would
matter as well, FWIW, as it is not set in WAL and a FPW logged in WAL
means that the page got modified.  It means that it could have been
flushed, updating its pd_lsn and its pd_checksum on the way.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] support tab-completion for single quote input with equal sign

2023-01-10 Thread Tom Lane

torikoshia  writes:
> I updated the patch going along with the v3 direction.

I think this adds about as many failure modes as it removes,
if not more.

* The connection string doesn't necessarily end with "'"; it could
be a dollar-quoted string.

* If it is a dollar-quoted string, there could be "'" marks internal
to it, allowing PUBLICATION to be falsely offered when we're really
still within the connection string.

* The following WITH options could contain "'", allowing PUBLICATION
to be falsely offered within that clause.

I've spent some effort previously on getting tab-completion to deal
sanely with single-quoted strings, but everything I've tried has
crashed and burned :-(, mainly because it's not clear when to take
the whole literal as one "word" and when not.  A relevant example
here is that somebody might wish that we could tab-complete within
the connection string, e.g. that

CREATE SUBSCRIPTION sub CONNECTION 'db

would complete with "name=".  We have the info available from libpq
to do that, if only we could figure out when to apply it.  I think
we need some pretty fundamental design work to figure out what we
want to do in this area, and that in the meantime putting band-aids
on specific cases is probably not very productive.

regards, tom lane

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-10 Thread David Rowley

On Tue, 10 Jan 2023 at 18:36, Tom Lane  wrote:
>
> David Rowley  writes:
> > Ideally, our sort costing would just be better, but I think that
> > raises the bar a little too high to start thinking of making
> > improvements to that for this patch.
>
> It's trickier than it looks, cf f4c7c410e.  But if you just want
> to add a small correction based on number of columns being sorted
> by, that seems within reach.  See the comment for cost_sort though.
> Also, I suppose for incremental sorts we'd want to consider only
> the number of newly-sorted columns, but I'm not sure if that info
> is readily at hand either.

Yeah, I had exactly that in mind when I mentioned about setting the
bar higher. It seems like a worthy enough goal to improve the sort
costs separately from this work. I'm starting to consider if we might
need to revisit cost_sort() anyway. There's been quite a number of
performance improvements made to sort in the past few years and I
don't recall if anything has been done to check if the sort costs are
still realistic. I'm aware that it's a difficult problem as the number
of comparisons is highly dependent on the order of the input rows.

David

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-10 Thread David Rowley

On Wed, 11 Jan 2023 at 08:17, Ankit Kumar Pandey  wrote:
>
>
> > On 10/01/23 10:53, David Rowley wrote:
>
> > the total cost is the same for both of these, but the execution time
> > seems to vary quite a bit.
>
> This is really weird, I tried it different ways (to rule out any issues
> due to
>
> caching) and execution time varied in spite of having same cost.
>
> > Maybe we should try and do this for DISTINCT queries if the
> > distinct_pathkeys match the orderby_pathkeys. That seems a little less
> > copout-ish. If the ORDER BY is the same as the DISTINCT then it seems
> > likely that the ORDER BY might opt to use the Unique path for DISTINCT
> > since it'll already have the correct pathkeys.
>
> > However, if the ORDER BY has fewer columns then it might be cheaper to Hash 
> > Aggregate and
> > then sort all over again, especially so when the DISTINCT removes a
> > large proportion of the rows.
>
> Isn't order by pathkeys are always fewer than distinct pathkeys?

Just thinking about this again, I remembered why I thought DISTINCT
was uninteresting to start with.  The problem is that if the query has
WindowFuncs and also has a DISTINCT clause, then the WindowFunc
results *must* be in the DISTINCT clause and, optionally also in the
ORDER BY clause.  There's no other place to write WindowFuncs IIRC.
Since we cannot pushdown the sort when the more strict version of the
pathkeys have WindowFuncs, then we must still perform the additional
sort if the planner chooses to do a non-hashed DISTINCT.  The aim of
this patch is to reduce the total number of sorts, and I don't think
that's possible in this case as you can't have WindowFuncs in the
ORDER BY when they're not in the DISTINCT clause:

postgres=# select distinct relname from pg_Class order by row_number()
over (order by oid);
ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list
LINE 1: select distinct relname from pg_Class order by row_number() ...

Another type of query which is suboptimal still is when there's a
DISTINCT and WindowClause but no ORDER BY. We'll reorder the DISTINCT
clause so that the leading columns of the ORDER BY come first in
transformDistinctClause(), but we've nothing to do the same for
WindowClauses.  It can't happen around when transformDistinctClause()
is called  as we've yet to decide the WindowClause evaluation order,
so if we were to try to make that better it would maybe have to do in
the upper planner somewhere. It's possible it's too late by that time
to adjust the DISTINCT clause.

Here's an example of it.

# set enable_hashagg=0;
# explain select distinct relname,relkind,count(*) over (partition by
relkind) from pg_Class;
 QUERY PLAN

 Unique  (cost=61.12..65.24 rows=412 width=73)
   ->  Sort  (cost=61.12..62.15 rows=412 width=73)
 Sort Key: relname, relkind, (count(*) OVER (?))
 ->  WindowAgg  (cost=36.01..43.22 rows=412 width=73)
   ->  Sort  (cost=36.01..37.04 rows=412 width=65)
 Sort Key: relkind
 ->  Seq Scan on pg_class  (cost=0.00..18.12
rows=412 width=65)
(7 rows)

We can simulate the optimisation by swapping the column order in the
targetlist. Note the planner can use Incremental Sort (at least since
3c6fc5820, from about 2 hours ago)

# explain select distinct relkind,relname,count(*) over (partition by
relkind) from pg_Class;
 QUERY PLAN

 Unique  (cost=41.26..65.32 rows=412 width=73)
   ->  Incremental Sort  (cost=41.26..62.23 rows=412 width=73)
 Sort Key: relkind, relname, (count(*) OVER (?))
 Presorted Key: relkind
 ->  WindowAgg  (cost=36.01..43.22 rows=412 width=73)
   ->  Sort  (cost=36.01..37.04 rows=412 width=65)
 Sort Key: relkind
 ->  Seq Scan on pg_class  (cost=0.00..18.12
rows=412 width=65)
(8 rows)

Not sure if we should be trying to improve that in this patch. I just
wanted to identify it as something else that perhaps could be done.
I'm not really all that sure the above query shape makes much sense in
the real world. Would anyone ever want to use DISTINCT on some results
containing WindowFuncs?

David

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 12:08 PM Peter Geoghegan  wrote:
> Actually, FreezeMultiXactId() can fully remove an xmax that has some
> member XIDs >= OldestXmin, provided FRM_NOOP processing isn't
> possible, at least when no individual member is still running. Doesn't
> have to involve transaction aborts at all.
>
> Let me go try to break it that way...

Attached patch shows how this could break.

It adds an assertion that checks that the expected
PD_ALL_VISIBLE/VM_ALL_VISIBLE() conditions hold at the right point. It
also comments out FreezeMultiXactId()'s FRM_NOOP handling.

The FRM_NOOP case is really just an optimization, and shouldn't be
needed for correctness. This is amply demonstrated by running "meston
test" with the patch applied, which will pass without incident.

I can get the PD_ALL_VISIBLE assertion to fail by following this
procedure with the patch applied:

* Run a plain VACUUM to set all the pages from a table all-visible,
but not all-frozen.

* Set a breakpoint that will hit after all_visible_according_to_vm is
set to true, for an interesting blkno.

* Run VACUUM FREEZE. We need FREEZE in order to be able to hit the
relevant visibilitymap_set() call site (the one that passes
VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
VISIBILITYMAP_ALL_VISIBLE).

Now all_visible_according_to_vm is set to true, but we don't have a
lock/pin on the same heap page just yet.

* Acquire several non-conflicting row locks on a row on the block in
question, so that a new multi is allocated.

* End every session whose XID is stored in our multi (commit/abort).

* Within GDB, continue from before -- observe assertion failure.

Obviously this scenario doesn't demonstrate the presence of a bug --
not quite. But it does prove that we rely on FRM_NOOP to not allow the
VM to become corrupt, which just doesn't make any sense, and can't
have been intended. At a minimum, it strongly suggests that the
current approach is very fragile.

-- 
Peter Geoghegan

0001-Debug-freeze-map-issue.patch
Description: Binary data

Re: Use windows VMs instead of windows containers on the CI

2023-01-10 Thread Thomas Munro

On Wed, Jan 11, 2023 at 8:20 AM Andres Freund  wrote:
> On 2023-01-10 09:22:12 -0600, Justin Pryzby wrote:
> > > There is more than 2x speed gain when VMs are used.
> >
> > One consideration is that if windows runs twice as fast, we'll suddenly
> > start using twice as many resources at cirrus/google/amazon - the
> > windows task has been throttling everything else.  Not sure if we should
> > to do anything beyond the limits that cfbot already uses.
>
> I'm not sure we would. cfbot has a time based limit for how often it tries to
> rebuild entries, and I think we were just about keeping up with that. In which
> case we shouldn't, on average, schedule more jobs than we currently
> do. Although peak "job throughput" would be higher.
>
> Thomas?

It currently tries to re-test each patch every 24 hours, but doesn't
achieve that.  It looks like it's currently re-testing every ~30
hours.  Justin's right, we'll consume more non-Windows resources if
Windows speeds up, but not 2x, more like 1.25x when cfbot's own
throttling kicks in.  Or I could change the cycle target to 36 or 48
hours, to spread the work out more.

Back-of-a-napkin maths:

 * there are currently 240 entries in a testable status
 * it takes ~0.5 hours to test (because that's the slow Windows time)
 * therefore it takes ~120 hours to test them all
 * but we can do 4 at a time, so that's ~30 hours to get through them
all and start again
 * that matches what we see:

cfbot=> select created - lag(created) over (order by created) from
branch where submission_id = 4068;
   ?column?
---

 1 day 06:30:00.265047
 1 day 05:43:59.978949
 1 day 04:13:59.754048
 1 day 05:28:59.811916
 1 day 07:00:00.651655
(6 rows)

If, with this change, we can test in only ~0.25 hours, then we'll only
need 60 hours of Cirrus time to test them all.  With a target of
re-testing every 24 hours, it should now only have to run ~2.5 jobs at
all times.  Having free slots would be kind to Cirrus, and also lower
the latency when a new patch is posted (which currently has to wait
for a free slot before it can begin).  Great news.

Can we let extensions change their dumped catalog schemas?

2023-01-10 Thread Jacob Champion

Hi all,

I've been talking to other Timescale devs about a requested change to
pg_dump, and there's been quite a bit of back-and-forth to figure out
what, exactly, we want. Any mistakes here are mine, but I think we've
been able to distill it down to the following request:

We'd like to be allowed to change the schema for a table that's been
marked in the past with pg_extension_config_dump().

Unless I'm missing something obvious (please, let it be that) there's no
way to do this safely. Once you've marked an internal table as dumpable,
its schema is effectively frozen if you want your dumps to work across
versions, because otherwise you'll try to restore that "catalog" data
into a table that has different columns. And while sometimes you can
make that work, it doesn't in the general case.

We (Timescale) do already change the schemas today, but we pay the
associated costs in that dump/restore doesn't work without manual
version bookkeeping and user fiddling -- and in the worst cases, it
appears to "work" across versions but leaves our catalog tables in an
inconsistent state. So the request is to come up with a way to support
this case.

Some options that have been proposed so far:

1) Don't ask for a new feature, and instead try to ensure infinite
backwards compatibility for those tables.

For extension authors who have already done this -- and have likely done
some heavy architectural lifting to make it work -- this is probably the
first thing that will come to mind, and it was the first thing I said,
too.

But the more I say it, the less acceptable it feels. Not even Postgres
is expected to maintain infinite catalog compatibility into the future.
We need to evolve our catalogs, too -- and we already provide the
standard update scripts to perform migrations of those tables, but a
dump/restore doesn't have any way to use them today.

2) Provide a way to record the exact version of an extension in a dump.

Brute-force, but pretty much guaranteed to fix the cross-version
problem, because the dump can't be accidentally restored to an extension
version with a different catalog schema. Users then manually ALTER
EXTENSION ... UPDATE (or we could even include that in the dump itself,
as the final action). Doing this by default would punish extensions that
don't have this problem, so it would have to be opt-in in some way.

It's also unnecessarily strict IMO -- even if we don't have a config
table change in a new version, we'll still require the old extension
version to be available alongside the new version during a restore.
Maybe a tweak on this idea would be to introduce a catversion for
extensions.

3) Provide a way to record the entire internal state of an extension in
a dump.

Every extension is already expected to handle the case where the
internal state is at version X but the installed extension is at version
X+N, and the update scripts we provide will perform the necessary
migrations. But there's no way to reproduce this case using
dump/restore, because dumping an extension omits its internals.

If a dump could instead include the entire internal state of an
extension, then we'd be guaranteed to reproduce the exact situation that
we already have to support for an in-place upgrade. After a restore, the
SQL is at version X, the installed extension is some equal or later
version, and all that remains is to run the update scripts, either
manually or within the dump itself.

Like (2), I think there's no way you'd all accept this cost for every
extension. It'd have to be opt-in.

--

Hopefully that makes a certain amount of sense. Does it seem like a
reasonable thing to ask?

I'm happy to clarify anything above, and if you know of an obvious
solution I'm missing, I would love to be corrected. :D

Thanks,
--Jacob

Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-01-10 Thread Tom Lane

Sandro Santilli  writes:
> On Mon, Jan 09, 2023 at 05:51:49PM -0500, Tom Lane wrote:
>> ... you still need one script file for each supported upgrade step

> That's exactly the problem we're trying to solve here.
> The include support is nice on itself, but won't solve our problem.

The script-file-per-upgrade-path aspect solves a problem that you
have, whether you admit it or not; I think you simply aren't realizing
that because you have not had to deal with the consequences of
your proposed feature.  Namely that you won't have any control
over what the backend will try to do in terms of upgrade paths.

As an example, suppose that a database has foo 4.0 installed, and
the DBA decides to try to downgrade to 3.0.  With the system as it
stands, if you've provided foo--4.0--3.0.sql then the conversion
will go through, and presumably it will work because you tested that
that script does what it is intended to.  If you haven't provided
any such downgrade script, then ALTER EXTENSION UPDATE will say
"Sorry Dave, I'm afraid I can't do that" and no harm is done.

With the proposed % feature, if foo--%--3.0.sql exists then the
system will invoke it and expect the end result to be a valid
3.0 installation, whether or not the script actually has any
ability to do a downgrade.  Moreover, there isn't any very
good way to detect or prevent unsupported version transitions.
(I suppose you could add code to look at pg_extension.extversion,
but I'm not sure if that works: it looks to me like we update that
before we run the extension script.  Besides which, if you have
to add such code is that really better than having a number of
one-liner scripts implementing the same check declaratively?)

It gets worse though, because above I'm supposing that 4.0 at
least existed when this copy of foo--%--3.0.sql was made.
Suppose that somebody fat-fingered a package upgrade, such that
the extension fileset available to a database containing foo 4.0
now corresponds to foo 3.0, and there's no knowledge of 4.0 at all
in the extension scripts.  The DBA trustingly issues ALTER EXTENSION
UPDATE, which will conclude from foo.control that it should update to
3.0, and invoke foo--%--3.0.sql to do it.  Maybe the odds of success
are higher than zero, but not by much; almost certainly you are
going to end with an extension containing some leftover 4.0
objects, some 3.0 objects, and maybe some objects with properties
that don't exactly match either 3.0 or 4.0.  Even if that state
of affairs manages not to cause immediate problems, it'll surely
be a mess whenever somebody tries to re-update to 4.0 or later.

So I really think this is a case of "be careful what you ask
for, you might get it".  Even if PostGIS is willing to put in
the amount of infrastructure legwork needed to make such a
design bulletproof, I'm quite sure nobody else will manage
to use such a thing successfully.  I'd rather spend our
development effort on a feature that has more than one use-case.

regards, tom lane

Re: verbose mode for pg_input_error_message?

2023-01-10 Thread Nathan Bossart

On Wed, Jan 04, 2023 at 04:18:59PM -0500, Andrew Dunstan wrote:
> On 2023-01-02 Mo 10:44, Tom Lane wrote:
>> I don't think that just concatenating those strings would make for a
>> pleasant API.  More sensible, perhaps, to have a separate function
>> that returns a record.  Or we could redefine the existing function
>> that way, but I suspect that "just the primary error" will be a
>> principal use-case.
>>
>> Being able to get the SQLSTATE is likely to be interesting too.
> 
> OK, here's a patch along those lines.

My vote would be to redefine the existing pg_input_error_message() function
to return a record, but I recognize that this would inflate the patch quite
a bit due to all the existing uses in the tests.  If this is the only
argument against this approach, I'm happy to help with the patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

wal_compression = method:level

2023-01-10 Thread Justin Pryzby

Is it desirable to support specifying a level ?

Maybe there's a concern about using high compression levels, but 
I'll start by asking if the feature is wanted at all.

Previous discussion at: 20210614012412.ga31...@telsasoft.com
>From cb30e17cf19fffa370a887d28d6d7e683d588b71 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 14 Mar 2021 17:12:07 -0500
Subject: [PATCH] Use GUC hooks to support compression:level..

..which is useful for zstd, but less so for lz4.

TODO:

windows, pglz, zlib case
---
 doc/src/sgml/config.sgml|   6 +-
 src/backend/access/transam/xlog.c   |  20 
 src/backend/access/transam/xloginsert.c |   6 +-
 src/backend/access/transam/xlogreader.c | 124 
 src/backend/utils/misc/guc_tables.c |  39 ++
 src/common/compression.c|   3 -
 src/include/access/xlog.h   |   6 +
 src/include/access/xlogreader.h |   2 +
 src/include/utils/guc_hooks.h   |   3 +
 src/test/regress/expected/compression.out   |  19 +++
 src/test/regress/expected/compression_1.out |  19 +++
 src/test/regress/sql/compression.sql|  10 ++
 12 files changed, 220 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 77574e2d4ec..7b34ed630d5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3161,7 +3161,7 @@ include_dir 'conf.d'
  
 
  
-  wal_compression (enum)
+  wal_compression (string)
   
wal_compression configuration parameter
   
@@ -3169,7 +3169,7 @@ include_dir 'conf.d'
   

 This parameter enables compression of WAL using the specified
-compression method.
+compression method and optional compression level.
 When enabled, the PostgreSQL
 server compresses full page images written to WAL when
  is on or during a base backup.
@@ -3179,6 +3179,8 @@ include_dir 'conf.d'
 was compiled with --with-lz4) and
 zstd (if PostgreSQL
 was compiled with --with-zstd).
+A compression level may optionally be specified, by appending the level
+number after a colon (:)
 The default value is off.
 Only superusers and users with the appropriate SET
 privilege can change this setting.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 74bd1f5fbe2..e2d81129549 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -125,6 +125,8 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 int			wal_compression = WAL_COMPRESSION_NONE;
+char		*wal_compression_string = "no"; /* Overwritten by GUC */
+int			wal_compression_level = 1;
 char	   *wal_consistency_checking_string = NULL;
 bool	   *wal_consistency_checking = NULL;
 bool		wal_init_zero = true;
@@ -8118,6 +8120,24 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 	}
 }
 
+bool
+check_wal_compression(char **newval, void **extra, GucSource source)
+{
+	int tmp;
+	if (get_compression_level(*newval, ) != -1)
+		return true;
+
+	return false;
+}
+
+/* Parse the GUC into integers for wal_compression and wal_compression_level */
+void
+assign_wal_compression(const char *newval, void *extra)
+{
+	wal_compression = get_compression_level(newval, _compression_level);
+	Assert(wal_compression >= 0);
+}
+
 
 /*
  * Issue appropriate kind of fsync (if any) for an XLOG output file.
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 47b6d16eaef..93003744ed0 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -906,8 +906,8 @@ XLogCompressBackupBlock(char *page, uint16 hole_offset, uint16 hole_length,
 
 		case WAL_COMPRESSION_LZ4:
 #ifdef USE_LZ4
-			len = LZ4_compress_default(source, dest, orig_len,
-	   COMPRESS_BUFSIZE);
+			len = LZ4_compress_fast(source, dest, orig_len,
+	   COMPRESS_BUFSIZE, wal_compression_level);
 			if (len <= 0)
 len = -1;		/* failure */
 #else
@@ -918,7 +918,7 @@ XLogCompressBackupBlock(char *page, uint16 hole_offset, uint16 hole_length,
 		case WAL_COMPRESSION_ZSTD:
 #ifdef USE_ZSTD
 			len = ZSTD_compress(dest, COMPRESS_BUFSIZE, source, orig_len,
-ZSTD_CLEVEL_DEFAULT);
+wal_compression_level);
 			if (ZSTD_isError(len))
 len = -1;		/* failure */
 #else
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a61b4a99219..728ca9f18c4 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -17,7 +17,9 @@
  */
 #include "postgres.h"
 
+#include 
 #include 
+
 #ifdef USE_LZ4
 #include 
 #endif
@@ -30,6 +32,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecord.h"
 #include "catalog/pg_control.h"
+#include "common/compression.h"
 #include

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2023-01-10 Thread Jacob Champion

On Mon, Jan 9, 2023 at 7:07 AM Jelte Fennema  wrote:
> I also took a closer look at the code, and the only comment I have is:
>
> > appendPQExpBuffer(>errorMessage,
>
> These calls can all be replaced by the recently added libpq_append_conn_error

Argh, thanks for the catch. Fixed.

> Finally I tested this against a Postgres server I created on Azure and
> the new value works as expected. The only thing that I think would be
> good to change is the error message when sslmode=verify-full, and
> sslrootcert is not provided, but ~/.postgresql/root.crt is also not available.
> I think it would be good for the error to mention sslrootcert=system

Good idea. The wording I chose in v6 is

Either provide the file, use the system's trusted roots with
sslrootcert=system, or change sslmode to disable server certificate
verification.

What do you think?

Thanks!
--Jacob
1:  7618037c86 ! 1:  e7e2d43b18 libpq: add sslrootcert=system to use default CAs
@@ src/interfaces/libpq/fe-secure-openssl.c: initialize_SSL(PGconn *conn)
 +  {
 +  char   *err = SSLerrmessage(ERR_get_error());
 +
-+  appendPQExpBuffer(>errorMessage,
-+libpq_gettext("could 
not load system root certificate paths: %s\n"),
-+err);
++  libpq_append_conn_error(conn, "could not load system 
root certificate paths: %s",
++  err);
 +  SSLerrfree(err);
 +  SSL_CTX_free(SSL_context);
 +  return -1;
@@ src/interfaces/libpq/fe-secure-openssl.c: initialize_SSL(PGconn *conn)
{
X509_STORE *cvstore;
  
+@@ src/interfaces/libpq/fe-secure-openssl.c: initialize_SSL(PGconn *conn)
+*/
+   if (fnbuf[0] == '\0')
+   libpq_append_conn_error(conn, "could not get 
home directory to locate root certificate file\n"
+- "Either 
provide the file or change sslmode to disable server certificate 
verification.");
++ "Either 
provide the file, use the system's trusted roots with sslrootcert=system, or 
change sslmode to disable server certificate verification.");
+   else
+   libpq_append_conn_error(conn, "root certificate 
file \"%s\" does not exist\n"
+- "Either 
provide the file or change sslmode to disable server certificate 
verification.", fnbuf);
++ "Either 
provide the file, use the system's trusted roots with sslrootcert=system, or 
change sslmode to disable server certificate verification.", fnbuf);
+   SSL_CTX_free(SSL_context);
+   return -1;
+   }
 
  ## src/test/ssl/ssl/server-cn-only+server_ca.crt (new) ##
 @@
2:  c392e1796e ! 2:  e4d9731e1e libpq: force sslmode=verify-full for system CAs
@@ src/interfaces/libpq/fe-connect.c: connectOptions2(PGconn *conn)
 +  && strcmp(conn->sslrootcert, "system") == 0)
 +  {
 +  conn->status = CONNECTION_BAD;
-+  appendPQExpBuffer(>errorMessage,
-+libpq_gettext("sslrootcert 
value \"%s\" invalid when SSL support is not compiled in\n"),
-+conn->sslrootcert);
++  libpq_append_conn_error(conn, "sslrootcert value \"%s\" invalid 
when SSL support is not compiled in",
++  
conn->sslrootcert);
 +  return false;
 +  }
 +#endif
@@ src/interfaces/libpq/fe-connect.c: connectOptions2(PGconn *conn)
 +  && strcmp(conn->sslmode, "verify-full") != 0)
 +  {
 +  conn->status = CONNECTION_BAD;
-+  appendPQExpBuffer(>errorMessage,
-+libpq_gettext("weak sslmode 
\"%s\" may not be used with sslrootcert=system\n"),
-+conn->sslmode);
++  libpq_append_conn_error(conn, "weak sslmode \"%s\" may not be 
used with sslrootcert=system",
++  conn->sslmode);
 +  return false;
 +  }
 +#endif
From e7e2d43b186a32a414783baa2ca55d88a5d7fe9e Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 24 Oct 2022 15:30:25 -0700
Subject: [PATCH v6 1/2] libpq: add sslrootcert=system to use default CAs

Based on a patch by Thomas Habets.

Note the workaround in .cirrus.yml for a strange interaction between
brew and the openssl@3

Re: ATTACH PARTITION seems to ignore column generation status

2023-01-10 Thread Tom Lane

I wrote:
> Amit Langote  writes:
>> Thanks for the patch.  It looks good, though I guess you said that we
>> should also change the error message that CREATE TABLE ... PARTITION
>> OF emits to match the other cases while we're here.  I've attached a
>> delta patch.

> Thanks.  I hadn't touched that issue because I wasn't entirely sure
> which error message(s) you were unhappy with.  These changes look
> fine offhand.

So, after playing with that a bit ... removing the block in
parse_utilcmd.c allows you to do

regression=# CREATE TABLE gtest_parent (f1 date NOT NULL, f2 bigint, f3 bigint 
GENERATED ALWAYS AS (f2 * 2) STORED) PARTITION BY RANGE (f1);
CREATE TABLE
regression=# CREATE TABLE gtest_child PARTITION OF gtest_parent (
regression(#f3 WITH OPTIONS GENERATED ALWAYS AS (f2 * 3) STORED
regression(# ) FOR VALUES FROM ('2016-07-01') TO ('2016-08-01');
CREATE TABLE
regression=# \d gtest_child
  Table "public.gtest_child"
 Column |  Type  | Collation | Nullable |   Default   
++---+--+-
 f1 | date   |   | not null | 
 f2 | bigint |   |  | 
 f3 | bigint |   |  | generated always as (f2 * 3) stored
Partition of: gtest_parent FOR VALUES FROM ('2016-07-01') TO ('2016-08-01')

regression=# insert into gtest_parent values('2016-07-01', 42);
INSERT 0 1
regression=# table gtest_parent;
 f1 | f2 | f3  
++-
 2016-07-01 | 42 | 126
(1 row)

That is, you can make a partition with a different generated expression
than the parent has.  Do we really want to allow that?  I think it works
as far as the backend is concerned, but it breaks pg_dump, which tries
to dump this state of affairs as

CREATE TABLE public.gtest_parent (
f1 date NOT NULL,
f2 bigint,
f3 bigint GENERATED ALWAYS AS ((f2 * 2)) STORED
)
PARTITION BY RANGE (f1);

CREATE TABLE public.gtest_child (
f1 date NOT NULL,
f2 bigint,
f3 bigint GENERATED ALWAYS AS ((f2 * 3)) STORED
);

ALTER TABLE ONLY public.gtest_parent ATTACH PARTITION public.gtest_child FOR 
VALUES FROM ('2016-07-01') TO ('2016-08-01');

and that fails at reload because the ATTACH PARTITION code path
checks for equivalence of the generation expressions.

This different-generated-expression situation isn't really morally
different from different ordinary DEFAULT expressions, which we
do endeavor to support.  So maybe we should deem this a supported
case and remove ATTACH PARTITION's insistence that the generation
expressions match ... which I think would be a good thing anyway,
because that check-for-same-reverse-compiled-expression business
is pretty cheesy in itself.  AFAIK, 3f7836ff6 took care of the
only problem that the backend would have with this, and pg_dump
looks like it will work as long as the backend will take the
ATTACH command.

I also looked into making CREATE TABLE ... PARTITION OF reject
this case; but that's much harder than it sounds, because what we
have at the relevant point is a raw (unanalyzed) expression for
the child's generation expression but a cooked one for the
parent's, so there is no easy way to match the two.

In short, it's seeming like the rule for both partitioning and
traditional inheritance ought to be "a child column must have
the same generated-or-not property as its parent, but their
generation expressions need not be the same".  Thoughts?

regards, tom lane

Wasted Vacuum cycles when OldestXmin is not moving

2023-01-10 Thread sirisha chamarthi

Hi Hackers,

vacuum is not able to clean up dead tuples when OldestXmin is not moving
(because of a long running transaction or when hot_standby_feedback is
behind). Even though OldestXmin is not moved from the last time it checked,
it keeps retrying every autovacuum_naptime and wastes CPU cycles and IOs
when pages are not in memory. Can we not bypass the dead tuple collection
and cleanup step until OldestXmin is advanced? Below log shows the vacuum
running every 1 minute.

2023-01-09 08:13:01.364 UTC [727219] LOG:  automatic vacuum of table
"postgres.public.t1": index scans: 0
pages: 0 removed, 6960 remain, 6960 scanned (100.00% of total)
tuples: 0 removed, 1572864 remain, 786432 are dead but not yet
removable
removable cutoff: 852, which was 2 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0
dead item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 13939 hits, 0 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 0.15 s, system: 0.00 s, elapsed: 0.29 s
2023-01-09 08:14:01.363 UTC [727289] LOG:  automatic vacuum of table
"postgres.public.t1": index scans: 0
pages: 0 removed, 6960 remain, 6960 scanned (100.00% of total)
tuples: 0 removed, 1572864 remain, 786432 are dead but not yet
removable
removable cutoff: 852, which was 2 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0
dead item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 13939 hits, 0 misses, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes
system usage: CPU: user: 0.14 s, system: 0.00 s, elapsed: 0.29 s

Thanks,
Sirisha

Re: Allow DISTINCT to use Incremental Sort

2023-01-10 Thread David Rowley

On Tue, 10 Jan 2023 at 16:07, Richard Guo  wrote:
> Sorry I didn't make myself clear.  I mean currently on HEAD in planner.c
> from line 4847 to line 4857, we have the code to make sure we always use
> the more rigorous clause for explicit-sort case.  I think this code is
> not necessary, because we have already done that in line 4791 to 4796,
> for both DISTINCT ON and standard DISTINCT.

Thanks for explaining.  I'm unsure if that code ever did anything
useful, but I agree that it does nothing now.

>> I've attached an updated patch
>
>
> The patch looks good to me.

Thanks for having another look. I've now pushed the patch.

David

Re: allowing for control over SET ROLE

2023-01-10 Thread Jeff Davis

On Tue, 2023-01-10 at 11:45 -0500, Robert Haas wrote:
> So the risks, which in theory are all very similar, are in practice
> far greater in the PostgreSQL context, basically because our default
> setup is about 40 years behind the times in terms of implementing
> best
> practices.

I agree that huge improvements could be made with improvements to best
practices/defaults.

But there are some differences that are harder to fix that way. In
postgres, one can attach arbitrary code to pretty much anything, so you
need to trust everything you touch. There is no safe postgres
equivalent to grepping an untrusted file.


> It might be best to repost some of these ideas on a new thread with a
> relevant subject line, but I agree that there's some potential here.
> Your first idea reminds me a lot of the proposal Tom made in
> https://www.postgresql.org/message-id/19327.1533748...@sss.pgh.pa.us
> -- except that his mechanism is more general, since you can say whose
> code you trust and whose code you don't trust. Noah had a competing
> version of that patch, too. But we never settled on an approach. I
> still think something like this would be a good idea, and the fact
> that you've apparently-independently come up with a similar notion
> just reinforces that.

Will do, thank you for the reference.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2023-01-10 Thread Jacob Champion

On Mon, Jan 9, 2023 at 7:40 AM Andrew Dunstan  wrote:
> I'm confused. A client cert might not have a hostname at all, and isn't
> used to verify the connecting address, but to verify the username. It
> needs to have a CN/DN equal to the user name of the connection, or that
> maps to that name via pg_ident.conf.

Right. But I don't know anything about the security model for using a
publicly issued server certificate as a client certificate. So if you
tell me that your only requirement is that the hostname/CN matches an
entry in your ident file, and that you don't need to verify that the
certificate identifying example.org is actually coming from
example.org, or do any sort of online revocation processing to help
mitigate the risks from that, or even handle wildcards or SANs in the
cert -- fine, but I don't know the right questions to ask to review
that case for safety or correctness. It'd be better to ask someone who
is already comfortable with it.

>From my perspective, ssl_ca_file=system sure *looks* like something
reasonable for me to choose as a DBA, but I'm willing to guess it's
not actually reasonable for 99% of people. (If you get your pg_ident
rule wrong, for example, the number of people who can attack you is
scoped by the certificates issued by your CA... which for 'system'
would be the entire world.) By contrast I would have no problem
recommending sslrootcert=system as a default: a certificate can still
be misissued, but a would-be attacker would still have to get you to
connect to them. That model and its risks are, I think, generally well
understood by the community.

--Jacob

Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-01-10 Thread Sandro Santilli

On Mon, Jan 09, 2023 at 05:51:49PM -0500, Tom Lane wrote:

> Have you considered the idea of instead inventing a "\include" facility

[...]

> cases, you still need one script file for each supported upgrade step

That's exactly the problem we're trying to solve here.
The include support is nice on itself, but won't solve our problem.

--strk;

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 12:19 PM Robert Haas  wrote:
> I don't understand what distinction you're making. It seems like
> hair-splitting to me. We should be able to reproduce problems like
> this reliably, at least with the aid of a debugger and some
> breakpoints, before we go changing the code.

So we can *never* change something defensively, on the basis of a
suspected or theoretical hazard, either in backbranches or just on
HEAD? Not under any circumstances, ever?

> The risk of being wrong
> is quite high because the code is subtle, and the consequences of
> being wrong are potentially very bad because the code is critical to
> data integrity. If the reproducer doesn't require a debugger or other
> extreme contortions, then we should consider reducing it to a TAP test
> that can be committed. If you agree with that, then I'm not sure what
> your last email was complaining about.

I was complaining about your prescribing conditions on proceeding with
a commit, based on an understanding of things that you yourself
acknowledged as incomplete. I cannot imagine how you read that as an
unwillingness to test the issue, especially given that I agreed to
work on that before you chimed in.

> > I have been unable to reproduce the problem, and think it's possible
> > that the issue cannot be triggered in practice. Though only through
> > sheer luck. Here's why that is:

> I guess I'm not very sure that this is sheer luck.

That's just my characterization. Other people can make up their own minds.

> For the purposes of clarifying my understanding, is this the code
> you're principally worried about?

> visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
>   vmbuffer, InvalidTransactionId,
>   VISIBILITYMAP_ALL_FROZEN);

Obviously I meant this call site, since it's the only one that passes
VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
VISIBILITYMAP_ALL_VISIBLE -- in vacuumlazy.c, and in general.

The other visibilitymap_set() callsite that you quoted is from the
second heap pass, where LP_DEAD items are vacuumed and become
LP_UNUSED items. That isn't buggy, but it is a silly approach, in that
it cares about what the visibility map says about the page being
all-visible, as if it might take a dissenting view that needs to be
taken into consideration (obviously we know what's going on with the
page because we just scanned it ourselves, and determined that it was
at least all-visible).

-- 
Peter Geoghegan

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-01-10 Thread Matthias van de Meent

On Tue, 10 Jan 2023 at 20:14, Andres Freund  wrote:
>
> Hi,
>
> On 2023-01-10 15:03:42 +0100, Matthias van de Meent wrote:
> > On Mon, 9 Jan 2023 at 20:34, Andres Freund  wrote:
> > > It's not too hard to fix in individual places, but I suspect that we'll
> > > introduce the bug in future places without some more fundamental 
> > > protection.
> > >
> > > Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable 
> > > value
> > > in ComputeXidHorizons() and GetSnapshotData().
> >
> > I don't think that clamping the value with oldestXid (as seen in patch
> > 0001, in GetSnapshotData) is right.
>
> I agree that using oldestXid to clamp is problematic.
>
>
> > It would clamp the value relative to the oldest frozen xid of all
> > databases, which can be millions of transactions behind oldestXmin,
> > and thus severely skew the amount of transaction's changes you keep on
> > disk (that is, until oldestXid moves past 1000_000).
>
> What precisely do you mean with "skew" here? Do you just mean that it'd take a
> long time until vacuum_defer_cleanup_age takes effect? Somehow it sounds like
> you might mean more than that?

h->oldest_considered_running can be extremely old due to the global
nature of the value and the potential existence of a snapshot in
another database that started in parallel to a very old running
transaction.

Example: With vacuum_defer_cleanup_age set to 100, it is possible
that a snapshot in another database (thus another backend) would
result in a local intermediate status result of h->o_c_r = 20,
h->s_o_n = 20, h->d_o_n = 10030. The clamped offset would then be 20
(clamped using h->o_c_r), which updates h->data_oldest_nonremovable to
10010. The obvious result is that all but the last 20 transactions
from this database's data files are available for cleanup, which
contradicts with the intention of the vacuum_defer_cleanup_age GUC.

> I'm tempted to go with reinterpreting 64bit xids as signed. Except that it
> seems like a mighty invasive change to backpatch.

I'm not sure either. Protecting against underflow by halving the
effective valid value space is quite the intervention, but if it is
necessary to make this work in a performant manner, it would be worth
it. Maybe someone else with more experience can provide their opinion
here.

Kind regards,

Matthias van de Meent

Re: Add SHELL_EXIT_CODE to psql

2023-01-10 Thread Corey Huinker

On Tue, Jan 10, 2023 at 3:54 AM Maxim Orlov  wrote:

>
>
> On Mon, 9 Jan 2023 at 21:36, Corey Huinker 
> wrote:
>
>>
>> I chose a name that would avoid collisions with anything a user might
>> potentially throw into their environment, so if the var "OS" is fairly
>> standard is a reason to avoid using it. Also, going with our own env var
>> allows us to stay in perfect synchronization with the build's #ifdef WIN32
>> ... and whatever #ifdefs may come in the future for new OSes. If there is
>> already an environment variable that does that for us, I would rather use
>> that, but I haven't found it.
>>
>> Perhaps, I didn't make myself clear. Your solution is perfectly adapted
> to our needs.
> But all Windows since 2000 already have an environment variable
> OS=Windows_NT. So, if env OS is defined and equal Windows_NT, this had to
> be Windows.
> May we use it in our case? I don't insist, just asking.
>

Ah, that makes more sense. I don't have a strong opinion on which we should
use, and I would probably defer to someone who does windows (and possibly
cygwin) builds more often than I do.

Re: heapgettup refactoring

2023-01-10 Thread Melanie Plageman

On Thu, Jan 5, 2023 at 8:52 AM Peter Eisentraut
 wrote:
>
> Ok, let's look through these patches starting from the top then.
>
> v4-0001-Add-no-movement-scan-helper.patch
>
> This makes sense overall; there is clearly some duplicate code that can
> be unified.
>
> It appears that during your rebasing you have effectively reverted your
> earlier changes that have been committed as
> 8e1db29cdbbd218ab6ba53eea56624553c3bef8c.  You should undo that.

Thanks. I think I have addressed this.
I've attached a rebased v5.

> I don't understand the purpose of the noinline maker.  If it's not
> necessary to inline, we can just leave it off, but there is no need to
> outright prevent inlining AFAICT.
>

I have removed it.

> I don't know why you changed the if/else sequences.  Before, the
> sequence was effectively
>
> if (forward)
> {
>  ...
> }
> else if (backward)
> {
>  ...
> }
> else
> {
>  /* it's no movement */
> }
>
> Now it's changed to
>
> if (no movement)
> {
>  ...
>  return;
> }
>
> if (forward)
> {
>  ...
> }
> else
> {
>  Assert(backward);
>  ...
> }
>
> Sure, that's the same thing, but it looks less elegant to me.

In this commit, you could keep the original ordering of if statements. I
preferred no movement scan first because then backwards and forwards
scans' code is physically closer to the rest of the code without the
intrusion of the no movement scan code.

Ultimately, the refactor (in later patches) flips the ordering of if
statements at the top from
if (scan direction)
to
if (initial or continue)
and this isn't a very interesting distinction for no movement scans. By
dealing with no movement scan at the top, I didn't have to handle no
movement scans in the initial and continue branches in the new structure.

Also, I will note that patches 4-6 at least and perhaps 4-7 do not make
sense as separate commits. I separated them for ease of review. Each is
correct and passes tests but is not really an improvement without the
others.

- Melanie
From 9b51959c483812c30ca848b66a0e6e3a807ab03f Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 9 Jan 2023 17:33:43 -0500
Subject: [PATCH v5 3/7] Add heapgettup_initial_block() helper

---
 src/backend/access/heap/heapam.c | 212 ---
 1 file changed, 82 insertions(+), 130 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c80b547809..b9a1aac3ca 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -521,6 +521,57 @@ heapgettup_no_movement(HeapScanDesc scan)
 }
 
 
+static inline BlockNumber
+heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
+{
+	Assert(!ScanDirectionIsNoMovement(dir));
+	Assert(!scan->rs_inited);
+
+	/* return null immediately if relation is empty */
+	if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
+		return InvalidBlockNumber;
+
+	scan->rs_inited = true;
+
+	/* forward and serial */
+	if (ScanDirectionIsForward(dir) && scan->rs_base.rs_parallel == NULL)
+		return scan->rs_startblock;
+
+	/* forward and parallel */
+	if (ScanDirectionIsForward(dir))
+	{
+		table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
+ scan->rs_parallelworkerdata,
+ (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+
+		return table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+ scan->rs_parallelworkerdata,
+ (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+	}
+
+	/* backward parallel scan not supported */
+	Assert(scan->rs_base.rs_parallel == NULL);
+
+	/*
+	 * Disable reporting to syncscan logic in a backwards scan; it's not very
+	 * likely anyone else is doing the same thing at the same time, and much
+	 * more likely that we'll just bollix things for forward scanners.
+	 */
+	scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC;
+
+	/*
+	 * Start from last page of the scan.  Ensure we take into account
+	 * rs_numblocks if it's been adjusted by heap_setscanlimits().
+	 */
+	if (scan->rs_numblocks != InvalidBlockNumber)
+		return (scan->rs_startblock + scan->rs_numblocks - 1) % scan->rs_nblocks;
+
+	if (scan->rs_startblock > 0)
+		return scan->rs_startblock - 1;
+
+	return scan->rs_nblocks - 1;
+}
+
 /* 
  *		heapgettup - fetch next heap tuple
  *
@@ -574,41 +625,16 @@ heapgettup(HeapScanDesc scan,
 	{
 		if (!scan->rs_inited)
 		{
-			/*
-			 * return null immediately if relation is empty
-			 */
-			if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
+			block = heapgettup_initial_block(scan, dir);
+
+			if (block == InvalidBlockNumber)
 			{
 Assert(!BufferIsValid(scan->rs_cbuf));
 tuple->t_data = NULL;
 return;
 			}
-			if (scan->rs_base.rs_parallel != NULL)
-			{
-ParallelBlockTableScanDesc pbscan =
-(ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
-ParallelBlockTableScanWorker pbscanwork =
-scan->rs_parallelworkerdata;
-
-

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Robert Haas

On Tue, Jan 10, 2023 at 2:48 PM Peter Geoghegan  wrote:
> What I actually said was that there is no reason to declare up front
> that the only circumstances under which a fix could be committed is
> when a clean repro is available. I never said that a test case has
> little or no value, and I certainly didn't assert that we definitely
> don't need a test case to proceed with a commit -- since I am not in
> the habit of presumptuously attaching conditions to such things well
> in advance.

I don't understand what distinction you're making. It seems like
hair-splitting to me. We should be able to reproduce problems like
this reliably, at least with the aid of a debugger and some
breakpoints, before we go changing the code. The risk of being wrong
is quite high because the code is subtle, and the consequences of
being wrong are potentially very bad because the code is critical to
data integrity. If the reproducer doesn't require a debugger or other
extreme contortions, then we should consider reducing it to a TAP test
that can be committed. If you agree with that, then I'm not sure what
your last email was complaining about. If you disagree, then I don't
know why.

> I have been unable to reproduce the problem, and think it's possible
> that the issue cannot be triggered in practice. Though only through
> sheer luck. Here's why that is:
>
> While pruning will remove aborted dead tuples, freezing will not
> remove the xmax of an aborted update unless the XID happens to be <
> OldestXmin. With my problem scenario, the page will be all_visible in
> prunestate, but not all_frozen -- so it dodges the relevant
> visibilitymap_set() call site.
>
> That just leaves inserts that abort, I think. An aborted insert will
> be totally undone by pruning, but that does still leave behind an
> LP_DEAD item that needs to be vacuumed in the second heap pass. This
> means that we can only set the page all-visible/all-frozen in the VM
> in the second heap pass, which also dodges the relevant
> visibilitymap_set() call site.

I guess I'm not very sure that this is sheer luck. It seems like we
could equally well suppose that the people who wrote the code
correctly understood the circumstances under which we needed to avoid
calling visibilitymap_set(), and wrote the code in a way that
accomplished that purpose. Maybe there's contrary evidence or maybe it
is actually broken somehow, but that's not currently clear to me.

For the purposes of clarifying my understanding, is this the code
you're principally worried about?

/*
 * If the all-visible page is all-frozen but not marked as such yet,
 * mark it as all-frozen.  Note that all_frozen is only valid if
 * all_visible is true, so we must check both prunestate fields.
 */
else if (all_visible_according_to_vm && prunestate.all_visible &&
 prunestate.all_frozen &&
 !VM_ALL_FROZEN(vacrel->rel, blkno, ))
{
/*
 * We can pass InvalidTransactionId as the cutoff XID here,
 * because setting the all-frozen bit doesn't cause recovery
 * conflicts.
 */
visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
  vmbuffer, InvalidTransactionId,
  VISIBILITYMAP_ALL_FROZEN);
}

Or maybe this one?

if (PageIsAllVisible(page))
{
uint8   flags = 0;
uint8   vm_status = visibilitymap_get_status(vacrel->rel,
 blkno, vmbuffer);

/* Set the VM all-frozen bit to flag, if needed */
if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
flags |= VISIBILITYMAP_ALL_VISIBLE;
if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
flags |= VISIBILITYMAP_ALL_FROZEN;

Assert(BufferIsValid(*vmbuffer));
if (flags != 0)
visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr,
  *vmbuffer, visibility_cutoff_xid, flags);
}

These are the only two call sites in vacuumlazy.c where I can see
there being a theoretical risk of the kind of problem that you're
describing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 11:47 AM Peter Geoghegan  wrote:
> In summary, I think that there is currently no way that we can have
> the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving
> the page all_frozen. It can happen and leave the page all_visible, but
> not all_frozen, due to these very fine details. (Assuming I haven't
> missed another path to the problem with aborted Multis or something,
> but looks like I haven't.)

Actually, FreezeMultiXactId() can fully remove an xmax that has some
member XIDs >= OldestXmin, provided FRM_NOOP processing isn't
possible, at least when no individual member is still running. Doesn't
have to involve transaction aborts at all.

Let me go try to break it that way...

-- 
Peter Geoghegan

Re: logical decoding and replication of sequences, take 2

2023-01-10 Thread Robert Haas

On Tue, Jan 10, 2023 at 1:32 PM Tomas Vondra
 wrote:
> 0001 is a fix for the pre-existing issue in logicalmsg_decode,
> attempting to build a snapshot before getting into a consistent state.
> AFAICS this only affects assert-enabled builds and is otherwise
> harmless, because we are not actually using the snapshot (apply gets a
> valid snapshot from the transaction).
>
> This is mostly the fix I shared in November, except that I kept the call
> in decode.c (per comment from Andres). I haven't added any argument to
> SnapBuildProcessChange because we may need to backpatch this (and it
> didn't seem much simpler, IMHO).

I tend to associate transactional behavior with snapshots, so it looks
odd to see code that builds a snapshot only when the message is
non-transactional. I think that a more detailed comment spelling out
the reasoning would be useful here.

> This however brings me to the original question what's the purpose of
> this patch - and that's essentially keeping sequences up to date to make
> them usable after a failover. We can't generate values from the sequence
> on the subscriber, because it'd just get overwritten. And from this
> point of view, it's also fine that the sequence is slightly ahead,
> because that's what happens after crash recovery anyway. And we're not
> guaranteeing the sequences to be gap-less.

I agree that it's fine for the sequence to be slightly ahead, but I
think that it can't be too far ahead without causing problems. Suppose
for example that transaction #1 creates a sequence. Transaction #2
does nextval on the sequence a bunch of times and inserts rows into a
table using the sequence values as the PK. It's fine if the nextval
operations are replicated ahead of the commit of transaction #2 -- in
fact I'd say it's necessary for correctness -- but they can't precede
the commit of transaction #1, since then the sequence won't exist yet.
Likewise, if there's an ALTER SEQUENCE that creates a new relfilenode,
I think that needs to act as a barrier: non-transactional changes that
happened before that transaction must also be replicated before that
transaction is replicated, and those that happened after that
transaction is replicated must be replayed after that transaction is
replicated. Otherwise, at the very least, there will be states visible
on the standby that were never visible on the origin server, and maybe
we'll just straight up get the wrong answer. For instance:

1. nextval, setting last_value to 3
2. ALTER SEQUENCE, getting a new relfilenode, and also set last_value to 19
3. nextval, setting last_value to 20

If 3 happens before 2, the sequence ends up in the wrong state.

Maybe you've already got this and similar cases totally correctly
handled, I'm not sure, just throwing it out there.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 10:50 AM Robert Haas  wrote:
> Look, I don't want to spend time arguing about what seem to me to be
> basic principles of good software engineering. When I don't put test
> cases into my patches, people complain at me and tell me that I'm a
> bad software engineer because I didn't include test cases. Your
> argument here seems to be that you're such a good software engineer
> that you don't need any test cases to know what the bug is or that
> you've fixed it correctly.

That's not what I said. This is a straw man.

What I actually said was that there is no reason to declare up front
that the only circumstances under which a fix could be committed is
when a clean repro is available. I never said that a test case has
little or no value, and I certainly didn't assert that we definitely
don't need a test case to proceed with a commit -- since I am not in
the habit of presumptuously attaching conditions to such things well
in advance.

> I don't particularly appreciate the implication that I either lack
> relevant or expertise or don't actually take time, either.

The implication was only that you didn't take the time. Clearly you
have the expertise. Obviously you're very far from stupid.

I have been unable to reproduce the problem, and think it's possible
that the issue cannot be triggered in practice. Though only through
sheer luck. Here's why that is:

While pruning will remove aborted dead tuples, freezing will not
remove the xmax of an aborted update unless the XID happens to be <
OldestXmin. With my problem scenario, the page will be all_visible in
prunestate, but not all_frozen -- so it dodges the relevant
visibilitymap_set() call site.

That just leaves inserts that abort, I think. An aborted insert will
be totally undone by pruning, but that does still leave behind an
LP_DEAD item that needs to be vacuumed in the second heap pass. This
means that we can only set the page all-visible/all-frozen in the VM
in the second heap pass, which also dodges the relevant
visibilitymap_set() call site.

In summary, I think that there is currently no way that we can have
the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving
the page all_frozen. It can happen and leave the page all_visible, but
not all_frozen, due to these very fine details. (Assuming I haven't
missed another path to the problem with aborted Multis or something,
but looks like I haven't.)

-- 
Peter Geoghegan

Re: Transparent column encryption

2023-01-10 Thread Mark Dilger

> On Jan 10, 2023, at 9:26 AM, Mark Dilger  wrote:
> 
>-- Cryptographically connected to the encrypted record
>patient_id  BIGINT NOT NULL,
>patient_ssn CHAR(11),
> 
>-- The encrypted record
>patient_record TEXT ENCRYPTED WITH (column_encryption_key = cek1,
>column_encryption_salt = (patient_id, 
> patient_ssn)),

As you mention upthread, tying columns together creates problems for statements 
that only operate on a subset of columns.  Allowing schema designers a choice 
about tying the encrypted column to zero or more other columns allows them to 
choose which works best for their security needs.

The example above would make a statement like "UPDATE patient_record SET 
patient_record = $1 \bind '{some json whatever}'" raise an exception at the 
libpq client level, but maybe that's what schema designers wants it to do.  If 
not, they should omit the column_encryption_salt option in the create table 
statement; but if so, they should expect to have to specify the other columns 
as part of the update statement, possibly as part of the where clause, like

UPDATE patient_record
SET patient_record = $1
WHERE patient_id = 12345
  AND patient_ssn = '111-11-' 
\bind '{some json record}'

and have the libpq get the salt column values from the where clause (which may 
be tricky to implement), or perhaps use some new bind syntax like

UPDATE patient_record
SET patient_record = ($1:$2,$3)   -- new, wonky syntax
WHERE patient_id = $2
  AND patient_ssn = $3 
\bind '{some json record}' 12345 '111-11-'

which would be error prone, since the sql statement could specify the 
($1:$2,$3) inconsistently with the where clause, or perhaps specify the "new" 
salt columns even when not changed, like

UPDATE patient_record
SET patient_record = $1, patient_id = 12345, patient_ssn = 
"111-11-"
WHERE patient_id = 12345
  AND patient_ssn = "111-11-"
\bind '{some json record}'

which looks kind of nuts at first glance, but is grammatically consistent with 
cases where one or both of the patient_id or patient_ssn are also being 
changed, like

UPDATE patient_record
SET patient_record = $1, patient_id = 98765, patient_ssn = 
"999-99-"
WHERE patient_id = 12345
  AND patient_ssn = "111-11-"
\bind '{some json record}'

Or, of course, you can ignore these suggestions or punt them to some future 
patch that extends the current work, rather than trying to get it all done in 
the first column encryption commit.  But it seems useful to think about what 
future directions would be, to avoid coding ourselves into a corner, making 
such future work harder.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: RFC: logical publication via inheritance root?

2023-01-10 Thread Jacob Champion

On Mon, Jan 9, 2023 at 12:41 AM Aleksander Alekseev
 wrote:
> I would like to point out that we shouldn't necessarily support
> multiple inheritance in all the possible cases, at least not in the
> first implementation. Supporting simple cases of inheritance would be
> already a valuable feature even if it will have certain limitations.

I agree. What I'm trying to avoid is the case where replication works
nicely for a table until someone performs an ALTER TABLE ... [NO]
INHERIT, and then Something Bad happens because we can't support the
new edge case. If every inheritance tree is automatically opted into
this new publication behavior, I think it'll be easier to hit that by
accident, making the whole thing feel brittle.

By contrast, if we have to opt specific tables into this feature by
marking them in the catalog, then not only will it be harder to hit by
accident (because we can document the requirements for the marker
function, and then it's up to the callers/extension authors/DBAs to
maintain those requirements), but we even have the chance to bail out
during an inheritance change if we see that the table is marked in
this way.

Two general pieces of progress to report:

1) I'm playing around with a marker in pg_inherits, where the inhseqno
is set to a sentinel value (0) for an inheritance relationship that
has been marked for logical publication. The intent is that the
pg_inherits helpers will prevent further inheritance relationships
when they see that marker, and reusing inhseqno means we can make use
of the existing index to do the lookups. An example:

=# CREATE TABLE root (a int);
=# CREATE TABLE root_p1 () INHERITS (root);
=# SELECT pg_set_logical_root('root_p1', 'root');

and then any data written to root_p1 gets replicated via root instead,
if publish_via_partition_root = true. If root_p1 is set up with extra
columns, they'll be omitted from replication.

2) While this strategy works well for ongoing replication, it's not
enough to get the initial synchronization correct. The subscriber
still does a COPY of the root table directly, missing out on all the
logical descendant data. The publisher will have to tell the
subscriber about the relationship somehow, and older subscriber
versions won't understand how to use that (similar to how old
subscribers can't correctly handle row filters).

--Jacob

Re: Show various offset arrays for heap WAL records

2023-01-10 Thread Andres Freund

Hi,

On 2023-01-09 19:59:42 -0800, Peter Geoghegan wrote:
> On Mon, Jan 9, 2023 at 1:58 PM Andres Freund  wrote:
> > A couple times when investigating data corruption issues, the last time just
> > yesterday in [1], I needed to see the offsets affected by PRUNE and VACUUM
> > records. As that's probably not just me, I think we should make that change
> > in-tree.
> 
> I remember how useful this was when we were investigating that early
> bug in 14, that turned out to be in parallel VACUUM. So I'm all in
> favor of it.

Cool.


> > The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
> > XLOG_HEAP2_FREEZE_PAGE.
> 
> I'm bound to end up doing the same in index access methods. Might make
> sense for the utility routines to live somewhere more centralized, at
> least when code reuse is likely. Practically every index AM has WAL
> records that include a sorted page offset number array, just like
> these ones. It's a very standard thing, obviously.

Hm, there doesn't seem to be a great location for them today. I guess we could
add something like src/include/access/rmgrdesc_utils.h? And put the
implementation in src/backend/access/rmgrdesc/rmgrdesc_utils.c? I first was
thinking of just rmgrdesc.[ch], but custom rmgrs added
src/bin/pg_waldump/rmgrdesc.[ch] ...


> > I chose to include infomask[2] for the different freeze plans mainly because
> > it looks odd to see different plans without a visible reason. But I'm not 
> > sure
> > that's the right choice.
> 
> I don't think that it is particularly necessary to do so in order for
> the output to make sense -- pg_waldump is inherently a tool for
> experts. What it comes down to for me is whether or not this
> information is sufficiently useful to display, and/or can be (or needs
> to be) controlled via some kind of verbosity knob.

It seemed useful enough to me, but I likely also stare more at this stuff than
most. Compared to the list of offsets it's not that much content.


> How hard would it be to invent a general mechanism to control the verbosity
> of what we'll show for each WAL record?

Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
void(*rm_desc) (StringInfo buf, XLogReaderState *record);

so we'd need to patch all of them. That might be worth doing at some point,
but I don't want to tackle it right now.

Greetings,

Andres Freund

Re: Use windows VMs instead of windows containers on the CI

2023-01-10 Thread Andres Freund

Hi,

On 2023-01-10 09:22:12 -0600, Justin Pryzby wrote:
> > There is more than 2x speed gain when VMs are used.
> 
> One consideration is that if windows runs twice as fast, we'll suddenly
> start using twice as many resources at cirrus/google/amazon - the
> windows task has been throttling everything else.  Not sure if we should
> to do anything beyond the limits that cfbot already uses.

I'm not sure we would. cfbot has a time based limit for how often it tries to
rebuild entries, and I think we were just about keeping up with that. In which
case we shouldn't, on average, schedule more jobs than we currently
do. Although peak "job throughput" would be higher.

Thomas?

Greetings,

Andres Freund

Re: psql: Add role's membership options to the \du+ command

2023-01-10 Thread Pavel Luzanov


Added the patch to the open commitfest:
https://commitfest.postgresql.org/42/4116/

Feel free to reject if it's not interesting.

--
Pavel Luzanov

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-10 Thread Ankit Kumar Pandey





On 10/01/23 10:53, David Rowley wrote:



the total cost is the same for both of these, but the execution time
seems to vary quite a bit.


This is really weird, I tried it different ways (to rule out any issues 
due to


caching) and execution time varied in spite of having same cost.


Maybe we should try and do this for DISTINCT queries if the
distinct_pathkeys match the orderby_pathkeys. That seems a little less
copout-ish. If the ORDER BY is the same as the DISTINCT then it seems
likely that the ORDER BY might opt to use the Unique path for DISTINCT
since it'll already have the correct pathkeys.



However, if the ORDER BY has fewer columns then it might be cheaper to Hash 
Aggregate and
then sort all over again, especially so when the DISTINCT removes a
large proportion of the rows.


Isn't order by pathkeys are always fewer than distinct pathkeys?

distinct pathkeys = order by pathkeys + window functions pathkeys

Again, I got your point which that it is okay to pushdown order by clause

if distinct is doing unique sort. But problem is (atleast from what I am 
facing),


distinct is not caring about pushed down sortkeys, it goes with hashagg 
or unique


with some other logic (mentioned below).


Consider following (with distinct clause restriction removed)

if (parse->distinctClause)
{
List* distinct_pathkeys = make_pathkeys_for_sortclauses(root, 
parse->distinctClause, root->processed_tlist);
if (!compare_pathkeys(distinct_pathkeys, orderby_pathkeys)==1) // distinct 
key > order by key
skip = true; // this is used to skip order by pushdown


}

CASE #1:

explain (costs off) select distinct a,b, min(a) over (partition by a), sum (a) 
over (partition by a) from abcd order by a,b;
QUERY PLAN
---
 Sort
   Sort Key: a, b
   ->  HashAggregate
 Group Key: a, b, min(a) OVER (?), sum(a) OVER (?)
 ->  WindowAgg
   ->  Sort
 Sort Key: a, b
 ->  Seq Scan on abcd
(8 rows)

explain (costs off) select distinct a,b,c, min(a) over (partition by a), sum 
(a) over (partition by a) from abcd order by a,b,c;
  QUERY PLAN
--
 Sort
   Sort Key: a, b, c
   ->  HashAggregate
 Group Key: a, b, c, min(a) OVER (?), sum(a) OVER (?)
 ->  WindowAgg
   ->  Sort
 Sort Key: a, b, c
 ->  Seq Scan on abcd
(8 rows)

No matter how many columns are pushed down, it does hashagg.

On the other hand:

CASE #2:

EXPLAIN (costs off) SELECT DISTINCT depname, empno, min(salary) OVER (PARTITION 
BY depname) depminsalary,sum(salary) OVER (PARTITION BY depname) depsalary
FROM empsalary
ORDER BY depname, empno;
QUERY PLAN
--
 Unique
   ->  Sort
 Sort Key: depname, empno, (min(salary) OVER (?)), (sum(salary) OVER 
(?))
 ->  WindowAgg
   ->  Sort
 Sort Key: depname, empno
 ->  Seq Scan on empsalary
(7 rows)

EXPLAIN (costs off) SELECT DISTINCT depname, empno, enroll_date, min(salary) 
OVER (PARTITION BY depname) depminsalary,sum(salary) OVER (PARTITION BY 
depname) depsalary
FROM empsalary
ORDER BY depname, empno, enroll_date;
  QUERY PLAN
---
 Unique
   ->  Sort
 Sort Key: depname, empno, enroll_date, (min(salary) OVER (?)), 
(sum(salary) OVER (?))
 ->  WindowAgg
   ->  Sort
 Sort Key: depname, empno, enroll_date
 ->  Seq Scan on empsalary
(7 rows)

It keeps doing Unique.

In both of the cases, compare_pathkeys(distinct_pathkeys, 
orderby_pathkeys) returns 1


Looking bit further, planner is choosing things correctly.

I could see cost of unique being higher in 1st case and lower in 2nd case.

But the point is, if sort for orderby is pushdown, shouldn't there be 
some discount on


cost of Unique sort (so that there is more possibility of it being 
favorable compared to HashAgg in certain cases)?


Again, cost of Unqiue node is taken as cost of sort node as it is, but

for HashAgg, new cost is being computed. If we do incremental sort here 
(for unique node),


as we have pushed down orderby's, unique cost could be reduced and our 
optimization could


be made worthwhile (I assume this is what you intended here) in case of 
distinct.


Eg:

EXPLAIN SELECT DISTINCT depname, empno, enroll_date, min(salary) OVER 
(PARTITION BY depname) depminsalary,sum(salary) OVER (PARTITION BY depname) 
depsalary
FROM empsalary
ORDER BY depname, empno, enroll_date;
  QUERY PLAN

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-01-10 Thread Andres Freund

Hi,

On 2023-01-10 15:03:42 +0100, Matthias van de Meent wrote:
> On Mon, 9 Jan 2023 at 20:34, Andres Freund  wrote:
> > On 2023-01-09 17:50:10 +0100, Matthias van de Meent wrote:
> > > Wouldn't it be enough to only fix the constructions in
> > > FullXidRelativeTo() and widen_snapshot_xid() (as attached, $topic does
> > > not occur with the patch), and (optionally) bump the first XID
> > > available for any cluster to (FirstNormalXid + 1) to retain the 'older
> > > than any running transaction' property?
> >
> > It's not too hard to fix in individual places, but I suspect that we'll
> > introduce the bug in future places without some more fundamental protection.
> >
> > Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable 
> > value
> > in ComputeXidHorizons() and GetSnapshotData().
> 
> I don't think that clamping the value with oldestXid (as seen in patch
> 0001, in GetSnapshotData) is right.

I agree that using oldestXid to clamp is problematic.


> It would clamp the value relative to the oldest frozen xid of all
> databases, which can be millions of transactions behind oldestXmin,
> and thus severely skew the amount of transaction's changes you keep on
> disk (that is, until oldestXid moves past 1000_000).

What precisely do you mean with "skew" here? Do you just mean that it'd take a
long time until vacuum_defer_cleanup_age takes effect? Somehow it sounds like
you might mean more than that?


I'm tempted to go with reinterpreting 64bit xids as signed. Except that it
seems like a mighty invasive change to backpatch.


Greetings,

Andres Freund

Re: Avoiding "wrong tuple length" errors at the end of VACUUM on pg_database update (Backpatch of 947789f to v12 and v13)

2023-01-10 Thread Andres Freund

Hi,

On 2023-01-10 02:57:43 -0500, Tom Lane wrote:
> No objections to back-patching the fix, but I wonder if we can find
> some mechanical way to prevent this sort of error in future.

What about a define that forces external toasting very aggressively for
catalog tables, iff they have a toast table? I suspect doing so for
non-catalog tables as well would trigger test changes. Running a buildfarm
animal with that would at least make issues like this much easier to discover.

Greetings,

Andres Freund

Re: can while loop in ClockSweepTick function be kind of infinite loop in some cases?

2023-01-10 Thread Andres Freund

Hi,

On 2023-01-10 13:11:35 -0500, Robert Haas wrote:
> On Tue, Jan 10, 2023 at 12:40 PM Andres Freund  wrote:
> > > I think. `expected = originalVictim + 1;` line should be in while loop
> > > (before acquiring spin lock) so that, even in the case above, expected
> > > variable is incremented for each loop and CAS operation will be successful
> > > at some point.
> > > Is my understanding correct? If so, I will send PR for fixing this issue.
> >
> > Yes, I think your understanding might be correct. Interesting that this
> > apparently has never occurred.
>
> Doesn't pg_atomic_compare_exchange_u32 set expected if it fails?

Indeed, so there's no problem.

I wonder if we should make ->nextVictimBuffer a 64bit atomic. At the time the
changes went in we didn't (or rather, couldn't) rely on it, but these days we
could.  I think with a 64bit number we could get rid of ->completePasses and
just infer it from ->nextVictimBuffer/NBuffers, removing th need for the
spinlock.  It's very unlikely that 64bit would ever wrap, and even if, it'd
just be a small inaccuracy in the assumed number of passes. OTOH, it's
doubtful the overflow handling / the spinlock matters performance wise.

Greetings,

Andres Freund

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Robert Haas

On Mon, Jan 9, 2023 at 5:59 PM Peter Geoghegan  wrote:
> On Mon, Jan 9, 2023 at 12:51 PM Robert Haas  wrote:
> > I feel that you should at least have a reproducer for these problems
> > posted to the thread, and ideally a regression test, before committing
> > things. I think it's very hard to understand what the problems are
> > right now.
>
> Hard to understand relative to what, exactly? We're talking about a
> very subtle race condition here.
>
> I'll try to come up with a reproducer, but I *utterly* reject your
> assertion that it's a hard requirement, sight unseen. Why should those
> be the parameters of the discussion?
>
> For one thing I'm quite confident that I'm right, with or without a
> reproducer. And my argument isn't all that hard to follow, if you have
> relevant expertise, and actually take the time.

Look, I don't want to spend time arguing about what seem to me to be
basic principles of good software engineering. When I don't put test
cases into my patches, people complain at me and tell me that I'm a
bad software engineer because I didn't include test cases. Your
argument here seems to be that you're such a good software engineer
that you don't need any test cases to know what the bug is or that
you've fixed it correctly. That seems like a surprising argument, but
even if it's true, test cases can have considerable value to future
code authors, because it allows them to avoid reintroducing bugs that
have previously been fixed. In my opinion, it's not worth trying to
have automated test cases for absolutely every bug we fix, because
many of them would be really hard to develop and executing all of them
every time we do anything would be unduly time-consuming. But I can't
remember the last time before this that someone wanted to commit a
patch for a data corruption issue without even providing a test case
that other people can run manually. If you think that is or ought to
be standard practice, I can only say that I disagree.

I don't particularly appreciate the implication that I either lack
relevant or expertise or don't actually take time, either. I spent an
hour yesterday looking at your patches yesterday and didn't feel I was
very close to understanding 0002 in that time. I feel that if the
patches were better-written, with relevant comments and test cases and
really good commit messages and a lack of extraneous changes, I
believe I probably would have gotten a lot further in the same amount
of time. There is certainly an alternate explanation, which is that I
am stupid. I'm inclined to think that's not the correct explanation,
but most stupid people believe that they aren't, so that doesn't
really prove anything.

> But even this is
> unlikely to matter much. Even if I somehow turn out to have been
> completely wrong about the race condition, it is still self-evident
> that the problem of uselessly WAL logging non-changes to the VM
> exists. That doesn't require any concurrent access at all. It's a
> natural consequence of calling visibilitymap_set() with
> VISIBILITYMAP_ALL_FROZEN-only flags. You need only look at the code
> for 2 minutes to see it.

Apparently not, because I spent more time than that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: releasing ParallelApplyTxnHash when pa_launch_parallel_worker returns NULL

2023-01-10 Thread Ted Yu

On Tue, Jan 10, 2023 at 9:43 AM Ted Yu  wrote:

>
>
> On Tue, Jan 10, 2023 at 9:26 AM Ted Yu  wrote:
>
>>
>>
>> On Tue, Jan 10, 2023 at 9:25 AM Ted Yu  wrote:
>>
>>> Hi,
>>> I was reading src/backend/replication/logical/applyparallelworker.c .
>>> In `pa_allocate_worker`, when pa_launch_parallel_worker returns NULL, I
>>> think the `ParallelApplyTxnHash` should be released.
>>>
>>> Please see the patch.
>>>
>>> Thanks
>>>
>> Here is the patch :-)
>>
>
> In `pa_process_spooled_messages_if_required`, the `pa_unlock_stream` call
> immediately follows `pa_lock_stream`.
> I assume the following is the intended sequence of calls. If this is the
> case, I can add it to the patch.
>
> Cheers
>
> diff --git a/src/backend/replication/logical/applyparallelworker.c
> b/src/backend/replication/logical/applyparallelworker.c
> index 2e5914d5d9..9879b3fff2 100644
> --- a/src/backend/replication/logical/applyparallelworker.c
> +++ b/src/backend/replication/logical/applyparallelworker.c
> @@ -684,9 +684,9 @@ pa_process_spooled_messages_if_required(void)
>  if (fileset_state == FS_SERIALIZE_IN_PROGRESS)
>  {
>  pa_lock_stream(MyParallelShared->xid, AccessShareLock);
> -pa_unlock_stream(MyParallelShared->xid, AccessShareLock);
>
>  fileset_state = pa_get_fileset_state();
> +pa_unlock_stream(MyParallelShared->xid, AccessShareLock);
>  }
>
>  /*
>
Looking closer at the comment above this code and other part of the file,
it seems the order is intentional.

Please disregard my email about `pa_process_spooled_messages_if_required`.

Re: can while loop in ClockSweepTick function be kind of infinite loop in some cases?

2023-01-10 Thread Robert Haas

On Tue, Jan 10, 2023 at 12:40 PM Andres Freund  wrote:
> > I think. `expected = originalVictim + 1;` line should be in while loop
> > (before acquiring spin lock) so that, even in the case above, expected
> > variable is incremented for each loop and CAS operation will be successful
> > at some point.
> > Is my understanding correct? If so, I will send PR for fixing this issue.
>
> Yes, I think your understanding might be correct. Interesting that this
> apparently has never occurred.

Doesn't pg_atomic_compare_exchange_u32 set expected if it fails?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Handle infinite recursion in logical replication setup

2023-01-10 Thread Jonathan S. Katz


On 1/10/23 10:17 AM, Amit Kapila wrote:

On Tue, Jan 10, 2023 at 8:13 AM Jonathan S. Katz  wrote:



This consistently created the deadlock in my testing.

Discussing with Masahiko off-list, this is due to a deadlock from 4
processes: the walsenders on A and B, and the apply workers on A and B.
The walsenders are waiting for feedback from the apply workers, and the
apply workers are waiting for the walsenders to synchronize (I may be
oversimplifying).

He suggested I try the above example instead with `synchronous_commit`
set to `local`. In this case, I verified that there is no more deadlock,
but he informed me that we would not be able to use cascading
synchronous replication when "origin=none".



This has nothing to do with the origin feature. I mean this should
happen with origin=any or even in PG15 without using origin at all.
Am, I missing something? One related point to note is that in physical
replication cascading replication is asynchronous. See docs [1]
(Cascading replication is currently asynchronous)


This is not directly related to the origin feature, but the origin 
feature makes it apparent. There is a common use-case where people want 
to replicate data synchronously between two tables, and this is enabled 
by the "origin" feature.


To be clear, my bigger concern is that it's fairly easy for users to 
create a deadlock situation based on how they set "synchronous_commit" 
with the origin feature -- this is the main reason why I brought it up. 
I'm less concerned about the "cascading" case, though I want to try out 
sync rep between 3 instances to see what happens.



If we decide that this is a documentation issue, I'd suggest we improve
the guidance around using `synchronous_commit`[1] on the CREATE
SUBSCRIPTION page, as the GUC page[2] warns against using `local`:



Yeah, but on Create Subscription page, we have mentioned that it is
safe to use off for logical replication.


While I think you can infer that it's "safe" for synchronous 
replication, I don't think it's clear.


We say it's "safe to use `off` for logical replication", but provide a 
lengthy explanation around synchronous logical replication that 
concludes that it's "advantageous to set synchronous_commit to local or 
higher" but does not address safety. The first line in the explanation 
of the parameter links to the `synchronous_commit` GUC which 
specifically advises against using "local" for synchronous replication.


Additionally, because we say "local" or higher, we increase the risk of 
the aforementioned in HEAD with the origin feature.


I know I'm hammering on this point a bit, but it feels like this is 
relatively easy to misconfigure with the upcoming "origin" change (I did 
so myself from reading the devel docs) and we should ensure we guide our 
users appropriately.



One can use local or higher
for reducing the latency for COMMIT when synchronous replication is
used in the publisher. Won't using 'local' while creating subscription
would suffice the need to consistently replicate the data? I mean it
is equivalent to somebody using levels greater than local in case of
physical replication. I think in the case of physical replication, we
won't wait for standby to replicate to another node before sending a
response, so why to wait in the case of logical replication? If this
understanding is correct, then probably it is sufficient to support
'local' for a subscription.


I think this is a good explanation. We should incorporate some version 
of this into the docs, and add some clarity around the use of 
`synchronous_commit` option in `CREATE SUBSCRIPTION` in particular with 
the origin feature. I can make an attempt at this.


Perhaps another question: based on this, should we disallow setting 
values of `synchronous_commit` greater than `local` when using 
"origin=none"? That might be too strict, but maybe we should warn around 
the risk of deadlock either in the logs or in the docs.


Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature

Re: Avoiding "wrong tuple length" errors at the end of VACUUM on pg_database update (Backpatch of 947789f to v12 and v13)

2023-01-10 Thread Nathan Bossart

On Tue, Jan 10, 2023 at 02:57:43AM -0500, Tom Lane wrote:
> Michael Paquier  writes:
>> Any objections about getting 947789f applied to REL_13_STABLE and
>> REL_12_STABLE and see this issue completely gone from all the versions
>> supported?
> 
> No objections to back-patching the fix...

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: fixing CREATEROLE

2023-01-10 Thread Robert Haas

On Thu, Jan 5, 2023 at 2:53 PM Robert Haas  wrote:
> On Tue, Jan 3, 2023 at 3:11 PM Robert Haas  wrote:
> > Committed and back-patched 0001 with fixes for the issues that you pointed 
> > out.
> >
> > Here's a trivial rebase of the rest of the patch set.
>
> I committed 0001 and 0002 after improving the commit messages a bit.
> Here's the remaining two patches back. I've done a bit more polishing
> of these as well, specifically in terms of fleshing out the regression
> tests. I'd like to move forward with these soon, if nobody's too
> vehemently opposed to that.

Done now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

1 2 >

1 - 100 of 155 matches

Mail list logo