Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-19 Thread Amit Kapila
On Wed, Sep 20, 2023 at 11:00 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,

+int
+count_old_cluster_logical_slots(void)
+{
+ int dbnum;
+ int slot_count = 0;
+
+ for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
+ slot_count += old_cluster.dbarr.dbs[dbnum].slot_arr.nslots;
+
+ return slot_count;
+}

In this code, aren't we assuming that 'slot_arr.nslots' will be zero
for versions <=PG16? On my Windows machine, this value is not zero but
rather some uninitialized negative value which makes its caller try to
allocate some undefined memory and fail. I think you need to
initialize this in get_old_cluster_logical_slot_infos() for lower
versions.

-- 
With Regards,
Amit Kapila.




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-19 Thread Amit Kapila
On Wed, Sep 20, 2023 at 11:51 AM Dilip Kumar  wrote:
>
> On Wed, Sep 20, 2023 at 11:00 AM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Amit,
> >
> > Thank you for reviewing! PSA new version. In this version I ran pgindent 
> > again.
> >
>
> + /*
> + * There is a possibility that following records may be generated
> + * during the upgrade.
> + */
> + if (!CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_CHECKPOINT_SHUTDOWN) &&
> + !CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_CHECKPOINT_ONLINE) &&
> + !CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_SWITCH) &&
> + !CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_FPI_FOR_HINT) &&
> + !CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_PARAMETER_CHANGE) &&
> + !CHECK_WAL_RECORD(rmid, info, RM_STANDBY_ID, XLOG_RUNNING_XACTS) &&
> + !CHECK_WAL_RECORD(rmid, info, RM_HEAP2_ID, XLOG_HEAP2_PRUNE))
> + is_valid = false;
> +
> + CHECK_FOR_INTERRUPTS();
>
> Just wondering why XLOG_HEAP2_VACUUM or other vacuum-related commands
> can not occur during the upgrade?
>

Because autovacuum is disabled during upgrade. See comment: "Use -b to
disable autovacuum" in start_postmaster().


-- 
With Regards,
Amit Kapila.




Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-09-19 Thread Maciek Sakrejda
On Tue, Sep 19, 2023, 20:23 Maciek Sakrejda  wrote:

> I wonder if something like CURRENT (i.e., the search path at function
> creation time) might be a useful keyword addition. I can see some uses
> (more forgiving than SYSTEM but not as loose as SESSION), but I don't
> know if it would justify its presence.


I realize now this is exactly what SET search_path FROM CURRENT does. Sorry
for the noise.

Regarding extensions installed in the public schema throwing a wrench in
the works, is that still a problem if the public schema is not writable? I
know that that's a new default, but it won't be forever.


Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-19 Thread Dilip Kumar
On Wed, Sep 20, 2023 at 11:00 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> Thank you for reviewing! PSA new version. In this version I ran pgindent 
> again.
>

+ /*
+ * There is a possibility that following records may be generated
+ * during the upgrade.
+ */
+ if (!CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_CHECKPOINT_SHUTDOWN) &&
+ !CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_CHECKPOINT_ONLINE) &&
+ !CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_SWITCH) &&
+ !CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_FPI_FOR_HINT) &&
+ !CHECK_WAL_RECORD(rmid, info, RM_XLOG_ID, XLOG_PARAMETER_CHANGE) &&
+ !CHECK_WAL_RECORD(rmid, info, RM_STANDBY_ID, XLOG_RUNNING_XACTS) &&
+ !CHECK_WAL_RECORD(rmid, info, RM_HEAP2_ID, XLOG_HEAP2_PRUNE))
+ is_valid = false;
+
+ CHECK_FOR_INTERRUPTS();

Just wondering why XLOG_HEAP2_VACUUM or other vacuum-related commands
can not occur during the upgrade?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Questioning an errcode and message in jsonb.c

2023-09-19 Thread Peter Eisentraut

On 18.09.23 18:55, Chapman Flack wrote:

It would make me happy if the message could be changed, and maybe
ERRCODE_INVALID_PARAMETER_VALUE also changed, perhaps to one of
the JSON-specific ones in the 2203x range.


What is an example of a statement or function call that causes this 
error?  Then we can look in the SQL standard for guidance.





Re: RFC: Logging plan of the running query

2023-09-19 Thread Lepikhov Andrei
On Tue, Sep 19, 2023, at 8:39 PM, torikoshia wrote:
> On 2023-09-15 15:21, Lepikhov Andrei wrote:
>> On Thu, Sep 7, 2023, at 1:09 PM, torikoshia wrote:
>> I have explored this patch and, by and large, agree with the code. But
>> some questions I want to discuss:
>> 1. Maybe add a hook to give a chance for extensions, like
>> pg_query_state, to do their job?
>
> Do you imagine adding a hook which enables adding custom interrupt codes 
> like below?
>
> https://github.com/postgrespro/pg_query_state/blob/master/patches/custom_signals_15.0.patch

No, I think around the hook, which would allow us to rewrite the  
pg_query_state extension without additional patches by using the functionality 
provided by your patch. I mean, an extension could provide console UI, not only 
server logging. And obtain from target backend so much information in the 
explain as the instrumentation level of the current query can give.
It may make pg_query_state shorter and more stable.

>> 2. In this implementation, ProcessInterrupts does a lot of work during
>> the explain creation: memory allocations, pass through the plan,
>> systables locks, syscache access, etc. I guess it can change the
>> semantic meaning of 'safe place' where CHECK_FOR_INTERRUPTS can be
>> called - I can imagine a SELECT query, which would call a stored
>> procedure, which executes some DDL and acquires row exclusive lock at
>> the time of interruption. So, in my mind, it is too unpredictable to
>> make the explain in the place of interruption processing. Maybe it is
>> better to think about some hook at the end of ExecProcNode, where a
>> pending explain could be created?
>
> Yeah, I withdrew this patch once for that reason, but we are resuming 
> development in response to the results of a discussion by James and 
> others at this year's pgcon about the safety of this process in CFI:
>
> https://www.postgresql.org/message-id/CAAaqYe9euUZD8bkjXTVcD9e4n5c7kzHzcvuCJXt-xds9X4c7Fw%40mail.gmail.com

I can't track the logic path  of the decision provided at this conference. But 
my anxiety related to specific place, where ActiveQueryDesc points top-level 
query and interruption comes during DDL, wrapped up in stored procedure. For 
example:
CREATE TABLE test();
CREATE OR REPLACE FUNCTION ddl() RETURNS void AS $$
BEGIN
  EXECUTE format('ALTER TABLE test ADD COLUMN x integer;');
  ...
END; $$ LANGUAGE plpgsql VOLATILE;
SELECT ddl(), ... FROM ...;

> BTW it seems pg_query_state also enables users to get running query 
> plans using CFI.
> Does pg_query_state do something for this safety concern?

No, and I'm looking for the solution, which could help to rewrite 
pg_query_state as a clean extension, without patches.

>> Also, I suggest minor code change with the diff in attachment.
>
> Thanks!
>
> This might be biased opinion and objections are welcomed, but I feel the 
> original patch is easier to read since PG_RETURN_BOOL(true/false) is 
> located in near place to each cases.
> Also the existing function pg_log_backend_memory_contexts(), which does 
> the same thing, has the same form as the original patch.
I got it, thank you.

-- 
Regards,
Andrei Lepikhov




RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-19 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

Thank you for reviewing! PSA new version. In this version I ran pgindent again.

> +#include "access/xlogdefs.h"
>  #include "common/relpath.h"
>  #include "libpq-fe.h"
> 
> The above include is not required. I have removed that and made a few
> cosmetic changes in the attached.

Yes, it is not needed anymore. Firstly it was introduced to use the datatype
XLogRecPtr, but removed in recent version.

Moreover, I my colleague Hou found several problems for v40. Here is a fixed
version. Below bullets are the found issues.

* Fixed to allow XLOG_SWICH when reading the record, including the initial one.
  The XLOG_SWICH may inserted after walsender exits. This is occurred when
  archive_mode is set to on (or always). 
* Fixed to set max_slot_wal_keep_size -1 only when the cluster is PG17+.
  max_slot_wal_keep_size was introduced in PG13, so previous patch could not
  upgrade from PG12 and prior.
  The setting is only needed to upgrade logical slots, so it should be set only
  when in PG17 and later.
* Avoid to call binary_upgrade_validate_wal_records() when the slot is 
invalidated.
  The function raises an ERROR if the record corresponds to the given LSN.
  The output is like:

```
ERROR:  requested WAL segment pg_wal/00010001 has already been 
removed
```

  It is usual behavior but we do not want to error out here, so it was avoided.
  The upgrading would fail correctly if there are invalid slots.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v41-0001-pg_upgrade-Allow-to-replicate-logical-replicatio.patch
Description:  v41-0001-pg_upgrade-Allow-to-replicate-logical-replicatio.patch


Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2023-09-19 Thread Kyotaro Horiguchi
At Tue, 19 Sep 2023 13:48:55 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> Dear Horiguchi-san,
> 
> > I added the thread to next CF entry, so let's see the how cfbot says.
> 
> At least there are several compiler warnings. E.g.,
> 
> * pgwin32_find_postmaster_pid() has "return;", but IIUC it should be "exit(1)"
> * When DWORD is printed, "%lx" should be used.
> * The variable "flags" seems not needed.

Yeah, I thought that they all have been fixed but.. you are right in
every respect.

> Here is a patch which suppresses warnings, whereas test would fail...
> You can use it if acceptable.

I was able to see the trouble in the CI environment, but not
locally. I'll delve deeper into this. Thanks you for bringing it to my
attention.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Bug fix for psql's meta-command \ev

2023-09-19 Thread Ryoga Yoshida

On 2023-09-20 09:32, Michael Paquier wrote:

Actually there was a bit more to it in the presence of \e, that could
also get some unpredictible behaviors if some errors happen while
editing a query, which is something unlikely, still leads to strange
behaviors on failure injections.  I was considering first to move the
reset in do_edit(), but also we have the case of \e[v|f] where the
buffer has no edits so it felt a bit more natural to do that in the
upper layer like in this patch.


Indeed, similar behaviours can happen with the \e. The patch you 
committed looks good to me. Thank you.


Ryoga Yoshida




Re: Using Btree to Provide Sorting on Suffix Keys with LIMIT

2023-09-19 Thread Peter Geoghegan
On Fri, Jan 18, 2019 at 12:15 PM James Coleman  wrote:
> I'll attempt to describe a more real world scenario: suppose we have a
> schema like:
>
> users(id serial primary key)
> orders(id serial primary key, user_id integer, created_at timestamp)
>
> And wanted to find the most recent N orders for a specific group of
> users (e.g., in a report or search). Your query might look like:
>
> SELECT *
> FROM orders
> WHERE orders.user_id IN (1, 2, 3)
> ORDER BY orders.created_at DESC
> LIMIT 25
>
> Currently an index on orders(user_id, created_at) will be used for
> this query, but only to satisfy the scalar array op qual. Then all
> matching orders (say, years worth) will be fetched, a sort node will
> sort all of those results, and then a limit node will take the top N.
>
> Generalized the problem is something like "find the top N rows across
> a group of foreign keys" (though saying foreign keys probably is too
> specific).
>
> But under the scheme I'm proposing that same index would be able to
> provide both the filter and guarantee ordering as well.
>
> Does that more real-world-ish example help place the usefulness of this?

Yes. It didn't make much sense back in 2019, but I understand what you
meant now, I think. The latest version of my ScalarArrayOpExpr patch
(v2) can execute queries like this efficiently:

https://postgr.es/m/CAH2-WzkEyBU9UQM-5GWPcB=weshaukcjdvgfuqvhupuo-iy...@mail.gmail.com

Note that your example is similar to the test case from the latest
update on the thread. The test case from Benoit Tigeot, that appears
here:

https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4690491#gistcomment-4690491

You seemed to want to use an index that started with user_id/bar_fk.
But I think you'd have to have an index on "created_at DESC, user_id".
It could work the other way around with your suggested index, for a
query written to match -- "ORDER BY user_id, created_at DESC".

With an index on "created_at DESC, user_id", you'd be able to
efficiently execute your limit query. The index scan could only
terminate when it found (say) 25 matching tuples, so you might still
have to scan quite a few index pages. But, you wouldn't have to do
heap access to eliminate non-matches (with or without the VM being
set) -- you could eliminate all of those non-matches using true SAOP
index quals, that don't need to operate on known visible rows.

This is possible with the patch, despite the fact that the user_id
column is a low-order column (so this isn't one of the cases where
it's useful to "skip"). Avoiding heap hits just to eliminate
non-matching rows on user_id is what really matters here, though --
not skipping. It would be helpful if you could confirm this
understanding, though.

Thanks
--
Peter Geoghegan




Re: Fix bug in VACUUM and ANALYZE docs

2023-09-19 Thread Michael Paquier
On Wed, Sep 20, 2023 at 09:43:15AM +0900, Shinya Kato wrote:
> Thanks for the patch.
> You're right. It looks good to me.

Right, it feels like there has been a lot of copy-paste in this area
of the docs.  Applied down to 16.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound

2023-09-19 Thread Peter Geoghegan
On Sun, May 14, 2023 at 6:06 PM Peter Geoghegan  wrote:
> BTW, Google cloud already just instruct their users to ignore the
> xidStopLimit HINT about single user mode:
>
> https://cloud.google.com/sql/docs/postgres/txid-wraparound

I read this just today, and was reminded of this thread:

https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-under-the-hood-adaptive-autovacuum

It reads:

"1. Transaction ID wraparound: PostgreSQL transaction IDs or XIDs are
32-bit unsigned integers that are assigned to each transaction and
also get incremented. When they reach their maximum value, it would
wrap around to zero (similar to a ring buffer) and can lead to data
corruption."

-- 
Peter Geoghegan




Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-09-19 Thread Maciek Sakrejda
On Tue, Sep 19, 2023 at 5:56 PM Jeff Davis  wrote:
>...
> On Tue, 2023-09-19 at 11:41 -0400, Robert Haas wrote:
> > That leads to a second idea, which is having it continue
> > to be a GUC but only affect directly-entered SQL, with all
> > indirectly-entered SQL either being stored as a node tree or having a
> > search_path property attached somewhere.
>
> That's not too far from the proposed patch and I'd certainly be
> interested to hear more and/or adapt my patch towards this idea.

As an interested bystander, that's the same thing I was thinking when
reading this. I reread your original e-mail, Jeff, and I still think
that.

I wonder if something like CURRENT (i.e., the search path at function
creation time) might be a useful keyword addition. I can see some uses
(more forgiving than SYSTEM but not as loose as SESSION), but I don't
know if it would justify its presence.

Thanks for working on this.

Thanks,
Maciek




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Laurenz Albe
On Tue, 2023-09-19 at 12:52 -0400, Robert Haas wrote:
> On Tue, Sep 19, 2023 at 12:30 PM Alvaro Herrera  
> wrote:
> > I was thinking something vaguely like "a table size that's roughly what
> > an optimal autovacuuming schedule would leave the table at" assuming 0.2
> > vacuum_scale_factor.  You would determine the absolute minimum size for
> > the table given the current live tuples in the table, then add 20% to
> > account for a steady state of dead tuples and vacuumed space.  So it's
> > not 1.2x of the "current" table size at the time the local_update_limit
> > feature is installed, but 1.2x of the optimal table size.
> 
> Right, that would be great. And honestly if that's something we can
> figure out, then why does the parameter even need to be an integer
> instead of a Boolean? If the system knows the optimal table size, then
> the user can just say "try to compact this table" and need not say to
> what size. The 1.2 multiplier is probably situation dependent and
> maybe the multiplier should indeed be a configuration parameter, but
> we would be way better off if the absolute size didn't need to be.

I don't have high hopes for a reliable way to automatically determine
the target table size.  There are these queries floating around to estimate
table bloat, which are used by various monitoring systems.  I find that they
get it right a lot of the time, but sometimes they get it wrong.  Perhaps
we can do better than that, but I vastly prefer a setting that I can control
(even at the danger that I can misconfigure it) over an automatism that I
cannot control and that sometimes gets it wrong.

I like Alvaro's idea to automatically reset "local_update_limit" when the
table has shrunk enough.  Why not perform that task during vacuum truncation?
If vacuum truncation has taken place, check if the table size is no bigger
than "local_update_limit" * (1 + "autovacuum_vacuum_scale_factor"), and if
it is no bigger, reset "local_update_limit".  That way, we would not have
to worry about a lock, because vacuum truncation already has the table locked.

Yours,
Laurenz Albe




Re: remaining sql/json patches

2023-09-19 Thread Amit Langote
On Tue, Sep 19, 2023 at 9:00 PM Amit Langote  wrote:
> On Tue, Sep 19, 2023 at 7:37 PM jian he  wrote:
> >  ---
> > https://www.postgresql.org/docs/current/extend-type-system.html#EXTEND-TYPES-POLYMORPHIC
> > >>  When the return value of a function is declared as a polymorphic type, 
> > >> there must be at least one argument position that is also
> > >> polymorphic, and the actual data type(s) supplied for the polymorphic 
> > >> arguments determine the actual result type for that call.
> >
> > select json_query(jsonb'{"a":[{"a":[2,3]},{"a":[4,5]}]}','$.a[*].a?(@<=3)'
> > returning anyrange);
> > should fail. Now it returns NULL. Maybe we can validate it in
> > transformJsonFuncExpr?
> > ---
>
> I'm not sure whether we should make the parser complain about the
> weird types being specified in RETURNING.

Sleeping over this, maybe adding the following to
transformJsonOutput() does make sense?

+   if (get_typtype(ret->typid) == TYPTYPE_PSEUDO)
+   ereport(ERROR,
+   errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+   errmsg("returning pseudo-types is not supported in
SQL/JSON functions"));
+

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: pg_rewind with cascade standby doesn't work well

2023-09-19 Thread Michael Paquier
On Wed, Sep 20, 2023 at 11:46:45AM +0900, Kuwamura Masaki wrote:
> I also found a bug (maybe). If we call `CreateRestartPoint()` during
> crash-recovery, the assertion fails at ComputeXidHorizon() in procarray.c.
> It's inherently orthogonal to the problem I already reported. So you can
> reproduce this at HEAD with this procedure.
> 
> 1. Start primary and standby server
> 2. Modify checkpoint_timeout to 1h on standby
> 3. Insert 10^10 records and concurrently run CHECKPOINT every second on
> primary
> 4. Do an immediate stop on both standby and primary at the end of the insert
> 5. Modify checkpoint_timeout to 30 on standby
> 6. Remove standby.signal on standby
> 7. Restart standby (it will start crash-recovery)
> 8. Assertion failure is raised shortly
> 
> I think this is because `TruncateSUBTRANS();`  in `CreateRestartPoint()` is
> called but `StartupSUBTRANS()` isn't called yet. In `StartupXLOG()`, we
> call
> `StartupSUBTRANS()` if `(ArchiveRecoveryRequested && EnableHotStandby)`.
> However, in `CreateRestartPoint()`, we call `TruncateSUBTRANS()` if
> `(EnableHotStandby)`. I guess the difference causes this bug. The latter
> possibly be called even crash-recovery while former isn't.
> The attached patch 0002 fixes it. I think we could discuss about this bug
> in
> another thread if needed.

This is a known issue.  I guess that the same as this thread and this
CF entry:
https://commitfest.postgresql.org/44/4244/
https://www.postgresql.org/message-id/flat/zarvomifjze7f...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature


Re: Disabling Heap-Only Tuples

2023-09-19 Thread Laurenz Albe
On Tue, 2023-09-19 at 14:50 -0400, Robert Haas wrote:
> But I know people will try to use it for instant compaction too, and
> there it's worth remembering why we removed old-style VACUUM FULL. The
> main problem is that it was mind-bogglingly slow. The other really bad
> problem is that it caused massive index bloat. I think any system
> that's based on moving around my tuples right now to make my table
> smaller right now is likely to have similar issues.

I had the same feeling that this is sort of bringing back old-style
VACUUM (FULL).  But I don't think that index bloat is a show stopper
these days, when we have REINDEX CONCURRENTLY, so I am not worried.

Yours,
Laurenz Albe




Re: pg_rewind with cascade standby doesn't work well

2023-09-19 Thread Kuwamura Masaki
>> IMO a test is needed that makes sure no one is going to break this in
>> the future.
>
> You definitely need more complex test scenarios for that.  If you can
> come up with new ways to make the TAP tests of pg_rewind mode modular
> in handling more complicated node setups, that would be a nice
> addition, for example.

I'm sorry for lacking tests. For now, I started off with a simple test
that cause the problem I mentioned. The updated WIP patch 0001 includes
the new test for pg_rewind.
And also, I'm afraid that I'm not sure what kind of tests I have to make
for fix this behavior. Would you mind giving me some advice?

> This patch is at least incorrect in its handling of crash recovery,
> because these two should *never* be set in this case as we want to
> replay up to the end of WAL.  For example, see xlog.c or the top of
> xlogrecovery.c about the assumptions behind these variables:
> /* crash recovery should always recover to the end of WAL */
> ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
> ControlFile->minRecoveryPointTLI = 0;
>
> If an end-of-recovery record is replayed during crash recovery, these
> assumptions are plain broken.

That make sense! I really appreciate your knowledgeable review.

> One thing that we could consider is to be more aggressive with
> restartpoints when replaying this record for a standby, see a few
> lines above the lines added by your patch, for example.  And we could
> potentially emulate a post-promotion restart point to get a refresh of
> the control file as it should, with the correct code paths involved in
> the updates of minRecoveryPoint when the checkpointer does the job.

I'm not confident but you meant we could make restartpoint
(i.e., call `RequestCheckpoint()`) instead of my old patch?
The patch 0001 also contains my understanding.

I also found a bug (maybe). If we call `CreateRestartPoint()` during
crash-recovery, the assertion fails at ComputeXidHorizon() in procarray.c.
It's inherently orthogonal to the problem I already reported. So you can
reproduce this at HEAD with this procedure.

1. Start primary and standby server
2. Modify checkpoint_timeout to 1h on standby
3. Insert 10^10 records and concurrently run CHECKPOINT every second on
primary
4. Do an immediate stop on both standby and primary at the end of the insert
5. Modify checkpoint_timeout to 30 on standby
6. Remove standby.signal on standby
7. Restart standby (it will start crash-recovery)
8. Assertion failure is raised shortly

I think this is because `TruncateSUBTRANS();`  in `CreateRestartPoint()` is
called but `StartupSUBTRANS()` isn't called yet. In `StartupXLOG()`, we
call
`StartupSUBTRANS()` if `(ArchiveRecoveryRequested && EnableHotStandby)`.
However, in `CreateRestartPoint()`, we call `TruncateSUBTRANS()` if
`(EnableHotStandby)`. I guess the difference causes this bug. The latter
possibly be called even crash-recovery while former isn't.
The attached patch 0002 fixes it. I think we could discuss about this bug
in
another thread if needed.

Best regards!

Masaki Kuwamura


v2-0001-pg_rewind-Fix-bug-using-cascade-standby-as-source.patch
Description: Binary data


v1-0002-Fix-restartpoint-during-crash-recovery.patch
Description: Binary data


Re: Inefficiency in parallel pg_restore with many tables

2023-09-19 Thread Nathan Bossart
On Mon, Sep 18, 2023 at 02:22:32PM -0700, Nathan Bossart wrote:
> For now, I've committed 0001 and 0002.  I intend to commit the others soon.

I've committed the rest of the patches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Move global variables of pgoutput to plugin private scope.

2023-09-19 Thread Peter Smith
Hi Hou-san.

Given there are some issues raised about the 0001 patch [1] I am
skipping that one until I see the replies.

Meanwhile, here are some review comments for the patches v1-0002 and v1-0003


v1-0002

==
Commit message

1.
The pgoutput module uses a global variable(publish_no_origin) to cache the
action for the origin filter. But we only initialize publish_no_origin when
user specifies the "origin" in the output paramters which means we could refer
to an uninitialized variable if user didn't specify the paramter.

~

1a.

typos
/variable(publish_no_origin)/variable (publish_no_origin)/
/paramters/parameters/
/paramter./paramter./

~

1b.
"...we could refer to an uninitialized variable"

I'm not sure what this means. Previously it was static, so it wouldn't
be "uninitialised"; it would be false. Perhaps there might be a stale
value from a previous pgoutput, but IIUC that's the point made by your
next paragraph ("Besides, we don't...")

~~~

2.
To improve it, the patch stores the map within the private data of the output
plugin so that it will get initialized and reset along with the output plugin
context.

2a.
/To improve it,/To fix this/

~

2b.
"stores the map"

What map? This might be a cut/paste error from the v1-0001 patch comment.


v1-0003

==
Commit message

1.
Missing patch comment.

==
src/backend/replication/pgoutput/pgoutput.c

2. maybe_send_schema

- if (in_streaming)
+ if (data->in_streaming)
  set_schema_sent_in_streamed_txn((PGOutputData *) ctx->output_plugin_private,
  relentry, topxid);
~

Since you added a new 'data' variable, you might as well make use of
it here instead of doing "(PGOutputData *) ctx->output_plugin_private"
again.

==
src/include/replication/pgoutput.h

3.
  MemoryContext cachectx; /* private memory context for cache data */

+ bool in_streaming;
+

Even though there was no comment previously when this was static, IMO
it is better to comment on all the structure fields where possible.

--
[1] https://www.postgresql.org/message-id/ZQk1Ca_eFDTmBiZy%40paquier.xyz

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Fix bug in VACUUM and ANALYZE docs

2023-09-19 Thread Shinya Kato

On 2023-09-19 17:59, Ryoga Yoshida wrote:

Hi,

Issue1:
VACUUM and ANALYZE docs explain that the parameter of
BUFFER_USAGE_LIMIT is optional as follows. But this is not true. The
argument, size, is required for BUFFER_USAGE_LIMIT. So the docs should
be fixed this issue.
BUFFER_USAGE_LIMIT [ size ]
https://www.postgresql.org/docs/devel/sql-vacuum.html
https://www.postgresql.org/docs/devel/sql-analyze.html

Issue2:
Sizes may also be specified as a string containing the numerical size
followed by any one of the following memory units: kB (kilobytes), MB
(megabytes), GB (gigabytes), or TB (terabytes).
VACUUM and ANALYZE docs explain that the argument of
BUFFER_USAGE_LIMIT accepts the units like kB (kilobytes), MB
(megabytes), GB (gigabytes), or TB (terabytes). But it also actually
accepts B(bytes) as an unit. So the docs should include "B(bytes)" as
an unit that the argument of BUFFER_USAGE_LIMIT can accept.

You can see the patch in the attached file.


Thanks for the patch.
You're right. It looks good to me.

--
Regards,
Shinya Kato
NTT DATA GROUP CORPORATION




Re: pg_upgrade and logical replication

2023-09-19 Thread Michael Paquier
On Tue, Sep 19, 2023 at 07:14:49PM +0530, vignesh C wrote:
> Here is a patch to set max_logical_replication_workers as 0 while the
> server is started to prevent the launcher from being started. Since
> this configuration is present from v10, no need for any version check.
> I have done upgrade tests for v10-master, v11-master, ... v16-master
> and found it to be working fine.

The project policy is to support pg_upgrade for 10 years, and 9.6 was
released in 2016:
https://www.postgresql.org/docs/9.6/release-9-6.html

>  snprintf(cmd, sizeof(cmd),
> - "\"%s/pg_ctl\" -w -l \"%s/%s\" -D \"%s\" -o \"-p %d -b%s %s%s\" 
> start",
> + "\"%s/pg_ctl\" -w -l \"%s/%s\" -D \"%s\" -o \"-p %d -b%s 
> %s%s%s\" start",
>   cluster->bindir,
>   log_opts.logdir,
>   SERVER_LOG_FILE, cluster->pgconfig, cluster->port,
>   (cluster == &new_cluster) ?
>   " -c synchronous_commit=off -c fsync=off -c 
> full_page_writes=off" : "",
> + " -c max_logical_replication_workers=0",
>   cluster->pgopts ? cluster->pgopts : "", socket_string);
>  
>  /*

And this code path is used to start postmaster instances for old and
new clusters.  So it seems to me that it is incorrect if this is not
conditional based on the cluster version.
--
Michael


signature.asc
Description: PGP signature


Re: Bug fix for psql's meta-command \ev

2023-09-19 Thread Michael Paquier
On Tue, Sep 19, 2023 at 01:23:54PM +0300, Aleksander Alekseev wrote:
> You are right, I missed it. Your patch is correct while the original
> one is not quite so.

Actually there was a bit more to it in the presence of \e, that could
also get some unpredictible behaviors if some errors happen while
editing a query, which is something unlikely, still leads to strange
behaviors on failure injections.  I was considering first to move the
reset in do_edit(), but also we have the case of \e[v|f] where the
buffer has no edits so it felt a bit more natural to do that in the
upper layer like in this patch.

Another aspect of all these code paths is the line number that can be
optionally number after an object name for \e[v|f] or a file name for
\e (in the latter case it is possible to have a line number without a
file name, as well).  Anyway, we only fill the query buffer after
validating all the options at hand.  So, while the status is set to
PSQL_CMD_ERROR, we'd still do a reset of the query buffer but nothing
got added to it yet.

I've also considered a backpatch for this change, but at the end
discarded this option, at least for now.  I don't think that someone
is relying on the existing behavior of having the query buffer still
around on failure if \ev or \ef fail their object lookup as the
contents are undefined, because that's not unintuitive, but this
change is not critical enough to make it backpatchable if somebody's
been actually relying on the previous behavior.  I'm OK to revisit
this choice later on depending on the feedback, though.
--
Michael


signature.asc
Description: PGP signature


Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-09-19 Thread Jeff Davis
On Tue, 2023-09-19 at 11:41 -0400, Robert Haas wrote:
> I agree this is a mess, and that documenting the mess better would be
> good. But instead of saying not to do something, we need to say what
> will happen if you do the thing. I'm regularly annoyed when somebody
> reports that "I tried to do X and it didn't work," instead of saying
> what happened when they tried, and this situation is another form of
> the same thing. "If you do X, then Y will or can occur" is much
> better
> than "do not do X".

Good documentation includes some guidance. Sure, it should describe the
system behavior, but without anchoring it to some kind of expected use
case, it can be equally frustrating.

Allow me to pick on this example which came up in a recent thread:

"[password_required] Specifies whether connections to the publisher
made as a result of this subscription must use password authentication.
This setting is ignored when the subscription is owned by a superuser.
The default is true. Only superusers can set this value to false."
 -- https://www.postgresql.org/docs/16/sql-createsubscription.html

Only superusers can set it, and it's ignored for superusers. That does
a good job of describing the actual behavior, but is a bit puzzling.

I guess what the user is supposed to do is one of:
  1. Create a subscription as a superuser with the right connection
string (without a password) and password_required=false, then reassign
it to a non-superuser; or
  2. Create a subscription as a non-superuser member of
pg_create_subscription using a bogus connection string, then a
superuser can alter it to set password_required=false, then alter the
connection string; or
  3. Create a superuser, let the new superuser create a subscription
with password_required=false, and then remove their superuser status.

so why not just document one of those things as the expected thing to
do? Not a whole section or anything, but a sentence to suggest what
they should do or where else they should look.

I don't mean to set some major new standard in the documentation that
should apply to everything; but for the privilege system, even hackers
are having trouble keeping up (myself included). A bit of guidance
toward supported use cases helps a lot.

> I fear it will be hard to come up with something that is
> clear, that highlights the severity of the problems, and that does
> not
> veer off into useless vitriol against the status quo, but if we can
> get there, that would be good.

I hope what I'm saying is not useless vitriol. I am offering the best
solutions I see in a bad situation. And I believe I've uncovered some
emergent behaviors that are not well-understood even among prominent
hackers.

> That leads to a second idea, which is having it continue
> to be a GUC but only affect directly-entered SQL, with all
> indirectly-entered SQL either being stored as a node tree or having a
> search_path property attached somewhere.

That's not too far from the proposed patch and I'd certainly be
interested to hear more and/or adapt my patch towards this idea.

>  Or, as a third idea, suppose
> we leave it a GUC but start breaking semantics around where and how
> that GUC gets set, e.g. by changing CREATE FUNCTION to capture the
> prevailing search_path by default unless instructed otherwise.

How would one instruct otherwise?

> Personally I feel like we'd need pretty broad consensus for any of
> these kinds of changes

+1

>  because it would break a lot of stuff for a lot
> of people, but if we could get that then I think we could maybe
> emerge
> in a better spot once the pain of the compatibility break receded.

Are there ways we can soften this a bit? I know compatibility GUCs are
not to be added lightly, but perhaps one is justified here?

> Another option is something around sandboxing and/or function trust.
> The idea here is to not care too much about the search_path behavior
> itself, and instead focus on the consequences, namely what code is
> getting executed as which user and perhaps what kinds of operations
> it's performing.

I'm open to discussing that further, and it certainly may solve some
problems, but it does not seem to solve the fundamental problem with
search_path: that the caller can (intentionally or unintentionally)
cause a function to do unexpected things.

Sometimes an unexpected thing is not a the kind of thing that would be
caught by a sandbox, e.g. just an unexpected function result. But if
that function is used in a constraint or expression index, that
unexpected result can lead to a violated constraint or a bad index
(that will later cause wrong results). The owner of the table might
reasonably consider that a privilege problem, if the user who causes
the trouble had only INSERT privileges.

> Are there any other categories of things we can do? More specific
> kinds of things we can do in each category? I don't really see an
> option other than (1) "change something in the system design so that
> people use search_path 

Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)

2023-09-19 Thread Tom Lane
Tomas Vondra  writes:
> bsd@freebsd:~ $ tclsh8.6
> % clock scan "1/26/2010"
> time value too large/small to represent

In hopes of replicating this, I tried installing FreeBSD 14-BETA2
aarch64 on my Pi 3B.  This test case works fine:

$ tclsh8.6
% clock scan "1/26/2010"
1264482000

$ tclsh8.7
% clock scan "1/26/2010"
1264482000

and unsurprisingly, pltcl's regression tests pass.  I surmise
that something is broken in BETA1 that they fixed in BETA2.

plpython works too, with the python 3.9 package (and no older
python).

However, all is not peachy, because plperl doesn't work.
Trying to CREATE EXTENSION either plperl or plperlu leads
to a libperl panic:

pl_regression=# create extension plperl;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

with this in the postmaster log:

panic: pthread_key_create failed

That message is certainly not ours, so it must be coming out of libperl.

Another thing that seemed strange is that ecpg's preproc.o takes
O(forever) to compile.  I killed the build after observing that the
compiler had gotten to 40 minutes of CPU time, and redid that step
with PROFILE=-O0, which allowed it to compile in 20 seconds or so.
(I also tried -O1, but gave up after a few minutes.)  This machine
can compile the main backend grammar in a minute or two, so there is
something very odd there.

I'm coming to the conclusion that 14-BETA is, well, beta grade.
I'll be interested to see if you get the same results when you
update to BETA2.

regards, tom lane




Re: [PATCH] Add native windows on arm64 support

2023-09-19 Thread Michael Paquier
On Tue, Sep 19, 2023 at 01:35:17PM +0100, Anthony Roberts wrote:
> Was there an explicit request for something there? I was under the
> impression that this was all just suggestion/theory at the moment.

Yes.  The addition of a buildfarm member registered into the community
facilities, so as it is possible to provide back to developers dynamic
feedback to the new configuration setup you would like to see
supported in PostgreSQL.  This has been mentioned a few times on this
thread, around these places:
https://www.postgresql.org/message-id/20220322103011.i6z2tuj4hle23wgx@jrouhaud
https://www.postgresql.org/message-id/dbd80715-e56b-40eb-84aa-ace70198e...@yesql.se
https://www.postgresql.org/message-id/6d1e2719-fa49-42a5-a6ff-0b0072bf6...@yesql.se
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Add inline comments to the pg_hba_file_rules view

2023-09-19 Thread Jim Jones

Hi Michael

On 16.09.23 06:18, Michael Paquier wrote:

That was the idea. I forgot about strpos(), but if you do that, do we
actually need a function in core to achieve that? 
I guess it depends who you ask :) I personally think it would be a good 
addition to the view, as it would provide a more comprehensive look into 
the hba file. Yes, the fact that it could possibly be written in SQL 
sounds silly, but it's IMHO still relevant to have it by default.
There are a few fancy cases with the SQL function you have sent, 
actually.. strpos() would grep the first '#' character, ignoring 
quoted areas.


Yes, you're totally right. I didn't take into account any token 
surrounded by double quotes containing #.


v3 attached addresses this issue.

From the following hba:

 host db jim 192.168.10.1/32 md5 # foo
 host db jim 192.168.10.2/32 md5 #bar
 host db jim 192.168.10.3/32 md5 # #foo#
 host "a#db" "a#user" 192.168.10.4/32 md5 # fancy #hba entry

We can get these records from the view:

 SELECT type, database, user_name, address, comment
 FROM pg_hba_file_rules
 WHERE address ~~ '192.168.10.%';

 type | database | user_name | address    | comment
--+--+---+--+--
 host | {db} | {jim} | 192.168.10.1 | foo
 host | {db} | {jim} | 192.168.10.2 | bar
 host | {db} | {jim} | 192.168.10.3 | #foo#
 host | {a#db}   | {a#user}  | 192.168.10.4 | fancy #hba entry


I am still struggling to find a way to enable this function in separated 
path without having to read the conf file multiple times, or writing too 
much redundant code. How many other conf files do you think would profit 
from this feature?


Jim
From 47f55bab0a8e8af286e6be2f40d218f25a5066c9 Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Wed, 20 Sep 2023 00:04:05 +0200
Subject: [PATCH v3] Add inline comments to the pg_hba_file_rules view

This patch proposes the column "comment" to the pg_hba_file_rules view.
It basically parses the inline comment (if any) of a valid pg_hba.conf
entry and displays it in the new column. The patch slightly changes the
test 004_file_inclusion.pl to accomodate the new column and the hba comments.
The view's documentation at system-views.sgml is extended accordingly.

The function GetInlineComment() was added to conffiles.c to avoid adding more
complexity directly to hba.c. It also enables this feature to be used by other
configuration files, if necessary.
---
 doc/src/sgml/system-views.sgml|  9 +++
 src/backend/utils/adt/hbafuncs.c  | 11 ++-
 src/backend/utils/misc/conffiles.c| 41 ++
 src/include/catalog/pg_proc.dat   |  6 +-
 src/include/libpq/hba.h   |  1 +
 src/include/utils/conffiles.h |  1 +
 .../authentication/t/004_file_inclusion.pl| 74 +--
 src/test/regress/expected/rules.out   |  3 +-
 8 files changed, 135 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 2b35c2f91b..68f9857de0 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -1090,6 +1090,15 @@
   
  
 
+
+  
+   comment text
+  
+  
+   Text after the first # comment character in the end of a valid pg_hba.conf entry, if any
+  
+ 
+
  
   
error text
diff --git a/src/backend/utils/adt/hbafuncs.c b/src/backend/utils/adt/hbafuncs.c
index 73d3ad1dad..929678e97e 100644
--- a/src/backend/utils/adt/hbafuncs.c
+++ b/src/backend/utils/adt/hbafuncs.c
@@ -22,6 +22,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
+#include "utils/conffiles.h"
 
 
 static ArrayType *get_hba_options(HbaLine *hba);
@@ -159,7 +160,7 @@ get_hba_options(HbaLine *hba)
 }
 
 /* Number of columns in pg_hba_file_rules view */
-#define NUM_PG_HBA_FILE_RULES_ATTS	 11
+#define NUM_PG_HBA_FILE_RULES_ATTS	 12
 
 /*
  * fill_hba_line
@@ -191,6 +192,7 @@ fill_hba_line(Tuplestorestate *tuple_store, TupleDesc tupdesc,
 	const char *addrstr;
 	const char *maskstr;
 	ArrayType  *options;
+	char   *comment;
 
 	Assert(tupdesc->natts == NUM_PG_HBA_FILE_RULES_ATTS);
 
@@ -346,6 +348,13 @@ fill_hba_line(Tuplestorestate *tuple_store, TupleDesc tupdesc,
 			values[index++] = PointerGetDatum(options);
 		else
 			nulls[index++] = true;
+
+		/* comments */
+		comment = GetInlineComment(hba->rawline);
+		if(comment)
+			values[index++] = CStringGetTextDatum(comment);
+		else
+			nulls[index++] = true;
 	}
 	else
 	{
diff --git a/src/backend/utils/misc/conffiles.c b/src/backend/utils/misc/conffiles.c
index 376a5c885b..feda49bf31 100644
--- a/src/backend/utils/misc/conffiles.c
+++ b/src/backend/utils/misc/conffiles.c
@@ -25,6 +25,47 @@
 #include "storage/fd.h"
 #include "utils/conffiles.h"
 
+/*
+ * GetInlineComment
+ *
+ * This function returns comments of a given config file line,
+ * if there is any. A comment is any text after the # ch

Re: Disabling Heap-Only Tuples

2023-09-19 Thread Robert Haas
On Tue, Sep 19, 2023 at 2:20 PM Matthias van de Meent
 wrote:
> Mostly agreed, but I think there's a pitfall here. You seem to assume
> we have a perfect oracle that knows the optimal data size, but we
> already know that our estimates can be significantly off. I don't
> quite trust the statistics enough to do any calculations based on the
> number of tuples in the relation. That also ignores the fact that we
> don't actually have any good information about the average size of the
> tuples in the table. So with current statistics, any automated "this
> is how large the table should be" decisions would result in an
> automated footgun, instead of the current patch's where the user has
> to decide to configure it to an explicit value.

I'm not assuming that there's an oracle here. I'm hoping that there's
some way that we can construct one. If we can't, then I think we're
asking the user to figure out a value that we don't have any idea how
to compute ourselves. And I think that kind of thing is usually a bad
idea. It's reasonable to ask the user for input when they know
something relevant that we can't know, like how large they think their
database will get, or what hardware they're using. But it's not
reasonable to essentially hope that the user is smarter than we are.
That's leaving our job half-undone and forcing the user into coping
with the result. And note that the value we need here is largely about
the present, not the future. The question is "how small can the table
be practically made right now?". And there is no reason at all to
suppose that the user is better-placed to answer that question than
the database itself.

> But about that: I'm not sure what the "footgun" is that you've
> mentioned recently?
> The issue with excessive bloat (when the local_update_limit is set too
> small and fillfactor is low) was fixed in the latest patch nearly
> three weeks ago, so the only remaining issue with misconfiguration is
> slower updates. Sure, that's not great, but in my opinion not a
> "footgun": performance returns immediately after resetting
> local_update_limit, and no space was lost.

That does seem like a very good change, but I'm not convinced that it
solves the whole problem. I would agree with your argument if the only
downside of enabling the feature were searching the FSM, failing to
find a suitable free page, and falling back to a HOT update. Such a
thing might be slow, but it won't cause any bloat, and as you say, if
the feature doesn't do what you want, don't use it. But I think the
feature can still cause bloat.

If we're using this feature on a reasonably heavily-updated table,
then sometimes when we check whether any low-numbered pages have free
space, it will turn out that one of them does. This will happen even
if local_update_limit is set far too low, because the table is
heavily-updated, and sometimes that means tuples are moving around,
leaving holes. So when there is a hole, i.e. just by luck we happen to
find some space on a low-numbered page, we'll suffer the cost of a
non-HOT update to move that tuple to an earlier page of the relation.
However, there's a good chance that the next time we update that
tuple, the page will have become completely full, because everybody's
furiously trying to jam as many tuples as possible into those
low-numbered pages, so now the tuple will have to bounce to some
higher-numbered page.

So I think what will happen if the local update limit is set too low,
and the table is actually being updated a lot, is that we'll just
uselessly do a bunch of HOT updates on high-numbered pages as non-HOT,
which will fill up low-numbered pages turning even potentially HOT
updates on those pages to non-HOT as well. Doing a bunch of updates
that could have been HOT as non-HOT can for sure cause index bloat. It
could maybe also cause table bloat, because if we'd done the updates
as HOT, we would have been able to recover the line pointers via
HOT-pruning, but since we turned them into non-HOT updates, we have to
wait for vacuum, which is comparatively much less frequent.

I'm not quite sure how bad this residual problem is. It's certainly a
lot better if a failed attempt to move a tuple earlier can turn into a
normal HOT update instead of a non-HOT update. But I don't think it
completely eliminates the problem of useless tuple movement either.

As Andres points out, I think rightly, we should really be thinking
about ways to guide this behavior other than a page number. As you
point out, there's no guarantee that we can know the right page
number. If we can, cool. But there are other approaches too. He
mentions looking at how full the FSM is, which seems like an
interesting idea although surely we don't want every backend
repeatedly iterating over the FSM to recompute statistics. I wonder if
there are other good ideas we haven't thought of yet. Certainly, if
you found that you were frequently being forced to move tuples to
higher-numbered pages for lack of space anywhere

Re: Extensible storage manager API - SMGR hook Redux

2023-09-19 Thread Tristan Partin

Found these warnings while compiling while only 0001 is applied.

[1166/2337] Compiling C object src/backend/postgres_lib.a.p/storage_smgr_md.c.o
../src/backend/storage/smgr/md.c: In function ‘mdexists’:
../src/backend/storage/smgr/md.c:224:28: warning: passing argument 1 of 
‘mdopenfork’ from incompatible pointer type [-Wincompatible-pointer-types]
 224 | return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != 
NULL);
 |^~~~
 ||
 |SMgrRelation {aka SMgrRelationData *}
../src/backend/storage/smgr/md.c:167:43: note: expected ‘MdSMgrRelation’ {aka 
‘MdSMgrRelationData *’} but argument is of type ‘SMgrRelation’ {aka 
‘SMgrRelationData *’}
 167 | static MdfdVec *mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int 
behavior);
 |~~~^~~~
../src/backend/storage/smgr/md.c: In function ‘mdcreate’:
../src/backend/storage/smgr/md.c:287:40: warning: passing argument 1 of 
‘register_dirty_segment’ from incompatible pointer type 
[-Wincompatible-pointer-types]
 287 | register_dirty_segment(reln, forknum, mdfd);
 |^~~~
 ||
 |SMgrRelation {aka 
SMgrRelationData *}
../src/backend/storage/smgr/md.c:168:51: note: expected ‘MdSMgrRelation’ {aka 
‘MdSMgrRelationData *’} but argument is of type ‘SMgrRelation’ {aka 
‘SMgrRelationData *’}
 168 | static void register_dirty_segment(MdSMgrRelation reln, ForkNumber 
forknum,

Here is a diff to be applied to 0001 which fixes the warnings that get 
generated when compiling. I did see that one of the warnings gets fixed 
0002 (the mdexists() one). I am assuming that change was just missed 
while rebasing the patchset or something. I did not see a fix for

mdcreate() in 0002 however.

--
Tristan Partin
Neon (https://neon.tech)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f3e4768160..fdc9f62fdf 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -213,6 +213,8 @@ mdinit(void)
 bool
 mdexists(SMgrRelation reln, ForkNumber forknum)
 {
+	MdSMgrRelation mdreln = (MdSMgrRelation) reln;
+
 	/*
 	 * Close it first, to ensure that we notice if the fork has been unlinked
 	 * since we opened it.  As an optimization, we can skip that in recovery,
@@ -221,7 +223,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
 	if (!InRecovery)
 		mdclose(reln, forknum);
 
-	return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
+	return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL);
 }
 
 /*
@@ -284,7 +286,7 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 	mdfd->mdfd_segno = 0;
 
 	if (!SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, mdfd);
+		register_dirty_segment(mdreln, forknum, mdfd);
 }
 
 /*


Re: Improving btree performance through specializing by key shape, take 2

2023-09-19 Thread Peter Geoghegan
On Tue, Sep 19, 2023 at 6:28 AM Matthias van de Meent
 wrote:
> > To be clear, page deletion does what I described here (it does an
> > in-place update of the downlink to the deleted page, so the same pivot
> > tuple now points to its right sibling, which is our page of concern),
> > in addition to fully removing the original pivot tuple whose downlink
> > originally pointed to our page of concern. This is why page deletion
> > makes the key space "move to the right", very much like a page split
> > would.
>
> I am still aware of this issue, and I think we've discussed it in
> detail earlier. I think it does not really impact this patchset. Sure,
> I can't use dynamic prefix compression to its full potential, but I
> still do get serious performance benefits:

Then why have you linked whatever the first patch does with the high
key to dynamic prefix compression in the first place? Your commit
message makes it sound like it's a way to get around the race
condition that affects dynamic prefix compression, but as far as I can
tell it has nothing whatsoever to do with that race condition.

Questions:

1. Why shouldn't the high key thing be treated as an unrelated piece of work?

I guess it's possible that it really should be structured that way,
but even then it's your responsibility to make it clear why that is.
As things stand, this presentation is very confusing.

2. Separately, why should dynamic prefix compression be tied to the
specialization work? I also see no principled reason why it should be
tied to the other two things.

I didn't mind this sort of structure so much back when this work was
very clearly exploratory -- I've certainly structured work in this
area that way myself, in the past. But if you want this patch set to
ever go beyond being an exploratory patch set, something has to
change. I don't have time to do a comprehensive (or even a fairly
cursory) analysis of which parts of the patch are helping, and which
are marginal or even add no value.

> > You'd have
> > to compare the lower bound separator key from the parent (which might
> > itself be the page-level low key for the parent) to the page low key.
> > That's not a serious suggestion; I'm just pointing out that you need
> > to be able to compare like with like for a canary condition like this
> > one, and AFAICT there is no lightweight practical way of doing that
> > that is 100% robust.
>
> True, if we had consistent LOWKEYs on pages, that'd make this job much
> easier: the prefix could indeed be carried over in full. But that's
> not currently the case for the nbtree code, and this is the next best
> thing, as it also has the benefit of working with all currently
> supported physical formats of btree indexes.

I went over the low key thing again because I had to struggle to
understand what your high key optimization had to do with dynamic
prefix compression. I'm still struggling. I think that your commit
message very much led me astray. Quoting it here:

"""
Although this limits the overall applicability of the
performance improvement, it still allows for a nice performance
improvement in most cases where initial columns have many
duplicate values and a compare function that is not cheap.

As an exception to the above rule, most of the time a pages'
highkey is equal to the right seperator on the parent page due to
how btree splits are done. By storing this right seperator from
the parent page and then validating that the highkey of the child
page contains the exact same data, we can restore the right prefix
bound without having to call the relatively expensive _bt_compare.
"""

You're directly tying the high key optimization to the dynamic prefix
compression optimization. But why?

I have long understood that you gave up on the idea of keeping the
bounds across levels of the tree (which does make sense to me), but
yesterday the issue became totally muddled by this high key business.
That's why I rehashed the earlier discussion, which I had previously
understood to be settled.

-- 
Peter Geoghegan




Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)

2023-09-19 Thread Tomas Vondra



On 9/19/23 18:45, Tom Lane wrote:
> Tomas Vondra  writes:
>> I have no experience with tcl, but I tried this in the two tclsh
>> versions installed no the system (8.6 and 8.7):
> 
>> bsd@freebsd:~ $ tclsh8.7
>> % clock scan "1/26/2010"
>> time value too large/small to represent
> 
>> bsd@freebsd:~ $ tclsh8.6
>> % clock scan "1/26/2010"
>> time value too large/small to represent
> 
>> AFAIK this is what the tcl_date_week(2010,1,26) translates to.
> 
> Oh, interesting.  On my FreeBSD 13.1 arm64 system, it works:
> 
> $ tclsh8.6
> % clock scan "1/26/2010"
> 1264482000
> 
> I am now suspicious that there's some locale effect that we have
> not observed before (though why not?).  What is the result of
> the "locale" command on your box?  Mine gives
> 
> $ locale
> LANG=C.UTF-8
> LC_CTYPE="C.UTF-8"
> LC_COLLATE="C.UTF-8"
> LC_TIME="C.UTF-8"
> LC_NUMERIC="C.UTF-8"
> LC_MONETARY="C.UTF-8"
> LC_MESSAGES="C.UTF-8"
> LC_ALL=
> 

bsd@freebsd:~ $ locale
LANG=C.UTF-8
LC_CTYPE="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_TIME="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_ALL=

bsd@freebsd:~ $ tclsh8.6
% clock scan "1/26/2010"
time value too large/small to represent

However, I wonder if there's something wrong with tcl itself,
considering this:

% clock format 1360558800 -format %D
02/11/2013
% clock scan 02/11/2013 -format %D
time value too large/small to represent

That's a bit strange - it seems tcl can format a timestamp, but then
can't read it back in for some reason ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Andres Freund
Hi,

On 2023-09-19 19:33:22 +0200, Matthias van de Meent wrote:
> On Tue, 19 Sept 2023 at 18:56, Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-09-19 18:30:44 +0200, Alvaro Herrera wrote:
> > > This makes me think that maybe the logic needs to be a little more
> > > complex to avoid the problem you describe: if an UPDATE is prevented
> > > from being HOT because of this setting, but then it goes and consults
> > > FSM and it gives the update a higher block number than the tuple's
> > > current block (or it fails to give a block number at all so it is forced
> > > to extend the relation), then the update should give up on that strategy
> > > and use a HOT update after all.  (I have not read the actual patch;
> > > maybe it already does this?  It sounds kinda obvious.)
> >
> > Yea, a setting like what's discussed here seems, uh, not particularly useful
> > for achieving the goal of compacting tables.  I don't think guiding this
> > through SQL makes a lot of sense. For decent compaction you'd want to scan 
> > the
> > table backwards, and move rows from the end to earlier, but stop once
> > everything is filled up. You can somewhat do that from SQL, but it's going 
> > to
> > be awkward and slow.  I doubt you even want to use the normal UPDATE WAL
> > logging.
>
> We can't move tuples around (or, not that I know of) without using a
> transaction ID to control the visibility of the two locations of that
> tuple.

Correct, otherwise you'd end up with broken visibility in scans (seeing the
same tuple twice or never).


> Doing table compaction would thus likely require using transactions to move
> these tuples around.

Yes - but I don't think that has to be a problem. I'd expect something like
this to use multiple transactions internally. Possibly optimizing xid usage by
checking if other transactions are currently waiting on the xid and committing
if that's the case. Processing a single page should be quite fast, so the
maximum delay on other sessions is quite small.


> Using a single backend and bulk operations, it'll still lock each tuple that
> is being moved, and that can be noticed by user DML queries. I'd rather make
> the user's queries move the data around than this long-duration, locking
> background operation.

I doubt that works well enough in practice. It's very common to have tuples
that aren't updated after some point. So you then end up with needing tooling
that triggers UPDATEs for tuples at the end of the relation.


> > I think having explicit compaction support in VACUUM or somewhere similar
> > would make sense, but I don't think the proposed GUC is a useful stepping
> > stone.
>
> The point of this GUC is that the compaction can happen organically in
> the user's UPDATE workflow, so that there is no long locking operation
> going on (as you would see with VACUUM FULL / CLUSTER / pg_repack).

It certainly shouldn't use an AEL. I think we could even get away without an
SUE (it's basically just UPDATEs after all), but whether it's worth doing that
I'm not sure.


> > > > But without any kind of auto-tuning, in my opinion, it's a fairly poor
> > > > feature. Sure, some people will get use out of it, if they're
> > > > sufficiently knowledgeable and sufficiently determined. But I think
> > > > for most people in most situations, it will be a struggle.
> >
> > Indeed. I think it'd often just explode table and index sizes, because HOT
> > pruning won't be able to make usable space in pages anymore (due to dead
> > items).
>
> You seem to misunderstand the latest patch. It explicitly only blocks
> local updates if the update can then move the new tuple to an earlier
> page. If that is not possible, then it'll insert locally (assuming
> that is still possible) and HOT can then still apply.

I indeed apparently had looked at the wrong patch. But I still don't think
this is a useful way of controlling this.  I guess it could be a small part of
something larger, but you are going to need something that actively updates
tuples at the end of the table, otherwise it's very unlikely in practice that
you'll ever be able to shrink the table.


Leaving aside what process "moves" tuples, I doubt that controlling "moving"
via the table size is useful. Controlling via the amount free space in the FSM
would make more sense. If there's no known free space in the FSM, this
approach can't compact. Using the table size to control also means that the
value needs to be updated with the growth of the table. Whereas controlling
moving via a percentage of free space in the FSM would allow the same setting
to be used even for a growing (or shrinking) table.

Greetings,

Andres Freund




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Robert Haas
On Tue, Sep 19, 2023 at 12:56 PM Andres Freund  wrote:
> Yea, a setting like what's discussed here seems, uh, not particularly useful
> for achieving the goal of compacting tables.  I don't think guiding this
> through SQL makes a lot of sense. For decent compaction you'd want to scan the
> table backwards, and move rows from the end to earlier, but stop once
> everything is filled up. You can somewhat do that from SQL, but it's going to
> be awkward and slow.  I doubt you even want to use the normal UPDATE WAL
> logging.
>
> I think having explicit compaction support in VACUUM or somewhere similar
> would make sense, but I don't think the proposed GUC is a useful stepping
> stone.

I think there's a difference between wanting to compact instantly and
wanting to compact over time. I think that this kind of thing is
reasonably well-suited to the latter, if we can engineer away the
cases where it backfires.

But I know people will try to use it for instant compaction too, and
there it's worth remembering why we removed old-style VACUUM FULL. The
main problem is that it was mind-bogglingly slow. The other really bad
problem is that it caused massive index bloat. I think any system
that's based on moving around my tuples right now to make my table
smaller right now is likely to have similar issues.

In the case where you're trying to compact gradually, I think there
are potentially serious issues with index bloat, but only potentially.
It seems like there are reasonable cases where it's fine.
Specifically, if you have relatively few indexes per table, relatively
few long-running transactions, and all tuples get updated on a
semi-regular basis, I'm thinking that you're more likely to win than
lose.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Matthias van de Meent
On Tue, 19 Sept 2023 at 18:52, Robert Haas  wrote:
>
> On Tue, Sep 19, 2023 at 12:30 PM Alvaro Herrera  
> wrote:
> > I was thinking something vaguely like "a table size that's roughly what
> > an optimal autovacuuming schedule would leave the table at" assuming 0.2
> > vacuum_scale_factor.  You would determine the absolute minimum size for
> > the table given the current live tuples in the table, then add 20% to
> > account for a steady state of dead tuples and vacuumed space.  So it's
> > not 1.2x of the "current" table size at the time the local_update_limit
> > feature is installed, but 1.2x of the optimal table size.
>
> Right, that would be great. And honestly if that's something we can
> figure out, then why does the parameter even need to be an integer
> instead of a Boolean? If the system knows the optimal table size, then
> the user can just say "try to compact this table" and need not say to
> what size. The 1.2 multiplier is probably situation dependent and
> maybe the multiplier should indeed be a configuration parameter, but
> we would be way better off if the absolute size didn't need to be.

Mostly agreed, but I think there's a pitfall here. You seem to assume
we have a perfect oracle that knows the optimal data size, but we
already know that our estimates can be significantly off. I don't
quite trust the statistics enough to do any calculations based on the
number of tuples in the relation. That also ignores the fact that we
don't actually have any good information about the average size of the
tuples in the table. So with current statistics, any automated "this
is how large the table should be" decisions would result in an
automated footgun, instead of the current patch's where the user has
to decide to configure it to an explicit value.

But about that: I'm not sure what the "footgun" is that you've
mentioned recently?
The issue with excessive bloat (when the local_update_limit is set too
small and fillfactor is low) was fixed in the latest patch nearly
three weeks ago, so the only remaining issue with misconfiguration is
slower updates. Sure, that's not great, but in my opinion not a
"footgun": performance returns immediately after resetting
local_update_limit, and no space was lost.

> > This makes me think that maybe the logic needs to be a little more
> > complex to avoid the problem you describe: if an UPDATE is prevented
> > from being HOT because of this setting, but then it goes and consults
> > FSM and it gives the update a higher block number than the tuple's
> > current block (or it fails to give a block number at all so it is forced
> > to extend the relation), then the update should give up on that strategy
> > and use a HOT update after all.  (I have not read the actual patch;
> > maybe it already does this?  It sounds kinda obvious.)
>
> +1 to all of that. Anything we can do to reduce the chance of the
> parameter doing the opposite of what it's intended to do is, IMHO,
> really, really valuable. If you're in the situation where you really
> need something like this, you're probably having a pretty bad day
> already.

Yes, it does that with the latest patch, from not quite 3 weeks ago.

> Just to be more clear about my position, I don't think that having
> some kind of a feature along these lines is a bad idea.

Thanks for clarifying.

> I do think
> that this is one of those cases where the perfect is the enemy of the
> good, and we can fall into the trap of saying that since we can't do
> the perfect thing let's not do anything at all. At the same time, just
> because we need to do something doesn't mean we should do exactly the
> first thing that anybody thought up, or that we shouldn't try as hard
> as we can to mitigate the downsides. If we add something like this I
> bet it will get a lot of use. Even a minor improvement to the design
> that removes one pitfall of many could turn out to help a lot of
> people.

100% agreed.

> > Having to set AEL is not nice for sure, but wouldn't
> > ShareUpdateExclusiveLock be sufficient?  We have a bunch of reloptions
> > for which that is sufficient.
>
> Hmm, yeah, I think you're right.

Updating the reloption after relation truncation implies having the
same lock as relation truncation, i.e. AEL (if the vacuum docs are to
be believed). So the AEL is not reqiored for updating the storage
option (that would require SUEL), but for the block truncation
operation operation.

Kind regards,

Matthias van de Meent
Neon (http://neon.tech)




Re: GenBKI emits useless open;close for catalogs without rows

2023-09-19 Thread Heikki Linnakangas

On 18/09/2023 17:50, Matthias van de Meent wrote:

(initdb takes about 73ms locally with syncing disabled)


That's impressive. It takes about 600 ms on my laptop. Of which about 
140 ms goes into processing the BKI file. And that's with "initdb 
-no-sync" option.



Various methods of reducing the size of postgres.bki were applied, as
detailed in the patch's commit message. I believe the current output
is still quite human readable.


Overall this does not seem very worthwhile to me.

One thing caught my eye though: We currently have an "open" command 
after every "create". Except for bootstrap relations; creating a 
bootstrap relation opens it implicitly. That seems like a weird 
inconsistency. If we make "create" to always open the relation, we can 
both make it more consistent and save a few lines. That's maybe worth 
doing, per the attached. It removes the "open" command altogether, as 
it's not needed anymore.


Looking at "perf" profile of initdb, I also noticed that a small but 
measurable amount of time is spent in the "isatty(0)" call in do_end(). 
Does anyone care about doing bootstrap mode interactively? We could 
probably remove that.


--
Heikki Linnakangas
Neon (https://neon.tech)
diff --git a/src/backend/bootstrap/bootparse.y b/src/backend/bootstrap/bootparse.y
index 81a1b7bfec..043ad9325f 100644
--- a/src/backend/bootstrap/bootparse.y
+++ b/src/backend/bootstrap/bootparse.y
@@ -99,7 +99,7 @@ static int num_columns_read = 0;
 /* NULLVAL is a reserved keyword */
 %token NULLVAL
 /* All the rest are unreserved, and should be handled in boot_ident! */
-%token  OPEN XCLOSE XCREATE INSERT_TUPLE
+%token  XCLOSE XCREATE INSERT_TUPLE
 %token  XDECLARE INDEX ON USING XBUILD INDICES UNIQUE XTOAST
 %token  OBJ_ID XBOOTSTRAP XSHARED_RELATION XROWTYPE_OID
 %token  XFORCE XNOT XNULL
@@ -119,8 +119,7 @@ Boot_Queries:
 		;
 
 Boot_Query :
-		  Boot_OpenStmt
-		| Boot_CloseStmt
+		  Boot_CloseStmt
 		| Boot_CreateStmt
 		| Boot_InsertStmt
 		| Boot_DeclareIndexStmt
@@ -129,20 +128,11 @@ Boot_Query :
 		| Boot_BuildIndsStmt
 		;
 
-Boot_OpenStmt:
-		  OPEN boot_ident
-{
-	do_start();
-	boot_openrel($2);
-	do_end();
-}
-		;
-
 Boot_CloseStmt:
 		  XCLOSE boot_ident
 {
 	do_start();
-	closerel($2);
+	boot_closerel($2);
 	do_end();
 }
 		;
@@ -185,17 +175,17 @@ Boot_CreateStmt:
 	 */
 	mapped_relation = ($4 || shared_relation);
 
+	if (boot_reldesc)
+	{
+		elog(DEBUG4, "create: warning, open relation exists, closing first");
+		boot_closerel(NULL);
+	}
+
 	if ($4)
 	{
 		TransactionId relfrozenxid;
 		MultiXactId relminmxid;
 
-		if (boot_reldesc)
-		{
-			elog(DEBUG4, "create bootstrap: warning, open relation exists, closing first");
-			closerel(NULL);
-		}
-
 		boot_reldesc = heap_create($2,
    PG_CATALOG_NAMESPACE,
    shared_relation ? GLOBALTABLESPACE_OID : 0,
@@ -239,6 +229,7 @@ Boot_CreateStmt:
 	  InvalidOid,
 	  NULL);
 		elog(DEBUG4, "relation created with OID %u", id);
+		boot_openrel(id);
 	}
 	do_end();
 }
@@ -466,7 +457,6 @@ boot_column_val:
 
 boot_ident:
 		  ID			{ $$ = $1; }
-		| OPEN			{ $$ = pstrdup($1); }
 		| XCLOSE		{ $$ = pstrdup($1); }
 		| XCREATE		{ $$ = pstrdup($1); }
 		| INSERT_TUPLE	{ $$ = pstrdup($1); }
diff --git a/src/backend/bootstrap/bootscanner.l b/src/backend/bootstrap/bootscanner.l
index 6a9d4193f2..efb9e36090 100644
--- a/src/backend/bootstrap/bootscanner.l
+++ b/src/backend/bootstrap/bootscanner.l
@@ -71,8 +71,6 @@ sid		\'([^']|\'\')*\'
 
 %%
 
-open			{ boot_yylval.kw = "open"; return OPEN; }
-
 close			{ boot_yylval.kw = "close"; return XCLOSE; }
 
 create			{ boot_yylval.kw = "create"; return XCREATE; }
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5810f8825e..2fc01f9d4d 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -409,13 +409,10 @@ bootstrap_signals(void)
  * 
  */
 void
-boot_openrel(char *relname)
+boot_openrel(Oid id)
 {
 	int			i;
 
-	if (strlen(relname) >= NAMEDATALEN)
-		relname[NAMEDATALEN - 1] = '\0';
-
 	/*
 	 * pg_type must be filled before any OPEN command is executed, hence we
 	 * can now populate Typ if we haven't yet.
@@ -424,12 +421,12 @@ boot_openrel(char *relname)
 		populate_typ_list();
 
 	if (boot_reldesc != NULL)
-		closerel(NULL);
+		boot_closerel(NULL);
 
-	elog(DEBUG4, "open relation %s, attrsize %d",
-		 relname, (int) ATTRIBUTE_FIXED_PART_SIZE);
+	elog(DEBUG4, "open relation %u, attrsize %d",
+		 id, (int) ATTRIBUTE_FIXED_PART_SIZE);
 
-	boot_reldesc = table_openrv(makeRangeVar(NULL, relname, -1), NoLock);
+	boot_reldesc = table_open(id, NoLock);
 	numattr = RelationGetNumberOfAttributes(boot_reldesc);
 	for (i = 0; i < numattr; i++)
 	{
@@ -450,11 +447,11 @@ boot_openrel(char *relname)
 }
 
 /* 
- *		closerel
+ *		boot_closerel
  * -

Re: Projection pushdown to index access method

2023-09-19 Thread Tom Lane
Robert Haas  writes:
> On Tue, Sep 19, 2023 at 12:35 PM Chris Cleveland
>  wrote:
>> I'm working on an index access method. I have a function which can appear in 
>> a projection list which should be evaluated by the access method itself. 
>> Example:
>> ...
>> How do I get the system to pull the value from the index instead of trying 
>> to calculate it?

> I don't see how you can do this in general, because there's no
> guarantee that the plan will be an Index Scan or Index Only Scan
> instead of a Seq Scan or Bitmap Heap/Index Scan.

Yeah.  There is some adjacent functionality for indexed expressions,
which maybe you could use, but it has a lot of shortcomings yet.
For example:

regression=# create or replace function f(x int) returns int as $$begin return 
x+1; end$$ language plpgsql strict immutable cost 1000;
CREATE FUNCTION
regression=# create table mytable (id int, x int);
CREATE TABLE
regression=# create index on mytable(x, f(x));
CREATE INDEX
regression=# set enable_seqscan TO 0;
SET
regression=# set enable_bitmapscan TO 0;
SET
regression=# explain verbose select f(x) from mytable;
   QUERY PLAN   
 
-
 Index Only Scan using mytable_x_f_idx on public.mytable  (cost=0.15..5728.06 
rows=2260 width=4)
   Output: (f(x))
(2 rows)

If you examine the plan tree closely you can confirm that it is pulling
f(x) from the index rather than recomputing it.  So maybe you could get
somewhere by pretending that my_special_function(body) is an indexed
expression.  However, there are a couple of big gotchas, which this
example illustrates:

1. The index has to also provide x (or for you, "body") or else the
planner fails to detect that an IOS is applicable.  This comes back
to the point Robert made about the planner preferring to think about
pulling individual Vars from tables: we don't believe the index is
usable in an IOS unless it provides all the Vars the query needs from
that table.  This wouldn't be hard to fix exactly; the problem is to
fix it without spending exponential amounts of planning time in
check_index_only.  We'd have to detect that all uses of "x" appear in
the context "f(x)" in order to realize that we don't need to be able
to fetch "x" itself.

2. Costing doesn't account for the fact that we've avoided runtime
computation of f(), thus the IOS plan may not be preferred over
other plan shapes, which is why I had to force it above.  Again,
this is pretty closely tied to the fact that we don't recognize
until very late in the game that we can get f(x) from the index.

3. This only works for an index-only scan, not regular index scans.
There's some early discussion happening about unifying IOS and
regular scans a bit more, which perhaps would allow relaxing that
(and maybe even solve issue #1?).  But it's a long way off yet.

If my_special_function() is supposed to always be applied to an
indexed column, then issue #1 would fortuitously not be a problem
for you.  But #2 is a pain, and #3 might be a deal-breaker for you.

regards, tom lane




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Matthias van de Meent
On Tue, 19 Sept 2023 at 18:56, Andres Freund  wrote:
>
> Hi,
>
> On 2023-09-19 18:30:44 +0200, Alvaro Herrera wrote:
> > This makes me think that maybe the logic needs to be a little more
> > complex to avoid the problem you describe: if an UPDATE is prevented
> > from being HOT because of this setting, but then it goes and consults
> > FSM and it gives the update a higher block number than the tuple's
> > current block (or it fails to give a block number at all so it is forced
> > to extend the relation), then the update should give up on that strategy
> > and use a HOT update after all.  (I have not read the actual patch;
> > maybe it already does this?  It sounds kinda obvious.)
>
> Yea, a setting like what's discussed here seems, uh, not particularly useful
> for achieving the goal of compacting tables.  I don't think guiding this
> through SQL makes a lot of sense. For decent compaction you'd want to scan the
> table backwards, and move rows from the end to earlier, but stop once
> everything is filled up. You can somewhat do that from SQL, but it's going to
> be awkward and slow.  I doubt you even want to use the normal UPDATE WAL
> logging.

We can't move tuples around (or, not that I know of) without using a
transaction ID to control the visibility of the two locations of that
tuple. Doing table compaction would thus likely require using
transactions to move these tuples around. Using a single backend and
bulk operations, it'll still lock each tuple that is being moved, and
that can be noticed by user DML queries. I'd rather make the user's
queries move the data around than this long-duration, locking
background operation.

> I think having explicit compaction support in VACUUM or somewhere similar
> would make sense, but I don't think the proposed GUC is a useful stepping
> stone.

The point of this GUC is that the compaction can happen organically in
the user's UPDATE workflow, so that there is no long locking operation
going on (as you would see with VACUUM FULL / CLUSTER / pg_repack).

> > > But without any kind of auto-tuning, in my opinion, it's a fairly poor
> > > feature. Sure, some people will get use out of it, if they're
> > > sufficiently knowledgeable and sufficiently determined. But I think
> > > for most people in most situations, it will be a struggle.
>
> Indeed. I think it'd often just explode table and index sizes, because HOT
> pruning won't be able to make usable space in pages anymore (due to dead
> items).

You seem to misunderstand the latest patch. It explicitly only blocks
local updates if the update can then move the new tuple to an earlier
page. If that is not possible, then it'll insert locally (assuming
that is still possible) and HOT can then still apply.

And yes, moving tuples to earlier pages will indeed increase index
bloat, because it does create dead tuples where previously we could've
applied HOT. But we do have VACUUM and REINDEX CONCURRENTLY to clean
that up without serious long-duration stop-the-world actions, while
the other builtin cleanup methods don't.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Projection pushdown to index access method

2023-09-19 Thread Robert Haas
On Tue, Sep 19, 2023 at 12:35 PM Chris Cleveland
 wrote:
> I'm working on an index access method. I have a function which can appear in 
> a projection list which should be evaluated by the access method itself. 
> Example:
>
> SELECT title, my_special_function(body)
> FROM books
> WHERE book_id <===> 42;
>
> "<===>" is the operator that invokes the access method. The value returned by 
> my_special_function() gets calculated during the index scan, and depends on  
> information that exists only in the index.
>
> How do I get the system to pull the value from the index instead of trying to 
> calculate it?

I don't see how you can do this in general, because there's no
guarantee that the plan will be an Index Scan or Index Only Scan
instead of a Seq Scan or Bitmap Heap/Index Scan.

> So far, I have created a CustomScan and set it using set_rel_pathlist_hook. 
> The hook function gives us a PlannerInfo, RelOptInfo, Index, and 
> RangeTblEntry. So far as I can tell, only RelOptInfo.reltarget.exprs gives us 
> any info on the SELECT expressions, but unfortunately, the exprs are Var 
> nodes that contain the (title, body) columns from above, and do not say 
> anything about my_special_function().

So what does the EXPLAIN plan look like?

I'm not quite sure what's happening here, but the planner likes to
make plans that just fetch attributes from all the relations being
joined (here, there's just one) and then perform the calculation of
any expressions at the very end, as the final step, or at least as the
final step at that subquery level. And if it plans to ask your custom
scan for title, body, and book_id and then compute
my_special_function(body) after the fact, the thing you want to happen
is not going to happen. If the planner can be induced to ask your
custom scan for my_special_function(body), then I *think* you should
be able to arrange to get that value any way you like and just return
it. But I don't quite know how to induce the planner to do that -- and
especially if this query involved more than one table, because of the
planner's tendency to postpone expression evaluation until after joins
are done.

> I know this is possible, because the docs for PathTarget say this:
>
> PathTarget
>  *
>  * This struct contains what we need to know during planning about the
>  * targetlist (output columns) that a Path will compute.  Each RelOptInfo
>  * includes a default PathTarget, which its individual Paths may simply
>  * reference.  However, in some cases a Path may compute outputs different
>  * from other Paths, and in that case we make a custom PathTarget for it.
>  * For example, an indexscan might return index expressions that would
>  * otherwise need to be explicitly calculated.

It's just worth keeping in mind that the planner and the executor are
very tightly bound together here. This may be one of those cases
getting the executor to do what you want is the easy part, and getting
the planner to produce a plan that tells it to do that is the hard
part.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Andres Freund
Hi,

On 2023-09-19 18:30:44 +0200, Alvaro Herrera wrote:
> This makes me think that maybe the logic needs to be a little more
> complex to avoid the problem you describe: if an UPDATE is prevented
> from being HOT because of this setting, but then it goes and consults
> FSM and it gives the update a higher block number than the tuple's
> current block (or it fails to give a block number at all so it is forced
> to extend the relation), then the update should give up on that strategy
> and use a HOT update after all.  (I have not read the actual patch;
> maybe it already does this?  It sounds kinda obvious.)

Yea, a setting like what's discussed here seems, uh, not particularly useful
for achieving the goal of compacting tables.  I don't think guiding this
through SQL makes a lot of sense. For decent compaction you'd want to scan the
table backwards, and move rows from the end to earlier, but stop once
everything is filled up. You can somewhat do that from SQL, but it's going to
be awkward and slow.  I doubt you even want to use the normal UPDATE WAL
logging.

I think having explicit compaction support in VACUUM or somewhere similar
would make sense, but I don't think the proposed GUC is a useful stepping
stone.


> > But without any kind of auto-tuning, in my opinion, it's a fairly poor
> > feature. Sure, some people will get use out of it, if they're
> > sufficiently knowledgeable and sufficiently determined. But I think
> > for most people in most situations, it will be a struggle.

Indeed. I think it'd often just explode table and index sizes, because HOT
pruning won't be able to make usable space in pages anymore (due to dead
items).

Greetings,

Andres Freund




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Robert Haas
On Tue, Sep 19, 2023 at 12:30 PM Alvaro Herrera  wrote:
> I was thinking something vaguely like "a table size that's roughly what
> an optimal autovacuuming schedule would leave the table at" assuming 0.2
> vacuum_scale_factor.  You would determine the absolute minimum size for
> the table given the current live tuples in the table, then add 20% to
> account for a steady state of dead tuples and vacuumed space.  So it's
> not 1.2x of the "current" table size at the time the local_update_limit
> feature is installed, but 1.2x of the optimal table size.

Right, that would be great. And honestly if that's something we can
figure out, then why does the parameter even need to be an integer
instead of a Boolean? If the system knows the optimal table size, then
the user can just say "try to compact this table" and need not say to
what size. The 1.2 multiplier is probably situation dependent and
maybe the multiplier should indeed be a configuration parameter, but
we would be way better off if the absolute size didn't need to be.

> This makes me think that maybe the logic needs to be a little more
> complex to avoid the problem you describe: if an UPDATE is prevented
> from being HOT because of this setting, but then it goes and consults
> FSM and it gives the update a higher block number than the tuple's
> current block (or it fails to give a block number at all so it is forced
> to extend the relation), then the update should give up on that strategy
> and use a HOT update after all.  (I have not read the actual patch;
> maybe it already does this?  It sounds kinda obvious.)

+1 to all of that. Anything we can do to reduce the chance of the
parameter doing the opposite of what it's intended to do is, IMHO,
really, really valuable. If you're in the situation where you really
need something like this, you're probably having a pretty bad day
already.

Just to be more clear about my position, I don't think that having
some kind of a feature along these lines is a bad idea. I do think
that this is one of those cases where the perfect is the enemy of the
good, and we can fall into the trap of saying that since we can't do
the perfect thing let's not do anything at all. At the same time, just
because we need to do something doesn't mean we should do exactly the
first thing that anybody thought up, or that we shouldn't try as hard
as we can to mitigate the downsides. If we add something like this I
bet it will get a lot of use. Even a minor improvement to the design
that removes one pitfall of many could turn out to help a lot of
people. If we could get to the point where most people have a positive
user experience without too much effort, this could turn out to be one
of the most impactful features in years.

> Having to set AEL is not nice for sure, but wouldn't
> ShareUpdateExclusiveLock be sufficient?  We have a bunch of reloptions
> for which that is sufficient.

Hmm, yeah, I think you're right.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)

2023-09-19 Thread Tom Lane
Tomas Vondra  writes:
> I have no experience with tcl, but I tried this in the two tclsh
> versions installed no the system (8.6 and 8.7):

> bsd@freebsd:~ $ tclsh8.7
> % clock scan "1/26/2010"
> time value too large/small to represent

> bsd@freebsd:~ $ tclsh8.6
> % clock scan "1/26/2010"
> time value too large/small to represent

> AFAIK this is what the tcl_date_week(2010,1,26) translates to.

Oh, interesting.  On my FreeBSD 13.1 arm64 system, it works:

$ tclsh8.6
% clock scan "1/26/2010"
1264482000

I am now suspicious that there's some locale effect that we have
not observed before (though why not?).  What is the result of
the "locale" command on your box?  Mine gives

$ locale
LANG=C.UTF-8
LC_CTYPE="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_TIME="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_ALL=


regards, tom lane




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Alvaro Herrera
On 2023-Sep-19, Robert Haas wrote:

> On Tue, Sep 19, 2023 at 6:26 AM Alvaro Herrera  
> wrote:
> > Second, I think we should make it auto-reset.  That is, have the user
> > set some value; later, when some condition triggers (say, the table size
> > is 1.2x the limit value you configured), then the local_update_limit is
> > automatically removed from the table options.  From that point onwards,
> > the table is operated normally.
> 
> That's an interesting idea. It would require taking AEL on the table.
> And also, what do you mean by 1.2x the limit value? Is that supposed
> to be a >= condition or a <= condition? It can't really be a >=
> condition, but you wouldn't set it in the first place unless the table
> were significantly bigger than it could be. But if it's a <= condition
> it doesn't really protect you from hosing yourself. You just have to
> insert a bit more data before enough of the bloat gets removed, and
> now the table just bloats infinitely and probably rather quickly. The
> correct value of the setting depends on the amount of real data
> (non-bloat) in the table, not the actual table size.

I was thinking something vaguely like "a table size that's roughly what
an optimal autovacuuming schedule would leave the table at" assuming 0.2
vacuum_scale_factor.  You would determine the absolute minimum size for
the table given the current live tuples in the table, then add 20% to
account for a steady state of dead tuples and vacuumed space.  So it's
not 1.2x of the "current" table size at the time the local_update_limit
feature is installed, but 1.2x of the optimal table size.

This makes me think that maybe the logic needs to be a little more
complex to avoid the problem you describe: if an UPDATE is prevented
from being HOT because of this setting, but then it goes and consults
FSM and it gives the update a higher block number than the tuple's
current block (or it fails to give a block number at all so it is forced
to extend the relation), then the update should give up on that strategy
and use a HOT update after all.  (I have not read the actual patch;
maybe it already does this?  It sounds kinda obvious.)


Having to set AEL is not nice for sure, but wouldn't
ShareUpdateExclusiveLock be sufficient?  We have a bunch of reloptions
for which that is sufficient.


> But without any kind of auto-tuning, in my opinion, it's a fairly poor
> feature. Sure, some people will get use out of it, if they're
> sufficiently knowledgeable and sufficiently determined. But I think
> for most people in most situations, it will be a struggle.
> 
> -- 
> Robert Haas
> EDB: http://www.enterprisedb.com


-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)




Re: Fixing tab-complete for dollar-names

2023-09-19 Thread Heikki Linnakangas

On 27/06/2023 02:47, Vik Fearing wrote:

On 6/26/23 22:10, Mikhail Gribkov wrote:

Hi hackers,

As not much preliminary interest seem to be here, I'm sending the patch to
the upcoming commitfest


I have added myself as reviewer.  I already had taken a look at it, and
it seemed okay, but I have not yet searched for corner cases.


LGTM, pushed.

I concur it's surprising that no one's noticed or at least not bothered 
to fix this before. But I also couldn't find any cases where this would 
cause trouble.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)

2023-09-19 Thread Tomas Vondra



On 9/18/23 20:52, Tom Lane wrote:
> Tomas Vondra  writes:
>> it seems dikkop is unhappy again, this time because of some OpenSSL
>> stuff. I'm not sure it's our problem - it might be issues with the other
>> packages, or maybe something FreeBSD specific, not sure.
>> ...
>> Both 11 and 12 failed with a weird openssl segfaults in plpython tests,
>> see [2] and [3]. And 13 is stuck in some openssl stuff in plpython
>> tests, with 100% CPU usage (for ~30h now):
> 
> Even weirder, its latest REL_11 run got past that, and instead failed
> in pltcl [1].  I suppose in an hour or two we'll know if v12 also
> changed behavior.
> 

Oh, yeah. Sorry for not mentioning this yesterday ...

I tried removing the openssl-1.1.1v and installed 3.1 instead, which
apparently allowed it to pass the plpython tests. I guess it's due to
some sort of confusion with the openssl-3.0 included in FreeBSD base
(which I didn't realize is there).

> The pltcl test case that is failing is annotated
> 
> -- Test usage of Tcl's "clock" command.  In recent Tcl versions this
> -- command fails without working "unknown" support, so it's a good canary
> -- for initialization problems.
> 
> which is mighty suggestive, but I'm not sure what to look at exactly.
> Perhaps apply "ldd" or local equivalent to those languages' .so files
> and see if they link to the same versions of indirectly-required
> libraries as Postgres is linking to?
> 
>   regards, tom lane
> 

I have no experience with tcl, but I tried this in the two tclsh
versions installed no the system (8.6 and 8.7):

bsd@freebsd:~ $ tclsh8.7
% clock scan "1/26/2010"
time value too large/small to represent

bsd@freebsd:~ $ tclsh8.6
% clock scan "1/26/2010"
time value too large/small to represent

AFAIK this is what the tcl_date_week(2010,1,26) translates to.

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)

2023-09-19 Thread Tomas Vondra
On 9/19/23 04:25, Thomas Munro wrote:
> On Tue, Sep 19, 2023 at 2:04 PM Michael Paquier  wrote:
>> On Mon, Sep 18, 2023 at 03:11:27PM +0200, Tomas Vondra wrote:
>>> Both 11 and 12 failed with a weird openssl segfaults in plpython tests,
>>> see [2] and [3]. And 13 is stuck in some openssl stuff in plpython
>>> tests, with 100% CPU usage (for ~30h now):
>>>
>>> #0  0x850e86c0 in OPENSSL_sk_insert ()
>>> from /usr/local/lib/libcrypto.so.11
>>> #1  0x850a5848 in CRYPTO_set_ex_data ()
>>> from /usr/local/lib/libcrypto.so.11
>>> ...
>>>
>>> Full backtrace attached. I'm not sure what could possibly be causing
>>> this, except maybe something in FreeBSD? Or maybe there's some confusion
>>> about libraries? No idea.
>>
>> FWIW, I've seen such corrupted and time-sensitive stacks in the past
>> in the plpython tests in builds when python linked to a SSL library
>> different than what's linked with the backend.  So that smells like a
>> packaging issue to me.
> 
> Could it be confusion due to the presence of OpenSSL 3.0 in the
> FreeBSD base system (/usr/include, /usr/lib) combined with the
> presence of OpenSSL 1.1.1 installed with "pkg install openssl"
> (/usr/local/include, /usr/local/lib)?  Tomas, does it help if you "pkg
> remove openssl"?

Oh! That might be it - I didn't realize FreeBSD already has openssl 3.0
already included in the base system, so perhaps installing 1.1.1v leads
to some serious confusion ...

After some off-list discussion with Alvaro I tried removing the 1.1.1v
and installed the openssl31 package, which apparently resolved this (at
which point it ran into the unrelated tcl issue).

Still, this confusion seems rather unexpected, and I'm not sure if
having both 3.0 (from base) and 3.1 (from package) could lead to the
same confusion / crashes. Not sure if it's "our" problem ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Robert Haas
On Tue, Sep 19, 2023 at 6:26 AM Alvaro Herrera  wrote:
> Second, I think we should make it auto-reset.  That is, have the user
> set some value; later, when some condition triggers (say, the table size
> is 1.2x the limit value you configured), then the local_update_limit is
> automatically removed from the table options.  From that point onwards,
> the table is operated normally.

That's an interesting idea. It would require taking AEL on the table.
And also, what do you mean by 1.2x the limit value? Is that supposed
to be a >= condition or a <= condition? It can't really be a >=
condition, but you wouldn't set it in the first place unless the table
were significantly bigger than it could be. But if it's a <= condition
it doesn't really protect you from hosing yourself. You just have to
insert a bit more data before enough of the bloat gets removed, and
now the table just bloats infinitely and probably rather quickly. The
correct value of the setting depends on the amount of real data
(non-bloat) in the table, not the actual table size.

> The point here is that third-party tools such as pg_repack or pg_squeeze
> exist, which work in a way we don't like, yet we offer no alternative.
> This proposal is a mechanism that essentially replaces those tools with
> a simple in-core feature, without having to include the tool itself in
> core.

I agree that it would be nice to have something in core that can be
used to help with this problem, but this feature isn't the same thing
as pg_repack or pg_squeeze, either. In some ways, it's better, because
it can shrink the table without rewriting it, which is very desirable.
But in other ways, it's worse, and the fact that it seems like it can
backfire spectacularly if you set the wrong value seems like one big
way that it is a lot worse. If there is a way that we can make this a
mode that you activate for a table, and the system calculates and
updates the threshold, I think that would actually be a pretty good
feature. It would be tricky to use it to recover from acute
emergencies, because it doesn't actually do anything until updates
happen, but you could use it for that in a pinch. And even without
that it would be useful if you have a table that is sometimes very
large and sometimes very small and you want to get the space back from
the OS when it is in the small phase of its lifecycle.

But without any kind of auto-tuning, in my opinion, it's a fairly poor
feature. Sure, some people will get use out of it, if they're
sufficiently knowledgeable and sufficiently determined. But I think
for most people in most situations, it will be a struggle.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Checks in RegisterBackgroundWorker.()

2023-09-19 Thread Heikki Linnakangas

On 25/08/2023 00:00, Thomas Munro wrote:

On Fri, Aug 25, 2023 at 3:15 AM Heikki Linnakangas  wrote:

In summary, RegisterBackgroundWorker() is doing some questionable and
useless work, when a shared preload library is loaded to a backend
process in EXEC_BACKEND mode.


Yeah.  When I was working on 7389aad6 ("Use WaitEventSet API for
postmaster's event loop."), I also tried to move all of the
postmaster's state variables into PostmasterContext (since the only
reason for that scope was the signal handler code that is now gone),
and I hit a variant of this design problem.  I wonder if that would be
unblocked by this...

https://www.postgresql.org/message-id/CA+hUKGKH_RPAo=ngpfhkj--565al1qivpugdwt1_pmjehy+...@mail.gmail.com


A-ha, yes I believe this patch will unblock that. 
RegisterBackgroundWorker() has no legit reason to access 
BackgroundWorkerList in child processes, and with these patches, it no 
longer does.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-09-19 Thread Robert Haas
On Mon, Sep 18, 2023 at 4:51 PM Jeff Davis  wrote:
> I don't want to make an argument of the form "the status quo is really
> bad, and therefore my proposal is good". That line of argument is
> suspect for good reason.

+1.

> But if my proposal isn't good enough, and we don't have a clear
> alternative, we need to think seriously about how much we've
> collectively over-promised and under-delivered on the concept of
> privilege separation.
>
> Absent a better idea, we need to figure out a way to un-promise what we
> can't do and somehow guide users towards safe practices. For instance,
> don't grant the INSERT or UPDATE privilege if the table uses functions
> in index expressions or constraints. Also don't touch any table unless
> the onwer has SET ROLE privileges on your role already, or the
> operation is part of a special carve out (logical replication or a
> maintenance command). And don't use the predefined role
> pg_write_all_data, because that's unsafe for most imaginable use cases.

I agree this is a mess, and that documenting the mess better would be
good. But instead of saying not to do something, we need to say what
will happen if you do the thing. I'm regularly annoyed when somebody
reports that "I tried to do X and it didn't work," instead of saying
what happened when they tried, and this situation is another form of
the same thing. "If you do X, then Y will or can occur" is much better
than "do not do X". And I think better documentation of this area
would be useful regardless of any other improvements that we may or
may not make. Indeed, really good documentation of this area might
facilitate making further improvements by highlighting some of the
problems so that they can more easily be understood by a wider
audience. I fear it will be hard to come up with something that is
clear, that highlights the severity of the problems, and that does not
veer off into useless vitriol against the status quo, but if we can
get there, that would be good.

But, leaving that to one side, what technical options do we have on
the table, supposing that we want to do something that is useful but
not this exact thing?

I think one option is to somehow change the behavior around
search_path but in a different way than you've proposed. The most
radical option would be to make it not be a GUC any more. I think the
backward-compatibility implications of that would likely be
unpalatable to many, and the details of what we'd actually do are also
not clear, at least to me. For a function, I think there is a
reasonable argument that you just make it a function property, like
IMMUTABLE, as you said before. But for code that goes directly into
the session, where's the search_path supposed to come from? It's got
to be configured somewhere, and somehow that somewhere feels a lot
like a GUC. That leads to a second idea, which is having it continue
to be a GUC but only affect directly-entered SQL, with all
indirectly-entered SQL either being stored as a node tree or having a
search_path property attached somewhere. Or, as a third idea, suppose
we leave it a GUC but start breaking semantics around where and how
that GUC gets set, e.g. by changing CREATE FUNCTION to capture the
prevailing search_path by default unless instructed otherwise.
Personally I feel like we'd need pretty broad consensus for any of
these kinds of changes because it would break a lot of stuff for a lot
of people, but if we could get that then I think we could maybe emerge
in a better spot once the pain of the compatibility break receded.

Another option is something around sandboxing and/or function trust.
The idea here is to not care too much about the search_path behavior
itself, and instead focus on the consequences, namely what code is
getting executed as which user and perhaps what kinds of operations
it's performing. To me, this seems like a possibly easier answer way
forward at least in the short to medium term, because I think it will
break fewer things for fewer people, and if somebody doesn't like the
new behavior they can just say "well I trust everyone completely" and
it all goes back to the way it was. That said, I think there are
problems with my previous proposals on the other thread so I believe
some adjustments would be needed there, and then there's the problem
of actually implementing anything. I'll try to respond to your
comments on that thread soon.

Are there any other categories of things we can do? More specific
kinds of things we can do in each category? I don't really see an
option other than (1) "change something in the system design so that
people use search_path wrongly less often" or (2) "make it so that it
doesn't matter as much if people using the wrong search_path" but
maybe I'm missing a clever idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Relation bulk write facility

2023-09-19 Thread Heikki Linnakangas
Several places bypass the buffer manager and use direct smgrextend() 
calls to populate a new relation: Index AM build methods, rewriteheap.c 
and RelationCopyStorage(). There's fair amount of duplicated code to 
WAL-log the pages, calculate checksums, call smgrextend(), and finally 
call smgrimmedsync() if needed. The duplication is tedious and 
error-prone. For example, if we want to optimize by WAL-logging multiple 
pages in one record, that needs to be implemented in each AM separately. 
Currently only sorted GiST index build does that but it would be equally 
beneficial in all of those places.


And I believe we got the smgrimmedsync() logic slightly wrong in a 
number of places [1]. And it's not great for latency, we could let the 
checkpointer do the fsyncing lazily, like Robert mentioned in the same 
thread.


The attached patch centralizes that pattern to a new bulk writing 
facility, and changes all those AMs to use it. The facility buffers 32 
pages and WAL-logs them in record, calculates checksums. You could 
imagine a lot of further optimizations, like writing those 32 pages in 
one vectored pvwrite() call [2], and not skipping the buffer manager 
when the relation is small. But the scope of this initial version is 
mostly to refactor the existing code.


One new optimization included here is to let the checkpointer do the 
fsyncing if possible. That gives a big speedup when e.g. restoring a 
schema-only dump with lots of relations.


[1] 
https://www.postgresql.org/message-id/58effc10-c160-b4a6-4eb7-384e95e6f9e3%40iki.fi


[2] 
https://www.postgresql.org/message-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com


--
Heikki Linnakangas
Neon (https://neon.tech)From 33f5cafc1512e8a004df2d506da71dfbbef5c60d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 19 Sep 2023 18:09:34 +0300
Subject: [PATCH v1 1/1] Introduce a new bulk loading facility.

The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.

The new facility is faster for small relations: Instead of of calling
smgrimmedsync(), we register the fsync to happen at next checkpoint,
which avoids the fsync latency. That can make a big difference if you
are e.g. restoring a schema-only dump with lots of relations.

It is also slightly more efficient with large relations, as the WAL
logging is performed multiple pages at a time. That avoids some WAL
header overhead. The sorted GiST index build did that already, this
moves the buffering to the new facility.

The changes to pageinspect GiST test needs an explanation: Before this
patch, the sorted GiST index build set the LSN on every page to the
special GistBuildLSN value, not the LSN of the WAL record, even though
they were WAL-logged. There was no particular need for it, it just
happened naturally when we wrote out the pages before WAL-logging
them. Now we WAL-log the pages first, like in B-tree build, so the
pages are stamped with the record's real LSN. When the build is not
WAL-logged, we still use GistBuildLSN. To make the test output
predictable, use an unlogged index.
---
 src/backend/access/gist/gistbuild.c   | 111 ++---
 src/backend/access/heap/rewriteheap.c |  71 ++
 src/backend/access/nbtree/nbtree.c|  29 +--
 src/backend/access/nbtree/nbtsort.c   | 102 ++--
 src/backend/access/spgist/spginsert.c |  49 ++--
 src/backend/catalog/storage.c |  38 +--
 src/backend/storage/smgr/Makefile |   1 +
 src/backend/storage/smgr/bulk_write.c | 334 ++
 src/backend/storage/smgr/md.c |  43 
 src/backend/storage/smgr/meson.build  |   1 +
 src/backend/storage/smgr/smgr.c   |  31 +++
 src/include/storage/bulk_write.h  |  28 +++
 src/include/storage/md.h  |   1 +
 src/include/storage/smgr.h|   1 +
 14 files changed, 535 insertions(+), 305 deletions(-)
 create mode 100644 src/backend/storage/smgr/bulk_write.c
 create mode 100644 src/include/storage/bulk_write.h

diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 5e0c1447f92..950b6046ac5 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,7 +43,8 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
-#include "storage/smgr.h"
+#include "storage/bulk_write.h"
+
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
@@ -106,11 +107,8 @@ typedef struct
 	Tuplesortstate *sortstate;	/* state data for tuplesort.c */
 
 	BlockNumber pages_allocated;
-	BlockNumber pages_written;
 
-	int			ready_num_pages;
-	BlockNumber ready_blknos[XLR_MAX_BLOCK_ID];
-	Page		ready_pages[XLR_

Projection pushdown to index access method

2023-09-19 Thread Chris Cleveland
I'm working on an index access method. I have a function which can appear
in a projection list which should be evaluated by the access method itself.
Example:

SELECT title, my_special_function(body)
FROM books
WHERE book_id <===> 42;

"<===>" is the operator that invokes the access method. The value returned
by my_special_function() gets calculated during the index scan, and depends
on  information that exists only in the index.

How do I get the system to pull the value from the index instead of trying
to calculate it?

So far, I have created a CustomScan and set it using set_rel_pathlist_hook.
The hook function gives us a PlannerInfo, RelOptInfo, Index, and
RangeTblEntry. So far as I can tell, only RelOptInfo.reltarget.exprs gives
us any info on the SELECT expressions, but unfortunately, the exprs are Var
nodes that contain the (title, body) columns from above, and do not say
anything about my_special_function().

Where do I find the actual final projection exprs?

Am I using the right hook?

Is there any example code out there on how to do this?

I know this is possible, because the docs for PathTarget say this:

PathTarget
 *
 * This struct contains what we need to know during planning about the
 * targetlist (output columns) that a Path will compute.  Each RelOptInfo
 * includes a default PathTarget, which its individual Paths may simply
 * reference.  However, in some cases a Path may compute outputs different
 * from other Paths, and in that case we make a custom PathTarget for it.
 *
*For example, an indexscan might return index expressions that would *
otherwise need to be explicitly calculated.*

-- 
Chris Cleveland
312-339-2677 mobile


Re: pg_resetwal tests, logging, and docs update

2023-09-19 Thread Peter Eisentraut

On 13.09.23 16:36, Aleksander Alekseev wrote:

All in all the patch looks OK but I have a couple of nitpicks.

```
+   working on a data directory in an unclean shutdown state or with a corrupt
+   control file.
```

```
+   After running this command on a data directory with corrupted WAL or a
+   corrupt control file,
```

I'm not a native English speaker but shouldn't it be "corruptED control file"?


fixed



```
+  Force pg_resetwal to proceed even in situations where
+  it could be dangerous,
```

"where" is probably fine but wouldn't "WHEN it could be dangerous" be better?


Hmm, I think I like "where" better.

Attached is an updated patch set where I have split the changes into 
smaller pieces.  The last two patches still have some open questions 
about what certain constants mean etc.  The other patches should be settled.
From 0922727410313e37c1ffaad4f6fbf90f9990923b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 19 Sep 2023 15:54:37 +0200
Subject: [PATCH v2 1/7] pg_resetwal: Update an obsolete comment

---
 src/bin/pg_resetwal/pg_resetwal.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/bin/pg_resetwal/pg_resetwal.c 
b/src/bin/pg_resetwal/pg_resetwal.c
index 25ecdaaa15..b7885e34f3 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -6,8 +6,7 @@
  *
  * The theory of operation is fairly simple:
  *   1. Read the existing pg_control (which will include the last
- *  checkpoint record).  If it is an old format then update to
- *  current format.
+ *  checkpoint record).
  *   2. If pg_control is corrupt, attempt to intuit reasonable values,
  *  by scanning the old xlog if necessary.
  *   3. Modify pg_control to reflect a "shutdown" state with a checkpoint

base-commit: bf094372d14bf00f1c04f70f6ec5790f8b2c8801
-- 
2.42.0

From 8fee9806bfa94fc999e730d3fb55946c656e6c64 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 19 Sep 2023 15:58:49 +0200
Subject: [PATCH v2 2/7] pg_resetwal: Improve error with wrong/missing data
 directory

Run chdir() before permission check to get a less confusing error
message if the specified data directory does not exist.
---
 src/bin/pg_resetwal/pg_resetwal.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_resetwal/pg_resetwal.c 
b/src/bin/pg_resetwal/pg_resetwal.c
index b7885e34f3..e344c9284d 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -345,6 +345,10 @@ main(int argc, char *argv[])
 
get_restricted_token();
 
+   if (chdir(DataDir) < 0)
+   pg_fatal("could not change directory to \"%s\": %m",
+DataDir);
+
/* Set mask based on PGDATA permissions */
if (!GetDataDirectoryCreatePerm(DataDir))
pg_fatal("could not read permissions of directory \"%s\": %m",
@@ -352,10 +356,6 @@ main(int argc, char *argv[])
 
umask(pg_mode_mask);
 
-   if (chdir(DataDir) < 0)
-   pg_fatal("could not change directory to \"%s\": %m",
-DataDir);
-
/* Check that data directory matches our server version */
CheckDataVersion();
 
-- 
2.42.0

From 241dc252f314ee32ebf853f236a17029ec477d50 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 19 Sep 2023 16:02:35 +0200
Subject: [PATCH v2 3/7] pg_resetwal: Regroup --help output

Put the options to modify the control values into a separate group.
This matches the outline of the man page.
---
 src/bin/pg_resetwal/pg_resetwal.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/src/bin/pg_resetwal/pg_resetwal.c 
b/src/bin/pg_resetwal/pg_resetwal.c
index e344c9284d..12e0869251 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -1128,24 +1128,29 @@ static void
 usage(void)
 {
printf(_("%s resets the PostgreSQL write-ahead log.\n\n"), progname);
-   printf(_("Usage:\n  %s [OPTION]... DATADIR\n\n"), progname);
-   printf(_("Options:\n"));
+   printf(_("Usage:\n"));
+   printf(_("  %s [OPTION]... DATADIR\n"), progname);
+
+   printf(_("\nOptions:\n"));
+   printf(_(" [-D, --pgdata=]DATADIR  data directory\n"));
+   printf(_("  -f, --forceforce update to be done\n"));
+   printf(_("  -n, --dry-run  no update, just show what would be 
done\n"));
+   printf(_("  -V, --version  output version information, then 
exit\n"));
+   printf(_("  -?, --help show this help, then exit\n"));
+
+   printf(_("\nOptions to override control file values:\n"));
printf(_("  -c, --commit-timestamp-ids=XID,XID\n"
 "   set oldest and 
newest transactions bearing\n"
 "   commit timestamp 
(zero means no ch

Re: Add pg_basetype() function to obtain a DOMAIN base type

2023-09-19 Thread Steve Chavez
Just to give a data point for the need of this function:

https://dba.stackexchange.com/questions/231879/how-to-get-the-basetype-of-a-domain-in-pg-type

This is also a common use case for services/extensions that require
postgres metadata for their correct functioning, like postgREST or
pg_graphql.

Here's a query for getting domain base types, taken from the postgREST
codebase:
https://github.com/PostgREST/postgrest/blob/531a183b44b36614224fda432335cdaa356b4a0a/src/PostgREST/SchemaCache.hs#L342-L364

So having `pg_basetype` would be really helpful in those cases.

Looking forward to hearing any feedback. Or if this would be a bad idea.

Best regards,
Steve Chavez

On Sat, 9 Sept 2023 at 01:17, Steve Chavez  wrote:

> Hello hackers,
>
> Currently obtaining the base type of a domain involves a somewhat long
> recursive query. Consider:
>
> ```
> create domain mytext as text;
> create domain mytext_child_1 as mytext;
> create domain mytext_child_2 as mytext_child_1;
> ```
>
> To get `mytext_child_2` base type we can do:
>
> ```
> WITH RECURSIVE
> recurse AS (
>   SELECT
> oid,
> typbasetype,
> COALESCE(NULLIF(typbasetype, 0), oid) AS base
>   FROM pg_type
>   UNION
>   SELECT
> t.oid,
> b.typbasetype,
> COALESCE(NULLIF(b.typbasetype, 0), b.oid) AS base
>   FROM recurse t
>   JOIN pg_type b ON t.typbasetype = b.oid
> )
> SELECT
>   oid::regtype,
>   base::regtype
> FROM recurse
> WHERE typbasetype = 0 and oid = 'mytext_child_2'::regtype;
>
>   oid   | base
> +--
>  mytext_child_2 | text
> ```
>
> Core has the `getBaseType` function, which already gets a domain base type
> recursively.
>
> I've attached a patch that exposes a `pg_basetype` SQL function that uses
> `getBaseType`, so the long query above just becomes:
>
> ```
> select pg_basetype('mytext_child_2'::regtype);
>  pg_basetype
> -
>  text
> (1 row)
> ```
>
> Tests and docs are added.
>
> Best regards,
> Steve Chavez
>


Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-09-19 Thread torikoshia

On 2023-09-15 19:02, Damir Belyalov wrote:

Since v5 patch failed applying anymore, updated the patch.


Thank you for updating the patch . I made a little review on it where
corrected some formatting.



Thanks for your review and update!
I don't have objections the modification of the codes and comments.
Although v7 patch doesn't have commit messages on the patch, I think 
leave commit message is good for reviewers.



- COPY with a datatype error that can't be handled as a soft error


I didn't know proper way to test this, but I've found data type
widget's
input function widget_in() defined to occur hard-error in regress.c,
attached patch added a test using it.


This test seems to be weird a bit, because of the "widget" type. The
hard error is thrown by the previous test with missing data. Also
it'll be interesting for me to list all cases when a hard error can be
thrown.


Although missing data error is hard error, the suggestion from Andres 
was adding `dataype` error:



- COPY with a datatype error that can't be handled as a soft error


As described in widghet_in(), widget is intentionally left emitting hard 
error for testing purpose:


  * Note: DON'T convert this error to "soft" style (errsave/ereturn).  
We
  * want this data type to stay permanently in the hard-error world so 
that

  * it can be used for testing that such cases still work reasonably.


From this point of view, I think this is a supposed way of using widget.
OTOH widget is declared in create_type.sql and I'm not sure it's ok to 
use it in another test copy2.sql.


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




RE: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2023-09-19 Thread Hayato Kuroda (Fujitsu)
Dear Horiguchi-san,

> I added the thread to next CF entry, so let's see the how cfbot says.

At least there are several compiler warnings. E.g.,

* pgwin32_find_postmaster_pid() has "return;", but IIUC it should be "exit(1)"
* When DWORD is printed, "%lx" should be used.
* The variable "flags" seems not needed.

Here is a patch which suppresses warnings, whereas test would fail...
You can use it if acceptable.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



pg_ctl_waits_for_postmasters_pid_on_windows_3.patch
Description: pg_ctl_waits_for_postmasters_pid_on_windows_3.patch


Re: pg_upgrade and logical replication

2023-09-19 Thread vignesh C
On Tue, 19 Sept 2023 at 11:49, Michael Paquier  wrote:
>
> On Fri, Sep 15, 2023 at 04:51:57PM +0530, vignesh C wrote:
> > Another approach to solve this as suggested by one of my colleague
> > Hou-san would be to set max_logical_replication_workers = 0 while
> > upgrading. I will evaluate this and update the next version of patch
> > accordingly.
>
> In the context of an upgrade, any node started is isolated with its
> own port and a custom unix domain directory with connections allowed
> only through this one.
>
> Saying that, I don't see why forcing max_logical_replication_workers
> to be 0 would be necessarily a bad thing to prevent unnecessary
> activity on the backend.  This should be a separate patch built on
> top of the main one, IMO.

Here is a patch to set max_logical_replication_workers as 0 while the
server is started to prevent the launcher from being started. Since
this configuration is present from v10, no need for any version check.
I have done upgrade tests for v10-master, v11-master, ... v16-master
and found it to be working fine.

Regards,
Vignesh
From f8ce16694ac4324113d948482818047f5457f924 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Tue, 19 Sep 2023 17:41:07 +0530
Subject: [PATCH v1] Don't start launcher process to be started while
 upgrading.

We don't want launcher to run while upgrading because launcher might start
apply worker which will start receiving changes from the publisher and
update the old cluster before the upgradation is completed.
---
 src/bin/pg_upgrade/server.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/server.c b/src/bin/pg_upgrade/server.c
index 0bc3d2806b..fa728d2b79 100644
--- a/src/bin/pg_upgrade/server.c
+++ b/src/bin/pg_upgrade/server.c
@@ -233,15 +233,16 @@ start_postmaster(ClusterInfo *cluster, bool report_and_exit_on_error)
 	 * Turn off durability requirements to improve object creation speed, and
 	 * we only modify the new cluster, so only use it there.  If there is a
 	 * crash, the new cluster has to be recreated anyway.  fsync=off is a big
-	 * win on ext4.
+	 * win on ext4. max_logical_replication_workers=0 to disable launcher.
 	 */
 	snprintf(cmd, sizeof(cmd),
-			 "\"%s/pg_ctl\" -w -l \"%s/%s\" -D \"%s\" -o \"-p %d -b%s %s%s\" start",
+			 "\"%s/pg_ctl\" -w -l \"%s/%s\" -D \"%s\" -o \"-p %d -b%s %s%s%s\" start",
 			 cluster->bindir,
 			 log_opts.logdir,
 			 SERVER_LOG_FILE, cluster->pgconfig, cluster->port,
 			 (cluster == &new_cluster) ?
 			 " -c synchronous_commit=off -c fsync=off -c full_page_writes=off" : "",
+			 " -c max_logical_replication_workers=0",
 			 cluster->pgopts ? cluster->pgopts : "", socket_string);
 
 	/*
-- 
2.34.1



Re: RFC: Logging plan of the running query

2023-09-19 Thread torikoshia

On 2023-09-15 15:21, Lepikhov Andrei wrote:

On Thu, Sep 7, 2023, at 1:09 PM, torikoshia wrote:

On 2023-09-06 11:17, James Coleman wrote:
It seems that we can know what queries were running from the stack
traces you shared.
As described above, I suspect a lock which was acquired prior to
ProcessLogQueryPlanInterrupt() caused the issue.
I will try a little more to see if I can devise a way to create the 
same

situation.

Hi,
I have explored this patch and, by and large, agree with the code. But
some questions I want to discuss:
1. Maybe add a hook to give a chance for extensions, like
pg_query_state, to do their job?


Do you imagine adding a hook which enables adding custom interrupt codes 
like below?


https://github.com/postgrespro/pg_query_state/blob/master/patches/custom_signals_15.0.patch

If so, that would be possible, but this patch doesn't require the 
functionality and I feel it'd be better doing in independent patch.




2. In this implementation, ProcessInterrupts does a lot of work during
the explain creation: memory allocations, pass through the plan,
systables locks, syscache access, etc. I guess it can change the
semantic meaning of 'safe place' where CHECK_FOR_INTERRUPTS can be
called - I can imagine a SELECT query, which would call a stored
procedure, which executes some DDL and acquires row exclusive lock at
the time of interruption. So, in my mind, it is too unpredictable to
make the explain in the place of interruption processing. Maybe it is
better to think about some hook at the end of ExecProcNode, where a
pending explain could be created?


Yeah, I withdrew this patch once for that reason, but we are resuming 
development in response to the results of a discussion by James and 
others at this year's pgcon about the safety of this process in CFI:


https://www.postgresql.org/message-id/CAAaqYe9euUZD8bkjXTVcD9e4n5c7kzHzcvuCJXt-xds9X4c7Fw%40mail.gmail.com

BTW it seems pg_query_state also enables users to get running query 
plans using CFI.

Does pg_query_state do something for this safety concern?


Also, I suggest minor code change with the diff in attachment.


Thanks!

This might be biased opinion and objections are welcomed, but I feel the 
original patch is easier to read since PG_RETURN_BOOL(true/false) is 
located in near place to each cases.
Also the existing function pg_log_backend_memory_contexts(), which does 
the same thing, has the same form as the original patch.


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: Improving btree performance through specializing by key shape, take 2

2023-09-19 Thread Matthias van de Meent
On Tue, 19 Sept 2023 at 03:56, Peter Geoghegan  wrote:
>
> On Mon, Sep 18, 2023 at 6:29 PM Peter Geoghegan  wrote:
> > I also have significant doubts about your scheme for avoiding
> > invalidating the bounds of the page based on its high key matching the
> > parent's separator. The subtle dynamic prefix compression race
> > condition that I was worried about was one caused by page deletion.
> > But page deletion doesn't change the high key at all (it does that for
> > the deleted page, but that's hardly relevant). So how could checking
> > the high key possibly help?
>
> To be clear, page deletion does what I described here (it does an
> in-place update of the downlink to the deleted page, so the same pivot
> tuple now points to its right sibling, which is our page of concern),
> in addition to fully removing the original pivot tuple whose downlink
> originally pointed to our page of concern. This is why page deletion
> makes the key space "move to the right", very much like a page split
> would.

I am still aware of this issue, and I think we've discussed it in
detail earlier. I think it does not really impact this patchset. Sure,
I can't use dynamic prefix compression to its full potential, but I
still do get serious performance benefits:

FULL KEY _bt_compare calls:
'Optimal' full-tree DPT: average O(3)
Paged DPT (this patch):  average O(2 * height)
... without HK opt:  average O(3 * height)
Current: O(log2(n))

Single-attribute compares:
'Optimal' full-tree DPT: O(log(N))
Paged DPT (this patch):  O(log(N))
Current: 0 (or, O(log(N) * natts))

So, in effect, this patch moves most compare operations to the level
of only one or two full key compare operations per page (on average).

I use "on average": on a sorted array with values ranging from
potentially minus infinity to positive infinity, it takes on average 3
compares before a binary search can determine the bounds of the
keyspace it has still to search. If one side's bounds is already
known, it takes on average 2 compare operations before these bounds
are known.

> IMV it would be better if it made the key space "move to the left"
> instead, which would make page deletion close to the exact opposite of
> a page split -- that's what the Lanin & Shasha paper does (sort of).
> If you have this symmetry, then things like dynamic prefix compression
> are a lot simpler.
>
> ISTM that the only way that a scheme like yours could work, assuming
> that making page deletion closer to Lanin & Shasha is not going to
> happen, is something even more invasive than that: it might work if
> you had a page low key (not just a high key) on every page.

Note that the "dynamic prefix compression" is currently only active on
the page level.

True, the patch does carry over _bt_compare's prefix result for the
high key on the child page, but we do that only if the highkey is
actually an exact copy of the right separator on the parent page. This
carry-over opportunity is extremely likely to happen, because the high
key generated in _bt_split is then later inserted on the parent page.
The only case where it could differ is in concurrent page deletions.
That is thus a case of betting a few cycles to commonly save many
cycles (memcmp vs _bt_compare full key compare.

Again, we do not actually skip a prefix on the compare call of the
P_HIGHKEY tuple, nor for the compares of the midpoints unless we've
found a tuple on the page that compares as smaller than the search
key.

> You'd have
> to compare the lower bound separator key from the parent (which might
> itself be the page-level low key for the parent) to the page low key.
> That's not a serious suggestion; I'm just pointing out that you need
> to be able to compare like with like for a canary condition like this
> one, and AFAICT there is no lightweight practical way of doing that
> that is 100% robust.

True, if we had consistent LOWKEYs on pages, that'd make this job much
easier: the prefix could indeed be carried over in full. But that's
not currently the case for the nbtree code, and this is the next best
thing, as it also has the benefit of working with all currently
supported physical formats of btree indexes.

Kind regards,

Matthias van de Meent




Re: tablecmds.c/MergeAttributes() cleanup

2023-09-19 Thread Peter Eisentraut

On 29.08.23 13:20, Alvaro Herrera wrote:

On 2023-Aug-29, Peter Eisentraut wrote:

@@ -3278,13 +3261,16 @@ MergeAttributes(List *schema, List *supers, char 
relpersistence,
   *
   * constraints is a list of CookedConstraint structs for previous constraints.
   *
- * Returns true if merged (constraint is a duplicate), or false if it's
- * got a so-far-unique name, or throws error if conflict.
+ * If the constraint is a duplicate, then the existing constraint's
+ * inheritance count is updated.  If the constraint doesn't match or conflict
+ * with an existing one, a new constraint is appended to the list.  If there
+ * is a conflict (same name but different expression), throw an error.


This wording confused me:

"If the constraint doesn't match or conflict with an existing one, a new
constraint is appended to the list."

I first read it as "doesn't match or conflicts with ..." (i.e., the
negation only applied to the first verb, not both) which would have been
surprising (== broken) behavior.

I think it's clearer if you say "doesn't match nor conflict", but I'm
not sure if this is grammatically correct.


Here is an updated version of this patch set.  I resolved some conflicts 
and addressed this comment of yours.  I also dropped the one patch with 
some catalog header edits that people didn't seem to particularly like.


The patches that are now 0001 through 0004 were previously reviewed and 
just held for the not-null constraint patches, I think, so I'll commit 
them in a few days if there are no objections.


Patches 0005 through 0007 are also ready in my opinion, but they haven't 
really been reviewed, so this would be something for reviewers to focus 
on.  (0005 and 0007 are trivial, but they go to together with 0006.)


The remaining 0008 and 0009 were still under discussion and contemplation.From 28e4dbba35fc3162c13f5896551921896cf30d1c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 12 Jul 2023 16:12:34 +0200
Subject: [PATCH v3 1/9] Clean up MergeAttributesIntoExisting()

Make variable naming clearer and more consistent.  Move some variables
to smaller scope.  Remove some unnecessary intermediate variables.
Try to save some vertical space.

Apply analogous changes to nearby MergeConstraintsIntoExisting() and
RemoveInheritance() for consistency.
---
 src/backend/commands/tablecmds.c | 123 ---
 1 file changed, 48 insertions(+), 75 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8a2c671b66..ff001f5ceb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15695,53 +15695,39 @@ static void
 MergeAttributesIntoExisting(Relation child_rel, Relation parent_rel)
 {
Relationattrrel;
-   AttrNumber  parent_attno;
-   int parent_natts;
-   TupleDesc   tupleDesc;
-   HeapTuple   tuple;
-   boolchild_is_partition = false;
+   TupleDesc   parent_desc;
 
attrrel = table_open(AttributeRelationId, RowExclusiveLock);
+   parent_desc = RelationGetDescr(parent_rel);
 
-   tupleDesc = RelationGetDescr(parent_rel);
-   parent_natts = tupleDesc->natts;
-
-   /* If parent_rel is a partitioned table, child_rel must be a partition 
*/
-   if (parent_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-   child_is_partition = true;
-
-   for (parent_attno = 1; parent_attno <= parent_natts; parent_attno++)
+   for (AttrNumber parent_attno = 1; parent_attno <= parent_desc->natts; 
parent_attno++)
{
-   Form_pg_attribute attribute = TupleDescAttr(tupleDesc,
-   
parent_attno - 1);
-   char   *attributeName = NameStr(attribute->attname);
+   Form_pg_attribute parent_att = TupleDescAttr(parent_desc, 
parent_attno - 1);
+   char   *parent_attname = NameStr(parent_att->attname);
+   HeapTuple   tuple;
 
/* Ignore dropped columns in the parent. */
-   if (attribute->attisdropped)
+   if (parent_att->attisdropped)
continue;
 
/* Find same column in child (matching on column name). */
-   tuple = SearchSysCacheCopyAttName(RelationGetRelid(child_rel),
-   
  attributeName);
+   tuple = SearchSysCacheCopyAttName(RelationGetRelid(child_rel), 
parent_attname);
if (HeapTupleIsValid(tuple))
{
-   /* Check they are same type, typmod, and collation */
-   Form_pg_attribute childatt = (Form_pg_attribute) 
GETSTRUCT(tuple);
+   Form_pg_attribute child_att = (Form_pg_attribute) 
GETSTRUCT(tuple);
 
-   if (attribute->atttyp

Re: [PATCH] Add native windows on arm64 support

2023-09-19 Thread Anthony Roberts

Hi,

This was covered earlier in the thread - I have taken this on in Niyas' 
stead.


Was there an explicit request for something there? I was under the 
impression that this was all just suggestion/theory at the moment.


Thanks,
Anthony

On 19/09/2023 09:33, Peter Eisentraut wrote:

On 14.09.23 11:39, Daniel Gustafsson wrote:
On 13 Sep 2023, at 21:12, Peter Eisentraut  
wrote:


On 31.08.23 06:44, Tom Lane wrote:

I agree.  I'm really uncomfortable with claiming support for
Windows-on-ARM if we don't have a buildfarm member testing it.
For other platforms that have a track record of multiple
hardware support, it might not be a stretch ... but Windows was
so resolutely Intel-only for so long that "it works on ARM" is
a proposition that I won't trust without hard evidence. There
are too many bits of that system that might not have gotten the
word yet, or at least not gotten sufficient testing.
My vote for this is we don't commit without a buildfarm member.


I think we can have a multi-tiered approach, where we can commit 
support but consider it experimental until we have buildfarm coverage.


If it's experimental it should probably be behind an opt-in flag in
autoconf/meson, or be reverted by the time REL_17_STABLE branches unless
coverage has materialized by then.


The author's email is bouncing now, due to job change, so it's 
unlikely we will see any progress on this anymore.  I am setting it to 
returned with feedback.







Re: remaining sql/json patches

2023-09-19 Thread Erik Rijkers

Op 9/19/23 om 13:56 schreef Amit Langote:

On Tue, Sep 19, 2023 at 7:18 PM Alvaro Herrera  wrote:

0001: I wonder why you used Node for the ErrorSaveContext pointer
instead of the specific struct you want.  I propose the attached, for
some extra type-safety.  Or did you have a reason to do it that way?


No reason other than that most other headers use Node.  I agree that
making an exception for this patch might be better, so I've
incorporated your patch into 0001.

I've also updated the query functions patch (0003) to address the
crashing bug reported by Erik.  Essentially, I made the coercion step
of JSON_QUERY to always use json_populate_type() when WITH WRAPPER is
used.  You might get funny errors with ERROR OR ERROR for many types
when used in RETURNING, but at least there should no longer be any
crashes.



Indeed, with v16 those crashes are gone.

Some lesser evil: gcc 13.2.0 gave some warnings, slightly different in 
assert vs non-assert build.


--- assert build:

-- [2023.09.19 14:06:35 json_table2/0] make core: make --quiet -j 4
In file included from ../../../src/include/postgres.h:45,
 from parse_expr.c:16:
In function ‘transformJsonFuncExpr’,
inlined from ‘transformExprRecurse’ at parse_expr.c:374:13:
parse_expr.c:4355:22: warning: ‘jsexpr’ may be used uninitialized 
[-Wmaybe-uninitialized]

 4355 | Assert(jsexpr->formatted_expr);
../../../src/include/c.h:864:23: note: in definition of macro ‘Assert’
  864 | if (!(condition)) \
  |   ^
parse_expr.c: In function ‘transformExprRecurse’:
parse_expr.c:4212:21: note: ‘jsexpr’ was declared here
 4212 | JsonExpr   *jsexpr;
  | ^~


--- non-assert build:

-- [2023.09.19 14:11:03 json_table2/1] make core: make --quiet -j 4
In function ‘transformJsonFuncExpr’,
inlined from ‘transformExprRecurse’ at parse_expr.c:374:13:
parse_expr.c:4356:28: warning: ‘jsexpr’ may be used uninitialized 
[-Wmaybe-uninitialized]

 4356 | if (exprType(jsexpr->formatted_expr) != JSONBOID)
  |  ~~^~~~
parse_expr.c: In function ‘transformExprRecurse’:
parse_expr.c:4212:21: note: ‘jsexpr’ was declared here
 4212 | JsonExpr   *jsexpr;
  | ^~


Thank you,

Erik





Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-19 Thread Amit Kapila
On Tue, Sep 19, 2023 at 11:47 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> Thank you for reviewing! PSA new version!
>

*
+#include "access/xlogdefs.h"
 #include "common/relpath.h"
 #include "libpq-fe.h"

The above include is not required. I have removed that and made a few
cosmetic changes in the attached.

-- 
With Regards,
Amit Kapila.


v40-0001-pg_upgrade-Allow-to-replicate-logical-replicatio.patch
Description: Binary data


Re: Add last_commit_lsn to pg_stat_database

2023-09-19 Thread Aleksander Alekseev
Hi,

> [...]
> As I was thinking about how to improve things, I realized that this
> information (since it's for monitoring anyway) fits more naturally
> into the stats system. I'd originally thought of exposing it in
> pg_stat_wal, but that's per-cluster rather than per-database (indeed,
> this is a flaw I hadn't considered in the original patch), so I think
> pg_stat_database is the correct location.
>
> I've attached a patch to track the latest commit's LSN in pg_stat_database.

Thanks for the patch. It was marked as "Needs Review" so I decided to
take a brief look.

All in all the code seems to be fine but I have a couple of nitpicks:

- If you are modifying pg_stat_database the corresponding changes to
the documentation are needed.
- pg_stat_get_db_last_commit_lsn() has the same description as
pg_stat_get_db_xact_commit() which is confusing.
- pg_stat_get_db_last_commit_lsn() is marked as STABLE which is
probably correct. But I would appreciate a second opinion on this.
- Wouldn't it be appropriate to add a test or two?
- `if (!XLogRecPtrIsInvalid(commit_lsn))` - I suggest adding
XLogRecPtrIsValid() macro for better readability

-- 
Best regards,
Aleksander Alekseev




Re: remaining sql/json patches

2023-09-19 Thread Amit Langote
On Tue, Sep 19, 2023 at 7:37 PM jian he  wrote:
> On Mon, Sep 18, 2023 at 7:51 PM Erik Rijkers  wrote:
> >
> > and FYI: None of these crashes occur when I leave off the 'WITH WRAPPER'
> > clause.
> >
> > >
> > > Erik
> > >
>
> if specify with wrapper, then default behavior is keep quotes, so
> jexpr->omit_quotes will be false, which make val_string NULL.
> in ExecEvalJsonExprCoercion: InputFunctionCallSafe, val_string is
> NULL, flinfo->fn_strict is true, it will return:  *op->resvalue =
> (Datum) 0. but at the same time  *op->resnull is still false!
>
> if not specify with wrapper, then JsonPathQuery will return NULL.
> (because after apply the path_expression, cannot multiple SQL/JSON
> items)
>
> select json_query(jsonb'{"a":[{"a":3},{"a":[4,5]}]}','$.a[*].a?(@<=3)'
>  returning int4range);
> also make server crash, because default is KEEP QUOTES, so in
> ExecEvalJsonExprCoercion jexpr->omit_quotes will be false.
> val_string will be NULL again as mentioned above.

That's right.

> another funny case:
> create domain domain_int4range int4range;
> select json_query(jsonb'{"a":[{"a":[2,3]},{"a":[4,5]}]}','$.a[*].a?(@<=3)'
>  returning domain_int4range with wrapper);
>
> should I expect it to return  [2,4)  ?

This is what you'll get with v16 that I just posted.

>  ---
> https://www.postgresql.org/docs/current/extend-type-system.html#EXTEND-TYPES-POLYMORPHIC
> >>  When the return value of a function is declared as a polymorphic type, 
> >> there must be at least one argument position that is also
> >> polymorphic, and the actual data type(s) supplied for the polymorphic 
> >> arguments determine the actual result type for that call.
>
> select json_query(jsonb'{"a":[{"a":[2,3]},{"a":[4,5]}]}','$.a[*].a?(@<=3)'
> returning anyrange);
> should fail. Now it returns NULL. Maybe we can validate it in
> transformJsonFuncExpr?
> ---

I'm not sure whether we should make the parser complain about the
weird types being specified in RETURNING.  The NULL you get in the
above example is because of the following error:

select json_query(jsonb'{"a":[{"a":[2,3]},{"a":[4,5]}]}','$.a[*].a?(@<=3)'
returning anyrange error on error);
ERROR:  JSON path expression in JSON_QUERY should return singleton
item without wrapper
HINT:  Use WITH WRAPPER clause to wrap SQL/JSON item sequence into array.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: Standardize type of variable when extending Buffers

2023-09-19 Thread Ranier Vilela
Em ter., 19 de set. de 2023 às 05:07, Peter Eisentraut 
escreveu:

> On 10.07.23 13:08, Ranier Vilela wrote:
> >
> > Em seg., 10 de jul. de 2023 às 03:27, Kyotaro Horiguchi
> > mailto:horikyota@gmail.com>> escreveu:
> >
> > At Fri, 7 Jul 2023 11:29:16 -0700, Gurjeet Singh  > > wrote in
> >  > On Fri, Jul 7, 2023 at 6:12 AM Ranier Vilela  > > wrote:
> >  > >
> >  > > Hi,
> >  > >
> >  > > This has already been discussed in [1].
> >  > > But I thought it best to start a new thread.
> >  > >
> >  > > The commit 31966b1 introduced the infrastructure to extend
> >  > > buffers.
> >  > > But the patch mixed types with int and uint32.
> >  > > The correct type of the variable counter is uint32.
> >  > >
> >  > > Fix by standardizing the int type to uint32.
> >  > >
> >  > > patch attached.
> >  >
> >  > LGTM.
> >
> > LGTM, too.
> >
> > Thanks Gurjeet and Kyotaro, for taking a look.
>
> committed
>
Thank you Peter.

best regards,
Ranier Vilela


Re: Infinite Interval

2023-09-19 Thread Dean Rasheed
On Sat, 16 Sept 2023 at 01:00, jian he  wrote:
>
> I refactor the avg(interval), sum(interval), so moving aggregate,
> plain aggregate both work with +inf/-inf.
> no performance degradation, in fact, some performance gains.
>

I haven't reviewed this part in any detail yet, but I can confirm that
there are some impressive performance improvements for avg(). However,
for me, sum() seems to be consistently a few percent slower with this
patch.

The introduction of an internal transition state struct seems like a
promising approach, but I think there is more to be gained by
eliminating per-row pallocs, and IntervalAggState's MemoryContext
(interval addition, unlike numeric addition, doesn't require memory
allocation, right?).

Also, this needs to include serialization and deserialization
functions, otherwise these aggregates will no longer be able to use
parallel workers. That makes a big difference to queryE, if the size
of the test data is scaled up.

This comment:

+   int64   N;  /* count of processed numbers */

should be "count of processed intervals".

Regards,
Dean




Re: CHECK Constraint Deferrable

2023-09-19 Thread Dean Rasheed
On Fri, 15 Sept 2023 at 08:00, vignesh C  wrote:
>
> On Thu, 14 Sept 2023 at 15:33, Himanshu Upadhyaya
>  wrote:
> >
> > On Thu, Sep 14, 2023 at 9:57 AM vignesh C  wrote:
> >>
> >> postgres=*#  SET CONSTRAINTS tbl_chk_1 DEFERRED;
> >> SET CONSTRAINTS
> >> postgres=*# INSERT INTO tbl values (1);
> >> ERROR:  new row for relation "tbl_1" violates check constraint "tbl_chk_1"
> >> DETAIL:  Failing row contains (1).
> >>
> > I dont think it's a problem, in the second case there are two DEFERRABLE 
> > CHECK constraints and you are marking one as DEFERRED but other one will be 
> > INITIALLY IMMEDIATE.
>
> I think we should be able to defer one constraint like in the case of
> foreign key constraint
>

Agreed. It should be possible to have a mix of deferred and immediate
constraint checks. In the example, the tbl_chk_1 is set deferred, but
it fails immediately, which is clearly not right.

I would say that it's reasonable to limit the scope of this patch to
table constraints only, and leave domain constraints to a possible
follow-up patch.

A few other review comments:

1. The following produces a WARNING (possibly the same issue already reported):

CREATE TABLE foo (a int, b int);
ALTER TABLE foo ADD CONSTRAINT a_check CHECK (a > 0);
ALTER TABLE foo ADD CONSTRAINT b_check CHECK (b > 0) DEFERRABLE;

WARNING:  unexpected pg_constraint record found for relation "foo"

2. I think that equalTupleDescs() should compare the new fields, when
comparing the 2 sets of check constraints.

3. The constraint exclusion code in the planner should ignore
deferrable check constraints (see get_relation_constraints() in
src/backend/optimizer/util/plancat.c), otherwise it might incorrectly
exclude a relation on the basis of a constraint that is temporarily
violated, and return incorrect query results. For example:

CREATE TABLE foo (a int);
CREATE TABLE foo_c1 () INHERITS (foo);
CREATE TABLE foo_c2 () INHERITS (foo);
ALTER TABLE foo_c2 ADD CONSTRAINT cc CHECK (a != 5) INITIALLY DEFERRED;

BEGIN;
INSERT INTO foo_c2 VALUES (5);
SET LOCAL constraint_exclusion TO off;
SELECT * FROM foo WHERE a = 5;
SET LOCAL constraint_exclusion TO on;
SELECT * FROM foo WHERE a = 5;
ROLLBACK;

4. The code in MergeWithExistingConstraint() should prevent inherited
constraints being merged if their deferrable properties don't match
(as MergeConstraintsIntoExisting() does, since
constraints_equivalent() tests the deferrable fields). I.e., the
following should fail to merge the constraints, since they don't
match:

DROP TABLE IF EXISTS p,c;

CREATE TABLE p (a int, b int);
ALTER TABLE p ADD CONSTRAINT c1 CHECK (a > 0) DEFERRABLE;
ALTER TABLE p ADD CONSTRAINT c2 CHECK (b > 0);

CREATE TABLE c () INHERITS (p);
ALTER TABLE c ADD CONSTRAINT c1 CHECK (a > 0);
ALTER TABLE c ADD CONSTRAINT c2 CHECK (b > 0) DEFERRABLE;

I.e., that should produce an error, as happens if c is made to inherit
p *after* the constraints have been added.

5. Instead of just adding the new fields to the end of the ConstrCheck
struct, and to the end of lists of function parameters like
StoreRelCheck(), and other related code, it would be more logical to
put them immediately before the valid/invalid entries, to match the
order of constraint properties in pg_constraint, and functions like
CreateConstraintEntry().

Regards,
Dean




Re: Should we use MemSet or {0} for struct initialization?

2023-09-19 Thread Richard Guo
On Tue, Sep 19, 2023 at 5:37 PM Peter Eisentraut 
wrote:

> On 31.08.23 10:32, Richard Guo wrote:
> > While working on a bug in expandRecordVariable() I noticed that in the
> > switch statement for case RTE_SUBQUERY we initialize struct ParseState
> > with {0} while for case RTE_CTE we do that with MemSet.  I understand
> > that there is nothing wrong with this, just cannot get away with the
> > inconsistency inside the same function (sorry for the nitpicking).
> >
> > Do we have a preference for how to initialize structures?  From 9fd45870
> > it seems that we prefer to {0}.  So here is a trivial patch doing that.
> > And with a rough scan the MemSet calls in pg_stat_get_backend_subxact()
> > can also be replaced with {0}, so include that in the patch too.
>
> The first part (parse_target.c) was already addressed by e0e492e5a9.  I
> have applied the second part (pgstatfuncs.c).


Thanks for pushing this.

Thanks
Richard


Re: remaining sql/json patches

2023-09-19 Thread jian he
On Mon, Sep 18, 2023 at 7:51 PM Erik Rijkers  wrote:
>
> and FYI: None of these crashes occur when I leave off the 'WITH WRAPPER'
> clause.
>
> >
> > Erik
> >

if specify with wrapper, then default behavior is keep quotes, so
jexpr->omit_quotes will be false, which make val_string NULL.
in ExecEvalJsonExprCoercion: InputFunctionCallSafe, val_string is
NULL, flinfo->fn_strict is true, it will return:  *op->resvalue =
(Datum) 0. but at the same time  *op->resnull is still false!

if not specify with wrapper, then JsonPathQuery will return NULL.
(because after apply the path_expression, cannot multiple SQL/JSON
items)

select json_query(jsonb'{"a":[{"a":3},{"a":[4,5]}]}','$.a[*].a?(@<=3)'
 returning int4range);
also make server crash, because default is KEEP QUOTES, so in
ExecEvalJsonExprCoercion jexpr->omit_quotes will be false.
val_string will be NULL again as mentioned above.

another funny case:
create domain domain_int4range int4range;
select json_query(jsonb'{"a":[{"a":[2,3]},{"a":[4,5]}]}','$.a[*].a?(@<=3)'
 returning domain_int4range with wrapper);

should I expect it to return  [2,4)  ?
 ---
https://www.postgresql.org/docs/current/extend-type-system.html#EXTEND-TYPES-POLYMORPHIC
>>  When the return value of a function is declared as a polymorphic type, 
>> there must be at least one argument position that is also
>> polymorphic, and the actual data type(s) supplied for the polymorphic 
>> arguments determine the actual result type for that call.

select json_query(jsonb'{"a":[{"a":[2,3]},{"a":[4,5]}]}','$.a[*].a?(@<=3)'
returning anyrange);
should fail. Now it returns NULL. Maybe we can validate it in
transformJsonFuncExpr?
---




Re: pg16: XX000: could not find pathkey item to sort

2023-09-19 Thread Richard Guo
On Mon, Sep 18, 2023 at 10:02 PM Justin Pryzby  wrote:

> This fails since 1349d2790b
>
> commit 1349d2790bf48a4de072931c722f39337e72055e
> Author: David Rowley 
> Date:   Tue Aug 2 23:11:45 2022 +1200
>
> Improve performance of ORDER BY / DISTINCT aggregates
>
> ts=# CREATE TABLE t (a int, b text) PARTITION BY RANGE (a);
> ts=# CREATE TABLE td PARTITION OF t DEFAULT;
> ts=# INSERT INTO t SELECT 1 AS a, '' AS b;
> ts=# SET enable_partitionwise_aggregate=on;
> ts=# explain SELECT a, COUNT(DISTINCT b) FROM t GROUP BY a;
> ERROR:  XX000: could not find pathkey item to sort
> LOCATION:  prepare_sort_from_pathkeys, createplan.c:6235


Thanks for the report!  I've looked at it a little bit.  In function
adjust_group_pathkeys_for_groupagg we add the pathkeys in ordered
aggregates to root->group_pathkeys.  But if the new added pathkeys do
not have EC members that match the targetlist or can be computed from
the targetlist, prepare_sort_from_pathkeys would have problem computing
sort column info for the new added pathkeys.  In the given example, the
pathkey representing 'b' can not match or be computed from the current
targetlist, so prepare_sort_from_pathkeys emits the error.

My first thought about the fix is that we artificially add resjunk
target entries to parse->targetList for the ordered aggregates'
arguments that are ORDER BY expressions, as attached.  While this can
fix the given query, it would cause Assert failure for the query in
sql/triggers.sql.

-- inserts only
insert into my_table values (1, 'AAA'), (2, 'BBB')
  on conflict (a) do
  update set b = my_table.b || ':' || excluded.b;

I haven't looked into how that happens.

Any thoughts?

Thanks
Richard


v1-0001-Include-ordered-aggregates-arguments-in-targetList.patch
Description: Binary data


Re: Disabling Heap-Only Tuples

2023-09-19 Thread Alvaro Herrera
On 2023-Sep-18, Robert Haas wrote:

> On Tue, Sep 5, 2023 at 11:15 PM Laurenz Albe  wrote:
> > I don't think that is a good comparison.  While most people probably
> > never need to touch "local_update_limit", "work_mem" is something everybody
> > has to consider.
> >
> > And it is not so hard to tune: the setting would be the desired table
> > size, and you could use pgstattuple to find a good value.
> 
> What I suspect would happen, though, is that you'd end up tuning the
> value over and over. You'd set it to some value and after some number
> of vacuums maybe you'd realize that you could save even more disk
> space if you reduced it a bit further or maybe your data set would
> grow a bit and you'd have to increase it a little (or a lot). And if
> you didn't keep adjusting it then maybe something quite bad would
> happen to your database.

As I understand it, the setting being proposed is useful as an emergency
for removing excessive bloat -- a substitute for VACUUM FULL when you
don't want to lock the table for long.  Trying to use it as a permanent
gadget is going to be misguided.  So my first thought is that we should
tell people to use it that way: if you're not in the irrecoverable-space
situation, just do not use this.  Then we don't have to worry about
people misusing it the way you imagine.

Second, I think we should make it auto-reset.  That is, have the user
set some value; later, when some condition triggers (say, the table size
is 1.2x the limit value you configured), then the local_update_limit is
automatically removed from the table options.  From that point onwards,
the table is operated normally.

This removes the other concern that makes the system behaves
suboptimally because some DBA in the past decade left this set for no
good reason: if you run into an emergency, then you activate the
emergency escape hatch, and it will close on its own as soon as the
emergency is over.

This also dissuades people from using it for these other things you
describe.  It just won't work.


The point here is that third-party tools such as pg_repack or pg_squeeze
exist, which work in a way we don't like, yet we offer no alternative.
This proposal is a mechanism that essentially replaces those tools with
a simple in-core feature, without having to include the tool itself in
core.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Thou shalt check the array bounds of all strings (indeed, all arrays), for
surely where thou typest "foo" someone someday shall type
"supercalifragilisticexpialidocious" (5th Commandment for C programmers)




Re: Bug fix for psql's meta-command \ev

2023-09-19 Thread Aleksander Alekseev
Hi Michael,

> The patch looks incorrect to me.  In case you've not noticed, we'd
> still have the same problem if do_edit() fails [...]

You are right, I missed it. Your patch is correct while the original
one is not quite so.

-- 
Best regards,
Aleksander Alekseev




Re: remaining sql/json patches

2023-09-19 Thread Alvaro Herrera
0001: I wonder why you used Node for the ErrorSaveContext pointer
instead of the specific struct you want.  I propose the attached, for
some extra type-safety.  Or did you have a reason to do it that way?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep."(Robert Davidson)
   http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php
>From 4cd888cd7abbfb156bad26bc94658be3e286bf1f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 19 Sep 2023 12:18:03 +0200
Subject: [PATCH] 0001 fixup: use struct ErrorSaveContext, not Node

---
 src/backend/jit/llvm/llvmjit.c   | 4 ++--
 src/backend/jit/llvm/llvmjit_expr.c  | 2 +-
 src/backend/jit/llvm/llvmjit_types.c | 2 +-
 src/include/executor/execExpr.h  | 3 ++-
 src/include/jit/llvmjit.h| 2 +-
 src/include/nodes/execnodes.h| 3 ++-
 6 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index ef6c8361b4..431d4511c5 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -85,7 +85,7 @@ LLVMTypeRef StructExprState;
 LLVMTypeRef StructAggState;
 LLVMTypeRef StructAggStatePerGroupData;
 LLVMTypeRef StructAggStatePerTransData;
-LLVMTypeRef StructNode;
+LLVMTypeRef StructErrorSaveContext;
 
 LLVMValueRef AttributeTemplate;
 
@@ -1035,7 +1035,7 @@ llvm_create_types(void)
StructAggState = llvm_pg_var_type("StructAggState");
StructAggStatePerGroupData = 
llvm_pg_var_type("StructAggStatePerGroupData");
StructAggStatePerTransData = 
llvm_pg_var_type("StructAggStatePerTransData");
-   StructNode = llvm_pg_var_type("StructNode");
+   StructErrorSaveContext = llvm_pg_var_type("StructErrorSaveContext");
 
AttributeTemplate = LLVMGetNamedFunction(llvm_types_module, 
"AttributeTemplate");
 }
diff --git a/src/backend/jit/llvm/llvmjit_expr.c 
b/src/backend/jit/llvm/llvmjit_expr.c
index 20c79dda9f..59e20ebcbd 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -1361,7 +1361,7 @@ llvm_compile_expr(ExprState *state)
v_params[2] = 
l_int32_const(ioparam);
v_params[3] = l_int32_const(-1);
v_params[4] = 
l_ptr_const(op->d.iocoerce.escontext,
-   
  l_ptr(StructNode));
+   
  l_ptr(StructErrorSaveContext));
/*
 * InputFunctionCallSafe() will 
writes directly into
 * *op->resvalue.
diff --git a/src/backend/jit/llvm/llvmjit_types.c 
b/src/backend/jit/llvm/llvmjit_types.c
index 36f526b374..e1e9625038 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -67,7 +67,7 @@ TupleTableSlot StructTupleTableSlot;
 HeapTupleTableSlot StructHeapTupleTableSlot;
 MinimalTupleTableSlot StructMinimalTupleTableSlot;
 TupleDescData StructTupleDescData;
-Node StructNode;
+ErrorSaveContext StructErrorSaveContext;
 
 
 /*
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 25d3c02dba..8dd92f8fc0 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -16,6 +16,7 @@
 
 #include "executor/nodeAgg.h"
 #include "nodes/execnodes.h"
+#include "nodes/miscnodes.h"
 
 /* forward references to avoid circularity */
 struct ExprEvalStep;
@@ -417,7 +418,7 @@ typedef struct ExprEvalStep
/* lookup and call info for result type's input 
function */
FmgrInfo   *finfo_in;
Oid typioparam;
-   Node   *escontext;
+   ErrorSaveContext *escontext;
}   iocoerce;
 
/* for EEOP_SQLVALUEFUNCTION */
diff --git a/src/include/jit/llvmjit.h b/src/include/jit/llvmjit.h
index e8cdc96b2b..c22ba3787f 100644
--- a/src/include/jit/llvmjit.h
+++ b/src/include/jit/llvmjit.h
@@ -79,7 +79,7 @@ extern PGDLLIMPORT LLVMTypeRef StructExprState;
 extern PGDLLIMPORT LLVMTypeRef StructAggState;
 extern PGDLLIMPORT LLVMTypeRef StructAggStatePerTransData;
 extern PGDLLIMPORT LLVMTypeRef StructAggStatePerGroupData;
-extern PGDLLIMPORT LLVMTypeRef StructNode;
+extern PGDLLIMPORT LLVMTypeRef StructErrorSaveContext;
 
 extern PGDLLIMPORT LLVMValueRef AttributeTemplate;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 57e4ca1c63..cc4ed9279c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/inc

Re: Move bki file pre-processing from initdb to bootstrap

2023-09-19 Thread Peter Eisentraut

On 01.09.23 10:01, Krishnakumar R wrote:

This patch moves the pre-processing for tokens in the bki file from
initdb to bootstrap. With these changes the bki file will only be
opened once in bootstrap and parsing will be done by the bootstrap
parser.


I did some rough performance tests on this.  I get about a 10% 
improvement on initdb run time, so this appears to have merit.


I wonder whether we can reduce the number of symbols that we need this 
treatment for.


For example, for NAMEDATALEN, SIZEOF_POINTER, ALIGNOF_POINTER, 
FLOAT8PASSBYVAL, these are known at build time, so we could have 
genbki.pl substitute them at build time.


The locale-related symbols (ENCODING, LC_COLLATE, etc.), I wonder 
whether we can eliminate the need for them.  Right now, these are only 
used in the bki entry for the template1 database.  How about initdb 
creates template0 first, with hardcoded default encoding, collation, 
etc., and then creates template1 from that, using the normal CREATE 
DATABASE command with the appropriate options.  Or initdb could just run 
an UPDATE on pg_database to put the right settings in place.


I don't like this part so much, because it adds like 4 more places each 
of these variables is mentioned, which increases the mental and testing 
overhead for dealing with these features (which are an area of active 
development).


In general, it would be good if this could be factored a bit more so 
each variable doesn't have to be hardcoded in so many places.



Some more detailed comments on the code:

+   boot_yylval.str = pstrdup(yytext);
+   sprintf(boot_yylval.str, "%d", NAMEDATALEN);

This is weird.  You are first assigning the text and then overwriting it 
with the numeric value?


You can also use boot_yylval.ival for storing numbers.

+   if (bootp_null(ebootp, ebootp->username)) return 
NULLVAL;


Add proper line breaks in the code.

+bool bootp_null(extra_bootstrap_params *e, char *s)

Add a comment what this function is supposed to do.

This function could be static.

+   while ((flag = getopt(argc, argv, "B:c:d:D:Fi:kr:X:-:")) != -1)

You should use an option letter that isn't already in use in one of the 
other modes of "postgres".  We try to keep those consistent.


New options should be added to the --help output (usage() in main.c).

+   elog(INFO, "Open bki file %s\n", bki_file);
+   boot_yyin = fopen(bki_file, "r");

Why is this needed?  It already reads the bki file from stdin?

+   printfPQExpBuffer(&cmd, "\"%s\" --boot -X %d %s %s %s %s -i 
%s=%s,%s=%s,%s=%s,"

+ "%s=%s,%s=%s,%s=%s,%s=%s,%s=%c",
+ backend_exec,
+ wal_segment_size_mb * (1024 * 1024),
+ boot_options, extra_options,
+ data_checksums ? "-k" : "",
+ debug ? "-d 5" : "",

This appears to undo some of the changes done in cccdbc5d95.

+#define BOOT_LC_COLLATE "lc_collate"
+#define BOOT_LC_CTYPE "lc_ctype"
+#define BOOT_ICU_LOCALE "icu_locale"

etc.  This doesn't look particularly useful.  You can just use the 
strings directly.






Re: Improvements in pg_dump/pg_restore toc format and performances

2023-09-19 Thread Pierre Ducroquet
On Monday, September 18, 2023 11:54:42 PM CEST Nathan Bossart wrote:
> On Mon, Sep 18, 2023 at 02:52:47PM -0700, Nathan Bossart wrote:
> > On Thu, Jul 27, 2023 at 10:51:11AM +0200, Pierre Ducroquet wrote:
> >> I ended up writing several patches that shaved some time for pg_restore
> >> -l,
> >> and reduced the toc.dat size.
> > 
> > I've only just started taking a look at these patches, and I intend to do
> > a
> > more thorough review in the hopefully-not-too-distant future.
> 
> Since cfbot is failing on some pg_upgrade and pg_dump tests, I've set this
> to waiting-on-author.

Attached updated patches fix this regression, I'm sorry I missed that.

>From bafc4a3775d732895aa919de21e6f512cc726f49 Mon Sep 17 00:00:00 2001
From: Pierre Ducroquet 
Date: Wed, 26 Jul 2023 21:31:33 +0200
Subject: [PATCH 1/4] drop has oids field instead of having static values

---
 src/bin/pg_dump/pg_backup_archiver.c | 3 +--
 src/bin/pg_dump/pg_backup_archiver.h | 3 ++-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 4d83381d84..f4c782d63d 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -2510,7 +2510,6 @@ WriteToc(ArchiveHandle *AH)
 		WriteStr(AH, te->tablespace);
 		WriteStr(AH, te->tableam);
 		WriteStr(AH, te->owner);
-		WriteStr(AH, "false");
 
 		/* Dump list of dependencies */
 		for (i = 0; i < te->nDeps; i++)
@@ -2618,7 +2617,7 @@ ReadToc(ArchiveHandle *AH)
 		is_supported = true;
 		if (AH->version < K_VERS_1_9)
 			is_supported = false;
-		else
+		else if (AH->version < K_VERS_1_16)
 		{
 			tmp = ReadStr(AH);
 
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index b07673933d..b78e326f73 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -68,10 +68,11 @@
 #define K_VERS_1_15 MAKE_ARCHIVE_VERSION(1, 15, 0)	/* add
 	 * compression_algorithm
 	 * in header */
+#define K_VERS_1_16 MAKE_ARCHIVE_VERSION(1, 16, 0)	/* optimize dump format */
 
 /* Current archive version number (the format we can output) */
 #define K_VERS_MAJOR 1
-#define K_VERS_MINOR 15
+#define K_VERS_MINOR 16
 #define K_VERS_REV 0
 #define K_VERS_SELF MAKE_ARCHIVE_VERSION(K_VERS_MAJOR, K_VERS_MINOR, K_VERS_REV)
 
-- 
2.42.0

>From e8dbb51713daefa1596ce63ea6990d2fd1c4e27f Mon Sep 17 00:00:00 2001
From: Pierre Ducroquet 
Date: Wed, 26 Jul 2023 22:23:28 +0200
Subject: [PATCH 2/4] convert sscanf to strtoul

---
 src/bin/pg_dump/pg_backup_archiver.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index f4c782d63d..f9efb2badf 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -2556,13 +2556,13 @@ ReadToc(ArchiveHandle *AH)
 		if (AH->version >= K_VERS_1_8)
 		{
 			tmp = ReadStr(AH);
-			sscanf(tmp, "%u", &te->catalogId.tableoid);
+			te->catalogId.tableoid = strtoul(tmp, NULL, 10);
 			free(tmp);
 		}
 		else
 			te->catalogId.tableoid = InvalidOid;
 		tmp = ReadStr(AH);
-		sscanf(tmp, "%u", &te->catalogId.oid);
+		te->catalogId.oid = strtoul(tmp, NULL, 10);
 		free(tmp);
 
 		te->tag = ReadStr(AH);
@@ -2646,7 +2646,7 @@ ReadToc(ArchiveHandle *AH)
 	depSize *= 2;
 	deps = (DumpId *) pg_realloc(deps, sizeof(DumpId) * depSize);
 }
-sscanf(tmp, "%d", &deps[depIdx]);
+deps[depIdx] = strtoul(tmp, NULL, 10);
 free(tmp);
 depIdx++;
 			}
-- 
2.42.0

>From eb02760df97ab8e95287662a88ac3439fbc0a4fa Mon Sep 17 00:00:00 2001
From: Pierre Ducroquet 
Date: Thu, 27 Jul 2023 10:04:51 +0200
Subject: [PATCH 3/4] store oids as integer instead of string

---
 src/bin/pg_dump/pg_backup_archiver.c | 63 ++--
 1 file changed, 42 insertions(+), 21 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index f9efb2badf..feacc22701 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -2470,7 +2470,6 @@ void
 WriteToc(ArchiveHandle *AH)
 {
 	TocEntry   *te;
-	char		workbuf[32];
 	int			tocCount;
 	int			i;
 
@@ -2494,11 +2493,8 @@ WriteToc(ArchiveHandle *AH)
 		WriteInt(AH, te->dumpId);
 		WriteInt(AH, te->dataDumper ? 1 : 0);
 
-		/* OID is recorded as a string for historical reasons */
-		sprintf(workbuf, "%u", te->catalogId.tableoid);
-		WriteStr(AH, workbuf);
-		sprintf(workbuf, "%u", te->catalogId.oid);
-		WriteStr(AH, workbuf);
+		WriteInt(AH, te->catalogId.tableoid);
+		WriteInt(AH, te->catalogId.oid);
 
 		WriteStr(AH, te->tag);
 		WriteStr(AH, te->desc);
@@ -2514,10 +2510,9 @@ WriteToc(ArchiveHandle *AH)
 		/* Dump list of dependencies */
 		for (i = 0; i < te->nDeps; i++)
 		{
-			sprintf(workbuf, "%d", te->dependencies[i]);
-			WriteStr(AH, workbuf);
+			WriteInt(AH, te->dependencies[i]);
 		}
-		WriteStr(AH, NULL);		/* Terminate List */
+		

Re: Should we use MemSet or {0} for struct initialization?

2023-09-19 Thread Peter Eisentraut

On 31.08.23 10:32, Richard Guo wrote:

While working on a bug in expandRecordVariable() I noticed that in the
switch statement for case RTE_SUBQUERY we initialize struct ParseState
with {0} while for case RTE_CTE we do that with MemSet.  I understand
that there is nothing wrong with this, just cannot get away with the
inconsistency inside the same function (sorry for the nitpicking).

Do we have a preference for how to initialize structures?  From 9fd45870
it seems that we prefer to {0}.  So here is a trivial patch doing that.
And with a rough scan the MemSet calls in pg_stat_get_backend_subxact()
can also be replaced with {0}, so include that in the patch too.


The first part (parse_target.c) was already addressed by e0e492e5a9.  I 
have applied the second part (pgstatfuncs.c).






Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2023-09-19 Thread bt23nguyent

On 2023-09-15 23:38, Nathan Bossart wrote:

On Fri, Sep 15, 2023 at 02:48:55PM +0200, Daniel Gustafsson wrote:
On 15 Sep 2023, at 12:49, Alvaro Herrera  
wrote:


On 2023-Sep-15, Daniel Gustafsson wrote:


-basic_archive_configured(ArchiveModuleState *state)
+basic_archive_configured(ArchiveModuleState *state, const char 
**errmsg)


The variable name errmsg implies that it will contain the errmsg() 
data when it
in fact is used for errhint() data, so it should be named 
accordingly.


I have no objection to allowing this callback to provide additional
information, but IMHO this should use errdetail() instead of errhint(). 
 In

the provided patch, the new message explains how the module is not
configured.  It doesn't hint at how to fix it (although presumably one
could figure that out pretty easily).

It's probably better to define the interface as 
ArchiveCheckConfiguredCB
functions returning an allocated string in the passed pointer which 
the caller

is responsible for freeing.


That does seem more flexible.

Also note that this callback is documented in archive-modules.sgml, 
so
that needs to be updated as well.  This also means you can't 
backpatch
this change, or you risk breaking external software that implements 
this

interface.


Absolutely, this is master only for v17.


+1


Thank you for all of your comments!

They are all really constructive and I totally agree with the points you 
brought up.

I have updated the patch accordingly.

Please let me know if you have any further suggestions that I can 
improve more.


Best regards,
Tung Nguyen
diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 4d78c31859..7b8673f338 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -48,7 +48,7 @@ typedef struct BasicArchiveData
 static char *archive_directory = NULL;
 
 static void basic_archive_startup(ArchiveModuleState *state);
-static bool basic_archive_configured(ArchiveModuleState *state);
+static bool basic_archive_configured(ArchiveModuleState *state, char **logdetail);
 static bool basic_archive_file(ArchiveModuleState *state, const char *file, const char *path);
 static void basic_archive_file_internal(const char *file, const char *path);
 static bool check_archive_directory(char **newval, void **extra, GucSource source);
@@ -159,9 +159,15 @@ check_archive_directory(char **newval, void **extra, GucSource source)
  * Checks that archive_directory is not blank.
  */
 static bool
-basic_archive_configured(ArchiveModuleState *state)
+basic_archive_configured(ArchiveModuleState *state, char **logdetail)
 {
-	return archive_directory != NULL && archive_directory[0] != '\0';
+	if (archive_directory == NULL || archive_directory[0] == '\0')
+{
+*logdetail = pstrdup("WAL archiving failed because basic_archive.archive_directory is not set");
+return false;
+}
+else
+return true;
 }
 
 /*
diff --git a/doc/src/sgml/archive-modules.sgml b/doc/src/sgml/archive-modules.sgml
index 7064307d9e..f9f6177844 100644
--- a/doc/src/sgml/archive-modules.sgml
+++ b/doc/src/sgml/archive-modules.sgml
@@ -101,7 +101,7 @@ typedef void (*ArchiveStartupCB) (ArchiveModuleState *state);
 assumes the module is configured.
 
 
-typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state);
+typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state, char **logdetail);
 
 
 If true is returned, the server will proceed with
@@ -112,7 +112,9 @@ typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state);
 WARNING:  archive_mode enabled, yet archiving is not configured
 
 In the latter case, the server will periodically call this function, and
-archiving will proceed only when it returns true.
+archiving will proceed only when it returns true. The
+archiver may also emit the detail explaining how the module is not configured
+to the sever log if the archive module has any. 

   
 
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index 157ca4c751..c957a18ee7 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -23,7 +23,7 @@
 #include "common/percentrepl.h"
 #include "pgstat.h"
 
-static bool shell_archive_configured(ArchiveModuleState *state);
+static bool shell_archive_configured(ArchiveModuleState *state, char **logdetail);
 static bool shell_archive_file(ArchiveModuleState *state,
 			   const char *file,
 			   const char *path);
@@ -43,7 +43,7 @@ shell_archive_init(void)
 }
 
 static bool
-shell_archive_configured(ArchiveModuleState *state)
+shell_archive_configured(ArchiveModuleState *state, char **logdetail)
 {
 	return XLogArchiveCommand[0] != '\0';
 }
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 46af349564..2fd1d03b09 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -390,7 +390,8 @@ pgar

Re: Sync scan & regression tests

2023-09-19 Thread Heikki Linnakangas

On 19/09/2023 01:57, Andres Freund wrote:

On 2023-09-18 13:49:24 +0300, Heikki Linnakangas wrote:

d) Copy fewer rows to the table in the test. If we copy only 6 rows, for
example, the table will have only two pages, regardless of shared_buffers.

I'm leaning towards d). The whole test is a little fragile, it will also
fail with a non-default block size, for example. But c) seems like a simple
fix and wouldn't look too out of place in the test.


Hm, what do you mean with the last sentence? Oh, is the test you're
referencing the relation-extension logic?


Sorry, I said "c) seems like a simple fix ...", but I meant "d) seems 
like a simple fix ..."


I meant the attached.

--
Heikki Linnakangas
Neon (https://neon.tech)
diff --git a/contrib/pg_visibility/expected/pg_visibility.out b/contrib/pg_visibility/expected/pg_visibility.out
index 9de54db2a2..09fa5933a3 100644
--- a/contrib/pg_visibility/expected/pg_visibility.out
+++ b/contrib/pg_visibility/expected/pg_visibility.out
@@ -217,8 +217,7 @@ select * from pg_visibility_map('copyfreeze');
 ---+-+
  0 | t   | t
  1 | t   | t
- 2 | t   | t
-(3 rows)
+(2 rows)
 
 select * from pg_check_frozen('copyfreeze');
  t_ctid 
diff --git a/contrib/pg_visibility/sql/pg_visibility.sql b/contrib/pg_visibility/sql/pg_visibility.sql
index ff3538f996..5af06ec5b7 100644
--- a/contrib/pg_visibility/sql/pg_visibility.sql
+++ b/contrib/pg_visibility/sql/pg_visibility.sql
@@ -108,12 +108,6 @@ copy copyfreeze from stdin freeze;
 4	'4'
 5	'5'
 6	'6'
-7	'7'
-8	'8'
-9	'9'
-10	'10'
-11	'11'
-12	'12'
 \.
 commit;
 select * from pg_visibility_map('copyfreeze');


Fix bug in VACUUM and ANALYZE docs

2023-09-19 Thread Ryoga Yoshida

Hi,

Issue1:
VACUUM and ANALYZE docs explain that the parameter of BUFFER_USAGE_LIMIT 
is optional as follows. But this is not true. The argument, size, is 
required for BUFFER_USAGE_LIMIT. So the docs should be fixed this issue.

BUFFER_USAGE_LIMIT [ size ]
https://www.postgresql.org/docs/devel/sql-vacuum.html
https://www.postgresql.org/docs/devel/sql-analyze.html

Issue2:
Sizes may also be specified as a string containing the numerical size 
followed by any one of the following memory units: kB (kilobytes), MB 
(megabytes), GB (gigabytes), or TB (terabytes).
VACUUM and ANALYZE docs explain that the argument of BUFFER_USAGE_LIMIT 
accepts the units like kB (kilobytes), MB (megabytes), GB (gigabytes), 
or TB (terabytes). But it also actually accepts B(bytes) as an unit. So 
the docs should include "B(bytes)" as an unit that the argument of 
BUFFER_USAGE_LIMIT can accept.


You can see the patch in the attached file.

Ryoga Yoshidadiff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml
index 1fba089265..bc973bdd1e 100644
--- a/doc/src/sgml/ref/analyze.sgml
+++ b/doc/src/sgml/ref/analyze.sgml
@@ -27,7 +27,7 @@ ANALYZE [ ( option [, ...] ) ] [ boolean ]
 SKIP_LOCKED [ boolean ]
-BUFFER_USAGE_LIMIT [ size ]
+BUFFER_USAGE_LIMIT size
 
 and table_and_columns is:
 
@@ -128,7 +128,7 @@ ANALYZE [ ( option [, ...] ) ] [ 
   Specifies an amount of memory in kilobytes.  Sizes may also be specified
   as a string containing the numerical size followed by any one of the
-  following memory units: kB (kilobytes),
+  following memory units: B (bytes), kB (kilobytes),
   MB (megabytes), GB (gigabytes), or
   TB (terabytes).
  
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 90cde70c07..f2e7c0bbde 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -38,7 +38,7 @@ VACUUM [ ( option [, ...] ) ] [ integer
 SKIP_DATABASE_STATS [ boolean ]
 ONLY_DATABASE_STATS [ boolean ]
-BUFFER_USAGE_LIMIT [ size ]
+BUFFER_USAGE_LIMIT size
 
 and table_and_columns is:
 
@@ -389,7 +389,7 @@ VACUUM [ ( option [, ...] ) ] [ 
   Specifies an amount of memory in kilobytes.  Sizes may also be specified
   as a string containing the numerical size followed by any one of the
-  following memory units: kB (kilobytes),
+  following memory units: B (bytes), kB (kilobytes),
   MB (megabytes), GB (gigabytes), or
   TB (terabytes).
  


Re: Fix GIST readme on LSN vs NSN

2023-09-19 Thread Heikki Linnakangas

On 18/09/2023 14:53, Daniel Gustafsson wrote:

On 18 Sep 2023, at 13:09, Heikki Linnakangas  wrote:



I propose the attached patch to reword the sentence a little more.


LGTM, +1


Committed, thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: [PATCH] Add native windows on arm64 support

2023-09-19 Thread Peter Eisentraut

On 14.09.23 11:39, Daniel Gustafsson wrote:

On 13 Sep 2023, at 21:12, Peter Eisentraut  wrote:

On 31.08.23 06:44, Tom Lane wrote:

I agree.  I'm really uncomfortable with claiming support for
Windows-on-ARM if we don't have a buildfarm member testing it.
For other platforms that have a track record of multiple
hardware support, it might not be a stretch ... but Windows was
so resolutely Intel-only for so long that "it works on ARM" is
a proposition that I won't trust without hard evidence.  There
are too many bits of that system that might not have gotten the
word yet, or at least not gotten sufficient testing.
My vote for this is we don't commit without a buildfarm member.


I think we can have a multi-tiered approach, where we can commit support but 
consider it experimental until we have buildfarm coverage.


If it's experimental it should probably be behind an opt-in flag in
autoconf/meson, or be reverted by the time REL_17_STABLE branches unless
coverage has materialized by then.


The author's email is bouncing now, due to job change, so it's unlikely 
we will see any progress on this anymore.  I am setting it to returned 
with feedback.






Re: pgbnech: allow to cancel queries during benchmark

2023-09-19 Thread Yugo NAGATA
On Wed, 6 Sep 2023 20:13:34 +0900
Yugo NAGATA  wrote:
 
> I attached the updated patch v3. The changes since the previous
> patch includes the following;
> 
> I removed the unnecessary condition (&& false) that you
> pointed out in [1].
> 
> The test was rewritten by using IPC::Run signal() and integrated
> to "001_pgbench_with_server.pl". This test is skipped on Windows
> because SIGINT causes to terminate the test itself as discussed
> in [2] about query cancellation test in psql.
> 
> I added some comments to describe how query cancellation is
> handled as I explained in [1].
> 
> Also, I found the previous patch didn't work on Windows so fixed it.
> On non-Windows system, a thread waiting a response of long query can
> be interrupted by SIGINT, but on Windows, threads do not return from
> waiting until queries they are running are cancelled. This is because,
> when the signal is received, the system just creates a new thread to
> execute the callback function specified by setup_cancel_handler, and
> other thread continue to run[3]. Therefore, the queries have to be
> cancelled in the callback function.
> 
> [1] 
> https://www.postgresql.org/message-id/a58388ac-5411-4760-ea46-71324d8324cb%40mines-paristech.fr
> [2] 
> https://www.postgresql.org/message-id/20230906004524.2fd6ee049f8a6c6f2690b99c%40sraoss.co.jp
> [3] https://learn.microsoft.com/en-us/windows/console/handlerroutine

I found that --disable-thread-safety option was removed in 68a4b58eca0329.
So, I removed codes involving ENABLE_THREAD_SAFETY from the patch.

Also, I wrote a commit log draft.

Attached is the updated version, v4.

Regards,
Yugo Nagata


-- 
Yugo NAGATA 
>From 10578e5cf02b1a0f7f0988bc7a877dcbeb5448b6 Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Mon, 24 Jul 2023 21:53:28 +0900
Subject: [PATCH v4] Allow pgbnech to cancel queries during benchmark

Previously, Ctrl+C during benchmark killed pgbench immediately,
but queries running at that time were not cancelled. The commit
fixes this so that cancel requests are sent for all connections
before pgbench exits.

In thread #0, setup_cancel_handler is called before the benchmark
so that CancelRequested is set when SIGINT is sent. When SIGINT
is sent during the benchmark, on non-Windows, thread #0 will be
interrupted, return from I/O wait, and send cancel requests to
all connections. After queries are cancelled, other threads also
be interrupted and pgbench will exit at the end. On Windows, cancel
requests are sent in the callback function specified by
setup_cancel_hander.
---
 src/bin/pgbench/pgbench.c| 89 
 src/bin/pgbench/t/001_pgbench_with_server.pl | 42 +
 2 files changed, 131 insertions(+)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 713e8a06bb..5adf099b76 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -596,6 +596,7 @@ typedef enum
 typedef struct
 {
 	PGconn	   *con;			/* connection handle to DB */
+	PGcancel   *cancel;			/* query cancel */
 	int			id;/* client No. */
 	ConnectionStateEnum state;	/* state machine's current state. */
 	ConditionalStack cstack;	/* enclosing conditionals state */
@@ -638,6 +639,8 @@ typedef struct
  * here */
 } CState;
 
+CState	*client_states;		/* status of all clients */
+
 /*
  * Thread state
  */
@@ -837,6 +840,10 @@ static void add_socket_to_set(socket_set *sa, int fd, int idx);
 static int	wait_on_socket_set(socket_set *sa, int64 usecs);
 static bool socket_has_input(socket_set *sa, int fd, int idx);
 
+#ifdef WIN32
+static void pgbench_cancel_callback(void);
+#endif
+
 /* callback used to build rows for COPY during data loading */
 typedef void (*initRowMethod) (PQExpBufferData *sql, int64 curr);
 
@@ -3639,6 +3646,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 		st->state = CSTATE_ABORTED;
 		break;
 	}
+	st->cancel = PQgetCancel(st->con);
 
 	/* reset now after connection */
 	now = pg_time_now();
@@ -4670,6 +4678,18 @@ disconnect_all(CState *state, int length)
 		finishCon(&state[i]);
 }
 
+/* send cancel requests to all connections */
+static void
+cancel_all()
+{
+	for (int i = 0; i < nclients; i++)
+	{
+		char errbuf[1];
+		if (client_states[i].cancel != NULL)
+			(void) PQcancel(client_states[i].cancel, errbuf, sizeof(errbuf));
+	}
+}
+
 /*
  * Remove old pgbench tables, if any exist
  */
@@ -7146,6 +7166,9 @@ main(int argc, char **argv)
 		}
 	}
 
+	/* enable threads to access the status of all clients */
+	client_states = state;
+
 	/* other CState initializations */
 	for (i = 0; i < nclients; i++)
 	{
@@ -7358,6 +7381,37 @@ threadRun(void *arg)
 	StatsData	last,
 aggs;
 
+	/*
+	 * Query cancellation is handled only in thread #0.
+	 *
+	 * On Windows, a callback function is set in which query cancel requests
+	 * are sent to all benchmark queries running in the backend.
+	 *
+	 * On non-Windows, any callback function is not set. When SIGINT is
+	 * received, C

Re: Improvements in pg_dump/pg_restore toc format and performances

2023-09-19 Thread Pierre Ducroquet
On Monday, September 18, 2023 11:54:42 PM CEST Nathan Bossart wrote:
> On Mon, Sep 18, 2023 at 02:52:47PM -0700, Nathan Bossart wrote:
> > On Thu, Jul 27, 2023 at 10:51:11AM +0200, Pierre Ducroquet wrote:
> >> I ended up writing several patches that shaved some time for pg_restore
> >> -l,
> >> and reduced the toc.dat size.
> > 
> > I've only just started taking a look at these patches, and I intend to do
> > a
> > more thorough review in the hopefully-not-too-distant future.
> 
> Since cfbot is failing on some pg_upgrade and pg_dump tests, I've set this
> to waiting-on-author.

FYI, the failures are related to patch 0004, I said it was dirty, but it was 
apparently an understatement. Patches 0001 to 0003 don't exhibit any 
regression.






Re: [PoC] Reducing planning time when tables have many partitions

2023-09-19 Thread Andrey Lepikhov

On 25/8/2023 14:39, Yuya Watari wrote:

Hello,

On Wed, Aug 9, 2023 at 8:54 PM David Rowley  wrote:

I think the best way to move this forward is to explore not putting
partitioned table partitions in EMs and instead see if we can
translate to top-level parent before lookups.  This might just be too
complex to translate the Exprs all the time and it may add overhead
unless we can quickly determine somehow that we don't need to attempt
to translate the Expr when the given Expr is already from the
top-level parent. If that can't be made to work, then maybe that shows
the current patch has merit.


Based on your suggestion, I have experimented with not putting child
EquivalenceMembers in an EquivalenceClass. I have attached a new
patch, v20, to this email. The following is a summary of v20.
Working on self-join removal in the thread [1] nearby, I stuck into the 
problem, which made an additional argument to work in this new direction 
than a couple of previous ones.
With indexing positions in the list of equivalence members, we make some 
optimizations like join elimination more complicated - it may need to 
remove some clauses and equivalence class members.
For changing lists of derives or ec_members, we should go through all 
the index lists and fix them, which is a non-trivial operation.


[1] 
https://www.postgresql.org/message-id/flat/64486b0b-0404-e39e-322d-0801154901f3%40postgrespro.ru


--
regards,
Andrey Lepikhov
Postgres Professional





Re: Standardize type of variable when extending Buffers

2023-09-19 Thread Peter Eisentraut

On 10.07.23 13:08, Ranier Vilela wrote:


Em seg., 10 de jul. de 2023 às 03:27, Kyotaro Horiguchi 
mailto:horikyota@gmail.com>> escreveu:


At Fri, 7 Jul 2023 11:29:16 -0700, Gurjeet Singh mailto:gurj...@singh.im>> wrote in
 > On Fri, Jul 7, 2023 at 6:12 AM Ranier Vilela mailto:ranier...@gmail.com>> wrote:
 > >
 > > Hi,
 > >
 > > This has already been discussed in [1].
 > > But I thought it best to start a new thread.
 > >
 > > The commit 31966b1 introduced the infrastructure to extend
 > > buffers.
 > > But the patch mixed types with int and uint32.
 > > The correct type of the variable counter is uint32.
 > >
 > > Fix by standardizing the int type to uint32.
 > >
 > > patch attached.
 >
 > LGTM.

LGTM, too.

Thanks Gurjeet and Kyotaro, for taking a look.


committed





Re: Fix error handling in be_tls_open_server()

2023-09-19 Thread Sergey Shinderuk

On 19.09.2023 03:54, Michael Paquier wrote:

One doubt that I have is if we shouldn't let X509_NAME_print_ex() be
as it is now, and not force a failure on the bio if this calls fails.



If malloc fails inside X509_NAME_print_ex, then we will be left with 
empty port->peer_dn. Here is a gdb session showing this:


(gdb) b X509_NAME_print_ex
Breakpoint 1 at 0x7f539f6c0cf0
(gdb) c
Continuing.

Breakpoint 1, 0x7f539f6c0cf0 in X509_NAME_print_ex () from 
/lib/x86_64-linux-gnu/libcrypto.so.3

(gdb) bt
#0  0x7f539f6c0cf0 in X509_NAME_print_ex () from 
/lib/x86_64-linux-gnu/libcrypto.so.3
#1  0x56026d2fbe8d in be_tls_open_server 
(port=port@entry=0x56026ed5d730) at be-secure-openssl.c:635
#2  0x56026d2eefa5 in secure_open_server 
(port=port@entry=0x56026ed5d730) at be-secure.c:118
#3  0x56026d3dc412 in ProcessStartupPacket 
(port=port@entry=0x56026ed5d730, ssl_done=ssl_done@entry=false, 
gss_done=gss_done@entry=false) at postmaster.c:2065
#4  0x56026d3dcd8e in BackendInitialize 
(port=port@entry=0x56026ed5d730) at postmaster.c:4377
#5  0x56026d3def6a in BackendStartup 
(port=port@entry=0x56026ed5d730) at postmaster.c:4155

#6  0x56026d3df115 in ServerLoop () at postmaster.c:1781
#7  0x56026d3e0645 in PostmasterMain (argc=argc@entry=20, 
argv=argv@entry=0x56026ec5a0d0) at postmaster.c:1465

#8  0x56026d2fcd7c in main (argc=20, argv=0x56026ec5a0d0) at main.c:198
(gdb) b malloc
Breakpoint 2 at 0x7f539eca5120: file ./malloc/malloc.c, line 3287.
(gdb) c
Continuing.

Breakpoint 2, __GI___libc_malloc (bytes=4) at ./malloc/malloc.c:3287
3287./malloc/malloc.c: No such file or directory.
(gdb) bt
#0  __GI___libc_malloc (bytes=4) at ./malloc/malloc.c:3287
#1  0x7f539f6f6e09 in BUF_MEM_grow_clean () from 
/lib/x86_64-linux-gnu/libcrypto.so.3

#2  0x7f539f6e4fb8 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#3  0x7f539f6d22fb in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#4  0x7f539f6d5c06 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#5  0x7f539f6d5d37 in BIO_write () from 
/lib/x86_64-linux-gnu/libcrypto.so.3

#6  0x7f539f6bdb41 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#7  0x7f539f6c0b7d in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#8  0x56026d2fbe8d in be_tls_open_server 
(port=port@entry=0x56026ed5d730) at be-secure-openssl.c:635
#9  0x56026d2eefa5 in secure_open_server 
(port=port@entry=0x56026ed5d730) at be-secure.c:118
#10 0x56026d3dc412 in ProcessStartupPacket 
(port=port@entry=0x56026ed5d730, ssl_done=ssl_done@entry=false, 
gss_done=gss_done@entry=false) at postmaster.c:2065
#11 0x56026d3dcd8e in BackendInitialize 
(port=port@entry=0x56026ed5d730) at postmaster.c:4377
#12 0x56026d3def6a in BackendStartup 
(port=port@entry=0x56026ed5d730) at postmaster.c:4155

#13 0x56026d3df115 in ServerLoop () at postmaster.c:1781
#14 0x56026d3e0645 in PostmasterMain (argc=argc@entry=20, 
argv=argv@entry=0x56026ec5a0d0) at postmaster.c:1465

#15 0x56026d2fcd7c in main (argc=20, argv=0x56026ec5a0d0) at main.c:198
(gdb) return 0
Make __GI___libc_malloc return now? (y or n) y
#0  0x7f539f6f6e09 in BUF_MEM_grow_clean () from 
/lib/x86_64-linux-gnu/libcrypto.so.3

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) c
Continuing.

And in the server log we see:

DEBUG:  SSL connection from DN:"" CN:"ssltestuser"

While in the normal case we get:

DEBUG:  SSL connection from DN:"CN=ssltestuser" CN:"ssltestuser"\

Probably we shouldn't ignore the error from X509_NAME_print_ex?


--
Sergey Shinderukhttps://postgrespro.com/





Re: Questioning an errcode and message in jsonb.c

2023-09-19 Thread Andy Fan
Hi,

 Thanks for raising this issue in a more public way:)

On Tue, Sep 19, 2023 at 12:55 AM Chapman Flack 
wrote:

>
> It would make me happy if the message could be changed, and maybe
> ERRCODE_INVALID_PARAMETER_VALUE also changed, perhaps to one of
> the JSON-specific ones in the 2203x range.
>

I'd agree with this.


> By the same token, the message and the errcode are established
> current behavior, so there can be sound arguments against changing
> them (even though that means weird logic in rewriting the expression).
>

This is not a technology issue,  I'd be pretty willing to see what some
more experienced people say about this.  I think just documenting the
impatible behavior is an option as well.

-- 
Best Regards
Andy Fan


Re: Bug fix for psql's meta-command \ev

2023-09-19 Thread Ryoga Yoshida

On 2023-09-19 15:29, Ryoga Yoshida wrote:

You can see attached file.


I didn't notice that Michael attached the patch file. Just ignore my 
file. I apologize for the inconvenience.


Ryoga Yoshida




Re: Better help output for pgbench -I init_steps

2023-09-19 Thread Peter Eisentraut

On 12.07.23 15:41, Alvaro Herrera wrote:

On 2023-Jul-12, Gurjeet Singh wrote:


The init-steps are severely under-documented in pgbench --help output.
I think at least a pointer to the the pgbench docs should be mentioned
in the pgbench --help output; an average user may not rush to read the
code to find the explanation, but a hint to where to find more details
about what the letters in --init-steps mean, would save them a lot of
time.


Agreed.

I would do it the way `pg_waldump --rmgr=list` or `psql
--help=variables` are handled: they give you a list of what is
supported.  You don't have to put the list in the main --help output.


I think I prefer variant 2.  Currently we only have 8 steps, so it might 
be overkill to separate them out into a different option.





Re: information_schema and not-null constraints

2023-09-19 Thread Peter Eisentraut

On 14.09.23 10:20, Peter Eisentraut wrote:

On 06.09.23 19:52, Alvaro Herrera wrote:
+    SELECT current_database()::information_schema.sql_identifier AS 
constraint_catalog,
+   rs.nspname::information_schema.sql_identifier AS 
constraint_schema,
+   con.conname::information_schema.sql_identifier AS 
constraint_name,
+   format('CHECK (%s IS NOT NULL)', 
at.attname)::information_schema.character_data AS check_clause


Small correction here: This should be

pg_catalog.format('%s IS NOT NULL', 
at.attname)::information_schema.character_data AS check_clause


That is, the word "CHECK" and the parentheses should not be part of the
produced value.


Slightly related, so let's just tack it on here:

While testing this, I noticed that the way the check_clause of regular 
check constraints is computed appears to be suboptimal.  It currently does


CAST(substring(pg_get_constraintdef(con.oid) from 7) AS character_data)

which ends up with an extra set of parentheses, which is ignorable, but 
it also leaves in suffixes like "NOT VALID", which don't belong into 
that column.  Earlier in this thread I had contemplated a fix for the 
first issue, but that wouldn't address the second issue.  I think we can 
fix this quite simply by using pg_get_expr() instead.  I don't know why 
it wasn't done like that to begin with, maybe it was just a (my?) 
mistake.  See attached patch.
From 4483d1f5c6c6aac047f12a36cc4a9e69d4e912f6 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 19 Sep 2023 08:46:47 +0200
Subject: [PATCH] Simplify information schema check constraint deparsing

The computation of the column
information_schema.check_constraints.check_clause used
pg_get_constraintdef() plus some string manipulation to get the check
clause back out.  This ended up with an extra pair of parentheses,
which is only an aesthetic problem, but also with suffixes like "NOT
VALID", which don't belong into that column.  We can fix both of these
problems and simplify the code by just using pg_get_expr() instead.
---
 src/backend/catalog/information_schema.sql | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/catalog/information_schema.sql 
b/src/backend/catalog/information_schema.sql
index 7f7de91cc2..10b34c3c5b 100644
--- a/src/backend/catalog/information_schema.sql
+++ b/src/backend/catalog/information_schema.sql
@@ -435,8 +435,7 @@ CREATE VIEW check_constraints AS
 SELECT CAST(current_database() AS sql_identifier) AS constraint_catalog,
CAST(rs.nspname AS sql_identifier) AS constraint_schema,
CAST(con.conname AS sql_identifier) AS constraint_name,
-   CAST(substring(pg_get_constraintdef(con.oid) from 7) AS 
character_data)
- AS check_clause
+   CAST(pg_get_expr(con.conbin, coalesce(c.oid, 0)) AS character_data) 
AS check_clause
 FROM pg_constraint con
LEFT OUTER JOIN pg_namespace rs ON (rs.oid = con.connamespace)
LEFT OUTER JOIN pg_class c ON (c.oid = con.conrelid)
-- 
2.42.0