Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-26 Thread Amit Kapila
On Fri, Oct 27, 2023 at 11:24 AM Bharath Rupireddy
 wrote:
>
> On Fri, Oct 27, 2023 at 10:10 AM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Here is a patch for fixing to 003_logical_slots. Also, I got a comment off 
> > list so that it was included.
> >
> > ```
> > -# Setup a pg_upgrade command. This will be used anywhere.
> > +# Setup a common pg_upgrade command to be used by all the test cases
> > ```
>
> The patch LGTM.
>

Thanks, I'll push it in some time.

-- 
With Regards,
Amit Kapila.




RE: pg_upgrade's object listing

2023-10-26 Thread Zhijie Hou (Fujitsu)
On Friday, October 27, 2023 1:21 PM Kyotaro Horiguchi  
wrote:
> 
> Hello.
> 
> I found the following message recently introduced in pg_upgrade:
> 
> > pg_log(PG_VERBOSE, "slot_name: \"%s\", plugin: \"%s\",
> two_phase: %s",
> >slot_info->slotname,
> >slot_info->plugin,
> >slot_info->two_phase ? "true" : "false");
> 
> If the labels correspond to the struct member names, the first label ought to 
> be
> "slotname". If not, all labels of this type, including those adjucent, should 
> have a
> more natural spelling.
> 
> What do you think about this?

Thanks for reporting. But I am not sure if rename to slotname or others will be 
an
improvement. I think we don't have a rule to make the output the same as struct
field. Existing message also don't follow it[1]. So, the current message looks
OK to me.

[1]
pg_log(PG_VERBOSE, "relname: \"%s.%s\", reloid: %u, reltblspace: 
\"%s\"",
   rel_arr->rels[relnum].nspname,
   rel_arr->rels[relnum].relname,
   rel_arr->rels[relnum].reloid,
   rel_arr->rels[relnum].tablespace);

Best Regards,
Hou zj




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-26 Thread Bharath Rupireddy
On Fri, Oct 27, 2023 at 10:10 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Here is a patch for fixing to 003_logical_slots. Also, I got a comment off 
> list so that it was included.
>
> ```
> -# Setup a pg_upgrade command. This will be used anywhere.
> +# Setup a common pg_upgrade command to be used by all the test cases
> ```

The patch LGTM.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-26 Thread Bharath Rupireddy
On Fri, Oct 27, 2023 at 11:09 AM Amit Kapila  wrote:
>
> On Fri, Oct 27, 2023 at 10:43 AM Michael Paquier  wrote:
> >
> > -"invalid_logical_replication_slots.txt");
> > +"invalid_logical_slots.txt");
> >
> > Or you could do something even shorter, with "invalid_slots.txt".
> >
>
> I also thought of it but if we want to keep it that way, we should
> slightly adjust the messages like: "The slot \"%s\" is invalid" to
> include slot_type. This will contain only logical slots, so the
> current one probably seems okay.

+1 for invalid_logical_slots.txt as file name (which can fix Windows
path name issue) and contents as-is "The slot \"%s\" is invalid\n" and
"The slot \"%s\" has not consumed the WAL yet\n".

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-26 Thread Hayato Kuroda (Fujitsu)
Dear Michael,

> Or you could do something even shorter, with "invalid_slots.txt".

I think current one seems better, because we only support logical replication
slots for now. We can extend as you said when we support physical slot as well.
Also, proposed length is sufficient for fairywren [1].

[1]: 
https://www.postgresql.org/message-id/CALj2ACVc-WSx_fvfynt-G3j8rjhNTMZ8DHu2wiKgCEiV9EO86g%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-26 Thread Amit Kapila
On Fri, Oct 27, 2023 at 10:43 AM Michael Paquier  wrote:
>
> On Fri, Oct 27, 2023 at 04:40:43AM +, Hayato Kuroda (Fujitsu) wrote:
> > Yeah, Bharath has already reported, I agreed that the reason was [1].
> >
> > ```
> > In the Windows API (with some exceptions discussed in the following 
> > paragraphs),
> > the maximum length for a path is MAX_PATH, which is defined as 260 
> > characters.
> > ```
>
> -"invalid_logical_replication_slots.txt");
> +"invalid_logical_slots.txt");
>
> Or you could do something even shorter, with "invalid_slots.txt".
>

I also thought of it but if we want to keep it that way, we should
slightly adjust the messages like: "The slot \"%s\" is invalid" to
include slot_type. This will contain only logical slots, so the
current one probably seems okay.


--
With Regards,
Amit Kapila.




Re: A recent message added to pg_upgade

2023-10-26 Thread Kyotaro Horiguchi
At Fri, 27 Oct 2023 09:51:43 +0530, Amit Kapila  wrote 
in 
> On Fri, Oct 27, 2023 at 9:37 AM Peter Smith  wrote:
> > IIUC the only possible way to reach this error (according to the
> > comment preceding it) is by the user overriding the GUC value (which
> > was defaulted -1) on the command line.
> >
> 
> Yeah, this is my understanding as well.
> 
> > + /*
> > + * The logical replication slots shouldn't be invalidated as
> > + * max_slot_wal_keep_size GUC is set to -1 during the upgrade.
> > + *
> > + * The following is just a sanity check.
> > + */
> >
> > Given that, I felt a more relevant msg/hint might be like:
> >
> > errmsg("\"max_slot_wal_keep_size\" must be set to -1 during the upgrade"),
> > errhint("Do not override \"max_slot_wal_keep_size\" using command line
> > options."));
> >
> 
> But OTOH, we don't have a value of user-passed options to ensure that.
> So, how about a slightly different message: "This can be caused by
> overriding \"max_slot_wal_keep_size\" using command line options." or
> something along those lines? I see a somewhat similar message in the
> existing code (errhint("This can be caused ...")).

The suggested error message looks to me like that of the GUC
mechanism. While I don't have the wider picture about the feature,
might we consider rejecting the parameter setting? With that
modification, this message can be changed to elog one.

I believe it's somewhat inconsiderate to point out what shouldn't have
been done only after a problem has occurred.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Introduce a new view for checkpointer related stats

2023-10-26 Thread Bharath Rupireddy
On Fri, Oct 27, 2023 at 10:32 AM Michael Paquier  wrote:
>
> On Fri, Oct 27, 2023 at 10:23:34AM +0530, Bharath Rupireddy wrote:
> > A possible way is to move existing pgstat_count_slru_flush in
> > SimpleLruWriteAll closer to pg_fsync and WAIT_EVENT_SLRU_SYNC in
> > SlruPhysicalWritePage, remove WAIT_EVENT_SLRU_FLUSH_SYNC completely,
> > use WAIT_EVENT_SLRU_SYNC in SlruSyncFileTag and count the flushes in
> > SlruSyncFileTag. This aggregated way is much simpler IMV.
> >
> > Another possible way is to have separate stat variables for each of
> > the SLRU flushes WAIT_EVENT_SLRU_SYNC and WAIT_EVENT_SLRU_FLUSH_SYNC
> > and expose them separately in pg_stat_slru. I don't like this
> > approach.
>
> This touches an area covered by a different patch, registered in this
> commit fest as well:
> https://www.postgresql.org/message-id/camm1awb18ept0whjrjg+-nyhnouxet6zuw0pnyyae+nezpv...@mail.gmail.com
>
> So perhaps we'd better move the discussion there.  The patch posted
> there is going to need a rebase anyway once the split with
> pg_stat_checkpointer is introduced.

Yeah, I'll re-look at the SLRU stuff and the other thread next week.

> > finalized. I'm posting the pg_stat_checkpointer patch as v10-0001.
>
> Thanks.  That seems OK.  I don't have the wits to risk my weekend on
> buildfarm failures if any, so that will have to wait a bit.

Hm, okay :)

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: A recent message added to pg_upgade

2023-10-26 Thread Bharath Rupireddy
On Fri, Oct 27, 2023 at 9:52 AM Amit Kapila  wrote:
>
> > errmsg("\"max_slot_wal_keep_size\" must be set to -1 during the upgrade"),
> > errhint("Do not override \"max_slot_wal_keep_size\" using command line
> > options."));
> >
>
> But OTOH, we don't have a value of user-passed options to ensure that.
> So, how about a slightly different message: "This can be caused by
> overriding \"max_slot_wal_keep_size\" using command line options." or
> something along those lines? I see a somewhat similar message in the
> existing code (errhint("This can be caused ...")).

I get it. I think having errdetail explaining the possible cause of
the error is wanted here, something like:

errmsg("cannot invalidate replication slots when in binary upgrade mode"),
errdetail("This can be caused by overriding \"max_slot_wal_keep_size\"
using command line options."));
errhint("Do not override or set \"max_slot_wal_keep_size\" to -1 ."));

Thoughts?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




pg_upgrade's object listing

2023-10-26 Thread Kyotaro Horiguchi
Hello.

I found the following message recently introduced in pg_upgrade:

>   pg_log(PG_VERBOSE, "slot_name: \"%s\", plugin: \"%s\", 
> two_phase: %s",
>  slot_info->slotname,
>  slot_info->plugin,
>  slot_info->two_phase ? "true" : "false");

If the labels correspond to the struct member names, the first label
ought to be "slotname". If not, all labels of this type, including
those adjucent, should have a more natural spelling.

What do you think about this?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-26 Thread Michael Paquier
On Fri, Oct 27, 2023 at 04:40:43AM +, Hayato Kuroda (Fujitsu) wrote:
> Yeah, Bharath has already reported, I agreed that the reason was [1]. 
> 
> ```
> In the Windows API (with some exceptions discussed in the following 
> paragraphs),
> the maximum length for a path is MAX_PATH, which is defined as 260 characters.
> ```

-"invalid_logical_replication_slots.txt");
+"invalid_logical_slots.txt");

Or you could do something even shorter, with "invalid_slots.txt".
--
Michael


signature.asc
Description: PGP signature


Re: maybe a type_sanity. sql bug

2023-10-26 Thread Michael Paquier
On Fri, Oct 27, 2023 at 11:45:44AM +0800, jian he wrote:
> The test seems to assume the following sql query should return zero row.
> but it does not. I don't know much about the "relreplident" column.

This is not about relreplident here, that refers to a relation's
replica identity.

> test1=# SELECT c1.oid, c1.relname, relkind, relpersistence, relreplident
> FROM pg_class as c1
> WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p') OR
> relpersistence NOT IN ('p', 'u', 't') OR
> relreplident NOT IN ('d', 'n', 'f', 'i');
>  oid | relname | relkind | relpersistence | relreplident
> -+-+-++--
> (0 rows)

The problem is about relkind, as 'I' refers to a partitioned index.
That is a legal value in pg_class.relkind, but we forgot to list it in
this test.
--
Michael


signature.asc
Description: PGP signature


Re: Introduce a new view for checkpointer related stats

2023-10-26 Thread Michael Paquier
On Fri, Oct 27, 2023 at 10:23:34AM +0530, Bharath Rupireddy wrote:
> A possible way is to move existing pgstat_count_slru_flush in
> SimpleLruWriteAll closer to pg_fsync and WAIT_EVENT_SLRU_SYNC in
> SlruPhysicalWritePage, remove WAIT_EVENT_SLRU_FLUSH_SYNC completely,
> use WAIT_EVENT_SLRU_SYNC in SlruSyncFileTag and count the flushes in
> SlruSyncFileTag. This aggregated way is much simpler IMV.
> 
> Another possible way is to have separate stat variables for each of
> the SLRU flushes WAIT_EVENT_SLRU_SYNC and WAIT_EVENT_SLRU_FLUSH_SYNC
> and expose them separately in pg_stat_slru. I don't like this
> approach.

This touches an area covered by a different patch, registered in this
commit fest as well:
https://www.postgresql.org/message-id/camm1awb18ept0whjrjg+-nyhnouxet6zuw0pnyyae+nezpv...@mail.gmail.com

So perhaps we'd better move the discussion there.  The patch posted
there is going to need a rebase anyway once the split with
pg_stat_checkpointer is introduced.

>> The error message generated for an incorrect target needs to be
>> updated in pg_stat_reset_shared():
>> =# select pg_stat_reset_shared('checkpointe');
>> ERROR:  22023: unrecognized reset target: "checkpointe"
>> HINT:  Target must be "archiver", "bgwriter", "io", "recovery_prefetch", or 
>> "wal".
> 
> +1. Changed in the attached v10-001. FWIW, having a test case in
> stats.sql emitting this error message and hint would have helped here.
> If okay, I can add one.
> 
> PS: I'll park the SLRU flush related patch aside until the approach is
> finalized. I'm posting the pg_stat_checkpointer patch as v10-0001.

Thanks.  That seems OK.  I don't have the wits to risk my weekend on
buildfarm failures if any, so that will have to wait a bit.
--
Michael


signature.asc
Description: PGP signature


Re: A recent message added to pg_upgade

2023-10-26 Thread Bharath Rupireddy
On Fri, Oct 27, 2023 at 9:36 AM Amit Kapila  wrote:
>
> On Fri, Oct 27, 2023 at 8:52 AM Bharath Rupireddy
>  wrote:
> >
> > On Fri, Oct 27, 2023 at 8:28 AM Kyotaro Horiguchi:
> > The above errhint LGTM. How about a slightly different errmsg, like
> > the following?
> >
> > +errmsg("cannot invalidate replication slots when
> > in binary upgrade mode"),
> > +errhint("Set \"max_slot_wal_keep_size\" to -1 to
> > avoid invalidation."));
> >
> > " when in binary upgrade mode" is being used in many places.
> >
>
> By this time slot may be already invalidated, so how about:
> "replication slot was invalidated when in binary upgrade mode"?

In this error spot, the is invalidated in memory but the invalidated
state is not persisted to disk which happens after somewhere later:

else
{
/*
 * We hold the slot now and have already invalidated it; flush it
 * to ensure that state persists.
 *

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce a new view for checkpointer related stats

2023-10-26 Thread Bharath Rupireddy
On Fri, Oct 27, 2023 at 8:03 AM Michael Paquier  wrote:
>
> Hmm.  As per the existing call of pgstat_count_slru_flush() in
> SimpleLruWriteAll(), routine called SimpleLruFlush() until ~13 and
> dee663f78439, an incrementation of 1 for slru_stats_idx refers to all
> the flushes for all the dirty data of this SLRU in a single pass.
>
> This addition actually means that we would now increment the counter
> for each sync request, changing its meaning.  Perhaps there is an
> argument for changing the meaning of this parameter to be based on the
> number of flush requests completed, but if that were to happen it
> would be better to remove pgstat_count_slru_flush() in
> SimpleLruWriteAll(), I guess, or just aggregate this counter once,
> which would be cheaper.

Right. Interestingly, there are 2 SLRU flush related wait events
WAIT_EVENT_SLRU_SYNC ("Waiting for SLRU data to reach durable storage
following a page write") and WAIT_EVENT_SLRU_FLUSH_SYNC ("Waiting for
SLRU data to reach durable storage during a checkpoint or database
shutdown"). And, we're counting the SLRU flushes in two of these
places into one single stat variable. These two events look confusing
and may be useful if shown in an aggregated way.

A possible way is to move existing pgstat_count_slru_flush in
SimpleLruWriteAll closer to pg_fsync and WAIT_EVENT_SLRU_SYNC in
SlruPhysicalWritePage, remove WAIT_EVENT_SLRU_FLUSH_SYNC completely,
use WAIT_EVENT_SLRU_SYNC in SlruSyncFileTag and count the flushes in
SlruSyncFileTag. This aggregated way is much simpler IMV.

Another possible way is to have separate stat variables for each of
the SLRU flushes WAIT_EVENT_SLRU_SYNC and WAIT_EVENT_SLRU_FLUSH_SYNC
and expose them separately in pg_stat_slru. I don't like this
approach.

> v9-0003 is mostly a mechanical change, and it passes the eye test.

Thanks. Indeed it contains mechanical changes.

> I have spotted two nits.
>
> +CREATE VIEW pg_stat_checkpointer AS
> +SELECT
> +pg_stat_get_checkpointer_num_timed() AS num_timed,
> +pg_stat_get_checkpointer_num_requested() AS num_requested,
> +pg_stat_get_checkpointer_write_time() AS write_time,
> +pg_stat_get_checkpointer_sync_time() AS sync_time,
> +pg_stat_get_checkpointer_buffers_written() AS buffers_written,
> +pg_stat_get_checkpointer_stat_reset_time() AS stats_reset;
>
> Okay with this layer.  I am wondering if others have opinions to
> share about these names for the attributes of pg_stat_checkpointer,
> though.

I think these column names are a good fit in this context as we can't
just call timed/requested/write/sync and having "checkpoint" makes the
column names long unnecessarily. FWIW, I see some of the user-exposed
field names starting with num_* (num_nonnulls, num_nulls, num_lwlocks,
num_transactions).

> -   single row, containing global data for the cluster.
> +   single row, containing data about the bgwriter of the cluster.
>
> "bgwriter" is used in one place of the docs, so perhaps "background
> writer" is better here?

+1. Changed in the attached v10-0001.

> The error message generated for an incorrect target needs to be
> updated in pg_stat_reset_shared():
> =# select pg_stat_reset_shared('checkpointe');
> ERROR:  22023: unrecognized reset target: "checkpointe"
> HINT:  Target must be "archiver", "bgwriter", "io", "recovery_prefetch", or 
> "wal".

+1. Changed in the attached v10-001. FWIW, having a test case in
stats.sql emitting this error message and hint would have helped here.
If okay, I can add one.

PS: I'll park the SLRU flush related patch aside until the approach is
finalized. I'm posting the pg_stat_checkpointer patch as v10-0001.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From d64200c575bf8c9f54bdefad95f84dff5a8381eb Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 27 Oct 2023 04:42:55 +
Subject: [PATCH v10] Introduce a new view for checkpointer related stats

pg_stat_bgwriter view currently reports checkpointer stats as
well. It has been that way so far because historically
checkpointer was part of bgwriter until the commits 806a2ae and
bf405ba separated them out way back in PG 9.2.

It is time for us to separate checkpointer stats to its own view
called pg_stat_checkpointer.

Bump catalog version.

Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Andres Freund
Discussion: https://www.postgresql.org/message-id/CALj2ACVxX2ii%3D66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml  | 98 ++-
 src/backend/access/transam/xlog.c |  4 +-
 src/backend/catalog/system_views.sql  | 14 ++-
 src/backend/postmaster/checkpointer.c |  6 +-
 src/backend/storage/buffer/bufmgr.c   |  2 +-
 .../utils/activity/pgstat_checkpointer.c  | 21 ++--
 src/backend/utils/adt/pgstatfuncs.c   | 35 +++
 src/include/catalog/pg_proc.dat  

Re: race condition in pg_class

2023-10-26 Thread Noah Misch
On Wed, Oct 25, 2023 at 01:39:41PM +0300, Smolkin Grigory wrote:
> We are running PG13.10 and recently we have encountered what appears to be
> a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT and
> some other catalog-writer, possibly ANALYZE.
> The problem is that after successfully creating index on relation (which
> previosly didnt have any indexes), its pg_class.relhasindex remains set to
> "false", which is illegal, I think.
> Index was built using the following statement:
> ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);

This is going to be a problem with any operation that does a transactional
pg_class update without taking a lock that conflicts with ShareLock.  GRANT
doesn't lock the table at all, so I can reproduce this in v17 as follows:

== session 1
create table t (c int);
begin;
grant select on t to public;

== session 2
alter table t add primary key (c);

== back in session 1
commit;


We'll likely need to change how we maintain relhasindex or perhaps take a lock
in GRANT.

> Looking into the WAL via waldump given us the following picture (full
> waldump output is attached):

> 1202295045 - create index statement
> 1202298790 and 1202298791 are some other concurrent operations,
> unfortunately I wasnt able to determine what are they

Can you explore that as follows?

- PITR to just before the COMMIT record.
- Save all rows of pg_class.
- PITR to just after the COMMIT record.
- Save all rows of pg_class.
- Diff the two sets of saved rows.

Which columns changed?  The evidence you've shown would be consistent with a
transaction doing GRANT or REVOKE on dozens of tables.  If the changed column
is something other than relacl, that would be great to know.

On the off-chance it's relevant, what extensions do you have (\dx in psql)?




RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-26 Thread Hayato Kuroda (Fujitsu)
Dear Bharath, Amit, Peter,

Thank you for discussing! A patch can be available in [1].

> > > > +1 for
> s/003_upgrade_logical_replication_slots.pl/003_upgrade_logical_slots.pl
> > > > and s/invalid_logical_replication_slots.txt/invalid_logical_slots.txt.
> >
> > +1. The proposed file name sounds reasonable.
> >
> > Agreed. So, how about 003_upgrade_logical_slots.pl or simply
> > 003_upgrade_slots.pl?
> >
> > Why not simply oldpub/sub/newpub or old_pub/sub/new_pub?
> 
> +1 for invalid_logical_slots.txt, 003_upgrade_logical_slots.pl and
> oldpub/sub/newpub. With these changes, the path name is brought down
> to ~220 chars. These names look good to me iff other things in the
> path name aren't dynamic crossing MAX_PATH limit (260 chars).
> 
> C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgra
> de/003_upgrade_logical_slots/data/t_003_upgrade_logical_slots_newpub_data/
> pgdata/pg_upgrade_output.d/20231026T112558.309/invalid_logical_slots.txt

Replaced to invalid_logical_slots.txt, 003_logical_slots.pl, and 
oldpub/sub/newpub.
Regarding the test finename, some client app (e.g., pg_ctl) does not have a 
prefix,
and some others (e.g., pg_dump) have. Either way seems acceptable.
Hence I chose to remove the header.

```
$ ls pg_ctl/t/
001_start_stop.pl  002_status.pl  003_promote.pl  004_logrotate.pl

$ ls pg_dump/t/
001_basic.pl  002_pg_dump.pl  003_pg_dump_with_server.pl  
004_pg_dump_parallel.pl  010_dump_connstr.pl
```

[1]: 
https://www.postgresql.org/message-id/TYCPR01MB5870A6A8FBB23554EDE8F5F3F5DCA%40TYCPR01MB5870.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-26 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

> The BF animal fairywren[1] failed when testing
> 003_upgrade_logical_replication_slots.pl.

Good catch!

> 
> The reason could be the length of this path(262) exceed the windows path
> limit(260 IIRC). If so, I recall we fixed similar things before (e213de8e7) by
> reducing the path somehow.

Yeah, Bharath has already reported, I agreed that the reason was [1]. 

```
In the Windows API (with some exceptions discussed in the following paragraphs),
the maximum length for a path is MAX_PATH, which is defined as 260 characters.
```

> In this case, I think one approach is to reduce the file and testname to
> xxx_logical_slots instead of xxx_logical_replication_slots. But we will 
> analyze
> more
> and share fix soon.
>

Here is a patch for fixing to 003_logical_slots. Also, I got a comment off list 
so that it was included.

```
-# Setup a pg_upgrade command. This will be used anywhere.
+# Setup a common pg_upgrade command to be used by all the test cases
```

[1]: 
https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



0001-Shorten-some-files.patch
Description: 0001-Shorten-some-files.patch


pg_dump not dumping the run_as_owner setting from version 16?

2023-10-26 Thread Philip Warner
Hi,

I as far as I can tell, pg_dump does not dup the ‘run_as_owner` setting for a 
subscription.

Should it? Should I submit a patch? It seems pretty trivial to fix if anyone 
else is working on it.

Sent from Mail for Windows



Re: A recent message added to pg_upgade

2023-10-26 Thread Amit Kapila
On Fri, Oct 27, 2023 at 9:37 AM Peter Smith  wrote:
>
> On Fri, Oct 27, 2023 at 1:58 PM Kyotaro Horiguchi
>  wrote:
> >
> > Hello.
> >
> > Some messages recently introduced by commit 29d0a77fa6 seem to deviate
> > slightly from our standards.
> >
> > +   if (*invalidated && SlotIsLogical(s) && IsBinaryUpgrade)
> > +   {
> > +   ereport(ERROR,
> > +   
> > errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +   errmsg("replication slots must not 
> > be invalidated during the upgrade"),
> > +   errhint("\"max_slot_wal_keep_size\" 
> > must be set to -1 during the upgrade"));
> >
> > The message for errhint is not a complete sentence. And errmsg is not
> > in telegraph style.  The first attached makes minimum changes.
> >
> > However, if allowed, I'd like to propose an alternative set of
> > messages as follows:
> >
> > +   errmsg("replication slot is 
> > invalidated during upgrade"),
> > +   errhint("Set 
> > \"max_slot_wal_keep_size\" to -1 to avoid invalidation."));
> >
> > The second attached does this.
> >
> > What do you think about those?
> >
>
> IIUC the only possible way to reach this error (according to the
> comment preceding it) is by the user overriding the GUC value (which
> was defaulted -1) on the command line.
>

Yeah, this is my understanding as well.

> + /*
> + * The logical replication slots shouldn't be invalidated as
> + * max_slot_wal_keep_size GUC is set to -1 during the upgrade.
> + *
> + * The following is just a sanity check.
> + */
>
> Given that, I felt a more relevant msg/hint might be like:
>
> errmsg("\"max_slot_wal_keep_size\" must be set to -1 during the upgrade"),
> errhint("Do not override \"max_slot_wal_keep_size\" using command line
> options."));
>

But OTOH, we don't have a value of user-passed options to ensure that.
So, how about a slightly different message: "This can be caused by
overriding \"max_slot_wal_keep_size\" using command line options." or
something along those lines? I see a somewhat similar message in the
existing code (errhint("This can be caused ...")).

-- 
With Regards,
Amit Kapila.




Re: A recent message added to pg_upgade

2023-10-26 Thread Peter Smith
On Fri, Oct 27, 2023 at 1:58 PM Kyotaro Horiguchi
 wrote:
>
> Hello.
>
> Some messages recently introduced by commit 29d0a77fa6 seem to deviate
> slightly from our standards.
>
> +   if (*invalidated && SlotIsLogical(s) && IsBinaryUpgrade)
> +   {
> +   ereport(ERROR,
> +   
> errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +   errmsg("replication slots must not be 
> invalidated during the upgrade"),
> +   errhint("\"max_slot_wal_keep_size\" 
> must be set to -1 during the upgrade"));
>
> The message for errhint is not a complete sentence. And errmsg is not
> in telegraph style.  The first attached makes minimum changes.
>
> However, if allowed, I'd like to propose an alternative set of
> messages as follows:
>
> +   errmsg("replication slot is 
> invalidated during upgrade"),
> +   errhint("Set 
> \"max_slot_wal_keep_size\" to -1 to avoid invalidation."));
>
> The second attached does this.
>
> What do you think about those?
>

IIUC the only possible way to reach this error (according to the
comment preceding it) is by the user overriding the GUC value (which
was defaulted -1) on the command line.

+ /*
+ * The logical replication slots shouldn't be invalidated as
+ * max_slot_wal_keep_size GUC is set to -1 during the upgrade.
+ *
+ * The following is just a sanity check.
+ */

Given that, I felt a more relevant msg/hint might be like:

errmsg("\"max_slot_wal_keep_size\" must be set to -1 during the upgrade"),
errhint("Do not override \"max_slot_wal_keep_size\" using command line
options."));

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: A recent message added to pg_upgade

2023-10-26 Thread Amit Kapila
On Fri, Oct 27, 2023 at 8:52 AM Bharath Rupireddy
 wrote:
>
> On Fri, Oct 27, 2023 at 8:28 AM Kyotaro Horiguchi:
> The above errhint LGTM. How about a slightly different errmsg, like
> the following?
>
> +errmsg("cannot invalidate replication slots when
> in binary upgrade mode"),
> +errhint("Set \"max_slot_wal_keep_size\" to -1 to
> avoid invalidation."));
>
> " when in binary upgrade mode" is being used in many places.
>

By this time slot may be already invalidated, so how about:
"replication slot was invalidated when in binary upgrade mode"?

-- 
With Regards,
Amit Kapila.




maybe a type_sanity. sql bug

2023-10-26 Thread jian he
hi.
The test seems to assume the following sql query should return zero row.
but it does not. I don't know much about the "relreplident" column.

https://git.postgresql.org/cgit/postgresql.git/tree/src/test/regress/expected/type_sanity.out#n499
demo: https://dbfiddle.uk/QFM88S2e

test1=# \dt
Did not find any relations.
test1=# SELECT c1.oid, c1.relname, relkind, relpersistence, relreplident
FROM pg_class as c1
WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p') OR
relpersistence NOT IN ('p', 'u', 't') OR
relreplident NOT IN ('d', 'n', 'f', 'i');
 oid | relname | relkind | relpersistence | relreplident
-+-+-++--
(0 rows)

test1=# CREATE TABLE test_partition (
id int4range,
valid_at daterange,
name text,
CONSTRAINT test_partition_uq1 UNIQUE (id, valid_at )
) PARTITION BY LIST (id);
CREATE TABLE
test1=# SELECT c1.oid, c1.relname, relkind, relpersistence, relreplident
FROM pg_class as c1
WHERE relkind NOT IN ('r', 'i', 'S', 't', 'v', 'm', 'c', 'f', 'p') OR
relpersistence NOT IN ('p', 'u', 't') OR
relreplident NOT IN ('d', 'n', 'f', 'i');
   oid   |  relname   | relkind | relpersistence | relreplident
-++-++--
 1034304 | test_partition_uq1 | I   | p  | n
(1 row)

test1=# select version();
  version

 PostgreSQL 16beta1 on x86_64-linux, compiled by gcc-11.3.0, 64-bit
(1 row)




Re: A recent message added to pg_upgade

2023-10-26 Thread Bharath Rupireddy
On Fri, Oct 27, 2023 at 8:28 AM Kyotaro Horiguchi
 wrote:
>
> Hello.
>
> Some messages recently introduced by commit 29d0a77fa6 seem to deviate
> slightly from our standards.
>
> +   if (*invalidated && SlotIsLogical(s) && IsBinaryUpgrade)
> +   {
> +   ereport(ERROR,
> +   
> errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +   errmsg("replication slots must not be 
> invalidated during the upgrade"),
> +   errhint("\"max_slot_wal_keep_size\" 
> must be set to -1 during the upgrade"));
>
> The message for errhint is not a complete sentence.

Yeah, the hint message should have ended with a period -
https://www.postgresql.org/docs/current/error-style-guide.html#ERROR-STYLE-GUIDE-GRAMMAR-PUNCTUATION.

> The second attached does this.
>
> What do you think about those?

+errmsg("replication slot is invalidated during upgrade"),
+errhint("Set \"max_slot_wal_keep_size\" to -1 to
avoid invalidation."));
 }

The above errhint LGTM. How about a slightly different errmsg, like
the following?

+errmsg("cannot invalidate replication slots when
in binary upgrade mode"),
+errhint("Set \"max_slot_wal_keep_size\" to -1 to
avoid invalidation."));

" when in binary upgrade mode" is being used in many places.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Use virtual tuple slot for Unique node

2023-10-26 Thread David Rowley
On Wed, 25 Oct 2023 at 22:48, Ashutosh Bapat
 wrote:
> We may save the size of data in VirtualTupleTableSlot, thus avoiding
> the first loop. I assume that when allocating
> VirtualTupleTableSlot->data, we always know what size we are
> allocating so it should be just a matter of saving it in
> VirtualTupleTableSlot->size. This should avoid the first loop in
> tts_virtual_materialize() and give some speed up. We will need a loop
> to repoint non-byval Datums. I imagine that the pointers to non-byval
> Datums can be computed as dest_slot->tts_values[natts] =
> dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data).
> This would work as long as all the non-byval datums in the source slot
> are all stored flattened in source slot's data. I am assuming that
> that would be true in a materialized virtual slot. The byval datums
> are copied as is. I think, this will avoid multiple memcpy calls, one
> per non-byval attribute and hence some speedup. I may be wrong though.

hmm, do you think it's common enough that we copy an already
materialised virtual slot?

I tried adding the following code totts_virtual_copyslot and didn't
see the NOTICE message when running each of your test queries. "make
check" also worked without anything failing after adjusting
nodeUnique.c to always use a TTSOpsVirtual slot.

+   if (srcslot->tts_ops ==  && TTS_SHOULDFREE(srcslot))
+   elog(NOTICE, "We copied a materialized virtual slot!");

I did get a failure in postgres_fdw's tests:

   server loopback options (table_name 'tab_batch_sharded_p1_remote');
 insert into tab_batch_sharded select * from tab_batch_local;
+NOTICE:  We copied a materialized virtual slot!
+NOTICE:  We copied a materialized virtual slot!

so I think it's probably not that common that we'd gain from that optimisation.

What might buy us a bit more would be to get rid of the for loop
inside tts_virtual_copyslot() and copy the guts of
tts_virtual_materialize() into tts_virtual_copyslot() and set the
dstslot tts_isnull and tts_values arrays in the same loop that we use
to calculate the size.

I tried that in the attached patch and then tested it alongside the
patch that changes the slot type.

master = 74604a37f
1 = [1]
2 = optimize_tts_virtual_copyslot.patch

Using the script from [2] and the setup from [3] but reduced to 10k
tuples instead of 1 million.

Times the average query time in milliseconds for a 60 second pgbench run.

querymaster  master+1master+1+2m/m+1 m/m+1+2
Q12.616 2.082   1.903   125.65%
137.47%
Q29.47910.59310.361 89.48%
 91.49%
Q310.329  10.35710.627 99.73%
97.20%
Q47.272  7.306 6.941  99.53%
 104.77%
Q57.597  7.043 6.645107.87%
 114.33%
Q6162.177  161.037  162.807100.71% 99.61%
Q759.288  59.43  61.294  99.76%
 96.73%

 103.25%105.94%

I only expect Q2 and Q3 to gain from this. Q1 shouldn't improve but
did, so the results might not be stable enough to warrant making any
decisions from.

I was uncertain if the old behaviour of when srcslot contains fewer
attributes than dstslot was intended or not.  What happens there is
that we'd leave the additional old dstslot tts_values in place and
only overwrite up to srcslot->natts but then we'd go on and
materialize all the dstslot attributes. I think this might not be
needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may
be ok just to materialize the srcslot attributes and ignore the
previous additional dstslot attributes.  Changing it to that would
make the attached patch more simple.

David

[1] 
https://www.postgresql.org/message-id/attachment/151110/use_subnode_slot_type_for_nodeunique.patch
[2] https://www.postgresql.org/message-id/attachment/151342/uniquebench.sh.txt
[3] 
https://www.postgresql.org/message-id/CAExHW5uhTMdkk26oJg9f2ZVufbi5J4Lquj79MdSO%2BipnGJ_muw%40mail.gmail.com
diff --git a/src/backend/executor/execTuples.c 
b/src/backend/executor/execTuples.c
index 2c2712ceac..4f8c6c24c0 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -251,7 +251,10 @@ tts_virtual_materialize(TupleTableSlot *slot)
 static void
 tts_virtual_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 {
-   TupleDesc   srcdesc = srcslot->tts_tupleDescriptor;
+   TupleDesc srcdesc = srcslot->tts_tupleDescriptor;
+   TupleDesc dstdesc = dstslot->tts_tupleDescriptor;
+   Sizesz = 0;
+   char   *data;
 
Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts);
 
@@ -259,17 +262,105 @@ tts_virtual_copyslot(TupleTableSlot *dstslot, 
TupleTableSlot *srcslot)
 
slot_getallattrs(srcslot);
 
-   for (int natt = 0; natt < srcdesc->natts; natt++)
+   /* make sure 

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-26 Thread Bharath Rupireddy
On Fri, Oct 27, 2023 at 8:06 AM Amit Kapila  wrote:
>
> > > +1 for 
> > > s/003_upgrade_logical_replication_slots.pl/003_upgrade_logical_slots.pl
> > > and s/invalid_logical_replication_slots.txt/invalid_logical_slots.txt.
>
> +1. The proposed file name sounds reasonable.
>
> Agreed. So, how about 003_upgrade_logical_slots.pl or simply
> 003_upgrade_slots.pl?
>
> Why not simply oldpub/sub/newpub or old_pub/sub/new_pub?

+1 for invalid_logical_slots.txt, 003_upgrade_logical_slots.pl and
oldpub/sub/newpub. With these changes, the path name is brought down
to ~220 chars. These names look good to me iff other things in the
path name aren't dynamic crossing MAX_PATH limit (260 chars).

C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/003_upgrade_logical_slots/data/t_003_upgrade_logical_slots_newpub_data/pgdata/pg_upgrade_output.d/20231026T112558.309/invalid_logical_slots.txt

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Partial aggregates pushdown

2023-10-26 Thread Bruce Momjian
On Fri, Oct 27, 2023 at 02:44:42AM +, 
fujii.y...@df.mitsubishielectric.co.jp wrote:
> Hi Momjian.
> 
> Thank you for your improvement.
> As a matter of detail, I think that the areas marked below are erroneous.
> 
> --
> +   Pushdown causes aggregate function cals to send partial aggregate
>  ^
> +   function calls to the remote server. If the partial aggregate
> +   function doesn't doesn't exist on the remote server, it causes
>  ^^^
> --

Agreed.  Do you want to fix that on your vesion?  I don't have any more
improvements to make.

> > What else needs to be done before committers start to review
> > this?
> There are no others. May I make a new version of v31 with your
> suggested improvements for the committer's review?

Yes, please.  I think the updated docs will help people understand how
the patch works.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Document parameter count limit

2023-10-26 Thread Bruce Momjian
On Thu, Oct 26, 2023 at 04:17:19PM -0700, David G. Johnston wrote:
> On Thu, Oct 26, 2023 at 4:13 PM David G. Johnston 
> wrote:
> 
> On Thu, Oct 26, 2023 at 4:08 PM Tom Lane  wrote:
> 
> Bruce Momjian  writes:
> > Ah, I was confused.  I documented both in the attached patch.
> 
> The function one should have the same annotation as some others:
> 
>      can be increased by recompiling PostgreSQL productname>
> 
> 
> 
> I'd like to see a comment on the parameter count one too.
> 
> "Alternatives include using a temporary table or passing them in as a
> single array parameter."
> 
> About the only time this is likely to come up is with many parameters of
> the same type and meaning, pointing that out with the array option seems
> excessively wordy for the comment area.
> 
> Needs a comma: 65,535
> 
> Kinda think both should be tacked on to the end of the table.  I'd also 
> put
> function arguments first so it appears under the compile time partition
> keys limit.
> 
> 
> 
> Cleanups for consistency:
> 
> Move "identifier length" after "partition keys" (before the new "function
> arguments")
> 
> Add commas to: 1,600 and 1,664 and 8,192

Okay, I made all the suggested changes in ordering and adding commas,
plus the text about the ability to change function arguments via
recompiling.

I didn't put commas in 8192 since that is a power-of-two and kind of a
magic number used in many places.

I am not sure where to put text about using arrays to handle many
function arguments.  I just don't see it fitting in the table, or the
paragraph below the table.

Patch attached.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/limits.sgml b/doc/src/sgml/limits.sgml
index d5b2b627dd..c549447013 100644
--- a/doc/src/sgml/limits.sgml
+++ b/doc/src/sgml/limits.sgml
@@ -57,14 +57,14 @@
 
 
  columns per table
- 1600
+ 1,600
  further limited by tuple size fitting on a single page; see note
  below
 
 
 
  columns in a result set
- 1664
+ 1,664
  
 
 
@@ -74,12 +74,6 @@
  
 
 
-
- identifier length
- 63 bytes
- can be increased by recompiling PostgreSQL
-
-
 
  indexes per table
  unlimited
@@ -92,11 +86,29 @@
  can be increased by recompiling PostgreSQL
 
 
-   
-partition keys
-32
-can be increased by recompiling PostgreSQL
-   
+
+ partition keys
+ 32
+ can be increased by recompiling PostgreSQL
+
+
+
+ identifier length
+ 63 bytes
+ can be increased by recompiling PostgreSQL
+
+
+
+ function arguments
+ 100
+ can be increased by recompiling PostgreSQL
+
+
+
+ query parameters
+ 65,535
+ 
+

   
  
@@ -104,9 +116,9 @@
  
   The maximum number of columns for a table is further reduced as the tuple
   being stored must fit in a single 8192-byte heap page.  For example,
-  excluding the tuple header, a tuple made up of 1600 int columns
+  excluding the tuple header, a tuple made up of 1,600 int columns
   would consume 6400 bytes and could be stored in a heap page, but a tuple of
-  1600 bigint columns would consume 12800 bytes and would
+  1,600 bigint columns would consume 12800 bytes and would
   therefore not fit inside a heap page.
   Variable-length fields of
   types such as text, varchar, and char


A recent message added to pg_upgade

2023-10-26 Thread Kyotaro Horiguchi
Hello.

Some messages recently introduced by commit 29d0a77fa6 seem to deviate
slightly from our standards.

+   if (*invalidated && SlotIsLogical(s) && IsBinaryUpgrade)
+   {
+   ereport(ERROR,
+   
errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+   errmsg("replication slots must not be 
invalidated during the upgrade"),
+   errhint("\"max_slot_wal_keep_size\" 
must be set to -1 during the upgrade"));

The message for errhint is not a complete sentence. And errmsg is not
in telegraph style.  The first attached makes minimum changes.

However, if allowed, I'd like to propose an alternative set of
messages as follows:

+   errmsg("replication slot is invalidated 
during upgrade"),
+   errhint("Set \"max_slot_wal_keep_size\" 
to -1 to avoid invalidation."));

The second attached does this.

What do you think about those?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 99823df3c7..15e85a23d4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1433,8 +1433,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 		{
 			ereport(ERROR,
 	errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-	errmsg("replication slots must not be invalidated during the upgrade"),
-	errhint("\"max_slot_wal_keep_size\" must be set to -1 during the upgrade"));
+	errmsg("replication slots must not be invalidated during upgrade"),
+	errhint("\"max_slot_wal_keep_size\" must be set to -1 during the upgrade."));
 		}
 
 		if (active_pid != 0)
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 99823df3c7..d561bd9be1 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1433,8 +1433,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 		{
 			ereport(ERROR,
 	errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-	errmsg("replication slots must not be invalidated during the upgrade"),
-	errhint("\"max_slot_wal_keep_size\" must be set to -1 during the upgrade"));
+	errmsg("replication slot is invalidated during upgrade"),
+	errhint("Set \"max_slot_wal_keep_size\" to -1 to avoid invalidation."));
 		}
 
 		if (active_pid != 0)


RE: Partial aggregates pushdown

2023-10-26 Thread fujii.y...@df.mitsubishielectric.co.jp
Hi Momjian.

Thank you for your improvement.
As a matter of detail, I think that the areas marked below are erroneous.

--
+   Pushdown causes aggregate function cals to send partial aggregate
 ^
+   function calls to the remote server. If the partial aggregate
+   function doesn't doesn't exist on the remote server, it causes
 ^^^
--

> What else needs to be done before committers start to review
> this?
There are no others. May I make a new version of v31 with your
suggested improvements for the committer's review?

Sincerely yours,
Yuuki Fujii

--
Yuuki Fujii
Information Technology R Center Mitsubishi Electric Corporation




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-26 Thread Amit Kapila
On Fri, Oct 27, 2023 at 3:28 AM Peter Smith  wrote:
>
> On Fri, Oct 27, 2023 at 2:26 AM Bharath Rupireddy
>  wrote:
> >
> > On Thu, Oct 26, 2023 at 8:11 PM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > The BF animal fairywren[1] failed when testing
> > > 003_upgrade_logical_replication_slots.pl.
> > >
> > > From the log, I can see pg_upgrade failed to open the
> > > invalid_logical_replication_slots.txt:
> > >
> > > # Checking for valid logical replication slots
> > > # could not open file 
> > > "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/003_upgrade_logical_replication_slots/data/t_003_upgrade_logical_replication_slots_new_publisher_data/pgdata/pg_upgrade_output.d/20231026T112558.309/invalid_logical_replication_slots.txt":
> > >  No such file or directory
> > > # Failure, exiting
> > >
> > > The reason could be the length of this path(262) exceed the windows path
> > > limit(260 IIRC). If so, I recall we fixed similar things before 
> > > (e213de8e7) by
> > > reducing the path somehow.
> >
> > Nice catch. Windows docs say that the file/directory path name can't
> > exceed MAX_PATH, which is defined as 260 characters. However, one must
> > opt-in to enable longer path names -
> > https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry
> > and 
> > https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry#enable-long-paths-in-windows-10-version-1607-and-later.
> >
> > > In this case, I think one approach is to reduce the file and testname to
> > > xxx_logical_slots instead of xxx_logical_replication_slots. But we will 
> > > analyze more
> > > and share fix soon.
> > >
> > > [1] 
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-10-26%2009%3A04%3A54
> >
> > +1 for 
> > s/003_upgrade_logical_replication_slots.pl/003_upgrade_logical_slots.pl
> > and s/invalid_logical_replication_slots.txt/invalid_logical_slots.txt.

+1. The proposed file name sounds reasonable.

> > In fact, we've used "logical slots" instead of "logical replication
> > slots" in the docs to be generic. By looking at the generated
> > directory path name, I think we can use shorter node names - instead
> > of old_publisher, new_publisher, subscriber - either use node1 (for
> > old publisher), node2 (for subscriber), node3 (for new publisher) or
> > use alpha (for old publisher), bravo (for subscriber), charlie (for
> > new publisher) or such shorter names. We don't have to be that
> > descriptive and long in node names, one can look at the test file to
> > know which one is what.
> >
>
> Some more ideas for shortening the filename:
>
> 1. "003_upgrade_logical_replication_slots.pl" -- IMO the word
> "upgrade" is redundant in that filename (earlier patches never had
> this). The test file lives under "pg_upgrade/t" so I felt that
> upgrading is already implied.
>

Agreed. So, how about 003_upgrade_logical_slots.pl or simply
003_upgrade_slots.pl?

> 2. If the node names will be shortened they should still retain *some*
> meaning if possible:
> old_publisher/subscriber/new_publisher --> node1/node2/node3 (means
> nothing without studying the tests)
> old_publisher/subscriber/new_publisher --> alpha/bravo/charlie (means
> nothing without studying the tests)
> How about:
> old_publisher/subscriber/new_publisher --> node_p1/node_s/node_p2
> or similar...
>

Why not simply oldpub/sub/newpub or old_pub/sub/new_pub?

-- 
With Regards,
Amit Kapila.




Re: Introduce a new view for checkpointer related stats

2023-10-26 Thread Michael Paquier
On Thu, Oct 26, 2023 at 10:55:00PM +0530, Bharath Rupireddy wrote:
> On Thu, Oct 26, 2023 at 7:30 AM Michael Paquier  wrote:
>> Why is that in 0002?  Isn't that something we should treat as a bug
>> fix of its own, even backpatching it to make sure that the flush
>> requests for individual commit_ts, multixact and clog files are
>> counted in the stats?
> 
> +1. I included the fix in a separate patch 0002 here.

Hmm.  As per the existing call of pgstat_count_slru_flush() in
SimpleLruWriteAll(), routine called SimpleLruFlush() until ~13 and
dee663f78439, an incrementation of 1 for slru_stats_idx refers to all
the flushes for all the dirty data of this SLRU in a single pass.
This addition actually means that we would now increment the counter
for each sync request, changing its meaning.  Perhaps there is an
argument for changing the meaning of this parameter to be based on the
number of flush requests completed, but if that were to happen it
would be better to remove pgstat_count_slru_flush() in
SimpleLruWriteAll(), I guess, or just aggregate this counter once,
which would be cheaper.

>> Saying that, I am OK with moving ahead with 0001 and 0002 to remove
>> buffers_backend_fsync and buffers_backend from pg_stat_bgwriter, but
>> it is better to merge them into a single commit.  It helps with 0003
>> and this would encourage the use of pg_stat_io that does a better job
>> overall with more field granularity.
> 
> I merged v8 0001 and 0002 into one single patch, PSA v9-0001.

v9-0001 is OK, so I have applied it.

v9-0003 is mostly a mechanical change, and it passes the eye test.  I
have spotted two nits.

+CREATE VIEW pg_stat_checkpointer AS
+SELECT
+pg_stat_get_checkpointer_num_timed() AS num_timed,
+pg_stat_get_checkpointer_num_requested() AS num_requested,
+pg_stat_get_checkpointer_write_time() AS write_time,
+pg_stat_get_checkpointer_sync_time() AS sync_time,
+pg_stat_get_checkpointer_buffers_written() AS buffers_written,
+pg_stat_get_checkpointer_stat_reset_time() AS stats_reset;

Okay with this layer.  I am wondering if others have opinions to
share about these names for the attributes of pg_stat_checkpointer,
though.

-   single row, containing global data for the cluster.
+   single row, containing data about the bgwriter of the cluster.

"bgwriter" is used in one place of the docs, so perhaps "background
writer" is better here?

The error message generated for an incorrect target needs to be
updated in pg_stat_reset_shared():
=# select pg_stat_reset_shared('checkpointe');
ERROR:  22023: unrecognized reset target: "checkpointe"
HINT:  Target must be "archiver", "bgwriter", "io", "recovery_prefetch", or 
"wal".
--
Michael


signature.asc
Description: PGP signature


Re: Is this a problem in GenericXLogFinish()?

2023-10-26 Thread Michael Paquier
On Thu, Oct 26, 2023 at 09:40:09AM -0400, Robert Haas wrote:
> Because of this, it is possible for bucketbuf, prevbuf, and wbuf to be
> the same (your first scenario) but the second scenario you mention
> (nextbuf  == wbuf) should be impossible.

Okay..

> It seems to me that maybe we shouldn't even be registering wbuf or
> doing anything at all to it if there are no tuples that need moving.
> That would also require some adjustment of the redo routine.

Hmm.  So my question is: do we need the cleanup lock on the write
buffer even if there are no tuples, and even if primary bucket and the
write bucket are the same?  I'd like to think that what you say is OK,
still I am not completely sure after reading the lock assumptions in
the hash README or 6d46f4783efe.  A simpler thing would be to mark
buffer 1 with REGBUF_NO_CHANGE when the primary and write buffers are
the same if we expect the lock to always be taken, I guess..

I've noticed that the replay paths for XLOG_HASH_MOVE_PAGE_CONTENTS
and XLOG_HASH_SQUEEZE_PAGE are similar with their page handlings (some
copy-pastes?).  A MOVE record should never have zero tuples, still the
replay path assumes that this can be possible, so it could be
simplified.
--
Michael


signature.asc
Description: PGP signature


Re: Does UCS_BASIC have the right CTYPE?

2023-10-26 Thread Jeff Davis
On Thu, 2023-10-26 at 17:32 -0400, Tom Lane wrote:
> For starters, C locale should certainly act different from others.

Agreed. ctype of "C" is 100% stable (as implemented in Postgres with
special ASCII-only semantics) and simple.

I'm looking for a way to offer a new middle ground between plain "C"
and buying into all of the problems with collation providers and
localization. We don't need to remove functionality to do so.

Providing Unicode ctype behavior doesn't look very hard. Collations
could select it either with a special name or by using the "builtin"
provider I proposed earlier. If the behavior does change with a new
Unicode version it would be easier to see and less likely to affect on-
disk structures than a collation change.

Regards,
Jeff Davis





Re: Document parameter count limit

2023-10-26 Thread Tom Lane
"David G. Johnston"  writes:
> Cleanups for consistency:

> Move "identifier length" after "partition keys" (before the new "function
> arguments")

Yeah, the existing ordering of this table seems quite random.
That would help some, by separating items having to do with
database/table size from SQL-query-related limits.

regards, tom lane




Re: Document parameter count limit

2023-10-26 Thread David G. Johnston
On Thu, Oct 26, 2023 at 4:13 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Thu, Oct 26, 2023 at 4:08 PM Tom Lane  wrote:
>
>> Bruce Momjian  writes:
>> > Ah, I was confused.  I documented both in the attached patch.
>>
>> The function one should have the same annotation as some others:
>>
>>  can be increased by recompiling
>> PostgreSQL
>>
>>
> I'd like to see a comment on the parameter count one too.
>
> "Alternatives include using a temporary table or passing them in as a
> single array parameter."
>
> About the only time this is likely to come up is with many parameters of
> the same type and meaning, pointing that out with the array option seems
> excessively wordy for the comment area.
>
> Needs a comma: 65,535
>
> Kinda think both should be tacked on to the end of the table.  I'd also
> put function arguments first so it appears under the compile time partition
> keys limit.
>
>
Cleanups for consistency:

Move "identifier length" after "partition keys" (before the new "function
arguments")

Add commas to: 1,600 and 1,664 and 8,192

David J.


Re: Document parameter count limit

2023-10-26 Thread David G. Johnston
On Thu, Oct 26, 2023 at 4:08 PM Tom Lane  wrote:

> Bruce Momjian  writes:
> > Ah, I was confused.  I documented both in the attached patch.
>
> The function one should have the same annotation as some others:
>
>  can be increased by recompiling
> PostgreSQL
>
>
I'd like to see a comment on the parameter count one too.

"Alternatives include using a temporary table or passing them in as a
single array parameter."

About the only time this is likely to come up is with many parameters of
the same type and meaning, pointing that out with the array option seems
excessively wordy for the comment area.

Needs a comma: 65,535

Kinda think both should be tacked on to the end of the table.  I'd also put
function arguments first so it appears under the compile time partition
keys limit.

David J.


Re: Document parameter count limit

2023-10-26 Thread Tom Lane
Bruce Momjian  writes:
> Ah, I was confused.  I documented both in the attached patch.

The function one should have the same annotation as some others:

 can be increased by recompiling 
PostgreSQL

regards, tom lane




Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-26 Thread David G. Johnston
On Thu, Oct 26, 2023 at 4:03 PM Bruce Momjian  wrote:

>
> Sure, done in the attached patch.
>
>
WFM.  Thank You!

David J.


Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-26 Thread Bruce Momjian
On Thu, Oct 26, 2023 at 03:44:14PM -0700, David G. Johnston wrote:
> On Thu, Oct 26, 2023 at 3:36 PM Bruce Momjian  wrote:
> 
> No sneaking.  ;-)  It would be bad to document this unevenly because it
> sets expectations in other parts of the system if we don't mention it.
> 
> 
> Agreed.
> 
> Last suggestion, remove the first jsonb_agg example that lacks an order by.
> 
> +WITH vals (k, v) AS ( VALUES ('key0','1'), ('key1','3'), ('key1','2') )
> +SELECT jsonb_object_agg(k, v) FROM vals;
> +      jsonb_object_agg
> +
> + {"key0": "1", "key1": "2"}
> +
> 
> We shouldn't write an example that relies on the rows being evaluated 1-2-3
> without specifying an order by clause.

Sure, done in the attached patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7c3e940afe..3b49e63987 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20273,6 +20273,13 @@ SELECT NULLIF(value, '(none)') ...
aggregation.
   
 
+  
+   While all aggregates below accept an optional
+   ORDER BY clause (as outlined in ), the clause has only been added to
+   aggregates whose output is affected by ordering.
+  
+

 General-Purpose Aggregate Functions
 
@@ -20310,7 +20317,7 @@ SELECT NULLIF(value, '(none)') ...
 
  array_agg
 
-array_agg ( anynonarray )
+array_agg ( anynonarray ORDER BY input_sort_columns )
 anyarray


@@ -20321,7 +20328,7 @@ SELECT NULLIF(value, '(none)') ...
 
   

-array_agg ( anyarray )
+array_agg ( anyarray ORDER BY input_sort_columns )
 anyarray


@@ -20526,14 +20533,14 @@ SELECT NULLIF(value, '(none)') ...
 
  json_agg
 
-json_agg ( anyelement )
+json_agg ( anyelement ORDER BY input_sort_columns )
 json


 
  jsonb_agg
 
-jsonb_agg ( anyelement )
+jsonb_agg ( anyelement ORDER BY input_sort_columns )
 jsonb


@@ -20573,7 +20580,8 @@ SELECT NULLIF(value, '(none)') ...
 
 json_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_columns )
 json


@@ -20582,7 +20590,8 @@ SELECT NULLIF(value, '(none)') ...
 
 jsonb_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_columns )
 jsonb


@@ -20819,7 +20828,8 @@ SELECT NULLIF(value, '(none)') ...


 string_agg ( value
- bytea, delimiter bytea )
+ bytea, delimiter bytea
+ ORDER BY input_sort_columns )
 bytea


@@ -20877,7 +20887,7 @@ SELECT NULLIF(value, '(none)') ...
 
  xmlagg
 
-xmlagg ( xml )
+xmlagg ( xml ORDER BY input_sort_columns )
 xml


diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index 3ba844057f..80d71d362f 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -1647,7 +1647,19 @@ sqrt(2)
 are always just expressions and cannot be output-column names or numbers.
 For example:
 
-SELECT array_agg(a ORDER BY b DESC) FROM table;
+WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) )
+SELECT array_agg(v ORDER BY v DESC) FROM vals;
+  array_agg
+-
+ {4,3,3,2,1}
+
+Since jsonb only keeps the last matching key, ordering
+of its keys can be significant:
+WITH vals (k, v) AS ( VALUES ('key0','1'), ('key1','3'), ('key1','2') )
+SELECT jsonb_object_agg(k, v ORDER BY v) FROM vals;
+  jsonb_object_agg
+
+ {"key0": "1", "key1": "3"}
 

 
@@ -1668,20 +1680,19 @@ SELECT string_agg(a ORDER BY a, ',') FROM table;  -- incorrect

 

-If DISTINCT is specified in addition to an
-order_by_clause, then all the ORDER BY
-expressions must match regular arguments of the aggregate; that is,
-you cannot sort on an expression that is not included in the
-DISTINCT list.
+If DISTINCT is specified with an
+order_by_clause, ORDER
+BY expressions can only reference columns in the
+DISTINCT list.  For example:
+
+WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) )
+SELECT array_agg(DISTINCT v ORDER BY v DESC) FROM vals;
+ array_agg
+---
+ {4,3,2,1}
+

 
-   
-
- The ability to specify both DISTINCT and ORDER BY
- in an aggregate function is a PostgreSQL extension.
-
-   
-

 Placing ORDER BY within the aggregate's regular argument
 list, as described so far, is used when ordering the input rows for


Re: Document parameter count limit

2023-10-26 Thread Bruce Momjian
On Thu, Oct 26, 2023 at 06:56:40PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > Here is a patch to add this.
> 
> "function arguments" seems like a completely wrong description
> (and if we do want to document that limit, it's 100).
> 
> "query parameters" would work, perhaps.

Ah, I was confused.  I documented both in the attached patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/limits.sgml b/doc/src/sgml/limits.sgml
index d5b2b627dd..4bdc569ac5 100644
--- a/doc/src/sgml/limits.sgml
+++ b/doc/src/sgml/limits.sgml
@@ -80,6 +80,18 @@
  can be increased by recompiling PostgreSQL
 
 
+
+ query parameters
+ 65535
+ 
+
+
+
+ function arguments
+ 100
+ 
+
+
 
  indexes per table
  unlimited


Re: Document parameter count limit

2023-10-26 Thread David G. Johnston
On Thu, Oct 26, 2023 at 3:51 PM Bruce Momjian  wrote:

> On Wed, Nov 23, 2022 at 02:33:27PM -0600, Justin Pryzby wrote:
> > On Wed, Nov 23, 2022 at 12:35:59PM -0700, David G. Johnston wrote:
> > > On Wed, Nov 23, 2022 at 11:47 AM Tom Lane  wrote:
> > >
> > > > Bruce Momjian  writes:
> > > > > Does this come up enough to document it?  I assume the error
> message the
> > > > > user receives is clear.
> > > >
> > > > Looks like you get
> > > >
> > > > if (nParams < 0 || nParams > PQ_QUERY_PARAM_MAX_LIMIT)
> > > > {
> > > > libpq_append_conn_error(conn, "number of parameters must be
> between 0 and %d",
> > > >PQ_QUERY_PARAM_MAX_LIMIT);
> > > > return 0;
> > > > }
> > > >
> > > > which seems clear enough.
> > > >
> > > > I think the concern here is that somebody who's not aware that a
> limit
> > > > exists might write an application that thinks it can send lots of
> > > > parameters, and then have it fall over in production.  Now, I've got
> > > > doubts that an entry in the limits.sgml table will do much to prevent
> > > > that scenario.  But perhaps offering the advice to use an array
> parameter
> > > > will be worthwhile even after-the-fact.
> >
> > Yes, that's what happens :)
> >
> > I hit that error after increasing the number of VALUES(),() a loader
> > used in a prepared statement (and that was with our non-wide tables).
> >
> > +1 to document the limit along with the other limits.
>
> Here is a patch to add this.
>
>
We aren't talking about "function arguments" though...is there something
wrong with the term "parameters per query"?

I suggest we take this opportunity to decide how to handle values > 999 in
terms of separators.  The existing page is inconsistent.  I would prefer
adding the needed commas.

David J.


Re: Document parameter count limit

2023-10-26 Thread Tom Lane
Bruce Momjian  writes:
> Here is a patch to add this.

"function arguments" seems like a completely wrong description
(and if we do want to document that limit, it's 100).

"query parameters" would work, perhaps.

regards, tom lane




Re: Document parameter count limit

2023-10-26 Thread Bruce Momjian
On Wed, Nov 23, 2022 at 02:33:27PM -0600, Justin Pryzby wrote:
> On Wed, Nov 23, 2022 at 12:35:59PM -0700, David G. Johnston wrote:
> > On Wed, Nov 23, 2022 at 11:47 AM Tom Lane  wrote:
> > 
> > > Bruce Momjian  writes:
> > > > Does this come up enough to document it?  I assume the error message the
> > > > user receives is clear.
> > >
> > > Looks like you get
> > >
> > > if (nParams < 0 || nParams > PQ_QUERY_PARAM_MAX_LIMIT)
> > > {
> > > libpq_append_conn_error(conn, "number of parameters must be 
> > > between 0 and %d",
> > >PQ_QUERY_PARAM_MAX_LIMIT);
> > > return 0;
> > > }
> > >
> > > which seems clear enough.
> > >
> > > I think the concern here is that somebody who's not aware that a limit
> > > exists might write an application that thinks it can send lots of
> > > parameters, and then have it fall over in production.  Now, I've got
> > > doubts that an entry in the limits.sgml table will do much to prevent
> > > that scenario.  But perhaps offering the advice to use an array parameter
> > > will be worthwhile even after-the-fact.
> 
> Yes, that's what happens :)
> 
> I hit that error after increasing the number of VALUES(),() a loader
> used in a prepared statement (and that was with our non-wide tables).
> 
> +1 to document the limit along with the other limits.

Here is a patch to add this.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/limits.sgml b/doc/src/sgml/limits.sgml
index d5b2b627dd..de6bf7697f 100644
--- a/doc/src/sgml/limits.sgml
+++ b/doc/src/sgml/limits.sgml
@@ -80,6 +80,12 @@
  can be increased by recompiling PostgreSQL
 
 
+
+ function arguments
+ 65535
+ 
+
+
 
  indexes per table
  unlimited


Re: Does UCS_BASIC have the right CTYPE?

2023-10-26 Thread Jeff Davis
On Thu, 2023-10-26 at 23:22 +0200, Daniel Verite wrote:
> Neither does Unicode, which is why the ICU functions like u_isupper()
> or u_toupper() don't take a locale argument.

u_strToUpper() accepts a locale argument:
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ustring_8h.html#aa64fbd4ad23af84d01c931d7cfa25f89

See also the part about tailorings here:
https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf#G33992

Regards,
Jeff Davis





Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-26 Thread David G. Johnston
On Thu, Oct 26, 2023 at 3:36 PM Bruce Momjian  wrote:

> No sneaking.  ;-)  It would be bad to document this unevenly because it
> sets expectations in other parts of the system if we don't mention it.
>

Agreed.

Last suggestion, remove the first jsonb_agg example that lacks an order by.

+WITH vals (k, v) AS ( VALUES ('key0','1'), ('key1','3'), ('key1','2') )
+SELECT jsonb_object_agg(k, v) FROM vals;
+  jsonb_object_agg
+
+ {"key0": "1", "key1": "2"}
+

We shouldn't write an example that relies on the rows being evaluated 1-2-3
without specifying an order by clause.

David J.


Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-26 Thread Bruce Momjian
On Thu, Oct 26, 2023 at 03:09:26PM -0700, David G. Johnston wrote:
> On Thu, Oct 26, 2023 at 2:56 PM Bruce Momjian  wrote:
> 
> On Wed, Oct 25, 2023 at 10:34:10PM -0700, David G. Johnston wrote:
> > I would reword the existing note to be something like:
> >
> > The SQL Standard defines specific aggregates and their properties,
> including
> > which of DISTINCT and/or ORDER BY is allowed.  Due to the extensible
> nature of
> > PostgreSQL it accepts either or both clauses for any aggregate.
> 
> Uh, is this something in my patch or somewhere else?  I don't think
> PostgreSQL extensible is an example of syntax flexibility.
> 
> 
> https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES
> 
> Note
> The ability to specify both DISTINCT and ORDER BY in an aggregate function is 
> a
> PostgreSQL extension.
> 
> I am pointing out that the first sentence of the existing note above seems to
> be factually incorrect.  I tried to make it correct - while explaining why we
> differ.  Though in truth I'd probably rather just remove the note.

Agreed, removed, patch attached.  This is just too complex to specify.

> > We get enough complaints regarding "apparent ordering" that I would like
> to
> > add:
> >
> > As a reminder, while some DISTINCT processing algorithms produce sorted
> output
> > as a side-effect, only by specifying ORDER BY is the output order
> guaranteed.
> 
> Well, we need to create a new email thread for this and look at all the
> areas is applies to since this is a much larger issue.
> 
> I was hoping to sneak this one in regardless of the bigger picture issues,
> since this specific combination is guaranteed to output ordered presently.

No sneaking.  ;-)  It would be bad to document this unevenly because it
sets expectations in other parts of the system if we don't mention it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7c3e940afe..3b49e63987 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20273,6 +20273,13 @@ SELECT NULLIF(value, '(none)') ...
aggregation.
   
 
+  
+   While all aggregates below accept an optional
+   ORDER BY clause (as outlined in ), the clause has only been added to
+   aggregates whose output is affected by ordering.
+  
+

 General-Purpose Aggregate Functions
 
@@ -20310,7 +20317,7 @@ SELECT NULLIF(value, '(none)') ...
 
  array_agg
 
-array_agg ( anynonarray )
+array_agg ( anynonarray ORDER BY input_sort_columns )
 anyarray


@@ -20321,7 +20328,7 @@ SELECT NULLIF(value, '(none)') ...
 
   

-array_agg ( anyarray )
+array_agg ( anyarray ORDER BY input_sort_columns )
 anyarray


@@ -20526,14 +20533,14 @@ SELECT NULLIF(value, '(none)') ...
 
  json_agg
 
-json_agg ( anyelement )
+json_agg ( anyelement ORDER BY input_sort_columns )
 json


 
  jsonb_agg
 
-jsonb_agg ( anyelement )
+jsonb_agg ( anyelement ORDER BY input_sort_columns )
 jsonb


@@ -20573,7 +20580,8 @@ SELECT NULLIF(value, '(none)') ...
 
 json_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_columns )
 json


@@ -20582,7 +20590,8 @@ SELECT NULLIF(value, '(none)') ...
 
 jsonb_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_columns )
 jsonb


@@ -20819,7 +20828,8 @@ SELECT NULLIF(value, '(none)') ...


 string_agg ( value
- bytea, delimiter bytea )
+ bytea, delimiter bytea
+ ORDER BY input_sort_columns )
 bytea


@@ -20877,7 +20887,7 @@ SELECT NULLIF(value, '(none)') ...
 
  xmlagg
 
-xmlagg ( xml )
+xmlagg ( xml ORDER BY input_sort_columns )
 xml


diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index 3ba844057f..ec089fac06 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -1647,7 +1647,26 @@ sqrt(2)
 are always just expressions and cannot be output-column names or numbers.
 For example:
 
-SELECT array_agg(a ORDER BY b DESC) FROM table;
+WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) )
+SELECT array_agg(v ORDER BY v DESC) FROM vals;
+  array_agg
+-
+ {4,3,3,2,1}
+
+Since jsonb only keeps the last matching key, ordering
+of its keys can be significant:
+
+WITH vals (k, v) AS ( VALUES ('key0','1'), ('key1','3'), ('key1','2') )
+SELECT jsonb_object_agg(k, v) FROM 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-10-26 Thread Bharath Rupireddy
On Wed, Oct 25, 2023 at 5:45 AM Jeff Davis  wrote:
>
> Comments:

Thanks for reviewing.

> * It would be good to document that this is partially an optimization
> (read from memory first) and partially an API difference that allows
> reading unflushed data. For instance, walsender may benefit
> performance-wise (and perhaps later with the ability to read unflushed
> data) whereas pg_walinspect benefits primarily from reading unflushed
> data.

Commit message has these things covered in detail. However, I think
adding some info in the code comments is a good idea and done around
the WALRead() function in the attached v13 patch set.

> * Shouldn't there be a new method in XLogReaderRoutine (e.g.
> read_unflushed_data), rather than having logic in WALRead()? The
> callers can define the method if it makes sense (and that would be a
> good place to document why); or leave it NULL if not.

I've designed the new function XLogReadFromBuffers to read from WAL
buffers in such a way that one can easily embed it in page_read
callbacks if it makes sense. Almost all the available backend
page_read callbacks read_local_xlog_page_no_wait,
read_local_xlog_page, logical_read_xlog_page except XLogPageRead
(which is used for recovery when WAL buffers aren't used at all) have
one thing in common, that is, WALRead(). Therefore, it seemed a
natural choice for me to call XLogReadFromBuffers. In other words, I'd
say it's the responsibility of page_read callback implementers to
decide if they want to read from WAL buffers or not and hence I don't
think we need a separate XLogReaderRoutine.

If someone wants to read unflushed WAL, the typical way to implement
it is to write a new page_read callback
read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
similar without WALRead() but just the new function
XLogReadFromBuffers to read from WAL buffers and return.

> * I'm not clear on the "partial hit" case. Wouldn't that mean you found
> the earliest byte in the buffers, but not the latest byte requested? Is
> that possible, and if so under what circumstances? I added an
> "Assert(nread == 0 || nread == count)" in WALRead() after calling
> XLogReadFromBuffers(), and it wasn't hit.
>
> * If the partial hit case is important, wouldn't XLogReadFromBuffers()
> fill in the end of the buffer rather than the beginning?

Partial hit was possible when the requested WAL pages are read one
page at a time from WAL buffers with WALBufMappingLock
acquisition-release for each page as the requested page can be
replaced by the time the lock is released and reacquired. This was the
case up until the v6 patch -
https://www.postgresql.org/message-id/CALj2ACWTNneq2EjMDyUeWF-BnwpewuhiNEfjo9bxLwFU9iPF0w%40mail.gmail.com.
Now that the approach has been changed to read multiple pages at once
under one WALBufMappingLock acquisition-release. .
We can either keep the partial hit handling (just to not miss
anything) or turn the following partial hit case to an error or an
Assert(false);. I prefer to keep the partial hit handling as-is just
in case:
+else if (count > nread)
+{
+/*
+ * Buffer partial hit, so reset the state to count the read bytes
+ * and continue.
+ */
+buf += nread;
+startptr += nread;
+count -= nread;
+}

> * Other functions that use xlblocks, e.g. GetXLogBuffer(), use more
> effort to avoid acquiring WALBufMappingLock. Perhaps you can avoid it,
> too? One idea is to check that XLogCtl->xlblocks[idx] is equal to
> expectedEndPtr both before and after the memcpy(), with appropriate
> barriers. That could mitigate concerns expressed by Kyotaro Horiguchi
> and Masahiko Sawada.

Yes, I proposed that idea in another thread -
https://www.postgresql.org/message-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu%2Bwp%3DqkM9pdLZ5mwLaFg%40mail.gmail.com.
If that looks okay, I can add it to the next version of this patch
set.

> * Are you sure that reducing the number of calls to memcpy() is a win?
> I would expect that to be true only if the memcpy()s are tiny, but here
> they are around XLOG_BLCKSZ. I believe this was done based on a comment
> from Nathan Bossart, but I didn't really follow why that's important.
> Also, if we try to use one memcpy for all of the data, it might not
> interact well with my idea above to avoid taking the lock.

Up until the v6 patch -
https://www.postgresql.org/message-id/CALj2ACWTNneq2EjMDyUeWF-BnwpewuhiNEfjo9bxLwFU9iPF0w%40mail.gmail.com,
the requested WAL was being read one page at a time from WAL buffers
into output buffer with one memcpy call for each page. Now that the
approach has been changed to read multiple pages at once under one
WALBufMappingLock acquisition-release with comparatively lesser number
of memcpy calls. I honestly haven't seen any difference between the
two approaches -
https://www.postgresql.org/message-id/CALj2ACUpQGiwQTzmoSMOFk5%3DWiJc06FcYpxzBX0SEej4ProRzg%40mail.gmail.com.

The new approach 

Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-26 Thread David G. Johnston
On Thu, Oct 26, 2023 at 2:56 PM Bruce Momjian  wrote:

> On Wed, Oct 25, 2023 at 10:34:10PM -0700, David G. Johnston wrote:
> > I would reword the existing note to be something like:
> >
> > The SQL Standard defines specific aggregates and their properties,
> including
> > which of DISTINCT and/or ORDER BY is allowed.  Due to the extensible
> nature of
> > PostgreSQL it accepts either or both clauses for any aggregate.
>
> Uh, is this something in my patch or somewhere else?  I don't think
> PostgreSQL extensible is an example of syntax flexibility.
>

https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES

Note
The ability to specify both DISTINCT and ORDER BY in an aggregate function
is a PostgreSQL extension.

I am pointing out that the first sentence of the existing note above seems
to be factually incorrect.  I tried to make it correct - while explaining
why we differ.  Though in truth I'd probably rather just remove the note.

> We get enough complaints regarding "apparent ordering" that I would like
> to
> > add:
> >
> > As a reminder, while some DISTINCT processing algorithms produce sorted
> output
> > as a side-effect, only by specifying ORDER BY is the output order
> guaranteed.
>
> Well, we need to create a new email thread for this and look at all the
> areas is applies to since this is a much larger issue.
>
>
I was hoping to sneak this one in regardless of the bigger picture issues,
since this specific combination is guaranteed to output ordered presently.

David J.


Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-26 Thread Peter Smith
On Fri, Oct 27, 2023 at 2:26 AM Bharath Rupireddy
 wrote:
>
> On Thu, Oct 26, 2023 at 8:11 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > The BF animal fairywren[1] failed when testing
> > 003_upgrade_logical_replication_slots.pl.
> >
> > From the log, I can see pg_upgrade failed to open the
> > invalid_logical_replication_slots.txt:
> >
> > # Checking for valid logical replication slots
> > # could not open file 
> > "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/003_upgrade_logical_replication_slots/data/t_003_upgrade_logical_replication_slots_new_publisher_data/pgdata/pg_upgrade_output.d/20231026T112558.309/invalid_logical_replication_slots.txt":
> >  No such file or directory
> > # Failure, exiting
> >
> > The reason could be the length of this path(262) exceed the windows path
> > limit(260 IIRC). If so, I recall we fixed similar things before (e213de8e7) 
> > by
> > reducing the path somehow.
>
> Nice catch. Windows docs say that the file/directory path name can't
> exceed MAX_PATH, which is defined as 260 characters. However, one must
> opt-in to enable longer path names -
> https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry
> and 
> https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry#enable-long-paths-in-windows-10-version-1607-and-later.
>
> > In this case, I think one approach is to reduce the file and testname to
> > xxx_logical_slots instead of xxx_logical_replication_slots. But we will 
> > analyze more
> > and share fix soon.
> >
> > [1] 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-10-26%2009%3A04%3A54
>
> +1 for s/003_upgrade_logical_replication_slots.pl/003_upgrade_logical_slots.pl
> and s/invalid_logical_replication_slots.txt/invalid_logical_slots.txt.
> In fact, we've used "logical slots" instead of "logical replication
> slots" in the docs to be generic. By looking at the generated
> directory path name, I think we can use shorter node names - instead
> of old_publisher, new_publisher, subscriber - either use node1 (for
> old publisher), node2 (for subscriber), node3 (for new publisher) or
> use alpha (for old publisher), bravo (for subscriber), charlie (for
> new publisher) or such shorter names. We don't have to be that
> descriptive and long in node names, one can look at the test file to
> know which one is what.
>

Some more ideas for shortening the filename:

1. "003_upgrade_logical_replication_slots.pl" -- IMO the word
"upgrade" is redundant in that filename (earlier patches never had
this). The test file lives under "pg_upgrade/t" so I felt that
upgrading is already implied.

2. If the node names will be shortened they should still retain *some*
meaning if possible:
old_publisher/subscriber/new_publisher --> node1/node2/node3 (means
nothing without studying the tests)
old_publisher/subscriber/new_publisher --> alpha/bravo/charlie (means
nothing without studying the tests)
How about:
old_publisher/subscriber/new_publisher --> node_p1/node_s/node_p2
or similar...

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-26 Thread Bruce Momjian
On Wed, Oct 25, 2023 at 10:34:10PM -0700, David G. Johnston wrote:
> I would reword the existing note to be something like:
> 
> The SQL Standard defines specific aggregates and their properties, including
> which of DISTINCT and/or ORDER BY is allowed.  Due to the extensible nature of
> PostgreSQL it accepts either or both clauses for any aggregate.

Uh, is this something in my patch or somewhere else?  I don't think
PostgreSQL extensible is an example of syntax flexibility.

> From the most recent patch:
> 
>     
> -    If DISTINCT is specified in addition to an
> -    order_by_clause, then all the ORDER 
> BY
> 
> -    expressions must match regular arguments of the aggregate; that is,
> -    you cannot sort on an expression that is not included in the
> -    DISTINCT list.
> +    If DISTINCT is specified with an
> +    order_by_clause, ORDER
> +    BY expressions can only reference columns in the
> +    DISTINCT list.  For example:
> +
> +WITH vals (v1, v2) AS ( VALUES (1,'Z'),(3,'D'),(4,'R'),(3,'A'),(2,'T') )
> +SELECT array_agg(DISTINCT v2 ORDER BY v2 DESC) FROM vals;
> +  array_agg
> +-
> + {Z,T,R,D,A}
> +
> 
> The change to a two-column vals was mostly to try and find corner-cases that
> might need to be addressed.  If we don't intend to show the error case of
> DISTINCT v1 ORDER BY v2 then we should go back to the original example and 
> just
> add ORDER BY v DESC.  I'm fine with not using string_agg here.
> 
> +    For example:
> +
> +WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) )
> +SELECT array_agg(DISTINCT v ORDER BY v DESC) FROM vals;
> + array_agg
> +---
> + {4,3,2,1}
> +

Okay, good, switched in the attached patch.

> We get enough complaints regarding "apparent ordering" that I would like to
> add:
> 
> As a reminder, while some DISTINCT processing algorithms produce sorted output
> as a side-effect, only by specifying ORDER BY is the output order guaranteed.

Well, we need to create a new email thread for this and look at all the
areas is applies to since this is a much larger issue.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7c3e940afe..3b49e63987 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20273,6 +20273,13 @@ SELECT NULLIF(value, '(none)') ...
aggregation.
   
 
+  
+   While all aggregates below accept an optional
+   ORDER BY clause (as outlined in ), the clause has only been added to
+   aggregates whose output is affected by ordering.
+  
+

 General-Purpose Aggregate Functions
 
@@ -20310,7 +20317,7 @@ SELECT NULLIF(value, '(none)') ...
 
  array_agg
 
-array_agg ( anynonarray )
+array_agg ( anynonarray ORDER BY input_sort_columns )
 anyarray


@@ -20321,7 +20328,7 @@ SELECT NULLIF(value, '(none)') ...
 
   

-array_agg ( anyarray )
+array_agg ( anyarray ORDER BY input_sort_columns )
 anyarray


@@ -20526,14 +20533,14 @@ SELECT NULLIF(value, '(none)') ...
 
  json_agg
 
-json_agg ( anyelement )
+json_agg ( anyelement ORDER BY input_sort_columns )
 json


 
  jsonb_agg
 
-jsonb_agg ( anyelement )
+jsonb_agg ( anyelement ORDER BY input_sort_columns )
 jsonb


@@ -20573,7 +20580,8 @@ SELECT NULLIF(value, '(none)') ...
 
 json_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_columns )
 json


@@ -20582,7 +20590,8 @@ SELECT NULLIF(value, '(none)') ...
 
 jsonb_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_columns )
 jsonb


@@ -20819,7 +20828,8 @@ SELECT NULLIF(value, '(none)') ...


 string_agg ( value
- bytea, delimiter bytea )
+ bytea, delimiter bytea
+ ORDER BY input_sort_columns )
 bytea


@@ -20877,7 +20887,7 @@ SELECT NULLIF(value, '(none)') ...
 
  xmlagg
 
-xmlagg ( xml )
+xmlagg ( xml ORDER BY input_sort_columns )
 xml


diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index 3ba844057f..92336fb929 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -1647,7 +1647,26 @@ sqrt(2)
 are always just expressions and cannot be output-column names or numbers.
 For example:
 
-SELECT array_agg(a ORDER BY b DESC) FROM table;
+WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) )
+SELECT array_agg(v ORDER BY v DESC) FROM vals;
+  array_agg
+-
+ {4,3,3,2,1}
+
+Since jsonb only keeps the last matching key, ordering
+of its 

Re: Making aggregate deserialization (and WAL receive) functions slightly faster

2023-10-26 Thread David Rowley
On Thu, 26 Oct 2023 at 17:00, David Rowley  wrote:
> Thanks for looking at this again. I fixed up each of those and pushed
> the result, mentioning the incompatibility in the commit message.
>
> Now that that's done, I've attached a patch which makes use of the new
> initReadOnlyStringInfo initializer function for the original case
> mentioned when I opened this thread. I don't think there are any
> remaining objections to this, but I'll let it sit for a bit to see.

I've just pushed the deserial function optimisation patch.

I was just looking at a few other places where we might want to make
use of initReadOnlyStringInfo.

* parallel.c in HandleParallelMessages():

Drilling into HandleParallelMessage(), I see the PqMsg_BackendKeyData
case just reads a fixed number of bytes.  In some of the other
"switch" cases, I see calls pq_getmsgrawstring() either directly or
indirectly.  I see the counterpart to pq_getmsgrawstring() is
pq_sendstring() which always appends the NUL char to the StringInfo,
so I don't think not NUL terminating the received bytes is a problem
as cstrings seem to be sent with the NUL terminator.

This case just seems to handle ERROR/NOTICE messages coming from
parallel workers. Not tuples themselves. It may not be that
interesting a case to speed up.

* applyparallelworker.c in HandleParallelApplyMessages():

Drilling into HandleParallelApplyMessage(), I don't see anything there
that needs the input StringInfo to be NUL terminated.

* worker.c in apply_spooled_messages():

Drilling into apply_dispatch() and going through each of the cases, I
see logicalrep_read_tuple() pallocs a new buffer and ensures it's
always NUL terminated which will be required in LOGICALREP_COLUMN_TEXT
mode. (There seems to be further optimisation opportunities there
where we could not do the palloc when in LOGICALREP_COLUMN_BINARY mode
and just point value's buffer directly to the correct portion of the
input StringInfo's buffer).

* walreceiver.c in XLogWalRcvProcessMsg():

Nothing there seems to require the incoming_message StringInfo to have
a NUL terminator.  I imagine this one is the most worthwhile to do out
of the 4.  I've not tested to see if there are any performance
improvements.

Does anyone see any reason why we can't do the attached?

David
diff --git a/src/backend/access/transam/parallel.c 
b/src/backend/access/transam/parallel.c
index 194a1207be..f3708005d2 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1087,10 +1087,8 @@ HandleParallelMessages(void)
{
StringInfoData msg;
 
-   initStringInfo();
-   appendBinaryStringInfo(, data, 
nbytes);
+   initReadOnlyStringInfo(, data, 
nbytes);
HandleParallelMessage(pcxt, i, );
-   pfree(msg.data);
}
else
ereport(ERROR,
diff --git a/src/backend/replication/logical/applyparallelworker.c 
b/src/backend/replication/logical/applyparallelworker.c
index 9b37736f8e..fcc33bda10 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -1117,10 +1117,8 @@ HandleParallelApplyMessages(void)
{
StringInfoData msg;
 
-   initStringInfo();
-   appendBinaryStringInfo(, data, nbytes);
+   initReadOnlyStringInfo(, data, nbytes);
HandleParallelApplyMessage();
-   pfree(msg.data);
}
else
ereport(ERROR,
diff --git a/src/backend/replication/logical/worker.c 
b/src/backend/replication/logical/worker.c
index ba67eb156f..52a9f136ab 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2019,7 +2019,6 @@ void
 apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
   XLogRecPtr lsn)
 {
-   StringInfoData s2;
int nchanges;
charpath[MAXPGPATH];
char   *buffer = NULL;
@@ -2057,7 +2056,6 @@ apply_spooled_messages(FileSet *stream_fileset, 
TransactionId xid,
CurrentResourceOwner = oldowner;
 
buffer = palloc(BLCKSZ);
-   initStringInfo();
 
MemoryContextSwitchTo(oldcxt);
 
@@ -2079,6 +2077,7 @@ apply_spooled_messages(FileSet *stream_fileset, 
TransactionId xid,
nchanges = 0;
while (true)
{
+   StringInfoData s2;
size_t  nbytes;
int len;
 
@@ -2104,9 +2103,8 @@ apply_spooled_messages(FileSet *stream_fileset, 
TransactionId xid,
 
   

Re: Does UCS_BASIC have the right CTYPE?

2023-10-26 Thread Tom Lane
"Daniel Verite"  writes:
> To me the question of what we should put in pg_collation.collctype
> for the "ucs_basic" collation leads to another question which is:
> why do we even consider collctype in the first place?

For starters, C locale should certainly act different from others.

I'm not sold that arguing from Unicode's behavior to other encodings
makes sense, either.  Unicode can get away with defining that there's
only one case-folding rule because they have the luxury of inventing
new code points when the "same" glyph should act differently according
to different languages' rules.  Encodings with a small number of code
points don't have that luxury.  In particular see the mess around dotted
and dotless I/J in Turkish vs. everywhere else.

regards, tom lane




Re: [17] Special search_path names "!pg_temp" and "!pg_catalog"

2023-10-26 Thread Nathan Bossart
On Fri, Aug 18, 2023 at 02:44:31PM -0700, Jeff Davis wrote:
> +SET search_path = admin, "!pg_temp";

I think it's unfortunate that these new identifiers must be quoted.  I
wonder if we could call these something like "no_pg_temp".  *shrug*

> +  * Add any implicitly-searched namespaces to the list unless the markers
> +  * "!pg_catalog" or "!pg_temp" are present.  Note these go on the front,
> +  * not the back; also notice that we do not check USAGE permissions for
> +  * these.
>*/
> - if (!list_member_oid(oidlist, PG_CATALOG_NAMESPACE))
> + if (implicit_pg_catalog &&
> + !list_member_oid(oidlist, PG_CATALOG_NAMESPACE))
>   oidlist = lcons_oid(PG_CATALOG_NAMESPACE, oidlist);
>  
> - if (OidIsValid(myTempNamespace) &&
> + if (implicit_pg_temp &&
> + OidIsValid(myTempNamespace) &&
>   !list_member_oid(oidlist, myTempNamespace))
>   oidlist = lcons_oid(myTempNamespace, oidlist);

Should we disallow including both !pg_temp and pg_temp at the same time?  I
worry that could be a source of confusion.  IIUC your patches effectively
ignore !pg_temp if pg_temp is explicitly listed elsewhere in the list.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add recovery to pg_control and remove backup_label

2023-10-26 Thread David G. Johnston
On Thu, Oct 26, 2023 at 2:02 PM David Steele  wrote:

> Hackers,
>
> This was originally proposed in [1] but that thread went through a
> number of different proposals so it seems better to start anew.
>
> The basic idea here is to simplify and harden recovery by getting rid of
> backup_label and storing recovery information directly in pg_control.
> Instead of backup software copying pg_control from PGDATA, it stores an
> updated version that is returned from pg_backup_stop(). I believe this
> is better for the following reasons:
>
> * The user can no longer remove backup_label and get what looks like a
> successful restore (while almost certainly causing corruption). If
> pg_control is removed the cluster will not start. The user may try
> pg_resetwal, but I think that tool makes it pretty clear that corruption
> will result from its use. We could also modify pg_resetwal to complain
> if recovery info is present in pg_control.
>
> * We don't need to worry about backup software seeing a torn copy of
> pg_control, since Postgres can safely read it out of memory and provide
> a valid copy via pg_backup_stop(). This solves [2] without needing to
> write pg_control via a temp file, which may affect performance on a
> standby. Unfortunately, this solution cannot be back patched.
>

Are we planning on dealing with torn writes in the back branches in some
way or are we just throwing in the towel and saying the old method is too
error-prone to exist/retain and therefore the goal of the v17 changes is to
not only provide a better way but also to ensure the old way no longer
works?  It seems sufficient to change the output signature of
pg_backup_stop to accomplish that goal though I am pondering whether an
explicit check and error for seeing the backup_label file would be
warranted.

If we are going to solve the torn writes problem completely then while I
agree the new way is superior, implementing it doesn't have to mean
existing tools built to produce backup_label and rely upon the pg_control
in the data directory need to be forcibly broken.


> I know that outputting pg_control as bytea is going to be a bit
> controversial. Software that is using psql get run pg_backup_stop()
> could use encode() to get pg_control as text and then decode it later.
> Alternately, we could update ReadControlFile() to recognize a
> base64-encoded pg_control file. I'm not sure dealing with binary data is
> that much of a problem, though, and if the backup software gets it wrong
> then recovery with fail on an invalid pg_control file.
>

Can we not figure out some way to place the relevant files onto the server
somewhere so that a simple "cp" command would work?  Have pg_backup_stop
return paths instead of contents, those paths being "$TEMP_DIR"//pg_control.conf (and tablespace_map)

David J.


Re: Does UCS_BASIC have the right CTYPE?

2023-10-26 Thread Daniel Verite
Peter Eisentraut wrote:

> > That seems to suggest the standard answer should be 'Á' regardless of
> > any COLLATE clause (though I could be misreading). I'm a bit confused
> > by that... what's the standard-compatible way to specify the locale for
> > UPPER()/LOWER()? If there is none, then it makes sense that Postgres
> > overloads the COLLATE clause for that purpose so that users can use a
> > different locale if they want.
> 
> The standard doesn't have the notion of locale-dependent case conversion.

Neither does Unicode, which is why the ICU functions like u_isupper()
or u_toupper() don't take a locale argument.

With libc, isupper_l() and the other ctype functions need a locale
argument, but given a locale's value of
"language[_territory][.codeset]", in theory only the codeset part is
actually useful.

To me the question of what we should put in pg_collation.collctype
for the "ucs_basic" collation leads to another question which is:
why do we even consider collctype in the first place?

Within a database, there's only one "codeset", which corresponds
to pg_database.encoding, and there's a value in pg_database.lc_ctype
that is normally compatible with that encoding.
ISTM that UPPER(string COLLATE "whatever") should always give
the same result than UPPER(string COLLATE pg_catalog.default). And
likewise all functions that depend on character categories could
basically ignore the COLLATE specification, given that our
database-wide properties are sufficient to characterize the strings
within.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-10-26 Thread Alena Rybakina

On 25.10.2023 18:35, Andrei Zubkov wrote:

Hi Alena,

On Wed, 2023-10-25 at 16:25 +0300, Alena Rybakina wrote:

  Hi! Thank you for your work on the subject.
1. I didn't understand why we first try to find pgssEntry with a
false top_level value, and later find it with a true top_level value.

In case of pg_stat_statements the top_level field is the bool field of
the pgssHashKey struct used as the key for pgss_hash hashtable. When we
are performing a reset only userid, dbid and queryid values are
provided. We need to reset both top-level and non-top level entries. We
have only one way to get them all from a hashtable - search for entries
having top_level=true and with top_level=false in their keys.

Thank you for explanation, I got it.

2. And honestly, I think you should change
  "Remove the entry if it exists, starting with the non-top-level
entry." on
  "Remove the entry or reset min/max time statistic information and
the timestamp if it exists, starting with the non-top-level entry."
And the same with the comment bellow:
"Also reset the top-level entry if it exists."
  "Also remove the entry or reset min/max time statistic information
and the timestamp if it exists."

There are four such places actually - every time when the
SINGLE_ENTRY_RESET macro is used. The logic of reset performed is
described a bit in this macro definition. It seems quite redundant to
repeat this description four times. But I've noted that "remove" terms
should be replaced by "reset".

I've attached v24 with commits updated.

Yes, I agree with the changes.

--
Regards,
Alena Rybakina


Re: POC, WIP: OR-clause support for indexes

2023-10-26 Thread Peter Geoghegan
On Thu, Oct 26, 2023 at 12:59 PM Robert Haas  wrote:
> Alexander's example seems to show that it's not that simple. If I'm
> reading his example correctly, with things like aid = 1, the
> transformation usually wins even if the number of things in the OR
> expression is large, but with things like aid + 1 * bid = 1, the
> transformation seems to lose at least with larger numbers of items. So
> it's not JUST the number of OR elements but also what they contain,
> unless I'm misunderstanding his point.

Alexander said "Generally, I don't see why ANY could be executed
slower than the equivalent OR clause". I understood that this was his
way of expressing the following idea:

"In principle, there is no reason to expect execution of ANY() to be
slower than execution of an equivalent OR clause (except for
noise-level differences). While it might not actually look that way
for every single type of plan you can imagine right now, that doesn't
argue for making a cost-based decision. It actually argues for fixing
the underlying issue, which can't possibly be due to some kind of
fundamental advantage enjoyed by expression evaluation with ORs".

This is also what I think of all this.

Alexander's partial index example had this quality to it. Obviously,
the planner *could* be taught to do the right thing with such a case,
with a little more work. The fact that it doesn't right now is
definitely a problem, and should probably be treated as a blocker for
this patch. But that doesn't really argue against the general idea
behind the patch -- it just argues for fixing that one problem.

There may also be a separate problem that comes from the added planner
cycles required to do the transformation -- particularly in extreme or
adversarial cases. We should worry about that, too. But, again, it
doesn't change the basic fact, which is that having a
standard/normalized representation of OR lists/DNF transformation is
extremely useful in general, and *shouldn't* result in any real
slowdowns at execution time if done well.

> To me, this sort of output suggests that perhaps the transformation is
> being done in the wrong place. I expect that we have to decide whether
> to convert from OR to = ANY(...) at a very early stage of the planner,
> before we have any idea what the selected path will ultimately be. But
> this output suggests that we want the answer to depend on which kind
> of path is going to be faster, which would seem to argue for doing
> this sort of transformation as part of path generation for only those
> paths that will benefit from it, rather than during earlier phases of
> expression processing/simplification.

I don't think that that's the right direction. They're semantically
equivalent things. But a SAOP-based plan can be fundamentally better,
since SAOPs enable passing down useful context to index AMs (at least
nbtree). And because we can use a hash table for SAOP expression
evaluation. It's a higher level, standardized, well optimized way of
expressing exactly the same concept.

I can come up with a case that'll be orders of magnitude more
efficient with this patch, despite the transformation process only
affecting a small OR list of 3 or 5 elements -- a 100x reduction in
heap page accesses is quite possible. This is particularly likely to
come up if you assume that the nbtree patch that I'm currently working
on is also available. In general, I think that we totally over-rely on
bitmap index scans, especially BitmapOrs.

-- 
Peter Geoghegan




Add recovery to pg_control and remove backup_label

2023-10-26 Thread David Steele

Hackers,

This was originally proposed in [1] but that thread went through a 
number of different proposals so it seems better to start anew.


The basic idea here is to simplify and harden recovery by getting rid of 
backup_label and storing recovery information directly in pg_control. 
Instead of backup software copying pg_control from PGDATA, it stores an 
updated version that is returned from pg_backup_stop(). I believe this 
is better for the following reasons:


* The user can no longer remove backup_label and get what looks like a 
successful restore (while almost certainly causing corruption). If 
pg_control is removed the cluster will not start. The user may try 
pg_resetwal, but I think that tool makes it pretty clear that corruption 
will result from its use. We could also modify pg_resetwal to complain 
if recovery info is present in pg_control.


* We don't need to worry about backup software seeing a torn copy of 
pg_control, since Postgres can safely read it out of memory and provide 
a valid copy via pg_backup_stop(). This solves [2] without needing to 
write pg_control via a temp file, which may affect performance on a 
standby. Unfortunately, this solution cannot be back patched.


* For backup from standby, we no longer need to instruct the backup 
software to copy pg_control last. In fact the backup software should not 
copy pg_control from PGDATA at all.


Since backup_label is now gone, the fields that used to be in 
backup_label are now provided as columns returned from pg_backup_start() 
and pg_backup_stop() and the backup history file is still written to the 
archive. For pg_basebackup we would have the option of writing the 
fields into the JSON manifest, storing them to a file (e.g. 
backup.info), or just ignoring them. None of the fields are required for 
recovery but backup software may be very interested in them.


I updated pg_rewind but I'm not very confident in the tests. When I 
removed backup_label processing, but before I updated pg_rewind to write 
recovery info into pg_control, all the rewind tests passed.


This patch highlights the fact that we still have no tests for the 
low-level backup method. I modified pgBackRest to work with this patch 
and the entire test suite ran without any issues, but in-core tests 
would be good to have. I'm planning to work on those myself as a 
separate patch.


This patch would also make the proposal in [3] obsolete since there is 
no need to rename backup_label if it is gone.


I know that outputting pg_control as bytea is going to be a bit 
controversial. Software that is using psql get run pg_backup_stop() 
could use encode() to get pg_control as text and then decode it later. 
Alternately, we could update ReadControlFile() to recognize a 
base64-encoded pg_control file. I'm not sure dealing with binary data is 
that much of a problem, though, and if the backup software gets it wrong 
then recovery with fail on an invalid pg_control file.


Lastly, I think there are improvements to be made in recovery that go 
beyond this patch. I originally set out to load the recovery info into 
*just* the existing fields in pg_control but it required so many changes 
to recovery that I decided it was too dangerous to do all in one patch. 
This patch very much takes the "backup_label in pg_control" approach, 
though I reused fields where possible. The added fields, e.g. 
backupRecoveryRequested, also allow us to keep the user experience 
pretty much the same in terms of messages and errors.


Thoughts?

Regards,
-David

[1] 
https://postgresql.org/message-id/1330cb48-4e47-03ca-f2fb-b144b49514d8%40pgmasters.net
[2] 
https://postgresql.org/message-id/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
[3] 
https://postgresql.org/message-id/eb3d1aae-1a75-bcd3-692a-38729423168f%40pgmasters.netdiff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae54..6be8fb902c5 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -935,19 +935,20 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
  ready to archive.
 
 
- pg_backup_stop will return one row with three
- values. The second of these fields should be written to a file named
- backup_label in the root directory of the backup. The
- third field should be written to a file named
- tablespace_map unless the field is empty. These 
files are
+ pg_backup_stop returns the
+ pg_control file, which must be stored in the
+ global directory of the backup. It also returns the
+ tablespace_map file, which should be written in the
+ root directory of the backup unless the field is empty. These files are
  vital to the backup working and must be written byte for byte without
- modification, which may require opening the file in binary mode.
+ modification, which will require opening the file in binary mode.
 


 
  Once the WAL segment files active during the backup are archived, you 

Re: RFC: Pluggable TOAST

2023-10-26 Thread Nikita Malakhov
Hi!

Matthias, thank you for your patience and explanation. I'd wish I had it
much earlier, it would save a lot of time.
You've asked a lot of good questions, and the answers we have for some
seem to be not very satisfactory, and pointed out some topics that were not
mentioned before. I have to rethink our approach to the TOAST enhancements
according to it.

Thanks a lot!

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Recovering from detoast-related catcache invalidations

2023-10-26 Thread Tom Lane
In bug #18163 [1], Alexander proved the misgivings I had in [2]
about catcache detoasting being possibly unsafe:

>> BTW, while nosing around I found what seems like a very nasty related
>> bug.  Suppose that a catalog tuple being loaded into syscache contains
>> some toasted fields.  CatalogCacheCreateEntry will flatten the tuple,
>> involving fetches from toast tables that will certainly cause
>> AcceptInvalidationMessages calls.  What if one of those should have
>> invalidated this tuple?  We will not notice, because it's not in
>> the hashtable yet.  When we do add it, we will mark it not-dead,
>> meaning that the stale entry looks fine and could persist for a long
>> while.

Attached is a POC patch for fixing this.  The idea is that if we get
an invalidation while trying to detoast a catalog tuple, we should
go around and re-read the tuple a second time to get an up-to-date
version (or, possibly, discover that it no longer exists).  In the
case of SearchCatCacheList, we have to drop and reload the entire
catcache list, but fortunately not a lot of new code is needed.

The detection of "get an invalidation" could be refined: what I did
here is to check for any advance of SharedInvalidMessageCounter,
which clearly will have a significant number of false positives.
However, the only way I see to make that a lot better is to
temporarily create a placeholder catcache entry (probably a negative
one) with the same keys, and then see if it got marked dead.
This seems a little expensive, plus I'm afraid that it'd be actively
wrong in the recursive-lookup cases that the existing comment in
SearchCatCacheMiss is talking about (that is, the catcache entry
might mislead any recursive lookup that happens).

Moreover, if we did do something like that then the new code
paths would be essentially untested.  As the patch stands, the
reload path seems to get taken 10 to 20 times during a
"make installcheck-parallel" run of the core regression tests
(out of about 150 times that catcache detoasting is required).
Probably all of those are false-positive cases, but at least
they're exercising the logic.

So I'm inclined to leave it like this, but perhaps somebody
else will have a different opinion.

(BTW, there's a fair amount of existing catcache.c code that
will need to be indented another tab stop, but in the interests
of keeping the patch legible I didn't do that yet.)

Comments?

regards, tom lane

[1] 
https://www.postgresql.org/message-id/18163-859bad19a43edcf6%40postgresql.org
[2] https://www.postgresql.org/message-id/1389919.1697144487%40sss.pgh.pa.us

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 1aacb736c2..8255162a72 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -1319,6 +1319,7 @@ SearchCatCacheMiss(CatCache *cache,
 	SysScanDesc scandesc;
 	HeapTuple	ntp;
 	CatCTup*ct;
+	bool		stale;
 	Datum		arguments[CATCACHE_MAXKEYS];
 
 	/* Initialize local parameter array */
@@ -1327,16 +1328,6 @@ SearchCatCacheMiss(CatCache *cache,
 	arguments[2] = v3;
 	arguments[3] = v4;
 
-	/*
-	 * Ok, need to make a lookup in the relation, copy the scankey and fill
-	 * out any per-call fields.
-	 */
-	memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * nkeys);
-	cur_skey[0].sk_argument = v1;
-	cur_skey[1].sk_argument = v2;
-	cur_skey[2].sk_argument = v3;
-	cur_skey[3].sk_argument = v4;
-
 	/*
 	 * Tuple was not found in cache, so we have to try to retrieve it directly
 	 * from the relation.  If found, we will add it to the cache; if not
@@ -1351,9 +1342,28 @@ SearchCatCacheMiss(CatCache *cache,
 	 * will eventually age out of the cache, so there's no functional problem.
 	 * This case is rare enough that it's not worth expending extra cycles to
 	 * detect.
+	 *
+	 * Another case, which we *must* handle, is that the tuple could become
+	 * outdated during CatalogCacheCreateEntry's attempt to detoast it (since
+	 * AcceptInvalidationMessages can run during TOAST table access).  We do
+	 * not want to return already-stale catcache entries, so we loop around
+	 * and do the table scan again if that happens.
 	 */
 	relation = table_open(cache->cc_reloid, AccessShareLock);
 
+	do
+	{
+		/*
+		 * Ok, need to make a lookup in the relation, copy the scankey and
+		 * fill out any per-call fields.  (We must re-do this when retrying,
+		 * because systable_beginscan scribbles on the scankey.)
+		 */
+		memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * nkeys);
+		cur_skey[0].sk_argument = v1;
+		cur_skey[1].sk_argument = v2;
+		cur_skey[2].sk_argument = v3;
+		cur_skey[3].sk_argument = v4;
+
 	scandesc = systable_beginscan(relation,
   cache->cc_indexoid,
   IndexScanOK(cache, cur_skey),
@@ -1362,12 +1372,19 @@ SearchCatCacheMiss(CatCache *cache,
   cur_skey);
 
 	ct = NULL;
+	stale = false;
 
 	while (HeapTupleIsValid(ntp = systable_getnext(scandesc)))
 	{
 		ct = 

Re: POC, WIP: OR-clause support for indexes

2023-10-26 Thread Alena Rybakina

On 26.10.2023 22:58, Robert Haas wrote:

On Thu, Oct 26, 2023 at 3:47 PM Alena Rybakina
 wrote:

With small amounts of "OR" elements, the cost of orexpr is lower than with 
"ANY", on the contrary, higher.

Alexander's example seems to show that it's not that simple. If I'm
reading his example correctly, with things like aid = 1, the
transformation usually wins even if the number of things in the OR
expression is large, but with things like aid + 1 * bid = 1, the
transformation seems to lose at least with larger numbers of items. So
it's not JUST the number of OR elements but also what they contain,
unless I'm misunderstanding his point.
Yes, I agree, with Alexander's example, this option will not help and 
here I need to look inside Expr itself. But I noticed that such a 
complex non-constant expression is always an OpExpr type, otherwise if 
the non-constant part contains only one variable, then it is a Var type. 
We can add a constraint that we will transform expressions with the 
simple variables like x=1 or x=2 or x=3, etc., but expressions like 
x*1+y=1 or x*2+y=2... we ignore.


But then, we do not consider expressions when the nonconstant part is 
always the same for expressions. For example, we could transform x*1+y=1 
or x*1+y=2... to x*1+y = ANY([1,2,...]). But I think it's not so 
critical, because such cases are rare.



Index Scan using pg_class_oid_index on pg_class  (cost=0.27..2859.42 rows=414 
width=68) (actual time=1.504..34.183 rows=260 loops=1)
Index Cond: (oid = ANY (ARRAY['1'::oid, '2'::oid, '3'::oid, '4'::oid, 
'5'::oid, '6'::oid, '7'::oid,

Bitmap Heap Scan on pg_class  (cost=43835.00..54202.14 rows=414 width=68) 
(actual time=39.958..41.293 rows=260 loops=1)
Recheck Cond: ((oid = '1'::oid) OR (oid = '2'::oid) OR (oid = '3'::oid) OR 
(oid = '4'::oid) OR (oid =

I think we could see which value is lower, and if lower with expressions converted to 
ANY, then work with it further, otherwise work with the original "OR" 
expressions. But we still need to make this conversion to find out its cost.

To me, this sort of output suggests that perhaps the transformation is
being done in the wrong place. I expect that we have to decide whether
to convert from OR to = ANY(...) at a very early stage of the planner,
before we have any idea what the selected path will ultimately be. But
this output suggests that we want the answer to depend on which kind
of path is going to be faster, which would seem to argue for doing
this sort of transformation as part of path generation for only those
paths that will benefit from it, rather than during earlier phases of
expression processing/simplification.

I'm not sure I have the full picture here, though, so I might have
this all wrong.

This would be the most ideal option, and to be honest, I like the 
conversion at an early stage also because there are no problems with 
selectivity or link updates if we changed the structure of RestrictInfo 
of relation.


But in terms of calculating which option is better to use transformed or 
original, I think this solution might be complicated, since we need not 
only to highlight the cases in which the transformation wins in 
principle, but also with which types of data it will work best and there 
is a risk of missing some cases and we may need the own evaluation 
model. Now it's hard for me to come up with something simple.


The cost option seems simpler and clearer to me, but yes, it is 
difficult to decide when it is better to do the conversion for the most 
correct estimate.



--
Regards,
Alena Rybakina





Re: Atomic ops for unlogged LSN

2023-10-26 Thread Nathan Bossart
On Thu, Oct 26, 2023 at 03:00:58PM +, John Morris wrote:
> Keeping things up to date.  Here is a rebased patch with no changes from 
> previous one.

This patch looks a little different than the last version I see posted [0].
That last version of the patch (which appears to be just about committable)
still applies for me, too.

[0] 
https://postgr.es/m/BYAPR13MB2677ED1797C81779D17B414CA03EA%40BYAPR13MB2677.namprd13.prod.outlook.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: CRC32C Parallel Computation Optimization on ARM

2023-10-26 Thread Nathan Bossart
On Thu, Oct 26, 2023 at 08:53:31AM +, Xiang Gao wrote:
> On  Tue,  24 Oct,  2023 20:45:39PM -0500, Nathan Bossart wrote:
>>I tried this.  pg_waldump on 2 million ~8kB records took around 8.1 seconds
>>without the patch and around 7.4 seconds with it (an 8% improvement).
>>pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
>>patch and around 2.4 seconds with it (a 25% improvement).
> 
> Could you please provide details on how to generate these 8kB size or 16kB 
> size data? Thanks!

I did something like

do $$
begin
for i in 1..100
loop
perform pg_logical_emit_message(false, 'test', 
repeat('0123456789', 800));
end loop;
end;
$$;

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: POC, WIP: OR-clause support for indexes

2023-10-26 Thread Robert Haas
On Thu, Oct 26, 2023 at 3:47 PM Alena Rybakina
 wrote:
> With small amounts of "OR" elements, the cost of orexpr is lower than with 
> "ANY", on the contrary, higher.

Alexander's example seems to show that it's not that simple. If I'm
reading his example correctly, with things like aid = 1, the
transformation usually wins even if the number of things in the OR
expression is large, but with things like aid + 1 * bid = 1, the
transformation seems to lose at least with larger numbers of items. So
it's not JUST the number of OR elements but also what they contain,
unless I'm misunderstanding his point.

> Index Scan using pg_class_oid_index on pg_class  (cost=0.27..2859.42 rows=414 
> width=68) (actual time=1.504..34.183 rows=260 loops=1)
>Index Cond: (oid = ANY (ARRAY['1'::oid, '2'::oid, '3'::oid, '4'::oid, 
> '5'::oid, '6'::oid, '7'::oid,
>
> Bitmap Heap Scan on pg_class  (cost=43835.00..54202.14 rows=414 width=68) 
> (actual time=39.958..41.293 rows=260 loops=1)
>Recheck Cond: ((oid = '1'::oid) OR (oid = '2'::oid) OR (oid = '3'::oid) OR 
> (oid = '4'::oid) OR (oid =
>
> I think we could see which value is lower, and if lower with expressions 
> converted to ANY, then work with it further, otherwise work with the original 
> "OR" expressions. But we still need to make this conversion to find out its 
> cost.

To me, this sort of output suggests that perhaps the transformation is
being done in the wrong place. I expect that we have to decide whether
to convert from OR to = ANY(...) at a very early stage of the planner,
before we have any idea what the selected path will ultimately be. But
this output suggests that we want the answer to depend on which kind
of path is going to be faster, which would seem to argue for doing
this sort of transformation as part of path generation for only those
paths that will benefit from it, rather than during earlier phases of
expression processing/simplification.

I'm not sure I have the full picture here, though, so I might have
this all wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: visibility of open cursors in pg_stat_activity

2023-10-26 Thread Robert Haas
On Thu, Oct 26, 2023 at 1:41 PM Andres Freund  wrote:
> Does it really matter on that level for the user whether a snapshot exists
> because of repeatable read or because of a cursor?  If users don't understand
> backend_xmin - likely largely true - then the consequences of holding a
> snapshot open because of repeatable read (or even just catalog snapshots!) is
> as severe as an open cursor.

Sure it matters. How is the user supposed to know what they need to go
fix? If there's application code that says BEGIN TRANSACTION
SERIALIZABLE, that's a different thing to look for than if there's
application code that fails to close a cursor somewhere.

> Given snapshots held for other reasons, I think we should expose them
> similarly, if we do something for cursors. Otherwise people might start to
> worry only about idle-txn-with-cursors and not the equally harmful
> idle-txn-with-snapshot.
>
> Maybe something roughly like
>   idle in transaction [with {snapshot|cursor|locks}]
> ?

Well every transaction is going to have a lock on its own VXID, if
nothing else. And in almost all interesting cases, more than that.

The point for me is that if you're using cursors, "idle in
transaction" is misleading in a way that it isn't if you have a
snapshot due to serializability or something. Consider two users. Each
begins a transaction, then each runs a query that returns a large
number of rows, considerably in excess of what will fit in the network
buffer. Each user then reads half of the rows and then goes into the
tank to process the data they have received thus far. User A does this
by sending the query using the simple query protocol and reading the
results one at a time using single-row mode. User B does this by
sending the query using the extended query protocol and fetches the
rows in batches by sending successive Execute messages each with a
non-zero row count. When user A goes into the tank, their session is
shown as active. When user B goes into the tank, their session is
shown as idle-in-transaction. But these situations are actually very
similar to each other. In both cases, execution is suspended because
the client is thinking.

The case of holding a snapshot because of repeatable read or
serializable isolation is, in my view, different. In that case, while
it's true that the session is holding onto resources that might cause
some problems for other things happening on the system, saying that
the session is idle in transaction is still accurate. The problems are
caused by transaction-lifespan resources. But in the case where there
are active cursors, the backend is actually in the middle of executing
a query, or maybe many of them, but at least one. Sure, at the exact
moment that we see the status as "idle in transaction", it isn't
actively trying to run any of them, but that feels like a pedantic
argument. If you put a pot of water on the stove to boil and wait for
it to heat up, are you actively cooking or are you idle? As here, I
think the answer is "something in between."

> I still would like a view that shows what's holding back the horizon on a
> system wide basis. Something like a view with the following columns and one
> row for each database

Seems like it's just the same information we already have in
pg_stat_activity, pg_prepared_xacts, and pg_replslots. Maybe
reformatting is useful but it doesn't seem groundbreaking. It would be
groundbreaking if we could surface information that's not visible now,
like the names and associated queries of cursors in sessions not our
own. But that would be much more expensive to expose.

> I recently mused in some other thread that I really would like to have an
> approximate xid->timestamp mapping, so that we could assign an age to these in
> a unit that makes sense to humans.  Particularly snapshots / xmin can be very
> confusing in that regard because a relatively recent transaction can hold back
> the overall horizon further than the time the transaction started, if some old
> transaction was still running at the time.
>
> Perhaps we could add at least timestamps to these in some other
> way. E.g. recording a timestamp whenever a transaction is prepared, a slot is
> released... Likely recording one whenever a snapshot is acquired would be too
> expensive tho - but we could use state_change as an approximation?

I'm not saying this is a terrible idea or anything, but in my
experience the problem isn't usually that people don't understand that
old XIDs are old -- it's that they don't know where to find the XIDs
in the first place.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: POC, WIP: OR-clause support for indexes

2023-10-26 Thread Alena Rybakina

Hi! Thank you for your feedback!

On 25.10.2023 22:54, Robert Haas wrote:

On Sat, Oct 14, 2023 at 6:37 PM Alexander Korotkov  wrote:

Regarding the GUC parameter, I don't see we need a limit.  It's not
yet clear whether a small number or a large number of OR clauses are
more favorable for transformation.  I propose to have just a boolean
enable_or_transformation GUC.

That's a poor solution. So is the GUC patch currently has
(or_transform_limit). What you need is a heuristic that figures out
fairly reliably whether the transformation is going to be better or
worse. Or else, do the whole thing in some other way that is always
same-or-better.

In general, adding GUCs makes sense when the user knows something that
we can't know. For example, shared_buffers makes some sense because,
even if we discovered how much memory the machine has, we can't know
how much of it the user wants to devote to PostgreSQL as opposed to
anything else. And track_io_timing makes sense because we can't know
whether the user wants to pay the price of gathering that additional
data. But GUCs are a poor way of handling cases where the real problem
is that we don't know what code to write. In this case, some queries
will be better with enable_or_transformation=on, and some will be
better with enable_or_transformation=off. Since we don't know which
will work out better, we make the user figure it out and set the GUC,
possibly differently for each query. That's terrible. It's the query
optimizer's whole job to figure out which transformations will speed
up the query. It shouldn't turn around and punt the decision back to
the user.

Notice that superficially-similar GUCs like enable_seqscan aren't
really the same thing at all. That's just for developer testing and
debugging. Nobody expects that you have to adjust that GUC on a
production system - ever.


I noticed that the costs of expressions are different and it can help to 
assess when it is worth leaving the conversion, when not.


With small amounts of "OR" elements, the cost of orexpr is lower than 
with "ANY", on the contrary, higher.


postgres=# SET or_transform_limit = 500;
EXPLAIN (analyze)
SELECT oid,relname FROM pg_class
WHERE
  oid = 13779 AND (oid = 2 OR oid = 4 OR oid = 5)
;
SET
  QUERY PLAN
--
 Index Scan using pg_class_oid_index on pg_class  (*cost=0.27..8.30* 
rows=1 width=68) (actual time=0.105..0.106 rows=0 loops=1)

   Index Cond: (oid = '13779'::oid)
   Filter: ((oid = '2'::oid) OR (oid = '4'::oid) OR (oid = '5'::oid))
 Planning Time: 0.323 ms
 Execution Time: 0.160 ms

(5 rows)

postgres=# SET or_transform_limit = 0;
EXPLAIN (analyze)
SELECT oid,relname FROM pg_class
WHERE
  oid = 13779 AND (oid = 2 OR oid = 4 OR oid = 5)
;
SET
  QUERY PLAN
---
 Index Scan using pg_class_oid_index on pg_class  (*cost=0.27..16.86* 
rows=1 width=68) (actual time=0.160..0.161 rows=0 loops=1)
   Index Cond: ((oid = ANY (ARRAY['2'::oid, '4'::oid, '5'::oid])) AND 
(oid = '13779'::oid))

 Planning Time: 4.515 ms
 Execution Time: 0.313 ms
(4 rows)


Index Scan using pg_class_oid_index on pg_class  (*cost=0.27..2859.42* 
rows=414 width=68) (actual time=1.504..34.183 rows=260 loops=1)
   Index Cond: (oid = ANY (ARRAY['1'::oid, '2'::oid, '3'::oid, 
'4'::oid, '5'::oid, '6'::oid, '7'::oid,


Bitmap Heap Scan on pg_class  (*cost=43835.00..54202.14* rows=414 
width=68) (actual time=39.958..41.293 rows=260 loops=1)
   Recheck Cond: ((oid = '1'::oid) OR (oid = '2'::oid) OR (oid = 
'3'::oid) OR (oid = '4'::oid) OR (oid =


I think we could see which value is lower, and if lower with expressions 
converted to ANY, then work with it further, otherwise work with the 
original "OR" expressions. But we still need to make this conversion to 
find out its cost.


In addition, I will definitely have to postpone the transformation of 
"OR" to "ANY" at the stage of creating indexes (?) or maybe a little 
earlier so that I have to count only the cost of the transformed 
expression.


Re: Partial aggregates pushdown

2023-10-26 Thread Bruce Momjian
On Thu, Oct 26, 2023 at 11:11:09AM +, 
fujii.y...@df.mitsubishielectric.co.jp wrote:
> >and checks if the remote server version is older than the local 
> > server version.
> >If so,
> >postgres_fdw
> > -->assumes that for each built-in aggregate function, the partial 
> > aggregate function is not defined
> > -->on the remote server unless the partial aggregate function and the 
> > aggregate
> > -->function match.
> >Otherwise postgres_fdw assumes that for each 
> > built-in aggregate function,
> >the partial aggregate function is defined on the remote server.
> >The default is false.
> >   
> >  
> > 
> > 
> > What does that marked sentence mean?  What is match?  Are one or both of 
> > these remote?  It sounds like you are
> > checking the local aggregate against the remote partial aggregate, but I 
> > don't see any code that does this in the patch.
> This sentence means that
> "If the partial aggregate function has the same OID as the aggregate function,
> then postgres_fdw assumes that for each built-in aggregate function, the 
> partial aggregate function is not defined
>  on the remote server."
> "Match" means that the partial aggregate function has the same OID as the 
> aggregate function in local server.
> But, in v30, there is no code which checks the partial aggregate function has 
> the same OID as the aggregate function in local server.
> So I modified the code of is_builtin_aggpartialfunc_shippable().
> Also, I modified wording postgres-fdw.sgml.

Yes, that is what I needed.  Attached is a modification of your v31
patch (the most recent) that mostly improves the documentation and
comments.  What else needs to be done before committers start to review
this?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.


fdw_partial.diff.gz
Description: application/gzip


Re: Does UCS_BASIC have the right CTYPE?

2023-10-26 Thread Jeff Davis
On Thu, 2023-10-26 at 09:21 -0700, Jeff Davis wrote:
> Our initcap() is not defined in the standard, and we document that it
> only differentiates between alphanumeric and non-alphanumeric
> characters, so we could get that behavior pretty easily as well. If
> we
> wanted to do it the Unicode way instead, we can follow the
> toTitlecase() part of the Default Case Algorithm, which is based on
> word breaks and would require another lookup table for that.

Correction: the rules for word breaks are fairly complex, so it would
not be worth it to try to replicate that just to support initcap(). We
could just use the simple, existing, and documented rules for initcap()
which only differentiate between alphanumeric and not. Anyone who wants
the more sophisticated rules can just use an ICU collation with
initcap().

The point stands that it would be pretty simple to have a collation
that handles upper() and lower() in a standards-compliant way without
relying on libc or ICU. Unfortunately it's too late to call that
collation UCS_BASIC, but it would still be useful.

Regards,
Jeff Davis





Re: visibility of open cursors in pg_stat_activity

2023-10-26 Thread Andres Freund
Hi,

On 2023-10-26 11:47:32 -0400, Robert Haas wrote:
> I've seen situations a few times now where somebody has sessions that
> are "idle in transaction" for a long time but they feel like it should
> be harmless because the transaction has no XID. However, the fact that
> the transaction is idle doesn't mean it isn't running a query, because
> there could be a cursor from which some but not all results were
> fetched. That query is suspended, but still holds a snapshot and thus
> still holds back xmin. You can see this from pg_stat_activity because
> backend_xmin will be set, but I've found that this is easily missed
> and sometimes confusing even when noticed. People don't necessarily
> understand how it's possible to have a snapshot if the session is
> idle. And even if somebody has great understanding of system
> internals, pg_stat_activity doesn't distinguish between a session that
> holds a snapshot because (a) the transaction was started with
> repeatable read or serializable and it has already executed at least
> one command that acquired a snapshot or alternatively (b) the
> transaction has opened some cursors which it has not closed. (Is there
> a (c)? As far as I know, it has to be one of those two things.)

Does it really matter on that level for the user whether a snapshot exists
because of repeatable read or because of a cursor?  If users don't understand
backend_xmin - likely largely true - then the consequences of holding a
snapshot open because of repeatable read (or even just catalog snapshots!) is
as severe as an open cursor.


> So I think it would be useful to improve the pg_stat_activity output
> in some way. For instance, the output could say "idle in transaction
> (with open cursors)" or something like that.

Given snapshots held for other reasons, I think we should expose them
similarly, if we do something for cursors. Otherwise people might start to
worry only about idle-txn-with-cursors and not the equally harmful
idle-txn-with-snapshot.

Maybe something roughly like
  idle in transaction [with {snapshot|cursor|locks}]
?


> Or we could add a whole new column that specifically gives a count of how
> many cursors the session has open, or how many active cursors, or something
> like that.  I'm not exactly clear on the terminology here.

Portals are very weirdly underdocumented and surprisingly complicated :/


> It seems like the thing we internally called a portal is basically a cursor,
> except there's also an unnamed portal that gets used when you run a query
> without using a cursor.

I think you can also basically use an unnamed portal as a cursor with the
extended protocol. The only thing is that there can only be one of them.

The interesting distinction likely is whether we have cursors that are not
active.


> But I think it would be nice to do something, because the current
> situation seems like it's more confusing than it needs to be.

I think it'd be nice to make idle-in-txn a bit more informative. Not sure
though how much that helps most users, it's still quite granular information.

I still would like a view that shows what's holding back the horizon on a
system wide basis. Something like a view with the following columns and one
row for each database

  datname
  horizon
  horizon_cause = {xid, snapshot, prepared_xact, replication_connection, ...}
  xid_horizon
  xid_horizon_pid
  snapshot_horizon
  snapshot_horizon_pid
  prepared_xact_horizon
  prepared_xact_horizon_id
  replication_connection_horizon
  replication_connection_horizon_pid
  physical_slot_horizon
  physical_slot_horizon_pid
  physical_slot_horizon_name
  logical_slot_horizon
  logical_slot_horizon_pid
  logical_slot_horizon_name

Perhaps with one additional row with a NULL datname showing the system wide
horizons (one database could have the oldest xid_horizon and another the
oldest logical_slot_horizon, so it's not a simple order by).


I recently mused in some other thread that I really would like to have an
approximate xid->timestamp mapping, so that we could assign an age to these in
a unit that makes sense to humans.  Particularly snapshots / xmin can be very
confusing in that regard because a relatively recent transaction can hold back
the overall horizon further than the time the transaction started, if some old
transaction was still running at the time.


Perhaps we could add at least timestamps to these in some other
way. E.g. recording a timestamp whenever a transaction is prepared, a slot is
released... Likely recording one whenever a snapshot is acquired would be too
expensive tho - but we could use state_change as an approximation?

Greetings,

Andres Freund




Re: Introduce a new view for checkpointer related stats

2023-10-26 Thread Bharath Rupireddy
On Thu, Oct 26, 2023 at 7:30 AM Michael Paquier  wrote:
>
> I was looking at this patch, and got a few comments.

Thanks.

> The view for the bgwriter does not do that.  I'd suggest to use
> functions that are named as pg_stat_get_checkpoint_$att with shorter
> $atts.  It is true that "timed" is a bit confusing, because it refers
> to a number of checkpoints, and that can be read as a time value (?).
> So how about num_timed?  And for the others num_requested and
> buffers_written?

+1. PSA v9-0003.

> + * Unlike the checkpoint fields, reqquests related fields are protected by
>
> s/reqquests/requests/.

Fixed.

>  SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
>  {
> +   SlruShared  shared = ctl->shared;
> int fd;
> int save_errno;
> int result;
>
> +   /* update the stats counter of flushes */
> +   pgstat_count_slru_flush(shared->slru_stats_idx);
>
> Why is that in 0002?  Isn't that something we should treat as a bug
> fix of its own, even backpatching it to make sure that the flush
> requests for individual commit_ts, multixact and clog files are
> counted in the stats?

+1. I included the fix in a separate patch 0002 here.

> Saying that, I am OK with moving ahead with 0001 and 0002 to remove
> buffers_backend_fsync and buffers_backend from pg_stat_bgwriter, but
> it is better to merge them into a single commit.  It helps with 0003
> and this would encourage the use of pg_stat_io that does a better job
> overall with more field granularity.

I merged v8 0001 and 0002 into one single patch, PSA v9-0001.

PSA v9 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v9-0001-Do-not-count-backend-writes-and-fsycns-in-checkpo.patch
Description: Binary data


v9-0002-Count-SLRU-page-flushes-during-checkpoint-or-data.patch
Description: Binary data


v9-0003-Introduce-a-new-view-for-checkpointer-related-sta.patch
Description: Binary data


Re: CRC32C Parallel Computation Optimization on ARM

2023-10-26 Thread Nathan Bossart
On Thu, Oct 26, 2023 at 07:28:35AM +, Xiang Gao wrote:
> On Wed,  25 Oct,  2023 at 10:43:25 -0500, Nathan Bossart wrote:
>>+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
>>+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || 
>>test  x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
>>+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
>>+USE_ARMV8_VMULL=1
>>+  fi
>>+fi
> 
>>Hm.  I wonder if we need to switch to a runtime check in some cases.  For
>>example, what happens if the ARMv8 intrinsics used today are found with the
>>default compiler flags, but vmull_p64() is only available if
>>-march=armv8-a+crypto is added?  It looks like the precedent is to use a
>>runtime check if we need extra CFLAGS to produce code that uses the
>>intrinsics.
> 
> We consider that a runtime check needs to be done in any scenario.
> Here we only confirm that the compilation can be successful.
> A runtime check will be done when choosing which algorithm.
> You can think of us as merging USE_ARMV8_VMULL and 
> USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.

Oh.  Looking again, I see that we are using a runtime check for ARM in all
cases with this patch.  If so, maybe we should just remove
USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
USE_ARMV8_CRC32C always do the runtime check).  I suspect there are other
opportunities to simplify things, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Does UCS_BASIC have the right CTYPE?

2023-10-26 Thread Jeff Davis
On Thu, 2023-10-26 at 16:49 +0200, Peter Eisentraut wrote:
> On 25.10.23 20:32, Jeff Davis wrote:
> > But what should the result of UPPER('á' COLLATE UCS_BASIC) be? In
> > Postgres, the answer is 'á', but intuitively, one could reasonably
> > expect the answer to be 'Á'.
> 
> I think that's right.  But what would you put into ctype to make that
> happen?

It looks like using Unicode files to implement
upper()/lower()/initcap() behavior would not be very hard. The Default
Case Algorithm only needs a simple mapping for toUppercase() and
toLowercase().

Our initcap() is not defined in the standard, and we document that it
only differentiates between alphanumeric and non-alphanumeric
characters, so we could get that behavior pretty easily as well. If we
wanted to do it the Unicode way instead, we can follow the
toTitlecase() part of the Default Case Algorithm, which is based on
word breaks and would require another lookup table for that.

I've already posted a patch that includes a lookup table for the
General Category, so that could be used for the rest of the ctype
behavior.

Doing ctype based on built-in Unicode tables would be a good use case
for the "builtin" provider that I had previously proposed.

> The standard doesn't have the notion of locale-dependent case
> conversion.

That explains it. Interesting.

Regards,
Jeff Davis





visibility of open cursors in pg_stat_activity

2023-10-26 Thread Robert Haas
Hi,

I've seen situations a few times now where somebody has sessions that
are "idle in transaction" for a long time but they feel like it should
be harmless because the transaction has no XID. However, the fact that
the transaction is idle doesn't mean it isn't running a query, because
there could be a cursor from which some but not all results were
fetched. That query is suspended, but still holds a snapshot and thus
still holds back xmin. You can see this from pg_stat_activity because
backend_xmin will be set, but I've found that this is easily missed
and sometimes confusing even when noticed. People don't necessarily
understand how it's possible to have a snapshot if the session is
idle. And even if somebody has great understanding of system
internals, pg_stat_activity doesn't distinguish between a session that
holds a snapshot because (a) the transaction was started with
repeatable read or serializable and it has already executed at least
one command that acquired a snapshot or alternatively (b) the
transaction has opened some cursors which it has not closed. (Is there
a (c)? As far as I know, it has to be one of those two things.)

So I think it would be useful to improve the pg_stat_activity output
in some way. For instance, the output could say "idle in transaction
(with open cursors)" or something like that. Or we could add a whole
new column that specifically gives a count of how many cursors the
session has open, or how many active cursors, or something like that.
I'm not exactly clear on the terminology here. It seems like the thing
we internally called a portal is basically a cursor, except there's
also an unnamed portal that gets used when you run a query without
using a cursor. And I think the cursors that could potentially hold
snapshots are the ones that are labelled PORTAL_READY. I think we
can't have a PORTAL_ACTIVE portal if we're idle, and that
PORTAL_{NEW,DEFINED,DONE,FAILED} portals are not capable of holding
any resources and thus not relevant. But I'm not 100% positive on
that, and I'm not exactly sure what terminology the user facing
reporting should use.

But I think it would be nice to do something, because the current
situation seems like it's more confusing than it needs to be.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-26 Thread Bharath Rupireddy
On Thu, Oct 26, 2023 at 8:11 PM Zhijie Hou (Fujitsu)
 wrote:
>
> The BF animal fairywren[1] failed when testing
> 003_upgrade_logical_replication_slots.pl.
>
> From the log, I can see pg_upgrade failed to open the
> invalid_logical_replication_slots.txt:
>
> # Checking for valid logical replication slots
> # could not open file 
> "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/003_upgrade_logical_replication_slots/data/t_003_upgrade_logical_replication_slots_new_publisher_data/pgdata/pg_upgrade_output.d/20231026T112558.309/invalid_logical_replication_slots.txt":
>  No such file or directory
> # Failure, exiting
>
> The reason could be the length of this path(262) exceed the windows path
> limit(260 IIRC). If so, I recall we fixed similar things before (e213de8e7) by
> reducing the path somehow.

Nice catch. Windows docs say that the file/directory path name can't
exceed MAX_PATH, which is defined as 260 characters. However, one must
opt-in to enable longer path names -
https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry
and 
https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry#enable-long-paths-in-windows-10-version-1607-and-later.

> In this case, I think one approach is to reduce the file and testname to
> xxx_logical_slots instead of xxx_logical_replication_slots. But we will 
> analyze more
> and share fix soon.
>
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-10-26%2009%3A04%3A54

+1 for s/003_upgrade_logical_replication_slots.pl/003_upgrade_logical_slots.pl
and s/invalid_logical_replication_slots.txt/invalid_logical_slots.txt.
In fact, we've used "logical slots" instead of "logical replication
slots" in the docs to be generic. By looking at the generated
directory path name, I think we can use shorter node names - instead
of old_publisher, new_publisher, subscriber - either use node1 (for
old publisher), node2 (for subscriber), node3 (for new publisher) or
use alpha (for old publisher), bravo (for subscriber), charlie (for
new publisher) or such shorter names. We don't have to be that
descriptive and long in node names, one can look at the test file to
know which one is what.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Atomic ops for unlogged LSN

2023-10-26 Thread John Morris
Keeping things up to date.  Here is a rebased patch with no changes from 
previous one.


  *   John Morris


atomic-lsn.patch
Description: atomic-lsn.patch


Re: Does UCS_BASIC have the right CTYPE?

2023-10-26 Thread Peter Eisentraut

On 25.10.23 20:32, Jeff Davis wrote:

But what should the result of UPPER('á' COLLATE UCS_BASIC) be? In
Postgres, the answer is 'á', but intuitively, one could reasonably
expect the answer to be 'Á'.


I think that's right.  But what would you put into ctype to make that 
happen?



That seems to suggest the standard answer should be 'Á' regardless of
any COLLATE clause (though I could be misreading). I'm a bit confused
by that... what's the standard-compatible way to specify the locale for
UPPER()/LOWER()? If there is none, then it makes sense that Postgres
overloads the COLLATE clause for that purpose so that users can use a
different locale if they want.


The standard doesn't have the notion of locale-dependent case conversion.





RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-26 Thread Zhijie Hou (Fujitsu)

Hi,

The BF animal fairywren[1] failed when testing
003_upgrade_logical_replication_slots.pl.

From the log, I can see pg_upgrade failed to open the
invalid_logical_replication_slots.txt:

# Checking for valid logical replication slots  
# could not open file 
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/003_upgrade_logical_replication_slots/data/t_003_upgrade_logical_replication_slots_new_publisher_data/pgdata/pg_upgrade_output.d/20231026T112558.309/invalid_logical_replication_slots.txt":
 No such file or directory
# Failure, exiting

The reason could be the length of this path(262) exceed the windows path
limit(260 IIRC). If so, I recall we fixed similar things before (e213de8e7) by
reducing the path somehow.

In this case, I think one approach is to reduce the file and testname to
xxx_logical_slots instead of xxx_logical_replication_slots. But we will analyze 
more
and share fix soon.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-10-26%2009%3A04%3A54

Best Regards,
Hou zj


Re: Guiding principle for dropping LLVM versions?

2023-10-26 Thread Devrim Gündüz
Hi,

On Thu, 2023-10-19 at 08:13 +1300, Thomas Munro wrote:
> If we used Debian as a yardstick, PostgreSQL 17 wouldn't need anything
> older than LLVM 14 AFAICS.  Who else do we need to ask? 

LLVM 15 is the minimum one for the platforms that I build the packages
on. So LLVM >= 14 is great for HEAD.

Regards,
-- 
Devrim Gündüz
Open Source Solution Architect, PostgreSQL Major Contributor
Twitter: @DevrimGunduz , @DevrimGunduzTR




Re: RFC: Pluggable TOAST

2023-10-26 Thread Matthias van de Meent
On Thu, 26 Oct 2023 at 15:18, Aleksander Alekseev
 wrote:
>
> Hi,
>
> > And the goal of *THIS* topic is to gather a picture on how the community 
> > sees
> > improvements in TOAST mechanics if it doesn't want it the way we proposed
> > before, to understand which way to go with JSON advanced storage and other
> > enhancements we already have. Previous topic was not of any help here.
>
> Publish your code under an appropriate license first so that 1. anyone
> can test/benchmark it and 2. merge it to the PostgreSQL core if
> necessary.
>
> Or better consider participating in the [1] discussion where we
> reached a consensus on RFC and are working on improving TOAST for JSON
> and other types. We try to be mindful of use cases you named before
> like 64-bit TOAST pointers but we still could use your input.

I feel that the no. 2 proposal is significantly different from the
discussion over at [1] in that it concerns changes in the interface
between types and toast, as opposed to as opposed to the no. 1
proposal (and [1]'s) changes that stay mostly inside the current TOAST
apis and abstractions.

The "Compression dictionaries for JSONB" thread that you linked went
the way of "store and use compression dictionaries for TOAST
compression algorithms", which is at a lower level than one of the
other ideas, which was to "allow JSONB to use a dictionary of common
values to dictionary-encode some of the contained entries". Naive
compression of the Datum's bytes makes the compressed datum
unparseable without decompression, even when dictionaries are used to
decrease the compressed size, while a type's own compression
dictionary substitutions could allow it to maintain it's structure and
would thus allow for a lower memory and storage footprint of the
column's datums during query processing.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Is this a problem in GenericXLogFinish()?

2023-10-26 Thread Robert Haas
On Thu, Oct 26, 2023 at 3:31 AM Michael Paquier  wrote:
> Hmm.  Looking at hash_xlog_squeeze_page(), it looks like wbuf is
> expected to always be registered.  So, you're right here: it should be
> OK to be less aggressive with setting the page LSN on wbuf if ntups is
> 0, but there's more to it?  For example, it is possible that
> bucketbuf, prevbuf and wbuf refer to the same buffer, causing
> is_prim_bucket_same_wrt and is_prev_bucket_same_wrt to be both true.
> Particularly, if nextbuf and wbuf are the same buffers, we finish by
> registering twice the same buffer with two different IDs to perform
> the tuple insertion and the opaque area updates in two steps.  And
> that's..  Err..  not really necessary?  I mean, as far as I read this
> code you could just reuse the buffer registered at ID 1 and update its
> opaque area if is_prim_bucket_same_wrt is true?

Read the comments for _hash_squeezebucket. Particularly, note this part:

 *  Try to squeeze the tuples onto pages occurring earlier in the
 *  bucket chain in an attempt to free overflow pages. When we start
 *  the "squeezing", the page from which we start taking tuples (the
 *  "read" page) is the last bucket in the bucket chain and the page
 *  onto which we start squeezing tuples (the "write" page) is the
 *  first page in the bucket chain.  The read page works backward and
 *  the write page works forward; the procedure terminates when the
 *  read page and write page are the same page.

Because of this, it is possible for bucketbuf, prevbuf, and wbuf to be
the same (your first scenario) but the second scenario you mention
(nextbuf  == wbuf) should be impossible.

It seems to me that maybe we shouldn't even be registering wbuf or
doing anything at all to it if there are no tuples that need moving.
That would also require some adjustment of the redo routine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: RFC: Pluggable TOAST

2023-10-26 Thread Matthias van de Meent
On Tue, 24 Oct 2023 at 22:38, Nikita Malakhov  wrote:
>
> Hi hackers!
>
> We need community feedback on previously discussed topic [1].
> There are some long-live issues in Postgres related to the TOAST mechanics, 
> like [2].
> Some time ago we already proposed a set of patches with an API allowing to 
> plug in
> different TOAST implementations into a live database. The patch set 
> introduced a lot
> of code and was quite crude in some places, so after several implementations 
> we decided
> to try to implement it in the production environment for further check-up.
>
> The main idea behind pluggable TOAST is make it possible to easily plug in 
> and use different
> implementations of large values storage, preserving existing mechanics to 
> keep backward
> compatibilitну provide easy Postgres-way  give users alternative mechanics 
> for storing large
> column values in a more effective way - we already have custom and very 
> effective (up to tens
> and even hundreds of times faster) TOAST implementations for bytea and JSONb 
> data types.
>
> As we see it - Pluggable TOAST proposes
> 1) changes in TOAST pointer itself, extending it to store custom data - 
> current limitations
> of TOAST pointer were discussed in [1] and [4];
> 2) API which allows calls of custom TOAST implementations for certain table 
> columns and
> (a topic for discussion) certain datatypes.
>
> Custom TOAST could be also used in a not so trivial way - for example, 
> limited columnar storage could be easily implemented and plugged in without 
> heavy core modifications
> of implementation of Pluggable Storage (Table Access Methods), preserving 
> existing data
> and database structure, be upgraded, replicated and so on.
>
> Any thoughts and proposals are welcome.

TLDR of my thoughts below:
1. I don't see much value in the "Pluggable TOAST" as proposed in [0],
where toasters are both decoupled from the type but also strongly
bound to the type with tagged vtables.
2. I do think we should allow *types* to provide their own toast
slicing implementation (not just "one blob, compressed then sliced"),
so that structured types don't have to read MBs of data to access only
a few of the structure's bytes. As this would be a different way of
storing the data, that would likely use a different tag for the
varatt_1b_e struct to differentiate the two stored formats.
3. I do think that attributes shouldn't be required to be stored
either on disk or in a single palloc-ed area of memory. It is very
expensive to copy such large chunks of memory; jsonb is one such
example. If the type is composite, allow it to be allocated in
multiple regions. This would require a new varatt_1b_e tag to discern
that the Datum isn't necessarily located in a single memory context,
but with good memory context management that should be fine.
4. I do think that TOAST needs improvements to allow differential
updates, not just full rewrites of the value. I believe this would
likely be enabled through solutions for (2) and (3), even if it might
already be possible without implementing new vartag options.

My thoughts:

In my view, the main job of TOAST is:
- To make sure a row with large attributes can still fit on a page by
reducing the size of the representation of attributes in the row
- To allow us to efficiently handle variable-length attribute values
- To reduce the overhead of moving large values through query execution

This is currently implemented through tagged values that contain
exactly one canonical representation of the type (be it inline, inline
compressed, or out of line with or without compression).

Our current implementation assumes that users of the attribute will
always use either the decompressed canonical representation, or don't
care about the representation at all (except decompression of only
prefixes, which is a special case), but this is clearly not the case:
Composite values like ROW types clearly benefit from careful
partitioning and subdivision of values into self-contained compressed
chunks: We don't TOAST a table's rows, but do TOAST per attribute.
JSONB could also benefit if it could create its own on-disk format of
a value: benchmarks of the "Pluggable Toaster" patch have shown that
JSONB operation performance improved significantly with custom toaster
infrastructure.

So, if composite types (like JSONB, ROW and ARRAY) would be able to
manually slice their values and create their own representation of
that toasted value, then that would probably benefit the system by
allowing some data to be stored in a more accessible manner than
"everything inline, compressed, or out-of-line, detoast (a prefix of)
all data, or none of it, no partial detoasting".


Now, returning to the table-level TOAST task of making sure the
tuple's data fits on the page, compressing & out-of-line-ing the data
until it fits:

Things that it currently does: varlena values are compressed and
out-of-lined with generic compression algorithms and a naive
slice-and-dice 

Re: trying again to get incremental backup

2023-10-26 Thread Robert Haas
On Thu, Oct 26, 2023 at 6:59 AM Andrew Dunstan  wrote:
> Because we won't be removing the RD parser.

Ah, OK.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: remaining sql/json patches

2023-10-26 Thread Nikita Malakhov
Hi,

Agreed on the latter, that must not be the part of it for sure.
Would think on how to make this part correct.

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: RFC: Pluggable TOAST

2023-10-26 Thread Aleksander Alekseev
Hi,

> And the goal of *THIS* topic is to gather a picture on how the community sees
> improvements in TOAST mechanics if it doesn't want it the way we proposed
> before, to understand which way to go with JSON advanced storage and other
> enhancements we already have. Previous topic was not of any help here.

Publish your code under an appropriate license first so that 1. anyone
can test/benchmark it and 2. merge it to the PostgreSQL core if
necessary.

Or better consider participating in the [1] discussion where we
reached a consensus on RFC and are working on improving TOAST for JSON
and other types. We try to be mindful of use cases you named before
like 64-bit TOAST pointers but we still could use your input.

You know all this.

[1]: 
https://www.postgresql.org/message-id/flat/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22%3D5xVBg7S4vr5rQ%40mail.gmail.com
-- 
Best regards,
Aleksander Alekseev




Re: RFC: Pluggable TOAST

2023-10-26 Thread Nikita Malakhov
Hi,

I meant discussion preceding the patch set - there was no any.

And the goal of *THIS* topic is to gather a picture on how the community
sees
improvements in TOAST mechanics if it doesn't want it the way we proposed
before, to understand which way to go with JSON advanced storage and other
enhancements we already have. Previous topic was not of any help here.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: remaining sql/json patches

2023-10-26 Thread Amit Langote
Hi,

On Thu, Oct 26, 2023 at 9:20 PM Nikita Malakhov  wrote:
>
> Hi,
>
> The main goal was to correctly process invalid queries (as in examples above).
> I'm not sure this could be done in type input functions. I thought that some
> coercions could be checked before evaluating expressions for saving reasons.

I assume by "invalid" you mean queries specifying types in RETURNING
that don't support soft-error handling in their input function.
Adding a check makes sense but its implementation should include a
type cache interface to check whether a given type has error-safe
input handling, possibly as a separate patch.  IOW, the SQL/JSON patch
shouldn't really make a list of types to report as unsupported.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: Synchronizing slots from primary to standby

2023-10-26 Thread Amit Kapila
On Thu, Oct 26, 2023 at 12:38 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > PFA v25 patch set. The changes are:
>
> Thanks for making the patch! It seems that there are lots of comments, so
> I can put some high-level comments for 0001.
> Sorry if there are duplicated comments.
>
> 1.
> The patch seemed not to consider the case that failover option between 
> replication
> slot and subscription were different. Currently slot option will be 
> overwritten
> by subscription one.
>
> Actually, I'm not sure what specification is better. Regarding the two_phase,
> 2PC will be decoded only when the both of settings are true. Should we follow?
>
> 2.
> Currently ctx->failover is set only in the pgoutput_startup(), but not sure 
> it is OK.
> Can we change the parameter in CreateDecodingContext() or similar functions?
>
> Because IIUC it means that only slots which have pgoutput can wait. Other
> output plugins must understand the change and set faliover flag as well -
> I felt it is not good. E.g., you might miss to enable the parameter in 
> test_decoding.
>
> Regarding the two_phase parameter, setting on plugin layer is good because it
> quite affects the output. As for the failover, it is not related with the
> content so that all of slots should be enabled.
>

Both of your points seem valid to me. However, I think they should be
addressed once we make option 'failover' behave similar to the '2PC'
option as per discussion [1].

[1] - 
https://www.postgresql.org/message-id/b099ebc2-68fd-4c08-87ce-65fc4cb24121%40gmail.com

-- 
With Regards,
Amit Kapila.




Re: remaining sql/json patches

2023-10-26 Thread Nikita Malakhov
Hi,

The main goal was to correctly process invalid queries (as in examples
above).
I'm not sure this could be done in type input functions. I thought that some
coercions could be checked before evaluating expressions for saving reasons.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Synchronizing slots from primary to standby

2023-10-26 Thread Amit Kapila
On Thu, Oct 26, 2023 at 5:38 PM Drouvot, Bertrand
 wrote:
>
> On 10/26/23 10:40 AM, Amit Kapila wrote:
> > On Wed, Oct 25, 2023 at 8:49 PM Drouvot, Bertrand
> >  wrote:
> >>
> >
> > Good point, I think we should enhance the WalSndWait() logic to
> > address this case.
>
> Agree. I think it would need to take care of the new CV and probably
> provide a way for the caller to detect it stopped waiting due to the socket
> (I don't think it can find out currently).
>
> > Additionally, I think we should ensure that
> > WalSndWaitForWal() shouldn't wait twice once for wal_flush and a
> > second time for wal to be replayed by physical standby. It should be
> > okay to just wait for Wal to be replayed by physical standby when
> > applicable, otherwise, just wait for Wal to flush as we are doing now.
> > Does that make sense?
>
> Yeah, I think so. What about moving WalSndWaitForStandbyConfirmation()
> outside of WalSndWaitForWal() and call one or the other in 
> logical_read_xlog_page()?
>

I think we need to somehow integrate the logic of both functions. Let
us see what the patch author has to say about this.

-- 
With Regards,
Amit Kapila.




Re: Synchronizing slots from primary to standby

2023-10-26 Thread Drouvot, Bertrand

Hi,

On 10/26/23 10:40 AM, Amit Kapila wrote:

On Wed, Oct 25, 2023 at 8:49 PM Drouvot, Bertrand
 wrote:




Good point, I think we should enhance the WalSndWait() logic to
address this case. 


Agree. I think it would need to take care of the new CV and probably
provide a way for the caller to detect it stopped waiting due to the socket
(I don't think it can find out currently).


Additionally, I think we should ensure that
WalSndWaitForWal() shouldn't wait twice once for wal_flush and a
second time for wal to be replayed by physical standby. It should be
okay to just wait for Wal to be replayed by physical standby when
applicable, otherwise, just wait for Wal to flush as we are doing now.
Does that make sense?


Yeah, I think so. What about moving WalSndWaitForStandbyConfirmation()
outside of WalSndWaitForWal() and call one or the other in 
logical_read_xlog_page()?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: RFC: Pluggable TOAST

2023-10-26 Thread Aleksander Alekseev
Hi,

> Aleksander, previous discussion was not a discussion actually, we proposed
> a set of big and complex core changes without any discussion preceding it.
> That was not very good approach although the overall idea behind the patch
> set is very progressive and is ready to solve some old and painful issues in 
> Postgres.

Not true.

There *was* a discussion and you are aware of all the problems that
were pointed out. Most importantly [1][2]. Also you followed the
thread [3] and are well aware that we want to implement TOAST
improvements in PostgreSQL core.

Despite all this you are still insisting on the extendable design as
if starting a new thread every year or so will change something.

[1]: 
https://www.postgresql.org/message-id/20230205223313.4dwhlddzg6uhaztg%40alap3.anarazel.de
[2]: 
https://www.postgresql.org/message-id/CAJ7c6TOsHtGkup8AVnLTGGt-%2B7EzE2j-cFGr12U37pzGEsU6Fg%40mail.gmail.com
[3]: 
https://www.postgresql.org/message-id/flat/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22%3D5xVBg7S4vr5rQ%40mail.gmail.com
-- 
Best regards,
Aleksander Alekseev




Re: speed up a logical replica setup

2023-10-26 Thread Ashutosh Bapat
On Mon, Oct 23, 2023 at 9:34 AM Euler Taveira  wrote:

>
> It is still a WIP but I would like to share it and get some feedback.
>
>

I have started reviewing the patch. I have just read through all the
code. It's well documented and clear. Next I will review the design in
detail. Here are a couple of minor comments
1.
+tests += {
+ 'name': 'pg_subscriber',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'tap': {
+ 'tests': [
+ 't/001_basic.pl',

COMMENT
Shouldn't we include 002_standby.pl?

2. CreateReplicationSlotLSN, is not used anywhere. Instead I see
create_logical_replication_slot() in pg_subscriber.c. Which of these
two you intend to use finally?

-- 
Best Wishes,
Ashutosh Bapat




Re: RFC: Pluggable TOAST

2023-10-26 Thread Nikita Malakhov
Hi,

Aleksander, previous discussion was not a discussion actually, we proposed
a set of big and complex core changes without any discussion preceding it.
That was not very good approach although the overall idea behind the patch
set is very progressive and is ready to solve some old and painful issues
in Postgres.

Also, introduction of SQL/JSON will further boost usage of JSON in
databases,
so our improvements in JSON storage and performance would be very useful.
These improvements depend on Pluggable TOAST, without API that allows easy
plug-in different TOAST implementations they require heavy core
modifications
and are very unlikely to be accepted. Not to mention that such kind of
changes
require upgrades, restarts and so on.

Pluggable TOAST allows using advanced storage techniques on top of the
default
Postgres database engine, instead of implementing the complex Pluggable
Storage
API, and allows plugging these advanced techniques on the fly - without even
restarting the server, which is crucial for production systems.

Discussion on extending the TOAST pointer showed some interest in this
topic,
so I hope this feature would draw some attention in the scope of widely
used large
JSON objects.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Add trailing commas to enum definitions

2023-10-26 Thread Peter Eisentraut

On 23.10.23 22:34, Nathan Bossart wrote:

On Mon, Oct 23, 2023 at 05:55:32PM +0800, Junwang Zhao wrote:

On Mon, Oct 23, 2023 at 2:37 PM Peter Eisentraut  wrote:

Since C99, there can be a trailing comma after the last value in an enum


C99 allows us to do this doesn't mean we must do this, this is not
inconsistent IMHO, and this will pollute the git log messages, people
may *git blame* the file and see the reason for the introduction of the
line.


I suspect that your concerns about git-blame could be resolved by adding
this commit to .git-blame-ignore-revs.  From a long-term perspective, I
think standardizing on the trailing comma style will actually improve
git-blame because patches won't need to add a comma to the previous line
when adding a value.


Committed that way.





Re: trying again to get incremental backup

2023-10-26 Thread Andrew Dunstan



On 2023-10-25 We 15:19, Robert Haas wrote:

On Wed, Oct 25, 2023 at 3:17 PM Andrew Dunstan  wrote:

OK, I'll go with that. It will actually be a bit less invasive than the
patch I posted.

Why's that?



Because we won't be removing the RD parser.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-10-26 Thread Andrei Zubkov
On Thu, 2023-10-26 at 15:49 +0700, Andrei Lepikhov wrote:
> It is the gist of my question. If needed, You can remove the record
> by 
> (userid, dbOid, queryId). As I understand, this extension is usually 
> used by an administrator. Who can reset these parameters except you
> and 
> the DBMS?
This extension is used by administrator but indirectly through some
kind of sampling solution providing information about statistics change
over time. The only kind of statistics unavailable to sampling
solutions without a periodic reset is a min/max statistics. This patch
provides a way for resetting those statistics without entry eviction.
Suppose the DBA will use several sampling solutions. Every such
solution can perform its own resets of min/max statistics. Other
sampling solutions need a way to detect such resets to avoid undetected
interference. Timestamping of min/max reset can be used for that
purpose.

-- 
regards, Andrei Zubkov
Postgres Professional




Re: race condition in pg_class

2023-10-26 Thread Smolkin Grigory
> ALTER TABLE ADD CONSTRAINT would certainly have taken
> AccessExclusiveLock on the "example" table, which should be sufficient
> to prevent anything else from touching its pg_class row.  The only
> mechanism I can think of that might bypass that is a manual UPDATE on
> pg_class, which would just manipulate the row as a row without concern
> for associated relation-level locks.  Any chance that somebody was
> doing something like that?

No chance. Our infrastructure dont do that, and users dont just have the
privileges to mess with pg_catalog.

ср, 25 окт. 2023 г. в 21:06, Tom Lane :

> Smolkin Grigory  writes:
> > We are running PG13.10 and recently we have encountered what appears to
> be
> > a bug due to some race condition between ALTER TABLE ... ADD CONSTRAINT
> and
> > some other catalog-writer, possibly ANALYZE.
> > The problem is that after successfully creating index on relation (which
> > previosly didnt have any indexes), its pg_class.relhasindex remains set
> to
> > "false", which is illegal, I think.
> > Index was built using the following statement:
> > ALTER TABLE "example" ADD constraint "example_pkey" PRIMARY KEY (id);
>
> ALTER TABLE ADD CONSTRAINT would certainly have taken
> AccessExclusiveLock on the "example" table, which should be sufficient
> to prevent anything else from touching its pg_class row.  The only
> mechanism I can think of that might bypass that is a manual UPDATE on
> pg_class, which would just manipulate the row as a row without concern
> for associated relation-level locks.  Any chance that somebody was
> doing something like that?
>
> regards, tom lane
>


Re: CRC32C Parallel Computation Optimization on ARM

2023-10-26 Thread Bharath Rupireddy
On Thu, Oct 26, 2023 at 2:23 PM Xiang Gao  wrote:
>
> On  Tue,  24 Oct,  2023 20:45:39PM -0500, Nathan Bossart wrote:
> >I tried this.  pg_waldump on 2 million ~8kB records took around 8.1 seconds
> >without the patch and around 7.4 seconds with it (an 8% improvement).
> >pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
> >patch and around 2.4 seconds with it (a 25% improvement).
>
> Could you please provide details on how to generate these 8kB size or 16kB 
> size data? Thanks!

Here's a script that I use to generate WAL records of various sizes,
change it to taste if useful:

for m in 16 64 256 1024 4096 8192 16384
do
echo "Start of run with WAL size \$m bytes at:"
date
echo "SELECT pg_logical_emit_message(true, 'mymessage',
repeat('d', \$m));" >> $JUMBO/scripts/dumbo\$m.sql
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096
do
  $PGWORKSPACE/pgbench -n postgres -c\$c -j\$c -T60 -f
$JUMBO/scripts/dumbo\$m.sql > $JUMBO/results/dumbo\$m:\$c.out
done
echo "End of run with WAL size \$m bytes at:"
date
echo "\n"
done

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




RE: CRC32C Parallel Computation Optimization on ARM

2023-10-26 Thread Xiang Gao
On  Tue,  24 Oct,  2023 20:45:39PM -0500, Nathan Bossart wrote:
>I tried this.  pg_waldump on 2 million ~8kB records took around 8.1 seconds
>without the patch and around 7.4 seconds with it (an 8% improvement).
>pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
>patch and around 2.4 seconds with it (a 25% improvement).

Could you please provide details on how to generate these 8kB size or 16kB size 
data? Thanks!


IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.




  1   2   >