date:20240701

Re: Avoid orphaned objects dependencies, take 3

2024-07-01 Thread Bertrand Drouvot

Hi,

On Mon, Jul 01, 2024 at 09:39:17AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Wed, Jun 26, 2024 at 10:24:41AM +, Bertrand Drouvot wrote:
> > Hi,
> > 
> > On Fri, Jun 21, 2024 at 01:22:43PM +, Bertrand Drouvot wrote:
> > > Another thought for the RelationRelationId class case: we could check if 
> > > there
> > > is a lock first and if there is no lock then acquire one. That way that 
> > > would
> > > ensure the relation is always locked (so no "risk" anymore), but OTOH it 
> > > may
> > > add "unecessary" locking (see 2. mentioned previously).
> > 
> > Please find attached v12 implementing this idea for the RelationRelationId 
> > class
> > case. As mentioned, it may add unnecessary locking for 2. but I think that's
> > worth it to ensure that we are always on the safe side of thing. This idea 
> > is
> > implemented in LockNotPinnedObjectById().
> 
> Please find attached v13, mandatory rebase due to 0cecc908e97. In passing, 
> make
> use of CheckRelationOidLockedByMe() added in 0cecc908e97.

Please find attached v14, mandatory rebase due to 65b71dec2d5.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 31767e9eb290909943d03408d9f8e827cf17e0bd Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v14] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place before the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch takes into account any type of objects except the ones that are pinned
(they are not droppable because the system requires it).

A special case is done for objects that belong to the RelationRelationId class.
For those, we should be in one of the two following cases that would already
prevent the relation to be dropped:

1. The relation is already locked (could be an existing relation or a relation
that we are creating).

2. The relation is protected indirectly (i.e an index protected by a lock on
its table, a table protected by a lock on a function that depends the table...)

To avoid any risks for the RelationRelationId class case, we acquire a lock if
there is none. That may add unnecessary lock for 2. but that's worth it.

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/aclchk.c  |   1 +
 src/backend/catalog/dependency.c  | 121 +++-
 src/backend/catalog/heap.c|   7 +
 src/backend/catalog/index.c   |  26 
 src/backend/catalog/objectaddress.c   |  57 
 src/backend/catalog/pg_aggregate.c|   9 ++
 src/backend/catalog/pg_attrdef.c  |   1 +
 src/backend/catalog/pg_cast.c |   5 +
 src/backend/catalog/pg_collation.c|   1 +
 src/backend/catalog/pg_constraint.c   |  26 
 src/backend/catalog/pg_conversion.c   |   2 +
 src/backend/catalog/pg_depend.c   |  40 +-
 src/backend/catalog/pg_operator.c |  19 +++
 src/backend/catalog/pg_proc.c |  17 ++-
 src/backend/catalog/pg_publication.c  |   7 +
 src/backend/catalog/pg_range.c|   6 +
 src/backend/catalog/pg_type.c |  39 ++
 src/backend/catalog/toasting.c|   1 +
 src/backend/commands/alter.c  |   4 +
 src/backend/commands/amcmds.c |   1 +
 src/backend/commands/cluster.c|   7 +
 src/backend/commands/event_trigger.c  |   1 +
 src/backend/commands/extension.c  |   5 +
 src/backend/commands/foreigncmds.c|   7 +

Re: Pgoutput not capturing the generated columns

2024-07-01 Thread Peter Smith

Hi Shubham,

As you can see, most of my recent review comments for patch 0001 are
only cosmetic nitpicks. But, there is still one long-unanswered design
question from a month ago [1, #G.2]

A lot of the patch code of pgoutput.c and proto.c and logicalproto.h
is related to the introduction and passing everywhere of new
'include_generated_columns' function parameters. These same functions
are also always passing "BitMapSet *columns" representing the
publication column list.

My question was about whether we can't make use of the existing BMS
parameter instead of introducing all the new API parameters.

The idea might go something like this:

* If 'include_generated_columns' option is specified true and if no
column list was already specified then perhaps the relentry->columns
can be used for a "dummy" column list that has everything including
all the generated columns.

* By doing this:
 -- you may be able to avoid passing the extra
'include_gernated_columns' everywhere
 -- you may be able to avoid checking for generated columns deeper in
the code (since it is already checked up-front when building the
column list BMS)

~~

I'm not saying this design idea is guaranteed to work, but it might be
worth considering, because if it does work then there is potential to
make the current 0001 patch significantly shorter.

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPsuJfcaeg6zst%3D6PE5uyJv_UxVRHU3ck7W2aHb1uQYKng%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: LogwrtResult contended spinlock

2024-07-01 Thread Alexander Lakhin


01.07.2024 20:04, Alvaro Herrera wrote:

OK, so it's
#if ALIGNOF_LONG_LONG_INT >= 8

Alexander, can you please confirm whether this works for you?


Yes, the v4 patch makes `meson test --suite setup` pass in x86 environment.
And complete `meson test` fails on pg_amcheck/004_verify_heapam only
(I think it's a separate issue, just want to announce what else is broken
on that platform)

Thank you!

Best regards,
Alexander

Re: Speed up JSON escape processing with SIMD plus other optimisations

2024-07-01 Thread David Rowley

I've attached a rebased set of patches.  The previous set no longer applied.

David


v3-0001-Add-len-parameter-to-escape_json.patch
Description: Binary data


v3-0002-Use-SIMD-processing-for-escape_json.patch
Description: Binary data


v3-0003-Special-case-text-type-conversion-in-datum_to_jso.patch
Description: Binary data

Re: pg_createsubscriber: drop pre-existing subscriptions from the converted node

2024-07-01 Thread Shlok Kyal

On Mon, 1 Jul 2024 at 11:44, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> Thanks for giving comments! PSA new version.
>
> > Thanks, this is a better approach. I have changed a few comments and
> > made some other cosmetic changes. See attached.
>
> I checked your attached and LGTM. Based on that, I added some changes
> like below:
>
> - Made dbname be escaped while listing up pre-existing subscriptions
>  Previous version could not pass tests by recent commits.
> - Skipped dropping subscriptions in dry_run mode
>   I found the issue while poring the test to 040_pg_createsubscriber.pl.
> - Added info-level output to follow other drop_XXX functions
>
> > BTW, why have you created a separate test file for this test? I think
> > we should add a new test to one of the existing tests in
> > 040_pg_createsubscriber.
>
> I was separated a test file for just confirmation purpose, I've planned to 
> merge.
> New patch set did that.
>
> > You can create a dummy subscription on node_p
> > and do a test similar to what we are doing in "# Create failover slot
> > to test its removal".
>
> Your approach looks better than mine. I followed the approach.

Hi Kuroda-san,

I tested the patches on linux and windows and I confirm that it
successfully fixes the issue [1].

[1]: 
https://www.postgresql.org/message-id/CANhcyEWvimA1-f6hSrA%3D9qkfR5SonFb56b36M%2B%2BvT%3DLiFj%3D76g%40mail.gmail.com

Thanks and Regards,
Shlok Kyal

Re: Pgoutput not capturing the generated columns

2024-07-01 Thread Peter Smith

On Mon, Jul 1, 2024 at 8:38 PM Shubham Khanna
 wrote:
>
>...
> > 8.
> > + else if (strcmp(elem->defname, "include-generated-columns") == 0)
> > + {
> > + if (elem->arg == NULL)
> > + data->include_generated_columns = true;
> >
> > Is there any way to test that "elem->arg == NULL" in the
> > generated.sql? OTOH, if it is not possible to get here then is the
> > code even needed?
> >
>
> Currently I could not find a case where the
> 'include_generated_columns' option is not specifying any value, but  I
> was hesitant to remove this from here as the other options mentioned
> follow the same rules. Thoughts?
>

If you do manage to find a scenario for this then I think a test for
it would be good. But, I agree that the code seems OK because now I
see it is the same pattern as similar nearby code.

~~~

Thanks for the updated patch. Here are some review comments for patch v13-0001.

==
.../expected/generated_columns.out

nitpicks (see generated_columns.sql)

==
.../test_decoding/sql/generated_columns.sql

nitpick - use plural /column/columns/
nitpick - use consistent wording in the comments
nitpick - IMO it is better to INSERT different values for each of the tests

==
doc/src/sgml/protocol.sgml

nitpick - I noticed that none of the other boolean parameters on this
page mention about a default, so maybe here we should do the same and
omit that information.

~~~

1.
- 
-  Next, the following message part appears for each column included in
-  the publication (except generated columns):
- 
-

In a previous review [1 comment #11] I wrote that you can't just
remove this paragraph because AFAIK it is still meaningful. A minimal
change might be to just remove the "(except generated columns)" part.
Alternatively, you could give a more detailed explanation mentioning
the include_generated_columns protocol parameter.

I provided some updated text for this paragraph in my NITPICKS top-up
patch, Please have a look at that for ideas.

==
src/backend/commands/subscriptioncmds.c

It looks like pg_indent needs to be run on this file.

==
src/include/catalog/pg_subscription.h

nitpick - comment /publish/Publish/ for consistency

==
src/include/replication/walreceiver.h

nitpick - comment /publish/Publish/ for consistency

==
src/test/regress/expected/subscription.out

nitpicks - (see subscription.sql)

==
src/test/regress/sql/subscription.sql

nitpick - combine the invalid option combinations test with all the
others (no special comment needed)
nitpick - rename subscription as 'regress_testsub2' same as all its peers.

==
src/test/subscription/t/011_generated.pl

nitpick - add/remove blank lines

==
src/test/subscription/t/031_column_list.pl

nitpick - rewording for a comment. This issue was not strictly caused
by this patch, but since you are modifying the same comment we can fix
this in passing.

==
99.
Please also see the attached top-up patch for all those nitpicks
identified above.

==
[1] v11-0001 review
https://www.postgresql.org/message-id/CAHut%2BPv45gB4cV%2BSSs6730Kb8urQyqjdZ9PBVgmpwqCycr1Ybg%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia
diff --git a/contrib/test_decoding/expected/generated_columns.out 
b/contrib/test_decoding/expected/generated_columns.out
index 4c3d6dd..f3b26aa 100644
--- a/contrib/test_decoding/expected/generated_columns.out
+++ b/contrib/test_decoding/expected/generated_columns.out
@@ -1,4 +1,4 @@
--- test decoding of generated column
+-- test decoding of generated columns
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 
'test_decoding');
  ?column? 
 --
@@ -7,7 +7,7 @@ SELECT 'init' FROM 
pg_create_logical_replication_slot('regression_slot', 'test_d
 
 -- column b' is a generated column
 CREATE TABLE gencoltable (a int, b int GENERATED ALWAYS AS (a * 2) STORED);
--- when 'include-generated-columns' is not set the generated column 'b' will 
be replicated
+-- when 'include-generated-columns' is not set the generated column 'b' values 
will be replicated
 INSERT INTO gencoltable (a) VALUES (1), (2), (3);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 
'include-xids', '0', 'skip-empty-xacts', '1');
 data 
@@ -20,26 +20,26 @@ SELECT data FROM 
pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 (5 rows)
 
 -- when 'include-generated-columns' = '1' the generated column 'b' values will 
be replicated
-INSERT INTO gencoltable (a) VALUES (1), (2), (3);
+INSERT INTO gencoltable (a) VALUES (4), (5), (6);
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 
'include-xids', '0', 'skip-empty-xacts', '1', 'include-generated-columns', '1');
-data 
--
+ data

Re: Conflict Detection and Resolution

2024-07-01 Thread shveta malik

On Mon, Jul 1, 2024 at 11:47 AM Masahiko Sawada  wrote:
>
> Hi,
>
> On Thu, May 23, 2024 at 3:37 PM shveta malik  wrote:
> >
> > DELETE
> > 
> > Conflict Type:
> > 
> > delete_missing: An incoming delete is trying to delete a row on a
> > target node which does not exist.
>
> IIUC the 'delete_missing' conflict doesn't cover the case where an
> incoming delete message is trying to delete a row that has already
> been updated locally or by another node. I think in update/delete
> conflict situations, we need to resolve the conflicts based on commit
> timestamps like we do for update/update and insert/update conflicts.
>
> For example, suppose there are two node-A and node-B and setup
> bi-directional replication, and suppose further that both have the row
> with id = 1, consider the following sequences:
>
> 09:00:00  DELETE ... WHERE id = 1 on node-A.
> 09:00:05  UPDATE ... WHERE id = 1 on node-B.
> 09:00:10  node-A received the update message from node-B.
> 09:00:15  node-B received the delete message from node-A.
>
> At 09:00:10 on node-A, an update_deleted conflict is generated since
> the row on node-A is already deleted locally. Suppose that we use
> 'apply_or_skip' resolution for this conflict, we convert the update
> message into an insertion, so node-A now has the row with id = 1. At
> 09:00:15 on node-B, the incoming delete message is applied and deletes
> the row with id = 1, even though the row has already been modified
> locally. The node-A and node-B are now inconsistent. This
> inconsistency can be avoided by using 'skip' resolution for the
> 'update_deleted' conflict on node-A, and 'skip' resolution is the
> default method for that actually. However, if we handle it as
> 'update_missing', the 'apply_or_skip' resolution is used by default.
>
> IIUC with the proposed architecture, DELETE always takes precedence
> over UPDATE since both 'update_deleted' and 'update_missing' don't use
> commit timestamps to resolve the conflicts. As long as that is true, I
> think there is no use case for 'apply_or_skip' and 'apply_or_error'
> resolutions in update/delete conflict cases. In short, I think we need
> something like 'delete_differ' conflict type as well.

Thanks for the feedback. Sure, we can have 'delete_differ'.

> FYI PGD and
> Oracle GoldenGate seem to have this conflict type[1][2].
>
> The 'delete'_differ' conflict type would have at least
> 'latest_timestamp_wins' resolution. With the timestamp based
> resolution method, we would deal with update/delete conflicts as
> follows:
>
> 09:00:00: DELETE ... WHERE id = 1 on node-A.
> 09:00:05: UPDATE ... WHERE id = 1 on node-B.
> - the updated row doesn't have the origin since it's a local change.
> 09:00:10: node-A received the update message from node-B.
> - the incoming update message has the origin of node-B whereas the
> local row is already removed locally.
> - 'update_deleted' conflict is generated.
> - do the insert of the new row instead, because the commit
> timestamp of UPDATE is newer than DELETE's one.

So, are you suggesting to support latest_tmestamp_wins for
'update_deleted' case? And shall 'latest_tmestamp_wins' be default
then instead of 'skip'? In some cases, the complete row can not be
constructed, and then 'insertion' might not be possible even if the
timestamp of 'update' is latest. Then shall we skip or error out at
latest_tmestamp_wins config?

Even if we support 'latest_timestamp_wins' as default, we can still
have 'apply_or_skip' and 'apply_or_error' as other options for
'update_deleted' case. Or do you suggest getting rid of these options
completely?

> 09:00:15: node-B received the delete message from node-A.
> - the incoming delete message has the origin of node-B whereas the
> (updated) row doesn't have the origin.
> - 'update_differ' conflict is generated.

Here, do you mean 'delete_differ' conflict is generated?

thanks
Shveta

RE: Improve EXPLAIN output for multicolumn B-Tree Index

2024-07-01 Thread Masahiro.Ikeda

> > I think the better choice would be adding an IndexAmRoutine->amexplain
> > support function, which would get called in e.g. explain.c's
> > ExplainIndexScanDetails to populate a new "Index Scan Details" (name
> > to be bikeshed) subsection of explain plans. This would certainly be
> > possible, as the essentials for outputting things to EXPLAIN are
> > readily available in the explain.h header.
> 
> Yes, that's one of my concerns. I agree to add IndexAmRoutine->amexplain is 
> better
> because we can support several use cases.
> 
> Although I'm not confident to add only IndexAmRoutine->amexplain is enough 
> now, I'll
> make a PoC patch to confirm it.

I attached the patch adding an IndexAmRoutine->amexplain.

This patch changes following.
* add a new index AM function "amexplain_function()" and it's called in 
ExplainNode()
   Although I tried to add it in ExplainIndexScanDetails(), I think it's not 
the proper place to 
   show quals. So, amexplain_function() will call after calling show_scanqual() 
in the patch. 
* add "amexplain_function" for B-Tree index and show "Non Key Filter" if 
VERBOSE is specified
   To avoid confusion with INCLUDE-d columns and non-index column "Filter", 
I've decided to
   output only with the VERBOSE option. However, I'm not sure if this is the 
appropriate solution.
   It might be a good idea to include words like 'b-tree' to make it clear that 
it's an output specific
   to b-tree index.

-- Example dataset
CREATE TABLE test (id1 int, id2 int, id3 int, value varchar(32));
CREATE INDEX test_idx ON test(id1, id2, id3);  -- multicolumn B-Tree index
INSERT INTO test (SELECT i % 2, i, i, 'hello' FROM generate_series(1,100) 
s(i));
ANALYZE;

-- The output is same as without this patch if it can search efficiently
=# EXPLAIN (VERBOSE, ANALYZE, BUFFERS, MEMORY, SERIALIZE) SELECT id3 FROM test 
WHERE id1 = 1 AND id2 = 101;
QUERY PLAN
---
 Index Only Scan using test_idx on public.test  (cost=0.42..4.44 rows=1 
width=4) (actual time=0.058..0.060 rows=1 loops=1)
   Output: id3
   Index Cond: ((test.id1 = 1) AND (test.id2 = 101))
   Heap Fetches: 0
   Buffers: shared hit=4
 Planning:
   Memory: used=14kB  allocated=16kB
 Planning Time: 0.166 ms
 Serialization: time=0.009 ms  output=1kB  format=text
 Execution Time: 0.095 ms
(10 rows)

-- "Non Key Filter" will be displayed if it will scan index tuples and filter 
them
=# EXPLAIN (VERBOSE, ANALYZE, BUFFERS, MEMORY, SERIALIZE) SELECT id3 FROM test 
WHERE id1 = 1 AND id3 = 101;
   QUERY PLAN

 Index Only Scan using test_idx on public.test  (cost=0.42..12724.10 rows=1 
width=4) (actual time=0.055..69.446 rows=1 loops=1)
   Output: id3
   Index Cond: ((test.id1 = 1) AND (test.id3 = 101))
   Heap Fetches: 0
   Non Key Filter: (test.id3 = 101)
   Buffers: shared hit=1920
 Planning:
   Memory: used=14kB  allocated=16kB
 Planning Time: 0.113 ms
 Serialization: time=0.004 ms  output=1kB  format=text
 Execution Time: 69.491 ms
(11 rows)

Although I plan to support "Rows Removed by Non Key Filtered"(or "Skip Scan 
Filtered"),
I'd like to know whether the current direction is good. One of my concerns is 
there might
be a better way to exact quals for boundary conditions in btexplain().

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION


v2-0001-Support-Non-Key-Filter-for-multicolumn-B-Tree-Ind.patch
Description:  v2-0001-Support-Non-Key-Filter-for-multicolumn-B-Tree-Ind.patch

Re: Use pgstat_kind_infos to read fixed shared stats structs

2024-07-01 Thread Michael Paquier

On Mon, Jul 01, 2024 at 02:48:19PM +0900, Michael Paquier wrote:
> So, how about trying to remove the dependency to the fixed shared
> stats structures in PgStat_ShmemControl and PgStat_Snapshot?  I'd like
> to think that these should be replaced with an area allocated in
> shared memory and TopMemoryContext respectively, with PgStat_Snapshot
> and PgStat_ShmemControl pointing to these areas, with an allocated
> size based on the information aggregated from the KindInfo Array.  We
> could also store the offset of the fixed areas in two extra arrays,
> one for each of the two structures, indexed by KindInfo and of size
> PGSTAT_NUM_KINDS.

I have been poking at this area, and found a solution that should
work.  The details will be posted on the pluggable stats thread with a
rebased patch and these bits on top of the pluggable APIs:
https://www.postgresql.org/message-id/Zmqm9j5EO0I4W8dx%40paquier.xyz

So let's move the talking there.
--
Michael


signature.asc
Description: PGP signature

Re: Should we document how column DEFAULT expressions work?

2024-07-01 Thread David Rowley

On Tue, 2 Jul 2024 at 02:43, Tom Lane  wrote:
> I'd be more excited about this discussion if I didn't think that
> the chances of removing 'now'::timestamp are exactly zero.  You
> can't just delete useful decades-old features, whether there's
> a better way or not.

Do you have any thoughts on rejecting trailing punctuation with the
timestamp special values?

For me, I've mixed feelings about it. I think it would be good to
break things for people who are doing this and getting the wrong
behaviour who haven't noticed yet, however, there could be a series of
people doing this and have these embedded in statements that are
parsed directly before execution, and they just happen to get the
right behaviour.  It might be better not to upset the latter set of
people needlessly.  Perhaps the former set of people don't exist since
the behaviour is quite different and it seems quite obviously wrong.

David

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-07-01 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Jul 01, 2024 at 08:40:17PM -0400, Tom Lane wrote:
>> While this won't actually fail against a v10 or v11 server, it won't
>> show anything useful either (because relam is zero for heaps in
>> pre-v12 versions).  Perhaps there should be a check to only add the
>> extra column if server >= v12?

> I've thought about that and would be OK to restrict things with this
> suggestion if you'd prefer it.  However, I could not decide in favor
> of it as using a psql \dP+ >= v18 gives the possibility to show the
> AMs of partitioned indexes, as these are also part listed in \dP.  So
> I've found that useful in itself.

Ah, I'd forgotten that partitioned indexes are shown too.
Never mind then.

regards, tom lane

CREATE OR REPLACE MATERIALIZED VIEW

2024-07-01 Thread Erik Wienhold

I like to add CREATE OR REPLACE MATERIALIZED VIEW with the attached
patches.

Patch 0001 adds CREATE OR REPLACE MATERIALIZED VIEW similar to CREATE OR
REPLACE VIEW.  It also includes regression tests and changes to docs.

Patch 0002 deprecates CREATE MATERIALIZED VIEW IF NOT EXISTS because it
no longer seems necessary with patch 0001.  Tom Lane commented[1] about
the general dislike of IF NOT EXISTS, to which I agree, but maybe this
was meant only in response to adding new commands.  Anyway, my idea is
to deprecate that usage in PG18 and eventually remove it in PG19, if
there's consensus for it.  We can drop that clause without violating any
standard because matviews are a Postgres extension.  I'm not married to
the idea, just want to put it on the table for discussion.

Motivation
--

At $JOB we use materialized views for caching a couple of expensive
views.  But every now and then those views have to be changed, e.g., new
logic, new columns, etc.  The matviews have to be dropped and re-created
to include new columns.  (Just changing the underlying view logic
without adding new columns is trivial because the matviews are just thin
wrappers that just have to be refreshed.)

We also have several views that depend on those matviews.  The views
must also be dropped in order to re-create the matviews.  We've already
automated this with two procedures that stash and re-create dependent
view definitions.

Native support for replacing matviews would simplify our setup and it
would make CREATE MATERIALIZED VIEW more complete when compared to
CREATE VIEW.

I searched the lists for previous discussions on this topic but couldn't
find any.  So, I don't know if this was ever tried, but rejected for
some reason.  I've found slides[2] from 2013 (when matviews landed in
9.3) which have OR REPLACE on the roadmap:

> Materialised Views roadmap
>
> * CREATE **OR REPLACE** MATERIALIZED VIEW
>   * Just an oversight that it wasn't added
>  [...]

Replacing Matviews
--

With patch 0001, a matview can be replaced without having to drop it and
its dependent objects.  In our use case it is no longer necessary to
define the actual query in a separate view.  Replacing a matview works
analogous to CREATE OR REPLACE VIEW:

* the new query may change SELECT list expressions of existing columns
* new columns can be added to the end of the SELECT list
* existing columns cannot be renamed
* the data type of existing columns cannot be changed

In addition to that, CREATE OR REPLACE MATERIALIZED VIEW also replaces
access method, tablespace, and storage parameters if specified.  The
clause WITH [NO] DATA works as expected: it either populates the matview
or leaves it in an unscannable state.

It is an error to specify both OR REPLACE and IF NOT EXISTS.

Example
---

postgres=# CREATE MATERIALIZED VIEW test AS SELECT 1 AS a;
SELECT 1
postgres=# SELECT * FROM test;
 a
---
 1
(1 row)

postgres=# CREATE OR REPLACE MATERIALIZED VIEW test AS SELECT 2 AS a, 3 AS b;
CREATE MATERIALIZED VIEW
postgres=# SELECT * FROM test;
 a | b
---+---
 2 | 3
(1 row)

Implementation Details
--

Patch 0001 extends create_ctas_internal in order to adapt an existing
matview to the new tuple descriptor, access method, tablespace, and
storage parameters.  This logic is mostly based on DefineViewRelation.
This also reuses checkViewColumns, but adds argument is_matview in order
to tell if we want error messages for a matview (true) or view (false).
I'm not sure if that flag is the correct way to do that, or if I should
just create a separate function just for matviews with the same logic.
Do we even need to distinguish between view and matview in those error
messages?

The patch also adds tab completion in psql for CREATE OR REPLACE
MATERIALIZED VIEW.

[1] https://www.postgresql.org/message-id/226806.1693430777%40sss.pgh.pa.us
[2] 
https://wiki.postgresql.org/images/a/ad/Materialised_views_now_and_the_future-pgconfeu_2013.pdf#page=23

-- 
Erik
>From b9f96d4a8e806389bf33a96be6db3a57bccb48cf Mon Sep 17 00:00:00 2001
From: Erik Wienhold 
Date: Tue, 21 May 2024 18:35:47 +0200
Subject: [PATCH v1 1/2] Add CREATE OR REPLACE MATERIALIZED VIEW

---
 .../sgml/ref/create_materialized_view.sgml|  15 +-
 src/backend/commands/createas.c   | 207 ++
 src/backend/commands/tablecmds.c  |   8 +-
 src/backend/commands/view.c   | 106 ++---
 src/backend/parser/gram.y |  15 ++
 src/bin/psql/tab-complete.c   |   2 +-
 src/include/commands/view.h   |   3 +
 src/include/nodes/parsenodes.h|   2 +-
 src/include/nodes/primnodes.h |   1 +
 src/test/regress/expected/matview.out | 191 
 src/test/regress/sql/matview.sql  | 108 +
 11 files changed, 574 insertions(+), 84 deletions(-)

diff --git a/doc/src/sgml/ref/create_materialized_view.sgml

Re: Built-in CTYPE provider

2024-07-01 Thread Jeff Davis

On Mon, 2024-07-01 at 16:03 -0700, Noah Misch wrote:
> I agree the v17 code is fine.  Today, a user can (with difficulty)
> choose
> dependency libraries so regexp_matches() is IMMUTABLE, as marked.  I
> don't
> want $SUBJECT to be the ctype that, at some post-v17 version, can't
> achieve
> that with unpatched PostgreSQL.

We aren't forcing anyone to use the builtin "C.UTF-8" locale. Anyone
can still use the builtin "C" locale (which never changes), or another
provider if they can sort out the difficulties (and live with the
consequences) of pinning the dependencies to a specific version.

>   Let's change the documentation to say this
> provider uses a particular snapshot of Unicode data, taken around
> PostgreSQL
> 17.  We plan never to change that data, so IMMUTABLE functions can
> rely on the
> data.

We can discuss this in the context of version 18 or the next time we
plan to update Unicode. I don't think we should make such a promise in
version 17.

>   If we provide a newer Unicode data set in the future, we'll provide
> it
> in such a way that DDL must elect the new data.  How well would that
> suit your
> vision for this feature?

Thomas tried tracking collation versions along with individual objects,
and it had to be reverted (ec48314708).

It fits my vision to do something like that as a way of tightening
things up.

But there are some open design questions we need to settle, along with
a lot of work. So I don't think we should pre-emptively block all
Unicode updates waiting for it.

>   An alternative would be to make pg_upgrade reject
> operating on a cluster that contains use of $SUBJECT.

That wouldn't help anyone.

Regards,
Jeff Davis

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-07-01 Thread Michael Paquier

On Mon, Jul 01, 2024 at 08:40:17PM -0400, Tom Lane wrote:
> While this won't actually fail against a v10 or v11 server, it won't
> show anything useful either (because relam is zero for heaps in
> pre-v12 versions).  Perhaps there should be a check to only add the
> extra column if server >= v12?

I've thought about that and would be OK to restrict things with this
suggestion if you'd prefer it.  However, I could not decide in favor
of it as using a psql \dP+ >= v18 gives the possibility to show the
AMs of partitioned indexes, as these are also part listed in \dP.  So
I've found that useful in itself.
--
Michael


signature.asc
Description: PGP signature

Re: Cleaning up perl code

2024-07-01 Thread Michael Paquier

On Fri, May 24, 2024 at 02:09:49PM +0900, Michael Paquier wrote:
> For now, I have staged for commit the attached, that handles most of
> the changes from Alexander (msvc could go for more cleanup?).

This one has been applied as of 0c1aca461481 now that v18 is
open.

> I'll look at the changes from Dagfinn after that, including if perlcritic
> could be changed.  I'll handle the first part when v18 opens up, as
> that's cosmetic.

I'm still biased about the second set of changes proposed here,
though.  ProhibitUnusedVariables would have benefits when writing perl
code in terms of clarity because we would avoid useless stuff, but it
seems to me that we should put more efforts into the unification of
the errcodes parsing paths first to have a cleaner long-term picture.

That's not directly the fault of this proposal that we have the same
parsing rules spread across three PL languages, so perhaps what's
proposed is fine as-is, at the end.

Any thoughts or comments from others more familiar with
ProhibitUnusedVariables?
--
Michael

signature.asc
Description: PGP signature

Re: Make tuple deformation faster

2024-07-01 Thread Andy Fan

David Rowley  writes:

> You can see the branch predictor has done a *much* better job in the
> patched code vs master with about 10x fewer misses.  This should have
> helped contribute to the "insn per cycle" increase.  4.29 is quite
> good for postgres. I often see that around 0.5. According to [1]
> (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
> micro-op cache". I'm unsure how micro-ops translate to "insn per
> cycle" that's shown in perf stat. I thought 4-5 was about the maximum
> pipeline size from today's era of CPUs. Maybe someone else can explain
> better than I can. In more simple terms, generally, the higher the
> "insn per cycle", the better. Also, the lower all of the idle and
> branch miss percentages are that's generally also better. However,
> you'll notice that the patched version has more front and backend
> stalls. I assume this is due to performing more instructions per cycle
> from improved branch prediction causing memory and instruction stalls
> to occur more frequently, effectively (I think) it's just hitting the
> next bottleneck(s) - memory and instruction decoding. At least, modern
> CPUs should be able to out-pace RAM in many workloads, so perhaps it's
> not that surprising that "backend cycles idle" has gone up due to such
> a large increase in instructions per cycle due to improved branch
> prediction.

Thanks for the answer, just another area desvers to exploring.

> It would be nice to see this tested on some modern Intel CPU. A 13th
> series or 14th series, for example, or even any intel from the past 5
> years would be better than nothing.

I have two kind of CPUs.

a). Intel Xeon Processor (Icelake) for my ECS
b). Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz at Mac.

My ECS reports " branch-misses", probabaly because it
runs in virtualization software , and Mac doesn't support perf yet :( 

-- 
Best Regards
Andy Fan

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-07-01 Thread Tom Lane

Michael Paquier  writes:
> On Thu, Jun 06, 2024 at 09:43:45AM +0900, Michael Paquier wrote:
>> Not sure that this is a must-have.  It is nice to have, but extra
>> information is a new feature IMO.  Any extra opinions?

> Hearing nothing, I've applied that on HEAD now that v18 is open.

While this won't actually fail against a v10 or v11 server, it won't
show anything useful either (because relam is zero for heaps in
pre-v12 versions).  Perhaps there should be a check to only add the
extra column if server >= v12?

regards, tom lane

Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)

2024-07-01 Thread Michael Paquier

On Mon, Jul 01, 2024 at 09:19:59PM +0200, Daniel Gustafsson wrote:
>> The bit I don't understand about this discussion is what will happen
>> with users that currently have exactly 1024 chars in backup names today.
>> With this change, we'll be truncating their names to 1023 chars instead.
>> Why would they feel that such change is welcome?
> 
> That's precisely what I was getting at.  Maybe it makes sense to change, maybe
> not, but that's not for this patch to decide as that's a different discussion
> from using safe string copying API's.

Yep.  Agreed to keep backward-compatibility here, even if I suspect
there is close to nobody relying on backup label names of this size.
--
Michael


signature.asc
Description: PGP signature

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-07-01 Thread Michael Paquier

On Thu, Jun 06, 2024 at 09:43:45AM +0900, Michael Paquier wrote:
> Not sure that this is a must-have.  It is nice to have, but extra
> information is a new feature IMO.  Any extra opinions?

Hearing nothing, I've applied that on HEAD now that v18 is open.
--
Michael


signature.asc
Description: PGP signature

Re: Make tuple deformation faster

2024-07-01 Thread David Rowley

On Mon, 1 Jul 2024 at 23:42, Matthias van de Meent
 wrote:
>
> On Mon, 1 Jul 2024 at 12:49, David Rowley  wrote:
> >
> > On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent
> >  wrote:
> > > One thing I'm slightly concerned about is that this allocates another
> > > 8 bytes for each attribute in the tuple descriptor. While that's not a
> > > lot when compared with the ->attrs array, it's still quite a lot when
> > > we might not care at all about this data; e.g. in temporary tuple
> > > descriptors during execution, in intermediate planner nodes.
> >
> > I've not done it in the patch, but one way to get some of that back is
> > to ditch pg_attribute.attcacheoff. There's no need for it after this
> > patch.  That's only 4 out of 8 bytes, however.
>
> FormData_pg_attribute as a C struct has 4-byte alignment; AFAIK it
> doesn't have any fields that require 8-byte alignment? Only on 64-bit
> systems we align the tuples on pages with 8-byte alignment, but
> in-memory arrays of the struct wouldn't have to deal with that, AFAIK.

Yeah, 4-byte alignment.  "out of 8 bytes" I was talking about is
sizeof(TupleDescDeformAttr), which I believe is the same "another 8
bytes" you had mentioned. What I meant was that deleting attcacheoff
only reduces FormData_pg_attribute by 4 bytes per column and adding
TupleDescDeformAttr adds 8 per column, so we still use 4 more bytes
per column with the patch.

I really doubt the 4 bytes extra memory is a big concern here. It
would be more concerning for patch that wanted to do something like
change NAMEDATALEN to 128, but I think the main concern with that
would be even slower tuple deforming. Additional memory would also be
concerning, but I doubt that's more important than the issue of making
all queries slower due to slower tuple deformation, which is what such
a patch would result in.

> > I think in most cases
> > due to FormData_pg_attribute being so huge, the aset.c power-of-2
> > roundup behaviour will be unlikely to cross a power-of-2 boundary.
>
> ASet isn't the only allocator, but default enough for this to make sense, yes.

It's the only allocator we use for allocating TupleDescs, so other
types and their behaviour are not relevant here.

> I'm not sure we have a pg_type entry that that supports numeric
> attalign values without increasing the size of the field, as the one
> single-byte integer-like type (char) is always used as a printable
> character, and implied to always be printable through its usage in
> e.g. nodeToString infrastructure.
>
> I'd love to have a better option, but we don't seem to have one yet.

yeah, select typname from pg_Type where typalign = 'c' and typlen=1;
has just bool and char.

I'm happy to keep going with this version of the patch unless someone
points out some good reason that we should go with the alternative
instead.

David

Re: Parallel CREATE INDEX for GIN indexes

2024-07-01 Thread Andy Fan

Tomas Vondra  writes:

Hi Tomas,

I am in a incompleted review process but I probably doesn't have much
time on this because of my internal tasks. So I just shared what I
did and the non-good-result patch.

What I tried to do is:
1) remove all the "sort" effort for the state->bs_sort_state tuples since
its input comes from state->bs_worker_state which is sorted already.

2). remove *partial* "sort" operations between accum.rbtree to
state->bs_worker_state because the tuple in accum.rbtree is sorted
already. 

Both of them can depend on the same API changes. 

1. 
struct Tuplesortstate
{
..
+ bool input_presorted; /*  the tuples are presorted. */
+ new_tapes;  // writes the tuples in memory into a new 'run'. 
}

and user can set it during tuplesort_begin_xx(, presorte=true);


2. in tuplesort_puttuple, if memory is full but presorted is
true, we can (a) avoid the sort. (b) resuse the existing 'runs'
to reduce the effort of 'mergeruns' unless new_tapes is set to
true. once it switch to a new tapes, the set state->new_tapes to false
and wait 3) to change it to true again.

3. tuplesort_dumptuples(..);  // dump the tuples in memory and set
new_tapes=true to tell the *this batch of input is presorted but they
are done, the next batch is just presort in its own batch*.

In the gin-parallel-build case,  for the case 1), we can just use

for tuple in bs_worker_sort: 
tuplesort_putgintuple(state->bs_sortstate, ..);
tuplesort_dumptuples(..);

At last we can get a). only 1 run in the worker so that the leader can
have merge less runs in mergeruns.  b). reduce the sort both in
perform_sort_tuplesort and in sortstate_puttuple_common.

for the case 2). we can have:

   for tuple in RBTree.tuples:
  tuplesort_puttuples(tuple) ;  
  // this may cause a dumptuples internally when the memory is full,
  // but it is OK.
tuplesort_dumptuples(..);

we can just remove the "sort" into sortstate_puttuple_common but
probably increase the 'runs' in sortstate which will increase the effort
of mergeruns at last.

But the test result is not good, maybe the 'sort' is not a key factor of
this. I do missed the perf step before doing this. or maybe my test data
is too small. 

Here is the patch I used for the above activity. and I used the
following sql to test. 

CREATE TABLE t(a int[], b numeric[]);

-- generate 1000 * 1000 rows.
insert into t select i, n
from normal_rand_array(1000, 90, 1::int4, 1::int4) i,
normal_rand_array(1000, 90, '1.00233234'::numeric, '8.239241989134'::numeric) n;

alter table t set (parallel_workers=4);
set debug_parallel_query to on;
set max_parallel_maintenance_workers to 4;

create index on t using gin(a);
create index on t using gin(b);

I found normal_rand_array is handy to use in this case and I
register it into https://commitfest.postgresql.org/48/5061/.

Besides the above stuff, I didn't find anything wrong in the currrent
patch, and the above stuff can be categoried into "furture improvement"
even it is worthy to. 

-- 
Best Regards
Andy Fan

>From 48c2e03fd854c8f88f781adc944f37b004db0721 Mon Sep 17 00:00:00 2001
From: Andy Fan 
Date: Sat, 8 Jun 2024 13:21:08 +0800
Subject: [PATCH v20240702 1/3] Add function normal_rand_array function to
 contrib/tablefunc.

It can produce an array of numbers with n controllable array length and
duplicated elements in these arrays.
---
 contrib/tablefunc/Makefile|   2 +-
 contrib/tablefunc/expected/tablefunc.out  |  26 
 contrib/tablefunc/sql/tablefunc.sql   |  10 ++
 contrib/tablefunc/tablefunc--1.0--1.1.sql |   7 ++
 contrib/tablefunc/tablefunc.c | 140 ++
 contrib/tablefunc/tablefunc.control   |   2 +-
 doc/src/sgml/tablefunc.sgml   |  10 ++
 src/backend/utils/adt/arrayfuncs.c|   7 ++
 8 files changed, 202 insertions(+), 2 deletions(-)
 create mode 100644 contrib/tablefunc/tablefunc--1.0--1.1.sql

diff --git a/contrib/tablefunc/Makefile b/contrib/tablefunc/Makefile
index 191a3a1d38..f0c67308fd 100644
--- a/contrib/tablefunc/Makefile
+++ b/contrib/tablefunc/Makefile
@@ -3,7 +3,7 @@
 MODULES = tablefunc
 
 EXTENSION = tablefunc
-DATA = tablefunc--1.0.sql
+DATA = tablefunc--1.0.sql tablefunc--1.0--1.1.sql
 PGFILEDESC = "tablefunc - various functions that return tables"
 
 REGRESS = tablefunc
diff --git a/contrib/tablefunc/expected/tablefunc.out b/contrib/tablefunc/expected/tablefunc.out
index ddece79029..9f0cbbfbbe 100644
--- a/contrib/tablefunc/expected/tablefunc.out
+++ b/contrib/tablefunc/expected/tablefunc.out
@@ -12,6 +12,32 @@ SELECT avg(normal_rand)::int, count(*) FROM normal_rand(100, 250, 0.2);
 -- negative number of tuples
 SELECT avg(normal_rand)::int, count(*) FROM normal_rand(-1, 250, 0.2);
 ERROR:  number of rows cannot be negative
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::numeric, 8::numeric) as i;
+ count |avg 
+---+
+10 | 3.
+(1 row)
+

Re: Relation bulk write facility

2024-07-01 Thread Noah Misch

On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:
> On 01/07/2024 23:52, Noah Misch wrote:
> > Commit 8af2565 wrote:
> > > --- /dev/null
> > > +++ b/src/backend/storage/smgr/bulk_write.c
> > 
> > > +/*
> > > + * Finish bulk write operation.
> > > + *
> > > + * This WAL-logs and flushes any remaining pending writes to disk, and 
> > > fsyncs
> > > + * the relation if needed.
> > > + */
> > > +void
> > > +smgr_bulk_finish(BulkWriteState *bulkstate)
> > > +{
> > > + /* WAL-log and flush any remaining pages */
> > > + smgr_bulk_flush(bulkstate);
> > > +
> > > + /*
> > > +  * When we wrote out the pages, we passed skipFsync=true to avoid the
> > > +  * overhead of registering all the writes with the checkpointer.  
> > > Register
> > > +  * the whole relation now.
> > > +  *
> > > +  * There is one hole in that idea: If a checkpoint occurred while we 
> > > were
> > > +  * writing the pages, it already missed fsyncing the pages we had 
> > > written
> > > +  * before the checkpoint started.  A crash later on would replay the WAL
> > > +  * starting from the checkpoint, therefore it wouldn't replay our 
> > > earlier
> > > +  * WAL records.  So if a checkpoint started after the bulk write, fsync
> > > +  * the files now.
> > > +  */
> > > + if (!SmgrIsTemp(bulkstate->smgr))
> > > + {
> > 
> > Shouldn't this be "if (bulkstate->use_wal)"?  The GetRedoRecPtr()-based
> > decision is irrelevant to the !wal case.  Either we don't need fsync at all
> > (TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).
> 
> The point of GetRedoRecPtr() is to detect if a checkpoint has started
> concurrently. It works for that purpose whether or not the bulk load is
> WAL-logged. It is not compared with the LSNs of WAL records written by the
> bulk load.

I think the significance of start_RedoRecPtr is it preceding all records
needed to recreate the bulk write.  If start_RedoRecPtr==GetRedoRecPtr() and
we crash after commit, we're indifferent to whether the rel gets synced at a
checkpoint before that crash or rebuilt from WAL after that crash.  If
start_RedoRecPtr!=GetRedoRecPtr(), some WAL of the bulk write is already
deleted, so only smgrimmedsync() suffices.  Overall, while it is not compared
with LSNs in WAL records, it's significant only to the extent that such a WAL
record exists.  What am I missing?

> Unlogged tables do need to be fsync'd. The scenario is:
> 
> 1. Bulk load an unlogged table.
> 2. Shut down Postgres cleanly
> 3. Pull power plug from server, and restart.
> 
> We talked about this earlier in the "Unlogged relation copy is not fsync'd"
> thread [1]. I had already forgotten about that; that bug actually still
> exists in back branches, and we should fix it..
> 
> [1] 
> https://www.postgresql.org/message-id/flat/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965%40iki.fi

Ah, that's right.  I agree this code suffices for unlogged.  As a further
optimization, it would be valid to ignore GetRedoRecPtr() for unlogged and
always call smgrregistersync().  (For any rel, smgrimmedsync() improves on
smgrregistersync() only if we fail to reach the shutdown checkpoint.  Without
a shutdown checkpoint, unlogged rels get reset anyway.)

> > I don't see any functional problem, but this likely arranges for an
> > unnecessary sync when a checkpoint starts between mdcreate() and
> > here.  (The mdcreate() sync may also be unnecessary, but that's
> > longstanding.)
> Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. It
> seems hard to eliminate the redundancy. smgr_bulk_finish() could skip the
> fsync, if it knew that smgrDoPendingSyncs() will do it later. However,
> smgrDoPendingSyncs() might also decide to WAL-log the relation instead of
> fsyncing it, and in that case we do still need the fsync.

We do not need the fsync in the "WAL-log the relation instead" case; see
https://postgr.es/m/20230921062210.ga110...@rfd.leadboat.com

So maybe like this:

  if (use_wal) /* includes init forks */
current logic;
  else if (unlogged)
smgrregistersync;
  /* else temp || (permanent && wal_level=minimal): nothing to do */

> Fortunately, fsync() on a file that's already flushed to disk is pretty
> cheap.

Yep.  I'm more concerned about future readers wondering why the function is
using LSNs to decide what to do about data that doesn't appear in WAL.  A
comment could be another way to fix that, though.

Re: Built-in CTYPE provider

2024-07-01 Thread Noah Misch

On Mon, Jul 01, 2024 at 12:24:15PM -0700, Jeff Davis wrote:
> On Sat, 2024-06-29 at 15:08 -0700, Noah Misch wrote:
> > lower(), initcap(), upper(), and regexp_matches() are
> > PROVOLATILE_IMMUTABLE.
> > Until now, we've delegated that responsibility to the user.  The user
> > is
> > supposed to somehow never update libc or ICU in a way that changes
> > outcomes
> > from these functions.
> 
> To me, "delegated" connotes a clear and organized transfer of
> responsibility to the right person to solve it. In that sense, I
> disagree that we've delegated it.

Good point.

> >   Now that postgresql.org is taking that responsibility
> > for builtin C.UTF-8, how should we govern it?  I think the above text
> > and [1]
> > convey that we'll update the Unicode data between major versions,
> > making
> > functions like lower() effectively STABLE.  Is that right?
> 
> Marking them STABLE is not a viable option, that would break a lot of
> valid use cases, e.g. an index on LOWER().

I agree.

> I don't think we need code changes for 17. Some documentation changes
> might be helpful, though. Should we have a note around LOWER()/UPPER()
> that users should REINDEX any dependent indexes when the provider is
> updated?

I agree the v17 code is fine.  Today, a user can (with difficulty) choose
dependency libraries so regexp_matches() is IMMUTABLE, as marked.  I don't
want $SUBJECT to be the ctype that, at some post-v17 version, can't achieve
that with unpatched PostgreSQL.  Let's change the documentation to say this
provider uses a particular snapshot of Unicode data, taken around PostgreSQL
17.  We plan never to change that data, so IMMUTABLE functions can rely on the
data.  If we provide a newer Unicode data set in the future, we'll provide it
in such a way that DDL must elect the new data.  How well would that suit your
vision for this feature?  An alternative would be to make pg_upgrade reject
operating on a cluster that contains use of $SUBJECT.

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

2024-07-01 Thread Dean Rasheed

On Mon, 1 Jul 2024 at 20:56, Joel Jacobson  wrote:
>
> Below is a more realistic benchmark
>
> CREATE TABLE bench_mul_var (num1 numeric, num2 numeric);
>
> INSERT INTO bench_mul_var (num1, num2)
> SELECT random(0::numeric,1e8::numeric), random(0::numeric,1e8::numeric) FROM 
> generate_series(1,1e8);
>
> SELECT SUM(num1*num2) FROM bench_mul_var;

I had a play with this, and came up with a slightly different way of
doing it that works for var2 of any size, as long as var1 is just 1 or
2 digits.

Repeating your benchmark where both numbers have up to 2 NBASE-digits,
this new approach was slightly faster:

SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 4762.990 ms (00:04.763)
Time: 4332.166 ms (00:04.332)
Time: 4276.211 ms (00:04.276)
Time: 4247.321 ms (00:04.247)
Time: 4166.738 ms (00:04.167)

SELECT SUM(num1*num2) FROM bench_mul_var; -- v2 patch
Time: 4398.812 ms (00:04.399)
Time: 3672.668 ms (00:03.673)
Time: 3650.227 ms (00:03.650)
Time: 3611.420 ms (00:03.611)
Time: 3534.218 ms (00:03.534)

SELECT SUM(num1*num2) FROM bench_mul_var; -- this patch
Time: 3350.596 ms (00:03.351)
Time: 3336.224 ms (00:03.336)
Time: 3335.599 ms (00:03.336)
Time: 3336.990 ms (00:03.337)
Time: 3351.453 ms (00:03.351)

(This was on an older Intel Core i9-9900K, so I'm not sure why all the
timings are faster. What compiler settings are you using?)

The approach taken in this patch only uses 32-bit integers, so in
theory it could be extended to work for var1ndigits = 3, 4, or even
more, but the code would get increasingly complex, and I ran out of
steam at 2 digits. It might be worth trying though.

Regards,
Dean
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index 5510a20..adbfd5c
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8748,6 +8748,74 @@ mul_var(const NumericVar *var1, const Nu
}
 
/*
+* Simplified fast-path computation, if var1 has just one or two digits.
+* This is significantly faster, since it avoids allocating a separate
+* digit array, making multiple passes over var2, and having separate
+* carry-propagation passes.
+*/
+   if (var1ndigits <= 2)
+   {
+   NumericDigit *res_buf;
+
+   /* Allocate result digit array */
+   res_buf = digitbuf_alloc(res_ndigits);
+   res_buf[0] = 0; /* spare digit for 
later rounding */
+   res_digits = res_buf + 1;
+
+   /*
+* Compute the result digits directly, in one pass, propagating 
the
+* carry up as we go.
+*/
+   switch (var1ndigits)
+   {
+   case 1:
+   carry = 0;
+   for (i = res_ndigits - 3; i >= 0; i--)
+   {
+   newdig = (int) var1digits[0] * 
var2digits[i] + carry;
+   res_digits[i + 1] = (NumericDigit) 
(newdig % NBASE);
+   carry = newdig / NBASE;
+   }
+   res_digits[0] = (NumericDigit) carry;
+   break;
+
+   case 2:
+   newdig = (int) var1digits[1] * 
var2digits[res_ndigits - 4];
+   res_digits[res_ndigits - 2] = (NumericDigit) 
(newdig % NBASE);
+   carry = newdig / NBASE;
+
+   for (i = res_ndigits - 4; i > 0; i--)
+   {
+   newdig = (int) var1digits[0] * 
var2digits[i] +
+   (int) var1digits[1] * 
var2digits[i - 1] + carry;
+   res_digits[i + 1] = (NumericDigit) 
(newdig % NBASE);
+   carry = newdig / NBASE;
+   }
+
+   newdig = (int) var1digits[0] * var2digits[0] + 
carry;
+   res_digits[1] = (NumericDigit) (newdig % NBASE);
+   res_digits[0] = (NumericDigit) (newdig / NBASE);
+   break;
+   }
+
+   /* Store the product in result (minus extra rounding digit) */
+   digitbuf_free(result->buf);
+   result->ndigits = res_ndigits - 1;
+   result->buf = res_buf;
+   result->digits = res_digits;
+   result->weight = res_weight - 1;
+   result->sign = res_sign;
+
+   /* Round to target rscale (and set result->dscale) */
+   round_var(result, rscale);
+
+   /* Strip leading and trailing zeroes */
+   strip_var(result);
+
+   return;
+   }
+
+   /*

Re: Relation bulk write facility

2024-07-01 Thread Heikki Linnakangas


Thanks for poking at this!

On 01/07/2024 23:52, Noah Misch wrote:

Commit 8af2565 wrote:

--- /dev/null
+++ b/src/backend/storage/smgr/bulk_write.c



+/*
+ * Finish bulk write operation.
+ *
+ * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
+ * the relation if needed.
+ */
+void
+smgr_bulk_finish(BulkWriteState *bulkstate)
+{
+   /* WAL-log and flush any remaining pages */
+   smgr_bulk_flush(bulkstate);
+
+   /*
+* When we wrote out the pages, we passed skipFsync=true to avoid the
+* overhead of registering all the writes with the checkpointer.  
Register
+* the whole relation now.
+*
+* There is one hole in that idea: If a checkpoint occurred while we 
were
+* writing the pages, it already missed fsyncing the pages we had 
written
+* before the checkpoint started.  A crash later on would replay the WAL
+* starting from the checkpoint, therefore it wouldn't replay our 
earlier
+* WAL records.  So if a checkpoint started after the bulk write, fsync
+* the files now.
+*/
+   if (!SmgrIsTemp(bulkstate->smgr))
+   {


Shouldn't this be "if (bulkstate->use_wal)"?  The GetRedoRecPtr()-based
decision is irrelevant to the !wal case.  Either we don't need fsync at all
(TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).


The point of GetRedoRecPtr() is to detect if a checkpoint has started 
concurrently. It works for that purpose whether or not the bulk load is 
WAL-logged. It is not compared with the LSNs of WAL records written by 
the bulk load.


Unlogged tables do need to be fsync'd. The scenario is:

1. Bulk load an unlogged table.
2. Shut down Postgres cleanly
3. Pull power plug from server, and restart.

We talked about this earlier in the "Unlogged relation copy is not 
fsync'd" thread [1]. I had already forgotten about that; that bug 
actually still exists in back branches, and we should fix it..


[1] 
https://www.postgresql.org/message-id/flat/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965%40iki.fi



I don't see any functional problem, but this likely arranges for an
unnecessary sync when a checkpoint starts between mdcreate() and
here.  (The mdcreate() sync may also be unnecessary, but that's
longstanding.)
Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. 
It seems hard to eliminate the redundancy. smgr_bulk_finish() could skip 
the fsync, if it knew that smgrDoPendingSyncs() will do it later. 
However, smgrDoPendingSyncs() might also decide to WAL-log the relation 
instead of fsyncing it, and in that case we do still need the fsync.


Fortunately, fsync() on a file that's already flushed to disk is pretty 
cheap.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Relation bulk write facility

2024-07-01 Thread Noah Misch

On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:
> Committed this. Thanks everyone!

Commit 8af2565 wrote:
> --- /dev/null
> +++ b/src/backend/storage/smgr/bulk_write.c

> +/*
> + * Finish bulk write operation.
> + *
> + * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
> + * the relation if needed.
> + */
> +void
> +smgr_bulk_finish(BulkWriteState *bulkstate)
> +{
> + /* WAL-log and flush any remaining pages */
> + smgr_bulk_flush(bulkstate);
> +
> + /*
> +  * When we wrote out the pages, we passed skipFsync=true to avoid the
> +  * overhead of registering all the writes with the checkpointer.  
> Register
> +  * the whole relation now.
> +  *
> +  * There is one hole in that idea: If a checkpoint occurred while we 
> were
> +  * writing the pages, it already missed fsyncing the pages we had 
> written
> +  * before the checkpoint started.  A crash later on would replay the WAL
> +  * starting from the checkpoint, therefore it wouldn't replay our 
> earlier
> +  * WAL records.  So if a checkpoint started after the bulk write, fsync
> +  * the files now.
> +  */
> + if (!SmgrIsTemp(bulkstate->smgr))
> + {

Shouldn't this be "if (bulkstate->use_wal)"?  The GetRedoRecPtr()-based
decision is irrelevant to the !wal case.  Either we don't need fsync at all
(TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).  I
don't see any functional problem, but this likely arranges for an unnecessary
sync when a checkpoint starts between mdcreate() and here.  (The mdcreate()
sync may also be unnecessary, but that's longstanding.)

> + /*
> +  * Prevent a checkpoint from starting between the 
> GetRedoRecPtr() and
> +  * smgrregistersync() calls.
> +  */
> + Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
> + MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> +
> + if (bulkstate->start_RedoRecPtr != GetRedoRecPtr())
> + {
> + /*
> +  * A checkpoint occurred and it didn't know about our 
> writes, so
> +  * fsync() the relation ourselves.
> +  */
> + MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> + smgrimmedsync(bulkstate->smgr, bulkstate->forknum);
> + elog(DEBUG1, "flushed relation because a checkpoint 
> occurred concurrently");
> + }
> + else
> + {
> + smgrregistersync(bulkstate->smgr, bulkstate->forknum);
> + MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> + }
> + }
> +}

This is an elegant optimization.

Re: WIP: parallel GiST index builds

2024-07-01 Thread Tomas Vondra

Hi,

I've done a number of experiments with the GiST parallel builds, both
with the sorted and unsorted cases, so let me share some of the results
and conclusions from that.

In the first post I did some benchmarks using btree_gist, but that
seemed not very realistic - there certainly are much more widely used
GiST indexes in the GIS world. So this time I used OpenStreetMap, loaded
using osm2pgsql, with two dataset sizes:

- small - "north america" (121GB without indexes)
- large - "planet" (688GB without indexes)

And then I created indexes using either gist_geometry_ops_2d (with sort)
or gist_geometry_ops_nd (no sorting).

On 6/7/24 19:41, Tomas Vondra wrote:
> Hi,
> 
> After looking into parallel builds for BRIN and GIN indexes, I was
> wondering if there's a way to do parallel builds for GiST too. I knew
> next to nothing about how GiST works, but I gave it a shot and here's
> what I have - the attached patch allows parallel GiST builds for the
> "unsorted" case (i.e. when the opclass does not include sortsupport),
> and does not support buffered builds.
> 
> 
> unsorted builds only
> 
> 
> Addressing only the unsorted case may seem a bit weird, but I did it
> this way for two reasons - parallel sort is a solved problem, and adding
> this to the patch seems quite straightforward. It's what btree does, for
> example. But I also was not very sure how common this is - we do have
> sort for points, but I have no idea if the PostGIS indexes define
> sorting etc. My guess was "no" but I've been told that's no longer true,
> so I guess sorted builds are more widely applicable than I thought.

For sorted builds, I made the claim that parallelizing sorted builds is
"solved problem" because we can use a parallel tuplesort. I was thinking
that maybe it'd be better to do that in the initial patch, and only then
introduce the more complex stuff in the unsorted case, so I gave this a
try, and it turned to be rather pointless.

Yes, parallel tuplesort does improve the duration, but it's not a very
significant improvement - maybe 10% or so. Most of the build time is
spent in gist_indexsortbuild(), so this is the part that would need to
be parallelized for any substantial improvement. Only then is it useful
to improve the tuplesort, I think.

And parallelizing gist_indexsortbuild() is not trivial - most of the
time is spent in gist_indexsortbuild_levelstate_flush() / gistSplit(),
so ISTM a successful parallel implementation would need to divide this
work between multiple workers. I don't have a clear idea how, though.

I do have a PoC/WIP patch doing the paralle tuplesort in my github
branch at [1] (and then also some ugly experiments on top of that), but
I'm not going to attach it here because of the reasons I just explained.
It'd be just a pointless distraction.

> In any case, I'm not in a rush to parallelize sorted builds. It can be
> added later, as an improvement, IMHO. In fact, it's a well isolated part
> of the patch, which might make it a good choice for someone looking for
> an idea for their first patch ...
> 

I still think this assessment is correct - it's fine to not parallelize
sorted builds. It can be improved in the future, or even not at all.

> 
> buffered builds
> ---
> 
> The lack of support for buffered builds is a very different thing. The
> basic idea is that we don't push the index entries all the way to the
> leaf pages right away, but accumulate them in buffers half-way through.
> This combines writes and reduces random I/O, which is nice.
> 
> Unfortunately, the way it's implemented does not work with parallel
> builds at all - all the state is in private memory, and it assumes the
> worker is the only possible backend that can split the page (at which
> point the buffers need to be split too, etc.). But for parallel builds
> this is obviously not true.
> 
> I'm not saying parallel builds can't do similar buffering, but it
> requires moving the buffers into shared memory, and introducing locking
> to coordinate accesses to the buffers. (Or perhaps it might be enough to
> only "notify" the workers about page splits, with buffers still in
> private memory?). Anyway, it seems far too complicated for v1.
> 
> In fact, I'm not sure the buffering is entirely necessary - maybe the
> increase in amount of RAM makes this less of an issue? If the index can
> fit into shared buffers (or at least page cache), maybe the amount of
> extra I/O is not that bad? I'm sure there may be cases really affected
> by this, but maybe it's OK to tell people to disable parallel builds in
> those cases?
> 

For unsorted builds, here's the results from one of the machines for
duration of CREATE INDEX with the requested number of workers (0 means
serial build) for different tables in the OSM databases:

  db type   size (MB)  | 0  1  2  3  4
  -|--
  small  line4889  |   811429294

Re: Commitfest manager for July 2024

2024-07-01 Thread Kirill Reshke

> Postgres 17
18

Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)

2024-07-01 Thread Daniel Gustafsson

> On 1 Jul 2024, at 21:58, Ranier Vilela  wrote:

> We only have to replace it with strlcpy.

Thanks, I'll have a look at applying this in the tomorrow morning.

--
Daniel Gustafsson

Re: optimizing pg_upgrade's once-in-each-database steps

2024-07-01 Thread Nathan Bossart

On Mon, Jul 01, 2024 at 03:58:16PM -0400, Robert Haas wrote:
> On Mon, Jul 1, 2024 at 3:47 PM Nathan Bossart  
> wrote:
>> 0001 introduces a new API for registering callbacks and running them in
>> parallel on all databases in the cluster.  This new system manages a set of
>> "slots" that follow a simple state machine to asynchronously establish a
>> connection and run the queries.  It uses system() to wait for these
>> asynchronous tasks to complete.  Users of this API only need to provide two
>> callbacks: one to return the query that should be run on each database and
>> another to process the results of that query.  If multiple queries are
>> required for each database, users can provide multiple sets of callbacks.
> 
> I do really like the idea of using asynchronous communication here. It
> should be significantly cheaper than using multiple processes or
> threads, and maybe less code, too. But I'm confused about why you
> would need or want to use system() to wait for asynchronous tasks to
> complete. Wouldn't it be something like select()?

Whoops, I meant to type "select()" there.  Sorry for the confusion.

-- 
nathan

Re: optimizing pg_upgrade's once-in-each-database steps

2024-07-01 Thread Robert Haas

On Mon, Jul 1, 2024 at 3:47 PM Nathan Bossart  wrote:
> 0001 introduces a new API for registering callbacks and running them in
> parallel on all databases in the cluster.  This new system manages a set of
> "slots" that follow a simple state machine to asynchronously establish a
> connection and run the queries.  It uses system() to wait for these
> asynchronous tasks to complete.  Users of this API only need to provide two
> callbacks: one to return the query that should be run on each database and
> another to process the results of that query.  If multiple queries are
> required for each database, users can provide multiple sets of callbacks.

I do really like the idea of using asynchronous communication here. It
should be significantly cheaper than using multiple processes or
threads, and maybe less code, too. But I'm confused about why you
would need or want to use system() to wait for asynchronous tasks to
complete. Wouldn't it be something like select()?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)

2024-07-01 Thread Ranier Vilela

Em seg., 1 de jul. de 2024 às 16:20, Daniel Gustafsson 
escreveu:

> > On 1 Jul 2024, at 21:15, Alvaro Herrera  wrote:
> > On 2024-Jul-01, Ranier Vilela wrote:
>
> >>> -   charname[MAXPGPATH + 1];
> >>> +   charname[MAXPGPATH];/* backup label name */
> >>>
> >>> With the introduced use of strlcpy, why do we need to change this
> field?
> >>>
> >> The part about being the only reference in the entire code that uses
> >> MAXPGPATH + 1.
> >
> > The bit I don't understand about this discussion is what will happen
> > with users that currently have exactly 1024 chars in backup names today.
> > With this change, we'll be truncating their names to 1023 chars instead.
> > Why would they feel that such change is welcome?
>
> That's precisely what I was getting at.  Maybe it makes sense to change,
> maybe
> not, but that's not for this patch to decide as that's a different
> discussion
> from using safe string copying API's.
>
Ok, this is not material for the proposal and can be considered unrelated.

We only have to replace it with strlcpy.

v7 attached.

best regards,
Ranier Vilela


v7-avoid-incomplete-copy-string-do_pg_backup_start.patch
Description: Binary data

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

2024-07-01 Thread Joel Jacobson

On Mon, Jul 1, 2024, at 15:14, Joel Jacobson wrote:
> * 0001-Optimize-mul_var-for-var2ndigits-4.patch

Found a typo, fixed in new version.

The int128 version is still slower though,
I wonder if there is something that can be done to speed it up further.

Below is a more realistic benchmark than just microbenchmarking mul_var(),
not testing the int128 version, but the code for up to 2 NBASE-digits:

```
CREATE TABLE bench_mul_var (num1 numeric, num2 numeric);

INSERT INTO bench_mul_var (num1, num2)
SELECT random(0::numeric,1e8::numeric), random(0::numeric,1e8::numeric) FROM 
generate_series(1,1e8);

SELECT SUM(num1*num2) FROM bench_mul_var;
Time: 8331.953 ms (00:08.332)
Time: 7415.241 ms (00:07.415)
Time: 7298.296 ms (00:07.298)
Time: 7314.754 ms (00:07.315)
Time: 7289.560 ms (00:07.290)

SELECT SUM(numeric_mul_patched(num1,num2)) FROM bench_mul_var;
Time: 6403.426 ms (00:06.403)
Time: 6401.797 ms (00:06.402)
Time: 6366.136 ms (00:06.366)
Time: 6376.049 ms (00:06.376)
Time: 6317.282 ms (00:06.317)
``

Benchmarked on a Intel Core i9-14900K machine.

/Joel

v2-0001-Optimize-mul_var-for-var2ndigits-4.patch
Description: Binary data

Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)

2024-07-01 Thread Ranier Vilela

Em seg., 1 de jul. de 2024 às 16:15, Alvaro Herrera 
escreveu:

> On 2024-Jul-01, Ranier Vilela wrote:
>
> > > -   charname[MAXPGPATH + 1];
> > > +   charname[MAXPGPATH];/* backup label name */
> > >
> > > With the introduced use of strlcpy, why do we need to change this
> field?
> > >
> > The part about being the only reference in the entire code that uses
> > MAXPGPATH + 1.
>
> The bit I don't understand about this discussion is what will happen
> with users that currently have exactly 1024 chars in backup names today.
> With this change, we'll be truncating their names to 1023 chars instead.
> Why would they feel that such change is welcome?
>
Yes of course, I understand.
This can be a problem for users.

best regards,
Ranier Vilela

Re: optimizing pg_upgrade's once-in-each-database steps

2024-07-01 Thread Nathan Bossart

I figured I'd post what I have so far since this thread hasn't been updated
in a while.  The attached patches are still "proof-of-concept grade," but
they are at least moving in the right direction (IMHO).  The variable
naming is still not great, and they are woefully undercommented, among
other things.

0001 introduces a new API for registering callbacks and running them in
parallel on all databases in the cluster.  This new system manages a set of
"slots" that follow a simple state machine to asynchronously establish a
connection and run the queries.  It uses system() to wait for these
asynchronous tasks to complete.  Users of this API only need to provide two
callbacks: one to return the query that should be run on each database and
another to process the results of that query.  If multiple queries are
required for each database, users can provide multiple sets of callbacks.

The other patches change several of the existing tasks to use this new API.
With these patches applied, I see the following differences in the output
of 'pg_upgrade | ts -i' for a cluster with 1k empty databases:

WITHOUT PATCH

00:00:19 Checking database user is the install user
ok
00:00:02 Checking for subscription state   
ok
00:00:06 Adding ".old" suffix to old global/pg_control 
ok
00:00:04 Checking for extension updates
ok

WITH PATCHES (--jobs 1)

00:00:10 Checking database user is the install user
ok
00:00:02 Checking for subscription state   
ok
00:00:07 Adding ".old" suffix to old global/pg_control 
ok
00:00:05 Checking for extension updates
ok

WITH PATCHES (--jobs 4)

00:00:06 Checking database user is the install user
ok
00:00:00 Checking for subscription state   
ok
00:00:02 Adding ".old" suffix to old global/pg_control 
ok
00:00:01 Checking for extension updates
ok

Note that the "Checking database user is the install user" time also
includes the call to get_db_rel_and_slot_infos() on the old cluster as well
as the call to get_loadable_libraries() on the old cluster.  I believe the
improvement with the patches with just one job is due to the consolidation
of the queries into one database connection (presently,
get_db_rel_and_slot_infos() creates 3 connections per database for some
upgrades).  Similarly, the "Adding \".old\" suffix to old
global/pg_control" time includes the call to get_db_rel_and_slot_infos() on
the new cluster.

There are several remaining places where we could use this new API to speed
up upgrades.  For example, I haven't attempted to use it for the data type
checks yet, and that tends to eat up a sizable chunk of time when there are
many databases.

On Thu, May 16, 2024 at 08:24:08PM -0500, Nathan Bossart wrote:
> On Thu, May 16, 2024 at 05:09:55PM -0700, Jeff Davis wrote:
>> Also, did you consider connecting once to each database and running
>> many queries? Most of those seem like just checks.
> 
> This was the idea behind 347758b.  It may be possible to do more along
> these lines.  IMO parallelizing will still be useful even if we do combine
> more of the steps.

My current thinking is that any possible further consolidation should
happen as part of a follow-up effort to parallelization.  I'm cautiously
optimistic that the parallelization work will make the consolidation easier
since it moves things to rigidly-defined callback functions.

A separate piece of off-list feedback from Michael Paquier is that this new
parallel system might be something we can teach the ParallelSlot code used
by bin/scripts/ to do.  I've yet to look too deeply into this, but I
suspect that it will be difficult to combine the two.  For example, the
ParallelSlot system doesn't seem well-suited for the kind of
run-once-in-each-database tasks required by pg_upgrade, and the error
handling is probably little different, too.  However, it's still worth a
closer look, and I'm interested in folks' opinions on the subject.

-- 
nathan
>From d7683a095d4d2c1574005eb41504a5be256d6480 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 28 Jun 2024 11:02:44 -0500
Subject: [PATCH v2 1/6] introduce framework for parallelizing pg_upgrade tasks

---
 src/bin/pg_upgrade/Makefile  |   1 +
 src/bin/pg_upgrade/async.c   | 323 +++
 src/bin/pg_upgrade/meson.build   |   1 +
 src/bin/pg_upgrade/pg_upgrade.h  |  16 ++
 src/tools/pgindent/typedefs.list |   4 +
 5 files changed, 345 insertions(+)
 create mode 100644 src/bin/pg_upgrade/async.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index bde91e2beb..3bc4f5d740 100644
--- a/src/bin/pg_upgrade/Makefile
+++

Re: Built-in CTYPE provider

2024-07-01 Thread Jeff Davis

On Sat, 2024-06-29 at 15:08 -0700, Noah Misch wrote:
> lower(), initcap(), upper(), and regexp_matches() are
> PROVOLATILE_IMMUTABLE.
> Until now, we've delegated that responsibility to the user.  The user
> is
> supposed to somehow never update libc or ICU in a way that changes
> outcomes
> from these functions.

To me, "delegated" connotes a clear and organized transfer of
responsibility to the right person to solve it. In that sense, I
disagree that we've delegated it.

What's happened here is evolution of various choices that seemed
reasonable at the time. Unfortunately, the consequences that are hard
for us to manage and even harder for users to manage themselves.

>   Now that postgresql.org is taking that responsibility
> for builtin C.UTF-8, how should we govern it?  I think the above text
> and [1]
> convey that we'll update the Unicode data between major versions,
> making
> functions like lower() effectively STABLE.  Is that right?

Marking them STABLE is not a viable option, that would break a lot of
valid use cases, e.g. an index on LOWER().

Unicode already has its own governance, including a stability policy
that includes case mapping:

https://www.unicode.org/policies/stability_policy.html#Case_Pair

Granted, that policy does not guarantee that the results will never
change. In particular, the results can change if using unassinged code
poitns that are later assigned to Cased characters.

That's not terribly common though; for instance, there are zero changes
in uppercase/lowercase behavior between Unicode 14.0 (2021) and 15.1
(current) -- even for code points that were unassigned in 14.0 and
later assigned. I checked this by modifying case_test.c to look at
unassigned code points as well.

There's a greater chance that character properties can change (e.g.
whether a character is "alphabetic" or not) in new releases of Unicode.
Such properties can affect regex character classifications, and in some
cases the results of initcap (because it uses the "alphanumeric"
classification to determine word boundaries).

I don't think we need code changes for 17. Some documentation changes
might be helpful, though. Should we have a note around LOWER()/UPPER()
that users should REINDEX any dependent indexes when the provider is
updated?

> (This thread had some discussion[2] that datcollversion/collversion
> won't
> necessarily change when a major versions changes lower() behavior.)

datcollversion/collversion track the vertsion of the collation
specifically (text ordering only), not the ctype (character semantics).
When using the libc provider, get_collation_actual_version() completely
ignores the ctype.

It would be interesting to consider tracking the versions separately,
though.

Regards,
Jeff Davis

Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)

2024-07-01 Thread Daniel Gustafsson

> On 1 Jul 2024, at 21:15, Alvaro Herrera  wrote:
> On 2024-Jul-01, Ranier Vilela wrote:

>>> -   charname[MAXPGPATH + 1];
>>> +   charname[MAXPGPATH];/* backup label name */
>>> 
>>> With the introduced use of strlcpy, why do we need to change this field?
>>> 
>> The part about being the only reference in the entire code that uses
>> MAXPGPATH + 1.
> 
> The bit I don't understand about this discussion is what will happen
> with users that currently have exactly 1024 chars in backup names today.
> With this change, we'll be truncating their names to 1023 chars instead.
> Why would they feel that such change is welcome?

That's precisely what I was getting at.  Maybe it makes sense to change, maybe
not, but that's not for this patch to decide as that's a different discussion
from using safe string copying API's.

--
Daniel Gustafsson

Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)

2024-07-01 Thread Alvaro Herrera

On 2024-Jul-01, Ranier Vilela wrote:

> > -   charname[MAXPGPATH + 1];
> > +   charname[MAXPGPATH];/* backup label name */
> >
> > With the introduced use of strlcpy, why do we need to change this field?
> >
> The part about being the only reference in the entire code that uses
> MAXPGPATH + 1.

The bit I don't understand about this discussion is what will happen
with users that currently have exactly 1024 chars in backup names today.
With this change, we'll be truncating their names to 1023 chars instead.
Why would they feel that such change is welcome?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: LogwrtResult contended spinlock

2024-07-01 Thread Alvaro Herrera

Hello,

Thanks for your attention here.

On 2024-Jul-01, Andres Freund wrote:

> On 2024-07-01 11:10:24 +0200, Alvaro Herrera wrote:
> > In the meantime I noticed that pg_attribute_aligned() is not supported
> > in every platform/compiler, so for safety sake I think it's better to go
> > with what we do for PGAlignedBlock: use a union with a double member.
> > That should be 8-byte aligned on x86 as well, unless I misunderstand.
> 
> If a platform wants to support 8 byte atomics, it better provides a way to
> make variables 8 bytes aligned. We already rely on that, actually. See use of
> pg_attribute_aligned in e.g. src/include/port/atomics/generic-msvc.h

Well, pg_atomic_uint64 is a struct. Passing pointers to it is fine,
which is what non-platform-specific code does; but because the
declaration of the type is in each platform-specific file, it might not
work to use it directly in generic code.  I didn't actually try, but it
seems a bit of a layering violation.  (I didn't find any place where
the struct is used that way.)

If that works, then I think we could simply declare currval as a
pg_atomic_uint64 and it'd be prettier.

> > +   /*
> > +* On 32-bit machines, declaring a bare uint64 variable doesn't promise
> > +* the alignment we need, so coerce the compiler this way.
> > +*/
> > +   union
> > +   {
> > +   uint64  u64;
> > +   double  force_align_d;
> > +   }   currval;
> 
> I wonder if we should just relax the alignment requirement for currval. It's
> crucial that the pointer is atomically aligned (atomic ops across pages are
> either forbidden or extremely slow), but it's far from obvious that it's
> crucial for comparator value to be aligned.

I'm pretty sure the Microsoft docs I linked to are saying it must be
aligned.

> >  #ifndef PG_HAVE_ATOMIC_U64_SIMULATION
> > AssertPointerAlignment(ptr, 8);
> >  #endif
> 
> What's the point of this assert, btw? This stuff is already asserted in lower
> level routines, so it just seems redundant to have it here?

There are in some of them, but not in pg_atomic_compare_exchange_u64_impl.

> > -   return Max(target_, currval);
> > +   return Max(target_, currval.u64);
> 
> What does the Max() actually achieve here? Shouldn't it be impossible to reach
> this with  currval < target_?

When two processes are hitting the cmpxchg concurrently, we need to
return the highest value that was written, even if it was the other
process that did it.  The assertions in the calling site are quickly
broken if we don't do this.  I admit this aspect took me by surprise.

> And why does target_ end in an underscore?

Heh, you tell me -- I just copied the style of the functions just above.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No nos atrevemos a muchas cosas porque son difíciles,
pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)

Re: Remove last traces of HPPA support

2024-07-01 Thread Tom Lane

Noah Misch  writes:
> On Sat, Oct 21, 2023 at 02:18:19AM -0400, Tom Lane wrote:
>> Andres Freund  writes:
>>> IMO a single person looking at HPPA code for a few minutes is a cost that 
>>> more
>>> than outweighs the potential benefits of continuing "supporting" this dead
>>> arch. Even code that doesn't need to change has costs, particularly if it's
>>> intermingled with actually important code (which spinlocks certainly are).

>> Yup, that.  It's not zero cost to carry this stuff.

> +1 for dropping it.

Done at commit edadeb0710.

regards, tom lane

Commitfest manager for July 2024

2024-07-01 Thread Andrey M. Borodin

Hello everyone!

As Michael noted in e26810d01d4 [0] hacking for Postgres 17 has begun.
I’ve skimmed through hackers@ list. And it seems that so far role of the 
commitfest manager is vacant.

Is anyone up for doing this community work? Make things moving :)


Best regards, Andrey Borodin.

[0] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e26810d01d4
[1] https://commitfest.postgresql.org/48/

Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)

2024-07-01 Thread Ranier Vilela

Em seg., 1 de jul. de 2024 às 14:35, Ranier Vilela 
escreveu:

> Em seg., 1 de jul. de 2024 às 06:20, Daniel Gustafsson 
> escreveu:
>
>> > On 27 Jun 2024, at 13:50, Ranier Vilela  wrote:
>>
>> > Now with file patch really attached.
>>
>> -   if (strlen(backupidstr) > MAXPGPATH)
>> +   if (strlcpy(state->name, backupidstr, sizeof(state->name)) >=
>> sizeof(state->name))
>> ereport(ERROR,
>>
>> Stylistic nit perhaps, I would keep the strlen check here and just
>> replace the
>> memcpy with strlcpy.  Using strlen in the error message check makes the
>> code
>> more readable.
>>
> This is not performance-critical code, so I see no problem using strlen,
> for the sake of readability.
>
>
>>
>> -   charname[MAXPGPATH + 1];
>> +   charname[MAXPGPATH];/* backup label name */
>>
>> With the introduced use of strlcpy, why do we need to change this field?
>>
> The part about being the only reference in the entire code that uses
> MAXPGPATH + 1.
> MAXPGPATH is defined as 1024, so MAXPGPATH +1 is 1025.
> I think this hurts the calculation of the array index,
> preventing power two optimization.
>
> Another argument is that all other paths have a 1023 size limit,
> I don't see why the backup label would have to be different.
>
> New version patch attached.
>
Sorry for v5, I forgot to update the patch and it was an error.

best regards,
Ranier Vilela


v6-avoid-incomplete-copy-string-do_pg_backup_start.patch
Description: Binary data

Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)

2024-07-01 Thread Ranier Vilela

Em seg., 1 de jul. de 2024 às 06:20, Daniel Gustafsson 
escreveu:

> > On 27 Jun 2024, at 13:50, Ranier Vilela  wrote:
>
> > Now with file patch really attached.
>
> -   if (strlen(backupidstr) > MAXPGPATH)
> +   if (strlcpy(state->name, backupidstr, sizeof(state->name)) >=
> sizeof(state->name))
> ereport(ERROR,
>
> Stylistic nit perhaps, I would keep the strlen check here and just replace
> the
> memcpy with strlcpy.  Using strlen in the error message check makes the
> code
> more readable.
>
This is not performance-critical code, so I see no problem using strlen,
for the sake of readability.


>
> -   charname[MAXPGPATH + 1];
> +   charname[MAXPGPATH];/* backup label name */
>
> With the introduced use of strlcpy, why do we need to change this field?
>
The part about being the only reference in the entire code that uses
MAXPGPATH + 1.
MAXPGPATH is defined as 1024, so MAXPGPATH +1 is 1025.
I think this hurts the calculation of the array index,
preventing power two optimization.

Another argument is that all other paths have a 1023 size limit,
I don't see why the backup label would have to be different.

New version patch attached.

best regards,
Ranier Vilela
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d36272ab4f..bfb500aa54 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8742,9 +8742,9 @@ do_pg_backup_start(const char *backupidstr, bool fast, List **tablespaces,
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("backup label too long (max %d bytes)",
-		MAXPGPATH)));
+		MAXPGPATH - 1)));
 
-	memcpy(state->name, backupidstr, strlen(backupidstr));
+	(void) strlcpy(state->name, backupidstr, MAXPGPATH))
 
 	/*
 	 * Mark backup active in shared memory.  We must do full-page WAL writes
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index c30d4a5991..a52345850e 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -21,8 +21,7 @@
 typedef struct BackupState
 {
 	/* Fields saved at backup start */
-	/* Backup label name one extra byte for null-termination */
-	char		name[MAXPGPATH + 1];
+	char		name[MAXPGPATH];/* backup label name */
 	XLogRecPtr	startpoint;		/* backup start WAL location */
 	TimeLineID	starttli;		/* backup start TLI */
 	XLogRecPtr	checkpointloc;	/* last checkpoint location */

Re: LogwrtResult contended spinlock

2024-07-01 Thread Andres Freund

Hi,

On 2024-07-01 11:10:24 +0200, Alvaro Herrera wrote:
> In the meantime I noticed that pg_attribute_aligned() is not supported
> in every platform/compiler, so for safety sake I think it's better to go
> with what we do for PGAlignedBlock: use a union with a double member.
> That should be 8-byte aligned on x86 as well, unless I misunderstand.

If a platform wants to support 8 byte atomics, it better provides a way to
make variables 8 bytes aligned. We already rely on that, actually. See use of
pg_attribute_aligned in e.g. src/include/port/atomics/generic-msvc.h



> From 9d240e90f87bf8b53bd5d92b623e33419db64717 Mon Sep 17 00:00:00 2001
> From: Alvaro Herrera 
> Date: Mon, 1 Jul 2024 10:41:06 +0200
> Subject: [PATCH v2] Fix alignment of variable in
>  pg_atomic_monotonic_advance_u64
> 
> Reported-by: Alexander Lakhin 
> Discussion: https://postgr.es/m/36796438-a718-cf9b-2071-b2c1b947c...@gmail.com
> ---
>  src/include/port/atomics.h | 24 
>  1 file changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
> index 78987f3154..964732e660 100644
> --- a/src/include/port/atomics.h
> +++ b/src/include/port/atomics.h
> @@ -580,30 +580,38 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, 
> int64 sub_)
>  static inline uint64
>  pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 
> target_)
>  {
> - uint64  currval;
> + /*
> +  * On 32-bit machines, declaring a bare uint64 variable doesn't promise
> +  * the alignment we need, so coerce the compiler this way.
> +  */
> + union
> + {
> + uint64  u64;
> + double  force_align_d;
> + }   currval;

I wonder if we should just relax the alignment requirement for currval. It's
crucial that the pointer is atomically aligned (atomic ops across pages are
either forbidden or extremely slow), but it's far from obvious that it's
crucial for comparator value to be aligned.


>  #ifndef PG_HAVE_ATOMIC_U64_SIMULATION
>   AssertPointerAlignment(ptr, 8);
>  #endif

What's the point of this assert, btw? This stuff is already asserted in lower
level routines, so it just seems redundant to have it here?


> - currval = pg_atomic_read_u64_impl(ptr);
> - if (currval >= target_)
> + currval.u64 = pg_atomic_read_u64_impl(ptr);
> + if (currval.u64 >= target_)
>   {
>   pg_memory_barrier();
> - return currval;
> + return currval.u64;
>   }
>  
>  #ifndef PG_HAVE_ATOMIC_U64_SIMULATION
> - AssertPointerAlignment(, 8);
> + AssertPointerAlignment(, 8);
>  #endif
>  
> - while (currval < target_)
> + while (currval.u64 < target_)
>   {
> - if (pg_atomic_compare_exchange_u64_impl(ptr, , target_))
> + if (pg_atomic_compare_exchange_u64_impl(ptr, , 
> target_))
>   break;
>   }
>  
> - return Max(target_, currval);
> + return Max(target_, currval.u64);

What does the Max() actually achieve here? Shouldn't it be impossible to reach
this with  currval < target_?

And why does target_ end in an underscore?

Greetings,

Andres Freund

Re: call for applications: mentoring program for code contributors

2024-07-01 Thread Robert Haas

On Thu, Jun 20, 2024 at 1:12 PM Robert Haas  wrote:
> I'm working to start a mentoring program where code contributors can
> be mentored by current committers. Applications are now open:
> https://forms.gle/dgjmdxtHYXCSg6aB7

Applications are now closed. Initially, I had imagined just keeping
this form more or less indefinitely, but that looks clearly
impractical at this point, so what I'm going to do instead is create a
new form at some future point TBD and repeat this process, taking into
account what needs we have at that time. Part of the reason it seems
impractical to keep the form open is because a significant percentage
of applications are from people who have posted a total of zero (0)
emails to pgsql-hackers, and I don't want to waste my time or that of
other committers by relying to such inquiries one by one. Hence, the
form is closed for now, but with the intention of having a new one at
some point when the time seems opportune. That will also give people
who did not find a match this time an opportunity to resubmit if
they're still interested.

Matching is largely complete at this point. I expect to send emails to
all applicants letting them know what happened with their application
soon, hopefully tomorrow (my time). In preparation for that, allow me
to say that I'm very pleased with the number of acceptances that I
anticipate being able to extend. Some committers ended up deciding to
take two mentees, which is really great. More details on that soon.
Nonetheless, I am sure that those who did not find a mentor for one
reason or another will be disappointed. I hope that no one will be so
disappointed that they give up on hacking on PostgreSQL. Remember, if
you didn't get matched to a mentor, you're no worse off than you were
before, and your work on PostgreSQL is no less valuable than it was
before! I am also hoping to start up something to provide some more
limited support to people who didn't match to a mentor, and I'll tell
you more about that when and if I have more to say.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: LogwrtResult contended spinlock

2024-07-01 Thread Alvaro Herrera

On 2024-Jul-01, Tom Lane wrote:

> Alvaro Herrera  writes:
> > Maybe we can do something like this,
> 
> > +#if MAXIMUM_ALIGNOF >= 8
> > uint64  currval;
> 
> This should probably be testing the alignment of int64 specifically,
> rather than assuming that MAXIMUM_ALIGNOF applies to it.  At least
> historically, there have been platforms where MAXIMUM_ALIGNOF is
> determined by float quantities and integer quantities are different.

OK, so it's
#if ALIGNOF_LONG_LONG_INT >= 8

Alexander, can you please confirm whether this works for you?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La virtud es el justo medio entre dos defectos" (Aristóteles)
>From 65e9e813859fec436805b1ada7be9339f8ecd63e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 1 Jul 2024 16:44:22 +0200
Subject: [PATCH v4] Fix alignment in pg_atomic_monotonic_advance_u64

Unsurprisingly, it's possible for Windows x86 to have 8-byte atomics but
4-byte-aligned long long ints.  This breaks an assertion in recently
introduced pg_atomic_monotonic_advance_u64 with commit f3ff7bf83bce.
Try to fix that blindly by adding pg_attribute_aligned(8).  Unfortunately
not all compilers support that, so it needs to be protected by #ifdef;
fortunately not all compilers _need_ it.  Hopefully the combinations we
now support cover all interesting cases.

Reported-by: Alexander Lakhin 
Reviewed-by: Tom Lane 
Discussion: https://postgr.es/m/36796438-a718-cf9b-2071-b2c1b947c...@gmail.com
---
 src/include/port/atomics.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 78987f3154..eeb47dee30 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -580,7 +580,21 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 static inline uint64
 pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_)
 {
+	/*
+	 * When using actual (not simulated) atomics, the target variable for
+	 * pg_atomic_compare_exchange_u64 must have suitable alignment.  This
+	 * occurs naturally on platforms where long long int is 8-byte aligned. On
+	 * others, we must explicitly coerce the compiler into applying suitable
+	 * alignment, or abort the compile if we cannot.  When using simulated
+	 * atomics, the alignment doesn't matter.
+	 */
+#if ALIGNOF_LONG_LONG_INT >= 8 || defined(PG_HAVE_ATOMIC_U64_SIMULATION)
 	uint64		currval;
+#elif defined(pg_attribute_aligned)
+	pg_attribute_aligned(8) uint64 currval;
+#else
+#error "Must have pg_attribute_aligned or simulated atomics on this platform"
+#endif
 
 #ifndef PG_HAVE_ATOMIC_U64_SIMULATION
 	AssertPointerAlignment(ptr, 8);
-- 
2.39.2

Re: pg_sequence_last_value() for unlogged sequences on standbys

2024-07-01 Thread Nathan Bossart

On Thu, May 16, 2024 at 08:33:35PM -0500, Nathan Bossart wrote:
> Here is a rebased version of 0002, which I intend to commit once v18
> development begins.

Committed.

-- 
nathan

Re: Backporting BackgroundPsql

2024-07-01 Thread Heikki Linnakangas


On 01/07/2024 17:11, Pavan Deolasee wrote:

H i Daniel,

On Mon, Jul 1, 2024 at 1:09 PM Daniel Gustafsson > wrote:


 > On 29 Jun 2024, at 06:38, Pavan Deolasee
mailto:pavan.deola...@gmail.com>> wrote:

 > Don't we need to add install and uninstall rules for the new
module, like we did in
https://git.postgresql.org/pg/commitdiff/a4c17c86176cfa712f541b81b2a026ae054b275e 
 
and https://git.postgresql.org/pg/commitdiff/7039c7cff6736780c3bbb41a90a6dfea0f581ad2 
 ?

Thats correct, we should backport those as well.

Thanks for confirming. Attaching patches for PG15 and PG14, but this 
will need backporting all the way up to PG12.


Thanks! Pushed to v12-v15.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Restart pg_usleep when interrupted

2024-07-01 Thread Sami Imseih


> 
>> Therefore, rather than "improving" pg_usleep (and uglifying its API),
>> the right answer is to fix parallel vacuum leaders to not depend on
>> pg_usleep in the first place.  A better idea might be to use
>> pg_sleep() or equivalent code.
> 
> Yes, that is a good idea to explore and it will not require introducing
> an awkward new API. I will look into using something similar to
> pg_sleep.


Looking through the history of the sleep in vacuum_delay_point, commit
720de00af49 replaced WaitLatch with pg_usleep to allow for microsecond
sleep precision [1]. 

Thomas has proposed a WaitLatchUs implementation in [2], but I have not
yet tried it. 

So I see there are 2 possible options here to deal with the interrupt of a 
parallel vacuum leader when a message is sent by a parallel vacuum worker. 

Option 1/ something like my initial proposal which is
to create a function similar to pg_usleep that is able to deal with
interrupts in a sleep. This could be a function scoped only to vacuum.c,
so it can only be used for vacuum delay purposes.

—— 
Option 2/ to explore the WaitLatchUs implementation by
Thomas which will give both a latch implementation for a sleep with
the microsecond precision.

It is worth mentioning that if we do end up using WaitLatch(Us) inside
vacuum_delay_point, it will need to set only WL_TIMEOUT and 
WL_EXIT_ON_PM_DEATH.

i.e.
(void) WaitLatch(MyLatch, WL_TIMEOUT| WL_EXIT_ON_PM_DEATH,
   msec
  WAIT_EVENT_VACUUM_DELAY);

This way it is not interrupted by a WL_LATCH_SET when a message
is set by a parallel worker.

——

Ultimately, I think option 2 may be worth a closer look as it is a cleaner
and safer approach, to detect a postmaster death.


Thoughts?


[1] 
https://postgr.es/m/CAAKRu_b-q0hXCBUCAATh0Z4Zi6UkiC0k2DFgoD3nC-r3SkR3tg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CA%2BhUKGKVbJE59JkwnUj5XMY%2B-rzcTFciV9vVC7i%3DLUfWPds8Xw%40mail.gmail.com

Re: Wrong security context for deferred triggers?

2024-07-01 Thread Laurenz Albe

On Sat, 2024-06-22 at 17:50 -0400, Joseph Koshakow wrote:
> On Mon, Jun 10, 2024 at 1:00 PM Laurenz Albe  wrote:
> > Like you, I was surprised by the current behavior.  There is a design
> > principle that PostgreSQL tries to follow, called the "Principle of
> > least astonishment".  Things should behave like a moderately skilled
> > user would expect them to.  In my opinion, the current behavior violates
> > that principle.  Tomas seems to agree with that point of view.
> 
> I worry that both approaches violate this principle in different ways.
> For example consider the following sequence of events:
> 
>     SET ROLE r1;
>     BEGIN;
>     SET CONSTRAINTS ALL DEFERRED;
>     INSERT INTO ...;
>     SET ROLE r2;
>     SET search_path = '...';
>     COMMIT;
> 
> I think that it would be reasonable to expect that the triggers execute
> with r2 and not r1, since the triggers were explicitly deferred and the
> role was explicitly set. It would likely be surprising that the search
> path was updated for the trigger but not the role. With your proposed
> approach it would be impossible for someone to trigger a trigger with
> one role and execute it with another, if that's a desirable feature.

I definitely see your point, although GUC settings and the current
security context are something different.

It would definitely not be viable to put all GUC values in the trigger
state.

So if you say "all or nothing", it would be nothing, and the patch should
be rejected.

> > I didn't find this strange behavior myself: it was one of our customers
> > who uses security definer functions for data modifications and has
> > problems with the current behavior, and I am trying to improve the
> > situation on their behalf.
> 
> Would it be possible to share more details about this use case? For
> example, What are their current problems? Are they not able to set
> constraints to immediate? Or can they update the trigger function
> itself be a security definer function? That might help illuminate why
> the current behavior is wrong.

I asked them for a statement, and they were nice enough to write up
https://postgr.es/m/e89e8dd9-7143-4db8-ac19-b2951cb0c0da%40gmail.com

They have a workaround, so the patch is not absolutely necessary for them.

> 
> I also took a look at the code. It doesn't apply cleanly to master, so
> I took the liberty of rebasing and attaching it.
> 
> > + /*
> > + * The role could have been dropped since the trigger was queued.
> > + * In that case, give up and error out.
> > + */
> > + pfree(GetUserNameFromId(evtshared->ats_rolid, false));
> 
> It feels a bit wasteful to allocate and copy the role name when we
> never actually use it. Is it possible to check that the role exists
> without copying the name?

If that is a concern (and I can envision it to be), I can improve that.
One option is to copy the guts of GetUserNameFromId(), and another
is to factor out the common parts into a new function.

I'd wait until we have a decision whether we want the patch or not
before I make the effort, if that's ok.

> Everything else looked good, and the code does what it says it will.

Thanks for the review!

Yours,
Laurenz Albe

Windows perl/tcl requirement documentation

2024-07-01 Thread Andrew Dunstan



Our docs currently state this regarding the perl requirement for 
building on Windows:



ActiveState Perl

   ActiveState Perl is required to run the build generation scripts.
   MinGW or Cygwin Perl will not work. It must also be present in the
   PATH. Binaries can be downloaded from https://www.activestate.com
    (Note: version 5.14 or later is
   required, the free Standard Distribution is sufficient).


This really hasn't been a true requirement for quite some time. I 
stopped using AS perl quite a few years ago due to possible licensing 
issues, and have been building with MSVC using Strawberry Perl ever 
since. Andres recently complained that Strawberry was somewhat out of 
date, but that's no longer really the case - it's on 5.38.2, which is 
the latest in that series, and perl 5.40.0 was only releases a few weeks 
ago.


We recommend AS Tcl/Tk, which I have not used, but which I am wary of 
for the same reasons. There is a BSD licensed up to date windows binary 
installation called Magicsplat which I haven't tried but which might 
well be suitable for our purposes.


I suggest we remove these references to AS and replace them with 
references to Windows distros that are more appropriate.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: add --no-sync to pg_upgrade's calls to pg_dump and pg_dumpall

2024-07-01 Thread Nathan Bossart

Committed.

-- 
nathan

Re: small pg_dump code cleanup

2024-07-01 Thread Daniel Gustafsson

> On 11 Jun 2024, at 22:30, Nathan Bossart  wrote:

> At the moment, I'm inclined to commit v1 once v18 development opens up.  We
> can consider any additional adjustments separately.

Patch LGTM and the tests pass, +1 on pushing this version.

--
Daniel Gustafsson

Re: gamma() and lgamma() functions

2024-07-01 Thread Alvaro Herrera

On 2024-Jul-01, Stepan Neretin wrote:

> I have one notice:
> ERROR:  value out of range: overflow. I think we need to add information
> about the available ranges in the error message

I think this is a project of its own.  The error comes from calling
float_overflow_error(), which is a generic routine used in several
functions which have different overflow conditions.  It's not clear to
me that throwing errors listing the specific range for each case is
worth the effort.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)

jsonpath: Inconsistency of timestamp_tz() Output

2024-07-01 Thread David E. Wheeler

Hackers,

There’s an odd difference in the behavior of timestamp_tz() outputs. Running 
with America/New_York as my TZ, it looks fine for a full timestamptz, identical 
to how casting the types directly works:

david=# set time zone 'America/New_York';
SET

david=# select '2024-08-15 12:34:56-04'::timestamptz;
  timestamptz   

 2024-08-15 12:34:56-04
(1 row)

david=# select jsonb_path_query_tz('"2024-08-15 12:34:56-04"', 
'$.timestamp_tz()');
 jsonb_path_query_tz 
-
 "2024-08-15T12:34:56-04:00"

Both show the time in America/New_York, which is great. But when casting from a 
date, the behavior differs. Casting directly:

david=# select '2024-08-15'::date::timestamptz;
  timestamptz   

 2024-08-15 00:00:00-04

It stringifies to the current zone setting again, as expected. But look at the 
output from a path query:

david=# select jsonb_path_query_tz('"2023-08-15"', '$.timestamp_tz()');
 jsonb_path_query_tz 
-
 "2023-08-15T04:00:00+00:00"

It’s using UTC for the display output! Shouldn’t it be using America/New_York?

Note that I’m comparing a cast from date to timestamptz because that’s how the 
jsonpath parsing works[1]: it ultimately uses 
date2timestamptz_opt_overflow()[2] to make the conversion, which appears to set 
the offset from the time zone GUC, so I’m not sure where it’s converted to UTC 
before stringifying.

Maybe an argument is missing from the stringification path?

FWIW, explicitly calling the string() jsonpath method does produce a result in 
the current TZ:

david=# select jsonb_path_query_tz('"2023-08-15"', '$.timestamp_tz().string()');
   jsonb_path_query_tz
--
 "2023-08-15 00:00:00-04"

That bit uses timestamptz_out to format the output, but JSONB has its own 
stringification[4] (called here[5]), but I can’t tell what might be different 
between a timestamptz cast from a date and one not cast from a date.

Note the same divergency in behavior occurs when the source value is a 
timestamp, too. Compare:

david=# select '2024-08-15 12:34:56'::timestamp::timestamptz;
  timestamptz   

 2024-08-15 12:34:56-04
(1 row)

david=# select jsonb_path_query_tz('"2023-08-15 12:34:56"', '$.timestamp_tz()');
 jsonb_path_query_tz 
-
 "2023-08-15T16:34:56+00:00"
(1 row)

Anyway, should the output of timestamptz JSONB values be made more consistent? 
I’m happy to make a patch to do so, but could use a hand figuring out where the 
behavior varies.

Best,

David

[1]: 
https://github.com/postgres/postgres/blob/3497c87/src/backend/utils/adt/jsonpath_exec.c#L2708-L2718
[2]: 
https://github.com/postgres/postgres/blob/3497c87/src/backend/utils/adt/date.c#L613-L698
[3]: 
https://github.com/postgres/postgres/blob/3fb59e789dd9f21610101d1ec106ad58095e24f3/src/backend/utils/adt/jsonpath_exec.c#L1650-L1653
[4]: 
https://github.com/postgres/postgres/blob/3fb59e789dd9f21610101d1ec106ad58095e24f3/src/backend/utils/adt/json.c#L369-L407
[5]: 
https://github.com/postgres/postgres/blob/3fb59e7/src/backend/utils/adt/jsonb.c#L743-L748

Re: LogwrtResult contended spinlock

2024-07-01 Thread Tom Lane

Alvaro Herrera  writes:
> Maybe we can do something like this,

> +#if MAXIMUM_ALIGNOF >= 8
>   uint64  currval;

This should probably be testing the alignment of int64 specifically,
rather than assuming that MAXIMUM_ALIGNOF applies to it.  At least
historically, there have been platforms where MAXIMUM_ALIGNOF is
determined by float quantities and integer quantities are different.

regards, tom lane

Re: LogwrtResult contended spinlock

2024-07-01 Thread Alvaro Herrera

Maybe we can do something like this,

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 78987f3154..f6fa90bad8 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -580,7 +580,20 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, 
int64 sub_)
 static inline uint64
 pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_)
 {
+   /*
+* When using actual (not simulated) atomics, the target variable for
+* pg_atomic_compare_exchange_u64 must have suitable alignment, which
+* is acquired naturally on most platforms, but not on 32-bit ones;
+* persuade the compiler in that case, but fail if we
+* cannot.
+*/
+#if MAXIMUM_ALIGNOF >= 8
uint64  currval;
+#elif defined(pg_attribute_aligned) && !defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+   pg_attribute_aligned(8) uint64  currval;
+#else
+#error "Must have pg_attribute aligned or simulated atomics"
+#endif
 
 #ifndef PG_HAVE_ATOMIC_U64_SIMULATION
AssertPointerAlignment(ptr, 8);


-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Just treat us the way you want to be treated + some extra allowance
 for ignorance."(Michael Brusser)

RE: speed up a logical replica setup

2024-07-01 Thread Hayato Kuroda (Fujitsu)

Dear Tom,

> I have a different but possibly-related complaint: why is
> 040_pg_createsubscriber.pl so miserably slow?  On my machine it
> runs for a bit over 19 seconds, which seems completely out of line
> (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
> other test scripts in this directory take much less).  It looks
> like most of the blame falls on this step:
> 
> [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S
> 
> AFAICS the amount of data being replicated is completely trivial,
> so that it doesn't make any sense for this to take so long --- and
> if it does, that suggests that this tool will be impossibly slow
> for production use.  But I suspect there is a logic flaw causing
> this.

I analyzed the issue. My elog() debugging said that wait_for_end_recovery() was
wasted some time. This was caused by the recovery target seeming unsatisfactory.

We are setting recovery_target_lsn by the return value of 
pg_create_logical_replication_slot(),
which returns the end of the RUNNING_XACT record. If we use the returned value 
as
recovery_target_lsn as-is, however, we must wait for additional WAL generation
because the parameter requires that the replicated WAL overtake a certain point.
On my env, the function waited until the bgwriter emitted the 
XLOG_RUNNING_XACTS record.

One simple solution is to add an additional WAL record at the end of the 
publisher
setup. IIUC, an arbitrary WAL insertion can reduce the waiting time. The 
attached
patch inserts a small XLOG_LOGICAL_MESSAGE record, which could reduce much 
execution
time on my environment.

```
BEFORE
(13.751s) ok 30 - run pg_createsubscriber on node S
AFTER
(0.749s) ok 30 - run pg_createsubscriber on node S
```

However, even after the modification, the reported failure [1] could not be 
resolved on my env.

How do you think?

[1]: 
https://www.postgresql.org/message-id/0dffca12-bf17-4a7a-334d-225569de5e6e%40gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


emit_dummy_message.diff
Description: emit_dummy_message.diff

Re: Should we document how column DEFAULT expressions work?

2024-07-01 Thread Tom Lane

Peter Eisentraut  writes:
> On 01.07.24 01:54, David Rowley wrote:
>> I think there are valid reasons to use the special timestamp input
>> values.  One that I can think of is for use with partition pruning. If
>> you have a time-range partitioned table and want the planner to prune
>> the partitions rather than the executor, you could use
>> 'now'::timestamp in your queries to allow the planner to prune.

> Yeah, but is that a good user interface?  Or is that just something that 
> happens to work now with the pieces that happened to be there, rather 
> than a really designed interface?

That's not a very useful argument to make.  What percentage of the
SQL language as a whole is legacy cruft that we'd do differently if
we could?  I think the answer is depressingly high.  Adding more
special-purpose features to the ones already there doesn't move
that needle in a desirable direction.

I'd be more excited about this discussion if I didn't think that
the chances of removing 'now'::timestamp are exactly zero.  You
can't just delete useful decades-old features, whether there's
a better way or not.

regards, tom lane

Re: Surround CheckRelation[Oid]LockedByMe() with USE_ASSERT_CHECKING

2024-07-01 Thread Bertrand Drouvot

Hi,

On Mon, Jul 01, 2024 at 10:21:35AM -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > On Mon, Jul 01, 2024 at 06:42:46AM +, Bertrand Drouvot wrote:
> >> I think it would make sense to declare / define those functions only for
> >> assert enabled build: please find attached a tiny patch doing so.
> 
> > Not convinced that's a good idea.  What about out-of-core code that
> > may use these routines for runtime checks in non-assert builds?
> 
> Yeah.  Also, I believe it's possible for an extension that's been
> built with assertions enabled to be used with a core server that
> wasn't.  This is why, for example, ExceptionalCondition() is not
> ifdef'd away in a non-assert build.  Even if you think there's
> no use for CheckRelation[Oid]LockedByMe except in assertions,
> it'd still be plenty reasonable for an extension to call them
> in assertions.

Yeah good point, thanks for the feedback! I've withdrawn the CF entry.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: LogwrtResult contended spinlock

2024-07-01 Thread Alvaro Herrera

On 2024-Jul-01, Tom Lane wrote:

> Alvaro Herrera  writes:
> >> because the failed assertion is:
> >> #ifndef PG_HAVE_ATOMIC_U64_SIMULATION
> >>     AssertPointerAlignment(, 8);
> >> #endif
> 
> Perhaps this assertion is what is wrong?  If the platform has no
> native 8-byte alignment requirement, why do we think that atomics
> need it?

Oh, that's a good question.  TBH I just copied the assertion from the
other routines for 64-bit variables in the same file.  But I think
that's correct.  We're gating the assertion on _not_ having emulation,
which must mean we have native atomics; on MSVC, if I read the #ifdef
maze correctly, that's implemented using _InterlockedCompareExchange,
whose docs state:

: The variables for this function must be aligned on a 64-bit boundary;
: otherwise, this function will behave unpredictably on multiprocessor x86
: systems and any non-x86 systems. See _aligned_malloc.
https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange64

So I think the assertion is correct.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere."(Lamar Owen)

Re: gamma() and lgamma() functions

2024-07-01 Thread Stepan Neretin

On Mon, Jul 1, 2024 at 5:33 PM Dean Rasheed 
wrote:

> Attached is a patch adding support for the gamma and log-gamma
> functions. See, for example:
>
> https://en.wikipedia.org/wiki/Gamma_function
>
> I think these are very useful general-purpose mathematical functions.
> They're part of C99, and they're commonly included in other
> mathematical libraries, such as the python math module, so I think
> it's useful to make them available from SQL.
>
> The error-handling for these functions seems to be a little trickier
> than most, so that might need further discussion.
>
> Regards,
> Dean
>

I tried to review the patch without applying it. It looks good to me, but I
have one notice:
ERROR:  value out of range: overflow. I think we need to add information
about the available ranges in the error message

Re: Should we document how column DEFAULT expressions work?

2024-07-01 Thread Peter Eisentraut


On 01.07.24 01:54, David Rowley wrote:

On Thu, 27 Jun 2024 at 23:57, Peter Eisentraut  wrote:

Maybe we should really be thinking about deprecating these special
values and steering users more urgently toward more robust alternatives.

Imagine if 'random' were a valid input value for numeric types.


I think there are valid reasons to use the special timestamp input
values.  One that I can think of is for use with partition pruning. If
you have a time-range partitioned table and want the planner to prune
the partitions rather than the executor, you could use
'now'::timestamp in your queries to allow the planner to prune.


Yeah, but is that a good user interface?  Or is that just something that 
happens to work now with the pieces that happened to be there, rather 
than a really designed interface?


Hypothetically, what would need to be done to make this work with now() 
or current_timestamp or something similar?  Do we need a new stability 
level that somehow encompasses this behavior, so that the function call 
can be evaluated at planning time?



That
works providing that you never use that in combination with PREPARE
and never put the query with the WHERE clause inside a VIEW.


And this kind of thing obviously makes this interface even worse.

Re: gamma() and lgamma() functions

2024-07-01 Thread Daniel Gustafsson

> On 1 Jul 2024, at 16:20, Stepan Neretin  wrote:

> The patch file seems broken.
> git apply gamma-and-lgamma.patch error: git apply: bad git-diff  — exptec 
> /dev/null in line 2

It's a plain patch file, if you apply it with patch and not git it will work 
fine:

$ patch -p1 < gamma-and-lgamma.patch
patching file 'doc/src/sgml/func.sgml'
patching file 'src/backend/utils/adt/float.c'
patching file 'src/include/catalog/pg_proc.dat'
patching file 'src/test/regress/expected/float8.out'
patching file 'src/test/regress/sql/float8.sql'

--
Daniel Gustafsson

Re: Surround CheckRelation[Oid]LockedByMe() with USE_ASSERT_CHECKING

2024-07-01 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Jul 01, 2024 at 06:42:46AM +, Bertrand Drouvot wrote:
>> I think it would make sense to declare / define those functions only for
>> assert enabled build: please find attached a tiny patch doing so.

> Not convinced that's a good idea.  What about out-of-core code that
> may use these routines for runtime checks in non-assert builds?

Yeah.  Also, I believe it's possible for an extension that's been
built with assertions enabled to be used with a core server that
wasn't.  This is why, for example, ExceptionalCondition() is not
ifdef'd away in a non-assert build.  Even if you think there's
no use for CheckRelation[Oid]LockedByMe except in assertions,
it'd still be plenty reasonable for an extension to call them
in assertions.

regards, tom lane

Re: gamma() and lgamma() functions

2024-07-01 Thread Stepan Neretin

On Mon, Jul 1, 2024 at 5:33 PM Dean Rasheed 
wrote:

> Attached is a patch adding support for the gamma and log-gamma
> functions. See, for example:
>
> https://en.wikipedia.org/wiki/Gamma_function
>
> I think these are very useful general-purpose mathematical functions.
> They're part of C99, and they're commonly included in other
> mathematical libraries, such as the python math module, so I think
> it's useful to make them available from SQL.
>
> The error-handling for these functions seems to be a little trickier
> than most, so that might need further discussion.
>
> Regards,
> Dean
>

Hi! The patch file seems broken.
git apply gamma-and-lgamma.patch
error: git apply: bad git-diff  — exptec /dev/null in line 2
Best regards, Stepan Neretin.

Re: Should we move the resowner field from JitContext to LLVMJitContext?

2024-07-01 Thread Daniel Gustafsson

> On 5 Jun 2024, at 10:19, Andreas Karlsson  wrote:

> When Heikki made the resource owners extensible in commit 
> b8bff07daa85c837a2747b4d35cd5a27e73fb7b2 the API for JIT plugins changed when 
> ResourceOwnerForgetJIT() was moved from the generic JIT code to the LLVM 
> specific JIT code so now the resowner field of the context is only used by 
> the code of the LLVM plugin.
> 
> Maybe a bit late in the release cycle but should we make the resowner field 
> specific to the LLVM code too now that we already are breaking the API? I 
> personally do not like having a LLVM JIT specific field in the common struct. 
> Code is easier to understand if things are local. Granted most JIT engines 
> will likely need similar infrastructure but just providing the struct field 
> and nothing else does not seem very helpful.

I'm inclined to agree, given that the code for handling the resowner is private
to the LLVM implementation it makes sense for the resowner to be as well.  A
future JIT implementation will likely need a ResourceOwner, but it might just
as well need two or none.

--
Daniel Gustafsson

Re: Backporting BackgroundPsql

2024-07-01 Thread Pavan Deolasee

H i Daniel,

On Mon, Jul 1, 2024 at 1:09 PM Daniel Gustafsson  wrote:

> > On 29 Jun 2024, at 06:38, Pavan Deolasee 
> wrote:
>
> > Don't we need to add install and uninstall rules for the new module,
> like we did in
> https://git.postgresql.org/pg/commitdiff/a4c17c86176cfa712f541b81b2a026ae054b275e
> and
> https://git.postgresql.org/pg/commitdiff/7039c7cff6736780c3bbb41a90a6dfea0f581ad2
> ?
>
> Thats correct, we should backport those as well.
>

Thanks for confirming. Attaching patches for PG15 and PG14, but this will
need backporting all the way up to PG12.

Thanks,
Pavan


0001-Fix-missing-installation-uninstallation-rules-for-Ba-PG14.patch
Description: Binary data


0001-Fix-missing-installation-uninstallation-rules-for-Ba-PG15.patch
Description: Binary data

Re: LogwrtResult contended spinlock

2024-07-01 Thread Tom Lane

Alvaro Herrera  writes:
>> because the failed assertion is:
>> #ifndef PG_HAVE_ATOMIC_U64_SIMULATION
>>     AssertPointerAlignment(, 8);
>> #endif

Perhaps this assertion is what is wrong?  If the platform has no
native 8-byte alignment requirement, why do we think that atomics
need it?

regards, tom lane

Re: [PATCH] Handle SK_SEARCHNULL and SK_SEARCHNOTNULL in HeapKeyTest

2024-07-01 Thread Aleksander Alekseev

Hi,

> This patches changes the HeapKeyTest macro to add handling for SK_SEARCHNULL
> and SK_SEARCHNOTNULL. While currently no core codes uses these ScanKey flags
> it would be useful for extensions if it was supported so extensions
> dont have to implement
> handling for those by themself.

As I recall, previously it was argued that changes like this should
have some use within the core [1].

Can you think of any such use?

[1]: https://commitfest.postgresql.org/42/4180/

-- 
Best regards,
Aleksander Alekseev

Re: Wrong security context for deferred triggers?

2024-07-01 Thread Bennie Swart


Hi,

Allow me to provide some background on how we came across this.

(This is my first time posting to a pgsql list so hopefully I've got 
everything set up correctly.)


We have a db with a big legacy section that we're in the process of 
modernizing. To compensate for some of the shortcomings we have designed 
a layer of writable views to better represent the underlying data and 
make working with it more convenient. The following should illustrate 
what we're doing:


    -- this is the schema containing the view layer.
    create schema api;
    -- and this user is granted access to the api, but not the rest of 
the legacy db.

    create role apiuser;
    grant usage on schema api to apiuser;

    -- some dummy objects in the legacy db - poorly laid out and poorly 
named.

    create schema legacy;
    create table legacy.stock_base (
    id serial primary key
      , lbl text not null unique
      , num int not null
      -- etc
    );
    create table legacy.stock_extra (
    id int not null unique references legacy.stock_base (id)
      , man text
      -- etc
    );

    -- create the stock view which better names and logically groups 
the data.

    create view api.stock as
      select sb.id
       , sb.lbl as description
       , sb.num as quantity
       , se.man as manufacturer
    from legacy.stock_base sb
      left join legacy.stock_extra se using (id);
    -- make it writable so it is easier to work with. use security 
definer to allow access to legacy sections.
    create function api.stock_cud() returns trigger language plpgsql 
security definer as $$

    begin
    -- insert/update legacy.stock_base and legacy.stock_extra 
depending on trigger action, modified fields, etc.

    assert tg_op = 'INSERT'; -- assume insert for example's sake.
    insert into legacy.stock_base (lbl, num) values 
(new.description, new.quantity) returning id into new.id;
    insert into legacy.stock_extra (id, man) values (new.id, 
new.manufacturer);

    return new;
    end;
    $$;
    create trigger stock_cud
    instead of insert or update or delete on api.stock
    for each row execute function api.stock_cud();

    -- grant the apiuser permission to work with the view.
    grant insert, update, delete on api.stock to apiuser;

    -- insert as superuser - works as expected.
    insert into api.stock (description, quantity, manufacturer) values 
('item1', 10, 'manufacturer1');

    -- insert as apiuser - works as expected.
    set role apiuser;
    insert into api.stock (description, quantity, manufacturer) values 
('item2', 10, 'manufacturer2');


In some cases there are constraint triggers on the underlying tables to 
validate certain states. It is, however, possible for a state to be 
temporarily invalid between statements, so long as it is valid at tx 
commit. For this reason the triggers are deferred by default. Consider 
the following example:


    reset role;
    create function legacy.stock_check_state() returns trigger language 
plpgsql as $$

    begin
    -- do some queries to check the state of stock based on 
modified rows and error if invalid.

    raise notice 'current_user %', current_user;
    -- dummy validation code.
    perform * from legacy.stock_base sb left join 
legacy.stock_extra se using (id) where sb.id = new.id;

    return new;
    end;
    $$;
    create constraint trigger stock_check_state
    after insert or update or delete on legacy.stock_base
    deferrable initially deferred
    for each row execute function legacy.stock_check_state();

Then repeat the inserts:

    -- insert as superuser - works as expected.
    reset role;
    insert into api.stock (description, quantity, manufacturer) values 
('item3', 10, 'manufacturer3'); -- NOTICE:  current_user postgres

    -- insert as apiuser - fails.
    set role apiuser;
    insert into api.stock (description, quantity, manufacturer) values 
('item4', 10, 'manufacturer4'); -- NOTICE:  current_user apiuser


    -- insert as apiuser, not deferred - works as expected.
    begin;
    set constraints all immediate;
    insert into api.stock (description, quantity, manufacturer) values 
('item4', 10, 'manufacturer4'); -- NOTICE:  current_user postgres

    commit;

The insert as apiuser fails when the constraint trigger is deferred, but 
works as expected when it is immediate.


Hopefully this helps to better paint the picture. Our workaround 
currently is to just make `legacy.stock_check_state()` security definer 
as well. As I told Laurenz, we're not looking to advocate for any 
specific outcome here. We noticed this strange behaviour and thought it 
to be a bug that should be fixed - whatever "fixed" ends up meaning.


Regards,

Bennie Swart


On 2024/06/26 17:53, Laurenz Albe wrote:

On Wed, 2024-06-26 at 07:38 -0700, David G. Johnston wrote:

We have a few choices then:
1. Status quo + documentation backpatch
2. Change v18 narrowly +

Re: Should we document how column DEFAULT expressions work?

2024-07-01 Thread James Coleman

On Sun, Jun 30, 2024 at 8:16 PM David G. Johnston
 wrote:
>
> On Sun, Jun 30, 2024 at 4:55 PM David Rowley  wrote:
>>
>>
>> I'd like to know what led someone down the path of doing something
>> like DEFAULT 'now()'::timestamp in a CREATE TABLE. Could it be a
>> faulty migration tool that created these and people copy them thinking
>> it's a legitimate syntax?
>>
>
> My thought process on this used to be:  Provide a text string of the 
> expression that is then stored within the catalog and eval'd during runtime.  
> If the only thing you are providing is a single literal and not some compound 
> expression it isn't that obvious that you are supposed to provide an unquoted 
> expression - which feels like it should be immediately evaluated - versus 
> something that is a constant.  Kinda like dynamic SQL.

I have a similar story to tell: I've honestly never thought about it
deeply until I started this thread, but just through experimentation a
few things were obvious:

- now() as a function call gives you the current timestamp in a query
- now() as a function call in a DDL DEFAULT clause sets that as a
default function call
- Quoting that function call (using the function call syntax is the
natural thing to try, I think, if you've already done the first two)
-- because some examples online show quoting it -- gives you DDL time
evaluation.

So I suspect -- though I've been doing this for so long I couldn't
tell you for certain -- that I largely intuitive the behavior by
observation.

And similarly to David J. I'd then assumed -- but never had a need to
test it -- that this was generalized.

I think DDL is also different conceptually from SQL/DML here in a kind
of insidious way: the "bare" function call in DEFAULT is *not*
executed as part of the query for DDL like it is with other queries.

Hope this helps explain things.

James Coleman

Re: Reg: Alternate way of hashing database role passwords

2024-07-01 Thread Daniel Gustafsson

> On 26 Jun 2024, at 18:59, Robert Haas  wrote:

> However, it seems like SCRAM is designed so
> that different hash functions can be substituted into it, so what I'm
> hoping is that we can keep SCRAM and just replace SCRAM-SHA-256 with
> SCRAM-WHATEVER when SHA-256 starts to look too weak.

Correct, SCRAM is an authentication method which can use different hashing
algorithms.  There are current drafts for SCRAM-SHA-512 and SHA3-512 but they
are some way away from being standardized.  If they become standards at some
point reasonable to extend our support, but until then there is no evidence
that what we have is insecure AFAIK.

https://datatracker.ietf.org/doc/html/draft-melnikov-scram-sha-512
https://datatracker.ietf.org/doc/html/draft-melnikov-scram-sha3-512

> What I find a bit surprising about Anbazhagan's question is that he
> asks about PBKDF2, which seems to be part of SCRAM already.

In scram_SaltedPassword() we perform PBKDF2 with HMAC as the pseudorandom
function.

--
Daniel Gustafsson

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

2024-07-01 Thread Joel Jacobson

On Mon, Jul 1, 2024, at 15:11, Joel Jacobson wrote:
> Not really sure why. Maybe the code I tried can be optimized further:

If anyone want experiment with the int128 version,
here is a patch that adds a separate numeric_mul_patched() function,
so it's easier to benchmark against the unmodified numeric_mul().

/Joel

0001-Optimize-mul_var-for-var2ndigits-4.patch
Description: Binary data

Re: PATCH: Add query for operators unusable with RLS to documentation

2024-07-01 Thread Yugo NAGATA

On Sun, 23 Jun 2024 19:14:09 +0900
Yugo NAGATA  wrote:

> and Operator Families"? Additionally, is it useful to add LEAKPROOF 
> information
> to the result of psql \dAo(+) meta-comand, or a function that can check given 
> index
> or operator is leakproof or not?

I worte a pach to implement the proposal above and submitted in the new 
thread[1].

[1] 
https://www.postgresql.org/message-id/20240701220817.483f9b645b95611f8b1f65da%40sranhm.sraoss.co.jp

Regards,
Yugo Nagata


> Regards,
> Yugo Nagata
> 
> > Thanks,
> > Josh
> > 
> > [1] 
> > https://www.postgresql.org/message-id/CAGrP7a2t%2BJbeuxpQY%2BRSvNe4fr3%2B%3D%3DUmyimwV0GCD%2BwcrSSb%3Dw%40mail.gmail.com
> 
> 
> -- 
> Yugo NAGATA 
> 
> 


-- 
Yugo NAGATA

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

2024-07-01 Thread Joel Jacobson

On Mon, Jul 1, 2024, at 14:25, Dagfinn Ilmari Mannsåker wrote:
> div_var() also has an optimisation for 3- and 4-digit operands under
> HAVE_INT128 (added in commit 0aa38db56bf), would that make sense in
> mul_var() too?

I considered it, but it only gives a marginal speed-up on Intel Core i9-14900K,
and is actually slower on Apple M3 Max.
Not really sure why. Maybe the code I tried can be optimized further:

```
#ifdef HAVE_INT128
/*
 * If var1 and var2 are up to four digits, their product will fit in
 * an int128 can be computed directly, which is significantly faster.
 */
if (var2ndigits <= 4)
{
int128  product = 0;

switch (var1ndigits)
{
case 1:
product = var1digits[0];
break;
case 2:
product = var1digits[0] * NBASE + var1digits[1];
break;
case 3:
product = ((int128) var1digits[0] * NBASE + 
var1digits[1])
* NBASE + var1digits[2];
break;
case 4:
product = (((int128) var1digits[0] * NBASE + 
var1digits[1])
* NBASE + var1digits[2]) * 
NBASE + var1digits[3];
break;
}

switch (var2ndigits)
{
case 1:
product *= var2digits[0];
break;
case 2:
product *= var2digits[0] * NBASE + 
var2digits[1];
break;
case 3:
product = ((int128) var2digits[0] * NBASE + 
var2digits[1])
* NBASE + var2digits[2];
break;
case 4:
product = (((int128) var2digits[0] * NBASE + 
var2digits[1])
* NBASE + var2digits[2]) * 
NBASE + var2digits[3];
break;
}

alloc_var(result, res_ndigits);
res_digits = result->digits;
for (i = res_ndigits - 1; i >= 0; i--)
{
res_digits[i] = product % NBASE;
product /= NBASE;
}
Assert(product == 0);

/*
 * Finally, round the result to the requested precision.
 */
result->weight = res_weight;
result->sign = res_sign;

/* Round to target rscale (and set result->dscale) */
round_var(result, rscale);

/* Strip leading and trailing zeroes */
strip_var(result);

return;
}
#endif
```

Benchmark 1, testing 2 ndigits * 2 ndigits:

SELECT
   timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
   timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
   total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
   'numeric_mul',
   'numeric_mul_patched',
   input_values := ARRAY[
  '',
  ''
   ],
   min_time := 100,
   timeout := '10 s'
);

Apple M3 Max:

 execution_time_a | execution_time_b | performance_ratio
--+--+
 32.2 ns  | 20.5 ns  | 1.5700112246809388
(1 row)

Intel Core i9-14900K:

 execution_time_a | execution_time_b | performance_ratio
--+--+
 30.2 ns  | 21.4 ns  | 1.4113042510107371
(1 row)

So 57% and 41% faster.

Benchmark 2, testing 4 ndigits * 4 ndigits:

SELECT
   timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
   timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
   total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
   'numeric_mul',
   'numeric_mul_patched',
   input_values := ARRAY[
  '',
  ''
   ],
   min_time := 100,
   timeout := '10 s'
);

Apple M3 Max:

 execution_time_a | execution_time_b |   performance_ratio
--+--+
 41.9 ns  | 51.3 ns  | 0.81733655797170943614
(1 row)

Intel Core i9-14900K:

 execution_time_a | execution_time_b | performance_ratio
--+--+
 40 ns| 38 ns| 1.0515610914706320
(1 row)

So 18% slower on Apple M3 Max and just 5% faster on Intel Core

psql: Add leakproof field to \dAo+ meta-command results

2024-07-01 Thread Yugo NAGATA

Hi,

I would like to propose to add a new field to psql's \dAo+ meta-command
to show whether the underlying function of an operator is leak-proof.

This idea is inspired from [1] that claims some indexes uses non-LEAKPROOF
functions under the associated operators, as a result, it can not be selected
for queries with security_barrier views or row-level security policies.
The original proposal was to add a query over system catalogs for looking up
non-leakproof operators to the documentation, but I thought it is useful
to improve \dAo results rather than putting such query to the doc.

The attached patch adds the field to \dAo+ and also a description that
explains the relation between indexes and security quals with referencing
\dAo+ meta-command.

[1] 
https://www.postgresql.org/message-id/raw/5af3bf0c-5e0c-4128-81dc-084c5258b1af%40code406.com

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
>From 3417c4cce46ec068464b7069428e7f4a9a2cd07d Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Mon, 1 Jul 2024 16:16:39 +0900
Subject: [PATCH] psql: Add leakproof field to \dAo+ meta-command results

This adds a field that shows whether the underlying function of an
operator associated with operator families is leak-proof.
It is useful for checking an index can be used with security_barrier
views or row-level security policies when the query's WHERE
clause contains an operator which is associated with the index.
---
 doc/src/sgml/ref/psql-ref.sgml |  3 ++-
 doc/src/sgml/rules.sgml| 10 ++
 src/bin/psql/describe.c| 17 +
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 830306ea1e..d59afa7524 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1362,7 +1362,8 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 'second value' \g
 is specified, only members of operator families whose names match that
 pattern are listed.
 If + is appended to the command name, each operator
-is listed with its sort operator family (if it is an ordering operator).
+is listed with its sort operator family (if it is an ordering operator),
+and whether it is leak-proof.
 
 
   
diff --git a/doc/src/sgml/rules.sgml b/doc/src/sgml/rules.sgml
index 7a928bd7b9..5e17031ee9 100644
--- a/doc/src/sgml/rules.sgml
+++ b/doc/src/sgml/rules.sgml
@@ -2167,6 +2167,16 @@ CREATE VIEW phone_number WITH (security_barrier) AS
 view's row filters.
 
 
+
+For example, an index scan can not be selected for queries with
+security_barrier views or row-level security policies if an
+operator used in the WHERE clause is associated with the
+operator family of the index, but its underlying function is not marked
+LEAKPROOF. The  program's
+\dAo+ meta-command is useful for listing the operators
+with associated operator families and whether it is leak-proof.
+
+
 
 It is important to understand that even a view created with the
 security_barrier option is intended to be secure only
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index f67bf0b892..243f099017 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6872,7 +6872,7 @@ listOpFamilyOperators(const char *access_method_pattern,
 	printQueryOpt myopt = pset.popt;
 	bool		have_where = false;
 
-	static const bool translate_columns[] = {false, false, false, false, false, false};
+	static const bool translate_columns[] = {false, false, false, false, false, false, false};
 
 	initPQExpBuffer();
 
@@ -6900,8 +6900,15 @@ listOpFamilyOperators(const char *access_method_pattern,
 
 	if (verbose)
 		appendPQExpBuffer(,
-		  ", ofs.opfname AS \"%s\"\n",
-		  gettext_noop("Sort opfamily"));
+		  ", ofs.opfname AS \"%s\"\n,"
+		  "  CASE\n"
+		  "   WHEN p.proleakproof THEN '%s'\n"
+		  "   ELSE '%s'\n"
+		  " END AS \"%s\"\n",
+		  gettext_noop("Sort opfamily"),
+		  gettext_noop("yes"),
+		  gettext_noop("no"),
+		  gettext_noop("Leak-proof"));
 	appendPQExpBufferStr(,
 		 "FROM pg_catalog.pg_amop o\n"
 		 "  LEFT JOIN pg_catalog.pg_opfamily of ON of.oid = o.amopfamily\n"
@@ -6909,7 +6916,9 @@ listOpFamilyOperators(const char *access_method_pattern,
 		 "  LEFT JOIN pg_catalog.pg_namespace nsf ON of.opfnamespace = nsf.oid\n");
 	if (verbose)
 		appendPQExpBufferStr(,
-			 "  LEFT JOIN pg_catalog.pg_opfamily ofs ON ofs.oid = o.amopsortfamily\n");
+			 "  LEFT JOIN pg_catalog.pg_opfamily ofs ON ofs.oid = o.amopsortfamily\n"
+			 "  LEFT JOIN pg_catalog.pg_operator op ON op.oid = o.amopopr\n"
+			 "  LEFT JOIN pg_catalog.pg_proc p ON p.oid = op.oprcode\n");
 
 	if (access_method_pattern)
 	{
-- 
2.25.1

Re: [PATCH] Fix docs to use canonical links

2024-07-01 Thread Daniel Gustafsson

> On 1 Jul 2024, at 13:09, Joel Jacobson  wrote:
> 
> On Mon, Jul 1, 2024, at 09:35, Daniel Gustafsson wrote:
>> Avoding redirects is generally a good thing, not everyone is on lightning 
>> fast
>> internet.  Wikipedia is however not doing any 30X redirects so it's not 
>> really
>> an issue for those links, it's all 200 requests.
> 
> Yes, I noticed that too when observing the HTTPS traffic, so no issue there,
> except that it's a bit annoying that the address bar suddenly changes.

Right, I was unclear, I'm not advocating against changing.  It won't move the
needle compared to 30X redirects but it also won't hurt.

> However, I think David J had another good argument:
> 
> "If we are making wikipedia our authority we might as well use their standard 
> for naming."

It's a moving target, but so is most if not all links.

--
Daniel Gustafsson

RE: New standby_slot_names GUC in PG 17

2024-07-01 Thread Zhijie Hou (Fujitsu)

On Monday, July 1, 2024 6:45 PM Amit Kapila  wrote:
> 
> On Thu, Jun 27, 2024 at 7:14 AM Masahiko Sawada
>  wrote:
> >
> > On Wed, Jun 26, 2024 at 6:15 PM Zhijie Hou (Fujitsu)
> >  wrote:
> >
> > Thank you for updating the patch. The v2 patch looks good to me.
> >
> 
> Pushed.

Thanks! I am attaching another patch to modify the release note as discussed.

Best Regards,
Hou zj


0001-add-recent-renaming-commit-to-the-release-note.patch
Description: 0001-add-recent-renaming-commit-to-the-release-note.patch

Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.

2024-07-01 Thread Dagfinn Ilmari Mannsåker

"Joel Jacobson"  writes:

> Hello hackers,
>
> Attached patch introduces an optimization of mul_var() in numeric.c,
> targeting cases where the multiplicands consist of only one or two
> base-NBASE digits. Such small multiplicands can fit into an int64 and
> thus be computed directly, resulting in a significant performance
> improvement, between 26% - 34% benchmarked on Intel Core i9-14900K.
>
> This optimization is similar to commit d1b307eef2, that also targeted
> one and two base-NBASE digit operands, but optimized div_var().

div_var() also has an optimisation for 3- and 4-digit operands under
HAVE_INT128 (added in commit 0aa38db56bf), would that make sense in
mul_var() too?

> Regards,
> Joel

- ilmari

Re: pg_createsubscriber: drop pre-existing subscriptions from the converted node

2024-07-01 Thread Amit Kapila

On Mon, Jul 1, 2024 at 11:44 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > You can create a dummy subscription on node_p
> > and do a test similar to what we are doing in "# Create failover slot
> > to test its removal".
>
> Your approach looks better than mine. I followed the approach.
>

LGTM.

-- 
With Regards,
Amit Kapila.

Re: Make tuple deformation faster

2024-07-01 Thread Matthias van de Meent

On Mon, 1 Jul 2024 at 12:49, David Rowley  wrote:
>
> On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent
>  wrote:
> > One thing I'm slightly concerned about is that this allocates another
> > 8 bytes for each attribute in the tuple descriptor. While that's not a
> > lot when compared with the ->attrs array, it's still quite a lot when
> > we might not care at all about this data; e.g. in temporary tuple
> > descriptors during execution, in intermediate planner nodes.
>
> I've not done it in the patch, but one way to get some of that back is
> to ditch pg_attribute.attcacheoff. There's no need for it after this
> patch.  That's only 4 out of 8 bytes, however.

FormData_pg_attribute as a C struct has 4-byte alignment; AFAIK it
doesn't have any fields that require 8-byte alignment? Only on 64-bit
systems we align the tuples on pages with 8-byte alignment, but
in-memory arrays of the struct wouldn't have to deal with that, AFAIK.

> I think in most cases
> due to FormData_pg_attribute being so huge, the aset.c power-of-2
> roundup behaviour will be unlikely to cross a power-of-2 boundary.

ASet isn't the only allocator, but default enough for this to make sense, yes.

> The following demonstrates which column counts that actually makes a
> difference with:
>
> select c as n_cols,old_bytes, new_bytes from (select c,24+104*c as
> old_bytes, 32+100*c+8*c as new_bytes from generate_series(1,1024) c)
> where position('1' in old_bytes::bit(32)::text) != position('1' in
> new_bytes::bit(32)::text);
>
> That returns just 46 column counts out of 1024 where we cross a power
> of 2 boundaries with the patched code that we didn't cross in master.
> Of course, larger pallocs will result in a malloc() directly, so
> perhaps that's not a good measure.  At least for smaller column counts
> it should be mainly the same amount of memory used. There are only 6
> rows in there for column counts below 100. I think if we were worried
> about memory there are likely 100 other things we could do to reclaim
> some. It would only take some shuffling of fields in RelationData. I
> count 50 bytes of holes in that struct out of the 488 bytes. There are
> probably a few that could be moved without upsetting the
> struct-field-order-lords too much.

I'd love for RelationData to be split into IndexRelation,
TableRelation, ForeignTableRelation, etc., as there's a lot of wastage
caused by exclusive fields, too.

> > > 0004: Adjusts the attalign to change it from char to uint8.  See below.
> > >
> > > The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8
> > > rather than a char containing 'c', 's', 'i' or 'd'. This allows much
> > > more simple code in the att_align_nominal() macro. What's in master is
> > > quite a complex expression to evaluate every time we deform a column
> > > as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we
> > > just store that numeric value in the struct that macro can become a
> > > simple TYPEALIGN() so the operation becomes simple bit masking rather
> > > than a poorly branch predictable series of compare and jump.
> >
> > +1, that's something I'd missed in my patches, and is probably the
> > largest contributor to the speedup.
>
> I think so too and I did consider if we should try and do that to
> pg_attribute, renaming the column to attalignby. I started but didn't
> finish a patch for that.

I'm not sure we have a pg_type entry that that supports numeric
attalign values without increasing the size of the field, as the one
single-byte integer-like type (char) is always used as a printable
character, and implied to always be printable through its usage in
e.g. nodeToString infrastructure.

I'd love to have a better option, but we don't seem to have one yet.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

2024-07-01 Thread David Christensen

On Jul 1, 2024, at 12:56 AM, Ashutosh Bapat  wrote:On Wed, Jun 12, 2024 at 4:22 AM David Christensen  wrote:On Mon, Jun 10, 2024 at 8:15 AM Robert Haas  wrote:
>
> On Mon, Jun 10, 2024 at 3:09 AM Ashutosh Bapat
>  wrote:
> > This is just one instance of measurements. If I run the experiment multiple times the results and the patterns will vary. Usually I have found planning time to vary within 5% for regular tables and within 9% for partitioned tables with a large number of partitions. Below are measurements with the experiment repeated multiple times. For a given number of partitioned tables (each with 1000 partitions) being joined, planning time is measured 10 consecutive times. For this set of 10 runs we calculate average and standard deviation of planning time. Such 10 sets are sampled. This means planning time is sampled 100 times in total with and without patch respectively. Measurements with master and patched are reported in the attached excel sheet.
>
> Well, this is fine then I guess, but if the original results weren't
> stable enough for people to draw conclusions from, then it's better
> not to post them, and instead do this work to get results that are
> stable before posting.

Just doing a quick code review of the structure and the caller, I'd
agree that this is properly hoisting the invariant, so don't see that
it should contribute to any performance regressions.  To the extent
that it's called multiple times I can see that it's an improvement,
with minimal code shuffling it seems like a sensible change (even in
the single-caller case).

In short +1 from me.Hi David,Do you think it needs more review or we can change it to "ready for committer"? I think it’s ready from my standpoint. David

Re: [PATCH] Fix docs to use canonical links

2024-07-01 Thread Joel Jacobson

On Mon, Jul 1, 2024, at 09:35, Daniel Gustafsson wrote:
> Avoding redirects is generally a good thing, not everyone is on lightning fast
> internet.  Wikipedia is however not doing any 30X redirects so it's not really
> an issue for those links, it's all 200 requests.

Yes, I noticed that too when observing the HTTPS traffic, so no issue there,
except that it's a bit annoying that the address bar suddenly changes.

However, I think David J had another good argument:

"If we are making wikipedia our authority we might as well use their standard 
for naming."

/Joel

Re: Virtual generated columns

2024-07-01 Thread Peter Eisentraut


On 28.06.24 02:00, jian he wrote:

the test structure you made ( generated_stored.sql,
generated_virtual.sq) looks ok to me.
but do we need to reset the search_path at the end of
generated_stored.sql, generated_virtual.sql?


No, the session ends at the end of the test file, so we don't need to 
reset session state.



+ /*
+ * TODO: This could be done, but it would need a different implementation:
+ * no rewriting, but still need to recheck any constraints.
+ */
+ if (attTup->attgenerated == ATTRIBUTE_GENERATED_VIRTUAL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("ALTER TABLE / SET EXPRESSION is not supported for virtual
generated columns"),
+ errdetail("Column \"%s\" of relation \"%s\" is a virtual generated column.",
+   colName, RelationGetRelationName(rel;

minor typo, should be
+ errmsg("ALTER TABLE SET EXPRESSION is not supported for virtual
generated columns"),


This style "ALTER TABLE / something else" is also used for other error 
messages related to ALTER TABLE subcommands, so I am using the same here.



insert/update/delete/merge returning have problems:
CREATE TABLE t2 (
a int ,
b int GENERATED ALWAYS AS (a * 2),
d int default 22);
insert into t2(a) select g from generate_series(1,10) g;

insert into t2 select 100 returning *, (select t2.b), t2.b = t2.a * 2;
update t2 set a = 12 returning *, (select t2.b), t2.b = t2.a * 2;
update t2 set a = 12 returning *, (select (select t2.b)), t2.b = t2.a * 2;
delete from t2 where t2.b = t2.a * 2 returning *, 1,((select t2.b));

currently all these query, error message is "unexpected virtual
generated column reference"
we expect above these query work?


Yes, this is a bug.  I'm looking into it.


issue with merge:
CREATE TABLE t0 (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a * 2) VIRTUAL);
insert into t0(a) select g from generate_series(1,10) g;
MERGE INTO t0 t USING t0 AS s ON 2 * t.a = s.b WHEN MATCHED THEN
DELETE returning *;

the above query returns zero rows, but for stored generated columns it
will return 10 rows.

in  transformMergeStmt(ParseState *pstate, MergeStmt *stmt)
add
`qry->hasGeneratedVirtual = pstate->p_hasGeneratedVirtual;`
before
`assign_query_collations(pstate, qry);`
solve the problem.


Good catch.  Will fix.

Thanks for this review.  I will work on fixing the issues above and come 
back with a new patch set.

Re: Virtual generated columns

2024-07-01 Thread Peter Eisentraut


On 17.06.24 21:31, Tomasz Rybak wrote:

v1-0001-Rename-regress-test-generated-to-generated_stored.patch:
no objections here, makes sense as preparation for future changes

v1-0002-Put-generated_stored-test-objects-in-a-schema.patch:
also no objections.
OTOH other tests (like publication.out, rowsecurity.out, stats_ext.out,
tablespace.out) are creating schemas and later dropping them - so here
it might also make sense to drop schema at the end of testing.


The existing tests for generated columns don't drop what they create at 
the end, which can be useful for pg_upgrade testing for example.  So 
unless there are specific reasons to change it, I would leave that as is.


Other tests might have other reasons.  For example, publications or row 
security might interfere with many other tests.



v1-0003-Remove-useless-initializations.patch:
All other cases (I checked directory src/backend/utils/cache)
calling MemoryContextAllocZero only initialize fields when they
are non-zero, so removing partial initialization with false brings
consistency to the code.

v1-0004-Remove-useless-code.patch:
Patch removes filling in of constraints from function
BuildDescForRelation. This function is only called from file
view.c and tablecmds.c (twice). In none of those cases
result->constr is used, so proposed change makes sense.
While I do not know code well, so might be wrong here,
I would apply this patch.


I have committed these two now.

Re: Make tuple deformation faster

2024-07-01 Thread David Rowley

On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent
 wrote:
> Cool, that's similar to, but even better than, my patch from 2021 over at [0].

I'll have a read of that. Thanks for pointing it out.

> One thing I'm slightly concerned about is that this allocates another
> 8 bytes for each attribute in the tuple descriptor. While that's not a
> lot when compared with the ->attrs array, it's still quite a lot when
> we might not care at all about this data; e.g. in temporary tuple
> descriptors during execution, in intermediate planner nodes.

I've not done it in the patch, but one way to get some of that back is
to ditch pg_attribute.attcacheoff. There's no need for it after this
patch.  That's only 4 out of 8 bytes, however. I think in most cases
due to FormData_pg_attribute being so huge, the aset.c power-of-2
roundup behaviour will be unlikely to cross a power-of-2 boundary.

The following demonstrates which column counts that actually makes a
difference with:

select c as n_cols,old_bytes, new_bytes from (select c,24+104*c as
old_bytes, 32+100*c+8*c as new_bytes from generate_series(1,1024) c)
where position('1' in old_bytes::bit(32)::text) != position('1' in
new_bytes::bit(32)::text);

That returns just 46 column counts out of 1024 where we cross a power
of 2 boundaries with the patched code that we didn't cross in master.
Of course, larger pallocs will result in a malloc() directly, so
perhaps that's not a good measure.  At least for smaller column counts
it should be mainly the same amount of memory used. There are only 6
rows in there for column counts below 100. I think if we were worried
about memory there are likely 100 other things we could do to reclaim
some. It would only take some shuffling of fields in RelationData. I
count 50 bytes of holes in that struct out of the 488 bytes. There are
probably a few that could be moved without upsetting the
struct-field-order-lords too much.

> Did you test for performance gains (or losses) with an out-of-line
> TupleDescDeformAttr array? One benefit from this would be that we
> could reuse the deform array for suffix truncated TupleDescs, reuse of
> which currently would require temporarily updating TupleDesc->natts
> with a smaller value; but with out-of-line ->attrs and ->deform_attrs,
> we could reuse these arrays between TupleDescs if one is shorter than
> the other, but has otherwise fully matching attributes. I know that
> btree split code would benefit from this, as it wouldn't have to
> construct a full new TupleDesc when it creates a suffix-truncated
> tuple during page splits.

No, but it sounds easy to test as patch 0002 moves that out of line
and does nothing else.

> > 0004: Adjusts the attalign to change it from char to uint8.  See below.
> >
> > The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8
> > rather than a char containing 'c', 's', 'i' or 'd'. This allows much
> > more simple code in the att_align_nominal() macro. What's in master is
> > quite a complex expression to evaluate every time we deform a column
> > as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we
> > just store that numeric value in the struct that macro can become a
> > simple TYPEALIGN() so the operation becomes simple bit masking rather
> > than a poorly branch predictable series of compare and jump.
>
> +1, that's something I'd missed in my patches, and is probably the
> largest contributor to the speedup.

I think so too and I did consider if we should try and do that to
pg_attribute, renaming the column to attalignby. I started but didn't
finish a patch for that.

> > I'll stick this in the July CF. It would be good to get some feedback
> > on the idea and feedback on whether more work on this is worthwhile.
>
> Do you plan to remove the ->attcacheoff catalog field from the
> FormData_pg_attribute, now that (with your patch) it isn't used
> anymore as a placeholder field for fast (de)forming of tuples?

Yes, I plan to do that once I get more confidence I'm on to a winner here.

Thanks for having a look at this.

David

Re: pgsql: Add more SQL/JSON constructor functions

2024-07-01 Thread Amit Langote

On Sun, Jun 30, 2024 at 3:56 AM Tom Lane  wrote:
> Alvaro Herrera  writes:
> >> +/*
> >> + * For domains, consider the base type's typmod to decide whether to 
> >> setup
> >> + * an implicit or explicit cast.
> >> + */
> >> +if (get_typtype(returning->typid) == TYPTYPE_DOMAIN)
> >> +(void) getBaseTypeAndTypmod(returning->typid, );
>
> > TBH I'm not super clear on why we decide on explicit or implicit cast
> > based on presence of a typmod.  Why isn't it better to always use an
> > implicit one?
>
> Hmm ... there are a bunch of existing places that seem to have similar
> logic, but they are all in new-ish SQL/JSON functionality, and I would
> not be surprised if they are all wrong.  parse_coerce.c is quite
> opinionated about what a domain's typtypmod means (see comments in
> coerce_type() for instance); see also the logic in coerce_to_domain:
>
>  * If the domain applies a typmod to its base type, build the appropriate
>  * coercion step.  Mark it implicit for display purposes, because we don't
>  * want it shown separately by ruleutils.c; but the isExplicit flag passed
>  * to the conversion function depends on the manner in which the domain
>  * coercion is invoked, so that the semantics of implicit and explicit
>  * coercion differ.  (Is that really the behavior we want?)
>
> I don't think that this SQL/JSON behavior quite matches that.

The reason I decided to go for the implicit cast only when there is a
typmod is that the behavior with COERCION_EXPLICIT is only problematic
when there's a typmod because of this code in
build_coercion_expression:

if (nargs == 3)
{
/* Pass it a boolean isExplicit parameter, too */
cons = makeConst(BOOLOID,
 -1,
 InvalidOid,
 sizeof(bool),
 BoolGetDatum(ccontext == COERCION_EXPLICIT),
 false,
 true);

args = lappend(args, cons);
}

Yeah, we could have fixed that by always using COERCION_IMPLICIT for
SQL/JSON but, as Jian said, we don't have a bunch of casts that these
SQL/JSON functions need, which is why I guess we ended up with
COERCION_EXPLICIT here in the first place.

One option I hadn't tried was using COERCION_ASSIGNMENT instead, which
seems to give coerceJsonFuncExpr() the casts it needs with the
behavior it wants, so how about applying the attached?

-- 
Thanks, Amit Langote


v6-0001-SQL-JSON-Rethink-c2d93c3802b.patch
Description: Binary data

Re: New standby_slot_names GUC in PG 17

2024-07-01 Thread Amit Kapila

On Thu, Jun 27, 2024 at 7:14 AM Masahiko Sawada  wrote:
>
> On Wed, Jun 26, 2024 at 6:15 PM Zhijie Hou (Fujitsu)
>  wrote:
>
> Thank you for updating the patch. The v2 patch looks good to me.
>

Pushed.

-- 
With Regards,
Amit Kapila.

gamma() and lgamma() functions

2024-07-01 Thread Dean Rasheed

Attached is a patch adding support for the gamma and log-gamma
functions. See, for example:

https://en.wikipedia.org/wiki/Gamma_function

I think these are very useful general-purpose mathematical functions.
They're part of C99, and they're commonly included in other
mathematical libraries, such as the python math module, so I think
it's useful to make them available from SQL.

The error-handling for these functions seems to be a little trickier
than most, so that might need further discussion.

Regards,
Dean
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index 5a16910..6464a8b
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1399,6 +1399,27 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FRO
   

 
+ gamma
+
+gamma ( double precision )
+double precision
+   
+   
+Gamma function
+   
+   
+gamma(0.5)
+1.772453850905516
+   
+   
+gamma(6)
+120
+   
+  
+
+  
+   
+
  gcd
 
 gcd ( numeric_type, numeric_type )
@@ -1436,6 +1457,23 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FRO

   
 
+  
+   
+
+ lgamma
+
+lgamma ( double precision )
+double precision
+   
+   
+Natural logarithm of the absolute value of the gamma function
+   
+   
+lgamma(1000)
+5905.220423209181
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
new file mode 100644
index cbbb8ae..e43aa42
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -2779,6 +2779,91 @@ derfc(PG_FUNCTION_ARGS)
 }
 
 
+/* == GAMMA FUNCTIONS == */
+
+
+/*
+ *		dgamma			- returns the gamma function of arg1
+ */
+Datum
+dgamma(PG_FUNCTION_ARGS)
+{
+	float8		arg1 = PG_GETARG_FLOAT8(0);
+	float8		result;
+
+	/*
+	 * Per the POSIX spec, return NaN if the input is NaN, +/-infinity if the
+	 * input is +/-0, NaN if the input is -infinity, and infinity if the input
+	 * is infinity.
+	 */
+	if (isnan(arg1))
+		PG_RETURN_FLOAT8(get_float8_nan());
+	if (arg1 == 0)
+	{
+		if (signbit(arg1))
+			PG_RETURN_FLOAT8(-get_float8_infinity());
+		else
+			PG_RETURN_FLOAT8(get_float8_infinity());
+	}
+	if (isinf(arg1))
+	{
+		if (arg1 < 0)
+			PG_RETURN_FLOAT8(get_float8_nan());
+		else
+			PG_RETURN_FLOAT8(get_float8_infinity());
+	}
+
+	/*
+	 * Note that the POSIX/C99 gamma function is called "tgamma", not "gamma".
+	 *
+	 * For negative integer inputs, it may raise a domain error or a pole
+	 * error, or return NaN or +/-infinity (the POSIX spec indicates that this
+	 * may change in future implementations).  Since we can't reliably detect
+	 * integer inputs above a certain size, and we can't distinguish this from
+	 * any other overflow error, just treat them all the same.
+	 */
+	errno = 0;
+	result = tgamma(arg1);
+	if (errno != 0 || isinf(result) || isnan(result))
+		float_overflow_error();
+
+	PG_RETURN_FLOAT8(result);
+}
+
+
+/*
+ *		dlgamma			- natural logarithm of absolute value of gamma of arg1
+ */
+Datum
+dlgamma(PG_FUNCTION_ARGS)
+{
+	float8		arg1 = PG_GETARG_FLOAT8(0);
+	float8		result;
+
+	/* Per the POSIX spec, return NaN if the input is NaN */
+	if (isnan(arg1))
+		PG_RETURN_FLOAT8(get_float8_nan());
+
+	errno = 0;
+	result = lgamma(arg1);
+
+	/*
+	 * If an ERANGE error occurs, it means there is an overflow.  This can
+	 * happen if the input is 0, a negative integer, -infinity, or infinity.
+	 * In each of those cases, the correct result is infinity.  However, it
+	 * can also happen in other cases where the correct result is finite, but
+	 * too large to be represented (e.g., if the input is larger than around
+	 * 2.5e305).  There seems to be no way to distinguish those cases, so we
+	 * just return infinity for them too.  That's not ideal, but in practice,
+	 * lgamma() is much less likely to overflow than tgamma(), so it seems OK.
+	 */
+	if (unlikely(errno == ERANGE))
+		result = get_float8_infinity();
+
+	PG_RETURN_FLOAT8(result);
+}
+
+
 
 /*
  *		=
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
new file mode 100644
index 6a5476d..acaf4fb
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3484,6 +3484,13 @@
   proname => 'erfc', prorettype => 'float8', proargtypes => 'float8',
   prosrc => 'derfc' },
 
+{ oid => '8702', descr => 'gamma function',
+  proname => 'gamma', prorettype => 'float8', proargtypes => 'float8',
+  prosrc => 'dgamma' },
+{ oid => '8703', descr => 'natural logarithm of absolute value of gamma function',
+  proname => 'lgamma', prorettype => 'float8', proargtypes => 'float8',
+  prosrc => 'dlgamma' },
+
 { oid => '1618',
   proname => 'interval_mul', prorettype => 'interval',
   proargtypes => 'interval float8', prosrc => 'interval_mul' },
diff --git

Alias of VALUES RTE in explain plan

2024-07-01 Thread Ashutosh Bapat

Hi All,
While reviewing Richard's patch for grouping sets, I stumbled upon
following explain output

explain (costs off)
select distinct on (a, b) a, b
from (values (1, 1), (2, 2)) as t (a, b) where a = b
group by grouping sets((a, b), (a))
order by a, b;
   QUERY PLAN

 Unique
   ->  Sort
 Sort Key: "*VALUES*".column1, "*VALUES*".column2
 ->  HashAggregate
   Hash Key: "*VALUES*".column1, "*VALUES*".column2
   Hash Key: "*VALUES*".column1
   ->  Values Scan on "*VALUES*"
 Filter: (column1 = column2)
(8 rows)

There is no VALUES.column1 and VALUES.column2 in the query. The alias t.a
and t.b do not appear anywhere in the explain output. I think explain
output should look like
explain (costs off)
select distinct on (a, b) a, b
from (values (1, 1), (2, 2)) as t (a, b) where a = b
group by grouping sets((a, b), (a))
order by a, b;
   QUERY PLAN

 Unique
   ->  Sort
 Sort Key: t.a, t.b
 ->  HashAggregate
   Hash Key: t.a, t.b
   Hash Key: t.a
   ->  Values Scan on "*VALUES*" t
 Filter: (a = b)
(8 rows)

I didn't get time to figure out the reason behind this, nor the history.
But I thought I would report it nonetheless.

-- 
Best Wishes,
Ashutosh Bapat

Re: Make tuple deformation faster

2024-07-01 Thread David Rowley

On Mon, 1 Jul 2024 at 21:17, Andy Fan  wrote:
> Yet another a wonderful optimization! I just want to know how did you
> find this optimization (CPU cache hit) case and think it worths some
> time. because before we invest our time to optimize something, it is
> better know that we can get some measurable improvement after our time
> is spend. At for this case, 30% is really a huge number even it is a
> artificial case.
>
> Another case is Andrew introduced NullableDatum 5 years ago and said using
> it in TupleTableSlot could be CPU cache friendly, I can follow that, but
> how much it can improve in an ideal case, is it possible to forecast it
> somehow? I ask it here because both cases are optimizing for CPU cache..

Have a look at:

perf stat --pid=

On my AMD Zen4 machine running the 16 extra column test from the
script in my last email, I see:

$ echo master && perf stat --pid=389510 sleep 10
master

 Performance counter stats for process id '389510':

   9990.65 msec task-clock:u  #0.999 CPUs utilized
 0  context-switches:u#0.000 /sec
 0  cpu-migrations:u  #0.000 /sec
 0  page-faults:u #0.000 /sec
   49407204156  cycles:u  #4.945 GHz
  18529494  stalled-cycles-frontend:u #0.04% frontend
cycles idle
   8505168  stalled-cycles-backend:u  #0.02% backend cycles idle
  165442142326  instructions:u#3.35  insn per cycle
  #0.00  stalled
cycles per insn
   39409877343  branches:u#3.945 G/sec
 146350275  branch-misses:u   #0.37% of all branches

  10.001012132 seconds time elapsed

$ echo patched && perf stat --pid=380216 sleep 10
patched

 Performance counter stats for process id '380216':

   9989.14 msec task-clock:u  #0.998 CPUs utilized
 0  context-switches:u#0.000 /sec
 0  cpu-migrations:u  #0.000 /sec
 0  page-faults:u #0.000 /sec
   49781280456  cycles:u  #4.984 GHz
  22922276  stalled-cycles-frontend:u #0.05% frontend
cycles idle
  24259785  stalled-cycles-backend:u  #0.05% backend cycles idle
  213688149862  instructions:u#4.29  insn per cycle
  #0.00  stalled
cycles per insn
   44147675129  branches:u#4.420 G/sec
  14282567  branch-misses:u   #0.03% of all branches

  10.005034271 seconds time elapsed

You can see the branch predictor has done a *much* better job in the
patched code vs master with about 10x fewer misses.  This should have
helped contribute to the "insn per cycle" increase.  4.29 is quite
good for postgres. I often see that around 0.5. According to [1]
(relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
micro-op cache". I'm unsure how micro-ops translate to "insn per
cycle" that's shown in perf stat. I thought 4-5 was about the maximum
pipeline size from today's era of CPUs. Maybe someone else can explain
better than I can. In more simple terms, generally, the higher the
"insn per cycle", the better. Also, the lower all of the idle and
branch miss percentages are that's generally also better. However,
you'll notice that the patched version has more front and backend
stalls. I assume this is due to performing more instructions per cycle
from improved branch prediction causing memory and instruction stalls
to occur more frequently, effectively (I think) it's just hitting the
next bottleneck(s) - memory and instruction decoding. At least, modern
CPUs should be able to out-pace RAM in many workloads, so perhaps it's
not that surprising that "backend cycles idle" has gone up due to such
a large increase in instructions per cycle due to improved branch
prediction.

It would be nice to see this tested on some modern Intel CPU. A 13th
series or 14th series, for example, or even any intel from the past 5
years would be better than nothing.

David

[1] 
https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/

Re: Make tuple deformation faster

2024-07-01 Thread Matthias van de Meent

On Mon, 1 Jul 2024 at 10:56, David Rowley  wrote:
>
> Currently, TupleDescData contains the descriptor's attributes in a
> variable length array of FormData_pg_attribute allocated within the
> same allocation as the TupleDescData. According to my IDE,
> sizeof(FormData_pg_attribute) == 104 bytes. It's that large mainly due
> to attname being 64 bytes. The TupleDescData.attrs[] array could end
> up quite large on tables with many columns and that could result in
> poor CPU cache hit ratios when deforming tuples.
>
> Instead, we could make TupleDescData contain an out-of-line pointer to
> the array of FormData_pg_attribute and have a much more compact
> inlined array of some other struct that much more densely contains the
> fields required for tuple deformation. attname and many of the other
> fields are not required to deform a tuple.

+1

> I've attached a patch series which does this.
>
> 0001: Just fixes up some missing usages of TupleDescAttr(). (mostly
> missed by me, apparently :-( )
> 0002: Adjusts the TupleDescData.attrs array to make it out of line. I
> wanted to make sure nothing weird happened by doing this before doing
> the bulk of the other changes to add the new struct.
> 0003: Adds a very compact 8-byte struct named TupleDescDeformAttr,
> which can be used for tuple deformation. 8 columns fits on a 64-byte
> cacheline rather than 13 cachelines.

Cool, that's similar to, but even better than, my patch from 2021 over at [0].

One thing I'm slightly concerned about is that this allocates another
8 bytes for each attribute in the tuple descriptor. While that's not a
lot when compared with the ->attrs array, it's still quite a lot when
we might not care at all about this data; e.g. in temporary tuple
descriptors during execution, in intermediate planner nodes.

Did you test for performance gains (or losses) with an out-of-line
TupleDescDeformAttr array? One benefit from this would be that we
could reuse the deform array for suffix truncated TupleDescs, reuse of
which currently would require temporarily updating TupleDesc->natts
with a smaller value; but with out-of-line ->attrs and ->deform_attrs,
we could reuse these arrays between TupleDescs if one is shorter than
the other, but has otherwise fully matching attributes. I know that
btree split code would benefit from this, as it wouldn't have to
construct a full new TupleDesc when it creates a suffix-truncated
tuple during page splits.

> 0004: Adjusts the attalign to change it from char to uint8.  See below.
>
> The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8
> rather than a char containing 'c', 's', 'i' or 'd'. This allows much
> more simple code in the att_align_nominal() macro. What's in master is
> quite a complex expression to evaluate every time we deform a column
> as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we
> just store that numeric value in the struct that macro can become a
> simple TYPEALIGN() so the operation becomes simple bit masking rather
> than a poorly branch predictable series of compare and jump.

+1, that's something I'd missed in my patches, and is probably the
largest contributor to the speedup.

> I'll stick this in the July CF. It would be good to get some feedback
> on the idea and feedback on whether more work on this is worthwhile.

Do you plan to remove the ->attcacheoff catalog field from the
FormData_pg_attribute, now that (with your patch) it isn't used
anymore as a placeholder field for fast (de)forming of tuples?

Kind regards,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/CAEze2Wh8-metSryZX_Ubj-uv6kb%2B2YnzHAejmEdubjhmGusBAg%40mail.gmail.com

Re: cfbot update: Using GitHub for patch review

2024-07-01 Thread Ashutosh Bapat

On Sat, Jun 29, 2024 at 2:12 PM Jelte Fennema-Nio 
wrote:

> On Sat, 29 Jun 2024 at 01:13, Thomas Munro  wrote:
> >
> > On Sat, Jun 29, 2024 at 1:10 AM Ashutosh Bapat
> >  wrote:
> > > I need to sign in to github to add my review comments. So those who do
> not have a github account can not use it for review. But I don't think that
> can be fixed. We need a way to know who left review comments.
> >
> > I don't think Jelte was talking about moving review discussion to
> > Github, just providing a link to *view* the patches there.
>
> Totally correct. And I realize now I should have called that out
> explicitly in the initial email.
>

While I personally would like to see that one day, getting a consensus and
changing the process for whole community is a lot of effort. I didn't think
(or mean) that we would move our review process to Github with this change.
Sorry if it sounded like that.

>
> While I personally would love to be able to read & write comments on a
> Github PR, integrating that with the mailing list in a way that the
> community is happy with as a whole is no small task (both technically
> and politically).
>

It is not a small amount of work, I agree. But it may be a way forward.
Those who want to use PR for review can review them as long as the reviews
are visible on the mailing list. Many of us already draft our review emails
similar to how it would look like in a PR. If the PR system can send that
email on reviewer's behalf (as if it's sent by the reviewer) it will
integrate well with the current process. People will learn, get used to it
and move eventually to PR based reviews.

>
> So (for now) I took the easy way out and sidestepped all those
> difficulties, by making the github branches of the cfbot (which we
> already had) a bit more user friendly as a way to access patches in a
> read-only way.
>

+1.

-- 
Best Wishes,
Ashutosh Bapat

Re: Conflict Detection and Resolution

2024-07-01 Thread Amit Kapila

On Mon, Jul 1, 2024 at 1:35 PM Masahiko Sawada  wrote:
>
> Setting resolvers at table-level and subscription-level sounds good to
> me. DDLs for setting resolvers at subscription-level would need the
> subscription name to be specified?
>

Yes, it should be part of the ALTER/CREATE SUBSCRIPTION command. One
idea could be to have syntax as follows:

ALTER SUBSCRIPTION name SET CONFLICT RESOLVER 'conflict_resolver' FOR
'conflict_type';
ALTER SUBSCRIPTION name RESET CONFLICT RESOLVER FOR 'conflict_type';

CREATE SUBSCRIPTION subscription_name CONNECTION 'conninfo'
PUBLICATION publication_name [, ...] CONFLICT RESOLVER
'conflict_resolver' FOR 'conflict_type';

> And another question is: a
> table-level resolver setting is precedent over all subscriber-level
> resolver settings in the database?
>

Yes.

-- 
With Regards,
Amit Kapila.

Re: Conflict Detection and Resolution

2024-07-01 Thread Amit Kapila

On Mon, Jul 1, 2024 at 11:47 AM Masahiko Sawada  wrote:
>
> On Thu, May 23, 2024 at 3:37 PM shveta malik  wrote:
> >
> > DELETE
> > 
> > Conflict Type:
> > 
> > delete_missing: An incoming delete is trying to delete a row on a
> > target node which does not exist.
>
> IIUC the 'delete_missing' conflict doesn't cover the case where an
> incoming delete message is trying to delete a row that has already
> been updated locally or by another node. I think in update/delete
> conflict situations, we need to resolve the conflicts based on commit
> timestamps like we do for update/update and insert/update conflicts.
>
> For example, suppose there are two node-A and node-B and setup
> bi-directional replication, and suppose further that both have the row
> with id = 1, consider the following sequences:
>
> 09:00:00  DELETE ... WHERE id = 1 on node-A.
> 09:00:05  UPDATE ... WHERE id = 1 on node-B.
> 09:00:10  node-A received the update message from node-B.
> 09:00:15  node-B received the delete message from node-A.
>
> At 09:00:10 on node-A, an update_deleted conflict is generated since
> the row on node-A is already deleted locally. Suppose that we use
> 'apply_or_skip' resolution for this conflict, we convert the update
> message into an insertion, so node-A now has the row with id = 1. At
> 09:00:15 on node-B, the incoming delete message is applied and deletes
> the row with id = 1, even though the row has already been modified
> locally. The node-A and node-B are now inconsistent. This
> inconsistency can be avoided by using 'skip' resolution for the
> 'update_deleted' conflict on node-A, and 'skip' resolution is the
> default method for that actually. However, if we handle it as
> 'update_missing', the 'apply_or_skip' resolution is used by default.
>
> IIUC with the proposed architecture, DELETE always takes precedence
> over UPDATE since both 'update_deleted' and 'update_missing' don't use
> commit timestamps to resolve the conflicts. As long as that is true, I
> think there is no use case for 'apply_or_skip' and 'apply_or_error'
> resolutions in update/delete conflict cases. In short, I think we need
> something like 'delete_differ' conflict type as well. FYI PGD and
> Oracle GoldenGate seem to have this conflict type[1][2].
>

Your explanation makes sense to me and I agree that we should
implement 'delete_differ' conflict type.

> The 'delete'_differ' conflict type would have at least
> 'latest_timestamp_wins' resolution. With the timestamp based
> resolution method, we would deal with update/delete conflicts as
> follows:
>
> 09:00:00: DELETE ... WHERE id = 1 on node-A.
> 09:00:05: UPDATE ... WHERE id = 1 on node-B.
> - the updated row doesn't have the origin since it's a local change.
> 09:00:10: node-A received the update message from node-B.
> - the incoming update message has the origin of node-B whereas the
> local row is already removed locally.
> - 'update_deleted' conflict is generated.
>

FYI, as of now, we don't have a reliable way to detect
'update_deleted' type of conflicts but we had some discussion about
the same [1].

> - do the insert of the new row instead, because the commit
> timestamp of UPDATE is newer than DELETE's one.
> 09:00:15: node-B received the delete message from node-A.
> - the incoming delete message has the origin of node-B whereas the
> (updated) row doesn't have the origin.
> - 'update_differ' conflict is generated.
> - discard DELETE, because the commit timestamp of UPDATE is newer
> than DELETE' one.ard DELETE, because the commit timestamp of UPDATE is
> newer than DELETE' one.
>
> As a result, both nodes have the new version row.
>

Right, it seems to me that we should implement 'latest_time_wins' if
we want consistency in such cases.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1Lj-PWrP789KnKxZydisHajd38rSihWXO8MVBLDwxG1Kg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Converting README documentation to Markdown

2024-07-01 Thread Daniel Gustafsson

> On 28 Jun 2024, at 20:40, Peter Eisentraut  wrote:

> If we're going to do it, then I expect that the files are marked up correctly 
> at all times.

I agree with that. I don't think it will be a terribly high bar though since we
were pretty much already writing markdown.  We already have pandoc in the meson
toolchain, adding a target to check syntax should be doable.

> Conversely, what's the best that could happen?

One of the main goals of this work was to make sure the documentation renders
nicely on platforms which potential new contributors consider part of the
fabric of writing code.  We might not be on Github (and I'm not advocating that
we should) but any new contributor we want to attract is pretty likely to be
using it.  The best that can happen is that new contributors find the postgres
code more approachable and get excited about contributing to postgres.

--
Daniel Gustafsson

Re: Surround CheckRelation[Oid]LockedByMe() with USE_ASSERT_CHECKING

2024-07-01 Thread Bertrand Drouvot

Hi,

On Mon, Jul 01, 2024 at 05:01:46PM +0900, Michael Paquier wrote:
> On Mon, Jul 01, 2024 at 06:42:46AM +, Bertrand Drouvot wrote:
> > While working on a rebase for [1] due to 0cecc908e97, I noticed that
> > CheckRelationLockedByMe() and CheckRelationOidLockedByMe() are used only in
> > assertions.
> > 
> > I think it would make sense to declare / define those functions only for
> > assert enabled build: please find attached a tiny patch doing so.
> > 
> > Thoughts?
> 
> Not convinced that's a good idea.  What about out-of-core code that
> may use these routines for runtime checks in non-assert builds?

Thanks for the feedback.

Yeah that could be an issue for CheckRelationLockedByMe() 
(CheckRelationOidLockedByMe()
is too recent to be a concern).

Having said that 1. out of core could want to use CheckRelationOidLockedByMe() (
probably if it was already using CheckRelationLockedByMe()) and 2. I just
submitted a rebase for [1] in which I thought that using
CheckRelationOidLockedByMe() would be a good idea.

So I think that we can get rid of this proposal.

[1]: 
https://www.postgresql.org/message-id/ZoJ5RVtMziIa3TQp%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Avoid orphaned objects dependencies, take 3

2024-07-01 Thread Bertrand Drouvot

Hi,

On Wed, Jun 26, 2024 at 10:24:41AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Fri, Jun 21, 2024 at 01:22:43PM +, Bertrand Drouvot wrote:
> > Another thought for the RelationRelationId class case: we could check if 
> > there
> > is a lock first and if there is no lock then acquire one. That way that 
> > would
> > ensure the relation is always locked (so no "risk" anymore), but OTOH it may
> > add "unecessary" locking (see 2. mentioned previously).
> 
> Please find attached v12 implementing this idea for the RelationRelationId 
> class
> case. As mentioned, it may add unnecessary locking for 2. but I think that's
> worth it to ensure that we are always on the safe side of thing. This idea is
> implemented in LockNotPinnedObjectById().

Please find attached v13, mandatory rebase due to 0cecc908e97. In passing, make
use of CheckRelationOidLockedByMe() added in 0cecc908e97.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 461ae5d2fa037b2b9fd285c2a328c8bc785eaec8 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v13] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place before the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch takes into account any type of objects except the ones that are pinned
(they are not droppable because the system requires it).

A special case is done for objects that belong to the RelationRelationId class.
For those, we should be in one of the two following cases that would already
prevent the relation to be dropped:

1. The relation is already locked (could be an existing relation or a relation
that we are creating).

2. The relation is protected indirectly (i.e an index protected by a lock on
its table, a table protected by a lock on a function that depends the table...)

To avoid any risks for the RelationRelationId class case, we acquire a lock if
there is none. That may add unnecessary lock for 2. but that's worth it.

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/aclchk.c  |   1 +
 src/backend/catalog/dependency.c  | 121 +++-
 src/backend/catalog/heap.c|   7 +
 src/backend/catalog/index.c   |  26 
 src/backend/catalog/objectaddress.c   |  57 
 src/backend/catalog/pg_aggregate.c|   9 ++
 src/backend/catalog/pg_attrdef.c  |   1 +
 src/backend/catalog/pg_cast.c |   5 +
 src/backend/catalog/pg_collation.c|   1 +
 src/backend/catalog/pg_constraint.c   |  26 
 src/backend/catalog/pg_conversion.c   |   2 +
 src/backend/catalog/pg_depend.c   |  40 +-
 src/backend/catalog/pg_operator.c |  19 +++
 src/backend/catalog/pg_proc.c |  17 ++-
 src/backend/catalog/pg_publication.c  |   7 +
 src/backend/catalog/pg_range.c|   6 +
 src/backend/catalog/pg_type.c |  39 ++
 src/backend/catalog/toasting.c|   1 +
 src/backend/commands/alter.c  |   4 +
 src/backend/commands/amcmds.c |   1 +
 src/backend/commands/cluster.c|   7 +
 src/backend/commands/event_trigger.c  |   1 +
 src/backend/commands/extension.c  |   5 +
 src/backend/commands/foreigncmds.c|   7 +
 src/backend/commands/functioncmds.c   |   6 +
 src/backend/commands/indexcmds.c  |   2 +
 src/backend/commands/opclasscmds.c|  18 +++
 src/backend/commands/operatorcmds.c   |

Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)

2024-07-01 Thread Daniel Gustafsson

> On 27 Jun 2024, at 13:50, Ranier Vilela  wrote:

> Now with file patch really attached.

-   if (strlen(backupidstr) > MAXPGPATH)
+   if (strlcpy(state->name, backupidstr, sizeof(state->name)) >= 
sizeof(state->name))
ereport(ERROR,

Stylistic nit perhaps, I would keep the strlen check here and just replace the
memcpy with strlcpy.  Using strlen in the error message check makes the code
more readable.

-   charname[MAXPGPATH + 1];
+   charname[MAXPGPATH];/* backup label name */

With the introduced use of strlcpy, why do we need to change this field?

--
Daniel Gustafsson

1 2 >

1 - 100 of 126 matches

Mail list logo