date:20240328

Re: Improve eviction algorithm in ReorderBuffer

2024-03-28 Thread vignesh C

On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada  
> wrote:
> >
> >
> > I've attached new version patches.
>
> Since the previous patch conflicts with the current HEAD, I've
> attached the rebased patches.

Thanks for the updated patch.
One comment:
I felt we can mention the improvement where we update memory
accounting info at transaction level instead of per change level which
is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
ReorderBufferSerializeTXN also in the commit message:
@@ -1527,7 +1573,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
/* Check we're not mixing changes from different
transactions. */
Assert(change->txn == txn);

-   ReorderBufferReturnChange(rb, change, true);
+   ReorderBufferReturnChange(rb, change, false);
}

/*
@@ -1586,8 +1632,13 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);

+   /* Update the memory counter */
+   ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);

Regards,
Vignesh

Re: type cache cleanup improvements

2024-03-28 Thread Жарков Роман


> Rest of patches, rebased.

Hello,
I read and tested only the first patch so far. Creation of temp
tables and rollback in your script work 3-4 times faster with
0001-type-cache.patch on my windows laptop.

In the patch I found a copy of the comment "If it's domain over
composite, reset flags...". Can we move the reset flags operation
and its comment into the invalidateCompositeTypeCacheEntry()
function? This simplify the TypeCacheRelCallback() func, but
adds two more IF statements when we need to clean up a cache
entry for a specific relation. (diff attached).
--
Roman Zharkov


mapRelType-v2.patch
Description: Binary data

Re: PSQL Should \sv & \ev work with materialized views?

2024-03-28 Thread Erik Wienhold

On 2024-03-29 04:27 +0100, Isaac Morland wrote:
> On Thu, 28 Mar 2024 at 20:38, Erik Wienhold  wrote:
> 
> 
> > Of course the problem with using DROP and CREATE is that indexes and
> > privileges (anything else?) must also be restored.  I haven't bothered
> > with that yet.
> >
> 
> Not just those — also anything that depends on the matview, such as views
> and other matviews.

Right.  But you'd run into the same issue for a regular view if you use
\ev and add  DROP VIEW myview CASCADE  which may be necessary if you
want to change columns names and/or types.  Likewise, you'd have to
manually change  DROP MATERIALIZED VIEW  and add the CASCADE option to
lose dependent objects.

I think implementing  CREATE OR REPLACE MATERIALIZED VIEW  has more
value.  But the semantics have to be defined first.  I guess it has to
behave like  CREATE OR REPLACE VIEW  in that it only allows changing the
query without altering column names and types.

We could also implement \sv so that it only prints  CREATE MATERIALIZED
VIEW  and change \ev to not work with matviews.  Both commands use
get_create_object_cmd to populate the query buffer, so you get \ev for
free when changing \sv.

-- 
Erik

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-28 Thread Amit Kapila

On Wed, Mar 27, 2024 at 9:00 PM Bharath Rupireddy
 wrote:
>
>
> Thanks. I'm attaching v29 patches. 0001 managing inactive_since on the
> standby for sync slots.
>

Commit message states: "why we can't just update inactive_since for
synced slots on the standby with the value received from remote slot
on the primary. This is consistent with any other slot parameter i.e.
all of them are synced from the primary."

The inactive_since is not consistent with other slot parameters which
we copy. We don't perform anything related to those other parameters
like say two_phase phase which can change that property. However, we
do acquire the slot, advance the slot (as per recent discussion [1]),
and release it. Since these operations can impact inactive_since, it
seems to me that inactive_since is not the same as other parameters.
It can have a different value than the primary. Why would anyone want
to know the value of inactive_since from primary after the standby is
promoted? Now, the other concern is that calling GetCurrentTimestamp()
could be costly when the values for the slot are not going to be
updated but if that happens we can optimize such that before acquiring
the slot we can have some minimal pre-checks to ensure whether we need
to update the slot or not.

[1] - 
https://www.postgresql.org/message-id/OS0PR01MB571615D35F486080616CA841943A2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

RE: Synchronizing slots from primary to standby

2024-03-28 Thread Hayato Kuroda (Fujitsu)

Dear Hou,

Thanks for updating the patch! Here is a comment for it.

```
+/*
+ * By advancing the restart_lsn, confirmed_lsn, and xmin using
+ * fast-forward logical decoding, we can verify whether a consistent
+ * snapshot can be built. This process also involves saving necessary
+ * snapshots to disk during decoding, ensuring that logical decoding
+ * efficiently reaches a consistent point at the restart_lsn without
+ * the potential loss of data during snapshot creation.
+ */
+pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
+found_consistent_point);
+ReplicationSlotsComputeRequiredLSN();
+updated_lsn = true;
```

You added them like pg_replication_slot_advance(), but the function also calls
ReplicationSlotsComputeRequiredXmin(false) at that time. According to the 
related
commit b48df81 and discussions [1], I know it is needed only for physical slots,
but it makes more consistent to call requiredXmin() as well, per [2]:

```
This may be a waste if no advancing is done, but it could also be an
advantage to enforce a recalculation of the thresholds for each
function call.  And that's more consistent with the slot copy, drop
and creation.
```

How do you think?

[1]: 
https://www.postgresql.org/message-id/20200609171904.kpltxxvjzislidks%40alap3.anarazel.de
[2]: https://www.postgresql.org/message-id/20200616072727.GA2361%40paquier.xyz

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: Synchronizing slots from primary to standby

2024-03-28 Thread shveta malik

On Fri, Mar 29, 2024 at 6:36 AM Zhijie Hou (Fujitsu)
 wrote:
>
> Attach a new version patch which fixed an un-initialized variable issue and
> added some comments. Also, temporarily enable DEBUG2 for the 040 tap-test so 
> that
> we can analyze the possible CFbot failures easily.

As suggested by Amit in [1], for the fix being discussed where we need
to advance the synced slot on standby, we need to skip the dbid check
in fast_forward mode in CreateDecodingContext(). We tried few tests to
make sure that there was no table-access done during fast-forward mode

1) Initially we tried avoiding database-id check in
CreateDecodingContext() only when called by
pg_logical_replication_slot_advance(). 'make check-world' passed on
HEAD for the same.

2) But the more generic solution was to skip the database check if
"fast_forward" is true. It was tried and 'make check-world' passed on
HEAD for that as well.

3) Another thing tried by Hou-San was to run pgbench after skipping db
check in the fast_forward logical decoding case.
pgbench was run to generate some changes and then the logical slot was
advanced to the latest position in another database. A LOG was added
in relation_open to catch table access. It was found that there was no
table-access in fast forward logical decoding i.e. no LOGS for
table-open were generated during the test. Steps given at [2]

[1]: 
https://www.postgresql.org/message-id/CAA4eK1KMiKangJa4NH_K1oFc87Y01n3rnpuwYagT59Y%3DADW8Dw%40mail.gmail.com

[2]:
--
1. apply the DEBUG patch (attached as .txt) which will log the
relation open and table cache access.

2. create a slot:
SELECT 'init' FROM pg_create_logical_replication_slot('logicalslot',
'test_decoding', false, false, true);

3. run pgbench to generate some data.
pgbench -i postgres
pgbench --aggregate-interval=5 --time=5 --client=10 --log --rate=1000
--latency-limit=10 --failures-detailed --max-tries=10 postgres

4. start a fresh session in a different db and advance the slot to the
latest position. There should be no relation open or CatCache log
between the LOG "starting logical decoding for slot .." and LOG
"decoding over".
SELECT pg_replication_slot_advance('logicalslot', pg_current_wal_lsn());
--

thanks
Shveta
From 5386894faa14c0de9854e0eee9679f8eea775f65 Mon Sep 17 00:00:00 2001
From: Hou Zhijie 
Date: Fri, 29 Mar 2024 11:46:36 +0800
Subject: [PATCH] debug log

---
 src/backend/access/common/relation.c | 2 ++
 src/backend/replication/slotfuncs.c  | 1 +
 src/backend/utils/cache/catcache.c   | 1 +
 3 files changed, 4 insertions(+)

diff --git a/src/backend/access/common/relation.c 
b/src/backend/access/common/relation.c
index d8a313a2c9..40718fc47e 100644
--- a/src/backend/access/common/relation.c
+++ b/src/backend/access/common/relation.c
@@ -50,6 +50,7 @@ relation_open(Oid relationId, LOCKMODE lockmode)
 
Assert(lockmode >= NoLock && lockmode < MAX_LOCKMODES);
 
+   elog(LOG, "relation_open");
/* Get the lock before trying to open the relcache entry */
if (lockmode != NoLock)
LockRelationOid(relationId, lockmode);
@@ -91,6 +92,7 @@ try_relation_open(Oid relationId, LOCKMODE lockmode)
 
Assert(lockmode >= NoLock && lockmode < MAX_LOCKMODES);
 
+   elog(LOG, "try_relation_open");
/* Get the lock first */
if (lockmode != NoLock)
LockRelationOid(relationId, lockmode);
diff --git a/src/backend/replication/slotfuncs.c 
b/src/backend/replication/slotfuncs.c
index ef5081784c..564b36fc45 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -608,6 +608,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto,
 
/* free context, call shutdown callback */
FreeDecodingContext(ctx);
+   elog(LOG, "decoding over");
 
InvalidateSystemCaches();
}
diff --git a/src/backend/utils/cache/catcache.c 
b/src/backend/utils/cache/catcache.c
index 569f51cb33..e19c586697 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -1328,6 +1328,7 @@ SearchCatCacheInternal(CatCache *cache,
 
Assert(cache->cc_nkeys == nkeys);
 
+   elog(LOG, "SearchCatCacheInternal");
/*
 * one-time startup overhead for each cache
 */
-- 
2.31.1

Re: DOCS: add helpful partitioning links

2024-03-28 Thread Ashutosh Bapat

On Thu, Mar 28, 2024 at 11:22 PM Alvaro Herrera 
wrote:

> On 2024-Mar-28, Ashutosh Bapat wrote:
>
> > LGTM.
> >
> > The commitfest entry is marked as RFC already.
> >
> > Thanks for taking care of the comments.
>
> Thanks for reviewing.  I noticed a typo "seperate", fixed here.

Thanks for catching it.

> Also, I
> noticed that Robert added an empty line which looks in the source like
> he's breaking the paragraph -- but because he didn't add a closing 
> and an opening  to the next one, there's no actual new paragraph
> in the HTML output.
>

> My first instinct was to add those.  However, upon reading the text, I
> noticed that the previous paragraph ends without offering an example,
> and then we attach the example to the paragraph that takes about CREATE
> TABLE LIKE showing both techniques, which seemed a bit odd.  So instead
> I joined both paragraphs back together.  I'm unsure which one looks
> better.  Which one do you vote for?
>

"CREATE TABLE ... LIKE" is mentioned in a separate paragraph in HEAD as
well. The confused me too but I didn't find any reason. Robert just made
that explicit by adding a blank line. I thought that was ok. But it makes
sense to not have a separate paragraph in the source code too. Thanks for
fixing it. I think the intention of the current code as well as the patch
is to have a single paragraph in HTML output, same as "no-extra-para"
output.

-- 
Best Wishes,
Ashutosh Bapat

Re: AIX support

2024-03-28 Thread Tom Lane

Thomas Munro  writes:
> Oh, sorry, I had missed the part where newer compilers fix the issue
> too.  Old out-of-support versions of AIX running old compilers, what
> fun.

Indeed.  One of the topics that needs investigation if you want to
pursue this is which AIX system and compiler versions still deserve
support, and which of the AIX hacks we had been carrying still need
to be there based on that analysis.  For context, we've been pruning
support for extinct-in-the-wild OS versions pretty aggressively
over the past couple of years, and I'd expect to apply the same
standard to AIX.

regards, tom lane

Re: PSQL Should \sv & \ev work with materialized views?

2024-03-28 Thread Isaac Morland

On Thu, 28 Mar 2024 at 20:38, Erik Wienhold  wrote:

> Of course the problem with using DROP and CREATE is that indexes and
> privileges (anything else?) must also be restored.  I haven't bothered
> with that yet.
>

Not just those — also anything that depends on the matview, such as views
and other matviews.

Re: remaining sql/json patches

2024-03-28 Thread jian he

On Thu, Mar 28, 2024 at 1:23 PM Amit Langote  wrote:
>
> On Wed, Mar 27, 2024 at 1:34 PM Amit Langote  wrote:
> > On Wed, Mar 27, 2024 at 12:42 PM jian he  
> > wrote:
> > > hi.
> > > I don't fully understand all the code in json_table patch.
> > > maybe we can split it into several patches,
> >
> > I'm working on exactly that atm.
> >
> > > like:
> > > * no nested json_table_column.
> > > * nested json_table_column, with PLAN DEFAULT
> > > * nested json_table_column, with PLAN ( json_table_plan )
> >
> > Yes, I think it will end up something like this.  I'll try to post the
> > breakdown tomorrow.
>
> Here's patch 1 for the time being that implements barebones
> JSON_TABLE(), that is, without NESTED paths/columns and PLAN clause.
> I've tried to shape the interfaces so that those features can be added
> in future commits without significant rewrite of the code that
> implements barebones JSON_TABLE() functionality.  I'll know whether
> that's really the case when I rebase the full patch over it.
>
> I'm still reading and polishing it and would be happy to get feedback
> and testing.
>

+static void
+JsonValueListClear(JsonValueList *jvl)
+{
+ jvl->singleton = NULL;
+ jvl->list = NULL;
+}
 jvl->list is a List structure, do we need to set it like "jvl->list = NIL"?

+ if (jperIsError(res))
+ {
+ /* EMPTY ON ERROR case */
+ Assert(!planstate->plan->errorOnError);
+ JsonValueListClear(>found);
+ }
i am not sure the comment is right.
`SELECT * FROM JSON_TABLE(jsonb'"1.23"', 'strict $.a' COLUMNS (js2 int
PATH '$') );`
will execute jperIsError branch.
also
SELECT * FROM JSON_TABLE(jsonb'"1.23"', 'strict $.a' COLUMNS (js2 int
PATH '$') default '1' on error);

I think it means applying path_expression, if the top level on_error
behavior is not on error
then ` if (jperIsError(res))` part may be executed.



--- a/src/include/utils/jsonpath.h
+++ b/src/include/utils/jsonpath.h
@@ -15,6 +15,7 @@
 #define JSONPATH_H

 #include "fmgr.h"
+#include "executor/tablefunc.h"
 #include "nodes/pg_list.h"
 #include "nodes/primnodes.h"
 #include "utils/jsonb.h"

should be:
+#include "executor/tablefunc.h"
 #include "fmgr.h"


+
+JSON_TABLE (
+context_item,
path_expression  AS
json_path_name  
PASSING { value AS
varname } , ...

+COLUMNS ( json_table_column ,
... )
+ { ERROR | EMPTY
} ON ERROR 
+)
top level (not in the COLUMN clause) also allows
NULL ON ERROR.

SELECT JSON_VALUE(jsonb'"1.23"', 'strict $.a' null on error);
returns one value.
SELECT * FROM JSON_TABLE(jsonb'"1.23"', 'strict $.a' COLUMNS (js2 int
PATH '$') NULL on ERROR);
return zero rows.
Is this what we expected?


main changes are in jsonpath_exec.c, parse_expr.c, parse_jsontable.c
overall the coverage seems pretty good.
I added some tests to improve the coverage.


v46-0001-improve-regress-coverage-test-based-on-v46.no-cfbot
Description: Binary data

Re: AIX support

2024-03-28 Thread Thomas Munro

On Fri, Mar 29, 2024 at 4:00 PM Thomas Munro  wrote:
> On Fri, Mar 29, 2024 at 3:48 PM Noah Misch  wrote:
> > The thread Alvaro and Tom cited contains an analysis.  It's a compiler bug.
> > You can get past the compiler bug by upgrading your compiler; both ibm-clang
> > 17.1.1.2 and gcc 13.2.0 are free from the bug.
>
> For the specific issue that triggered that, I strongly suspect that it
> would go away if we just used smgrzeroextend() instead of smgrextend()
> using that variable with the alignment requirement, since, as far as I
> can tell from build farm clues, the otherwise similar function-local
> static variable used by the former (ie one that the linker must still
> control the location of AFAIK?) seems to work fine.

Oh, sorry, I had missed the part where newer compilers fix the issue
too.  Old out-of-support versions of AIX running old compilers, what
fun.

Re: AIX support

2024-03-28 Thread Tom Lane

Noah Misch  writes:
> On Thu, Mar 28, 2024 at 11:09:43AM +, Sriram RK wrote:
>> Also, would like to know if we can access the buildfarm(power machines) to 
>> run any of the specific tests to hit this assert.

> https://portal.cfarm.net/users/new/ is the form to request access.  It lists
> the eligibility criteria.

There might be some confusion here about what system we are talking
about.  The Postgres buildfarm is described at
https://buildfarm.postgresql.org/index.html
but it consists of a large number of individual machines run by
individual owners.  There would not be a lot of point in adding a
new AIX machine to the Postgres buildfarm right now, because it
would surely fail to build HEAD.  What Noah is referencing is
the GCC compile farm, which happens to include some AIX machines.
The existing AIX entries in the Postgres buildfarm are run (by Noah)
on those GCC compile farm machines, which really the GCC crowd have
been *very* forgiving about letting us abuse like that.  If you have
your own AIX hardware there's not a lot of reason that you should
need to access the GCC farm.

What you do need to do to reproduce the described problems is
check out the Postgres git tree and rewind to just before
commit 0b16bb877, where we deleted AIX support.  Any attempt
to restore AIX support would have to start with reverting that
commit (and perhaps the followup f0827b443).

regards, tom lane

Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there

2024-03-28 Thread Erik Wienhold

On 2024-03-29 02:42 +0100, David G. Johnston wrote:
> For consideration for the doc portion.  The existing wording is too
> imprecise for my liking and just tacking on "expects...create type" is
> jarring.
> 
> """
> Creates a typed table, which takes it structure from an existing (name
> optionally schema-qualified) stand-alone composite type i.e., one created
> using CREATE TYPE) though it still produces a new composite type as well.
> The table will have a dependency to the referenced type such cascaded alter
> and drop actions on the type will propagate to the table.
> 
> A typed table always has the same column names and data types as the type
> it is derived from, and you cannot specify additional columns.  But the
> CREATE TABLE command can add defaults and constraints to the table, as well
> as specify storage parameters.
> """

Thanks, that sounds better.  I incorporated that with some minor edits
in the attached v3.

> We do use the term "stand-alone composite" in create type so I'm inclined
> to use it instead of "composite created with CREATE TYPE"; especially in
> the error messages; I'm a bit more willing to add the cross-reference to
> create type in the user docs.

Okay, changed in v3 as well.  I used "created with CREATE TYPE" in the
error message because I thought it's clearer to the user.  But I see no
reason for not using "stand-alone" here as well if it's the established
term.

-- 
Erik
>From e9a79e6d5e063098eed4f834b18d576089b38499 Mon Sep 17 00:00:00 2001
From: Erik Wienhold 
Date: Fri, 8 Mar 2024 04:21:56 +0100
Subject: [PATCH v3] Document that typed tables rely on CREATE TYPE

CREATE TABLE OF requires a stand-alone composite type.  Clarify that in
the error message.  Also reword the docs to better explain the
connection between created table and stand-alone composite type.

Reworded docs provided by David G. Johnston.
---
 doc/src/sgml/ref/create_table.sgml| 18 +-
 src/backend/commands/tablecmds.c  |  2 +-
 src/test/regress/expected/typed_table.out |  2 +-
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index dfb7822985..5c8c1edaed 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -249,19 +249,19 @@ WITH ( MODULUS numeric_literal, REM
 OF type_name
 
  
-  Creates a typed table, which takes its
-  structure from the specified composite type (name optionally
-  schema-qualified).  A typed table is tied to its type; for
-  example the table will be dropped if the type is dropped
-  (with DROP TYPE ... CASCADE).
+  Creates a typed table, which takes its structure
+  from an existing (name optionally schema-qualified) stand-alone composite
+  type (i.e. created using ) though it
+  still produces a new composite type as well.  The table will have
+  a dependency on the referenced type such that cascaded alter and drop
+  actions on the type will propagate to the table.
  
 
  
-  When a typed table is created, then the data types of the
-  columns are determined by the underlying composite type and are
-  not specified by the CREATE TABLE command.
+  A typed table always has the same column names and data types as the
+  type it is derived from, and you cannot specify additional columns.
   But the CREATE TABLE command can add defaults
-  and constraints to the table and can specify storage parameters.
+  and constraints to the table, as well as specify storage parameters.
  
 

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6741e721ae..8e9dbe4bee 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6990,7 +6990,7 @@ check_of_type(HeapTuple typetuple)
if (!typeOk)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-errmsg("type %s is not a composite type",
+errmsg("type %s is not a stand-alone composite 
type",
format_type_be(typ->oid;
 }
 
diff --git a/src/test/regress/expected/typed_table.out 
b/src/test/regress/expected/typed_table.out
index 2e47ecbcf5..745fbde811 100644
--- a/src/test/regress/expected/typed_table.out
+++ b/src/test/regress/expected/typed_table.out
@@ -89,7 +89,7 @@ drop cascades to function get_all_persons()
 drop cascades to table persons2
 drop cascades to table persons3
 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used
-ERROR:  type stuff is not a composite type
+ERROR:  type stuff is not a stand-alone composite type
 DROP TABLE stuff;
 -- implicit casting
 CREATE TYPE person_type AS (id int, name text);
-- 
2.44.0

Re: AIX support

2024-03-28 Thread Thomas Munro

On Fri, Mar 29, 2024 at 3:48 PM Noah Misch  wrote:
> On Thu, Mar 28, 2024 at 11:09:43AM +, Sriram RK wrote:
> > We are setting up the build environment and trying to build the source and 
> > also trying to analyze the assert from the Aix point of view.
>
> The thread Alvaro and Tom cited contains an analysis.  It's a compiler bug.
> You can get past the compiler bug by upgrading your compiler; both ibm-clang
> 17.1.1.2 and gcc 13.2.0 are free from the bug.

For the specific issue that triggered that, I strongly suspect that it
would go away if we just used smgrzeroextend() instead of smgrextend()
using that variable with the alignment requirement, since, as far as I
can tell from build farm clues, the otherwise similar function-local
static variable used by the former (ie one that the linker must still
control the location of AFAIK?) seems to work fine.

But we didn't remove AIX because of that, it was just the straw that
broke the camel's back.

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-03-28 Thread torikoshia


On 2024-03-28 21:54, Masahiko Sawada wrote:
On Thu, Mar 28, 2024 at 9:38 PM torikoshia  
wrote:


On 2024-03-28 10:20, Masahiko Sawada wrote:
> Hi,
>
> On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada 
> wrote:
>>
>> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
>>  wrote:
>> >
>> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia  
wrote:
>> > > On 2024-01-18 10:10, jian he wrote:
>> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
>> > > > wrote:
>> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
>> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
>> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of 
"error")?
>> > > >> > You will need a separate parameter anyway to specify the destination
>> > > >> > of "log", unless "none" became an illegal table name when I wasn't
>> > > >> > looking.  I don't buy that one parameter that has some special 
values
>> > > >> > while other values could be names will be a good design.  Moreover,
>> > > >> > what if we want to support (say) log-to-file along with 
log-to-table?
>> > > >> > Trying to distinguish a file name from a table name without any 
other
>> > > >> > context seems impossible.
>> > > >>
>> > > >> I've been thinking we can add more values to this option to log errors
>> > > >> not only to the server logs but also to the error table (not sure
>> > > >> details but I imagined an error table is created for each table on
>> > > >> error), without an additional option for the destination name. The
>> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
>> > > >>
>> > > >
>> > > > another idea:
>> > > > on_error {error|ignore|other_future_option}
>> > > > if not specified then by default ERROR.
>> > > > You can also specify ERROR or IGNORE for now.
>> > > >
>> > > > I agree, the parameter "error_action" is better than "location".
>> > >
>> > > I'm not sure whether error_action or on_error is better, but either way
>> > > "error_action error" and "on_error error" seems a bit odd to me.
>> > > I feel "stop" is better for both cases as Tom suggested.
>> >
>> > OK.  What about this?
>> > on_error {stop|ignore|other_future_option}
>> > where other_future_option might be compound like "file 'copy.log'" or
>> > "table 'copy_log'".
>>
>> +1
>>
>
> I realized that ON_ERROR syntax synoposis in the documentation is not
> correct. The option doesn't require the value to be quoted and the
> value can be omitted. The attached patch fixes it.
>
> Regards,

Thanks!

Attached patch fixes the doc, but I'm wondering perhaps it might be
better to modify the codes to prohibit abbreviation of the value.

When seeing the query which abbreviates ON_ERROR value, I feel it's 
not

obvious what happens compared to other options which tolerates
abbreviation of the value such as FREEZE or HEADER.

   COPY t1 FROM stdin WITH (ON_ERROR);

What do you think?


Indeed. Looking at options of other commands such as VACUUM and
EXPLAIN, I can see that we can omit a boolean value, but non-boolean
parameters require its value. The HEADER option is not a pure boolean
parameter but we can omit the value. It seems to be for backward
compatibility; it used to be a boolean parameter. I agree that the
above example would confuse users.

Regards,


Thanks for your comment!

Attached a patch which modifies the code to prohibit omission of its 
value.


I was a little unsure about adding a regression test for this, but I 
have not added it since other COPY option doesn't test the omission of 
its value.


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group CorporationFrom 1b4bec3c2223246ec59ffb9eb7de2f1de27315f7 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Fri, 29 Mar 2024 11:36:12 +0900
Subject: [PATCH v1] Disallow ON_ERROR option without value

Currently ON_ERROR option of COPY allows to omit its value,
but the syntax synopsis in the documentation requires it.

Since it seems non-boolean parameters usually require its value
and it's not obvious what happens when value of ON_ERROR is
omitted, this patch disallows ON_ERROR without its value.
---
 src/backend/commands/copy.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 28cf8b040a..2719bf28b7 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -392,7 +392,7 @@ defGetCopyHeaderChoice(DefElem *def, bool is_from)
 static CopyOnErrorChoice
 defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
 {
-	char	   *sval;
+	char	   *sval = defGetString(def);
 
 	if (!is_from)
 		ereport(ERROR,
@@ -400,16 +400,9 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
  errmsg("COPY ON_ERROR cannot be used with COPY TO"),
  parser_errposition(pstate, def->location)));
 
-	/*
-	 * If no parameter value given, assume the default value.
-	 */
-	if (def->arg == NULL)
-		return COPY_ON_ERROR_STOP;
-
 	/*
 	 * Allow "stop", or

Re: AIX support

2024-03-28 Thread Noah Misch

On Thu, Mar 28, 2024 at 11:09:43AM +, Sriram RK wrote:
> We are setting up the build environment and trying to build the source and 
> also trying to analyze the assert from the Aix point of view.

The thread Alvaro and Tom cited contains an analysis.  It's a compiler bug.
You can get past the compiler bug by upgrading your compiler; both ibm-clang
17.1.1.2 and gcc 13.2.0 are free from the bug.

> Also, would like to know if we can access the buildfarm(power machines) to 
> run any of the specific tests to hit this assert.

https://portal.cfarm.net/users/new/ is the form to request access.  It lists
the eligibility criteria.

Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there

2024-03-28 Thread David G. Johnston

On Thu, Mar 7, 2024 at 9:29 PM Erik Wienhold  wrote:

> I wrote:
> > The attached v2 is a simpler patch that instead modifies the existing
> > error message.
>
> Forgot to attach v2.
>
>
For consideration for the doc portion.  The existing wording is too
imprecise for my liking and just tacking on "expects...create type" is
jarring.

"""
Creates a typed table, which takes it structure from an existing (name
optionally schema-qualified) stand-alone composite type i.e., one created
using CREATE TYPE) though it still produces a new composite type as well.
The table will have a dependency to the referenced type such cascaded alter
and drop actions on the type will propagate to the table.

A typed table always has the same column names and data types as the type
it is derived from, and you cannot specify additional columns.  But the
CREATE TABLE command can add defaults and constraints to the table, as well
as specify storage parameters.
"""

We do use the term "stand-alone composite" in create type so I'm inclined
to use it instead of "composite created with CREATE TYPE"; especially in
the error messages; I'm a bit more willing to add the cross-reference to
create type in the user docs.

David J.

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-28 Thread Thomas Munro

On Fri, Mar 29, 2024 at 10:43 AM Tomas Vondra
 wrote:
> I think there's some sort of bug, triggering this assert in heapam
>
>   Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);

Thanks for the repro.  I can't seem to reproduce it (still trying) but
I assume this is with Melanie's v11 patch set which had
v11-0016-v10-Read-Stream-API.patch.

Would you mind removing that commit and instead applying the v13
stream_read.c patches[1]?  v10 stream_read.c was a little confused
about random I/O combining, which I fixed with a small adjustment to
the conditions for the "if" statement right at the end of
read_stream_look_ahead().  Sorry about that.  The fixed version, with
eic=4, with your test query using WHERE a < a, ends its scan with:

...
posix_fadvise(32,0x28aee000,0x4000,POSIX_FADV_WILLNEED) = 0 (0x0)
pread(32,"\0\0\0\0@4\M-5:\0\0\^D\0\M-x\^A"...,40960,0x28acc000) = 40960 (0xa000)
posix_fadvise(32,0x28af4000,0x4000,POSIX_FADV_WILLNEED) = 0 (0x0)
pread(32,"\0\0\0\0\^XC\M-6:\0\0\^D\0\M-x"...,32768,0x28ad8000) = 32768 (0x8000)
posix_fadvise(32,0x28afc000,0x4000,POSIX_FADV_WILLNEED) = 0 (0x0)
pread(32,"\0\0\0\0\M-XQ\M-7:\0\0\^D\0\M-x"...,24576,0x28ae4000) = 24576 (0x6000)
posix_fadvise(32,0x28b02000,0x8000,POSIX_FADV_WILLNEED) = 0 (0x0)
pread(32,"\0\0\0\0\M^@3\M-8:\0\0\^D\0\M-x"...,16384,0x28aee000) = 16384 (0x4000)
pread(32,"\0\0\0\0\M-`\M-:\M-8:\0\0\^D\0"...,16384,0x28af4000) = 16384 (0x4000)
pread(32,"\0\0\0\0po\M-9:\0\0\^D\0\M-x\^A"...,16384,0x28afc000) = 16384 (0x4000)
pread(32,"\0\0\0\0\M-P\M-v\M-9:\0\0\^D\0"...,32768,0x28b02000) = 32768 (0x8000)

In other words it's able to coalesce, but v10 was a bit b0rked in that
respect and wouldn't do as well at that.  Then if you set
io_combine_limit = 1, it looks more like master, eg lots of little
reads, but not as many fadvises as master because of sequential
access:

...
posix_fadvise(32,0x28af4000,0x2000,POSIX_FADV_WILLNEED) = 0 (0x0) -+
pread(32,...,8192,0x28ae8000) = 8192 (0x2000)  |
pread(32,...,8192,0x28aee000) = 8192 (0x2000)  |
posix_fadvise(32,0x28afc000,0x2000,POSIX_FADV_WILLNEED) = 0 (0x0) ---+
pread(32,...,8192,0x28af) = 8192 (0x2000)  | |
pread(32,...,8192,0x28af4000) = 8192 (0x2000) <+ |
posix_fadvise(32,0x28b02000,0x2000,POSIX_FADV_WILLNEED) = 0 (0x0) -+
pread(32,...,8192,0x28af6000) = 8192 (0x2000)| |
pread(32,...,8192,0x28afc000) = 8192 (0x2000) <--+ |
pread(32,...,8192,0x28afe000) = 8192 (0x2000)  }-- no advice   |
pread(32,...,8192,0x28b02000) = 8192 (0x2000) <+
pread(32,...,8192,0x28b04000) = 8192 (0x2000)  }
pread(32,...,8192,0x28b06000) = 8192 (0x2000)  }-- no advice
pread(32,...,8192,0x28b08000) = 8192 (0x2000)  }

It becomes slightly less eager to start I/Os as soon as
io_combine_limit > 1, because when it has hit max_ios, if ... 
yeah if the average block that it can combine is bigger than 4, an
arbitrary number from:

max_pinned_buffers = Max(max_ios * 4, io_combine_limit);

 then it can run out of look ahead window before it can reach
max_ios (aka eic), so that's a kind of arbitrary/bogus I/O depth
constraint, which is another way of saying what I was saying earlier:
maybe it just needs more distance.  So let's see the average combined
I/O length in your test query... for me it works out to 27,169 bytes.
But I think there must be times when it runs out of window due to
clustering.  So you could also try increasing that 4->8 to see what
happens to performance.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2B5UofvseJWv6YqKmuc_%3Drguc7VqKcNEG1eawKh3MzHXQ%40mail.gmail.com

RE: Synchronizing slots from primary to standby

2024-03-28 Thread Zhijie Hou (Fujitsu)

On Thursday, March 28, 2024 10:02 PM Zhijie Hou (Fujitsu) 
 wrote:
> 
> On Thursday, March 28, 2024 7:32 PM Amit Kapila 
> wrote:
> >
> > On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu)
> > 
> > wrote:
> > >
> > > When analyzing one BF error[1], we find an issue of slotsync: Since
> > > we don't perform logical decoding for the synced slots when syncing
> > > the lsn/xmin of slot, no logical snapshots will be serialized to
> > > disk. So, when user starts to use these synced slots after
> > > promotion, it needs to re-build the consistent snapshot from the
> > > restart_lsn if the
> > > WAL(xl_running_xacts) at restart_lsn position indicates that there
> > > are running transactions. This however could cause the data that
> > > before the
> > consistent point to be missed[2].
> > >
> > > This issue doesn't exist on the primary because the snapshot at
> > > restart_lsn should have been serialized to disk
> > > (SnapBuildProcessRunningXacts -> SnapBuildSerialize), so even if the
> > > logical decoding restarts, it can find consistent snapshot
> > > immediately at
> > restart_lsn.
> > >
> > > To fix this, we could use the fast forward logical decoding to
> > > advance the synced slot's lsn/xmin when syncing these values instead
> > > of directly updating the slot's info. This way, the snapshot will be
> > > serialized to
> > disk when decoding.
> > > If we could not reach to the consistent point at the remote
> > > restart_lsn, the slot is marked as temp and will be persisted once
> > > it reaches the consistent point. I am still analyzing the fix and
> > > will share once
> > ready.
> > >
> >
> > Yes, we can use this but one thing to note is that
> > CreateDecodingContext() will expect that the slot's and current
> > database are the same. I think the reason for that is we need to check
> > system tables of the current database while decoding and sending data
> > to the output_plugin which won't be a requirement for the fast_forward
> > case. So, we need to skip that check in fast_forward mode.
> 
> Agreed.
> 
> >
> > Next, I was thinking about the case of the first time updating the
> > restart and confirmed_flush LSN while syncing the slots. I think we
> > can keep the current logic as it is based on the following analysis.
> >
> > For each logical slot, cases possible on the primary:
> > 1. The restart_lsn doesn't have a serialized snapshot and hasn't yet
> > reached the consistent point.
> > 2. The restart_lsn doesn't have a serialized snapshot but has reached
> > a consistent point.
> > 3. The restart_lsn has a serialized snapshot which means it has
> > reached a consistent point as well.
> >
> > Considering we keep the logic to reserve initial WAL positions the
> > same as the current (Reserve WAL for the currently active local slot
> > using the specified WAL location (restart_lsn). If the given WAL
> > location has been removed, reserve WAL using the oldest existing WAL
> > segment.), I could think of the below
> > scenarios:
> > A. For 1, we shouldn't sync the slot as it still wouldn't have been
> > marked persistent on the primary.
> > B. For 2, we would sync the slot
> >B1. If remote_restart_lsn >= local_resart_lsn, then advance the
> > slot by calling pg_logical_replication_slot_advance().
> >B11. If we reach consistent point, then it should be okay
> > because after promotion as well we should reach consistent point.
> > B111. But again is it possible that there is some xact
> > that comes before consistent_point on primary and the same is after
> > consistent_point on standby? This shouldn't matter as we will start
> > decoding transactions after confirmed_flush_lsn which would be the same on
> primary and standby.
> >B22. If we haven't reached consistent_point, then we won't mark
> > the slot as persistent, and at the next sync we will do the same till
> > it reaches consistent_point. At that time, the situation will be similar to 
> > B11.
> >B2. If remote_restart_lsn < local_restart_lsn, then we will wait
> > for the next sync cycle and keep the slot as temporary. Once in the
> > next or some consecutive sync cycle, we reach the condition
> > remote_restart_lsn >= local_restart_lsn, we will proceed to advance
> > the slot and we should have the same behavior as B1.
> > C. For 3, we would sync the slot, but the behavior should be the same as B.
> >
> > Thoughts?
> 
> Looks reasonable to me.
> 
> Here is the patch based on above lines.
> I am also testing and verifying the patch locally.

Attach a new version patch which fixed an un-initialized variable issue and
added some comments. Also, temporarily enable DEBUG2 for the 040 tap-test so 
that
we can analyze the possible CFbot failures easily.

Best Regards,
Hou zj


v2-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch
Description:  v2-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch

Re: PSQL Should \sv & \ev work with materialized views?

2024-03-28 Thread Erik Wienhold

I wrote:
> On 2023-05-15 06:32 +0200, Kirk Wolak wrote:
> > Personally I would appreciate it if \sv actually showed you the DDL.
> > Oftentimes I will \ev something to review it, with syntax highlighting.
> 
> +1.  I was just reviewing some matviews and was surprised that psql
> lacks commands to show their definitions.
> 
> But I think that it should be separate commands \sm and \em because we
> already have commands \dm and \dv that distinguish between matviews and
> views.

Separate commands are not necessary because \ev and \sv already have a
(disabled) provision in get_create_object_cmd for when CREATE OR REPLACE
MATERIALIZED VIEW is available.  So I guess both commands should also
apply to matview.  The attached patch replaces that provision with a
transaction that drops and creates the matview.  This uses meta command
\; to put multiple statements into the query buffer without prematurely
sending those statements to the server.

Demo:

=> DROP MATERIALIZED VIEW IF EXISTS test;
DROP MATERIALIZED VIEW
=> CREATE MATERIALIZED VIEW test AS SELECT s FROM generate_series(1, 
10) s;
SELECT 10
=> \sv test
BEGIN \;
DROP MATERIALIZED VIEW public.test \;
CREATE MATERIALIZED VIEW public.test AS
 SELECT s
   FROM generate_series(1, 10) s(s)
 WITH DATA \;
COMMIT
=>

And \ev test works as well.

Of course the problem with using DROP and CREATE is that indexes and
privileges (anything else?) must also be restored.  I haven't bothered
with that yet.

-- 
Erik
>From efb5e37d90b668011307b602655f28455d700635 Mon Sep 17 00:00:00 2001
From: Erik Wienhold 
Date: Fri, 29 Mar 2024 01:08:35 +0100
Subject: [PATCH v1] psql: \ev and \sv for matviews

CREATE OR REPLACE is not available for materialized views so DROP and
CREATE them inside a transaction.  Use meta command \; to compose the
query buffer without sending it to the server.

TODO: Re-create indexes and privileges which are currently lost by
  relying on DROP and CREATE.
---
 src/bin/psql/command.c | 29 -
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 9b0fa041f7..f40c1d7f99 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -5575,19 +5575,22 @@ get_create_object_cmd(EditableObjectType obj_type, Oid 
oid,
char   *reloptions = 
PQgetvalue(res, 0, 4);
char   *checkoption = 
PQgetvalue(res, 0, 5);
 
-   /*
-* If the backend ever supports CREATE 
OR REPLACE
-* MATERIALIZED VIEW, allow that here; 
but as of today it
-* does not, so editing a matview 
definition in this way
-* is impossible.
-*/
switch (relkind[0])
{
-#ifdef NOT_USED
case RELKIND_MATVIEW:
-   
appendPQExpBufferStr(buf, "CREATE OR REPLACE MATERIALIZED VIEW ");
+   /*
+* Allow editing a 
matview via separate DROP and
+* CREATE statement 
inside a transaction.  Use meta
+* command \; to write 
more than one statement to
+* the query buffer 
without sending it to the server.
+*/
+   
appendPQExpBufferStr(buf, "BEGIN \\;\n");
+   
appendPQExpBufferStr(buf, "DROP MATERIALIZED VIEW ");
+   appendPQExpBuffer(buf, 
"%s.", fmtId(nspname));
+   
appendPQExpBufferStr(buf, fmtId(relname));
+   
appendPQExpBufferStr(buf, " \\;\n");
+   
appendPQExpBufferStr(buf, "CREATE MATERIALIZED VIEW ");
break;
-#endif
case RELKIND_VIEW:

appendPQExpBufferStr(buf, "CREATE OR REPLACE VIEW ");
break;
@@ -5625,6 +5628,14 @@ get_create_object_cmd(EditableObjectType obj_type, Oid 
oid,
if (checkoption && checkoption[0] != 
'\0')

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-28 Thread Euler Taveira

On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote:
> Fixed along with other issues spotted by Alexander Lakhin.

[I didn't read the whole thread. I'm sorry if I missed something ...]

You renamed the function in a previous version but let me suggest another one:
pg_wal_replay_wait. It uses the same pattern as the other recovery control
functions [1]. I think "for" doesn't add much for the function name and "lsn" is
used in functions that return an LSN (that's not the case here).

postgres=# \df pg_wal_replay*
List of functions
-[ RECORD 1 ]---+-
Schema  | pg_catalog
Name| pg_wal_replay_pause
Result data type| void
Argument data types | 
Type| func
-[ RECORD 2 ]---+-
Schema  | pg_catalog
Name| pg_wal_replay_resume
Result data type| void
Argument data types | 
Type| func

Regarding the arguments, I think the timeout should be bigint. There is at least
another function that implements a timeout that uses bigint. 

postgres=# \df pg_terminate_backend
List of functions
-[ RECORD 1 ]---+--
Schema  | pg_catalog
Name| pg_terminate_backend
Result data type| boolean
Argument data types | pid integer, timeout bigint DEFAULT 0
Type| func

I also suggests that the timeout unit should be milliseconds, hence, using
bigint is perfectly fine for the timeout argument.

+   
+Throws an ERROR if the target lsn was not replayed
+on standby within given timeout.  Parameter 
timeout
+is the time in seconds to wait for the 
target_lsn
+replay. When timeout value equals to zero no
+timeout is applied.
+   


[1] 
https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-RECOVERY-CONTROL


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-28 Thread Alexander Korotkov

On Fri, Mar 22, 2024 at 3:45 PM Kartyshov Ivan
 wrote:
> On 2024-03-20 12:11, Alexander Korotkov wrote:
> > On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan
> >  wrote:
> >> > 4.2 With an unreasonably high future LSN, BEGIN command waits
> >> > unboundedly, shouldn't we check if the specified LSN is more than
> >> > pg_last_wal_receive_lsn() error out?
> >
> > I think limiting wait lsn by current received lsn would destroy the
> > whole value of this feature.  The value is to wait till given LSN is
> > replayed, whether it's already received or not.
>
> Ok sounds reasonable, I`ll rollback the changes.
>
> > But I don't see a problem here. On the replica, it's out of our
> > control to check which lsn is good and which is not.  We can't check
> > whether the lsn, which is in future for the replica, is already issued
> > by primary.
> >
> > For the case of wrong lsn, which could cause potentially infinite
> > wait, there is the timeout and the manual query cancel.
>
> Fully agree with this take.
>
> >> > 4.3 With an unreasonably high wait time, BEGIN command waits
> >> > unboundedly, shouldn't we restrict the wait time to some max
> > value,
> >> > say a day or so?
> >> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
> >> > BEGIN AFTER :'future_receive_lsn' WITHIN 10;
> >>
> >> Good idea, I put it 1 day. But this limit we should to discuss.
> >
> > Do you think that specifying timeout in milliseconds is suitable?  I
> > would prefer to switch to seconds (with ability to specify fraction of
> > second).  This was expressed before by Alexander Lakhin.
>
> It sounds like an interesting idea. Please review the result.
>
> >> > https://github.com/macdice/redo-bench or similar tools?
> >
> > Ivan, could you do this?
>
> Yes, test redo-bench/crash-recovery.sh
> This patch on master
> 91.327, 1.973
> 105.907, 3.338
> 98.412, 4.579
> 95.818, 4.19
>
> REL_13-STABLE
> 116.645, 3.005
> 113.212, 2.568
> 117.644, 3.183
> 111.411, 2.782
>
> master
> 124.712, 2.047
> 117.012, 1.736
> 116.328, 2.035
> 115.662, 1.797
>
> Strange behavior, patched version is faster then REL_13-STABLE and
> master.

I've run this test on my machine with v13 of the path.

patched
53.663, 0.466
53.884, 0.402
54.102, 0.441

master
55.216, 0.441
54.52, 0.464
51.479, 0.438

It seems that difference is less than variance.

--
Regards,
Alexander Korotkov

Re: PSQL Should \sv & \ev work with materialized views?

2024-03-28 Thread Erik Wienhold

On 2023-05-15 06:32 +0200, Kirk Wolak wrote:
> Personally I would appreciate it if \sv actually showed you the DDL.
> Oftentimes I will \ev something to review it, with syntax highlighting.

+1.  I was just reviewing some matviews and was surprised that psql
lacks commands to show their definitions.

But I think that it should be separate commands \sm and \em because we
already have commands \dm and \dv that distinguish between matviews and
views.

> This should not be that difficult.  Just looking for feedback.
> Admittedly \e is questionable, because you cannot really apply the changes.
> ALTHOUGH, I would consider that I could
> BEGIN;
> DROP MATERIALIZED VIEW ...;
> CREATE MATERIALIZED VIEW ...;
> 
> Which I had to do to change the WITH DATA so it creates with data when we
> reload our object.s

I think this could even be handled by optional modifiers, e.g. \em emits
CREATE MATERIALIZED VIEW ... WITH NO DATA and \emD emits WITH DATA.
Although I wouldn't mind manually changing WITH NO DATA to WITH DATA.

-- 
Erik

RE: Popcount optimization using AVX512

2024-03-28 Thread Amonson, Paul D

> -Original Message-
> From: Amonson, Paul D 
> Sent: Thursday, March 28, 2024 3:03 PM
> To: Nathan Bossart 
> ...
> I will review the new patch to see if there are anything that jumps out at me.

I see in the meson.build you added the new file twice?

@@ -7,6 +7,7 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
+  'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
@@ -84,6 +85,7 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 
'avx512_popcnt'],

I was putting the file with special flags ONLY in the second section and all 
seemed to work. :)

Everything else seems good to me.

Thanks,
Paul

Re: Popcount optimization using AVX512

2024-03-28 Thread Alvaro Herrera

On 2024-Mar-28, Amonson, Paul D wrote:

> > -Original Message-
> > From: Nathan Bossart 
> > Sent: Thursday, March 28, 2024 2:39 PM
> > To: Amonson, Paul D 
> > 
> > * The latest patch set from Paul Amonson appeared to support MSVC in the
> >   meson build, but not the autoconf one.  I don't have much expertise here,
> >   so the v14 patch doesn't have any autoconf/meson support for MSVC, which
> >   I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
> >   can always compile the x86_64 popcount code, but I don't know whether
> >   that's safe for AVX512.
> 
> I also do not know how to integrate MSVC+Autoconf, the CI uses
> MSVC+Meson+Ninja so I stuck with that.

We don't do MSVC via autoconf/Make.  We used to have a special build
framework for MSVC which parsed Makefiles to produce "solution" files,
but it was removed as soon as Meson was mature enough to build.  See
commit 1301c80b2167.  If it builds with Meson, you're good.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere."(Lamar Owen)

Re: incorrect results and different plan with 2 very similar queries

2024-03-28 Thread Tomas Vondra




On 3/27/24 23:10, Dave Cramer wrote:
> Dave Cramer
> 
> 
> On Wed, 27 Mar 2024 at 17:57, David Rowley  wrote:
> 
>> On Thu, 28 Mar 2024 at 10:33, Dave Cramer  wrote:
>>> There is a report on the pgjdbc github JDBC Driver shows erratic
>> behavior when filtering on CURRENT_DATE · pgjdbc/pgjdbc · Discussion #3184 (
>> github.com)
>>>
>>> Here are the plans.
>>>
>>> JDBC - Nested Loop (incorrect result)
>>>
>>> Index Cond: (mutation >= ((CURRENT_DATE -
>> '1971-12-31'::date) - 28))
>>
>>> JDBC - Hash Right (correct result)
>>>
>>> Recheck Cond: (mutation >= ((CURRENT_DATE -
>> '1971-12-31'::date) - 29))
>>
>> I don't see any version details or queries, but going by the
>> conditions above, the queries don't appear to be the same, so
>> different results aren't too surprising and not a demonstration that
>> there's any sort of bug.
>>
> 
> Sorry, you are correct. Version is 12.14. Here is the query
> 
> SELECT
>   p.partseqno_i
> , p.partno
> , p.partmatch
> , pfe.average_price
> , pfe.sales_price
> , pfe.purch_price
> , pfe.average_price_2
> , pfe.avg_repair_cost
> , pfe.average_price_func
> , pfe.fsv
> , pfe.fsv_func
> , p.status
> 
> FROM part p
> LEFT JOIN part_fa_entity pfe ON (p.partno = pfe.partno)
> WHERE 1=1
> AND (p.mutation >= (CURRENT_DATE - '1971-12-31'::date)-27) ORDER BY p.partno
> 

I guess the confusing bit is that the report does not claim that those
queries are expected to produce the same result, but that the parameter
value affects which plan gets selected, and one of those plans produces
incorrect result.

I think the simplest explanation might be that one of the indexes on
part_fa_entity is corrupted and fails to lookup some rows by partno.
That would explain why the plan with seqscan works fine, but nestloop
with index scan is missing these rows.

They might try a couple things:

1) set enable_nestloop=off, see if the results get correct

2) try bt_index_check on i_39773, might notice some corruption

3) rebuild the index

If it's not this, they'll need to build a reproducer. It's really
difficult to deduce what's going on just from query plans for different
parameter values.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: Popcount optimization using AVX512

2024-03-28 Thread Amonson, Paul D

> -Original Message-
> From: Nathan Bossart 
> Sent: Thursday, March 28, 2024 2:39 PM
> To: Amonson, Paul D 
> 
> * The latest patch set from Paul Amonson appeared to support MSVC in the
>   meson build, but not the autoconf one.  I don't have much expertise here,
>   so the v14 patch doesn't have any autoconf/meson support for MSVC, which
>   I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
>   can always compile the x86_64 popcount code, but I don't know whether
>   that's safe for AVX512.

I also do not know how to integrate MSVC+Autoconf, the CI uses MSVC+Meson+Ninja 
so I stuck with that.
 
> * I think we need to verify there isn't a huge performance regression for
>   smaller arrays.  IIUC those will still require an AVX512 instruction or
>   two as well as a function call, which might add some noticeable overhead.

Not considering your changes, I had already tested small buffers. At less than 
512 bytes there was no measurable regression (there was one extra condition 
check) and for 512+ bytes it moved from no regression to some gains between 512 
and 4096 bytes. Assuming you introduced no extra function calls, it should be 
the same.

> I forgot to mention that I also want to understand whether we can actually 
> assume availability of XGETBV when CPUID says we support AVX512:

You cannot assume as there are edge cases where AVX-512 was found on system one 
during compile but it's not actually available in a kernel on a second system 
at runtime despite the CPU actually having the hardware feature.

I will review the new patch to see if there are anything that jumps out at me.

Thanks,
Paul

Re: Popcount optimization using AVX512

2024-03-28 Thread Nathan Bossart

On Thu, Mar 28, 2024 at 04:38:54PM -0500, Nathan Bossart wrote:
> Here is a v14 of the patch that I think is beginning to approach something
> committable.  Besides general review and testing, there are two things that
> I'd like to bring up:
> 
> * The latest patch set from Paul Amonson appeared to support MSVC in the
>   meson build, but not the autoconf one.  I don't have much expertise here,
>   so the v14 patch doesn't have any autoconf/meson support for MSVC, which
>   I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
>   can always compile the x86_64 popcount code, but I don't know whether
>   that's safe for AVX512.
> 
> * I think we need to verify there isn't a huge performance regression for
>   smaller arrays.  IIUC those will still require an AVX512 instruction or
>   two as well as a function call, which might add some noticeable overhead.

I forgot to mention that I also want to understand whether we can actually
assume availability of XGETBV when CPUID says we support AVX512:

> + /*
> +  * We also need to check that the OS has enabled support for 
> the ZMM
> +  * registers.
> +  */
> +#ifdef _MSC_VER
> + return (_xgetbv(0) & 0xe0) != 0;
> +#else
> + uint64  xcr = 0;
> + uint32  high;
> + uint32  low;
> +
> +__asm__ __volatile__(" xgetbv\n":"=a"(low), "=d"(high):"c"(xcr));
> + return (low & 0xe0) != 0;
> +#endif

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: pg_upgrade --copy-file-range

2024-03-28 Thread Tomas Vondra




On 3/28/24 21:45, Robert Haas wrote:
> On Tue, Mar 26, 2024 at 2:09 PM Tomas Vondra
>  wrote:
>> The patch I shared a couple minutes ago should fix this, effectively
>> restoring the original debug behavior. I liked the approach with calling
>> strategy_implementation a bit more, I wonder if it'd be better to go
>> back to that for the "accelerated" copy methods, somehow.
> 
> Somehow I don't see this patch?
> 

It's here:

https://www.postgresql.org/message-id/90866c27-265a-4adb-89d0-18c8dd22bc19%40enterprisedb.com

I did change the subject to reflect that it's no longer about
pg_upgrade, maybe that breaks the threading for you somehow?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-28 Thread Tomas Vondra

On 3/27/24 20:37, Melanie Plageman wrote:
> On Mon, Mar 25, 2024 at 12:07:09PM -0400, Melanie Plageman wrote:
>> On Sun, Mar 24, 2024 at 06:37:20PM -0400, Melanie Plageman wrote:
>>> On Sun, Mar 24, 2024 at 5:59 PM Tomas Vondra
>>>  wrote:

 BTW when you say "up to 'Make table_scan_bitmap_next_block() async
 friendly'" do you mean including that patch, or that this is the first
 patch that is not one of the independently useful patches.
>>>
>>> I think the code is easier to understand with "Make
>>> table_scan_bitmap_next_block() async friendly". Prior to that commit,
>>> table_scan_bitmap_next_block() could return false even when the bitmap
>>> has more blocks and expects the caller to handle this and invoke it
>>> again. I think that interface is very confusing. The downside of the
>>> code in that state is that the code for prefetching is still in the
>>> BitmapHeapNext() code and the code for getting the current block is in
>>> the heap AM-specific code. I took a stab at fixing this in v9's 0013,
>>> but the outcome wasn't very attractive.
>>>
>>> What I will do tomorrow is reorder and group the commits such that all
>>> of the commits that are useful independent of streaming read are first
>>> (I think 0014 and 0015 are independently valuable but they are on top
>>> of some things that are only useful to streaming read because they are
>>> more recently requested changes). I think I can actually do a bit of
>>> simplification in terms of how many commits there are and what is in
>>> each. Just to be clear, v9 is still reviewable. I am just going to go
>>> back and change what is included in each commit.
>>
>> So, attached v10 does not include the new version of streaming read API.
>> I focused instead on the refactoring patches commit regrouping I
>> mentioned here.
> 
> Attached v11 has the updated Read Stream API Thomas sent this morning
> [1]. No other changes.
> 

I think there's some sort of bug, triggering this assert in heapam

  Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);

I haven't looked for the root cause, and it's not exactly deterministic,
but try this:

  create table t (a int, b text);

  insert into t select 1 * random(), md5(i::text)
from generate_series(1,1000) s(i);^C

  create index on t (a);

  explain analyze select * from t where a = 200;
  explain analyze select * from t where a < 200;

and then vary the condition a bit (different values, inequalities,
etc.). For me it hits the assert in a couple tries.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company0x79ff48a73d37 in epoll_wait () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install 
glibc-2.37-18.fc38.x86_64 libgcc-13.2.1-4.fc38.x86_64 libicu-72.1-2.fc38.x86_64 
libstdc++-13.2.1-4.fc38.x86_64 zlib-1.2.13-3.fc38.x86_64
(gdb) c
Continuing.

Program received signal SIGABRT, Aborted.
0x79ff489ef884 in __pthread_kill_implementation () from /lib64/libc.so.6
(gdb) bt
#0  0x79ff489ef884 in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x79ff4899eafe in raise () from /lib64/libc.so.6
#2  0x79ff4898787f in abort () from /lib64/libc.so.6
#3  0x009ba563 in ExceptionalCondition 
(conditionName=conditionName@entry=0xa209e0 
"BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno", 
fileName=fileName@entry=0xa205b4 "heapam_handler.c", 
lineNumber=lineNumber@entry=2221) at assert.c:66
#4  0x0057759f in heapam_scan_bitmap_next_block (exact_pages=, lossy_pages=, recheck=, scan=0x1e73548) at 
heapam_handler.c:2221
#5  heapam_scan_bitmap_next_tuple (scan=0x1e73548, slot=, 
recheck=, lossy_pages=, exact_pages=) at heapam_handler.c:2359
#6  0x006f58bb in table_scan_bitmap_next_tuple (exact_pages=, lossy_pages=, recheck=, slot=, scan=) at ../../../src/include/access/tableam.h:2022
#7  BitmapHeapNext (node=0x1e71fc8) at nodeBitmapHeapscan.c:202
#8  0x006e46b8 in ExecProcNodeInstr (node=0x1e71fc8) at 
execProcnode.c:480
#9  0x006dd6fa in ExecProcNode (node=0x1e71fc8) at 
../../../src/include/executor/executor.h:274
#10 ExecutePlan (execute_once=, dest=0xb755e0 , 
direction=, numberTuples=0, sendTuples=true, 
operation=CMD_SELECT, use_parallel_mode=, planstate=0x1e71fc8, 
estate=0x1e71d80) at execMain.c:1644
#11 standard_ExecutorRun (queryDesc=0x1e6c500, direction=, 
count=0, execute_once=) at execMain.c:363
#12 0x0067af84 in ExplainOnePlan 
(plannedstmt=plannedstmt@entry=0x1e6c3f0, into=into@entry=0x0, 
es=es@entry=0x1db3138, queryString=queryString@entry=0x1d87f60 "explain analyze 
select * from t where a = 200;", params=params@entry=0x0, 
queryEnv=queryEnv@entry=0x0, 
planduration=0x7ffcbe111ac8, bufusage=0x0, mem_counters=0x0) at 
explain.c:645
#13 0x0067b417 in standard_ExplainOneQuery (query=, 
cursorOptions=2048, into=0x0, es=0x1db3138, queryString=0x1d87f60 "explain 
analyze select * from t where a = 200;", params=0x0,

Re: Popcount optimization using AVX512

2024-03-28 Thread Nathan Bossart

Here is a v14 of the patch that I think is beginning to approach something
committable.  Besides general review and testing, there are two things that
I'd like to bring up:

* The latest patch set from Paul Amonson appeared to support MSVC in the
  meson build, but not the autoconf one.  I don't have much expertise here,
  so the v14 patch doesn't have any autoconf/meson support for MSVC, which
  I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
  can always compile the x86_64 popcount code, but I don't know whether
  that's safe for AVX512.

* I think we need to verify there isn't a huge performance regression for
  smaller arrays.  IIUC those will still require an AVX512 instruction or
  two as well as a function call, which might add some noticeable overhead.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 9b5725e36aa8cff7caeb8683e11cd09bd5bda745 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v14 1/1] AVX512 popcount support

---
 config/c-compiler.m4   |  34 +++
 configure  | 165 +
 configure.ac   |  34 +++
 meson.build|  59 
 src/Makefile.global.in |   1 +
 src/include/pg_config.h.in |   9 ++
 src/include/port/pg_bitutils.h |  20 
 src/makefiles/meson.build  |   1 +
 src/port/Makefile  |   6 ++
 src/port/meson.build   |   6 +-
 src/port/pg_bitutils.c |  56 ---
 src/port/pg_popcount_avx512.c  |  98 
 12 files changed, 451 insertions(+), 38 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..f881e7ec28 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_AVX512_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_AVX512_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..189264b86e 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, [0], [1], [2], [3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
 $as_echo_n "checking for __cpuid... " >&6; }
 if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17474,135 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking for

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-28 Thread Thomas Munro

On Fri, Mar 29, 2024 at 7:01 AM Tomas Vondra
 wrote:
> On 3/28/24 06:20, Thomas Munro wrote:
> > With the unexplained but apparently somewhat systematic regression
> > patterns on certain tests and settings, I wonder if they might be due
> > to read_stream.c trying to form larger reads, making it a bit lazier.
> > It tries to see what the next block will be before issuing the
> > fadvise.  I think that means that with small I/O concurrency settings,
> > there might be contrived access patterns where it loses, and needs
> > effective_io_concurrency to be set one notch higher to keep up, or
> > something like that.
>
> Yes, I think we've speculated this might be the root cause before, but
> IIRC we didn't manage to verify it actually is the problem.

Another factor could be the bug in master that allows it to get out of
sync -- can it allow *more* concurrency than it intended to?  Or fewer
hints, but somehow that goes faster because of the
stepping-on-kernel-toes problem?

> > One way to test that idea would be to run the
> > tests with io_combine_limit = 1 (meaning 1 block).  It issues advise
> > eagerly when io_combine_limit is reached, so I suppose it should be
> > exactly as eager as master.  The only difference then should be that
> > it automatically suppresses sequential fadvise calls.
>
> Sure, I'll give that a try. What are some good values to test? Perhaps
> 32 and 1, i.e. the default and "no coalescing"?

Thanks!  Yeah.  The default is actually 16, computed backwards from
128kB.  (Explanation: POSIX requires 16 as minimum IOV_MAX, ie number
of vectors acceptable to writev/readv and related functions, though
actual acceptable number is usually much higher, and it also seems to
be a conservative acceptable number for hardware scatter/gather lists
in various protocols, ie if doing direct I/O, the transfer won't be
chopped up into more than one physical I/O command because the disk
and DMA engine can handle it as a single I/O in theory at least.
Actual limit on random SSDs might be more like 33, a weird number but
that's what I'm seeing; Mr Axboe wrote a nice short article[1] to get
some starting points for terminology on that topic on Linux.  Also,
just anecdotally, returns seem to diminish after that with huge
transfers of buffered I/O so it seems like an OK number if you have to
pick one; but IDK, YMMV, subject for future research as direct I/O
grows in relevance, hence GUC.)

> If this turns out to be the problem, does that mean we would consider
> using a more conservative default value? Is there some "auto tuning" we
> could do? For example, could we reduce the value combine limit if we
> start not finding buffers in memory, or something like that?

Hmm, not sure... I like that number for seq scans.  I also like
auto-tuning.  But it seems to me that *if* the problem is that we're
not allowing ourselves as many concurrent I/Os as master BHS because
we're waiting to see if the next block is consecutive, that might
indicate that the distance needs to be higher so that we can have a
better chance to see the 'edge' (the non-contiguous next block) and
start the I/O, not that the io_combine_limit needs to be lower.  But I
could be way off, given the fuzziness on this problem so far...

> Anyway, doesn't the combine limit work against the idea that
> effective_io_concurrency is "prefetch distance"? With eic=32 I'd expect
> we issue prefetch 32 pages ahead, i.e. if we prefetch page X, we should
> then process 32 pages before we actually need X (and we expect the page
> to already be in memory, thanks to the gap). But with the combine limit
> set to 32, is this still true?

Hmm.  It's different.  (1) Master BHS has prefetch_maximum, which is
indeed directly taken from the eic setting, while read_stream.c is
prepared to look much ahead further than that (potentially as far as
max_pinned_buffers) if it's been useful recently, to find
opportunities to coalesce and start I/O.  (2) Master BHS has
prefetch_target to control the look-ahead window, which starts small
and ramps up until it hits prefetch_maximum, while read_stream.c has
distance which goes up and down according to a more complex algorithm
described at the top.

> I've tried going through read_stream_* to determine how this will
> behave, but read_stream_look_ahead/read_stream_start_pending_read does
> not make this very clear. I'll have to experiment with some tracing.

I'm going to try to set up something like your experiment here too,
and figure out some way to visualise or trace what's going on...

The differences come from (1) multi-block I/Os, requiring two separate
numbers: how many blocks ahead we're looking, and how many I/Os are
running, and (2) being more aggressive about trying to reach the
desired I/O level.  Let me try to describe the approach again.
"distance" is the size of a window that we're searching for
opportunities to start I/Os.  read_stream_look_ahead() will keep
looking ahead until we already have max_ios I/Os running, or

Re: Add pg_basetype() function to obtain a DOMAIN base type

2024-03-28 Thread Tom Lane

jian he  writes:
> trying to do it this way.
> not sure the following error message is expected.

> SELECT pg_basetype(-1);
> ERROR:  cache lookup failed for type 4294967295

Yeah, that's not really OK.  You could say it's fine for bogus input,
but we've found over the years that it's better for catalog inspection
functions like this to be forgiving of bad input.  Otherwise,
your query can blow up in unexpected ways due to race conditions
(ie somebody just dropped the type you are interested in).

A fairly common solution to that is to return NULL for bad input,
but in this case we could just have it return the OID unchanged.

Either way though, we can't use getBaseType as-is.  We could imagine
extending that function to support a "noerror"-like flag, but I
believe it's already a hot-spot and I'd rather not complicate it
further.  So what I suggest doing is just duplicating the code;
there's not very much of it.

I did a little polishing of the docs and test cases too, ending
with the v7 attached.  I think this is about ready to go unless
there are objections to the definition.

Not sure what I think about your 0002 proposal to extend \dD
with this.  Aside from the server-version-compatibility problem,
I think it's about 90% redundant because \dD already shows
the immediate base type.  The new column would only be
different in the case of nested domains, which I think are
not common.  \dD's output is already pretty wide, so on the
whole I'm inclined to leave it alone.

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 93b0bc2bc6..b3687b3645 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25129,6 +25129,29 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);

   
 
+  
+   
+
+ pg_basetype
+
+pg_basetype ( regtype )
+regtype
+   
+   
+Returns the OID of the base type of a domain identified by its
+type OID.  If the argument is not the OID of a domain type,
+returns the argument as-is.  If there's a chain of domain
+dependencies, it will recurse until finding the base type.
+   
+   
+Assuming CREATE DOMAIN mytext AS text:
+   
+   
+pg_basetype('mytext'::regtype)
+text
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index d4a92d0b3f..d2b4ba8a72 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -44,6 +44,7 @@
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 
@@ -566,6 +567,48 @@ pg_typeof(PG_FUNCTION_ARGS)
 }
 
 
+/*
+ * Return the base type of the argument.
+ *		If the given type is a domain, return its base type;
+ *		otherwise return the type's own OID.
+ *
+ * This is a SQL-callable version of getBaseType().  Unlike that function,
+ * we don't want to fail for a bogus type OID; this is helpful to keep race
+ * conditions from turning into query failures when scanning the catalogs.
+ * Hence we need our own implementation.
+ */
+Datum
+pg_basetype(PG_FUNCTION_ARGS)
+{
+	Oid			typid = PG_GETARG_OID(0);
+
+	/*
+	 * We loop to find the bottom base type in a stack of domains.
+	 */
+	for (;;)
+	{
+		HeapTuple	tup;
+		Form_pg_type typTup;
+
+		tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typid));
+		if (!HeapTupleIsValid(tup))
+			break;/* return the bogus OID as-is */
+		typTup = (Form_pg_type) GETSTRUCT(tup);
+		if (typTup->typtype != TYPTYPE_DOMAIN)
+		{
+			/* Not a domain, so done */
+			ReleaseSysCache(tup);
+			break;
+		}
+
+		typid = typTup->typbasetype;
+		ReleaseSysCache(tup);
+	}
+
+	PG_RETURN_OID(typid);
+}
+
+
 /*
  * Implementation of the COLLATE FOR expression; returns the collation
  * of the argument.
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 07023ee61d..134e3b22fd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3889,6 +3889,9 @@
 { oid => '1619', descr => 'type of the argument',
   proname => 'pg_typeof', proisstrict => 'f', provolatile => 's',
   prorettype => 'regtype', proargtypes => 'any', prosrc => 'pg_typeof' },
+{ oid => '8312', descr => 'base type of a domain type',
+  proname => 'pg_basetype', provolatile => 's', prorettype => 'regtype',
+  proargtypes => 'regtype', prosrc => 'pg_basetype' },
 { oid => '3162',
   descr => 'collation of the argument; implementation of the COLLATION FOR expression',
   proname => 'pg_collation_for', proisstrict => 'f', provolatile => 's',
diff --git a/src/test/regress/expected/domain.out b/src/test/regress/expected/domain.out
index dc58793e3f..71d9f1952c 100644
--- a/src/test/regress/expected/domain.out
+++ b/src/test/regress/expected/domain.out
@@ -1292,3 +1292,28 @@ SELECT * FROM information_schema.check_constraints
  regression |

Re: pg_upgrade --copy-file-range

2024-03-28 Thread Robert Haas

On Tue, Mar 26, 2024 at 2:09 PM Tomas Vondra
 wrote:
> The patch I shared a couple minutes ago should fix this, effectively
> restoring the original debug behavior. I liked the approach with calling
> strategy_implementation a bit more, I wonder if it'd be better to go
> back to that for the "accelerated" copy methods, somehow.

Somehow I don't see this patch?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Bruce Momjian

On Thu, Mar 28, 2024 at 01:23:36PM +0100, Jelte Fennema-Nio wrote:
> +   
> +Turning this setting off is intended for environments where the
> +configuration of PostgreSQL is managed by
> +some external tool.
> +In such environments, a well intentioned superuser might
> +mistakenly use ALTER SYSTEM
> +to change the configuration instead of using the external tool.
> +This might result in unintended behavior, such as the external tool
> +discarding the change at some later point in time when it updates the

"discarding" -> "overwriting" ?

> +  
> +   ALTER SYSTEM can be disabled by setting
> +to off, but 
> this
> +   is no security mechanism (as explained in detail in the documentation for

"is no" -> "is not a"

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Properly pathify the union planner

2024-03-28 Thread Tom Lane

I wrote:
> David Rowley  writes:
>> Maybe something with "Parameters" in the name?

> SubqueryParameters might be OK.  Or SubqueryPlannerExtra?
> Since this is a bespoke struct that will probably only ever
> be used with subquery_planner, naming it after that function
> seems like a good idea.

On third thought, I'm not at all convinced that we even want to
invent this struct as compared to just adding another parameter
to subquery_planner.  The problem with a struct is what happens
the next time we need to add a parameter?  If we add yet another
function parameter, we can count on the compiler to complain
about call sites that didn't get the memo.  Adding a field
within an existing struct provokes no such warning, leading
to situations with uninitialized fields that might accidentally
work during testing, but fail the minute they get to the field.

If you do want to go this direction, a minimum safety requirement
would be to have an ironclad rule that callers memset the whole
struct to zero before filling it, so that any not-set fields
will at least have predictable values.  But I don't see the
point really.

regards, tom lane

Re: Flushing large data immediately in pqcomm

2024-03-28 Thread Melih Mutlu

On Wed, Mar 27, 2024 at 18:54 Robert Haas  wrote:

> On Wed, Mar 27, 2024 at 7:39 AM David Rowley  wrote:
> > Robert, I understand you'd like a bit more from this patch. I'm
> > wondering if you planning on blocking another committer from going
> > ahead with this? Or if you have a reason why the current state of the
> > patch is not a meaningful enough improvement that would justify
> > possibly not getting any improvements in this area for PG17?
>
> So, I think that the first version of the patch, when it got a big
> chunk of data, would just flush whatever was already in the buffer and
> then send the rest without copying.


Correct.

The current version, as I
> understand it, only does that if the buffer is empty; otherwise, it
> copies data as much data as it can into the partially-filled buffer.


Yes, currently it should fill and flush the buffer first, if it’s not
already empty. Only then it sends the rest without copying.

Thanks,
Melih

Re: Flushing large data immediately in pqcomm

2024-03-28 Thread Melih Mutlu

On Wed, Mar 27, 2024 at 14:39 David Rowley  wrote:

> On Fri, 22 Mar 2024 at 12:46, Melih Mutlu  wrote:
> > I did all of the above changes and it seems like those resolved the
> regression issue.
>
> Thanks for adjusting the patch.   The numbers do look better, but on
> looking at your test.sh script from [1], I see:
>
> meson setup --buildtype debug -Dcassert=true
> --prefix="$DESTDIR/usr/local/pgsql" $DESTDIR && \
>
> can you confirm if the test was done in debug with casserts on?   If
> so, it would be much better to have asserts off and have
> -Dbuildtype=release.


Yes, previous numbers were with --buildtype debug -Dcassert=true. I can
share new numbers with release build and asserts off soon.

Thanks,
Melih

Re: Properly pathify the union planner

2024-03-28 Thread Tom Lane

David Rowley  writes:
> The problem is informing the UNION child query about what it is.  I
> thought I could do root->parent_root->parse->setOperations for a UNION
> child to know what it is, but that breaks for a query such as:

Yeah, having grouping_planner poke into the parent level
doesn't seem like a great idea here.  I continue to not like
the name "PlannerContext" but I agree passing down the setop
explicitly is the way to go.

>> Perhaps "SubqueryContext" or the like would be better?  It
>> still has the conflict with memory contexts though.

> Maybe something with "Parameters" in the name?

SubqueryParameters might be OK.  Or SubqueryPlannerExtra?
Since this is a bespoke struct that will probably only ever
be used with subquery_planner, naming it after that function
seems like a good idea.  (And, given that fact and the fact
that it's not a Node, I'm not sure it belongs in pathnodes.h.
We could just declare it in planner.h.)

Some minor comments now that I've looked at 66c0185a3 a little:

* Near the head of grouping_planner is this bit:

if (parse->setOperations)
{
/*
 * If there's a top-level ORDER BY, assume we have to fetch all the
 * tuples.  This might be too simplistic given all the hackery below
 * to possibly avoid the sort; but the odds of accurate estimates here
 * are pretty low anyway.  XXX try to get rid of this in favor of
 * letting plan_set_operations generate both fast-start and
 * cheapest-total paths.
 */
if (parse->sortClause)
root->tuple_fraction = 0.0;

I'm pretty sure this comment is mine, but it's old enough that I don't
recall exactly what I had in mind.  Still, it seems like your patch
has addressed precisely the issue of generating fast-start plans for
setops.  Should we now remove this reset of tuple_fraction?

* generate_setop_child_grouplist does this:

   /* assign a tleSortGroupRef, or reuse the existing one */
   sgc->tleSortGroupRef = assignSortGroupRef(tle, targetlist);
   tle->ressortgroupref = sgc->tleSortGroupRef;

That last line is redundant and confusing.  It is not this code's
charter to change ressortgroupref.

regards, tom lane

Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Bruce Momjian

On Thu, Mar 28, 2024 at 02:43:38PM -0400, Robert Haas wrote:
> How would you like to proceed from here? I think that in addressing
> all of the comments given in the last few days, the documentation has
> gotten modestly worse. I think it was crisp and clear before, and now
> it feels a little ... over-edited. But if you're happy with the latest
> version, we can go with that. Or, do you need more time to review?

I am fine with moving ahead.  I thought my later emails explaining we
have to be careful communicated that.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

RE: Psql meta-command conninfo+

2024-03-28 Thread Maiquel Grassi

Hi Sami!

(v21)

First and foremost, thank you very much for the review.


> 1/ I looked through other psql-meta commands and the “+” adds details but

> does not change output format. In this patch, conninfo and conninfo+

> have completely different output. The former is a string with all the details

> and the latter is a table. Should conninfo just be a table with the minimal 
> info

> and additional details can be displayed with "+" appended?

The initial and central idea was always to keep the metacommand
"\conninfo" in its original state, that is, to preserve the string as it is.
The idea of "\conninfo+" is to expand this to include more information.
If I change the "\conninfo" command and transform it into a table,
I would have to remove all the translation part (files) that has already
been implemented in the past. I believe that keeping the string and
its translations is a good idea and there is no great reason to change that.

> 2/ GSSAPI details should show the full details, such as "principal",

> "encrypted" and "credentials_delegated".

In this new patch, I added these columns. I handled the 'server versions'
for cases where the 'credentials_delegated' column is not in the view.
It turned out well. Thank you for the idea.

3/ A very nice to have is "Application Name", in the case one

sets the application_name within a connection or in the connection string.

I did the same here. I liked your idea and added this column in the SQL.

Look below to see how it turned out after it's finished.

Exemple 1:

[postgres@localhost ~]$ /home/pgsql-17devel/bin/psql -x -p 5432

psql (17devel, server 15.6)
Type "help" for help.

postgres=# \conninfo
You are connected to database "postgres" as user "postgres" via socket in 
"/tmp" at port "5432".
postgres=# \conninfo+
Current Connection Information
-[ RECORD 1 ]+-
Database | postgres
Authenticated User   | postgres
System User  |
Current User | postgres
Session User | postgres
Application Name | psql
Backend PID  | 19439
Server Address   |
Server Port  | 5432
Client Address   |
Client Port  |
Socket Directory | /tmp
Host |
SSL Connection   | f
SSL Protocol |
SSL Cipher   |
SSL Compression  |
GSSAPI Authenticated | f
GSSAPI Principal |
GSSAPI Encrypted | f
GSSAPI Credentials Delegated |

Exemple 2:

[postgres@localhost ~]$ /home/pgsql-17devel/bin/psql -x -p 5000
psql (17devel)
Type "help" for help.

postgres=# \conninfo
You are connected to database "postgres" as user "postgres" via socket in 
"/tmp" at port "5000".
postgres=# \conninfo+
Current Connection Information
-[ RECORD 1 ]+-
Database | postgres
Authenticated User   | postgres
System User  |
Current User | postgres
Session User | postgres
Application Name | psql
Backend PID  | 19468
Server Address   |
Server Port  | 5000
Client Address   |
Client Port  |
Socket Directory | /tmp
Host |
SSL Connection   | f
SSL Protocol |
SSL Cipher   |
SSL Compression  |
GSSAPI Authenticated | f
GSSAPI Principal |
GSSAPI Encrypted | f
GSSAPI Credentials Delegated | f

Regards,
Maiquel Grassi.


v21-0001-psql-meta-command-conninfo-plus.patch
Description: v21-0001-psql-meta-command-conninfo-plus.patch

Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Robert Haas

On Thu, Mar 28, 2024 at 1:46 PM Bruce Momjian  wrote:
> The concern about this patch is not its contents but because it is our
> first attempt at putting limits on the superuser for an external tool.
> If done improperly, this could open a flood of problems, including CVE
> and user confusion, which would reflect badly on the project.
>
> I think the email discussion has expressed those concerns clearly, and
> it is only recently that we have gotten to a stage where we are ready to
> add this, and doing this near the closing of the last commitfest can be
> a valid concern.  I do agree with your analysis of other patches in the
> commitfest, but I just don't see them stretching our boundaries like
> this patch.

I do understand the concern, and I'm not saying that you're wrong to
have it at some level, but I do sincerely think it's excessive. I
don't think this is even close to being the scariest patch in this
release, or even in this CommitFest. I also agree that doing things
near the end of the last CommitFest isn't great, because even if your
patch is fantastic, people start to think maybe you're only committing
it to beat the deadline, and then the conversation can get unpleasant.
However, I don't think that's really what is happening here. If this
patch gets bounced out of this release, it won't be in any better
shape a year from now than it is right now. It can't be, because the
code is completely trivial; and the documentation has already been
extensively wordsmithed. Surely we don't need another whole release
cycle to polish three paragraphs of documentation. I think it has to
be right to get this done while we're all thinking about it and the
issue is fresh in everybody's mind.

How would you like to proceed from here? I think that in addressing
all of the comments given in the last few days, the documentation has
gotten modestly worse. I think it was crisp and clear before, and now
it feels a little ... over-edited. But if you're happy with the latest
version, we can go with that. Or, do you need more time to review?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: elog/ereport VS misleading backtrace_function function address

2024-03-28 Thread Tom Lane

Jakub Wartak  writes:
> While chasing some other bug I've learned that backtrace_functions
> might be misleading with top elog/ereport() address.

That was understood from the beginning: this type of backtrace is
inherently pretty imprecise, and I doubt there is much that can
be done to make it better.  IIRC the fundamental problem is it only
looks at global symbols, so static functions inherently defeat it.
It was argued that this is better than nothing, which is true, but
you have to take the info with a mountain of salt.

I recall speculating about whether we could somehow invoke gdb
to get a more comprehensive and accurate backtrace, but I don't
really have a concrete idea how that could be made to work.

regards, tom lane

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Tom Lane

Alvaro Herrera  writes:
> On 2024-Mar-28, Tom Lane wrote:
>> If the test fails both when the machine is too slow and when it's
>> too fast, then there's zero hope of making it stable and we should
>> just remove it.

> It doesn't fail when it's too fast -- it's just that it doesn't cover
> the case we want to cover.

That's hardly better, because then you think you have test
coverage but maybe you don't.

Could we make this test bulletproof by using an injection point?
If not, I remain of the opinion that we're better off without it.

regards, tom lane

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-28 Thread Tomas Vondra

On 3/28/24 06:20, Thomas Munro wrote:
> With the unexplained but apparently somewhat systematic regression
> patterns on certain tests and settings, I wonder if they might be due
> to read_stream.c trying to form larger reads, making it a bit lazier.
> It tries to see what the next block will be before issuing the
> fadvise.  I think that means that with small I/O concurrency settings,
> there might be contrived access patterns where it loses, and needs
> effective_io_concurrency to be set one notch higher to keep up, or
> something like that.

Yes, I think we've speculated this might be the root cause before, but
IIRC we didn't manage to verify it actually is the problem.

FWIW I don't think the tests use synthetic data, but I don't think it's
particularly contrived.

> One way to test that idea would be to run the
> tests with io_combine_limit = 1 (meaning 1 block).  It issues advise
> eagerly when io_combine_limit is reached, so I suppose it should be
> exactly as eager as master.  The only difference then should be that
> it automatically suppresses sequential fadvise calls.

Sure, I'll give that a try. What are some good values to test? Perhaps
32 and 1, i.e. the default and "no coalescing"?

If this turns out to be the problem, does that mean we would consider
using a more conservative default value? Is there some "auto tuning" we
could do? For example, could we reduce the value combine limit if we
start not finding buffers in memory, or something like that?

I recognize this may not be possible with buffered I/O, due to not
having any insight into page cache. And maybe it's misguided anyway,
because how would we know if the right response is to increase or reduce
the combine limit?

Anyway, doesn't the combine limit work against the idea that
effective_io_concurrency is "prefetch distance"? With eic=32 I'd expect
we issue prefetch 32 pages ahead, i.e. if we prefetch page X, we should
then process 32 pages before we actually need X (and we expect the page
to already be in memory, thanks to the gap). But with the combine limit
set to 32, is this still true?

I've tried going through read_stream_* to determine how this will
behave, but read_stream_look_ahead/read_stream_start_pending_read does
not make this very clear. I'll have to experiment with some tracing.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Alvaro Herrera

On 2024-Mar-28, Tom Lane wrote:

> Jelte Fennema-Nio  writes:
> 
> > I think we don't really want to make the timeout too short. Otherwise
> > the query might get cancelled before we push any query down to the
> > FDW. I guess that means that for some slow machines even 10ms is not
> > enough to make the test do the intended purpose. I'd keep it at 10ms,
> > which seems long enough for normal systems, while still being pretty
> > short.
> 
> If the test fails both when the machine is too slow and when it's
> too fast, then there's zero hope of making it stable and we should
> just remove it.

It doesn't fail when it's too fast -- it's just that it doesn't cover
the case we want to cover.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)

Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Bruce Momjian

On Thu, Mar 28, 2024 at 08:38:24AM -0400, Robert Haas wrote:
> Let's please, please stop pretending like this patch is somehow
> deserving of special scrutiny. There's barely even anything to
> scrutinize. It's literally if (!variable) ereport(...) plus some
> boilerplate and docs. I entirely agree that, because of the risk of
> someone filing a bogus CVE, the docs do need to be carefully worded.
> But, I'm going to be honest: I feel completely confident in my ability
> to review a patch well enough to know whether the documentation for a
> single test-and-ereport has been done up to project standard. It
> saddens and frustrates me that you don't seem to agree.

The concern about this patch is not its contents but because it is our
first attempt at putting limits on the superuser for an external tool. 
If done improperly, this could open a flood of problems, including CVE
and user confusion, which would reflect badly on the project.

I think the email discussion has expressed those concerns clearly, and
it is only recently that we have gotten to a stage where we are ready to
add this, and doing this near the closing of the last commitfest can be
a valid concern.  I do agree with your analysis of other patches in the
commitfest, but I just don't see them stretching our boundaries like
this patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Tom Lane

Jelte Fennema-Nio  writes:
> On Thu, 28 Mar 2024 at 17:43, Alvaro Herrera  wrote:
>> Hah, you're right, I can reproduce with a smaller timeout, and using SET
>> LOCAL works as a fix.  If we're doing that, why not reduce the timeout
>> to 1ms?  We don't need to wait extra 9ms ...

> I think we don't really want to make the timeout too short. Otherwise
> the query might get cancelled before we push any query down to the
> FDW. I guess that means that for some slow machines even 10ms is not
> enough to make the test do the intended purpose. I'd keep it at 10ms,
> which seems long enough for normal systems, while still being pretty
> short.

If the test fails both when the machine is too slow and when it's
too fast, then there's zero hope of making it stable and we should
just remove it.

regards, tom lane

Re: To what extent should tests rely on VACUUM ANALYZE?

2024-03-28 Thread Tom Lane

Tomas Vondra  writes:
> Yeah. I think it's good to design the data/queries in such a way that
> the behavior does not flip due to minor noise like in this case.

+1

> But I'm a bit confused - how come the estimates do change at all? The
> analyze simply fetches 30k rows, and tenk only has 10k of them. So we
> should have *exact* numbers, and it should be exactly the same for all
> the analyze runs. So how come it changes like this?

It's plausible that the VACUUM ANALYZE done by test_setup fails
ConditionalLockBufferForCleanup() sometimes because of concurrent
activity like checkpointer writes.  I'm not quite sure how we
get from that to the observed symptom though.  Maybe the
VACUUM needs DISABLE_PAGE_SKIPPING?

regards, tom lane

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Jelte Fennema-Nio

On Thu, 28 Mar 2024 at 17:43, Alvaro Herrera  wrote:
> Hah, you're right, I can reproduce with a smaller timeout, and using SET
> LOCAL works as a fix.  If we're doing that, why not reduce the timeout
> to 1ms?  We don't need to wait extra 9ms ...

I think we don't really want to make the timeout too short. Otherwise
the query might get cancelled before we push any query down to the
FDW. I guess that means that for some slow machines even 10ms is not
enough to make the test do the intended purpose. I'd keep it at 10ms,
which seems long enough for normal systems, while still being pretty
short.

Re: remaining sql/json patches

2024-03-28 Thread Alvaro Herrera

On 2024-Mar-28, Amit Langote wrote:

> Here's patch 1 for the time being that implements barebones
> JSON_TABLE(), that is, without NESTED paths/columns and PLAN clause.
> I've tried to shape the interfaces so that those features can be added
> in future commits without significant rewrite of the code that
> implements barebones JSON_TABLE() functionality.  I'll know whether
> that's really the case when I rebase the full patch over it.

I think this barebones patch looks much closer to something that can be
committed for pg17, given the current commitfest timeline.  Maybe we
should just slip NESTED and PLAN to pg18 to focus current efforts into
getting the basic functionality in 17.  When I looked at the JSON_TABLE
patch last month, it appeared far too large to be reviewable in
reasonable time.  The fact that this split now exists gives me hope that
we can get at least the first part of it.

(A note that PLAN seems to correspond to separate features T824+T838, so
leaving that one out would still let us claim T821 "Basic SQL/JSON query
operators" ... however, the NESTED clause does not appear to be a
separate SQL feature; in particular it does not appear to correspond to
T827, though I may be reading the standard wrong.  So if we don't have
NESTED, apparently we could not claim to support T821.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)

Re: To what extent should tests rely on VACUUM ANALYZE?

2024-03-28 Thread Tomas Vondra

On 3/28/24 16:00, Alexander Lakhin wrote:
> ...
>
> Using the trick Thomas proposed in [1] (see my modification attached), I
> could reproduce the failure easily on my workstation with no specific
> conditions:
> 2024-03-28 14:05:13.792 UTC client backend[2358012]
> pg_regress/test_setup LOG:  !!!ConditionalLockBufferForCleanup()
> returning false
> 2024-03-28 14:05:13.792 UTC client backend[2358012]
> pg_regress/test_setup CONTEXT:  while scanning block 29 of relation
> "public.tenk2"
> 2024-03-28 14:05:13.792 UTC client backend[2358012]
> pg_regress/test_setup STATEMENT:  VACUUM ANALYZE tenk2;
> ...
>   relname | relpages | reltuples | autovacuum_count | autoanalyze_count
>  -+--+---+--+---
> - tenk2   |  345 | 1 |    0 | 0
> + tenk2   |  345 |  9996 |    0 | 0
>  (1 row)
> 
> So it looks to me like a possible cause of the failure, and I wonder
> whether checks for query plans should be immune to such changes or results
> of VACUUM ANALYZE should be 100% stable?
> 

Yeah. I think it's good to design the data/queries in such a way that
the behavior does not flip due to minor noise like in this case.

But I'm a bit confused - how come the estimates do change at all? The
analyze simply fetches 30k rows, and tenk only has 10k of them. So we
should have *exact* numbers, and it should be exactly the same for all
the analyze runs. So how come it changes like this?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Alvaro Herrera

On 2024-Mar-28, Jelte Fennema-Nio wrote:

> Ugh that's annoying, the RESET is timing out too I guess.

Hah, you're right, I can reproduce with a smaller timeout, and using SET
LOCAL works as a fix.  If we're doing that, why not reduce the timeout
to 1ms?  We don't need to wait extra 9ms ...

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
“Cuando no hay humildad las personas se degradan” (A. Christie)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Jelte Fennema-Nio

On Thu, 28 Mar 2024 at 17:34, Alvaro Herrera  wrote:
>
> Eh, kestrel has also failed[1], apparently every query after the large
> JOIN that this commit added as test fails with a statement timeout error.
>
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-03-28%2016%3A01%3A14

Ugh that's annoying, the RESET is timing out too I guess. That can
hopefully be easily fixed by changing the new test to:

BEGIN;
SET LOCAL statement_timeout = '10ms';
select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5;
-- this takes very long
ROLLBACK;

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Alvaro Herrera

Eh, kestrel has also failed[1], apparently every query after the large
JOIN that this commit added as test fails with a statement timeout error.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel=2024-03-28%2016%3A01%3A14

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No deja de ser humillante para una persona de ingenio saber
que no hay tonto que no le pueda enseñar algo." (Jean B. Say)

Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Robert Haas

On Thu, Mar 28, 2024 at 12:01 PM Tom Lane  wrote:
> > Well, there's the abort case, too, which I think is almost equally 
> > important.
>
> True, but in the abort case there probably *are* resources to be
> cleaned up, so I'm not seeing that the fast-path idea helps.
> Although maybe the idea of batching multiple cleanups would?

Yes, I think we should be trying to optimize for the case where the
(sub)transaction being cleaned up holds a small but non-zero number of
resources. I think if we just optimize the case where it's exactly
zero, there will be enough cases where the optimization doesn't apply
that we'll feel like we haven't really solved the problem. Whether the
specific idea of trying to batch the cleanups could be made to help
enough to matter, I'm not quite sure. Another idea I had at one point
was to have some kind of bitmask where each bit tells you whether or
not one particular resource type might be held, so that
{Commit,Abort}{Sub,}Transaction would end up doing a bunch of stuff
like if (ResourcesNeedingCleanup & MIGHT_HOLD_THINGY)
AtEO(Sub)Xact_Thingy(...). But I wasn't sure that would really move
the needle, either. This seems to be one of those annoying cases where
the problem is much more obvious than the solution.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Tom Lane

Robert Haas  writes:
> On Thu, Mar 28, 2024 at 11:50 AM Tom Lane  wrote:
>> Yeah, I was thinking about that too.  The normal case is that you
>> don't hold any releasable resources except locks when arriving at
>> CommitSubTransaction --- if you do, it's a bug and we're going to
>> print leak warnings.  Seems like maybe it'd be worth trying to
>> have a fast path for that case.

> Well, there's the abort case, too, which I think is almost equally important.

True, but in the abort case there probably *are* resources to be
cleaned up, so I'm not seeing that the fast-path idea helps.
Although maybe the idea of batching multiple cleanups would?

regards, tom lane

Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Robert Haas

On Thu, Mar 28, 2024 at 11:50 AM Tom Lane  wrote:
> Yeah, I was thinking about that too.  The normal case is that you
> don't hold any releasable resources except locks when arriving at
> CommitSubTransaction --- if you do, it's a bug and we're going to
> print leak warnings.  Seems like maybe it'd be worth trying to
> have a fast path for that case.

Well, there's the abort case, too, which I think is almost equally important.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-03-28 Thread Alvaro Herrera

On 2024-Mar-26, Justin Pryzby wrote:

> Looks right.  That's how I originally wrote it, except for the
> "stmt->accessMethod != NULL" case.
> 
> I prefered my way - the grammar should refuse to set stmt->accessMethod
> for inappropriate relkinds.  And you could assert that.

Hmm, I didn't like this at first sight, because it looked convoluted and
baroque, but I compared both versions for a while and I ended up liking
yours more than mine, so I adopted it.

> I also prefered to set "accessMethodId = InvalidOid" once, rather than
> twice.

Grumble.  I don't like initialization at declare time, so far from the
code that depends on the value.  But the alternative would have been to
assign right where this blocks starts, an additional line.  I pushed it
like you had it.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"This is what I like so much about PostgreSQL.  Most of the surprises
are of the "oh wow!  That's cool" Not the "oh shit!" kind.  :)"
Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php

Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Tom Lane

Robert Haas  writes:
> Hmm, I wonder if that's actually where the cycles are going. There's
> an awful lot of separate function calls inside CommitSubTransaction(),
> and in the common case, each one of them has to individually decide
> that it doesn't need to do anything. Sure, they're all fast, but if
> you have enough of them, it's still going to add up, at least a bit.
> In that sense, the resource owner mechanism seems like it should, or
> at least could, be better.

Yeah, I was thinking about that too.  The normal case is that you
don't hold any releasable resources except locks when arriving at
CommitSubTransaction --- if you do, it's a bug and we're going to
print leak warnings.  Seems like maybe it'd be worth trying to
have a fast path for that case.  (Also, given that we probably
do need to release locks right away, this point invalidates my
earlier idea of postponing the work.)

> But I haven't done any benchmarking of this area in a long time.

Ditto.

regards, tom lane

Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Robert Haas

On Thu, Mar 28, 2024 at 10:59 AM Tom Lane  wrote:
> Yeah.  The whole ResourceOwner mechanism is not exactly lightweight,
> but it's hard to argue that we don't need it.  I wonder whether we
> could get anywhere by deeming that a "small enough" subtransaction
> doesn't need to have its resources cleaned up instantly, and
> instead re-use its ResourceOwner to accumulate resources of the
> next subtransaction, and the next, until there's enough to be
> worth cleaning up.

Hmm, I wonder if that's actually where the cycles are going. There's
an awful lot of separate function calls inside CommitSubTransaction(),
and in the common case, each one of them has to individually decide
that it doesn't need to do anything. Sure, they're all fast, but if
you have enough of them, it's still going to add up, at least a bit.
In that sense, the resource owner mechanism seems like it should, or
at least could, be better. I'm not sure this is quite the way it works
now, but if you had one single list/array/thingamabob that listed all
of the resources that needed releasing, that should in theory be
better when there's a lot of kinds of resources that you COULD hold
but only a small number of kinds of resources that you actually do
hold -- and it also shouldn't be any worse if it turns out that you
hold a whole lot of resources of many different types.

But I haven't done any benchmarking of this area in a long time.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Table AM Interface Enhancements

2024-03-28 Thread Japin Li



On Thu, 28 Mar 2024 at 21:26, Alexander Korotkov  wrote:
> Hi Pavel!
>
> Revised patchset is attached.
>
> On Thu, Mar 28, 2024 at 3:12 PM Pavel Borisov  wrote:
>> The other extensibility that seems quite clear and uncontroversial to me is 
>> 0006.
>>
>> It simply shifts the decision on whether tuple inserts should invoke inserts 
>> to the related indices to the table am level. It doesn't change the current 
>> heap insert behavior so it's safe for the existing heap access method. But 
>> new table access methods could redefine this (only for tables created with 
>> these am's) and make index inserts independently of ExecInsertIndexTuples 
>> inside their own implementations of tuple_insert/tuple_multi_insert methods.
>>
>> I'd propose changing the comment:
>>
>> 1405  * This function sets `*insert_indexes` to true if expects caller to 
>> return
>> 1406  * the relevant index tuples.  If `*insert_indexes` is set to false, 
>> then
>> 1407  * this function cares about indexes itself.
>>
>> in the following way
>>
>> Tableam implementation of tuple_insert should set `*insert_indexes` to true
>> if it expects the caller to insert the relevant index tuples (as in heap
>>  implementation). It should set `*insert_indexes` to false if it cares
>> about index inserts itself and doesn't want the caller to do index inserts.
>
> Changed as you proposed.
>
>> Maybe, a commit message is also better to reformulate to describe better who 
>> should do what.
>
> Done.
>
> Also, I removed extra includes in 0001 as you proposed and edited the
> commit message in 0002.
>
>> I think, with rebase and correction in the comments/commit message patch 
>> 0006 is ready to be committed.
>
> I'm going to push 0001, 0002 and 0006 if no objections.

Thanks for updating the patches.  Here are some minor sugesstion.

0003

+static inline TupleTableSlot *
+heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,

I'm not entirely certain whether the "inline" keyword has any effect.

0004

+static bytea *
+heapam_indexoptions(amoptions_function amoptions, char relkind,
+   Datum reloptions, bool validate)
+{
+   return index_reloptions(amoptions, reloptions, validate);
+}

Could you please explain why we are not verifying the relkind like
heapam_reloptions()?


-   case RELKIND_VIEW:
case RELKIND_MATVIEW:
+   case RELKIND_VIEW:
case RELKIND_PARTITIONED_TABLE:

I think this change is unnecessary.

+   {
+   Form_pg_class classForm;
+   HeapTuple   classTup;
+
+   /* fetch the relation's relcache entry */
+   if (relation->rd_index->indrelid >= 
FirstNormalObjectId)
+   {
+   classTup = SearchSysCacheCopy1(RELOID, 
ObjectIdGetDatum(relation->rd_index->indrelid));
+   classForm = (Form_pg_class) 
GETSTRUCT(classTup);
+   if (classForm->relam >= 
FirstNormalObjectId)
+   tableam = 
GetTableAmRoutineByAmOid(classForm->relam);
+   else
+   tableam = 
GetHeapamTableAmRoutine();
+   heap_freetuple(classTup);
+   }
+   else
+   {
+   tableam = GetHeapamTableAmRoutine();
+   }
+   amoptsfn = relation->rd_indam->amoptions;
+   }

- We can reduce the indentation by moving the classFrom and classTup into
  the if branch.
- Perhaps we could remove the brace of else branch to maintain consistency
  in the code style.

--
Regards,
Japin Li

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-28 Thread Melanie Plageman

On Thu, Mar 28, 2024 at 4:49 AM Heikki Linnakangas  wrote:
>
> On 28/03/2024 03:53, Melanie Plageman wrote:
> > On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote:
> >> One change with this is that live_tuples and many of the other fields are
> >> now again updated, even if the caller doesn't need them. It was hard to 
> >> skip
> >> them in a way that would save any cycles, with the other refactorings.
> >
> > I am worried we are writing checks that are going to have to come out of
> > SELECT queries' bank accounts, but I'll do some benchmarking when we're
> > all done with major refactoring.
>
> Sounds good, thanks.

Actually, after having looked at it again, there really are only a few
more increments of various counters, the memory consumed by revisit,
and the additional setting of offsets in marked. I think a few
carefully constructed queries which do on-access pruning could test
the impact of this (as opposed to a bigger benchmarking endeavor).

I also wonder if there would be any actual impact of marking the
various record_*() routines inline.

> >> @@ -728,10 +832,12 @@ htsv_get_valid_status(int status)
> >>* DEAD, our visibility test is just too coarse to detect it.
> >>*
> >>* In general, pruning must never leave behind a DEAD tuple that still 
> >> has
> >> - * tuple storage.  VACUUM isn't prepared to deal with that case.  That's 
> >> why
> >> + * tuple storage.  VACUUM isn't prepared to deal with that case (FIXME: 
> >> it no longer cares, right?).
> >> + * That's why
> >>* VACUUM prunes the same heap page a second time (without dropping its 
> >> lock
> >>* in the interim) when it sees a newly DEAD tuple that we initially saw 
> >> as
> >> - * in-progress.  Retrying pruning like this can only happen when an 
> >> inserting
> >> + * in-progress (FIXME: Really? Does it still do that?).
> >
> > Yea, I think all that is no longer true. I missed this comment back
> > then.
>
> Committed a patch to remove it.
>
> >> @@ -981,7 +1069,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber 
> >> rootoffnum,
> >>   * RECENTLY_DEAD tuples just in case there's a DEAD one after 
> >> them;
> >>   * but we can't advance past anything else.  We have to make 
> >> sure that
> >>   * we don't miss any DEAD tuples, since DEAD tuples that 
> >> still have
> >> - * tuple storage after pruning will confuse VACUUM.
> >> + * tuple storage after pruning will confuse VACUUM (FIXME: 
> >> not anymore
> >> + * I think?).
> >
> > Meaning, it won't confuse vacuum anymore or there won't be DEAD tuples
> > with storage after pruning anymore?
>
> I meant that it won't confuse VACUUM anymore. lazy_scan_prune() doesn't
> loop through the items on the page checking their visibility anymore.
>
> Hmm, one confusion remains though: In the 2nd phase of vacuum, we remove
> all the dead line pointers that we have now removed from the indexes.
> When we do that, we assume them all to be dead line pointers, without
> storage, rather than normal tuples that happen to be HEAPTUPLE_DEAD. So
> it's important that if pruning would leave behind HEAPTUPLE_DEAD tuples,
> they are not included in 'deadoffsets'.

It seems like master only adds items it is marking LP_DEAD to
deadoffsets. And things marked LP_DEAD have lp_len set to 0.

> In any case, let's just make sure that pruning doesn't leave
> HEAPTUPLE_DEAD tuples. There's no reason it should.

Maybe worth adding an assert to

static void
heap_prune_record_unchanged_lp_dead(ItemId itemid, PruneState
*prstate, OffsetNumber offnum)
{
...
Assert(!ItemIdHasStorage(itemid));
prstate->deadoffsets[prstate->lpdead_items++] = offnum;
}

By the way, I wasn't quite sure about the purpose of
heap_prune_record_unchanged_lp_dead(). It is called in
heap_prune_chain() in a place where we didn't add things to
deadoffsets before, no?

/*
 * Likewise, a dead line pointer can't be part of the chain. (We
 * already eliminated the case of dead root tuple outside this
 * function.)
 */
if (ItemIdIsDead(lp))
{
/*
 * If the caller set PRUNE_DO_MARK_UNUSED_NOW, we can set dead
 * line pointers LP_UNUSED now.
 */
if (unlikely(prstate->actions & PRUNE_DO_MARK_UNUSED_NOW))
heap_prune_record_unused(prstate, offnum, false);
else
heap_prune_record_unchanged_lp_dead(lp, prstate, offnum);
break;
}

> >> @@ -1224,10 +1327,21 @@ heap_prune_record_live_or_recently_dead(Page page, 
> >> PruneState *prstate, OffsetNu
> >>   * ensure the math works out. The assumption that the transaction will
> >>   * commit and update the counters after we report is a bit shaky; but 
> >> it
> >>   * is what acquire_sample_rows() does, so we do the same to be 
> >> consistent.
> >> + *
> >> + * HEAPTUPLE_DEAD are handled by the other

To what extent should tests rely on VACUUM ANALYZE?

2024-03-28 Thread Alexander Lakhin


Hello hackers,

When running multiple 027_stream_regress.pl test instances in parallel
(and with aggressive autovacuum) on a rather slow machine, I encountered
test failures due to the subselect test instability just as the following
failures on buildfarm:
1) 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit=2024-03-27%2010%3A16%3A12

--- 
/home/bf/bf-build/grassquit/HEAD/pgsql/src/test/regress/expected/subselect.out 
2024-03-19 22:20:34.435867114 +
+++ /home/bf/bf-build/grassquit/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/subselect.out 
2024-03-27 10:28:38.185776605 +

@@ -2067,16 +2067,16 @@
    QUERY PLAN
 -
  Hash Join
-   Hash Cond: (c.odd = b.odd)
+   Hash Cond: (c.hundred = a.hundred)
    ->  Hash Join
- Hash Cond: (a.hundred = c.hundred)
- ->  Seq Scan on tenk1 a
+ Hash Cond: (b.odd = c.odd)
+ ->  Seq Scan on tenk2 b
  ->  Hash
    ->  HashAggregate
  Group Key: c.odd, c.hundred
  ->  Seq Scan on tenk2 c
    ->  Hash
- ->  Seq Scan on tenk2 b
+ ->  Seq Scan on tenk1 a
 (11 rows)

2) 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2024-03-27%2009%3A49%3A38

(That query was added recently (by 9f1337639 from 2023-02-15) and the
failure evidentially depends on timing, so the number of the failures I
could find on buildfarm is moderate for now.)

With the subselect test modified as in attached, I could see what makes
the plan change:
- ->  Seq Scan on public.tenk2 c (cost=0.00..445.00 
rows=1 width=8)
+ ->  Seq Scan on public.tenk2 c (cost=0.00..444.95 
rows=9995 width=8)

  relname | relpages | reltuples | autovacuum_count | autoanalyze_count
 -+--+---+--+---
- tenk2   |  345 | 1 |    0 | 0
+ tenk2   |  345 |  9995 |    0 | 0

Using the trick Thomas proposed in [1] (see my modification attached), I
could reproduce the failure easily on my workstation with no specific
conditions:
2024-03-28 14:05:13.792 UTC client backend[2358012] pg_regress/test_setup LOG:  !!!ConditionalLockBufferForCleanup() 
returning false
2024-03-28 14:05:13.792 UTC client backend[2358012] pg_regress/test_setup CONTEXT:  while scanning block 29 of relation 
"public.tenk2"

2024-03-28 14:05:13.792 UTC client backend[2358012] pg_regress/test_setup 
STATEMENT:  VACUUM ANALYZE tenk2;
...
  relname | relpages | reltuples | autovacuum_count | autoanalyze_count
 -+--+---+--+---
- tenk2   |  345 | 1 |    0 | 0
+ tenk2   |  345 |  9996 |    0 | 0
 (1 row)

So it looks to me like a possible cause of the failure, and I wonder
whether checks for query plans should be immune to such changes or results
of VACUUM ANALYZE should be 100% stable?

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGKYNHmL_DhmVRiidHv6YLAL8jViifwwn2ABY__Y3BCphg%40mail.gmail.com

Best regards,
Alexanderdiff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 29b11f11aa..6a7bb6b7a9 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -2079,6 +2079,32 @@ ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
  ->  Seq Scan on tenk2 b
 (11 rows)
 
+explain (verbose)
+SELECT * FROM tenk1 A INNER JOIN tenk2 B
+ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
+   QUERY PLAN   
+
+ Hash Join  (cost=1087.50..13845.00 rows=100 width=488)
+   Output: a.unique1, a.unique2, a.two, a.four, a.ten, a.twenty, a.hundred, a.thousand, a.twothousand, a.fivethous, a.tenthous, a.odd, a.even, a.stringu1, a.stringu2, a.string4, b.unique1, b.unique2, b.two, b.four, b.ten, b.twenty, b.hundred, b.thousand, b.twothousand, b.fivethous, b.tenthous, b.odd, b.even, b.stringu1, b.stringu2, b.string4
+   Hash Cond: (c.odd = b.odd)
+   ->  Hash Join  (cost=517.50..2000.00 rows=1 width=248)
+ Output: a.unique1, a.unique2, a.two, a.four, a.ten, a.twenty,

Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Tom Lane

Robert Haas  writes:
> I sort of assumed you were going to commit the patch as you had it.

OK, I will move ahead on that.

> I actually
> really wish we could find some way of making subtransactions
> significantly lighter-wait, because I think the cost of spinning up
> and tearing down a trivial subtransaction is a real performance
> problem, but fixing that is probably a pretty hard problem whether
> this patch gets committed or not.

Yeah.  The whole ResourceOwner mechanism is not exactly lightweight,
but it's hard to argue that we don't need it.  I wonder whether we
could get anywhere by deeming that a "small enough" subtransaction
doesn't need to have its resources cleaned up instantly, and
instead re-use its ResourceOwner to accumulate resources of the
next subtransaction, and the next, until there's enough to be
worth cleaning up.

Having said that, it's hard to see any regime under which tied-up
parallel workers wouldn't count as a resource worth releasing ASAP.
I started this mail with the idea of suggesting that parallel contexts
ought to become a ResourceOwner-managed resource, but maybe that
wouldn't be an improvement after all.

regards, tom lane

Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Robert Haas

On Wed, Mar 27, 2024 at 5:28 PM Tom Lane  wrote:
> After mulling it over for awhile, I still think the extra checking
> is appropriate, especially since this patch is enlarging the set of
> things that can happen in parallel mode.  How do you want to proceed?

I sort of assumed you were going to commit the patch as you had it.
I'm not a huge fan of that, but I don't think that's it's catastrophe,
either. It pains me a bit to add CPU cycles that I consider
unnecessary to a very frequently taken code path, but as you say, it's
not a lot of CPU cycles, so maybe nobody will ever notice. I actually
really wish we could find some way of making subtransactions
significantly lighter-wait, because I think the cost of spinning up
and tearing down a trivial subtransaction is a real performance
problem, but fixing that is probably a pretty hard problem whether
this patch gets committed or not.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

RE: Synchronizing slots from primary to standby

2024-03-28 Thread Zhijie Hou (Fujitsu)

On Thursday, March 28, 2024 7:32 PM Amit Kapila  wrote:
> 
> On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > When analyzing one BF error[1], we find an issue of slotsync: Since we
> > don't perform logical decoding for the synced slots when syncing the
> > lsn/xmin of slot, no logical snapshots will be serialized to disk. So,
> > when user starts to use these synced slots after promotion, it needs
> > to re-build the consistent snapshot from the restart_lsn if the
> > WAL(xl_running_xacts) at restart_lsn position indicates that there are
> > running transactions. This however could cause the data that before the
> consistent point to be missed[2].
> >
> > This issue doesn't exist on the primary because the snapshot at
> > restart_lsn should have been serialized to disk
> > (SnapBuildProcessRunningXacts -> SnapBuildSerialize), so even if the
> > logical decoding restarts, it can find consistent snapshot immediately at
> restart_lsn.
> >
> > To fix this, we could use the fast forward logical decoding to advance
> > the synced slot's lsn/xmin when syncing these values instead of
> > directly updating the slot's info. This way, the snapshot will be 
> > serialized to
> disk when decoding.
> > If we could not reach to the consistent point at the remote
> > restart_lsn, the slot is marked as temp and will be persisted once it
> > reaches the consistent point. I am still analyzing the fix and will share 
> > once
> ready.
> >
> 
> Yes, we can use this but one thing to note is that
> CreateDecodingContext() will expect that the slot's and current database are
> the same. I think the reason for that is we need to check system tables of the
> current database while decoding and sending data to the output_plugin which
> won't be a requirement for the fast_forward case. So, we need to skip that
> check in fast_forward mode.

Agreed.

> 
> Next, I was thinking about the case of the first time updating the restart and
> confirmed_flush LSN while syncing the slots. I think we can keep the current
> logic as it is based on the following analysis.
> 
> For each logical slot, cases possible on the primary:
> 1. The restart_lsn doesn't have a serialized snapshot and hasn't yet reached 
> the
> consistent point.
> 2. The restart_lsn doesn't have a serialized snapshot but has reached a
> consistent point.
> 3. The restart_lsn has a serialized snapshot which means it has reached a
> consistent point as well.
> 
> Considering we keep the logic to reserve initial WAL positions the same as the
> current (Reserve WAL for the currently active local slot using the specified 
> WAL
> location (restart_lsn). If the given WAL location has been removed, reserve
> WAL using the oldest existing WAL segment.), I could think of the below
> scenarios:
> A. For 1, we shouldn't sync the slot as it still wouldn't have been marked
> persistent on the primary.
> B. For 2, we would sync the slot
>B1. If remote_restart_lsn >= local_resart_lsn, then advance the slot by 
> calling
> pg_logical_replication_slot_advance().
>B11. If we reach consistent point, then it should be okay because after
> promotion as well we should reach consistent point.
> B111. But again is it possible that there is some xact that comes
> before consistent_point on primary and the same is after consistent_point on
> standby? This shouldn't matter as we will start decoding transactions after
> confirmed_flush_lsn which would be the same on primary and standby.
>B22. If we haven't reached consistent_point, then we won't mark the 
> slot
> as persistent, and at the next sync we will do the same till it reaches
> consistent_point. At that time, the situation will be similar to B11.
>B2. If remote_restart_lsn < local_restart_lsn, then we will wait for the 
> next
> sync cycle and keep the slot as temporary. Once in the next or some
> consecutive sync cycle, we reach the condition remote_restart_lsn >=
> local_restart_lsn, we will proceed to advance the slot and we should have the
> same behavior as B1.
> C. For 3, we would sync the slot, but the behavior should be the same as B.
> 
> Thoughts?

Looks reasonable to me.

Here is the patch based on above lines.
I am also testing and verifying the patch locally.

Best Regards,
Hou zj


0001-advance-the-restart_lsn-of-synced-slots-using-logica.patch
Description:  0001-advance-the-restart_lsn-of-synced-slots-using-logica.patch

Re: POC, WIP: OR-clause support for indexes

2024-03-28 Thread Alexander Korotkov

On Tue, Mar 19, 2024 at 7:17 AM Andrei Lepikhov
 wrote:
> On 14/3/2024 16:31, Alexander Korotkov wrote:
> > On Wed, Mar 13, 2024 at 2:16 PM Andrei Lepikhov
> > As you can see this case is not related to partial indexes.  Just no
> > index selective for the whole query.  However, splitting scan by the OR
> > qual lets use a combination of two selective indexes.
> I have rewritten the 0002-* patch according to your concern. A candidate
> and some thoughts are attached.
> As I see, we have a problem here: expanding each array and trying to
> apply an element to each index can result in a lengthy planning stage.
> Also, an index scan with the SAOP may potentially be more effective than
> with the list of OR clauses.
> Originally, the transformation's purpose was to reduce a query's
> complexity and the number of optimization ways to speed up planning and
> (sometimes) execution. Here, we reduce planning complexity only in the
> case of an array size larger than MAX_SAOP_ARRAY_SIZE.
> Maybe we can fall back to the previous version of the second patch,
> keeping in mind that someone who wants to get maximum profit from the
> BitmapOr scan of multiple indexes can just disable this optimization,
> enabling deep search of the most optimal scanning way?
> As a compromise solution, I propose adding one more option to the
> previous version: if an element doesn't fit any partial index, try to
> cover it with a plain index.
> In this case, we still do not guarantee the most optimal fit of elements
> to the set of indexes, but we speed up planning. Does that make sense?

Thank you for your research Andrei.  Now things get more clear on the
advantages and disadvantages of this transformation.

The current patch has a boolean guc enable_or_transformation.
However, when we have just a few ORs to be transformated, then we
should get less performance gain from the transformation and higher
chances to lose a good bitmap scan plan from that.  When there is a
huge list of ORs to be transformed, then the performance gain is
greater and it is less likely we could lose a good bitmap scan plan.

What do you think about introducing a GUC threshold value: the minimum
size of list to do OR-to-ANY transformation?
min_list_or_transformation or something.

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-03-28 Thread Pavel Borisov

Hi, Alexander!

The other extensibility that seems quite clear and uncontroversial to me is
0006.

It simply shifts the decision on whether tuple inserts should invoke
inserts to the related indices to the table am level. It doesn't change the
current heap insert behavior so it's safe for the existing heap access
method. But new table access methods could redefine this (only for tables
created with these am's) and make index inserts independently
of ExecInsertIndexTuples inside their own implementations of
tuple_insert/tuple_multi_insert methods.

I'd propose changing the comment:

1405  * This function sets `*insert_indexes` to true if expects caller to
return
1406  * the relevant index tuples.  If `*insert_indexes` is set to false,
then
1407  * this function cares about indexes itself.

in the following way

Tableam implementation of tuple_insert should set `*insert_indexes` to true
if it expects the caller to insert the relevant index tuples (as in heap
 implementation). It should set `*insert_indexes` to false if it cares
about index inserts itself and doesn't want the caller to do index inserts.

Maybe, a commit message is also better to reformulate to describe better
who should do what.

I think, with rebase and correction in the comments/commit message patch
0006 is ready to be committed.

Regards,
Pavel Borisov.

Re: Add new error_action COPY ON_ERROR "log"

2024-03-28 Thread Masahiko Sawada

On Thu, Mar 28, 2024 at 5:28 PM Bharath Rupireddy
 wrote:
>
> On Thu, Mar 28, 2024 at 1:43 PM torikoshia  wrote:
> >
> > Attached patch fixes the doc,
>
> May I know which patch you are referring to? And, what do you mean by
> "fixes the doc"?
>
> > but I'm wondering perhaps it might be
> > better to modify the codes to prohibit abbreviation of the value.
>
> Please help me understand the meaning here.
>
> > When seeing the query which abbreviates ON_ERROR value, I feel it's not
> > obvious what happens compared to other options which tolerates
> > abbreviation of the value such as FREEZE or HEADER.
> >
> >COPY t1 FROM stdin WITH (ON_ERROR);
> >
> > What do you think?
>
> So, do you mean to prohibit ON_ERROR being specified without any value
> like in COPY t1 FROM stdin WITH (ON_ERROR);? If yes, I think all the
> other options do allow that [1].
>
> Even if we were to do something like this, shall we discuss this separately?
>
> Having said that, what do you think of the v13 patch posted upthread?
>

This topic accidentally jumped in this thread, but it made me think
that the same might be true for the LOG_VERBOSITY option. That is,
since the LOG_VERBOSITY option is an enum parameter, it might make
more sense to require the value, instead of making the value optional.
For example, the following command could not be obvious for users:

COPY test FROM stdin (ON_ERROR ignore, LOG_VERBOSITY);

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-03-28 Thread Masahiko Sawada

On Thu, Mar 28, 2024 at 9:38 PM torikoshia  wrote:
>
> On 2024-03-28 10:20, Masahiko Sawada wrote:
> > Hi,
> >
> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada 
> > wrote:
> >>
> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >>  wrote:
> >> >
> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia  
> >> > wrote:
> >> > > On 2024-01-18 10:10, jian he wrote:
> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> >> > > > 
> >> > > > wrote:
> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of 
> >> > > >> > "error")?
> >> > > >> > You will need a separate parameter anyway to specify the 
> >> > > >> > destination
> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't
> >> > > >> > looking.  I don't buy that one parameter that has some special 
> >> > > >> > values
> >> > > >> > while other values could be names will be a good design.  
> >> > > >> > Moreover,
> >> > > >> > what if we want to support (say) log-to-file along with 
> >> > > >> > log-to-table?
> >> > > >> > Trying to distinguish a file name from a table name without any 
> >> > > >> > other
> >> > > >> > context seems impossible.
> >> > > >>
> >> > > >> I've been thinking we can add more values to this option to log 
> >> > > >> errors
> >> > > >> not only to the server logs but also to the error table (not sure
> >> > > >> details but I imagined an error table is created for each table on
> >> > > >> error), without an additional option for the destination name. The
> >> > > >> values would be like error_action 
> >> > > >> {error|ignore|save-logs|save-table}.
> >> > > >>
> >> > > >
> >> > > > another idea:
> >> > > > on_error {error|ignore|other_future_option}
> >> > > > if not specified then by default ERROR.
> >> > > > You can also specify ERROR or IGNORE for now.
> >> > > >
> >> > > > I agree, the parameter "error_action" is better than "location".
> >> > >
> >> > > I'm not sure whether error_action or on_error is better, but either way
> >> > > "error_action error" and "on_error error" seems a bit odd to me.
> >> > > I feel "stop" is better for both cases as Tom suggested.
> >> >
> >> > OK.  What about this?
> >> > on_error {stop|ignore|other_future_option}
> >> > where other_future_option might be compound like "file 'copy.log'" or
> >> > "table 'copy_log'".
> >>
> >> +1
> >>
> >
> > I realized that ON_ERROR syntax synoposis in the documentation is not
> > correct. The option doesn't require the value to be quoted and the
> > value can be omitted. The attached patch fixes it.
> >
> > Regards,
>
> Thanks!
>
> Attached patch fixes the doc, but I'm wondering perhaps it might be
> better to modify the codes to prohibit abbreviation of the value.
>
> When seeing the query which abbreviates ON_ERROR value, I feel it's not
> obvious what happens compared to other options which tolerates
> abbreviation of the value such as FREEZE or HEADER.
>
>COPY t1 FROM stdin WITH (ON_ERROR);
>
> What do you think?

Indeed. Looking at options of other commands such as VACUUM and
EXPLAIN, I can see that we can omit a boolean value, but non-boolean
parameters require its value. The HEADER option is not a pure boolean
parameter but we can omit the value. It seems to be for backward
compatibility; it used to be a boolean parameter. I agree that the
above example would confuse users.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-28 Thread Alexander Korotkov

On Thu, Mar 28, 2024 at 9:37 AM Thomas Munro  wrote:
>
> > v12
>
> Hi all,
>
> I didn't review the patch but one thing jumped out: I don't think it's
> OK to hold a spinlock while (1) looping over an array of backends and
> (2) making system calls (SetLatch()).

Good catch, thank you.

Fixed along with other issues spotted by Alexander Lakhin.

--
Regards,
Alexander Korotkov


v13-0001-Implement-pg_wait_for_wal_replay_lsn-stored-proc.patch
Description: Binary data

Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Robert Haas

On Wed, Mar 27, 2024 at 6:24 PM Bruce Momjian  wrote:
> Please ignore my complaints, and my apologies.
>
> As far as the GUC change, let's just be careful since we have a bad
> history of pushing things near the end that we regret.  I am not saying
> that would be this feature, but let's be careful.

Even if what I had pushed was the patch itself, so what? This patch
has been sitting around, largely unchanged, for six months. There has
been plenty of time for wordsmithing the documentation, yet nobody got
interested in doing it until I expressed interest in committing the
patch. Meanwhile, there are over 100 other patches that no committer
is paying attention to right now, some of which could probably really
benefit from some wordsmithing of the documentation. It drives me
insane that this is the patch everyone is getting worked up about.
This is a 27-line code change that does something many people want,
and we're acting like the future of the project depends on it. Both I
and others have committed thousands of lines of new code over the last
few months that could easily be full of bugs that will eat your data
without nearly the scrutiny that this patch is getting.

To be honest, I had every intention of pushing the main patch right
after I pushed that preliminary patch, but I stopped because I saw you
had emailed the thread. I'm pretty sure that I would have missed the
fact that the documentation hadn't been properly updated for the fact
that the sense of the GUC had been inverted. That would have been
embarrassing, and I would have had to push a follow-up commit to fix
that. But no real harm would have been done, except that somebody
surely would have seized on my mistake as proof that this patch wasn't
ready to be committed and that I was being irresponsible and
inconsiderate by pushing forward with it, which is a garbage argument.
Committers make mistakes like that all the time, every week, even
every day. It doesn't mean that they're bad committers, and it doesn't
mean that the patches suck. Some of the patches that get committed do
suck, but it's not because there are a few words wrong in the
documentation.

Let's please, please stop pretending like this patch is somehow
deserving of special scrutiny. There's barely even anything to
scrutinize. It's literally if (!variable) ereport(...) plus some
boilerplate and docs. I entirely agree that, because of the risk of
someone filing a bogus CVE, the docs do need to be carefully worded.
But, I'm going to be honest: I feel completely confident in my ability
to review a patch well enough to know whether the documentation for a
single test-and-ereport has been done up to project standard. It
saddens and frustrates me that you don't seem to agree.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-03-28 Thread torikoshia


On 2024-03-28 10:20, Masahiko Sawada wrote:

Hi,

On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada  
wrote:


On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov 
 wrote:

>
> On Thu, Jan 18, 2024 at 4:16 AM torikoshia  wrote:
> > On 2024-01-18 10:10, jian he wrote:
> > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> > > wrote:
> > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
> > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> > >> > You will need a separate parameter anyway to specify the destination
> > >> > of "log", unless "none" became an illegal table name when I wasn't
> > >> > looking.  I don't buy that one parameter that has some special values
> > >> > while other values could be names will be a good design.  Moreover,
> > >> > what if we want to support (say) log-to-file along with log-to-table?
> > >> > Trying to distinguish a file name from a table name without any other
> > >> > context seems impossible.
> > >>
> > >> I've been thinking we can add more values to this option to log errors
> > >> not only to the server logs but also to the error table (not sure
> > >> details but I imagined an error table is created for each table on
> > >> error), without an additional option for the destination name. The
> > >> values would be like error_action {error|ignore|save-logs|save-table}.
> > >>
> > >
> > > another idea:
> > > on_error {error|ignore|other_future_option}
> > > if not specified then by default ERROR.
> > > You can also specify ERROR or IGNORE for now.
> > >
> > > I agree, the parameter "error_action" is better than "location".
> >
> > I'm not sure whether error_action or on_error is better, but either way
> > "error_action error" and "on_error error" seems a bit odd to me.
> > I feel "stop" is better for both cases as Tom suggested.
>
> OK.  What about this?
> on_error {stop|ignore|other_future_option}
> where other_future_option might be compound like "file 'copy.log'" or
> "table 'copy_log'".

+1



I realized that ON_ERROR syntax synoposis in the documentation is not
correct. The option doesn't require the value to be quoted and the
value can be omitted. The attached patch fixes it.

Regards,


Thanks!

Attached patch fixes the doc, but I'm wondering perhaps it might be 
better to modify the codes to prohibit abbreviation of the value.


When seeing the query which abbreviates ON_ERROR value, I feel it's not 
obvious what happens compared to other options which tolerates 
abbreviation of the value such as FREEZE or HEADER.


  COPY t1 FROM stdin WITH (ON_ERROR);

What do you think?

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: Experiments with Postgres and SSL

2024-03-28 Thread Heikki Linnakangas


On 28/03/2024 13:15, Matthias van de Meent wrote:

On Tue, 5 Mar 2024 at 15:08, Heikki Linnakangas  wrote:


I hope I didn't joggle your elbow reviewing this, Jacob, but I spent
some time rebase and fix various little things:


With the recent changes to backend startup committed by you, this
patchset has gotten major apply failures.

Could you provide a new version of the patchset so that it can be
reviewed in the context of current HEAD?


Here you are.

--
Heikki Linnakangas
Neon (https://neon.tech)
From 83484696e470ab130bcd3038f0e28d494065071a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 5 Mar 2024 15:40:30 +0200
Subject: [PATCH v9 1/6] Move Kerberos module

So that we can reuse it in new tests.
---
 src/test/kerberos/t/001_auth.pl   | 174 ++--
 src/test/perl/PostgreSQL/Test/Kerberos.pm | 229 ++
 2 files changed, 240 insertions(+), 163 deletions(-)
 create mode 100644 src/test/perl/PostgreSQL/Test/Kerberos.pm

diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index e51e87d0a2e..d4f1ec58092 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -21,6 +21,7 @@ use strict;
 use warnings FATAL => 'all';
 use PostgreSQL::Test::Utils;
 use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Kerberos;
 use Test::More;
 use Time::HiRes qw(usleep);
 
@@ -34,177 +35,27 @@ elsif (!$ENV{PG_TEST_EXTRA} || $ENV{PG_TEST_EXTRA} !~ /\bkerberos\b/)
 	  'Potentially unsafe test GSSAPI/Kerberos not enabled in PG_TEST_EXTRA';
 }
 
-my ($krb5_bin_dir, $krb5_sbin_dir);
-
-if ($^O eq 'darwin' && -d "/opt/homebrew")
-{
-	# typical paths for Homebrew on ARM
-	$krb5_bin_dir = '/opt/homebrew/opt/krb5/bin';
-	$krb5_sbin_dir = '/opt/homebrew/opt/krb5/sbin';
-}
-elsif ($^O eq 'darwin')
-{
-	# typical paths for Homebrew on Intel
-	$krb5_bin_dir = '/usr/local/opt/krb5/bin';
-	$krb5_sbin_dir = '/usr/local/opt/krb5/sbin';
-}
-elsif ($^O eq 'freebsd')
-{
-	$krb5_bin_dir = '/usr/local/bin';
-	$krb5_sbin_dir = '/usr/local/sbin';
-}
-elsif ($^O eq 'linux')
-{
-	$krb5_sbin_dir = '/usr/sbin';
-}
-
-my $krb5_config = 'krb5-config';
-my $kinit = 'kinit';
-my $klist = 'klist';
-my $kdb5_util = 'kdb5_util';
-my $kadmin_local = 'kadmin.local';
-my $krb5kdc = 'krb5kdc';
-
-if ($krb5_bin_dir && -d $krb5_bin_dir)
-{
-	$krb5_config = $krb5_bin_dir . '/' . $krb5_config;
-	$kinit = $krb5_bin_dir . '/' . $kinit;
-	$klist = $krb5_bin_dir . '/' . $klist;
-}
-if ($krb5_sbin_dir && -d $krb5_sbin_dir)
-{
-	$kdb5_util = $krb5_sbin_dir . '/' . $kdb5_util;
-	$kadmin_local = $krb5_sbin_dir . '/' . $kadmin_local;
-	$krb5kdc = $krb5_sbin_dir . '/' . $krb5kdc;
-}
-
-my $host = 'auth-test-localhost.postgresql.example.com';
-my $hostaddr = '127.0.0.1';
-my $realm = 'EXAMPLE.COM';
-
-my $krb5_conf = "${PostgreSQL::Test::Utils::tmp_check}/krb5.conf";
-my $kdc_conf = "${PostgreSQL::Test::Utils::tmp_check}/kdc.conf";
-my $krb5_cache = "${PostgreSQL::Test::Utils::tmp_check}/krb5cc";
-my $krb5_log = "${PostgreSQL::Test::Utils::log_path}/krb5libs.log";
-my $kdc_log = "${PostgreSQL::Test::Utils::log_path}/krb5kdc.log";
-my $kdc_port = PostgreSQL::Test::Cluster::get_free_port();
-my $kdc_datadir = "${PostgreSQL::Test::Utils::tmp_check}/krb5kdc";
-my $kdc_pidfile = "${PostgreSQL::Test::Utils::tmp_check}/krb5kdc.pid";
-my $keytab = "${PostgreSQL::Test::Utils::tmp_check}/krb5.keytab";
-
 my $pgpass = "${PostgreSQL::Test::Utils::tmp_check}/.pgpass";
 
 my $dbname = 'postgres';
 my $username = 'test1';
 my $application = '001_auth.pl';
 
-note "setting up Kerberos";
-
-my ($stdout, $krb5_version);
-run_log [ $krb5_config, '--version' ], '>', \$stdout
-  or BAIL_OUT("could not execute krb5-config");
-BAIL_OUT("Heimdal is not supported") if $stdout =~ m/heimdal/;
-$stdout =~ m/Kerberos 5 release ([0-9]+\.[0-9]+)/
-  or BAIL_OUT("could not get Kerberos version");
-$krb5_version = $1;
-
 # Construct a pgpass file to make sure we don't use it
 append_to_file($pgpass, '*:*:*:*:abc123');
 
 chmod 0600, $pgpass or die $!;
 
-# Build the krb5.conf to use.
-#
-# Explicitly specify the default (test) realm and the KDC for
-# that realm to avoid the Kerberos library trying to look up
-# that information in DNS, and also because we're using a
-# non-standard KDC port.
-#
-# Also explicitly disable DNS lookups since this isn't really
-# our domain and we shouldn't be causing random DNS requests
-# to be sent out (not to mention that broken DNS environments
-# can cause the tests to take an extra long time and timeout).
-#
-# Reverse DNS is explicitly disabled to avoid any issue with a
-# captive portal or other cases where the reverse DNS succeeds
-# and the Kerberos library uses that as the canonical name of
-# the host and then tries to acquire a cross-realm ticket.
-append_to_file(
-	$krb5_conf,
-	qq![logging]
-default = FILE:$krb5_log
-kdc = FILE:$kdc_log
-
-[libdefaults]
-dns_lookup_realm = false
-dns_lookup_kdc = false
-default_realm = $realm
-forwardable = false
-rdns =

Re: Synchronizing slots from primary to standby

2024-03-28 Thread Bertrand Drouvot

Hi,

On Thu, Mar 28, 2024 at 05:05:35PM +0530, Amit Kapila wrote:
> On Thu, Mar 28, 2024 at 3:34 PM Bertrand Drouvot
>  wrote:
> >
> > On Thu, Mar 28, 2024 at 04:38:19AM +, Zhijie Hou (Fujitsu) wrote:
> >
> > > To fix this, we could use the fast forward logical decoding to advance 
> > > the synced
> > > slot's lsn/xmin when syncing these values instead of directly updating the
> > > slot's info. This way, the snapshot will be serialized to disk when 
> > > decoding.
> > > If we could not reach to the consistent point at the remote restart_lsn, 
> > > the
> > > slot is marked as temp and will be persisted once it reaches the 
> > > consistent
> > > point. I am still analyzing the fix and will share once ready.
> >
> > Thanks! I'm wondering about the performance impact (even in fast_forward 
> > mode),
> > might be worth to keep an eye on it.
> >
> 
> True, we can consider performance but correctness should be a
> priority,

Yeah of course.

> and can we think of a better way to fix this issue?

I'll keep you posted if there is one that I can think of.

> > Should we create a 17 open item [1]?
> >
> > [1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
> >
> 
> Yes, we can do that.

done.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add new error_action COPY ON_ERROR "log"

2024-03-28 Thread torikoshia


On 2024-03-28 17:27, Bharath Rupireddy wrote:
On Thu, Mar 28, 2024 at 1:43 PM torikoshia  
wrote:


Attached patch fixes the doc,


May I know which patch you are referring to? And, what do you mean by
"fixes the doc"?


Ugh, I replied to the wrong thread.
Sorry for making you confused and please ignore it.


but I'm wondering perhaps it might be
better to modify the codes to prohibit abbreviation of the value.


Please help me understand the meaning here.

When seeing the query which abbreviates ON_ERROR value, I feel it's 
not

obvious what happens compared to other options which tolerates
abbreviation of the value such as FREEZE or HEADER.

   COPY t1 FROM stdin WITH (ON_ERROR);

What do you think?


So, do you mean to prohibit ON_ERROR being specified without any value
like in COPY t1 FROM stdin WITH (ON_ERROR);? If yes, I think all the
other options do allow that [1].

Even if we were to do something like this, shall we discuss this 
separately?


Having said that, what do you think of the v13 patch posted upthread?

[1]
postgres=# COPY t1 FROM stdin WITH (
DEFAULT ESCAPE  FORCE_QUOTE HEADER  QUOTE
DELIMITER   FORCE_NOT_NULL  FORMAT  NULL
ENCODINGFORCE_NULL  FREEZE  ON_ERROR

postgres=# COPY t1 FROM stdin WITH ( QUOTE );
ERROR:  relation "t1" does not exist
postgres=# COPY t1 FROM stdin WITH ( DEFAULT );
ERROR:  relation "t1" does not exist
postgres=# COPY t1 FROM stdin WITH ( ENCODING );
ERROR:  relation "t1" does not exist



--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: Various small doc improvements; plpgsql, schemas, permissions, oidvector

2024-03-28 Thread Robert Treat

On Thu, Mar 28, 2024 at 8:16 AM Andrey M. Borodin  wrote:
> > On 1 Dec 2023, at 19:00, Karl O. Pinc  wrote:
> >
> >  I won't be able to submit
> > new patches based on reviews for 2 weeks.
>
> Hi everyone!
>
> Is there any work going on? Maybe is someone interested in moving this 
> forward?
>
> Thanks!
>

Hey Andrey,

I spoke with Karl briefly on this and he is working on getting an
updated patch together. The work now involves incorporating feedback
and some rebasing, but hopefully we will see something in the next few
days.

Robert Treat
https://xzilla.net

Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Jelte Fennema-Nio

On Thu, 28 Mar 2024 at 12:57, Robert Haas  wrote:
> I disagree with a lot of these changes. I think the old version was
> mostly better. But I can live with a lot of it if it makes other
> people happy.

I'd have been fine with many of the previous versions of the docs too.
(I'm not a native english speaker though, so that might be part of it)

> However:

Attached is a patch with your last bit of feedback addressed.

v12-0001-Add-allow_alter_system-GUC.patch
Description: Binary data

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Alvaro Herrera

On 2024-Mar-28, Alvaro Herrera wrote:

> Undefined symbols for architecture arm64:
>   "_libintl_gettext", referenced from:
>   _libpqsrv_cancel in dblink.o
>   _libpqsrv_cancel in dblink.o
> ld: symbol(s) not found for architecture arm64
> clang: error: linker command failed with exit code 1 (use -v to see 
> invocation)
> make[1]: *** [dblink.dylib] Error 1
> make: *** [all-dblink-recurse] Error 2

I just removed the _() from the new function.  There's not much point in
wasting more time on this, given that contrib doesn't have translation
support anyway, and we're not using this in libpqwalreceiver.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Crear es tan difícil como ser libre" (Elsa Triolet)

Re: Use virtual tuple slot for Unique node

2024-03-28 Thread Andrey M. Borodin

> On 29 Oct 2023, at 21:30, Denis Smirnov  wrote:
> 
> I have taken a look at this discussion, at the code and I am confused how we 
> choose tuple table slot (TTS) type in PG. 

After offline discussion with Denis, we decided to withdraw this patch from CF 
for now. If anyone is willing to revive working on this, please register a new 
entry in next commitfest.
Thanks!

Best regards, Andrey Borodin.

Re: Various small doc improvements; plpgsql, schemas, permissions, oidvector

2024-03-28 Thread Andrey M. Borodin




> On 1 Dec 2023, at 19:00, Karl O. Pinc  wrote:
> 
>  I won't be able to submit
> new patches based on reviews for 2 weeks.

Hi everyone!

Is there any work going on? Maybe is someone interested in moving this forward?

Thanks!


Best regards, Andrey Borodin.

Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns

2024-03-28 Thread Andrey M. Borodin




> On 8 Aug 2023, at 12:31, John Naylor  wrote:
> 
> > > Also the shared counter is the cause of the slowdown, but not the reason 
> > > for the numeric limit.
> >
> > Isn't it both? typedef Oid is unsigned int = 2^32, and according to 
> > GetNewOidWithIndex() logic if we exhaust the whole OID space it will hang 
> > indefinitely which has the same semantics as "being impossible"/permanent 
> > hang (?)
> 
> Looking again, I'm thinking the OID type size is more relevant for the first 
> paragraph, and the shared/global aspect is more relevant for the second.
> 
> The last issue is how to separate the notes at the bottom, since there are 
> now two topics.

Jakub, do you have plans to address this feedback? Is the CF entry still 
relevant?

Thanks!


Best regards, Andrey Borodin.

Re: Can't find not null constraint, but \d+ shows that

2024-03-28 Thread Tender Wang

Alvaro Herrera  于2024年3月28日周四 17:18写道：

> On 2024-Mar-28, Tender Wang wrote:
>
> >  RemoveConstraintById() should think recurse(e.g. partition table)? I'm
> not
> > sure now.
> >  If we should think process recurse in RemoveConstraintById(),  the
> > function will look complicated than before.
>
> No -- this function handles just a single constraint, as identified by
> OID.  The recursion is handled by upper layers, which can be either
> dependency.c or tablecmds.c.  I think the problem you found is caused by
> the fact that I worked with the tablecmds.c recursion and neglected the
> one in dependency.c.
>

Indeed.

create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
alter table skip_wal_skip_rewrite_index alter c type varchar(20);

Above SQL need attnotnull to be true when re-add index, but
RemoveConstraintById() is hard to recognize this scenario as I know.
We should re-set attnotnull to be true before re-add index. I add a new
AT_PASS in attached patch.
Any thoughts?
--
Tender Wang
OpenPie:  https://en.openpie.com/


v4-0001-Fix-pg_attribute-attnotnull-not-reset-when-droppe.patch
Description: Binary data

Re: Support logical replication of DDLs

2024-03-28 Thread Amit Kapila

On Thu, Mar 28, 2024 at 5:31 PM Andrey M. Borodin  wrote:
>
> This thread is registered on CF [0] but is not active since 2023. Is anyone 
> working on this? I understand that this is a nice feature. Should we move it 
> to next CF or withdraw CF entry?
>

At this stage, we should close either Returned With Feedback or
Withdrawn as this requires a lot of work.

-- 
With Regards,
Amit Kapila.

elog/ereport VS misleading backtrace_function function address

2024-03-28 Thread Jakub Wartak

Hi -hackers,

While chasing some other bug I've learned that backtrace_functions
might be misleading with top elog/ereport() address.

Reproducer:

# using Tom's reproducer on master:
wget 
https://www.postgresql.org/message-id/attachment/112394/ri-collation-bug-example.sql
echo '' >> ri-collation-bug-example.sql
echo '\errverbose' >> ri-collation-bug-example.sql
-- based on grepping the source code locations many others could go in here:
psql -p 5437 -c "alter system set backtrace_functions =
'RI_FKey_cascade_del,get_collation_isdeterministic';"
psql -p 5437 -c "select pg_reload_conf();"
dropdb -p 5437 test1 ; createdb -p 5437 test1 ; psql -p 5437 -d test1
-f ri-collation-bug-example.sql

gives (note "get_collation_isdeterministic"):
psql:ri-collation-bug-example.sql:42: ERROR:  cache lookup failed
for collation 0
psql:ri-collation-bug-example.sql:44: error: ERROR:  XX000: cache
lookup failed for collation 0
LOCATION:  get_collation_isdeterministic, lsyscache.c:1088

and in log note the 0x14c6bb:
2024-03-27 14:39:13.065 CET [12899] postgres@test1 ERROR:  cache
lookup failed for collation 0
2024-03-27 14:39:13.065 CET [12899] postgres@test1 BACKTRACE:
postgres: 16/main: postgres test1 [local] DELETE(+0x14c6bb)
[0x5565c5a9c6bb]
postgres: 16/main: postgres test1 [local]
DELETE(RI_FKey_cascade_del+0x323) [0x5565c5ec0973]
postgres: 16/main: postgres test1 [local] DELETE(+0x2d401f)
[0x5565c5c2401f]
postgres: 16/main: postgres test1 [local] DELETE(+0x2d5a04)
[0x5565c5c25a04]
postgres: 16/main: postgres test1 [local]
DELETE(AfterTriggerEndQuery+0x78) [0x5565c5c2a918]
[..]
2024-03-27 14:39:13.065 CET [12899] postgres@test1 STATEMENT:  delete
from revisions where id='5c617ce7-688d-4bea-9d66-c0f0ebc635da';

based on \errverbose it is OK (error matches the code, thanks to
Alvaro for this hint):

 9   bool
 8   get_collation_isdeterministic(Oid colloid)
 7   {
 6   │   HeapTuple>  tp;
 5   │   Form_pg_collation colltup;
 4   │   bool>   >   result;
 3   │
 2   │   tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(colloid));
 1   │   if (!HeapTupleIsValid(tp))
  1088   │   │   elog(ERROR, "cache lookup failed for collation %u", colloid);
[..]


but based on backtrace address 0x14c6bb (!) and it resolves
differently which seems to be highly misleading (note the
"get_language_name.cold"):

$ addr2line -e /usr/lib/postgresql/16/bin/postgres -a -f 0x14c6bb
0x0014c6bb
get_language_name.cold
./build/src/backend/utils/cache/./build/../src/backend/utils/cache/lsyscache.c:1181

When i disassemble the get_collation_isdeterministic() (and I know the
name from \errverbose):

Dump of assembler code for function get_collation_isdeterministic:
Address range 0x5c7680 to 0x5c76c1:
   0x005c7680 <+0>: push   %rbp
[..]
   0x005c7694 <+20>:call   0x5d63b0 
   0x005c7699 <+25>:test   %rax,%rax
   0x005c769c <+28>:je 0x14c686

   0x005c76a2 <+34>:mov%rax,%rdi
[..]
   0x005c76bf <+63>:leave
   0x005c76c0 <+64>:ret
Address range 0x14c686 to 0x14c6bb:
   0x0014c686 <-4698106>:   xor%esi,%esi
   0x0014c688 <-4698104>:   mov$0x15,%edi
   0x0014c68d <-4698099>:   call   0x14ec86 
   0x0014c692 <-4698094>:   mov%r12d,%esi
   0x0014c695 <-4698091>:   lea0x5028dc(%rip),%rdi
   # 0x64ef78
   0x0014c69c <-4698084>:   xor%eax,%eax
   0x0014c69e <-4698082>:   call   0x5df320 
   0x0014c6a3 <-4698077>:   lea0x6311a6(%rip),%rdx
   # 0x77d850 <__func__.34>
   0x0014c6aa <-4698070>:   mov$0x440,%esi
   0x0014c6af <-4698065>:   lea0x630c8a(%rip),%rdi
   # 0x77d340
   0x0014c6b6 <-4698058>:   call   0x5df100 
   << NOTE the exact 0x14c6bb is even missing here(!)

notice the interesting thing here: according to GDB
get_collation_isdeterministic() is @ 0x5c7680 + jump to 0x14c686
 till 0x14c6bb (but without it)
without any stack setup for that .cold. But backtrace() just captured
the elog/ereport (cold) at the end of 0x14c6bb instead, so if I take
that exact address from backtrace_functions output as it is
("DELETE(+0x14c6bb)"), it also shows WRONG function (just like
addr2line):

(gdb) disassemble 0x14c6bb
Dump of assembler code for function get_language_name:
Address range 0x5c7780 to 0x5c77ee:
[..]
   0x005c77ed <+109>:   ret
Address range 0x14c6bb to 0x14c6f0:
   0x0014c6bb <-4698309>:   xor%esi,%esi
[..]
   0x0014c6eb <-4698261>:   call   0x5df100 
End of assembler dump.

so this is wrong (as are failing in "get_collation_isdeterministic"
not in "get_language_name").

I was very confused at the beginning with the main question being: why
are we compiling elog/ereport() so that it is incompatible with
backtrace? When looking for it I've found two threads [1] by David
that were

Re: Support logical replication of DDLs

2024-03-28 Thread Andrey M. Borodin

> On 18 Jul 2023, at 12:09, Zhijie Hou (Fujitsu)  wrote:
> 
> Here is the POC patch(0004) for the second approach

Hi everyone!

This thread is registered on CF [0] but is not active since 2023. Is anyone 
working on this? I understand that this is a nice feature. Should we move it to 
next CF or withdraw CF entry?

Thanks!

[0] https://commitfest.postgresql.org/47/3595/

Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Robert Haas

On Thu, Mar 28, 2024 at 5:42 AM Jelte Fennema-Nio  wrote:
> On Thu, 28 Mar 2024 at 10:24, Jelte Fennema-Nio  wrote:
> > I addressed them all I think. Mostly the small changes that were
> > suggested, but I rewrote the sentence with "might be discarded". And I
> > added references to the new GUC in both places suggested by David.
>
> Changed the error hint to use "external tool" too. And removed a
> duplicate header definition of AllowAlterSystem (I moved it to guc.h
> so it was together with other definitions a few patches ago, but
> apparently forgot to remove it from guc_tables.h)

I disagree with a lot of these changes. I think the old version was
mostly better. But I can live with a lot of it if it makes other
people happy. However:

+Which might result in unintended behavior, such as the external tool
+discarding the change at some later point in time when it updates the
+configuration.

This is not OK from a grammatical point of view. You can't just start
a sentence with "which" like this. You could replace "Which" with
"This", though.

+ if (!AllowAlterSystem)
+ {
+
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("ALTER SYSTEM is not allowed in this environment"),
+ errhint("Global configuration changes should be made using an
external tool, not by using ALTER SYSTEM.")));
+ }

The extra blank line should go. The brackets should go. And I think
the errhint should go, too, because the errhint implies that we know
why the user chose to set allow_alter_system=false. There's no reason
for this message to be opinionated about that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Alvaro Herrera

Hm, indri failed:

ccache gcc -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new 
-Wendif-labels -Wmissing-format-attribute -Wcast-function-type 
-Wformat-security -fno-strict-aliasing -fwrapv 
-Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -g -O2 
-fno-common -Werror  -fvisibility=hidden -bundle -o dblink.dylib  dblink.o 
-L../../src/port -L../../src/common -L../../src/interfaces/libpq -lpq -isysroot 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk
  -L/opt/local/libexec/llvm-15/lib -L/opt/local/lib -L/opt/local/lib 
-L/opt/local/lib  -L/opt/local/lib -Wl,-dead_strip_dylibs  -Werror  
-fvisibility=hidden -bundle_loader ../../src/backend/postgres

Undefined symbols for architecture arm64:
  "_libintl_gettext", referenced from:
  _libpqsrv_cancel in dblink.o
  _libpqsrv_cancel in dblink.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[1]: *** [dblink.dylib] Error 1
make: *** [all-dblink-recurse] Error 2

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Synchronizing slots from primary to standby

2024-03-28 Thread Amit Kapila

On Thu, Mar 28, 2024 at 3:34 PM Bertrand Drouvot
 wrote:
>
> On Thu, Mar 28, 2024 at 04:38:19AM +, Zhijie Hou (Fujitsu) wrote:
>
> > To fix this, we could use the fast forward logical decoding to advance the 
> > synced
> > slot's lsn/xmin when syncing these values instead of directly updating the
> > slot's info. This way, the snapshot will be serialized to disk when 
> > decoding.
> > If we could not reach to the consistent point at the remote restart_lsn, the
> > slot is marked as temp and will be persisted once it reaches the consistent
> > point. I am still analyzing the fix and will share once ready.
>
> Thanks! I'm wondering about the performance impact (even in fast_forward 
> mode),
> might be worth to keep an eye on it.
>

True, we can consider performance but correctness should be a
priority, and can we think of a better way to fix this issue?

> Should we create a 17 open item [1]?
>
> [1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
>

Yes, we can do that.

-- 
With Regards,
Amit Kapila.

Re: Vectored I/O in bulk_write.c

2024-03-28 Thread Thomas Munro

Then I would make the trivial change to respect the new
io_combine_limit GUC that I'm gearing up to commit in another thread.
As attached.
From 7993cede8939cad9172867ccc690a44ea25d1ad6 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 29 Mar 2024 00:22:53 +1300
Subject: [PATCH] fixup: respect io_combine_limit in bulk_write.c

---
 src/backend/storage/smgr/bulk_write.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/smgr/bulk_write.c 
b/src/backend/storage/smgr/bulk_write.c
index 848c3054f5..612a9a23b3 100644
--- a/src/backend/storage/smgr/bulk_write.c
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -45,12 +45,6 @@
 
 #define MAX_PENDING_WRITES XLR_MAX_BLOCK_ID
 
-/*
- * How many blocks to send to smgrwritev() at a time.  Arbitrary value for
- * now.
- */
-#define MAX_BLOCKS_PER_WRITE ((128 * 1024) / BLCKSZ)
-
 static const PGIOAlignedBlock zero_buffer = {{0}}; /* worth BLCKSZ */
 
 typedef struct PendingWrite
@@ -232,7 +226,7 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
for (int i = 0; i < npending; i++)
{
Pagepage;
-   const void *pages[MAX_BLOCKS_PER_WRITE];
+   const void *pages[MAX_IO_COMBINE_LIMIT];
BlockNumber blkno;
int nblocks;
int max_nblocks;
@@ -266,14 +260,14 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
 * We're overwriting.  Clamp at the existing size, 
because we
 * can't mix writing and extending in a single 
operation.
 */
-   max_nblocks = Min(lengthof(pages),
+   max_nblocks = Min(io_combine_limit,
  
bulkstate->pages_written - blkno);
}
else
{
/* We're extending. */
Assert(blkno == bulkstate->pages_written);
-   max_nblocks = lengthof(pages);
+   max_nblocks = io_combine_limit;
}
 
/* Find as many consecutive blocks as we can. */
-- 
2.39.3 (Apple Git-146)

Re: Synchronizing slots from primary to standby

2024-03-28 Thread Amit Kapila

On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu)
 wrote:
>
> When analyzing one BF error[1], we find an issue of slotsync: Since we don't
> perform logical decoding for the synced slots when syncing the lsn/xmin of
> slot, no logical snapshots will be serialized to disk. So, when user starts to
> use these synced slots after promotion, it needs to re-build the consistent
> snapshot from the restart_lsn if the WAL(xl_running_xacts) at restart_lsn
> position indicates that there are running transactions. This however could
> cause the data that before the consistent point to be missed[2].
>
> This issue doesn't exist on the primary because the snapshot at restart_lsn
> should have been serialized to disk (SnapBuildProcessRunningXacts ->
> SnapBuildSerialize), so even if the logical decoding restarts, it can find
> consistent snapshot immediately at restart_lsn.
>
> To fix this, we could use the fast forward logical decoding to advance the 
> synced
> slot's lsn/xmin when syncing these values instead of directly updating the
> slot's info. This way, the snapshot will be serialized to disk when decoding.
> If we could not reach to the consistent point at the remote restart_lsn, the
> slot is marked as temp and will be persisted once it reaches the consistent
> point. I am still analyzing the fix and will share once ready.
>

Yes, we can use this but one thing to note is that
CreateDecodingContext() will expect that the slot's and current
database are the same. I think the reason for that is we need to check
system tables of the current database while decoding and sending data
to the output_plugin which won't be a requirement for the fast_forward
case. So, we need to skip that check in fast_forward mode.

Next, I was thinking about the case of the first time updating the
restart and confirmed_flush LSN while syncing the slots. I think we
can keep the current logic as it is based on the following analysis.

For each logical slot, cases possible on the primary:
1. The restart_lsn doesn't have a serialized snapshot and hasn't yet
reached the consistent point.
2. The restart_lsn doesn't have a serialized snapshot but has reached
a consistent point.
3. The restart_lsn has a serialized snapshot which means it has
reached a consistent point as well.

Considering we keep the logic to reserve initial WAL positions the
same as the current (Reserve WAL for the currently active local slot
using the specified WAL location (restart_lsn). If the given WAL
location has been removed, reserve WAL using the oldest existing WAL
segment.), I could think of the below scenarios:
A. For 1, we shouldn't sync the slot as it still wouldn't have been
marked persistent on the primary.
B. For 2, we would sync the slot
   B1. If remote_restart_lsn >= local_resart_lsn, then advance the
slot by calling pg_logical_replication_slot_advance().
   B11. If we reach consistent point, then it should be okay
because after promotion as well we should reach consistent point.
B111. But again is it possible that there is some xact
that comes before consistent_point on primary and the same is after
consistent_point on standby? This shouldn't matter as we will start
decoding transactions after confirmed_flush_lsn which would be the
same on primary and standby.
   B22. If we haven't reached consistent_point, then we won't mark
the slot as persistent, and at the next sync we will do the same till
it reaches consistent_point. At that time, the situation will be
similar to B11.
   B2. If remote_restart_lsn < local_restart_lsn, then we will wait
for the next sync cycle and keep the slot as temporary. Once in the
next or some consecutive sync cycle, we reach the condition
remote_restart_lsn >= local_restart_lsn, we will proceed to advance
the slot and we should have the same behavior as B1.
C. For 3, we would sync the slot, but the behavior should be the same as B.

Thoughts?

-- 
With Regards,
Amit Kapila.

Re: Fix some resources leaks (src/bin/pg_basebackup/pg_createsubscriber.c)

2024-03-28 Thread Ranier Vilela

Em qua., 27 de mar. de 2024 às 23:08, Euler Taveira 
escreveu:

> On Wed, Mar 27, 2024, at 8:50 PM, Ranier Vilela wrote:
>
> Coverity has some reports in the new code
> pg_createsubscriber.c
> I think that Coverity is right.
>
>
> It depends on your "right" definition.
>
I do not think so.

If your program execution is ephemeral
> and the leak is just before exiting, do you think it's worth it?
>
I think it's worth it.
Even an ephemeral execution can allocate resources, for example, and
mainly, in Windows,
and block these resources for other programs, and on a highly loaded server
with very few free resources,
releasing resources as quickly as possible makes a difference.


> 1.
> CID 1542682: (#1 of 1): Resource leak (RESOURCE_LEAK)
> leaked_storage: Variable buf going out of scope leaks the storage it
> points to.
>
>
> It will exit in the next instruction.
>
Yes, by exit() call function.


>
> 2.
> CID 1542704: (#1 of 1): Resource leak (RESOURCE_LEAK)
> leaked_storage: Variable conn going out of scope leaks the storage it
> points to.
>
>
> The connect_database function whose exit_on_error is false is used in 2
> routines:
>
Calling connect_database with false, per example:
conn = connect_database(dbinfo[i].pubconninfo, false);

If the connection status != CONNECTION_OK, the function returns NULL,
but does not free connection struct, memory and data.

In the loop with thousands of "number of specified databases",
It would quickly end up in problems, especially on Windows.


> * cleanup_objects_atexit: that's about to exit;
> * drop_primary_replication_slot: that will execute a few routines before
> exiting.
>
> 3.
> CID 1542691: (#1 of 1): Resource leak (RESOURCE_LEAK)
> leaked_storage: Variable str going out of scope leaks the storage it
> points to.
>
>
> It will exit in the next instruction.
>
Yes, by exit() call function.

About exit() function:

deallocating-memory-when-using-exit1-c

"exit *does not* call the destructors of any stack based objects so if
those objects have allocated any memory internally then yes that memory
will be leaked. "

reference/cstdlib/exit/ 
"Note that objects with automatic storage are not destroyed by calling exit
(C++)."

I think that relying on the exit function for cleaning is extremely
fragile, especially on Windows.


> Having said that, applying this patch is just a matter of style.
>
IMO, a better and more correct style.

best regards,
Ranier Vilela

Re: Streaming I/O, vectored I/O (WIP)

2024-03-28 Thread Thomas Munro

On Fri, Mar 29, 2024 at 12:06 AM Thomas Munro  wrote:
> Small bug fix: the condition in the final test at the end of
> read_stream_look_ahead() wasn't quite right.  In general when looking
> ahead, we don't need to start a read just because the pending read
> would bring us up to stream->distance if submitted now (we'd prefer to
> build it all the way up to size io_combine_limit if we can), but if
> that condition is met AND we have nothing pinned yet, then there is no
> chance for the read to grow bigger by a pinned buffer being consumed.
> Fixed, comment updated.

Oops, I sent the wrong/unfixed version.  This version has the fix
described above.


v13-0001-Provide-vectored-variant-of-ReadBuffer.patch
Description: Binary data


v13-0002-Provide-API-for-streaming-relation-data.patch
Description: Binary data


v13-0003-Use-streaming-I-O-in-pg_prewarm.patch
Description: Binary data

Re: Experiments with Postgres and SSL

2024-03-28 Thread Matthias van de Meent

On Tue, 5 Mar 2024 at 15:08, Heikki Linnakangas  wrote:
>
> I hope I didn't joggle your elbow reviewing this, Jacob, but I spent
> some time rebase and fix various little things:

With the recent changes to backend startup committed by you, this
patchset has gotten major apply failures.

Could you provide a new version of the patchset so that it can be
reviewed in the context of current HEAD?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: AIX support

2024-03-28 Thread Sriram RK

Hi Team,



We are setting up the build environment and trying to build the source and also 
trying to analyze the assert from the Aix point of view.

Also, would like to know if we can access the buildfarm(power machines) to run 
any of the specific tests to hit this assert.

Thanks & regards,
Sriram.

  > From: Sriram RK 
  > Date: Thursday, 21 March 2024 at 10:05 PM
  > To: Tom Lane t...@sss.pgh.pa.us, Alvaro Herrera 

  > Cc: pgsql-hack...@postgresql.org 

  > Subject: Re: AIX support
  > Thanks, Tom and Alvaro, for the info.
  > We shall look into to details and get back.

Re: Streaming I/O, vectored I/O (WIP)

2024-03-28 Thread Thomas Munro

Small bug fix: the condition in the final test at the end of
read_stream_look_ahead() wasn't quite right.  In general when looking
ahead, we don't need to start a read just because the pending read
would bring us up to stream->distance if submitted now (we'd prefer to
build it all the way up to size io_combine_limit if we can), but if
that condition is met AND we have nothing pinned yet, then there is no
chance for the read to grow bigger by a pinned buffer being consumed.
Fixed, comment updated.


v12-0001-Provide-vectored-variant-of-ReadBuffer.patch
Description: Binary data


v12-0002-Provide-API-for-streaming-relation-data.patch
Description: Binary data


v12-0003-Use-streaming-I-O-in-pg_prewarm.patch
Description: Binary data

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Alvaro Herrera

On 2024-Mar-28, Jelte Fennema-Nio wrote:

> Your changes look good, apart from the prverror stuff indeed. If you
> remove the prverror stuff again I think this is ready to commit.

Great, thanks for looking.  Pushed now, I'll be closing the commitfest
entry shortly.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Always assume the user will do much worse than the stupidest thing
you can imagine."(Julien PUYDT)

Re: Fix parallel vacuum buffer usage reporting

2024-03-28 Thread Anthonin Bonnefoy

Hi,

Thanks for the review.

On Thu, Mar 28, 2024 at 4:07 AM Masahiko Sawada 
wrote:

> As for the proposed patch, the following part should handle the temp
> tables too:
>

True, I've missed the local blocks. I've updated the patch:
- read_rate and write_rate now include local block usage
- I've added a specific output for reporting local blocks instead of
combining them in the same output.

Regards,
Anthonin


v2-0001-Fix-parallel-vacuum-buffer-usage-reporting.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2024-03-28 Thread Bertrand Drouvot

Hi,

On Thu, Mar 28, 2024 at 04:38:19AM +, Zhijie Hou (Fujitsu) wrote:
> Hi,
> 
> When analyzing one BF error[1], we find an issue of slotsync: Since we don't
> perform logical decoding for the synced slots when syncing the lsn/xmin of
> slot, no logical snapshots will be serialized to disk. So, when user starts to
> use these synced slots after promotion, it needs to re-build the consistent
> snapshot from the restart_lsn if the WAL(xl_running_xacts) at restart_lsn
> position indicates that there are running transactions. This however could
> cause the data that before the consistent point to be missed[2].

I see, nice catch and explanation, thanks!

> This issue doesn't exist on the primary because the snapshot at restart_lsn
> should have been serialized to disk (SnapBuildProcessRunningXacts ->
> SnapBuildSerialize), so even if the logical decoding restarts, it can find
> consistent snapshot immediately at restart_lsn.

Right.

> To fix this, we could use the fast forward logical decoding to advance the 
> synced
> slot's lsn/xmin when syncing these values instead of directly updating the
> slot's info. This way, the snapshot will be serialized to disk when decoding.
> If we could not reach to the consistent point at the remote restart_lsn, the
> slot is marked as temp and will be persisted once it reaches the consistent
> point. I am still analyzing the fix and will share once ready.

Thanks! I'm wondering about the performance impact (even in fast_forward mode),
might be worth to keep an eye on it.

Should we create a 17 open item [1]?

[1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

1 2 >

1 - 100 of 118 matches

Mail list logo