Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-19 Thread Bharath Rupireddy
On Mon, Mar 18, 2024 at 3:42 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> Looking at 0001:

Thanks for reviewing.

> 1 ===
>
> +   True if this logical slot conflicted with recovery (and so is now
> +   invalidated). When this column is true, check
>
> Worth to add back the physical slot mention "Always NULL for physical slots."?

Will change.

> 2 ===
>
> @@ -1023,9 +1023,10 @@ CREATE VIEW pg_replication_slots AS
>  L.wal_status,
>  L.safe_wal_size,
>  L.two_phase,
> -L.conflict_reason,
> +L.conflicting,
>  L.failover,
> -L.synced
> +L.synced,
> +L.invalidation_reason
>
> What about making invalidation_reason close to conflict_reason?

Not required I think. One can pick the required columns in the SELECT
clause anyways.

> 3 ===
>
> - * Maps a conflict reason for a replication slot to
> + * Maps a invalidation reason for a replication slot to
>
> s/a invalidation/an invalidation/?

Will change.

> 4 ===
>
> While at it, shouldn't we also rename "conflict" to say "invalidation_cause" 
> in
> InvalidatePossiblyObsoleteSlot()?

That's inline with our understanding about conflict vs invalidation,
and keeps the function generic. Will change.

> 5 ===
>
> + * rows_removed and wal_level_insufficient are only two reasons
>
> s/are only two/are the only two/?

Will change..

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Refactoring backend fork+exec code

2024-03-19 Thread Tom Lane
Heikki Linnakangas  writes:
> Committed, with some final cosmetic cleanups. Thanks everyone!

A couple of buildfarm animals don't like these tests:

Assert(child_type >= 0 && child_type < lengthof(child_process_kinds));

for example

 ayu   | 2024-03-19 13:08:05 | launch_backend.c:211:39: warning: 
comparison of constant 16 with expression of type 'BackendType' (aka 'enum 
BackendType') is always true [-Wtautological-constant-out-of-range-compare]
 ayu   | 2024-03-19 13:08:05 | launch_backend.c:233:39: warning: 
comparison of constant 16 with expression of type 'BackendType' (aka 'enum 
BackendType') is always true [-Wtautological-constant-out-of-range-compare]

I'm not real sure why it's moaning about the "<" check but not the
">= 0" check, which ought to be equally tautological given the
assumption that the input is a valid member of the enum.  But
in any case, exactly how much value do these assertions carry?
If you're intent on keeping them, perhaps casting child_type to
int here would suppress the warnings.  But personally I think
I'd lose the Asserts.

regards, tom lane




Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-19 Thread vignesh C
On Tue, 19 Mar 2024 at 17:18, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san,
>
> Thanks for giving comments!
>
> > This behavior makes sense to me. But do we want to handle the case of
> > using environment variables too?
>
> Yeah, v5 does not consider which libpq parameters are specified by environment
> variables. Such a variable should be used when the dbname is not expressly 
> written
> in the connection string.
> Such a path was added in the v6 patch. If the dbname is not determined after
> parsing the connection string, we call PQconndefaults() to get settings from
> environment variables and service files [1], then start to search dbname 
> again.
> Below shows an example.
>
> ```
> PGPORT=5431 PGUSER=kuroda PGDATABASE=postgres pg_basebackup -D data_N2 -R -v
> ->
> primary_conninfo = 'user=kuroda ... port=5431 ... dbname=postgres ... '
> ```
>
> > IIUC,
> >
> > pg_basebackup -D tmp -d "user=masahiko dbname=test_db"
> >
> > is equivalent to:
> >
> > PGDATABASE="user=masahiko dbname=test_db" pg_basebackup -D tmp
>
> The case won't work. I think You assumed that expanded_dbname like
> PQconnectdbParams() [2] can be used for enviroment variables, but it is not 
> correct
> - it won't parse as connection string again.
>
> In the libpq layer, connection parameters are parsed in 
> PQconnectStartParams()->conninfo_array_parse().
> When expand_dbname is specified, the entry "dbname" is firstly checked and
> parsed its value. They are done at fe-connect.c:5846.
>
> The environment variables are checked and parsed in conninfo_add_defaults(), 
> which
> is called from conninfo_array_parse(). However, it is done at 
> fe-connect.c:5956 - the
> expand_dbname has already been done at that time. This means there is no 
> chance
> that PGDATABASE is parsed as an expanded style.
>
> For example, if the pg_basebackup runs like below:
>
> PGDATABASE="user=kuroda dbname=postgres" pg_basebackup -D data_N2 -R -v
>
> The primary_conninfo written in the file will be:
>
> primary_conninfo = 'user=hayato ... dbname=''user=kuroda dbname=postgres'''

Thanks for the patch.
Here are the test results for various tests by specifying connection
string, environment variable, service file, and connection URIs with
the patch:
case 1:
pg_basebackup -D test10 -p 5431 -X s -P  -R -d "user=vignesh dbname=db1"
primary_conninfo will have dbname=db1

case 2:
pg_basebackup -D test10 -p 5431 -X s -P  -R -d "user=vignesh dbname=replication"
primary_conninfo will have dbname=replication

case 3:
pg_basebackup -D test10 -p 5431 -X s -P  -R  -U vignesh
primary_conninfo will not have dbname

case 4:
pg_basebackup -D test10 -p 5431 -X s -P  -R
primary_conninfo will not have dbname

case 5:
pg_basebackup -D test10 -p 5431 -X s -P  -R -d "user=vignesh"
primary_conninfo will not have dbname

case 6:
pg_basebackup -D test10 -p 5431 -X s -P  -R  -d ""
primary_conninfo will not have dbname

--- Testing through PGDATABASE environment variable
case 7:
export PGDATABASE="user=postgres dbname=test"
./pg_basebackup -D test10 -p 5431 -X s -P  -R
primary_conninfo will have dbname=''user=postgres dbname=test'' like below:
primary_conninfo = 'user=vignesh passfile=''/home/vignesh/.pgpass''
channel_binding=prefer port=5431 sslmode=prefer sslcompression=0
sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2
gssencmode=disable krbsrvname=postgres gssdelegation=0
target_session_attrs=any load_balance_hosts=disable
dbname=''user=postgres dbname=test'''

case 8:
export PGDATABASE=db1
./pg_basebackup -D test10 -p 5431 -X s -P  -R
primary_conninfo will have dbname=db1

--- Testing through pg_service
case 9:
Create .pg_service.conf with the following info:
[conn1]
dbname=db2

export PGSERVICE=conn1

./pg_basebackup -D test10 -p 5431 -X s -P  -R
primary_conninfo will have dbname=db2

case 10:
Create .pg_service.conf with the following info, i.e. there is no
database specified:
[conn1]

./pg_basebackup -D test10 -p 5431 -X s -P  -R
primary_conninfo will not have dbname

--- Testing through Connection URIs
case 11:
./pg_basebackup -D test10  -X s -P  -R -d "postgresql://localhost:5431"
primary_conninfo will not have dbname

case 12:
./pg_basebackup -D test10  -p 5431 -X s -P  -R -d
"postgresql://localhost/db3:5431"
primary_conninfo will have dbname=''db3:5431'' like below:
primary_conninfo = 'user=vignesh passfile=''/home/vignesh/.pgpass''
channel_binding=prefer host=localhost port=5431 sslmode=prefer
sslcompression=0 sslcertmode=allow sslsni=1
ssl_min_protocol_version=TLSv1.2 gssencmode=disable
krbsrvname=postgres gssdelegation=0 target_session_attrs=any
load_balance_hosts=disable dbname=''db3:5431'''

case 13:
./pg_basebackup -D test10  -p 5431 -X s -P  -R -d "postgresql://localhost/db3"
primary_conninfo will have dbname=db3

case 14:
./pg_basebackup -D test10   -X s -P  -R -d "postgresql://localhost:5431/db3"
primary_conninfo will have dbname=db3

case 15:
./pg_basebackup -D test10   -X s -P  -R -d
"postgresql://localhost:5431/db4,127.0.0.1:5431/db5"

Re: Table AM Interface Enhancements

2024-03-19 Thread Pavel Borisov
Hi, Alexander!

For 0007:

Code inside

+heapam_reloptions(char relkind, Datum reloptions, bool validate)
+{
+   if (relkind == RELKIND_RELATION ||
+   relkind == RELKIND_TOASTVALUE ||
+   relkind == RELKIND_MATVIEW)
+   return heap_reloptions(relkind, reloptions, validate);
+
+   return NULL;

looks redundant to what is done inside heap_reloptions(). Was this on
purpose? Is it possible to leave only "return heap_reloptions()"  ?

This looks like a duplicate:
src/include/access/reloptions.h:extern bytea
*index_reloptions(amoptions_function amoptions, Datum reloptions,
src/include/access/tableam.h:extern bytea
*index_reloptions(amoptions_function amoptions, Datum reloptions,

Otherwise the patch looks good and doing what it's proposed to do.

Regards,
Pavel Borisov.


Re: Popcount optimization using AVX512

2024-03-19 Thread David Rowley
On Wed, 20 Mar 2024 at 11:56, Amonson, Paul D  wrote:
> Changed in this patch set.

Thanks for rebasing.

I don't think there's any need to mention Intel in each of the
following comments:

+# Check for Intel AVX512 intrinsics to do POPCNT calculations.

+# Newer Intel processors can use AVX-512 POPCNT Capabilities (01/30/2024)

AMD's Zen4 also has AVX512, so it's misleading to indicate it's an
Intel only instruction.  Also, writing the date isn't necessary as we
have "git blame"

David




Re: remaining sql/json patches

2024-03-19 Thread jian he
On Tue, Mar 19, 2024 at 6:46 PM Amit Langote  wrote:
>
> I intend to commit 0001+0002 after a bit more polishing.
>

V43 is far more intuitive! thanks!

if (isnull ||
(exprType(expr) == JSONBOID &&
btype == default_behavior))
coerce = true;
else
coerced_expr =
coerce_to_target_type(pstate, expr, exprType(expr),
  returning->typid, returning->typmod,
  COERCION_EXPLICIT, COERCE_EXPLICIT_CAST,
  exprLocation((Node *) behavior));

obviously, there are cases where "coerce" is false, and "coerced_expr"
is not null.
so I think the bool "coerce" variable naming is not very intuitive.
maybe we can add some comments or change to a better name.


JsonPathVariableEvalContext
JsonPathVarCallback
JsonItemType
JsonExprPostEvalState
these should remove from src/tools/pgindent/typedefs.list


+/*
+ * Performs JsonPath{Exists|Query|Value}() for a given context_item and
+ * jsonpath.
+ *
+ * Result is set in *op->resvalue and *op->resnull.  Return value is the
+ * step address to be performed next.
+ *
+ * On return, JsonExprPostEvalState is populated with the following details:
+ * - error.value: true if an error occurred during JsonPath evaluation
+ * - empty.value: true if JsonPath{Query|Value}() found no matching item
+ *
+ * No return if the ON ERROR/EMPTY behavior is ERROR.
+ */
+int
+ExecEvalJsonExprPath(ExprState *state, ExprEvalStep *op,
+ ExprContext *econtext)

" No return if the ON ERROR/EMPTY behavior is ERROR."  is wrong?
counter example:
SELECT JSON_QUERY(jsonb '{"a":[12,2]}', '$.a' RETURNING int4RANGE omit
quotes error on error);
also "JsonExprPostEvalState" does not exist any more.
overall feel like ExecEvalJsonExprPath comments need to be rephrased.




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-19 Thread Amit Kapila
On Wed, Mar 20, 2024 at 12:49 AM Bharath Rupireddy
 wrote:
>
>
> Following are some open points:
>
> 1. Where to do inactive_timeout invalidation exactly if not the checkpointer.
>

I have suggested to do it at the time of CheckpointReplicationSlots()
and Bertrand suggested to do it whenever we resume using the slot. I
think we should follow both the suggestions.

> 2. Where to do XID age invalidation exactly if not the checkpointer.
> 3. How to go about recomputing XID horizons based on max_slot_xid_age.
> Does the slot's horizon's need to be adjusted in ComputeXidHorizons()?
>

I suggest postponing the patch for xid based invalidation for a later
discussion.

> 4. New invalidation mechanisms interaction with slot sync feature.
>

Yeah, this is important. My initial thoughts are that synced slots
shouldn't be invalidated on the standby due to timeout.

> 5. Review comments on 0001 from Bertrand.
>
> Please see the attached v12 patches.
>

Thanks for quickly updating the patches.

-- 
With Regards,
Amit Kapila.




Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-03-19 Thread Amit Kapila
On Mon, Mar 18, 2024 at 12:39 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > If you think that this is OK, and as far as I can see this looks OK on
> > the thread, then this open item should be moved under "resolved before
> > 17beta1", mentioning the commit involved in the fix.  Please see [1]
> > for examples.
>
> OK, I understood that I could wait checking from you. Thanks.
>

I don't think there is a need to wait here. The issue being tracked
was fixed, so I have updated the open items page accordingly.

-- 
With Regards,
Amit Kapila.




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-19 Thread Amit Kapila
On Tue, Mar 19, 2024 at 6:12 PM Bertrand Drouvot
 wrote:
>
> On Tue, Mar 19, 2024 at 04:20:35PM +0530, Amit Kapila wrote:
> > On Tue, Mar 19, 2024 at 3:11 PM Bertrand Drouvot
> >  wrote:
> > >
> > > On Tue, Mar 19, 2024 at 10:56:25AM +0530, Amit Kapila wrote:
> > > > On Mon, Mar 18, 2024 at 8:19 PM Bertrand Drouvot
> > > >  wrote:
> > > > > Agree. While it makes sense to invalidate slots for wal removal in
> > > > > CreateCheckPoint() (because this is the place where wal is removed), 
> > > > > I 'm not
> > > > > sure this is the right place for the 2 new cases.
> > > > >
> > > > > Let's focus on the timeout one as proposed above (as probably the 
> > > > > simplest one):
> > > > > as this one is purely related to time and activity what about to 
> > > > > invalidate them
> > > > > when?:
> > > > >
> > > > > - their usage resume
> > > > > - in pg_get_replication_slots()
> > > > >
> > > > > The idea is to invalidate the slot when one resumes activity on it or 
> > > > > wants to
> > > > > get information about it (and among other things wants to know if the 
> > > > > slot is
> > > > > valid or not).
> > > > >
> > > >
> > > > Trying to invalidate at those two places makes sense to me but we
> > > > still need to cover the cases where it takes very long to resume the
> > > > slot activity and the dangling slot cases where the activity is never
> > > > resumed.
> > >
> > > I understand it's better to have the slot reflecting its real status 
> > > internally
> > > but it is a real issue if that's not the case until the activity on it is 
> > > resumed?
> > > (just asking, not saying we should not)
> > >
> >
> > Sorry, I didn't understand your point. Can you try to explain by example?
>
> Sorry if that was not clear, let me try to rephrase it first: what issue to 
> you
> see if the invalidation of such a slot occurs only when its usage resume or
> when pg_get_replication_slots() is triggered? I understand that this could 
> lead
> to the slot not being invalidated (maybe forever) but is that an issue for an
> inactive slot?
>

It has the risk of preventing WAL and row removal. I think this is the
primary reason we are at the first place planning to have such a
parameter. So, we should have some way to invalidate it even when the
walsender/backend process doesn't use it again.

-- 
With Regards,
Amit Kapila.




Re: pg16: XX000: could not find pathkey item to sort

2024-03-19 Thread David Rowley
On Mon, 18 Mar 2024 at 18:50, Ashutosh Bapat
 wrote:
> If the problem you speculate is different from this one, I am not able to see 
> it. It might help give an example query or explain more.

I looked at this again and I might have been wrong about there being a
problem.  I set a breakpoint in create_gather_merge_path() and
adjusted the startup and total cost to 1 when I saw the pathkeys
containing {a,b}.  It turns out this is the non-partitionwise
aggregate path, and of course, the targetlist there does contain the
"b" column, so it's fine in that case that the pathkeys are {a,b}.   I
had previously thought that this was for the partition-wise aggregate
plan, in which case the targetlist would contain a, sum(b order by b),
of which there's no single value of "b" that we can legally sort by.

Here's the full plan.

postgres=# explain verbose SELECT a, sum(b order by b) FROM t GROUP BY
a ORDER BY a;
QUERY PLAN
---
 GroupAggregate  (cost=1.00..25.60 rows=200 width=12)
   Output: t.a, sum(t.b ORDER BY t.b)
   Group Key: t.a
   ->  Gather Merge  (cost=1.00..1.00 rows=4520 width=8)
 Output: t.a, t.b
 Workers Planned: 2
 ->  Sort  (cost=158.36..163.07 rows=1882 width=8)
   Output: t.a, t.b
   Sort Key: t.a, t.b
   ->  Parallel Append  (cost=0.00..56.00 rows=1882 width=8)
 ->  Parallel Seq Scan on public.tp1 t_1
(cost=0.00..23.29 rows=1329 width=8)
   Output: t_1.a, t_1.b
 ->  Parallel Seq Scan on public.td t_2
(cost=0.00..23.29 rows=1329 width=8)
   Output: t_2.a, t_2.b
(14 rows)

David




Re: Avoiding inadvertent debugging mode for pgbench

2024-03-19 Thread Greg Sabino Mullane
Rebased version attached (v2), with another sentence in the sgml to explain
the optional use of -d

Cheers,
Greg


pgbench.dash.d.or.not.dash.d.v2.patch
Description: Binary data


Re: sslinfo extension - add notbefore and notafter timestamps

2024-03-19 Thread Cary Huang
Thank you for your review again. 

 > ...but I think Timestamp[Tz]s are stored as microseconds, so we're off
 > by a factor of a million. This still works because later we cast to
 > double and pass it back through float8_timestamptz, which converts it:

In my test, if I made ASN1_TIME_to_timestamptz to return in microseconds, the
float8_timestamptz() function will catch a "timestamp out of range" exception as
this function treats the input as seconds.

> But anyone who ends up inspecting the value of
 > st_sslstatus->ssl_not_before directly is going to find an incorrect
 > timestamp. I think it'd be clearer to store microseconds to begin
 > with, and then just use TimestampTzGetDatum rather than the
 > DirectFunctionCall1 we have now. 

I have also tried TimestampTzGetDatum with ASN1_TIME_to_timestamptz 
outputting in microseconds. No exception is caught, but pg_stat_ssl displays 
incorrect results. The timestamps are extra large because the extra factor of 
1 million is considered in the timestamp computation as well.

The comments for TimestampTz says:

 * Timestamps, as well as the h/m/s fields of intervals, are stored as
 * int64 values with units of microseconds.  (Once upon a time they were
 * double values with units of seconds.)

but it seems to me that many of the timestamp related functions still consider
timestamp or timestampTz as "double values with units of seconds" though. 

Best regards

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca





Re: Partial aggregates pushdown

2024-03-19 Thread Bruce Momjian
On Tue, Mar 19, 2024 at 05:29:07PM -0400, Tom Lane wrote:
> I'd like to vociferously protest both of those decisions.
> 
> "No version check by default" means "unsafe by default", which is not
> project style in general and is especially not so for postgres_fdw.
> We have tried very hard for years to ensure that postgres_fdw will
> work with a wide range of remote server versions, and generally been
> extremely conservative about what we think will work (example:
> collations); but this patch seems ready to throw that principle away.
> 
> Also, surely "remoteversion < PG_VERSION_NUM" is backwards.  What
> this would mean is that nobody can ever change a partial aggregate's
> implementation, because that would break queries issued from older
> servers (that couldn't know about the change) to newer ones.

Well it is the origin server that is issuing the PUSHDOWN syntax, so an
older origin server should be able to push to a newer remote server.

> Realistically, I think it's fairly unsafe to try aggregate pushdown
> to anything but the same PG major version; otherwise, you're buying
> into knowing which aggregates have partial support in which versions,
> as well as predicting the future about incompatible state changes.

Yes, incompatible state changes would be a problem with an older origin
server with a newer remote server setup.

If we require matching versions, we must accept that upgrades will
require more downtime.

> Even that isn't bulletproof --- e.g, maybe somebody wasn't careful
> about endianness-independence of the serialized partial state, making
> it unsafe to ship --- so there had better be a switch whereby the user
> can disable it.

Makes sense.  I was also wondering how a user would know whether the
pushdown is happening, or not.

> Maybe we could define a three-way setting:
> 
> * default: push down partial aggs only to same major PG version
> * disable: don't push down, period
> * force: push down regardless of remote version

What would be the default?  If it is the first one, it requires a
remote version check on first in the session.

> With the "force" setting, it's the user's responsibility not to
> issue any remote-able aggregation that would be unsafe to push
> down.  This is still a pretty crude tool: I can foresee people
> wanting to have per-aggregate control over things, especially
> extension-supplied aggregates.  But it'd do for starters.

We have the postgres_fdw extensions option to control function pushdown
to extensions.

> I'm not super thrilled by the fact that the patch contains zero
> user-facing documentation, even though it's created new SQL syntax,
> not to mention a new postgres_fdw option.  I assume this means that
> nobody thinks it's anywhere near ready to commit.

Previous versions of the patch had docs since I know I worked on
improving them.  I am not sure what happened to them.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




RE: Popcount optimization using AVX512

2024-03-19 Thread Amonson, Paul D
> -Original Message-
> From: Nathan Bossart 
>
> Committed.  Thanks for the suggestion and for reviewing!
> 
> Paul, I suspect your patches will need to be rebased after commit cc4826d.
> Would you mind doing so?

Changed in this patch set.

* Rebased.
* Direct *slow* calls via macros as shown in example patch.
* Changed the choose filename to be platform specific as suggested.
* Falls back to intermediate "Fast" methods if AVX512 is not available at 
runtime.
* inline used where is makes sense, remember using "extern" negates "inline".
* Fixed comment issues pointed out in review.

I tested building with and without TRY_POPCOUNT_FAST, for both configure and 
meson build systems, and ran in CI.

Thanks,
Paul



v10-0001-Refactor-inlining-and-direct-calls-for-_slow-functio.patch
Description: v10-0001-Refactor-inlining-and-direct-calls-for-_slow-functio.patch


v10-0002-Refactor-Seperated-slow-fast-and-choose-functionalit.patch
Description: v10-0002-Refactor-Seperated-slow-fast-and-choose-functionalit.patch


v10-0003-Feature-Add-POPCNT512-accelerated-functionality-for-.patch
Description: v10-0003-Feature-Add-POPCNT512-accelerated-functionality-for-.patch


Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Tom Lane
Andrew Dunstan  writes:
> On Tue, Mar 19, 2024 at 2:28 PM Magnus Hagander  wrote:
>> Windows has had full ACL support since 1993. The  easiest way to do
>> what you're doing here is to just set a DENY permission on the
>> postgres operating system user.

> Yeah. See <
> https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/icacls>
> for example.

Cool.  Maybe somebody should take a fresh look at the places where
we're assuming Windows has nothing comparable to Unix permissions
(for example, checking for world readability of ssl_key_file).
It's off-topic for this thread though.

regards, tom lane




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-19 Thread Kartyshov Ivan

Bharath Rupireddy, thank you for you review.
But here is some points.

On 2024-03-16 10:02, Bharath Rupireddy wrote:

4.1 With invalid LSN succeeds, shouldn't it error out? Or at least,
add a fast path/quick exit to WaitForLSN()?
BEGIN AFTER '0/0';


In postgresql '0/0' is Valid pg_lsn, but it is always reached.


4.2 With an unreasonably high future LSN, BEGIN command waits
unboundedly, shouldn't we check if the specified LSN is more than
pg_last_wal_receive_lsn() error out?
BEGIN AFTER '0/';
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn';


This case will give ERROR cause '0/' + 1 is invalid pg_lsn


4.3 With an unreasonably high wait time, BEGIN command waits
unboundedly, shouldn't we restrict the wait time to some max value,
say a day or so?
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn' WITHIN 10;


Good idea, I put it 1 day. But this limit we should to discuss.


6.
+/* Set all latches in shared memory to signal that new LSN has been 
replayed */

+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{

I see this patch is waking up all the waiters in the recovery path
after applying every WAL record, which IMO is a hot path. Is the
impact of this change on recovery measured, perhaps using
https://github.com/macdice/redo-bench or similar tools?

7. In continuation to comment #6, why not use Conditional Variables
instead of proc latches to sleep and wait for all the waiters in
WaitLSNSetLatches?


Waiters are stored in the array sorted by LSN. This help us to wake
only PIDs with replayed LSN. This saves us from scanning of whole
array. So it`s not so hot path.

Add some fixes

1) make waiting timeont more simple (as pg_terminate_backend())
2) removed the 1 minute wait because INTERRUPTS don’t arrive for a
long time, changed it to 0.5 seconds
3) add more tests
4) added and expanded sections in the documentation
5) add default variant of timeout
pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)
6) now big timeout will be restricted to 1 day (8640ms)
CALL pg_wait_lsn('0/34FB5A1',100);
WARNING:  Timeout for pg_wait_lsn() restricted to 1 day

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.comdiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5b225ccf4f..106bca9b73 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27861,6 +27861,19 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
 extension.

   
+
+  
+   
+
+ pg_wait_lsn
+
+pg_wait_lsn (trg_lsn pg_lsn, delay int8 DEFAULT 0)
+   
+   
+If timeout <= 0 then timeout is off.
+Returns ERROR if target wait_lsn was not replayed.
+   
+  
  
 

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..0b783dc733 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,14 @@ PerformWalRecovery(void)
 break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+(XLogRecoveryCtl->lastReplayedEndRecPtr >= pg_atomic_read_u64(>minLSN)))
+WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index fe2bb50f46..ff82dffac0 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wait_lsn';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
 IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
 OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..cede90c3b9 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include 

Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Andrew Dunstan
On Tue, Mar 19, 2024 at 2:28 PM Magnus Hagander  wrote:

> On Tue, Mar 19, 2024 at 3:52 PM Tom Lane  wrote:
> >
> > Heikki Linnakangas  writes:
> > > Perhaps we could make that even better with a GUC though. I propose a
> > > GUC called 'configuration_managed_externally = true / false". If you
> set
> > > it to true, we prevent ALTER SYSTEM and make the error message more
> > > definitive:
> >
> > > postgres=# ALTER SYSTEM SET wal_level TO minimal;
> > > ERROR:  configuration is managed externally
> >
> > > As a bonus, if that GUC is set, we could even check at server startup
> > > that all the configuration files are not writable by the postgres user,
> > > and print a warning or refuse to start up if they are.
> >
> > I like this idea.  The "bonus" is not optional though, because
> > setting the files' ownership/permissions is the only way to be
> > sure that the prohibition is even a little bit bulletproof.
> >
> > One small issue: how do we make that work on Windows?  Have recent
> > versions grown anything that looks like real file permissions?
>
> Windows has had full ACL support since 1993. The  easiest way to do
> what you're doing here is to just set a DENY permission on the
> postgres operating system user.
>
>
>
>


Yeah. See <
https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/icacls>
for example.

cheers

andrew


Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-19 Thread Tom Lane
Dean Rasheed  writes:
> Fair enough. I have no further comments.

Pushed then.  Thanks for reviewing!  I gave you credit as co-author.

regards, tom lane




Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-19 Thread Dean Rasheed
On Tue, 19 Mar 2024 at 21:40, Tom Lane  wrote:
>
> I'm inclined to leave that alone.  The actual source sub-SELECT
> could only have had one column, so no real confusion is possible.
> Yeah, there's a resjunk grouping column visible in the plan as well,
> but those exist in many other queries, and we've not gotten questions
> about them.
>

Fair enough. I have no further comments.

Regards,
Dean




Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-19 Thread Tom Lane
Dean Rasheed  writes:
> One final case that could possibly be improved is this one from 
> aggregates.out:

> explain (verbose, costs off)
> select array(select sum(x+y) s
> from generate_series(1,3) y group by y order by s)
>   from generate_series(1,3) x;
> QUERY PLAN
> ---
>  Function Scan on pg_catalog.generate_series x
>Output: ARRAY(SubPlan 1)
>Function Call: generate_series(1, 3)
>SubPlan 1
>  ->  Sort
>Output: (sum((x.x + y.y))), y.y
>Sort Key: (sum((x.x + y.y)))
>->  HashAggregate
>  Output: sum((x.x + y.y)), y.y
>  Group Key: y.y
>  ->  Function Scan on pg_catalog.generate_series y
>Output: y.y
>Function Call: generate_series(1, 3)

> ARRAY operates on a SELECT with a single targetlist item, but in this
> case it looks like the subplan output has 2 columns, which might
> confuse people.

I'm inclined to leave that alone.  The actual source sub-SELECT
could only have had one column, so no real confusion is possible.
Yeah, there's a resjunk grouping column visible in the plan as well,
but those exist in many other queries, and we've not gotten questions
about them.

(Perhaps some documentation about junk columns needs to be added?
I'm not eager to write it though.)

I had actually had a similar thought about sticking ".col1" onto
EXPR_SUBLINK cases, but I concluded it was mostly pedantry.
Nobody's likely to get confused.

regards, tom lane




Re: documentation structure

2024-03-19 Thread Andrew Dunstan
On Mon, Mar 18, 2024 at 10:12 AM Robert Haas  wrote:

> I was looking at the documentation index this morning[1], and I can't
> help feeling like there are some parts of it that are over-emphasized
> and some parts that are under-emphasized. I'm not sure what we can do
> about this exactly, but I thought it worth writing an email and seeing
> what other people think.
>
> The two sections of the documentation that seem really
> under-emphasized to me are the GUC documentation and the SQL
> reference. The GUC documentation is all buried under "20. Server
> Configuration" and the SQL command reference is under "I. SQL
> commands". For reasons that I don't understand, all chapters except
> for those in "VI. Reference" are numbered, but the chapters in that
> section have Roman numerals instead.
>
> I don't know what other people's experience is, but for me, wanting to
> know what a command does or what a setting does is extremely common.
> Therefore, I think these chapters are disproportionately important and
> should be emphasized more. In the case of the GUC reference, one idea
> I have is to split up "III. Server Administration". My proposal is
> that we divide it into three sections. The first would be called "III.
> Server Installation" and would cover chapters 16 (installation from
> binaries) through 19 (server setup and operation). The second would be
> called "IV. Server Configuration" -- so every section that's currently
> a subsection of "server configuration" would become a top-level
> chapter. The third division would be "V. Server Administration," and
> would cover the current chapters 21-33. This is probably far from
> perfect, but it seems like a relatively simple change and better than
> what we have now.
>
> I don't know what to do about "I. SQL commands". It's obviously
> impractical to promote that to a top-level section, because it's got a
> zillion sub-pages which I don't think we want in the top-level
> documentation index. But having it as one of several unnumbered
> chapters interposed between 51 and 52 doesn't seem great either.
>
> The stuff that I think is over-emphasized is as follows: (a) chapters
> 1-3, the tutorial; (b) chapters 4-6, which are essentially a
> continuation of the tutorial, and not at all similar to chapters 8-11
> which are chalk-full of detailed technical information; (c) chapters
> 43-46, one per procedural language; perhaps these could just be
> demoted to sub-sections of chapter 42 on procedural languages; (d)
> chapters 47 (server programming interface), 50 (replication progress
> tracking), and 51 (archive modules), all of which are important to
> document but none of which seem important enough to put them in the
> top-level documentation index; and (e) large parts of section "VII.
> Internals," which again contain tons of stuff of very marginal
> interest. The first ~4 chapters of the internals section seem like
> they might be mainstream enough to justify the level of prominence
> that we give them, but the rest has got to be of interest to a tiny
> minority of readers.
>
> I think it might be possible to consolidate the internals section by
> grouping a bunch of existing entries together by category. Basically,
> after the first few chapters, you've got stuff that is of interest to
> C programmers writing core or extension code; and you've got
> explainers on things like GEQO and index op-classes and support
> functions which might be of interest even to non-programmers. I think
> for example that we don't need separate top-level chapters on writing
> procedural language handlers, FDWs, tablesample methods, custom scan
> providers, table access methods, index access methods, and WAL
> resource managers. Some or all of those could be grouped under a
> single chapter, perhaps, e.g. Using PostgreSQL Extensibility
> Interfaces.
>
> Thoughts? I realize that this topic is HIGHLY prone to ENDLESS
> bikeshedding, and it's inevitable that not everybody is going to
> agree. But I hope we can agree that it's completely silly that it's
> vastly easier to find the documentation about the backup manifest
> format than it is to find the documentation on CREATE TABLE or
> shared_buffers, and if we can agree on that, then perhaps we can agree
> on some way to make things better.
>
>
>
+many for improving the index.

My own pet docs peeve is a purely editorial one: func.sgml is a 30k line
beast, and I think there's a good case for splitting out at least the
larger chunks of it.

cheers

andrew


Re: Partial aggregates pushdown

2024-03-19 Thread Tom Lane
Bruce Momjian  writes:
> The current patch has:

>   if ((OidIsValid(aggform->aggfinalfn) ||
>   (aggform->aggtranstype == INTERNALOID)) &&
>   fpinfo->check_partial_aggregate_support)
>   {
>   if (fpinfo->remoteversion == 0)
>   {
>   PGconn *conn = GetConnection(fpinfo->user, false, 
> NULL);

>   fpinfo->remoteversion = PQserverVersion(conn);
>   }

>   if (fpinfo->remoteversion < PG_VERSION_NUM)
>   partial_agg_ok = false;
>   }

> It uses check_partial_aggregate_support, which defaults to false,
> meaning partial aggregates will be pushed down with no version check by
> default.  If set to true, pushdown will happen if the remote server is
> the same version or newer, which seems acceptable to me.

I'd like to vociferously protest both of those decisions.

"No version check by default" means "unsafe by default", which is not
project style in general and is especially not so for postgres_fdw.
We have tried very hard for years to ensure that postgres_fdw will
work with a wide range of remote server versions, and generally been
extremely conservative about what we think will work (example:
collations); but this patch seems ready to throw that principle away.

Also, surely "remoteversion < PG_VERSION_NUM" is backwards.  What
this would mean is that nobody can ever change a partial aggregate's
implementation, because that would break queries issued from older
servers (that couldn't know about the change) to newer ones.

Realistically, I think it's fairly unsafe to try aggregate pushdown
to anything but the same PG major version; otherwise, you're buying
into knowing which aggregates have partial support in which versions,
as well as predicting the future about incompatible state changes.
Even that isn't bulletproof --- e.g, maybe somebody wasn't careful
about endianness-independence of the serialized partial state, making
it unsafe to ship --- so there had better be a switch whereby the user
can disable it.

Maybe we could define a three-way setting:

* default: push down partial aggs only to same major PG version
* disable: don't push down, period
* force: push down regardless of remote version

With the "force" setting, it's the user's responsibility not to
issue any remote-able aggregation that would be unsafe to push
down.  This is still a pretty crude tool: I can foresee people
wanting to have per-aggregate control over things, especially
extension-supplied aggregates.  But it'd do for starters.

I'm not super thrilled by the fact that the patch contains zero
user-facing documentation, even though it's created new SQL syntax,
not to mention a new postgres_fdw option.  I assume this means that
nobody thinks it's anywhere near ready to commit.

regards, tom lane




Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-19 Thread Dean Rasheed
On Tue, 19 Mar 2024 at 16:42, Tom Lane  wrote:
>
> Here's a hopefully-final version that makes that adjustment and
> tweaks a couple of comments.
>

This looks very good to me.

One final case that could possibly be improved is this one from aggregates.out:

explain (verbose, costs off)
select array(select sum(x+y) s
from generate_series(1,3) y group by y order by s)
  from generate_series(1,3) x;
QUERY PLAN
---
 Function Scan on pg_catalog.generate_series x
   Output: ARRAY(SubPlan 1)
   Function Call: generate_series(1, 3)
   SubPlan 1
 ->  Sort
   Output: (sum((x.x + y.y))), y.y
   Sort Key: (sum((x.x + y.y)))
   ->  HashAggregate
 Output: sum((x.x + y.y)), y.y
 Group Key: y.y
 ->  Function Scan on pg_catalog.generate_series y
   Output: y.y
   Function Call: generate_series(1, 3)

ARRAY operates on a SELECT with a single targetlist item, but in this
case it looks like the subplan output has 2 columns, which might
confuse people.

I wonder if we should output "ARRAY((SubPlan 1).col1)" to make it
clearer. Since ARRAY_SUBLINK is a special case, which always collects
the first column's values, we could just always output "col1" for
ARRAY.

Regards,
Dean




Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread walther

Greg Sabino Mullane:
On Tue, Mar 19, 2024 at 12:05 PM Tom Lane > wrote:


If you aren't willing to build a solution that blocks off mods
using COPY TO FILE/PROGRAM and other readily-available-to-superusers
tools (plpythonu for instance), I think you shouldn't bother asking
for a feature at all.  Just trust your superusers.


There is a huge gap between using a well-documented standard tool like 
ALTER SYSTEM and going out of your way to modify the configuration files 
through trickery. I think we need to only solve the former as in "hey, 
please don't do that because your changes will be overwritten"


Recap: The requested feature is not supposed to be a security feature. 
It is supposed to prevent the admin from accidentally doing the wrong 
thing - but not from willfully doing the same through different means.


This very much sounds like a "warning" - how about turning the feature 
into one?


Have a GUC warn_on_alter_system = "", which allows the 
kubernetes operator to set it to something like "hey, please don't do 
that because your changes will be overwritten. Use xyz operator instead.".


This will hardly be taken as a security feature by anyone, but should 
essentially achieve what is asked for.


A more sophisticated way would be to make that GUC throw an error, but 
have a syntax for ALTER SYSTEM to override this - i.e. similar to a 
--force flag.


Best,

Wolfgang




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-19 Thread Tomas Vondra
On 3/18/24 16:55, Tomas Vondra wrote:
>
> ...
> 
> OK, I've restarted the tests for only 0012 and 0014 patches, and I'll
> wait for these to complete - I don't want to be looking for patterns
> until we have enough data to smooth this out.
> 
>

I now have results for 1M and 10M runs on the two builds (0012 and
0014), attached is a chart for relative performance plotting

  (0014 timing) / (0012 timing)

for "optimal' runs that would pick bitmapscan on their own. There's
nothing special about the config - I reduced the random_page_cost to
1.5-2.0 to reflect both machines have flash storage, etc.

Overall, the chart is pretty consistent with what I shared on Sunday.
Most of the results are fine (0014 is close to 0012 or faster), but
there's a bunch of cases that are much slower. Interestingly enough,
almost all of them are on the i5 machine, almost none of the xeon. My
guess is this is about the SSD type (SATA vs. NVMe).

Attached if table of ~50 worst regressions (by the metric above), and
it's interesting the worst regressions are with eic=0 and eic=1.

I decided to look at the first case (eic=0), and the timings are quite
stable - there are three runs for each build, with timings close to the
average (see below the table).

Attached is a script that reproduces this on both machines, but the
difference is much more significant on i5 (~5x) compared to xeon (~2x).

I haven't investigated what exactly is happening and why, hopefully the
script will allow you to reproduce this independently. I plan to take a
look, but I don't know when I'll have time for this.

FWIW if the script does not reproduce this on your machines, I might be
able to give you access to the i5 machine. Let me know.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companycreate table test_table (a bigint, b bigint, c text) with (fillfactor = 25);
insert into test_table select a, b, c from (select a, b, c, 
generate_series(1,24) from (select a, b, c from (select (416 * random())::int 
a, i b, md5(random()::text) c from generate_series(1, 100/24) s(i)) foo) 
bar) baz;
create index on test_table (a);
vacuum analyze;
checkpoint;


-- 0012
set effective_io_concurrency = 0;
set work_mem = '4MB';
explain analyze select * from test_table where (a >= 53) AND (a <= 264);

QUERY PLAN  
   
---
 Bitmap Heap Scan on test_table  (cost=6030.91..55240.07 rows=502877 width=49) 
(actual time=21.821..406.664 rows=508440 loops=1)
   Recheck Cond: ((a >= 53) AND (a <= 264))
   Heap Blocks: exact=21185
   ->  Bitmap Index Scan on test_table_a_idx  (cost=0.00..5905.20 rows=502877 
width=0) (actual time=18.599..18.599 rows=508440 loops=1)
 Index Cond: ((a >= 53) AND (a <= 264))
 Planning Time: 5.235 ms
 Execution Time: 421.185 ms
(7 rows)


-- 0014
set effective_io_concurrency = 0;
set work_mem = '4MB';
explain analyze select * from test_table where (a >= 53) AND (a <= 264);

QUERY PLAN  
   
---
 Bitmap Heap Scan on test_table  (cost=6030.91..55240.07 rows=502877 width=49) 
(actual time=21.894..2697.074 rows=508440 loops=1)
   Recheck Cond: ((a >= 53) AND (a <= 264))
   Heap Blocks: exact=21185
   ->  Bitmap Index Scan on test_table_a_idx  (cost=0.00..5905.20 rows=502877 
width=0) (actual time=18.714..18.715 rows=508440 loops=1)
 Index Cond: ((a >= 53) AND (a <= 264))
 Planning Time: 4.943 ms
 Execution Time: 2730.409 ms
(7 rows)
with data as (
  select machine, build, rows, dataset, workers, wm, eic, matches, caching, 
count(*), round(avg(timing),2) as timing
from results where optimal = 'bitmapscan'
   group by 1, 2, 3, 4, 5, 6, 7, 8, 9
)
select d1.*, d2.timing as timing_0014, round(d2.timing / d1.timing,2) AS change
from data d1
join data d2 on ((d1.machine, d1.rows, d1.dataset, d1.workers, d1.wm, d1.eic, 
d1.matches, d1.caching) = (d2.machine, d2.rows, d2.dataset, d2.workers, d2.wm, 
d2.eic, d2.matches, d2.caching))
where d1.build = 'patched-0012'
  and d2.build = 'patched-0014'
order by d2.timing / d1.timing desc;

 machine |build |   rows   |dataset| workers |  wm   | eic | 
matches |  caching  | count |  timing  | timing_0014 | change 
-+--+--+---+-+---+-+-+---+---+--+-+
 i5  | patched-0012 |  100 | uniform_pages |   0 |  4096 |   0 |
 211 | uncached  | 3 |   449.78 | 2650.20 |   5.89
 i5  | patched-0012 |  100 | 

Re: Partial aggregates pushdown

2024-03-19 Thread Bruce Momjian
On Sat, Mar 16, 2024 at 02:28:50AM +, 
fujii.y...@df.mitsubishielectric.co.jp wrote:
> Hi. Mr.Pyhalov.
>
> > From: Alexander Pyhalov  Sent: Wednesday,
> > February 28, 2024 10:43 PM
> > > 1. Transmitting state value safely between machines
> > >> From: Robert Haas  Sent: Wednesday,
> > >> December 6, 2023 10:25 PM the problems around transmitting
> > >> serialized bytea blobs between machines that can't be assumed to
> > >> fully trust each other will need to be addressed in some way,
> > >> which seems like it will require a good deal of design work,
> > >> forming some kind of consensus, and then implementation work to
> > >> follow.
> > > I have considered methods for safely transmitting state values
> > > between different machines.  I have taken into account the version
> > > policy of PostgreSQL (5 years of support) and the major version
> > > release cycle over the past 10 years (1 year), and as a result, I
> > > have made the assumption that transmission is allowed only when
> > > the difference between the local version and the remote version
> > > is 5 or less.  I believe that by adding new components, "export
> > > function" and "import function", to the aggregate functions, and
> > > further introducing a new SQL keyword to the query syntax of
> > > aggregate expressions, we can address this issue.
> >
> > I honestly think that achieving cross-version compatibility in
> > this way puts a significant burden on developers. Can we instead
> > always use the more or less universal export and import function
> > to fix possible issues with binary representations on different
> > architectures and just refuse to push down partial aggregates on
> > server version mismatch? At least at the first step?
>
> Thank you for your comment. I agree with your point that the proposed
> method would impose a significant burden on developers. In order
> to ensure cross-version compatibility, it is necessary to impose
> constraints on the format of the state values exchanged between
> servers, which would indeed burden developers. As you mentioned, I
> think that it is realistic to allow partial aggregation pushdown only
> when coordinating between the same versions in the first step.

The current patch has:

  if ((OidIsValid(aggform->aggfinalfn) ||
  (aggform->aggtranstype == INTERNALOID)) &&
  fpinfo->check_partial_aggregate_support)
  {
  if (fpinfo->remoteversion == 0)
  {
  PGconn *conn = GetConnection(fpinfo->user, false, 
NULL);

  fpinfo->remoteversion = PQserverVersion(conn);
  }

  if (fpinfo->remoteversion < PG_VERSION_NUM)
  partial_agg_ok = false;
  }

It uses check_partial_aggregate_support, which defaults to false,
meaning partial aggregates will be pushed down with no version check by
default.  If set to true, pushdown will happen if the remote server is
the same version or newer, which seems acceptable to me.

FYI, the patch is much smaller now.  :-)

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Avoiding inadvertent debugging mode for pgbench

2024-03-19 Thread Nathan Bossart
On Tue, Mar 19, 2024 at 01:47:53PM -0500, Nathan Bossart wrote:
> On Mon, Mar 11, 2024 at 04:59:36PM +0100, Alvaro Herrera wrote:
>> All in all, I support the original patch.
> 
> I'll commit this in a few days if there are no objections.

Actually, I just took a look at the patch and it appears to need a rebase
as well as additional documentation updates for the new -d/--dbname option.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Vectored I/O in bulk_write.c

2024-03-19 Thread Thomas Munro
On Sun, Mar 17, 2024 at 8:10 AM Andres Freund  wrote:
> I don't think zeroextend on the one hand and and on the other hand a normal
> write or extend are really the same operation. In the former case the content
> is hard-coded in the latter it's caller provided. Sure, we can deal with that
> by special-casing NULL content - but at that point, what's the benefit of
> combinding the two operations?  I'm not dead-set against this, just not really
> convinced that it's a good idea to combine the operations.

I liked some things about that, but I'm happy to drop that part.
Here's a version that leaves smgrzeroextend() alone, and I also gave
up on moving errors and assertions up into the smgr layer for now to
minimise the change.  So to summarise, this gives us smgrwritev(...,
flags), where flags can include _SKIP_FSYNC and _EXTEND.  This way we
don't have to invent smgrextendv().  The old single block smgrextend()
still exists as a static inline wrapper.


v6-0001-Use-smgrwritev-for-both-overwriting-and-extending.patch
Description: Binary data


v6-0002-Use-vectored-I-O-for-bulk-writes.patch
Description: Binary data


v6-0003-Improve-bulk_write.c-memory-management.patch
Description: Binary data


Re: Popcount optimization using AVX512

2024-03-19 Thread Nathan Bossart
On Tue, Mar 19, 2024 at 12:30:50PM +1300, David Rowley wrote:
> Looks good.

Committed.  Thanks for the suggestion and for reviewing!

Paul, I suspect your patches will need to be rebased after commit cc4826d.
Would you mind doing so?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Why is parula failing?

2024-03-19 Thread Tom Lane
For the last few days, buildfarm member parula has been intermittently
failing the partition_prune regression test, due to unexpected plan
changes [1][2][3][4].  The symptoms can be reproduced exactly by
inserting a "vacuum" of one or another of the partitions of table
"ab", so we can presume that the underlying cause is an autovacuum run
against one of those tables.  But why is that happening?  None of
those tables receive any insertions during the test, so I don't
understand why autovacuum would trigger on them.

I suppose we could attach "autovacuum=off" settings to these tables,
but it doesn't seem to me that that should be necessary.  These test
cases are several years old and haven't given trouble before.
Moreover, if that's necessary then there are a lot of other regression
tests that would presumably need the same treatment.

I'm also baffled why this wasn't happening before.  I scraped the
buildfarm logs for 3 months back and confirmed my impression that
this is a new failure mode.  But one of these four runs is on
REL_14_STABLE, eliminating the theory that the cause is a recent
HEAD-only change.

Any ideas?

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-03-19%2016%3A09%3A02
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-03-18%2011%3A13%3A02
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-03-14%2011%3A40%3A02
[4] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-03-14%2019%3A00%3A02




Re: Proposal to include --exclude-extension Flag in pg_dump

2024-03-19 Thread Ayush Vatsa
> I ran it through pgindent to fix some whitespace issues and added
> another test for the filter option, based on the test case you added.

Thank you for addressing those whitespaces issues and adding more tests. I
appreciate your attention to detail and will certainly be more vigilant in
future.


> I'm marking this ready-for-commit (which I'll probably do myself in a
> day or two, unless anyone else claims it first).

Thank you very much, this marks my first contribution to the open-source
community, and I'm enthusiastic about making further meaningful
contributions to PostgreSQL in the future.


> LGTM too from a read through.  I did notice a few mistakes in the --filter
> documentation portion for other keywords but that's unrelated to this
patch,
> will fix them once this is in to avoid conflicts.

Thanks Daniel for your review. It's gratifying to see that my patch not
only introduced the intended feature but also brought other minor mistakes
to light.

Regards
Ayush Vatsa
Amazon Web services (AWS)


On Tue, 19 Mar 2024 at 17:23, Daniel Gustafsson  wrote:

> > On 19 Mar 2024, at 12:19, Dean Rasheed  wrote:
>
> > I'm marking this ready-for-commit (which I'll probably do myself in a
> > day or two, unless anyone else claims it first).
>
> LGTM too from a read through.  I did notice a few mistakes in the --filter
> documentation portion for other keywords but that's unrelated to this
> patch,
> will fix them once this is in to avoid conflicts.
>
> --
> Daniel Gustafsson
>
>


Re: Oversight in reparameterize_path_by_child leading to executor crash

2024-03-19 Thread Tom Lane
Richard Guo  writes:
> Here is the patch for HEAD.  I simply re-posted v10.  Nothing has
> changed.

I got back to this finally, and pushed it with some minor cosmetic
adjustments.

regards, tom lane




Re: cleanup patches for incremental backup

2024-03-19 Thread Nathan Bossart
On Thu, Mar 14, 2024 at 08:52:55PM -0500, Nathan Bossart wrote:
> Subject: [PATCH v1 1/2] Revert "Temporary patch to help debug pg_walsummary
>  test failures."

> Subject: [PATCH v1 2/2] Fix possible overflow in MaybeRemoveOldWalSummaries().

Assuming there are no objections or feedback, I plan to commit these two
patches within the next couple of days.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Avoiding inadvertent debugging mode for pgbench

2024-03-19 Thread Nathan Bossart
On Mon, Mar 11, 2024 at 04:59:36PM +0100, Alvaro Herrera wrote:
> All in all, I support the original patch.

I'll commit this in a few days if there are no objections.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Daniel Gustafsson
> On 19 Mar 2024, at 15:51, Tom Lane  wrote:
> 
> Heikki Linnakangas  writes:
>> Perhaps we could make that even better with a GUC though. I propose a 
>> GUC called 'configuration_managed_externally = true / false". If you set 
>> it to true, we prevent ALTER SYSTEM and make the error message more 
>> definitive:
> 
>> postgres=# ALTER SYSTEM SET wal_level TO minimal;
>> ERROR:  configuration is managed externally
> 
>> As a bonus, if that GUC is set, we could even check at server startup 
>> that all the configuration files are not writable by the postgres user, 
>> and print a warning or refuse to start up if they are.
> 
> I like this idea.  The "bonus" is not optional though, because
> setting the files' ownership/permissions is the only way to be
> sure that the prohibition is even a little bit bulletproof.

Agreed, assuming we can solve the below..

> One small issue: how do we make that work on Windows?  Have recent
> versions grown anything that looks like real file permissions?

--
Daniel Gustafsson





Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Magnus Hagander
On Tue, Mar 19, 2024 at 3:52 PM Tom Lane  wrote:
>
> Heikki Linnakangas  writes:
> > Perhaps we could make that even better with a GUC though. I propose a
> > GUC called 'configuration_managed_externally = true / false". If you set
> > it to true, we prevent ALTER SYSTEM and make the error message more
> > definitive:
>
> > postgres=# ALTER SYSTEM SET wal_level TO minimal;
> > ERROR:  configuration is managed externally
>
> > As a bonus, if that GUC is set, we could even check at server startup
> > that all the configuration files are not writable by the postgres user,
> > and print a warning or refuse to start up if they are.
>
> I like this idea.  The "bonus" is not optional though, because
> setting the files' ownership/permissions is the only way to be
> sure that the prohibition is even a little bit bulletproof.
>
> One small issue: how do we make that work on Windows?  Have recent
> versions grown anything that looks like real file permissions?

Windows has had full ACL support since 1993. The  easiest way to do
what you're doing here is to just set a DENY permission on the
postgres operating system user.


//Magnus




Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Daniel Gustafsson
> On 19 Mar 2024, at 17:53, Jelte Fennema-Nio  wrote:
> 
> On Tue, 19 Mar 2024 at 17:05, Tom Lane  wrote:
>> I've said this repeatedly: it's not enough.  The only reason we need
>> any feature whatsoever is that somebody doesn't trust their database
>> superusers to not try to modify the configuration.
> 
> And as everyone else on this thread has said: It is enough. Because
> the point is not security, the point is hinting to a superuser that a
> workflow they know from other systems (or an ALTER SYSTEM command they
> copied from the internet) is not the intended way to modify their
> server configuration on the system they are currently working on.

Well.  Protection against superusers randomly copying ALTER SYSTEM commands
from the internet actually does turn this into a security feature =)

--
Daniel Gustafsson





Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-19 Thread Kartyshov Ivan

Intro
==
The main purpose of the feature is to achieve
read-your-writes-consistency, while using async replica for reads and
primary for writes. In that case lsn of last modification is stored
inside application. We cannot store this lsn inside database, since
reads are distributed across all replicas and primary.


Procedure style implementation
==
https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us
https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com

CALL pg_wait_lsn(‘LSN’, timeout);

Examples
==

primary standby
--- 
 postgresql.conf
 recovery_min_apply_delay = 10s


CREATE TABLE tbl AS SELECT generate_series(1,10) AS a;
INSERT INTO tbl VALUES (generate_series(11, 20));
SELECT pg_current_wal_lsn();


 CALL pg_wait_lsn('0/3002AE8', 1);
 BEGIN;
 SELECT * FROM tbl; // read fresh insertions
 COMMIT;

Fixed and ready to review.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.comdiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5b225ccf4f..cb1c175c62 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27861,6 +27861,19 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
 extension.

   
+
+  
+   
+
+ pg_wait_lsn
+
+pg_wait_lsn (wait_lsn pg_lsn, timeout bigint)
+   
+   
+If timeout <= 0 then timeout is off.
+Returns ERROR if target wait_lsn was not replayed.
+   
+  
  
 

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..0b783dc733 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,14 @@ PerformWalRecovery(void)
 break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+(XLogRecoveryCtl->lastReplayedEndRecPtr >= pg_atomic_read_u64(>minLSN)))
+WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..cede90c3b9 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..7549be5dc3 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00..f9b701c946
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,309 @@
+/*-
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  CALL pg_wait_lsn(wait_lsn pg_lsn, timeout int).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include 
+#include 
+
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = 

Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Greg Sabino Mullane
On Tue, Mar 19, 2024 at 12:05 PM Tom Lane  wrote:

> If you aren't willing to build a solution that blocks off mods
> using COPY TO FILE/PROGRAM and other readily-available-to-superusers
> tools (plpythonu for instance), I think you shouldn't bother asking
> for a feature at all.  Just trust your superusers.
>

There is a huge gap between using a well-documented standard tool like
ALTER SYSTEM and going out of your way to modify the configuration files
through trickery. I think we need to only solve the former as in "hey,
please don't do that because your changes will be overwritten"

Cheers,
Greg


Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Jelte Fennema-Nio
On Tue, 19 Mar 2024 at 17:05, Tom Lane  wrote:
> I've said this repeatedly: it's not enough.  The only reason we need
> any feature whatsoever is that somebody doesn't trust their database
> superusers to not try to modify the configuration.

And as everyone else on this thread has said: It is enough. Because
the point is not security, the point is hinting to a superuser that a
workflow they know from other systems (or an ALTER SYSTEM command they
copied from the internet) is not the intended way to modify their
server configuration on the system they are currently working on.

I feel like the docs and error message in the current active patch are
very clear on that. If you think they are not clear, feel free to
suggest what could clarify the intent of this feature. But at this
point, it's really starting to seem to me like you're willingly trying
to interpret this feature as a thing that it is not (i.e. a security
feature).




Re: Reducing output size of nodeToString

2024-03-19 Thread Matthias van de Meent
On Tue, 19 Mar 2024 at 17:13, Peter Eisentraut  wrote:
>
> On 11.03.24 21:52, Matthias van de Meent wrote:
> >> * v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
> >>
> >> This looks sensible, but maybe making Location a global type is a bit
> >> much?  Maybe something more specific like ParseLocation, or ParseLoc, to
> >> keep it under 12 characters.
> > I've gone with ParseLoc in the attached v8 patchset.
>
> I have committed this one.

Thanks!

> I moved the typedef to nodes/nodes.h, where we already had similar
> typdefs (Cardinality, etc.).  The fields stmt_location and stmt_len in
> PlannedStmt were not converted, so I fixed that.  Also, between you
> writing your patch and now, at least one new node type was added, so I
> fixed that one up, too.

Good points, thank you for fixing that.

> (I diffed the generated node support functions
> to check.)  Hopefully, future hackers will apply the new type when
> appropriate.

Are you also planning on committing some of the other patches later,
or should I rebase the set to keep CFBot happy?

-Matthias




Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-19 Thread Tom Lane
I wrote:
> I won't update the patch right now, but "(rescan SubPlan N)"
> seems like a winner to me.

Here's a hopefully-final version that makes that adjustment and
tweaks a couple of comments.

regards, tom lane

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 58a603ac56..acbbf3b56c 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3062,10 +3062,10 @@ select exists(select 1 from pg_enum), sum(c1) from ft1;
 QUERY PLAN
 --
  Foreign Scan
-   Output: $0, (sum(ft1.c1))
+   Output: (InitPlan 1).col1, (sum(ft1.c1))
Relations: Aggregate on (public.ft1)
Remote SQL: SELECT sum("C 1") FROM "S 1"."T 1"
-   InitPlan 1 (returns $0)
+   InitPlan 1
  ->  Seq Scan on pg_catalog.pg_enum
 (6 rows)
 
@@ -3080,8 +3080,8 @@ select exists(select 1 from pg_enum), sum(c1) from ft1 group by 1;
 QUERY PLAN 
 ---
  GroupAggregate
-   Output: $0, sum(ft1.c1)
-   InitPlan 1 (returns $0)
+   Output: (InitPlan 1).col1, sum(ft1.c1)
+   InitPlan 1
  ->  Seq Scan on pg_catalog.pg_enum
->  Foreign Scan on public.ft1
  Output: ft1.c1
@@ -3305,10 +3305,10 @@ select sum(c1) filter (where (c1 / c1) * random() <= 1) from ft1 group by c2 ord
 
 explain (verbose, costs off)
 select sum(c2) filter (where c2 in (select c2 from ft1 where c2 < 5)) from ft1;
-QUERY PLAN 

+  QUERY PLAN   
+---
  Aggregate
-   Output: sum(ft1.c2) FILTER (WHERE (hashed SubPlan 1))
+   Output: sum(ft1.c2) FILTER (WHERE (ANY (ft1.c2 = (hashed SubPlan 1).col1)))
->  Foreign Scan on public.ft1
  Output: ft1.c2
  Remote SQL: SELECT c2 FROM "S 1"."T 1"
@@ -6171,9 +6171,9 @@ UPDATE ft2 AS target SET (c2, c7) = (
  Update on public.ft2 target
Remote SQL: UPDATE "S 1"."T 1" SET c2 = $2, c7 = $3 WHERE ctid = $1
->  Foreign Scan on public.ft2 target
- Output: $1, $2, (SubPlan 1 (returns $1,$2)), target.ctid, target.*
+ Output: (SubPlan 1).col1, (SubPlan 1).col2, (rescan SubPlan 1), target.ctid, target.*
  Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8, ctid FROM "S 1"."T 1" WHERE (("C 1" > 1100)) FOR UPDATE
- SubPlan 1 (returns $1,$2)
+ SubPlan 1
->  Foreign Scan on public.ft2 src
  Output: (src.c2 * 10), src.c7
  Remote SQL: SELECT c2, c7 FROM "S 1"."T 1" WHERE (($1::integer = "C 1"))
@@ -11685,9 +11685,9 @@ SELECT * FROM local_tbl t1 LEFT JOIN (SELECT *, (SELECT count(*) FROM async_pt W
QUERY PLAN   
 
  Nested Loop Left Join
-   Output: t1.a, t1.b, t1.c, async_pt.a, async_pt.b, async_pt.c, ($0)
+   Output: t1.a, t1.b, t1.c, async_pt.a, async_pt.b, async_pt.c, ((InitPlan 1).col1)
Join Filter: (t1.a = async_pt.a)
-   InitPlan 1 (returns $0)
+   InitPlan 1
  ->  Aggregate
Output: count(*)
->  Append
@@ -11699,10 +11699,10 @@ SELECT * FROM local_tbl t1 LEFT JOIN (SELECT *, (SELECT count(*) FROM async_pt W
  Output: t1.a, t1.b, t1.c
->  Append
  ->  Async Foreign Scan on public.async_p1 async_pt_1
-   Output: async_pt_1.a, async_pt_1.b, async_pt_1.c, $0
+   Output: async_pt_1.a, async_pt_1.b, async_pt_1.c, (InitPlan 1).col1
Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a < 3000))
  ->  Async Foreign Scan on public.async_p2 async_pt_2
-   Output: async_pt_2.a, async_pt_2.b, async_pt_2.c, $0
+   Output: async_pt_2.a, async_pt_2.b, async_pt_2.c, (InitPlan 1).col1
Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((a < 3000))
 (20 rows)
 
@@ -11713,7 +11713,7 @@ SELECT * FROM local_tbl t1 LEFT JOIN (SELECT *, (SELECT count(*) FROM async_pt W
  Nested Loop Left Join (actual rows=1 loops=1)
Join Filter: (t1.a = async_pt.a)
Rows Removed by Join Filter: 399
-   InitPlan 1 (returns $0)
+   InitPlan 1
  ->  Aggregate (actual rows=1 loops=1)
->  Append (actual rows=400 loops=1)
  ->  Async Foreign Scan on async_p1 async_pt_4 (actual rows=200 loops=1)
@@ -11936,11 +11936,11 @@ CREATE FOREIGN TABLE foreign_tbl2 () INHERITS (foreign_tbl)
   SERVER loopback OPTIONS (table_name 'base_tbl');
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT a FROM base_tbl WHERE (a, random() > 0) IN (SELECT a, random() > 0 FROM foreign_tbl);

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-19 Thread Masahiko Sawada
On Tue, Mar 19, 2024 at 6:40 PM John Naylor  wrote:
>
> On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada  
> wrote:
> >
> > On Tue, Mar 19, 2024 at 8:35 AM John Naylor  wrote:
> > >
> > > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor  
> > > > wrote:
>
> > > It might also be worth reducing the number of blocks in the random
> > > test -- multiple runs will have different offsets anyway.
> >
> > Yes. If we reduce the number of blocks from 1000 to 100, the
> > regression test took on my environment:
> >
> > 1000 blocks : 516 ms
> > 100 blocks   : 228 ms
>
> Sounds good.
>
> > Removed some unnecessary variables in 0002 patch.
>
> Looks good.
>
> > So the MaxBlocktableEntrySize calculation would be as follows?
> >
> > #define MaxBlocktableEntrySize \
> > offsetof(BlocktableEntry, words) + \
> > (sizeof(bitmapword) * \
> > WORDS_PER_PAGE(Min(MaxOffsetNumber, \
> >BITS_PER_BITMAPWORD * PG_INT8_MAX - 1
> >
> > I've made this change in the 0003 patch.
>
> This is okay, but one side effect is that we have both an assert and
> an elog, for different limits. I think we'll need a separate #define
> to help. But for now, I don't want to hold up tidstore further with
> this because I believe almost everything else in v74 is in pretty good
> shape. I'll save this for later as a part of the optimization I
> proposed.
>
> Remaining things I noticed:
>
> +#define RT_PREFIX local_rt
> +#define RT_PREFIX shared_rt
>
> Prefixes for simplehash, for example, don't have "sh" -- maybe 
> "local/shared_ts"
>
> + /* MemoryContext where the radix tree uses */
>
> s/where/that/
>
> +/*
> + * Lock support functions.
> + *
> + * We can use the radix tree's lock for shared TidStore as the data we
> + * need to protect is only the shared radix tree.
> + */
> +void
> +TidStoreLockExclusive(TidStore *ts)
>
> Talking about multiple things, so maybe a blank line after the comment.
>
> With those, I think you can go ahead and squash all the tidstore
> patches except for 0003 and commit it.
>
> > While reviewing the vacuum patch, I realized that we always pass
> > LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related
> > to the tidstore is therefore always the same. I think it would be
> > better to make the caller of TidStoreCreate() specify the tranch_id
> > and pass it to RT_CREATE(). That way, the caller can specify their own
> > wait event for tidstore. The 0008 patch tried this idea. dshash.c does
> > the same idea.
>
> Sounds reasonable. I'll just note that src/include/storage/lwlock.h
> still has an entry for LWTRANCHE_SHARED_TIDSTORE.

Thank you. I've incorporated all the comments above. I've attached the
latest patches, and am going to push them (one by one) after
self-review again.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v75-0001-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
Description: Binary data


v75-0002-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data


Re: add AVX2 support to simd.h

2024-03-19 Thread Nathan Bossart
On Tue, Mar 19, 2024 at 04:53:04PM +0700, John Naylor wrote:
> On Tue, Mar 19, 2024 at 10:16 AM Nathan Bossart
>  wrote:
>> 0002 does the opposite of this.  That is, after we've completed as many
>> blocks as possible, we move the iterator variable back to "end -
>> block_size" and do one final iteration to cover all the remaining elements.
> 
> Sounds similar in principle, but it looks really complicated. I don't
> think the additional loops and branches are a good way to go, either
> for readability or for branch prediction. My sketch has one branch for
> which loop to do, and then performs only one loop. Let's do the
> simplest thing that could work. (I think we might need a helper
> function to do the block, but the rest should be easy)

I tried to trim some of the branches, and came up with the attached patch.
I don't think this is exactly what you were suggesting, but I think it's
relatively close.  My testing showed decent benefits from using 2 vectors
when there aren't enough elements for 4, so I've tried to keep that part
intact.  This changes pg_lfind32() to something like:

if not many elements
process one by one

while enough elements for 4 registers remain
process with 4 registers

if no elements remain
return false

if more than 2-registers-worth of elements remain
do one iteration with 2 registers

do another iteration on last 2-registers-worth of elements

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index b8dfa66eef..d154b61555 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -80,6 +80,34 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
 	return false;
 }
 
+static inline bool
+lfind32_2reg_helper(const Vector32 keys, uint32 *base)
+{
+	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
+	Vector32	vals1,
+vals2,
+result1,
+result2,
+result;
+
+	/* load the next block into 2 registers */
+	vector32_load(, base);
+	vector32_load(, [nelem_per_vector]);
+
+	/* compare each value to the key */
+	result1 = vector32_eq(keys, vals1);
+	result2 = vector32_eq(keys, vals2);
+
+	/* combine the results into a single variable */
+	result = vector32_or(result1, result2);
+
+	/* see if there was a match */
+	if (vector32_is_highbit_set(result))
+		return true;
+
+	return false;
+}
+
 /*
  * pg_lfind32
  *
@@ -100,7 +128,8 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	 */
 	const Vector32 keys = vector32_broadcast(key);	/* load copies of key */
 	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
-	const uint32 nelem_per_iteration = 4 * nelem_per_vector;
+	uint32 nelem_per_iteration = 4 * nelem_per_vector;
+	uint32 remaining;
 
 	/* round down to multiple of elements per iteration */
 	const uint32 tail_idx = nelem & ~(nelem_per_iteration - 1);
@@ -119,6 +148,9 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	}
 #endif
 
+	if (nelem < nelem_per_vector * 2)
+		goto slow_path;
+
 	for (i = 0; i < tail_idx; i += nelem_per_iteration)
 	{
 		Vector32	vals1,
@@ -157,8 +189,21 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 			return true;
 		}
 	}
+
+	nelem_per_iteration = 2 * nelem_per_vector;
+	remaining = nelem - i;
+
+	if (remaining == 0)
+		return false;
+
+	if (remaining > nelem_per_iteration &&
+		lfind32_2reg_helper(keys, [i]))
+		return true;
+
+	return lfind32_2reg_helper(keys, [nelem - nelem_per_iteration]);
 #endif			/* ! USE_NO_SIMD */
 
+slow_path:
 	/* Process the remaining elements one at a time. */
 	for (; i < nelem; i++)
 	{


Re: Read data from Postgres table pages

2024-03-19 Thread Sushrut Shivaswamy
>
>
> lol, thanks for the inputs Alexander :)!


Re: speed up a logical replica setup

2024-03-19 Thread Peter Eisentraut

On 19.03.24 16:24, vignesh C wrote:

The problem with this failure is that standby has been promoted
already and we will have to re-create the physica replica again.

If you are not planning to have the checks for name length, this could
alternatively be fixed by including database id also while querying
pg_subscription like below in set_replication_progress function:
appendPQExpBuffer(str,
"SELECT oid FROM pg_catalog.pg_subscription \n"
"WHERE subname = '%s' AND subdbid = (SELECT oid FROM
pg_catalog.pg_database WHERE datname = '%s')",
dbinfo->subname,
dbinfo->dbname);


Yes, this is more correct anyway, because subscription names are 
per-database, not global.  So you should be able to make 
pg_createsubscriber use the same subscription name for each database.







Re: Reducing output size of nodeToString

2024-03-19 Thread Peter Eisentraut

On 11.03.24 21:52, Matthias van de Meent wrote:

* v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch

This looks sensible, but maybe making Location a global type is a bit
much?  Maybe something more specific like ParseLocation, or ParseLoc, to
keep it under 12 characters.

I've gone with ParseLoc in the attached v8 patchset.


I have committed this one.

I moved the typedef to nodes/nodes.h, where we already had similar 
typdefs (Cardinality, etc.).  The fields stmt_location and stmt_len in 
PlannedStmt were not converted, so I fixed that.  Also, between you 
writing your patch and now, at least one new node type was added, so I 
fixed that one up, too.  (I diffed the generated node support functions 
to check.)  Hopefully, future hackers will apply the new type when 
appropriate.






Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Tom Lane
Jelte Fennema-Nio  writes:
> On Tue, 19 Mar 2024 at 15:52, Tom Lane  wrote:
>> I like this idea.  The "bonus" is not optional though, because
>> setting the files' ownership/permissions is the only way to be
>> sure that the prohibition is even a little bit bulletproof.

> I don't agree with this. The only "normal" way of modifying
> postgresql.auto.conf from within postgres is using ALTER SYSTEM, so
> simply disabling ALTER SYSTEM seems enough to me.

I've said this repeatedly: it's not enough.  The only reason we need
any feature whatsoever is that somebody doesn't trust their database
superusers to not try to modify the configuration.  Given that
requirement, merely disabling ALTER SYSTEM isn't a solution, it's a
fig leaf that might fool incompetent auditors but no more.

If you aren't willing to build a solution that blocks off mods
using COPY TO FILE/PROGRAM and other readily-available-to-superusers
tools (plpythonu for instance), I think you shouldn't bother asking
for a feature at all.  Just trust your superusers.

regards, tom lane




Re: DRAFT: Pass sk_attno to consistent function

2024-03-19 Thread Michał Kłeczek
Hi All,

Since it looks like there is not much interest in the patch I will try to 
provide some background to explain why I think it is needed.

We are in the process of migration from an old db platform to PostgreSQL.
Our database is around 10TB big and contains around 10 billion financial 
transactions in a single table.
Each transaction is assigned to an account (column acc_number).

We have partitioned the table BY HASH (acc_number).

A client can query transactions belonging to his accounts using several 
criteria - among them is te xt search.
Queries are of type TOP N (ie ORDER BY … LIMIT ).

The list of accounts that we are querying is provided as a parameter to the 
query.

We have decided to use a single Gist index supporting all queries (reasons 
described in [1]).

There are several problems with Gist usage (but I still think we have no other 
choice) but the most important is
that we cannot use SAOP in our queries - since Gist does not support it the 
planner decides to perform Bitmap Scan
which in turn does not support ORDER BY … LIMIT well because requires Sort.

So when we use “= ANY (array of account numbers) … ORDER BY ...” the plan 
requires reading all records meeting
search criteria and then sort.

As a workaround we have to perform LATERAL joins:

unnest(list of account numbers) AS a(n) LATERAL JOIN (SELECT * FROM … WHERE 
account_number = a.n ORDER BY … LIMIT …) ORDER BY … LIMIT …

It is still bad because requires multiple scans of the same partition if 
account number hashes are the same.

What we really need is for Gist to support “= ANY (…)”.
As Gist index is extensible in terms of queries it supports it is quite easy to 
implement an operator class/family with operator:

||= (text, text[])

that has semantics the same as “= ANY (…)”

With this operator we can write our queries like:

account_number ||= [list of account numbers] AND
account_number = ANY ([list of account numbers]) — redundant for partition 
pruning as it does not understand ||=

and have optimal plans:

Limit
— Merge Append
—— Index scan of relevant partitions

The problem is that now each partition scan is for the same list of accounts.
The “consistent” function cannot assume anything about contents of the table so 
it has to check all elements of the list
and that in turn causes reading unnecessary index pages for accounts not in 
this partition.

What we need is a way for the “consistent” function to be able to pre-process 
input query array and remove elements
that are not relevant for this scan. To do that it is necessary to have enough 
information to read necessary metadata from the catalog.

The proposed patch addresses this need and seems (to me) largely 
uncontroversial as it does not break any existing extensions.

Attached is a patch with consolidated changes (I am pretty new to managing 
patches so previous two were partial and not something shareable, I am afraid).

—
Michal


0001-Pass-key-sk_attno-to-consistent-function-in-gistinde.patch
Description: Binary data


Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Jelte Fennema-Nio
On Tue, 19 Mar 2024 at 15:52, Tom Lane  wrote:
> I like this idea.  The "bonus" is not optional though, because
> setting the files' ownership/permissions is the only way to be
> sure that the prohibition is even a little bit bulletproof.

I don't agree with this. The only "normal" way of modifying
postgresql.auto.conf from within postgres is using ALTER SYSTEM, so
simply disabling ALTER SYSTEM seems enough to me.

> Another question is whether this should be one-size-fits-all for
> all the configuration files.  I can imagine situations where
> you'd like to lock down postgresql[.auto].conf but not pg_hba.conf.
> But maybe that can wait for somebody to show up with a use-case.

Afaik there's no way to modify pg_hba.conf from within postgres, only
read it. (except for COPY TO FILE/PROGRAM etc) So, I don't think we
need to worry about this now.

On Tue, 19 Mar 2024 at 15:52, Tom Lane  wrote:
>
> Heikki Linnakangas  writes:
> > Perhaps we could make that even better with a GUC though. I propose a
> > GUC called 'configuration_managed_externally = true / false". If you set
> > it to true, we prevent ALTER SYSTEM and make the error message more
> > definitive:
>
> > postgres=# ALTER SYSTEM SET wal_level TO minimal;
> > ERROR:  configuration is managed externally
>
> > As a bonus, if that GUC is set, we could even check at server startup
> > that all the configuration files are not writable by the postgres user,
> > and print a warning or refuse to start up if they are.
>
> I like this idea.  The "bonus" is not optional though, because
> setting the files' ownership/permissions is the only way to be
> sure that the prohibition is even a little bit bulletproof.
>
> One small issue: how do we make that work on Windows?  Have recent
> versions grown anything that looks like real file permissions?
>
> Another question is whether this should be one-size-fits-all for
> all the configuration files.  I can imagine situations where
> you'd like to lock down postgresql[.auto].conf but not pg_hba.conf.
> But maybe that can wait for somebody to show up with a use-case.
>
> regards, tom lane




Re: speed up a logical replica setup

2024-03-19 Thread vignesh C
On Mon, 18 Mar 2024 at 16:36, Peter Eisentraut  wrote:
>
> On 18.03.24 08:18, vignesh C wrote:
> > 1) Maximum size of the object name is 64, we can have a check so that
> > we don't specify more than the maximum allowed length:
> > + case 3:
> > + if (!simple_string_list_member(_names, optarg))
> > + {
> > + simple_string_list_append(_names, optarg);
> > + num_replslots++;
> > + }
> > + else
> > + {
> > + pg_log_error("duplicate replication slot \"%s\"", optarg);
> > + exit(1);
> > + }
> > + break;
> >
> > If we allow something like this:
> >   ./pg_createsubscriber -U postgres -D data_N2/ -P "port=5431
> > user=postgres"  -p 5432 -s /home/vignesh/postgres/inst/bin/ -d db1 -d
> > db2 -d db3 
> > --replication-slot="testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttes1"
> > --replication-slot="testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttes2"
> > --replication-slot="testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttes3"
> > In this case creation of replication slot will fail:
> > pg_createsubscriber: error: could not create replication slot
> > "testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttes" on
> > database "db2": ERROR:  replication slot
> > "testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttes"
> > already exists
>
> I think this is fine.  The server can check whether the names it is
> given are of the right size.  We don't need to check it again in the client.
>
> > 2) Similarly here too:
> > + case 4:
> > + if (!simple_string_list_member(_names, optarg))
> > + {
> > + simple_string_list_append(_names, optarg);
> > + num_subs++;
> > + }
> > + else
> > + {
> > + pg_log_error("duplicate subscription \"%s\"", optarg);
> > + exit(1);
> > + }
> > + break;
> >
> > If we allow something like this:
> > ./pg_createsubscriber -U postgres -D data_N2/ -P "port=5431
> > user=postgres"  -p 5432 -s /home/vignesh/postgres/inst/bin/ -d db1 -d
> > db2 -d db3 
> > --subscription=testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttes1
> > --subscription=testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttes2
> > --subscription=testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttes3
> >
> > Subscriptions will be created with the same name and later there will
> > be a problem when setting replication progress as there will be
> > multiple subscriptions with the same name.
>
> Could you clarify this problem?

In this case the subscriptions name specified is more than the allowed
name, the subscription name will be truncated and both the
subscription for db1 and db2 will have same name like below:
db2=# select subname, subdbid from pg_subscription;
 subname | subdbid
-+-
 testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttes |   16384
 testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttes |   16385

Now we try to set the replication origin of the subscriptions to a
consistent lsn from the following:
+set_replication_progress(PGconn *conn, struct LogicalRepInfo *dbinfo,
const char *lsn)
+{
+   PQExpBuffer str = createPQExpBuffer();
+   PGresult   *res;
+   Oid suboid;
+   charoriginname[NAMEDATALEN];
+   charlsnstr[17 + 1]; /* MAXPG_LSNLEN = 17 */
+
+   Assert(conn != NULL);
+
+   appendPQExpBuffer(str,
+ "SELECT oid FROM
pg_catalog.pg_subscription "
+ "WHERE subname = '%s'",
+ dbinfo->subname);
+
+   res = PQexec(conn, str->data);
+   if (PQresultStatus(res) != PGRES_TUPLES_OK)
+   {
+   pg_log_error("could not obtain subscription OID: %s",
+PQresultErrorMessage(res));
+   disconnect_database(conn, true);
+   }
+
+   if (PQntuples(res) != 1 && !dry_run)
+   {
+   pg_log_error("could not obtain subscription OID: got
%d rows, expected %d rows",
+PQntuples(res), 1);
+   disconnect_database(conn, true);
+   }

Since the subscription name is truncated, we will have multiple
records returned for the above query which results in failure with:
pg_createsubscriber: error: could not obtain subscription OID: got 2
rows, expected 1 rows
pg_createsubscriber: warning: pg_createsubscriber failed after the end
of recovery
pg_createsubscriber: hint: The target server cannot be used as a
physical replica anymore.
pg_createsubscriber: hint: You must recreate the physical replica
before continuing.

The problem with this failure is that standby has been promoted
already and we will have to re-create the physica replica again.

If you are not planning to have the checks for name length, this could
alternatively be fixed by including database id also while querying

Re: pg_upgrade --copy-file-range

2024-03-19 Thread Tomas Vondra
Hi,

I took a quick look at the remaining part adding copy_file_range to
pg_combinebackup. The patch no longer applies, so I had to rebase it.
Most of the issues were trivial, but I had to fix a couple missing
prototypes - I added them to copy_file.h/c, mostly.

0001 is the minimal rebase + those fixes

0002 has a couple review comments in copy_file, and it also undoes a lot
of unnecessary formatting changes (already pointed out by Peter a couple
days ago).

A couple review comments:

1) AFAIK opt_errinfo() returns pointer to the local "buf" variable.

2) I wonder if we even need opt_errinfo(). I'm not sure it actually
makes anything simpler.

3) I think it'd be nice to make CopyFileMethod more consistent with
transferMode in pg_upgrade.h (I mean, it seems wise to make the naming
more consistent, it's probably not worth unifying this somehow).

4) I wonder how we came up with copying the files by 50 blocks, but I
now realize it's been like this before this patch. I only noticed
because the patch adds a comment before buffer_size calculation.

5) I dislike the renaming of copy_file_blocks to pg_copyfile. The new
name is way more generic / less descriptive - it's clear it copies the
file block by block (well, in chunks). pg_copyfile is pretty vague.

6) This leaves behind copy_file_copyfile, which is now unused.

7) The patch reworks how combinebackup deals with alternative copy
implementations - instead of setting strategy_implementation and calling
that, the decisions now happen in pg_copyfile_offload with a lot of
conditions / ifdef / defined ... I find it pretty hard to understand and
reason about. I liked the strategy_implementation approach, as it forces
us to keep each method in a separate function.

Perhaps there's a reason why that doesn't work for copy_file_range? But
in that case this needs much clearer comments.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 39f42eee4c6f50d106672afe108294ee59082500 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 19 Mar 2024 15:34:18 +0100
Subject: [PATCH v20240319 2/2] review and cleanup

---
 src/bin/pg_combinebackup/copy_file.c|   3 +
 src/bin/pg_combinebackup/copy_file.h|   1 +
 src/bin/pg_combinebackup/pg_combinebackup.c | 197 +++-
 src/bin/pg_combinebackup/reconstruct.c  | 105 ++-
 src/bin/pg_combinebackup/reconstruct.h  |  19 +-
 5 files changed, 190 insertions(+), 135 deletions(-)

diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
index 16e26b4f573..f45670dd47c 100644
--- a/src/bin/pg_combinebackup/copy_file.c
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -77,6 +77,8 @@ opt_errinfo(const char *addon_errmsg)
 		return "";
 
 	strcpy(buf, " ");
+
+	/* XXX isn't this broken? this returns pointer to local variable */
 	return strncat(buf, addon_errmsg, sizeof(buf) - 2);
 }
 
@@ -93,6 +95,7 @@ pg_copyfile(const char *src, const char *dest, const char *addon_errmsg,
 	int			dest_fd;
 	uint8	   *buffer;
 
+	/* XXX where does the 50 blocks come from? larger/smaller? */
 	/* copy in fairly large chunks for best efficiency */
 	const int	buffer_size = 50 * BLCKSZ;
 
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
index 2797a340055..f4d0ac47d0e 100644
--- a/src/bin/pg_combinebackup/copy_file.h
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -15,6 +15,7 @@
 #include "common/checksum_helper.h"
 #include "common/file_utils.h"
 
+/* XXX do we even want this? how does pg_upgrade to this? */
 typedef enum CopyFileMethod
 {
 	PG_COPYFILE_FALLBACK = 0x1,
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 1455360d81c..8fa7827c563 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -99,10 +99,15 @@ static void cleanup_directories_atexit(void);
 static void create_output_directory(char *dirname, cb_options *opt);
 static void help(const char *progname);
 static bool parse_oid(char *s, Oid *result);
-static void process_directory_recursively(
-		  Oid tsoid, char *input_directory, char *output_directory,
-		  char *relative_path, int n_prior_backups, char **prior_backup_dirs,
-		  manifest_data **manifests, manifest_writer *mwriter, cb_options *opt);
+static void process_directory_recursively(Oid tsoid,
+		  char *input_directory,
+		  char *output_directory,
+		  char *relative_path,
+		  int n_prior_backups,
+		  char **prior_backup_dirs,
+		  manifest_data **manifests,
+		  manifest_writer *mwriter,
+		  cb_options *opt);
 static int	read_pg_version_file(char *directory);
 static void remember_to_cleanup_directory(char *target_path, bool rmtopdir);
 static void reset_directory_cleanup_list(void);
@@ -156,8 +161,8 @@ main(int argc, char *argv[])
 	opt.copy_method = 0;
 
 	/* process 

Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2024-03-19 Thread Alexander Korotkov
Hi, Pavel!

On Tue, Nov 28, 2023 at 11:00 AM Pavel Borisov  wrote:
> > You're designing new APIs, days before the feature freeze.
> On Wed, 5 Apr 2023 at 06:54, Michael Paquier  wrote:
> >
> > On Tue, Apr 04, 2023 at 01:25:46AM +0300, Alexander Korotkov wrote:
> > > Pavel, thank you for you review, revisions and rebase.
> > > We'll reconsider this once v17 is branched.
>
> I've looked through patches v16 once more and think they're good
> enough, and previous issues are all addressed. I see that there is
> nothing that blocks it from being committed except the last iteration
> was days before v16 feature freeze.
>
> Recently in another thread [1] Alexander posted a new version of
> patches v16 (as 0001 and 0002) In 0001 only indenation, comments, and
> commit messages changed from v16 in this thread. In 0002 new test
> eval-plan-qual-2 was integrated into the existing eval-plan-qual test.
> For maintaining the most recent versions in this thread I'm attaching
> them under v17. I suppose that we can commit these patches to v17 if
> there are no objections or additional reviews.
>
> [1] 
> https://www.postgresql.org/message-id/flat/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com

The new revision of patches is attached.

It has updated commit messages, new comments, and some variables were
renamed to be more consistent with surroundings.

I also think that all the design issues spoken before are resolved.
It would be nice to hear from Andres about this.

I'll continue rechecking these patches myself.

--
Regards,
Alexander Korotkov


v18-0001-Allow-locking-updated-tuples-in-tuple_update-and.patch
Description: Binary data


v18-0002-Add-EvalPlanQual-delete-returning-isolation-test.patch
Description: Binary data


Re: Read data from Postgres table pages

2024-03-19 Thread Alexander Korotkov
On Tue, Mar 19, 2024 at 4:48 PM Sushrut Shivaswamy
 wrote:
>
> If we query the DB directly, is it possible to know which new rows have been 
> added since the last query?
> Is there a change pump that can be latched onto?

Please, check this.
https://www.postgresql.org/docs/current/logicaldecoding.html

> I’m assuming the page data structs are encapsulated in specific headers which 
> can be used to list / read pages.
> Why would Postgres need to be stopped to read the data? The read / query path 
> in Postgres would also be reading these pages when the instance is running?

I think this would be a good point to start studying.
https://www.interdb.jp/
The information there should be more than enough to forget this idea forever :)

--
Regards,
Alexander Korotkov




Re: Streaming I/O, vectored I/O (WIP)

2024-03-19 Thread Heikki Linnakangas

Some quick comments:

On 12/03/2024 15:02, Thomas Munro wrote:

src/backend/storage/aio/streaming_read.c
src/include/storage/streaming_read.h


Standard file header comments missing.

It would be nice to have a comment at the top of streaming_read.c, 
explaining at a high level how the circular buffer, lookahead and all 
that works. Maybe even some diagrams.


For example, what is head and what is tail? Before reading the code, I 
assumed that 'head' was the next block range to return in 
pg_streaming_read_buffer_get_next(). But I think it's actually the other 
way round?



/*
 * Create a new streaming read object that can be used to perform the
 * equivalent of a series of ReadBuffer() calls for one fork of one relation.
 * Internally, it generates larger vectored reads where possible by looking
 * ahead.
 */
PgStreamingRead *
pg_streaming_read_buffer_alloc(int flags,
   void *pgsr_private,
   size_t 
per_buffer_data_size,
   BufferAccessStrategy 
strategy,
   
BufferManagerRelation bmr,
   ForkNumber forknum,
   
PgStreamingReadBufferCB next_block_cb)


I'm not a fan of the name, especially the 'alloc' part. Yeah, most of 
the work it does is memory allocation. But I'd suggest something like 
'pg_streaming_read_begin' instead.


Do we really need the pg_ prefix in these?


Buffer
pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)


Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next', 
for a shorter name.





/*
 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
 * When it is full, there is an empty element between head and tail.  
Head
 * can also be empty (nblocks == 0), therefore we need two extra 
elements
 * for non-occupied ranges, on top of max_pinned_buffers to allow for 
the
 * maxmimum possible number of occupied ranges of the smallest possible
 * size of one.
 */
size = max_pinned_buffers + 2;


I didn't understand this explanation for why it's + 2.


/*
 * Skip the initial ramp-up phase if the caller says we're going to be
 * reading the whole relation.  This way we start out doing full-sized
 * reads.
 */
if (flags & PGSR_FLAG_FULL)
pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, 
pgsr->max_pinned_buffers);
else
pgsr->distance = 1;


Should this be "Max(MAX_BUFFERS_PER_TRANSFER, 
pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than 
MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So 
perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ?


--
Heikki Linnakangas
Neon (https://neon.tech)




Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Tom Lane
Heikki Linnakangas  writes:
> Perhaps we could make that even better with a GUC though. I propose a 
> GUC called 'configuration_managed_externally = true / false". If you set 
> it to true, we prevent ALTER SYSTEM and make the error message more 
> definitive:

> postgres=# ALTER SYSTEM SET wal_level TO minimal;
> ERROR:  configuration is managed externally

> As a bonus, if that GUC is set, we could even check at server startup 
> that all the configuration files are not writable by the postgres user, 
> and print a warning or refuse to start up if they are.

I like this idea.  The "bonus" is not optional though, because
setting the files' ownership/permissions is the only way to be
sure that the prohibition is even a little bit bulletproof.

One small issue: how do we make that work on Windows?  Have recent
versions grown anything that looks like real file permissions?

Another question is whether this should be one-size-fits-all for
all the configuration files.  I can imagine situations where
you'd like to lock down postgresql[.auto].conf but not pg_hba.conf.
But maybe that can wait for somebody to show up with a use-case.

regards, tom lane




Re: Read data from Postgres table pages

2024-03-19 Thread Sushrut Shivaswamy
If we query the DB directly, is it possible to know which new rows have been 
added since the last query?
Is there a change pump that can be latched onto?

I’m assuming the page data structs are encapsulated in specific headers which 
can be used to list / read pages.
Why would Postgres need to be stopped to read the data? The read / query path 
in Postgres would also be reading these pages when the instance is running?



Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-03-19 Thread Dmitry Koval

Hi!


The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed


Thanks for info!
I was unable to reproduce the problem and I wanted to ask for 
clarification. But your message was ahead of my question.


--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com




Re: Read data from Postgres table pages

2024-03-19 Thread Alexander Korotkov
On Tue, Mar 19, 2024 at 4:35 PM Sushrut Shivaswamy
 wrote:
> The binary I"m trying to create should automatically be able to read data 
> from a postgres instance without users having to
> run commands for backup / pg_dump etc.
> Having access to the appropriate source headers would allow me to read the 
> data.

Please, avoid the top-posting.
https://en.wikipedia.org/wiki/Posting_style#Top-posting

If you're looking to have a separate binary, why can't your binary
just *connect* to the postgres database and query the data?  This is
what pg_dump does, you can just do the same directly.  pg_dump doesn't
access the raw data.

Trying to read raw postgres data from the separate binary looks flat
wrong for your purposes.  First, you would have to replicate pretty
much postgres internals inside. Second, you can read the consistent
data only when postgres is stopped or didn't do any modifications
since the last checkpoint.

--
Regards,
Alexander Korotkov




Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Joe Conway

On 3/19/24 07:49, Andrew Dunstan wrote:



On Tue, Mar 19, 2024 at 5:26 AM Heikki Linnakangas > wrote:


I want to remind everyone of this from Gabriele's first message that
started this thread:

 > At the moment, a possible workaround is that `ALTER SYSTEM` can
be blocked
 > by making the postgresql.auto.conf read only, but the returned
message is
 > misleading and that’s certainly bad user experience (which is very
 > important in a cloud native environment):
 >
 >
 > ```
 > postgres=# ALTER SYSTEM SET wal_level TO minimal;
 > ERROR:  could not open file "postgresql.auto.conf": Permission denied
 > ```

I think making the config file read-only is a fine solution. If you
don't want postgres to mess with the config files, forbid it with the
permission system.

Problems with pg_rewind, pg_basebackup were mentioned with that
approach. I think if you want the config files to be managed outside
PostgreSQL, by kubernetes, patroni or whatever, it would be good for
them to be read-only to the postgres user anyway, even if we had a
mechanism to disable ALTER SYSTEM. So it would be good to fix the
problems with those tools anyway.

The error message is not great, I agree with that. Can we improve it?
Maybe just add a HINT like this:

postgres=# ALTER SYSTEM SET wal_level TO minimal;
ERROR:  could not open file "postgresql.auto.conf" for writing:
Permission denied
HINT:  Configuration might be managed outside PostgreSQL


Perhaps we could make that even better with a GUC though. I propose a
GUC called 'configuration_managed_externally = true / false". If you
set
it to true, we prevent ALTER SYSTEM and make the error message more
definitive:

postgres=# ALTER SYSTEM SET wal_level TO minimal;
ERROR:  configuration is managed externally

As a bonus, if that GUC is set, we could even check at server startup
that all the configuration files are not writable by the postgres user,
and print a warning or refuse to start up if they are.

(Another way to read this proposal is to rename the GUC that's been
discussed in this thread to 'configuration_managed_externally'. That
makes it look less like a security feature, and describes the intended
use case.)




I agree with pretty much all of this.



+1 me too.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Read data from Postgres table pages

2024-03-19 Thread Sushrut Shivaswamy
The binary I"m trying to create should automatically be able to read data
from a postgres instance without users having to
run commands for backup / pg_dump etc.
Having access to the appropriate source headers would allow me to read the
data.

On Tue, Mar 19, 2024 at 8:03 PM Sushrut Shivaswamy <
sushrut.shivasw...@gmail.com> wrote:

> I'd like to read individual rows from the pages as they are updated and
> stream them to a server to create a copy of the data.
> The data will be rewritten to columnar format for analytics queries.
>
> On Tue, Mar 19, 2024 at 7:58 PM Alexander Korotkov 
> wrote:
>
>> Hi
>>
>> On Tue, Mar 19, 2024 at 4:23 PM Sushrut Shivaswamy
>>  wrote:
>> > I'm trying to build a postgres export tool that reads data from table
>> pages and exports it to an S3 bucket. I'd like to avoid manual commands
>> like pg_dump, I need access to the raw data.
>> >
>> > Can you please point me to the postgres source header / cc files that
>> encapsulate this functionality?
>> >  - List all pages for a table
>> > - Read a given page for a table
>> >
>> > Any pointers to the relevant source code would be appreciated.
>>
>> Why do you need to work on the source code level?
>> Please, check this about having a binary  copy of the database on the
>> filesystem level.
>> https://www.postgresql.org/docs/current/backup-file.html
>>
>> --
>> Regards,
>> Alexander Korotkov
>>
>


Re: Read data from Postgres table pages

2024-03-19 Thread Sushrut Shivaswamy
I'd like to read individual rows from the pages as they are updated and
stream them to a server to create a copy of the data.
The data will be rewritten to columnar format for analytics queries.

On Tue, Mar 19, 2024 at 7:58 PM Alexander Korotkov 
wrote:

> Hi
>
> On Tue, Mar 19, 2024 at 4:23 PM Sushrut Shivaswamy
>  wrote:
> > I'm trying to build a postgres export tool that reads data from table
> pages and exports it to an S3 bucket. I'd like to avoid manual commands
> like pg_dump, I need access to the raw data.
> >
> > Can you please point me to the postgres source header / cc files that
> encapsulate this functionality?
> >  - List all pages for a table
> > - Read a given page for a table
> >
> > Any pointers to the relevant source code would be appreciated.
>
> Why do you need to work on the source code level?
> Please, check this about having a binary  copy of the database on the
> filesystem level.
> https://www.postgresql.org/docs/current/backup-file.html
>
> --
> Regards,
> Alexander Korotkov
>


Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-03-19 Thread stephane tachoires
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Sorry, tests passed when applying all patches.
I planned to check without optimisation first.

The new status of this patch is: Needs review


Re: Read data from Postgres table pages

2024-03-19 Thread Alexander Korotkov
Hi

On Tue, Mar 19, 2024 at 4:23 PM Sushrut Shivaswamy
 wrote:
> I'm trying to build a postgres export tool that reads data from table pages 
> and exports it to an S3 bucket. I'd like to avoid manual commands like 
> pg_dump, I need access to the raw data.
>
> Can you please point me to the postgres source header / cc files that 
> encapsulate this functionality?
>  - List all pages for a table
> - Read a given page for a table
>
> Any pointers to the relevant source code would be appreciated.

Why do you need to work on the source code level?
Please, check this about having a binary  copy of the database on the
filesystem level.
https://www.postgresql.org/docs/current/backup-file.html

--
Regards,
Alexander Korotkov




Re: Q: Escapes in jsonpath Idents

2024-03-19 Thread David E. Wheeler
On Mar 17, 2024, at 20:09, Erik Wienhold  wrote:
> 
> On 2024-03-17 20:50 +0100, David E. Wheeler wrote:
>> On Mar 17, 2024, at 15:12, Erik Wienhold  wrote:
>>> So I think it makes sense to reword the entire backslash part of the
>>> paragraph and remove references to JSON entirely.  The attached patch
>>> does that and also formats the backslash escapes as a bulleted list for
>>> readability.
>> 
>> Ah, it’s JavaScript format, not JSON! This does clarify things quite
>> nicely, thank you. Happy to add my review once it’s in a commit fest.
> 
> Thanks.  https://commitfest.postgresql.org/48/4899/

Applies cleanly, `make -C doc/src/sgml check` runs without error. Doc 
improvement welcome and much clearer than before.

> I had the same reasoning while writing my first reply but scrapped that
> part because I found it obvious:  That jsonpath is parsed before calling
> jsonb_path_exists() and therefore the parser has no context about any
> variables, which might not even be hardcoded but may result from a
> query.

Right, there’s a chicken/egg problem.

> Unfortunately, I don't have access to that part of the SQL spec.  So I
> don't know how the jsonpath grammar is specified.

Seems quite logical; I think it should be documented, but I’d also be 
interested to know what the 2016 and 2023 standards say, exactly.

> Also checked git log src/backend/utils/adt/jsonpath_scan.l for some
> insights but haven't found any yet.

Everybody’s taking shortcuts relative to the standard, AFAICT. For example, 
jsonpath_scan.l matches unqouted identifiers with these two regular expressions:

{other}+
\/\*
\\.

Plus the backslash escapes. {other} is defined as:

/* "other" means anything that's not special, blank, or '\' or '"' */
other [^\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/\\\" \t\n\r\f]

Which is wy more liberal than the ECMA standard[1], by my reading, but the 
MSDN[2] description is quite succinct (thanks for the links!):

> In JavaScript, identifiers are commonly made of alphanumeric characters, 
> underscores (_), and dollar signs ($). Identifiers are not allowed to start 
> with numbers. However, JavaScript identifiers are not only limited to ASCII — 
> many Unicode code points are allowed as well. Namely, any character in the 
> ID_Start category can start an identifier, while any character in the 
> ID_Continue category can appear after the first character.


ID_Start[3] and ID_Continue[4] point to the unicode standard codes lister, 
nether of which reference Emoji. Sure enough, in Safari:

> x = {"": true}
< {: true}
> x.
< SyntaxError: Invalid character '\ud83c’

But in Postgres jsonpath:

david=# select '$.'::jsonpath;
 jsonpath 
--
 $.""

If the MSDN references to ID_Start and ID_Continue are correct, then the 
Postgres path parser is being overly-liberal. Maybe that’s totally fine? Not 
sure what should be documented and what’s not worth it.

Aside: I’m only digging into these details because I’m busy porting the path 
parser, so trying to figure out where to be compatible and where not to. So far 
I’m rejecting '$' (but allowing '\$' and '\u0024') but taking advantage of the 
unicode support in Go to specifically validate against ID_Start and ID_Continue.

Best,

David

[1] https://262.ecma-international.org/#sec-identifier-names
[2] 
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#identifiers
[3] 
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BID_Start%7D
[4] 
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BID_Continue%7D











Re: Table AM Interface Enhancements

2024-03-19 Thread Japin Li


On Tue, 19 Mar 2024 at 21:05, Alexander Korotkov  wrote:
> Hi, Pavel!
>
> On Tue, Mar 19, 2024 at 11:34 AM Pavel Borisov  wrote:
>> On Tue, 19 Mar 2024 at 03:34, Alexander Korotkov  
>> wrote:
>>>
>>> On Sun, Mar 3, 2024 at 1:50 PM Alexander Korotkov  
>>> wrote:
>>> > On Mon, Nov 27, 2023 at 10:18 PM Mark Dilger
>>> >  wrote:
>>> > >
>>> > > > On Nov 25, 2023, at 9:47 AM, Alexander Korotkov 
>>> > > >  wrote:
>>> > > >
>>> > > >> Should the patch at least document which parts of the EState are 
>>> > > >> expected to be in which states, and which parts should be viewed as 
>>> > > >> undefined?  If the implementors of table AMs rely on any/all aspects 
>>> > > >> of EState, doesn't that prevent future changes to how that structure 
>>> > > >> is used?
>>> > > >
>>> > > > New tuple tuple_insert_with_arbiter() table AM API method needs EState
>>> > > > argument to call executor functions: ExecCheckIndexConstraints(),
>>> > > > ExecUpdateLockMode(), and ExecInsertIndexTuples().  I think we
>>> > > > probably need to invent some opaque way to call this executor function
>>> > > > without revealing EState to table AM.  Do you think this could work?
>>> > >
>>> > > We're clearly not accessing all of the EState, just some specific 
>>> > > fields, such as es_per_tuple_exprcontext.  I think you could at least 
>>> > > refactor to pass the minimum amount of state information through the 
>>> > > table AM API.
>>> >
>>> > Yes, the table AM doesn't need the full EState, just the ability to do
>>> > specific manipulation with tuples.  I'll refactor the patch to make a
>>> > better isolation for this.
>>>
>>> Please find the revised patchset attached.  The changes are following:
>>> 1. Patchset is rebase. to the current master.
>>> 2. Patchset is reordered.  I tried to put less debatable patches to the top.
>>> 3. tuple_is_current() method is moved from the Table AM API to the
>>> slot as proposed by Matthias van de Meent.
>>> 4. Assert added to the table_free_rd_amcache() as proposed by Pavel Borisov.
>>
>>
>> Patches 0001-0002 are unchanged compared to the last version in thread [1]. 
>> In my opinion, it's still ready to be committed, which was not done for time 
>> were too close to feature freeze one year ago.
>>
>> 0003 - Assert added from previous version. I still have a strong opinion 
>> that allowing multi-chunked data structures instead of single chunks is 
>> completely safe and makes natural process of Postgres improvement that is 
>> self-justified. The patch is simple enough and ready to be pushed.
>>
>> 0004  (previously 0007) - Have not changed, and there is consensus that this 
>> is reasonable. I've re-checked the current code. Looks safe considering 
>> returning a different slot, which I doubted before. So consider this patch 
>> also ready.
>>
>> 0005 (previously 0004) - Unused argument in the is_current_xact_tuple() 
>> signature is removed. Also comparing to v1 the code shifted from tableam 
>> methods to TTS's level.
>>
>> I'd propose to remove  Assert(!TTS_EMPTY(slot)) for 
>> tts_minimal_is_current_xact_tuple() and  tts_virtual_is_current_xact_tuple() 
>> as these are only error reporting functions that don't use slot actually.
>>
>> Comment similar to:
>> +/*
>> + * VirtualTupleTableSlots never have a storage tuples.  We generally
>> + * shouldn't get here, but provide a user-friendly message if we do.
>> + */
>> also applies to tts_minimal_is_current_xact_tuple()
>>
>> I'd propose changes for clarity of this comment:
>> %s/a storage tuples/storage tuples/g
>> %s/generally//g
>>
>> Otherwise patch 0005 also looks good to me.
>>
>> I'm planning to review the remaining patches. Meanwhile think pushing what 
>> is now ready and uncontroversial is a good intention.
>> Thank you for the work done on this patchset!
>
> Thank you, Pavel!
>
> Regarding 0005, I did apply "a storage tuples" grammar fix.  Regarding
> the rest of the things, I'd like to keep methods
> tts_*_is_current_xact_tuple() to be similar to nearby
> tts_*_getsysattr().  This is why I'm keeping the rest unchanged.  I
> think we could refactor that later, but together with
> tts_*_getsysattr() methods.
>
> I'm going to push 0003, 0004 and 0005 if there are no objections.
>
> And I'll update 0001 and 0002 in their dedicated thread.
>

When I try to test the patch on Ubuntu 22.04 with GCC 11.4.0.  There are some
warnings as following:

/home/japin/Codes/postgres/build/../src/backend/access/heap/heapam_handler.c: 
In function ‘heapam_acquire_sample_rows’:
/home/japin/Codes/postgres/build/../src/backend/access/heap/heapam_handler.c:1603:28:
 warning: implicit declaration of function 
‘get_tablespace_maintenance_io_concurrency’ [-Wimplicit-function-declaration]
 1603 | prefetch_maximum = 
get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
  |^
/home/japin/Codes/postgres/build/../src/backend/access/heap/heapam_handler.c:1757:30:
 

Read data from Postgres table pages

2024-03-19 Thread Sushrut Shivaswamy
Hey,

I'm trying to build a postgres export tool that reads data from table pages
and exports it to an S3 bucket. I'd like to avoid manual commands like
pg_dump, I need access to the raw data.

Can you please point me to the postgres source header / cc files that
encapsulate this functionality?
 - List all pages for a table
- Read a given page for a table

Any pointers to the relevant source code would be appreciated.

Thanks,
Sushrut


Re: documentation structure

2024-03-19 Thread Tom Lane
Daniel Gustafsson  writes:
> It's actually not very odd, the reference section is using  
> elements
> and we had missed the arabic numerals setting on those.  The attached fixes
> that for me.  That being said, we've had roman numerals for the reference
> section since forever (all the way down to the 7.2 docs online has it) so 
> maybe
> it was intentional?

I'm quite sure it *was* intentional.  Maybe it was a bad idea, but
it's not that way simply because nobody thought about it.

regards, tom lane




Re: Reducing connection overhead in pg_upgrade compat check phase

2024-03-19 Thread Daniel Gustafsson
> On 19 Mar 2024, at 08:07, Peter Eisentraut  wrote:
> 
> On 18.03.24 13:11, Daniel Gustafsson wrote:
>> Attached is a fresh rebase with only minor cosmetic touch-ups which I would
>> like to go ahead with during this CF.
>> Peter: does this address the comments you had on translation and code
>> duplication?
> 
> Yes, this looks good.

Thanks for review! I took another look at this and pushed it.

--
Daniel Gustafsson





Re: minor tweak to catalogs.sgml pg_class.reltablespace

2024-03-19 Thread Tom Lane
Alvaro Herrera  writes:
> While reviewing the patch for SET ACCESS METHOD[1] I noticed that
> pg_class.relam is not documented fully for partitioned tables, so I
> proposed the attached.

The bit about "(Not meaningful if the relation has no on-disk file.)"
is not correct, and now it's adjacent to text that contradicts it.
Maybe more like

   The tablespace in which this relation is stored.
   If zero, the database's default tablespace is implied.
   Not meaningful if the relation has no on-disk file,
   except for partitioned tables, where this is the tablespace
   in which partitions will be created when one is not
   specified in the creation command.

regards, tom lane




Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Jelte Fennema-Nio
On Mon, 18 Mar 2024 at 18:27, Robert Haas  wrote:
> I think for now we
> should just file this under "Other platforms and clients," which only
> has one existing setting. If the number of settings of this type
> grows, we can split it out.

Done. I also included a patch to rename COMPAT_OPTIONS_CLIENTS to
COMPAT_OPTIONS_OTHER, since that enum variant naming doesn't match the
new intent of the section.

On Tue, 19 Mar 2024 at 10:26, Heikki Linnakangas  wrote:
> (Another way to read this proposal is to rename the GUC that's been
> discussed in this thread to 'configuration_managed_externally'. That
> makes it look less like a security feature, and describes the intended
> use case.)

I like this idea of naming the GUC in such a way. I swapped the words
a bit and went for externally_managed_configuration, since order
matches other GUCs e.g. standard_conforming_strings. But if you feel
strongly about the ordering of the words, I'm happy to change it back.

For the errorcode I now went for ERRCODE_FEATURE_NOT_SUPPORTED, which
seemed most fitting.

On Tue, 19 Mar 2024 at 10:26, Heikki Linnakangas  wrote:
>
> I want to remind everyone of this from Gabriele's first message that
> started this thread:
>
> > At the moment, a possible workaround is that `ALTER SYSTEM` can be blocked
> > by making the postgresql.auto.conf read only, but the returned message is
> > misleading and that’s certainly bad user experience (which is very
> > important in a cloud native environment):
> >
> >
> > ```
> > postgres=# ALTER SYSTEM SET wal_level TO minimal;
> > ERROR:  could not open file "postgresql.auto.conf": Permission denied
> > ```
>
> I think making the config file read-only is a fine solution. If you
> don't want postgres to mess with the config files, forbid it with the
> permission system.
>
> Problems with pg_rewind, pg_basebackup were mentioned with that
> approach. I think if you want the config files to be managed outside
> PostgreSQL, by kubernetes, patroni or whatever, it would be good for
> them to be read-only to the postgres user anyway, even if we had a
> mechanism to disable ALTER SYSTEM. So it would be good to fix the
> problems with those tools anyway.
>
> The error message is not great, I agree with that. Can we improve it?
> Maybe just add a HINT like this:
>
> postgres=# ALTER SYSTEM SET wal_level TO minimal;
> ERROR:  could not open file "postgresql.auto.conf" for writing:
> Permission denied
> HINT:  Configuration might be managed outside PostgreSQL
>
>
> Perhaps we could make that even better with a GUC though. I propose a
> GUC called 'configuration_managed_externally = true / false". If you set
> it to true, we prevent ALTER SYSTEM and make the error message more
> definitive:
>
> postgres=# ALTER SYSTEM SET wal_level TO minimal;
> ERROR:  configuration is managed externally
>
> As a bonus, if that GUC is set, we could even check at server startup
> that all the configuration files are not writable by the postgres user,
> and print a warning or refuse to start up if they are.
>
> (Another way to read this proposal is to rename the GUC that's been
> discussed in this thread to 'configuration_managed_externally'. That
> makes it look less like a security feature, and describes the intended
> use case.)
>
> --
> Heikki Linnakangas
> Neon (https://neon.tech)
>


v6-0002-Add-externally_managed_configuration-GUC.patch
Description: Binary data


v6-0001-Rename-COMPAT_OPTIONS_CLIENT-to-COMPAT_OPTIONS_OT.patch
Description: Binary data


Re: DOCS: add helpful partitioning links

2024-03-19 Thread Robert Treat
On Tue, Mar 19, 2024 at 3:08 AM Ashutosh Bapat
 wrote:
>
> Hi Robert,
>
>
> On Mon, Mar 18, 2024 at 10:52 PM Robert Treat  wrote:
>>
>> On Thu, Mar 14, 2024 at 12:15 PM Ashutosh Bapat
>>  wrote:
>> >
>> > Hi Robert,
>> >
>> > On Thu, Mar 7, 2024 at 10:49 PM Robert Treat  wrote:
>> >>
>> >> This patch adds a link to the "attach partition" command section
>> >> (similar to the detach partition link above it) as well as a link to
>> >> "create table like" as both commands contain additional information
>> >> that users should review beyond what is laid out in this section.
>> >> There's also a couple of wordsmiths in nearby areas to improve
>> >> readability.
>> >
>> >
>> > Thanks.
>> >
>> > The patch gives error when building html
>> >
>> > ddl.sgml:4300: element link: validity error : No declaration for attribute 
>> > linked of element link
>> >  CREATE TABLE ... 
>> > LIKE> >   ^
>> > ddl.sgml:4300: element link: validity error : Element link does not carry 
>> > attribute linkend
>> > nked="sql-createtable-parms-like">CREATE TABLE ... 
>> > LIKE> >
>> > ^
>> > make[1]: *** [Makefile:72: postgres-full.xml] Error 4
>> > make[1]: *** Deleting file 'postgres-full.xml'
>> > make[1]: Leaving directory 
>> > '/home/ashutosh/work/units/pg_review/coderoot/pg/doc/src/sgml'
>> > make: *** [Makefile:8: html] Error 2
>> >
>> > I have fixed in the attached.
>> >
>>
>> Doh! Thanks!
>>
>> >
>> > - As an alternative, it is sometimes more convenient to create the
>> > - new table outside the partition structure, and attach it as a
>> > + As an alternative, it is sometimes more convenient to create a
>> > + new table outside of the partition structure, and attach it as a
>> >
>> > it uses article "the" for "new table" since it's referring to the 
>> > partition mentioned in the earlier example. I don't think using "a" is 
>> > correct.
>> >
>>
>> I think this section has a general problem of mingling the terms table
>> and partition in a way they can be confusing.
>>
>> In this case, you have to infer that the term "the new table" is
>> referring not to the only table mentioned in the previous paragraph
>> (the partitioned table), but actually to the partition mentioned in
>> the previous paragraph. For long term postgres folks the idea that
>> partitions are tables isn't a big deal, but that isn't how folks
>> coming from other databases see things. So I lean towards "a new
>> table" because we are specifically talking about an alternative to the
>> above paragraph... ie. don't make a new partition, make a new table.
>> And tbh I think that wording (create a new table and attach it as a
>> partition) is still better than the wording in your patch, because the
>> "new partition" you are creating isn't a partition until it is
>> attached; it is just a new table.
>>
>> > "outside" seems better than "outside of". See 
>> > https://english.stackexchange.com/questions/9700/outside-or-outside-of. 
>> > But I think the meaning of the sentence will be more clear if we rephrase 
>> > it as in the attached patch.
>> >
>>
>> This didn't really clarify anything for me, as the discussion in that
>> link seems to be around the usage of the term wrt physical location,
>> and it is much less clear about the context of a logical construct.
>> Granted, your patch removes that, though I think now I'd lean toward
>> using the phrase "separate from".
>>
>> > - convenient, as not only will the existing partitions become indexed, 
>> > but
>> > - also any partitions that are created in the future will.  One 
>> > limitation is
>> > + convenient as not only will the existing partitions become indexed, 
>> > but
>> > + any partitions created in the future will as well.  One limitation is
>> >
>> > I am finding the current construct hard to read. The comma is misplaced as 
>> > you have pointed out. The pair of commas break the "not only" ... "but 
>> > also" construct. I have tried to simplify the sentence in the attached. 
>> > Please review.
>> >
>> > - the partitioned table; such an index is marked invalid, and the 
>> > partitions
>> > - do not get the index applied automatically.  The indexes on 
>> > partitions can
>> > - be created individually using CONCURRENTLY, and 
>> > then
>> > + the partitioned table; such an index is marked invalid and the 
>> > partitions
>> > + do not get the index applied automatically.  The partition indexes 
>> > can
>> >
>> > "indexes on partition" is clearer than "partition index". Fixed in the 
>> > attached patch.
>> >
>> > Please review.
>>
>> The language around all this is certainly tricky (like, what is a
>> partitioned index vs parent index?), and one thing I'd certainly try
>> to avoid is using any words like "inherited" which is also overloaded
>> in this context. In any case, I took in all the above and had a stab
>> at a 

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-03-19 Thread stephane tachoires
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Hi,
I have failing tap test after patches apply:

ok 201   + partition_merge  2635 ms
not ok 202   + partition_split  5719 ms

@@ -805,6 +805,7 @@
   (PARTITION salesmans2_3 FOR VALUES FROM (2) TO (3),
PARTITION salesmans3_4 FOR VALUES FROM (3) TO (4),
PARTITION salesmans4_5 FOR VALUES FROM (4) TO (5));
+ERROR:  no owned sequence found
 INSERT INTO salesmans (salesman_name) VALUES ('May');
 INSERT INTO salesmans (salesman_name) VALUES ('Ford');
 SELECT * FROM salesmans1_2;
@@ -814,23 +815,17 @@
 (1 row)

 SELECT * FROM salesmans2_3;
- salesman_id | salesman_name 
--+---
-   2 | Ivanov
-(1 row)
-
+ERROR:  relation "salesmans2_3" does not exist
+LINE 1: SELECT * FROM salesmans2_3;

The new status of this patch is: Waiting on Author


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-19 Thread Bertrand Drouvot
Hi,

On Tue, Mar 19, 2024 at 04:20:35PM +0530, Amit Kapila wrote:
> On Tue, Mar 19, 2024 at 3:11 PM Bertrand Drouvot
>  wrote:
> >
> > On Tue, Mar 19, 2024 at 10:56:25AM +0530, Amit Kapila wrote:
> > > On Mon, Mar 18, 2024 at 8:19 PM Bertrand Drouvot
> > >  wrote:
> > > > Agree. While it makes sense to invalidate slots for wal removal in
> > > > CreateCheckPoint() (because this is the place where wal is removed), I 
> > > > 'm not
> > > > sure this is the right place for the 2 new cases.
> > > >
> > > > Let's focus on the timeout one as proposed above (as probably the 
> > > > simplest one):
> > > > as this one is purely related to time and activity what about to 
> > > > invalidate them
> > > > when?:
> > > >
> > > > - their usage resume
> > > > - in pg_get_replication_slots()
> > > >
> > > > The idea is to invalidate the slot when one resumes activity on it or 
> > > > wants to
> > > > get information about it (and among other things wants to know if the 
> > > > slot is
> > > > valid or not).
> > > >
> > >
> > > Trying to invalidate at those two places makes sense to me but we
> > > still need to cover the cases where it takes very long to resume the
> > > slot activity and the dangling slot cases where the activity is never
> > > resumed.
> >
> > I understand it's better to have the slot reflecting its real status 
> > internally
> > but it is a real issue if that's not the case until the activity on it is 
> > resumed?
> > (just asking, not saying we should not)
> >
> 
> Sorry, I didn't understand your point. Can you try to explain by example?

Sorry if that was not clear, let me try to rephrase it first: what issue to you
see if the invalidation of such a slot occurs only when its usage resume or
when pg_get_replication_slots() is triggered? I understand that this could lead
to the slot not being invalidated (maybe forever) but is that an issue for an
inactive slot?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Built-in CTYPE provider

2024-03-19 Thread Peter Eisentraut

* v25-0001-Address-more-review-comments-on-commit-2d819a08a.patch

This was committed.

* v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch

Looks ok.

* v25-0003-Inline-basic-UTF-8-functions.patch

ok

* v25-0004-Use-version-for-builtin-collations.patch

Not sure about the version format "1.0", which implies some sort of 
major/minor or component-based system.  I would just use "1".


* v25-0005-Add-unicode_strtitle-for-Unicode-Default-Case-Co.patch
* v25-0006-Support-Unicode-full-case-mapping-and-conversion.patch
* v25-0007-Support-PG_UNICODE_FAST-locale-in-the-builtin-co.patch

0005 and 0006 don't contain any test cases.  So I guess they are really 
only usable via 0007.  Is that understanding correct?


Btw., tested initcap() on Oracle:

select initcap('džudo') from dual;

(which uses the precomposed U+01F3) and the result is

DŽudo

(with the precomposed uppercase character).  So that matches the 
behavior proposed in your 0002 patch.


Are there any test cases that illustrate the word boundary changes in 
patch 0005?  It might be useful to test those against Oracle as well.






Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-19 Thread Heikki Linnakangas

On 18/03/2024 17:19, Melanie Plageman wrote:

I've attached v7 rebased over this commit.


Thanks!


v7-0001-BitmapHeapScan-begin-scan-after-bitmap-creation.patch


If we delayed table_beginscan_bm() call further, after starting the TBM 
iterator, we could skip it altogether when the iterator is empty.


That's a further improvement, doesn't need to be part of this patch set. 
Just caught my eye while reading this.



v7-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patch


I suggest to avoid the double negative with SO_CAN_SKIP_FETCH, and call 
the flag e.g. SO_NEED_TUPLE.



As yet another preliminary patch before the streaming read API, it would 
be nice to move the prefetching code to heapam.c too.


What's the point of having separate table_scan_bitmap_next_block() and 
table_scan_bitmap_next_tuple() functions anymore? The AM owns the TBM 
iterator now. The executor node updates the lossy/exact page counts, but 
that's the only per-page thing it does now.



/*
 * If this is the first scan of the underlying table, create 
the table
 * scan descriptor and begin the scan.
 */
if (!scan)
{
uint32  extra_flags = 0;

/*
 * We can potentially skip fetching heap pages if we do 
not need
 * any columns of the table, either for checking 
non-indexable
 * quals or for returning data.  This test is a bit 
simplistic, as
 * it checks the stronger condition that there's no 
qual or return
 * tlist at all. But in most cases it's probably not 
worth working
 * harder than that.
 */
if (node->ss.ps.plan->qual == NIL && 
node->ss.ps.plan->targetlist == NIL)
extra_flags |= SO_CAN_SKIP_FETCH;

scan = node->ss.ss_currentScanDesc = table_beginscan_bm(
   
 
node->ss.ss_currentRelation,
  
  
node->ss.ps.state->es_snapshot,

0,

NULL,

extra_flags);
}

scan->tbmiterator = tbmiterator;
scan->shared_tbmiterator = shared_tbmiterator;


How about passing the iterator as an argument to table_beginscan_bm()? 
You'd then need some other function to change the iterator on rescan, 
though. Not sure what exactly to do here, but feels that this part of 
the API is not fully thought-out. Needs comments at least, to explain 
who sets tbmiterator / shared_tbmiterator and when. For comparison, for 
a TID scan there's a separate scan_set_tidrange() table AM function. 
Maybe follow that example and introduce scan_set_tbm_iterator().


It's bit awkward to have separate tbmiterator and shared_tbmiterator 
fields. Could regular and shared iterators be merged, or wrapped under a 
common interface?


--
Heikki Linnakangas
Neon (https://neon.tech)




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-19 Thread Alexander Korotkov
On Tue, Mar 19, 2024 at 1:51 PM Amit Kapila  wrote:
> On Mon, Mar 18, 2024 at 3:24 PM Alexander Korotkov  
> wrote:
> >
> > On Mon, Mar 18, 2024 at 5:17 AM Amit Kapila  wrote:
> > > > 1. First, check that it was called with non-atomic context (that is,
> > > > it's not called within a transaction). Trigger error if called with
> > > > atomic context.
> > > > 2. Release a snapshot to be able to wait without risk of WAL replay
> > > > stuck.  Procedure is still called within the snapshot.  It's a bit of
> > > > a hack to release a snapshot, but Vacuum statements already do so.
> > > >
> > >
> > > Can you please provide a bit more details with some example what is
> > > the existing problem with functions and how using procedures will
> > > resolve it? How will this this address the implicit transaction case
> > > or do we have any other workaround for those cases?
> >
> > Please check [1] and [2] for the explanation of the problem with functions.
> >
> > Also, please find a draft patch implementing the procedure.  The issue with 
> > the snapshot is addressed with the following lines.
> >
> > We first ensure we're in a non-atomic context, then pop an active snapshot 
> > (tricky, but ExecuteVacuum() does the same).  Then we should have no active 
> > snapshot and it's safe to wait for lsn replay.
> >
> > if (context->atomic)
> > ereport(ERROR,
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >  errmsg("pg_wait_lsn() must be only called in non-atomic 
> > context")));
> >
> > if (ActiveSnapshotSet())
> > PopActiveSnapshot();
> > Assert(!ActiveSnapshotSet());
> >
> > The function call could be added either before the BEGIN statement or 
> > before the implicit transaction.
> >
> > CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN;
> > CALL pg_wait_lsn('my_lsn', my_timeout); SELECT ...;
> >
>
> I haven't thought in detail about whether there are any other problems
> with this idea but sounds like it should solve the problems you shared
> with a function call approach. BTW, if the application has to anyway
> know the LSN till where replica needs to wait, why can't they simply
> monitor the pg_last_wal_replay_lsn() value?

Amit, thank you for your feedback.

Yes, the application can monitor pg_last_wal_replay_lsn() value,
that's our state of the art solution.  But that's rather inconvenient
and takes extra latency and network traffic. And it can't be wrapped
into a server-side function in procedural language for the reasons we
can't implement it as a built-in function.

--
Regards,
Alexander Korotkov




Re: speed up a logical replica setup

2024-03-19 Thread Peter Eisentraut

On 19.03.24 12:26, Shlok Kyal wrote:

I'm attaching a new version (v30) that adds:

* 3 new options (--publication, --subscription, --replication-slot) to assign
   names to the objects. The --database option used to ignore duplicate names,
   however, since these new options rely on the number of database options to
   match the number of object name options, it is forbidden from now on. The
   duplication is also forbidden for the object names to avoid errors earlier.
* rewrite the paragraph related to unusuable target server after
   pg_createsubscriber fails.
* Vignesh reported an issue [1] related to reaching the recovery stop point
   before the consistent state is reached. I proposed a simple patch that fixes
   the issue.

[1] 
https://www.postgresql.org/message-id/CALDaNm3VMOi0GugGvhk3motghaFRKSWMCSE2t3YX1e%2BMttToxg%40mail.gmail.com



I have added a top-up patch v30-0003. The issue in [1] still exists in
the v30 patch. I was not able to come up with an approach to handle it
in the code, so I have added it to the documentation in the warning
section. Thoughts?


Seems acceptable to me.  pg_createsubscriber will probably always have 
some restrictions and unsupported edge cases like that.  We can't 
support everything, so documenting is ok.






Re: speed up a logical replica setup

2024-03-19 Thread Peter Eisentraut

On 19.03.24 08:05, Amit Kapila wrote:

On Mon, Mar 18, 2024 at 7:22 PM Peter Eisentraut  wrote:


In check_subscriber(): All these permissions checks seem problematic
to me.  We shouldn't reimplement our own copy of the server's
permission checks.  The server can check the permissions.  And if the
permission checking in the server ever changes, then we have
inconsistencies to take care of.  Also, the error messages "permission
denied" are inappropriate, because we are not doing the actual thing.
Maybe we want to do a dry-run for the benefit of the user, but then we
should do the actual thing, like try to create a replication slot, or
whatever.  But I would rather just remove all this, it seems too
problematic.



If we remove all the checks then there is a possibility that we can
fail later while creating the actual subscription. For example, if
there are not sufficient max_replication_slots, then it is bound to
fail in the later steps which would be a costlier affair because by
that time the standby would have been promoted and the user won't have
any way to move forward but to re-create standby and then use this
tool again. I think here the patch tries to mimic pg_upgrade style
checks where we do some pre-checks.


I think checking for required parameter settings is fine.  My concern is 
with the code before that, that does 
pg_has_role/has_database_privilege/has_function_privilege.






Re: Proposal to include --exclude-extension Flag in pg_dump

2024-03-19 Thread Daniel Gustafsson
> On 19 Mar 2024, at 12:19, Dean Rasheed  wrote:

> I'm marking this ready-for-commit (which I'll probably do myself in a
> day or two, unless anyone else claims it first).

LGTM too from a read through.  I did notice a few mistakes in the --filter
documentation portion for other keywords but that's unrelated to this patch,
will fix them once this is in to avoid conflicts.

--
Daniel Gustafsson





Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-19 Thread Amit Kapila
On Mon, Mar 18, 2024 at 3:24 PM Alexander Korotkov  wrote:
>
> On Mon, Mar 18, 2024 at 5:17 AM Amit Kapila  wrote:
> > > 1. First, check that it was called with non-atomic context (that is,
> > > it's not called within a transaction). Trigger error if called with
> > > atomic context.
> > > 2. Release a snapshot to be able to wait without risk of WAL replay
> > > stuck.  Procedure is still called within the snapshot.  It's a bit of
> > > a hack to release a snapshot, but Vacuum statements already do so.
> > >
> >
> > Can you please provide a bit more details with some example what is
> > the existing problem with functions and how using procedures will
> > resolve it? How will this this address the implicit transaction case
> > or do we have any other workaround for those cases?
>
> Please check [1] and [2] for the explanation of the problem with functions.
>
> Also, please find a draft patch implementing the procedure.  The issue with 
> the snapshot is addressed with the following lines.
>
> We first ensure we're in a non-atomic context, then pop an active snapshot 
> (tricky, but ExecuteVacuum() does the same).  Then we should have no active 
> snapshot and it's safe to wait for lsn replay.
>
> if (context->atomic)
> ereport(ERROR,
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>  errmsg("pg_wait_lsn() must be only called in non-atomic 
> context")));
>
> if (ActiveSnapshotSet())
> PopActiveSnapshot();
> Assert(!ActiveSnapshotSet());
>
> The function call could be added either before the BEGIN statement or before 
> the implicit transaction.
>
> CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN;
> CALL pg_wait_lsn('my_lsn', my_timeout); SELECT ...;
>

I haven't thought in detail about whether there are any other problems
with this idea but sounds like it should solve the problems you shared
with a function call approach. BTW, if the application has to anyway
know the LSN till where replica needs to wait, why can't they simply
monitor the pg_last_wal_replay_lsn() value?

-- 
With Regards,
Amit Kapila.




Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Andrew Dunstan
On Tue, Mar 19, 2024 at 5:26 AM Heikki Linnakangas  wrote:

> I want to remind everyone of this from Gabriele's first message that
> started this thread:
>
> > At the moment, a possible workaround is that `ALTER SYSTEM` can be
> blocked
> > by making the postgresql.auto.conf read only, but the returned message is
> > misleading and that’s certainly bad user experience (which is very
> > important in a cloud native environment):
> >
> >
> > ```
> > postgres=# ALTER SYSTEM SET wal_level TO minimal;
> > ERROR:  could not open file "postgresql.auto.conf": Permission denied
> > ```
>
> I think making the config file read-only is a fine solution. If you
> don't want postgres to mess with the config files, forbid it with the
> permission system.
>
> Problems with pg_rewind, pg_basebackup were mentioned with that
> approach. I think if you want the config files to be managed outside
> PostgreSQL, by kubernetes, patroni or whatever, it would be good for
> them to be read-only to the postgres user anyway, even if we had a
> mechanism to disable ALTER SYSTEM. So it would be good to fix the
> problems with those tools anyway.
>
> The error message is not great, I agree with that. Can we improve it?
> Maybe just add a HINT like this:
>
> postgres=# ALTER SYSTEM SET wal_level TO minimal;
> ERROR:  could not open file "postgresql.auto.conf" for writing:
> Permission denied
> HINT:  Configuration might be managed outside PostgreSQL
>
>
> Perhaps we could make that even better with a GUC though. I propose a
> GUC called 'configuration_managed_externally = true / false". If you set
> it to true, we prevent ALTER SYSTEM and make the error message more
> definitive:
>
> postgres=# ALTER SYSTEM SET wal_level TO minimal;
> ERROR:  configuration is managed externally
>
> As a bonus, if that GUC is set, we could even check at server startup
> that all the configuration files are not writable by the postgres user,
> and print a warning or refuse to start up if they are.
>
> (Another way to read this proposal is to rename the GUC that's been
> discussed in this thread to 'configuration_managed_externally'. That
> makes it look less like a security feature, and describes the intended
> use case.)
>
>
>

I agree with pretty much all of this.

cheers

andrew


RE: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-19 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

Thanks for giving comments!

> This behavior makes sense to me. But do we want to handle the case of
> using environment variables too? 

Yeah, v5 does not consider which libpq parameters are specified by environment
variables. Such a variable should be used when the dbname is not expressly 
written
in the connection string.
Such a path was added in the v6 patch. If the dbname is not determined after
parsing the connection string, we call PQconndefaults() to get settings from
environment variables and service files [1], then start to search dbname again.
Below shows an example.

```
PGPORT=5431 PGUSER=kuroda PGDATABASE=postgres pg_basebackup -D data_N2 -R -v
->
primary_conninfo = 'user=kuroda ... port=5431 ... dbname=postgres ... '
```

> IIUC,
>
> pg_basebackup -D tmp -d "user=masahiko dbname=test_db"
>
> is equivalent to:
>
> PGDATABASE="user=masahiko dbname=test_db" pg_basebackup -D tmp

The case won't work. I think You assumed that expanded_dbname like
PQconnectdbParams() [2] can be used for enviroment variables, but it is not 
correct
- it won't parse as connection string again.

In the libpq layer, connection parameters are parsed in 
PQconnectStartParams()->conninfo_array_parse().
When expand_dbname is specified, the entry "dbname" is firstly checked and
parsed its value. They are done at fe-connect.c:5846.

The environment variables are checked and parsed in conninfo_add_defaults(), 
which
is called from conninfo_array_parse(). However, it is done at fe-connect.c:5956 
- the
expand_dbname has already been done at that time. This means there is no chance
that PGDATABASE is parsed as an expanded style.

For example, if the pg_basebackup runs like below:

PGDATABASE="user=kuroda dbname=postgres" pg_basebackup -D data_N2 -R -v

The primary_conninfo written in the file will be:

primary_conninfo = 'user=hayato ... dbname=''user=kuroda dbname=postgres'''

[1]: https://www.postgresql.org/docs/devel/libpq-pgservice.html
[2]: 
https://www.postgresql.org/docs/devel/libpq-connect.html#LIBPQ-PQCONNECTDBPARAMS

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



pg_basebackup-write-dbname.v0006.patch
Description: pg_basebackup-write-dbname.v0006.patch


Re: speed up a logical replica setup

2024-03-19 Thread Shlok Kyal
Hi,

> I'm attaching a new version (v30) that adds:
>
> * 3 new options (--publication, --subscription, --replication-slot) to assign
>   names to the objects. The --database option used to ignore duplicate names,
>   however, since these new options rely on the number of database options to
>   match the number of object name options, it is forbidden from now on. The
>   duplication is also forbidden for the object names to avoid errors earlier.
> * rewrite the paragraph related to unusuable target server after
>   pg_createsubscriber fails.
> * Vignesh reported an issue [1] related to reaching the recovery stop point
>   before the consistent state is reached. I proposed a simple patch that fixes
>   the issue.
>
> [1] 
> https://www.postgresql.org/message-id/CALDaNm3VMOi0GugGvhk3motghaFRKSWMCSE2t3YX1e%2BMttToxg%40mail.gmail.com
>

I have added a top-up patch v30-0003. The issue in [1] still exists in
the v30 patch. I was not able to come up with an approach to handle it
in the code, so I have added it to the documentation in the warning
section. Thoughts?
I am not changing the version as I have not made any changes in
v30-0001 and v30-0002.

[1]: 
https://www.postgresql.org/message-id/cahv8rj+5mzk9jt+7ecogjzfm5czvdccd5jo1_rcx0bteypb...@mail.gmail.com



Thanks and regards,
Shlok Kyal


v30-0001-pg_createsubscriber-creates-a-new-logical-replic.patch
Description: Binary data


v30-0002-Stop-the-target-server-earlier.patch
Description: Binary data


v30-0003-Document-a-limitation-of-pg_createsubscriber.patch
Description: Binary data


Re: Proposal to include --exclude-extension Flag in pg_dump

2024-03-19 Thread Dean Rasheed
On Sat, 16 Mar 2024 at 17:36, Ayush Vatsa  wrote:
>
> Attached is the complete patch with all the required code changes.
> Looking forward to your review and feedback.
>

This looks good to me. I tested it and everything worked as expected.

I ran it through pgindent to fix some whitespace issues and added
another test for the filter option, based on the test case you added.

I'm marking this ready-for-commit (which I'll probably do myself in a
day or two, unless anyone else claims it first).

Regards,
Dean
From f757ebe748ab47d1e1ab40b343af2a43a9183287 Mon Sep 17 00:00:00 2001
From: Ayush Vatsa 
Date: Mon, 25 Dec 2023 14:46:05 +0530
Subject: [PATCH v3] Add support for --exclude-extension in pg_dump

When specified, extensions matching the given pattern are excluded
in dumps.
---
 doc/src/sgml/ref/pg_dump.sgml   | 34 ++--
 src/bin/pg_dump/pg_dump.c   | 33 +++-
 src/test/modules/test_pg_dump/t/001_base.pl | 88 ++---
 3 files changed, 139 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0caf56e0e0..8edf03a03d 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -256,6 +256,27 @@ PostgreSQL documentation
   
  
 
+ 
+  --exclude-extension=pattern
+  
+   
+Do not dump any extensions matching pattern.  The pattern is
+interpreted according to the same rules as for -e.
+--exclude-extension can be given more than once to exclude extensions
+matching any of several patterns.
+   
+
+   
+When both -e and --exclude-extension are given, the behavior
+is to dump just the extensions that match at least one -e
+switch but no --exclude-extension switches.  If --exclude-extension
+appears without -e, then extensions matching --exclude-extension are
+excluded from what is otherwise a normal dump.
+   
+  
+ 
+
  
   -E encoding
   --encoding=encoding
@@ -848,10 +869,11 @@ PostgreSQL documentation
 --exclude-table-and-children or
 -T for tables,
 -n/--schema for schemas,
---include-foreign-data for data on foreign servers and
+--include-foreign-data for data on foreign servers,
 --exclude-table-data,
---exclude-table-data-and-children for table data,
--e/--extension for extensions.
+--exclude-table-data-and-children for table data, and
+-e/--extension or
+--exclude-extension for extensions.
 To read from STDIN, use - as the
 filename.  The --filter option can be specified in
 conjunction with the above listed options for including or excluding
@@ -874,8 +896,7 @@ PostgreSQL documentation
  
   
extension: extensions, works like the
-   --extension option. This keyword can only be
-   used with the include keyword.
+   -e/--extension option.
   
  
  
@@ -1278,7 +1299,8 @@ PostgreSQL documentation


 This option has no effect
-on -N/--exclude-schema,
+on --exclude-extension,
+-N/--exclude-schema,
 -T/--exclude-table,
 or --exclude-table-data.  An exclude pattern failing
 to match any objects is not considered an error.
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index a5149ca823..3ab7c6676a 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -136,6 +136,9 @@ static SimpleOidList foreign_servers_include_oids = {NULL, NULL};
 static SimpleStringList extension_include_patterns = {NULL, NULL};
 static SimpleOidList extension_include_oids = {NULL, NULL};
 
+static SimpleStringList extension_exclude_patterns = {NULL, NULL};
+static SimpleOidList extension_exclude_oids = {NULL, NULL};
+
 static const CatalogId nilCatalogId = {0, 0};
 
 /* override for standard extra_float_digits setting */
@@ -437,6 +440,7 @@ main(int argc, char **argv)
 		{"exclude-table-data-and-children", required_argument, NULL, 14},
 		{"sync-method", required_argument, NULL, 15},
 		{"filter", required_argument, NULL, 16},
+		{"exclude-extension", required_argument, NULL, 17},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -672,6 +676,11 @@ main(int argc, char **argv)
 read_dump_filters(optarg, );
 break;
 
+			case 17:			/* exclude extension(s) */
+simple_string_list_append(_exclude_patterns,
+		  optarg);
+break;
+
 			default:
 /* getopt_long already emitted a complaint */
 pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -890,6 +899,10 @@ main(int argc, char **argv)
 		if (extension_include_oids.head == NULL)
 			pg_fatal("no matching extensions were found");
 	}
+	expand_extension_name_patterns(fout, _exclude_patterns,
+   _exclude_oids,
+   false);
+	/* non-matching exclusion patterns aren't an error */
 
 	/*
 	 * Dumping LOs 

Re: SQL:2011 application time

2024-03-19 Thread Peter Eisentraut

On 16.03.24 22:37, Paul A Jungwirth wrote:

Here is a new patch series addressing the last few feedback emails
from Peter & Jian He. It mostly focuses on the FKs patch, trying to
get it really ready to commit,


I have committed the test changes (range and date format etc.).

The FOREIGN KEY patch looks okay to me now.  Maybe check if any of the 
subsequent comments from jian should be applied.



  > I'm not sure how else to do it. The issue is that `range_agg` returns
  > a multirange, so the result
  > type doesn't match the inputs. But other types will likely have the
  > same problem: to combine boxes
  > you may need a multibox. The combine mdranges you may need a
  > multimdrange.

Can we just hardcode the use of range_agg for this release?  Might be
easier.  I don't see all this generality being useful in the near future.


Okay, I've hard-coded range_agg in the main patch and separated the
support for multirange/etc in the next two patches. But there isn't
much code there (mostly tests and docs). Since we can't hard-code the
*operators*, most of the infrastructure is already there not to
hard-code the aggregate function. Supporting multiranges is already a
nice improvement. E.g. it should cut down on disk usage when a record
gets updated frequently. Supporting arbitrary types also seems very
powerful, and we already do that for PKs.


I think we could also handle multiranges in a hardcoded way?  Ranges and 
multiranges are hardcoded concepts anyway.  It's just when we move to 
arbitrary types supporting containment, then it gets a bit more complicated.


What would a patch that adds just multiranges on the FK side, but 
without the full pluggable gist support, look like?



I don't see any drawbacks from supporting inferred REFERENCES with
temporal tables, so my vote is to break from the standard here, and
*not* apply that follow-up patch. Should I add some docs about that?
Also skipping the patch will cause some annoying merge conflicts, so
let me know if that's what you choose and I'll handle them right away.


I agree we can allow this.





Re: Recent 027_streaming_regress.pl hangs

2024-03-19 Thread Alexander Lakhin

14.03.2024 23:56, Tom Lane wrote:

Thomas Munro  writes:

On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin  wrote:

Could it be that the timeout (360 sec?) is just not enough for the test
under the current (changed due to switch to meson) conditions?

But you're right that under meson the test takes a lot longer, I guess
due to increased concurrency:

What it seems to be is highly variable.  Looking at calliphoridae's
last half dozen successful runs, I see reported times for
027_stream_regress anywhere from 183 to 704 seconds.  I wonder what
else is happening on that machine.  Also, this is probably not
helping anything:

'extra_config' => {
   ...
   'fsync = on'

I would suggest turning that off and raising wait_timeout a good
deal, and then we'll see if calliphoridae gets any more stable.


I could reproduce similar failures with
PG_TEST_EXTRA=wal_consistency_checking
only, running 5 tests in parallel on a slowed-down VM, so that test
duration increased to ~1900 seconds, but perhaps that buildfarm machine
has a different bottleneck (I/O?) or it's concurrent workload is not
uniform as in my experiments.

Meanwhile, I've analyzed failed test logs from buildfarm and calculated
the percentage of WAL replayed before timeout.
For instance, one of the failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-03-18%2022%3A36%3A40
standby_1.log:
2024-03-18 22:38:22.743 UTC [2010896][walreceiver][:0] LOG:  started streaming 
WAL from primary at 0/300 on timeline 1
...
2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] LOG: recovery restart 
point at 0/E00E030
2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] DETAIL: Last completed transaction was at log time 2024-03-18 
22:41:26.647756+00.
2024-03-18 22:50:12.841 UTC [2010896][walreceiver][:0] FATAL:  could not receive data from WAL stream: server closed the 
connection unexpectedly


primary.log:
2024-03-18 22:38:23.754 UTC [2012240][client backend][3/3:0] LOG: statement: 
GRANT ALL ON SCHEMA public TO public;
# ^^^ One of the first records produced by `make check`
...
2024-03-18 22:41:26.647 UTC [2174047][client backend][10/752:0] LOG:  statement: ALTER VIEW my_property_secure SET 
(security_barrier=false);

# ^^^ A record near the last completed transaction on standby
...
2024-03-18 22:44:13.226 UTC [2305844][client backend][22/3784:0] LOG:  
statement: DROP TABLESPACE regress_tblspace_renamed;
# ^^^ One of the last records produced by `make check`

\set t0 '22:38:23.754' \set t1 '22:44:13.226' \set tf '22:41:26.647756'
select extract(epoch from (:'tf'::time - :'t0'::time)) / extract(epoch from 
(:'t1'::time - :'t0'::time));
~52%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-03-18%2018%3A58%3A58
~48%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-03-18%2016%3A41%3A13
~43%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-03-18%2015%3A47%3A09
~36%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-03-15%2011%3A24%3A38
~87%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2024-03-17%2021%3A55%3A41
~36%

So it still looks like a performance-related issue to me. And yes,
fsync = off -> on greatly increases (~3x) the overall test duration in
that my environment.

Best regards,
Alexander




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-19 Thread Amit Kapila
On Tue, Mar 19, 2024 at 3:11 PM Bertrand Drouvot
 wrote:
>
> On Tue, Mar 19, 2024 at 10:56:25AM +0530, Amit Kapila wrote:
> > On Mon, Mar 18, 2024 at 8:19 PM Bertrand Drouvot
> >  wrote:
> > > Agree. While it makes sense to invalidate slots for wal removal in
> > > CreateCheckPoint() (because this is the place where wal is removed), I 'm 
> > > not
> > > sure this is the right place for the 2 new cases.
> > >
> > > Let's focus on the timeout one as proposed above (as probably the 
> > > simplest one):
> > > as this one is purely related to time and activity what about to 
> > > invalidate them
> > > when?:
> > >
> > > - their usage resume
> > > - in pg_get_replication_slots()
> > >
> > > The idea is to invalidate the slot when one resumes activity on it or 
> > > wants to
> > > get information about it (and among other things wants to know if the 
> > > slot is
> > > valid or not).
> > >
> >
> > Trying to invalidate at those two places makes sense to me but we
> > still need to cover the cases where it takes very long to resume the
> > slot activity and the dangling slot cases where the activity is never
> > resumed.
>
> I understand it's better to have the slot reflecting its real status 
> internally
> but it is a real issue if that's not the case until the activity on it is 
> resumed?
> (just asking, not saying we should not)
>

Sorry, I didn't understand your point. Can you try to explain by example?

-- 
With Regards,
Amit Kapila.




minor tweak to catalogs.sgml pg_class.reltablespace

2024-03-19 Thread Alvaro Herrera
While reviewing the patch for SET ACCESS METHOD[1] I noticed that
pg_class.relam is not documented fully for partitioned tables, so I
proposed the attached.  Also, I remove a comment that merely repeats
what was already said a few lines above.

This is intended for backpatch to 12.

[1] https://postgr.es/m/202403191013.w2kr7wqlamqz@alvherre.pgsql

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
>From fe4d53f33c688ab3ff132964bf42ba5fc8cdaa6e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 19 Mar 2024 09:49:46 +0100
Subject: [PATCH] Review for wording on tablespaces on partitioned tables

Remove a redundant comment, and document pg_class.relam in
catalogs.sgml.

After commits a36c84c3e4a9, 87259588d0ab and others.

Discussion: https://postgr.es/m/202403191013.w2kr7wqlamqz@alvherre.pgsql
---
 doc/src/sgml/catalogs.sgml   | 8 +---
 src/backend/commands/tablecmds.c | 4 
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 2f091ad09d..183669272b 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2011,9 +2011,11 @@ SCRAM-SHA-256$iteration count:
(references pg_tablespace.oid)
   
   
-   The tablespace in which this relation is stored.  If zero,
-   the database's default tablespace is implied.  (Not meaningful
-   if the relation has no on-disk file.)
+   The tablespace in which this relation is stored; for partitioned
+   tables, the tablespace in which partitions are created
+   when one is not specified in the creation command.
+   If zero, the database's default tablespace is implied.
+   (Not meaningful if the relation has no on-disk file.)
   
  
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3ed0618b4e..6c0c899210 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -807,10 +807,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	}
 	else if (stmt->partbound)
 	{
-		/*
-		 * For partitions, when no other tablespace is specified, we default
-		 * the tablespace to the parent partitioned table's.
-		 */
 		Assert(list_length(inheritOids) == 1);
 		tablespaceId = get_rel_tablespace(linitial_oid(inheritOids));
 	}
-- 
2.39.2



Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-03-19 Thread Alvaro Herrera
On 2024-Mar-19, Alvaro Herrera wrote:

> 0001 is Michaël's patch, 0002 are my proposed changes.

Doh, I sent the wrong set of attachments.  But I see no reason to post
again: what I attached as 0001 is what I wrote was going to be 0002,
Michaël's patch is already in archives, and the CI tests with both
applied on current master are running here:
https://cirrus-ci.com/build/6404370015715328

Michaël, I'll leave this for you to push ...

Thanks!

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: hot updates and fillfactor

2024-03-19 Thread Aleksander Alekseev
Hi Fabrice,

> I do not understand why hot_updates value is not 0 for pg_database? Given 
> that reloptions is empty for this table that means it has a default value of 
> 100%

Maybe I didn't entirely understand your question, but why would you
assume they are somehow related?

According to the documentation [1][2]:

pg_class.reloptions:
  Access-method-specific options, as “keyword=value” strings

pg_stat_all_tables.n_tup_hot_upd:
  Number of rows HOT updated. These are updates where no successor
versions are required in indexes.

The value of n_tup_hot_upd is not zero because there are tuples that
were HOT-updated. That's it. You can read more about HOT here [3].

[1]: https://www.postgresql.org/docs/current/catalog-pg-class.html
[2]: https://www.postgresql.org/docs/current/monitoring-stats.html
[3]: https://www.postgresql.org/docs/current/storage-hot.html

-- 
Best regards,
Aleksander Alekseev




Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-03-19 Thread Alvaro Herrera
On 2024-Mar-08, Michael Paquier wrote:

> I have spent more time reviewing the whole and the tests (I didn't see
> much value in testing the DEFAULT clause twice for the partitioned
> table case and there is a test in d61a6cad6418), tweaked a few
> comments and the documentation, did an indentation and a commit
> message draft.
> 
> How does that look to you?  The test coverage and the semantics do
> what we want them to do, so that looks rather reasonable here.  A
> second or even third pair of eyes would not hurt.

I gave this a look.  I found some of the comments a bit confusing or
overly long, so I propose to reword them.  I also propose a small doc
change (during writing which I noticed that the docs for tablespace had
been neglected and one comment too many; patch to be committed
separately soon).  I ended up also moving code in tablecmds.c so that
all the AT*SetAccessMethod* routines appear together rather than mixed
with the ones for tablespaces, and removing one CCI that seems
unnecessary, at the bottom of ATExecSetAccessMethodNoStorage.

0001 is Michaël's patch, 0002 are my proposed changes.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Hay que recordar que la existencia en el cosmos, y particularmente la
elaboración de civilizaciones dentro de él no son, por desgracia,
nada idílicas" (Ijon Tichy)
>From 8b4408ac6cc46af2257e9c613d95a0eb38238a8e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 19 Mar 2024 09:51:58 +0100
Subject: [PATCH] comment updates, remove one CCI

---
 doc/src/sgml/catalogs.sgml   |  17 ++-
 src/backend/commands/tablecmds.c | 223 ++-
 2 files changed, 114 insertions(+), 126 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c86ff19754..476a7b70dd 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1989,9 +1989,12 @@ SCRAM-SHA-256$iteration 
count:
   
   
If this is a table or an index, the access method used (heap,
-   B-tree, hash, etc.); otherwise zero. Zero occurs for sequences,
-   as well as relations without storage, such as views. Partitioned
-   tables may use a non-zero value.
+   B-tree, hash, etc.); for partitioned tables, the access method
+   for newly created partitions, when one is not specified in the
+   creation command, or zero to make creation fall back to
+   default_table_access_method.
+   Zero for sequences, as well as other relations without
+   storage, such as views.
   
  
 
@@ -2012,9 +2015,11 @@ SCRAM-SHA-256$iteration 
count:
(references pg_tablespace.oid)
   
   
-   The tablespace in which this relation is stored.  If zero,
-   the database's default tablespace is implied.  (Not meaningful
-   if the relation has no on-disk file.)
+   The tablespace in which this relation is stored; for partitioned
+   tables, the tablespace in which partitions are created
+   when one is not specified in the creation command.
+   If zero, the database's default tablespace is implied.
+   (Not meaningful if the relation has no on-disk file.)
   
  
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 54c58d1764..9e36aae425 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -950,27 +950,20 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid 
ownerId,
}
 
/*
-* If the statement hasn't specified an access method, but we're 
defining
-* a type of relation that needs one, use the default.
+* Select access method to use: an explicitly indicated one, or (in the
+* case of a partitioned table) the parent's, if it has one.
 */
if (stmt->accessMethod != NULL)
accessMethodId = get_table_am_oid(stmt->accessMethod, false);
else if (stmt->partbound)
{
-   /*
-* For partitions, if no access method is specified, use the AM 
of the
-* parent table.
-*/
Assert(list_length(inheritOids) == 1);
accessMethodId = get_rel_relam(linitial_oid(inheritOids));
}
else
accessMethodId = InvalidOid;
 
-   /*
-* Still nothing?  Use the default.  Partitioned tables default to
-* InvalidOid without an access method specified.
-*/
+   /* still nothing? use the default */
if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
accessMethodId = get_table_am_oid(default_table_access_method, 
false);
 
@@ -5403,7 +5396,6 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
/* nothing to do here, oid columns don't exist anymore 
*/
break;
case AT_SetAccessMethod:/* SET ACCESS METHOD */
-   /* handled 

Re: Catalog domain not-null constraints

2024-03-19 Thread jian he
create domain connotnull integer;
create table domconnotnulltest
( col1 connotnull
, col2 connotnull
);
alter domain connotnull add not null value;
---
the above query does not work in pg16.
ERROR: syntax error at or near "not".

after applying the patch, now this works.
this new syntax need to be added into the alter_domain.sgml's synopsis and also
need an explanation varlistentry?

+ /*
+ * Store the constraint in pg_constraint
+ */
+ ccoid =
+ CreateConstraintEntry(constr->conname, /* Constraint Name */
+  domainNamespace, /* namespace */
+  CONSTRAINT_NOTNULL, /* Constraint Type */
+  false, /* Is Deferrable */
+  false, /* Is Deferred */
+  !constr->skip_validation, /* Is Validated */
+  InvalidOid, /* no parent constraint */
+  InvalidOid, /* not a relation constraint */
+  NULL,
+  0,
+  0,
+  domainOid, /* domain constraint */
+  InvalidOid, /* no associated index */
+  InvalidOid, /* Foreign key fields */
+  NULL,
+  NULL,
+  NULL,
+  NULL,
+  0,
+  ' ',
+  ' ',
+  NULL,
+  0,
+  ' ',
+  NULL, /* not an exclusion constraint */
+  NULL,
+  NULL,
+  true, /* is local */
+  0, /* inhcount */
+  false, /* connoinherit */
+  false, /* conwithoutoverlaps */
+  false); /* is_internal */

/* conwithoutoverlaps */
should be
/* conperiod */




Re: add AVX2 support to simd.h

2024-03-19 Thread John Naylor
On Tue, Mar 19, 2024 at 10:16 AM Nathan Bossart
 wrote:
>
> On Tue, Mar 19, 2024 at 10:03:36AM +0700, John Naylor wrote:
> > I took a brief look, and 0001 isn't quite what I had in mind. I can't
> > quite tell what it's doing with the additional branches and "goto
> > retry", but I meant something pretty simple:
>
> Do you mean 0002?  0001 just adds a 2-register loop for remaining elements
> once we've exhausted what can be processed with the 4-register loop.

Sorry, I was looking at v2 at the time.

> > - if short, do one element at a time and return
>
> 0002 does this.

That part looks fine.

> > - if long, do one block unconditionally, then round the start pointer
> > up so that "end - start" is an exact multiple of blocks, and loop over
> > them
>
> 0002 does the opposite of this.  That is, after we've completed as many
> blocks as possible, we move the iterator variable back to "end -
> block_size" and do one final iteration to cover all the remaining elements.

Sounds similar in principle, but it looks really complicated. I don't
think the additional loops and branches are a good way to go, either
for readability or for branch prediction. My sketch has one branch for
which loop to do, and then performs only one loop. Let's do the
simplest thing that could work. (I think we might need a helper
function to do the block, but the rest should be easy)




Re: Inconsistent printf placeholders

2024-03-19 Thread Peter Eisentraut

On 15.03.24 08:20, Kyotaro Horiguchi wrote:

diff --git a/src/backend/access/transam/twophase.c 
b/src/backend/access/transam/twophase.c
@@ -1369,8 +1369,8 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 errmsg("could not read file \"%s\": 
%m", path)));
else
ereport(ERROR,
-   (errmsg("could not read file \"%s\": read %d 
of %lld",
-   path, r, (long long 
int) stat.st_size)));
+   (errmsg("could not read file \"%s\": read 
%zd of %zu",
+   path, r, (Size) 
stat.st_size)));
}
  
  	pgstat_report_wait_end();


This might be worse, because stat.st_size is of type off_t, which could 
be smaller than Size/size_t.



diff --git a/src/backend/libpq/be-secure-gssapi.c 
b/src/backend/libpq/be-secure-gssapi.c
index bc04e78abb..68645b4519 100644
--- a/src/backend/libpq/be-secure-gssapi.c
+++ b/src/backend/libpq/be-secure-gssapi.c
@@ -572,9 +572,9 @@ secure_open_gssapi(Port *port)
if (input.length > PQ_GSS_RECV_BUFFER_SIZE)
{
ereport(COMMERROR,
-   (errmsg("oversize GSSAPI packet sent by the 
client (%zu > %d)",
+   (errmsg("oversize GSSAPI packet sent by the 
client (%zu > %zu)",
(size_t) input.length,
-   
PQ_GSS_RECV_BUFFER_SIZE)));
+   (size_t) 
PQ_GSS_RECV_BUFFER_SIZE)));
return -1;
}
  


Might be better to add that cast to the definition of 
PQ_GSS_RECV_BUFFER_SIZE instead, so that all code can benefit.



diff --git a/src/backend/replication/repl_gram.y 
b/src/backend/replication/repl_gram.y
index 7474f5bd67..baa76280b9 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -312,11 +312,6 @@ timeline_history:
{
TimeLineHistoryCmd *cmd;
  
-	if ($2 <= 0)

-   ereport(ERROR,
-   
(errcode(ERRCODE_SYNTAX_ERROR),
-errmsg("invalid 
timeline %u", $2)));
-
cmd = makeNode(TimeLineHistoryCmd);
cmd->timeline = $2;
  
@@ -352,13 +347,7 @@ opt_slot:
  
  opt_timeline:

K_TIMELINE UCONST
-   {
-   if ($2 <= 0)
-   ereport(ERROR,
-   
(errcode(ERRCODE_SYNTAX_ERROR),
-errmsg("invalid 
timeline %u", $2)));
-   $$ = $2;
-   }
+   { $$ = $2; }
| /* EMPTY */   { $$ = 0; }
;
  


I don't think this is correct.  It loses the check for == 0.


diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index 88cba58cba..9d21178107 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -191,7 +191,8 @@ make_tsvector(ParsedText *prs)
if (lenstr > MAXSTRPOS)
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-errmsg("string is too long for tsvector (%d bytes, 
max %d bytes)", lenstr, MAXSTRPOS)));
+/* cast values to avoid extra translatable 
messages */
+errmsg("string is too long for tsvector (%ld bytes, 
max %ld bytes)", (long) lenstr, (long) MAXSTRPOS)));
  
  	totallen = CALCDATASIZE(prs->curwords, lenstr);

in = (TSVector) palloc0(totallen);


I think it would be better instead to change the message in tsvectorin() 
to *not* use long.  The size of long is unportable, so I would rather 
avoid using it at all.



diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 8d28dd42ce..5de490b569 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -3217,8 +3217,9 @@ byteaGetByte(PG_FUNCTION_ARGS)
if (n < 0 || n >= len)
ereport(ERROR,
(errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR),
-errmsg("index %d out of valid range, 0..%d",
-   n, len - 1)));
+/* cast 

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-19 Thread Bertrand Drouvot
Hi,

On Tue, Mar 19, 2024 at 10:56:25AM +0530, Amit Kapila wrote:
> On Mon, Mar 18, 2024 at 8:19 PM Bertrand Drouvot
>  wrote:
> > Agree. While it makes sense to invalidate slots for wal removal in
> > CreateCheckPoint() (because this is the place where wal is removed), I 'm 
> > not
> > sure this is the right place for the 2 new cases.
> >
> > Let's focus on the timeout one as proposed above (as probably the simplest 
> > one):
> > as this one is purely related to time and activity what about to invalidate 
> > them
> > when?:
> >
> > - their usage resume
> > - in pg_get_replication_slots()
> >
> > The idea is to invalidate the slot when one resumes activity on it or wants 
> > to
> > get information about it (and among other things wants to know if the slot 
> > is
> > valid or not).
> >
> 
> Trying to invalidate at those two places makes sense to me but we
> still need to cover the cases where it takes very long to resume the
> slot activity and the dangling slot cases where the activity is never
> resumed.

I understand it's better to have the slot reflecting its real status internally
but it is a real issue if that's not the case until the activity on it is 
resumed?
(just asking, not saying we should not)

> How about apart from the above two places, trying to
> invalidate in CheckPointReplicationSlots() where we are traversing all
> the slots?

I think that's a good place but there is still a window of time (that could also
be "large" depending of the activity and the checkpoint frequency) during which
the slot is not known as invalid internally. But yeah, at leat we know that 
we'll
mark it as invalid at some point...

BTW:

if (am_walsender)
{
+   if (slot->data.persistency == RS_PERSISTENT)
+   {
+   SpinLockAcquire(>mutex);
+   slot->data.last_inactive_at = GetCurrentTimestamp();
+   slot->data.inactive_count++;
+   SpinLockRelease(>mutex);

I'm also feeling the same concern as Shveta mentioned in [1]: that a "normal"
backend using pg_logical_slot_get_changes() or friends would not set the
last_inactive_at.

[1]: 
https://www.postgresql.org/message-id/CAJpy0uD64X%3D2ENmbHaRiWTKeQawr-rbGoy_GdhQQLVXzUSKTMg%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-19 Thread John Naylor
On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada  wrote:
>
> On Tue, Mar 19, 2024 at 8:35 AM John Naylor  wrote:
> >
> > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor  
> > > wrote:

> > It might also be worth reducing the number of blocks in the random
> > test -- multiple runs will have different offsets anyway.
>
> Yes. If we reduce the number of blocks from 1000 to 100, the
> regression test took on my environment:
>
> 1000 blocks : 516 ms
> 100 blocks   : 228 ms

Sounds good.

> Removed some unnecessary variables in 0002 patch.

Looks good.

> So the MaxBlocktableEntrySize calculation would be as follows?
>
> #define MaxBlocktableEntrySize \
> offsetof(BlocktableEntry, words) + \
> (sizeof(bitmapword) * \
> WORDS_PER_PAGE(Min(MaxOffsetNumber, \
>BITS_PER_BITMAPWORD * PG_INT8_MAX - 1
>
> I've made this change in the 0003 patch.

This is okay, but one side effect is that we have both an assert and
an elog, for different limits. I think we'll need a separate #define
to help. But for now, I don't want to hold up tidstore further with
this because I believe almost everything else in v74 is in pretty good
shape. I'll save this for later as a part of the optimization I
proposed.

Remaining things I noticed:

+#define RT_PREFIX local_rt
+#define RT_PREFIX shared_rt

Prefixes for simplehash, for example, don't have "sh" -- maybe "local/shared_ts"

+ /* MemoryContext where the radix tree uses */

s/where/that/

+/*
+ * Lock support functions.
+ *
+ * We can use the radix tree's lock for shared TidStore as the data we
+ * need to protect is only the shared radix tree.
+ */
+void
+TidStoreLockExclusive(TidStore *ts)

Talking about multiple things, so maybe a blank line after the comment.

With those, I think you can go ahead and squash all the tidstore
patches except for 0003 and commit it.

> While reviewing the vacuum patch, I realized that we always pass
> LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related
> to the tidstore is therefore always the same. I think it would be
> better to make the caller of TidStoreCreate() specify the tranch_id
> and pass it to RT_CREATE(). That way, the caller can specify their own
> wait event for tidstore. The 0008 patch tried this idea. dshash.c does
> the same idea.

Sounds reasonable. I'll just note that src/include/storage/lwlock.h
still has an entry for LWTRANCHE_SHARED_TIDSTORE.




Re: What is a typical precision of gettimeofday()?

2024-03-19 Thread Aleksander Alekseev
Hi,

cc: Andrey

> Over in the thread discussing the addition of UUIDv7 support [0], there
> is some uncertainty about what timestamp precision one can expect from
> gettimeofday().
>
> UUIDv7 uses milliseconds since Unix epoch, but can optionally use up to
> 12 additional bits of timestamp precision (see [1]), but it can also
> just use a counter instead of the extra precision.  The current patch
> uses the counter method "because of portability concerns" (source code
> comment).
>
> I feel that we don't actually have any information about this
> portability concern.  Does anyone know what precision we can expect from
> gettimeofday()?  Can we expect the full microsecond precision usually?

Specifically in the UUIDv7 application the goal is to generate not
necessarily time-precise UUIDs but rather do our best to get *unique*
UUIDs. As I understand, this is the actual reason why the patch needs
counters.

As Linux man page puts it:

"""
The time returned by gettimeofday() is affected by discontinuous jumps
in the system  time  (e.g.,  if  the  system  administrator  manually
changes the system time).
"""

On top of that MacOS man page says:

"""
The resolution of the system clock is hardware dependent, and the time
may be updated continuously or in ``ticks.''
"""

On Windows our gettimeofday() implementation is a wrapper for
GetSystemTimePreciseAsFileTime(). The corresponding MSDN page [1] is
somewhat laconic.

Considering the number of environments PostgreSQL can run in (OS +
hardware + virtualization technologies) and the fact that
hardware/software changes I doubt that it's realistic to expect any
particular guarantees from gettimeofday() in the general case.

[1]: 
https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getsystemtimepreciseasfiletime


--
Best regards,
Aleksander Alekseev




Re: Table AM Interface Enhancements

2024-03-19 Thread Pavel Borisov
Hi, Alexander!

On Tue, 19 Mar 2024 at 03:34, Alexander Korotkov 
wrote:

> On Sun, Mar 3, 2024 at 1:50 PM Alexander Korotkov 
> wrote:
> > On Mon, Nov 27, 2023 at 10:18 PM Mark Dilger
> >  wrote:
> > >
> > > > On Nov 25, 2023, at 9:47 AM, Alexander Korotkov <
> aekorot...@gmail.com> wrote:
> > > >
> > > >> Should the patch at least document which parts of the EState are
> expected to be in which states, and which parts should be viewed as
> undefined?  If the implementors of table AMs rely on any/all aspects of
> EState, doesn't that prevent future changes to how that structure is used?
> > > >
> > > > New tuple tuple_insert_with_arbiter() table AM API method needs
> EState
> > > > argument to call executor functions: ExecCheckIndexConstraints(),
> > > > ExecUpdateLockMode(), and ExecInsertIndexTuples().  I think we
> > > > probably need to invent some opaque way to call this executor
> function
> > > > without revealing EState to table AM.  Do you think this could work?
> > >
> > > We're clearly not accessing all of the EState, just some specific
> fields, such as es_per_tuple_exprcontext.  I think you could at least
> refactor to pass the minimum amount of state information through the table
> AM API.
> >
> > Yes, the table AM doesn't need the full EState, just the ability to do
> > specific manipulation with tuples.  I'll refactor the patch to make a
> > better isolation for this.
>
> Please find the revised patchset attached.  The changes are following:
> 1. Patchset is rebase. to the current master.
> 2. Patchset is reordered.  I tried to put less debatable patches to the
> top.
> 3. tuple_is_current() method is moved from the Table AM API to the
> slot as proposed by Matthias van de Meent.
> 4. Assert added to the table_free_rd_amcache() as proposed by Pavel
> Borisov.
>

Patches 0001-0002 are unchanged compared to the last version in thread [1].
In my opinion, it's still ready to be committed, which was not done for
time were too close to feature freeze one year ago.

0003 - Assert added from previous version. I still have a strong opinion
that allowing multi-chunked data structures instead of single chunks is
completely safe and makes natural process of Postgres improvement that is
self-justified. The patch is simple enough and ready to be pushed.

0004  (previously 0007) - Have not changed, and there is consensus that
this is reasonable. I've re-checked the current code. Looks safe
considering returning a different slot, which I doubted before. So consider
this patch also ready.

0005 (previously 0004) - Unused argument in the is_current_xact_tuple()
signature is removed. Also comparing to v1 the code shifted from tableam
methods to TTS's level.

I'd propose to remove  Assert(!TTS_EMPTY(slot))
for tts_minimal_is_current_xact_tuple()
and  tts_virtual_is_current_xact_tuple() as these are only error reporting
functions that don't use slot actually.

Comment similar to:
+/*
+ * VirtualTupleTableSlots never have a storage tuples.  We generally
+ * shouldn't get here, but provide a user-friendly message if we do.
+ */
also applies to tts_minimal_is_current_xact_tuple()

I'd propose changes for clarity of this comment:
%s/a storage tuples/storage tuples/g
%s/generally//g

Otherwise patch 0005 also looks good to me.

I'm planning to review the remaining patches. Meanwhile think pushing what
is now ready and uncontroversial is a good intention.
Thank you for the work done on this patchset!

Regards,
Pavel Borisov,
Supabase.






[1].
https://www.postgresql.org/message-id/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com


Re: doc issues in event-trigger-matrix.html

2024-03-19 Thread Daniel Gustafsson
> On 19 Mar 2024, at 02:14, Michael Paquier  wrote:
> 
> On Tue, Mar 19, 2024 at 08:00:00AM +0800, jian he wrote:
>> I think the "X" and "-" mean in this matrix [0] is not very intuitive.
>> mainly because "X" tends to mean negative things in most cases.
>> we can write a sentence saying "X"  means this, "-" means that.
>> 
>> or maybe Check mark [1] and Cross mark [2]  are more universal.
>> and we can use these marks.
>> 
>> "Only for local objects"
>> is there any reference explaining "local objects"?
>> I think local object means objects that only affect one single database?

That's a bigger problem than the table representation, we never define what
"local object" mean anywhere in the EVT docs.  EV's are global for a database,
but not a cluster, so I assume what this means is that EVs for non-DDL commands
like COMMENT can only fire for a specific relation they are attached to and not
database wide?

> It is true that in Japan the cross mark refers to a negation, and
> that's the opposite in France: I would put a cross on a table in the
> case where something is supported.  I've never seen anybody complain
> about the format of these tables, FWIW, but if these were to be
> changed, the update should happen across the board for all the tables
> and not only one.

AFAICT we only have one other table with "X" denoting support, the "Conflicting
Lock Modes" table under Concurrency Control chapter, and there we simply leave
the "not supported" column empty instead of using a dash.  Maybe the simple fix
here is to make these tables consistent by removing the dash from the event
trigger firing matrix?

As a sidenote, the table should gain a sentence explaining why the login column
is missing to avoid confusion.

--
Daniel Gustafsson





  1   2   >