Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-02 Thread Bharath Rupireddy
On Wed, Apr 3, 2024 at 8:38 AM shveta malik  wrote:
>
> > > Or a simple solution is that the slotsync worker updates
> > > inactive_since as it does for non-synced slots, and disables
> > > timeout-based slot invalidation for synced slots.
>
> I like this idea better, it takes care of such a case too when the
> user is relying on sync-function rather than worker and does not want
> to get the slots invalidated in between 2 sync function calls.

Please find the attached v31 patches implementing the above idea:

- synced slots get their on inactive_since just like any other slot
- synced slots don't get invalidated due to inactive timeout because
such slots not considered active at all as they don't perform logical
decoding (of course, they will perform in fast_forward mode to fix the
other data loss issue, but they don't generate changes for them to be
called as *active* slots)
- synced slots inactive_since is set to current timestamp after the
standby gets promoted to help inactive_since interpret correctly just
like any other slot.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From fd2cb48726dd4e1932f8809dfb36e0fe9f96 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 3 Apr 2024 05:03:22 +
Subject: [PATCH v31 1/2] Allow synced slots to have their own inactive_since.

---
 doc/src/sgml/system-views.sgml|  7 +++
 src/backend/replication/logical/slotsync.c| 44 +
 src/backend/replication/slot.c| 23 +++--
 src/test/perl/PostgreSQL/Test/Cluster.pm  | 34 +
 src/test/recovery/t/019_replslot_limit.pl | 26 +-
 .../t/040_standby_failover_slots_sync.pl  | 49 +++
 6 files changed, 144 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 3c8dca8ca3..b64274a1fb 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2530,6 +2530,13 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
 The time since the slot has become inactive.
 NULL if the slot is currently being used.
+Note that for slots on the standby that are being synced from a
+primary server (whose synced field is
+true), the
+inactive_since value will get updated
+after every synchronization (see
+)
+from the corresponding remote slot on the primary.
   
  
 
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 30480960c5..4050bd40f8 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -140,6 +140,7 @@ typedef struct RemoteSlot
 } RemoteSlot;
 
 static void slotsync_failure_callback(int code, Datum arg);
+static void update_synced_slots_inactive_since(void);
 
 /*
  * If necessary, update the local synced slot's metadata based on the data
@@ -1296,6 +1297,46 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t startup_data_len)
 	Assert(false);
 }
 
+/*
+ * Update the inactive_since property for synced slots.
+ */
+static void
+update_synced_slots_inactive_since(void)
+{
+	TimestampTz now = 0;
+
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+	for (int i = 0; i < max_replication_slots; i++)
+	{
+		ReplicationSlot *s = >replication_slots[i];
+
+		/* Check if it is a synchronized slot */
+		if (s->in_use && s->data.synced)
+		{
+			Assert(SlotIsLogical(s));
+
+			/*
+			 * We get the current time beforehand and only once to avoid
+			 * system calls overhead while holding the lock.
+			 */
+			if (now == 0)
+now = GetCurrentTimestamp();
+
+			/*
+			 * Set the time since the slot has become inactive after shutting
+			 * down slot sync machinery. This helps correctly interpret the
+			 * time if the standby gets promoted without a restart.
+			 */
+			SpinLockAcquire(>mutex);
+			s->inactive_since = now;
+			SpinLockRelease(>mutex);
+		}
+	}
+
+	LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Shut down the slot sync worker.
  */
@@ -1309,6 +1350,7 @@ ShutDownSlotSync(void)
 	if (SlotSyncCtx->pid == InvalidPid)
 	{
 		SpinLockRelease(>mutex);
+		update_synced_slots_inactive_since();
 		return;
 	}
 	SpinLockRelease(>mutex);
@@ -1341,6 +1383,8 @@ ShutDownSlotSync(void)
 	}
 
 	SpinLockRelease(>mutex);
+
+	update_synced_slots_inactive_since();
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d778c0b921..5549ca9640 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -42,6 +42,7 @@
 #include "access/transam.h"
 #include "access/xlog_internal.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogutils.h"
 #include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
@@ -690,13 +691,10 @@ ReplicationSlotRelease(void)
 	}
 
 	/*
-	 * Set the last inactive time 

Re: Synchronizing slots from primary to standby

2024-04-02 Thread Amit Kapila
On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy
 wrote:
>
> On Wed, Apr 3, 2024 at 9:04 AM Amit Kapila  wrote:
> >
> > > I'd just rename LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr
> > > moveto, bool *found_consistent_snapshot) to
> > > pg_logical_replication_slot_advance(XLogRecPtr moveto, bool
> > > *found_consistent_snapshot) and use it. If others don't like this, I'd
> > > at least turn pg_logical_replication_slot_advance(XLogRecPtr moveto) a
> > > static inline function.
> > >
> > Yeah, we can do that but it is not a performance-sensitive routine so
> > don't know if it is worth it.
>
> Okay for what the patch has right now. No more bikeshedding from me on this.
>
> > > > The slotsync worker needs to advance the slots from different databases 
> > > > in
> > > > fast_forward. So, we need to skip this check in fast_forward mode. The 
> > > > analysis can
> > > > be found in [3].
> > > -if (slot->data.database != MyDatabaseId)
> > > +/*
> > > + * We need to access the system tables during decoding to build the
> > > + * logical changes unless we are in fast_forward mode where no 
> > > changes are
> > > + * generated.
> > > + */
> > > +if (slot->data.database != MyDatabaseId && !fast_forward)
> > >  ereport(ERROR,
> > >
> > > It's not clear from the comment that we need it for a slotsync worker
> > > to advance the slots from different databases. Can this be put into
> > > the comment? Also, specify in the comment, why this is safe?
> > >
> > It is not specific to slot sync worker but specific to fast_forward
> > mode. There is already a comment "We need to access the system tables
> > during decoding to build the logical changes unless we are in
> > fast_forward mode where no changes are generated." telling why it is
> > safe. The point is we need database access to access system tables
> > while generating the logical changes and in fast-forward mode, we
> > don't generate logical changes so this check is not required. Do let
> > me if you have a different understanding or if my understanding is
> > incorrect.
>
> Understood. Thanks. Just curious, why isn't a problem for the existing
> fast_forward mode callers pg_replication_slot_advance and
> LogicalReplicationSlotHasPendingWal?
>

We call those after connecting to the database and the slot also
belongs to that database whereas during synchronization of slots
standby. the slots could be from different databases.

> I quickly looked at v8, and have a nit, rest all looks good.
>
> +if (DecodingContextReady(ctx) && found_consistent_snapshot)
> +*found_consistent_snapshot = true;
>
> Can the found_consistent_snapshot be checked first to help avoid the
> function call DecodingContextReady() for pg_replication_slot_advance
> callers?
>

Okay, changed. Additionally, I have updated the comments and commit
message. I'll push this patch after some more testing.

-- 
With Regards,
Amit Kapila.


v9-0001-Ensure-that-the-sync-slots-reach-a-consistent-sta.patch
Description: Binary data


Re: SQL:2011 application time

2024-04-02 Thread Paul Jungwirth

On 3/24/24 00:38, Peter Eisentraut wrote:> I have committed the patches
v33-0001-Add-temporal-FOREIGN-KEYs.patch and v33-0002-Support-multiranges-in-temporal-FKs.patch 
(together).


Hi Hackers,

I found some problems with temporal primary keys and the idea of uniqueness, especially around the 
indisunique column. Here are some small fixes and a proposal for a larger fix, which I think we need 
but I'd like some feedback on.


The first patch fixes problems with ON CONFLICT DO NOTHING/UPDATE.

DO NOTHING fails because it doesn't expect a non-btree unique index. It's fine to make it accept a 
temporal PRIMARY KEY/UNIQUE index though (i.e. an index with both indisunique and indisexclusion).
This is no different than an exclusion constraint. So I skip BuildSpeculativeIndexInfo for WITHOUT 
OVERLAPS indexes. (Incidentally, AFAICT ii_UniqueOps is never used, only ii_UniqueProcs. Right?)


We should still forbid temporally-unique indexes for ON CONFLICT DO UPDATE, since there may be more 
than one row that conflicts. Ideally we would find and update all the conflicting rows, but we don't 
now, and we don't want to update just one:


postgres=# create table t (id int4range, valid_at daterange, name text, constraint tpk primary 
key (id, valid_at without overlaps));

CREATE TABLE
postgres=# insert into t values ('[1,2)', '[2000-01-01,2001-01-01)', 'a'), ('[1,2)', 
'[2001-01-01,2002-01-01)', 'b');

INSERT 0 2
postgres=# insert into t values ('[1,2)', '[2000-01-01,2002-01-01)', 'c') on conflict (id, 
valid_at) do update set name = excluded.name;

INSERT 0 1
postgres=# select * from t;
  id   |valid_at | name
---+-+--
 [1,2) | [2001-01-01,2002-01-01) | b
 [1,2) | [2000-01-01,2001-01-01) | c
(2 rows)

So I also added code to prevent that. This is just preserving the old behavior for exclusion 
constraints, which was bypassed because of indisunique. All this is in the first patch.


That got me thinking about indisunique and where else it could cause problems. Perhaps there are 
other places that assume only b-trees are unique. I couldn't find anywhere that just gives an error 
like ON CONFLICT, but I can imagine more subtle problems.


A temporal PRIMARY KEY or UNIQUE constraint is unique in at least three ways: It is *metaphorically* 
unique: the conceit is that the scalar part is unique at every moment in time. You may have id 5 in 
your table more than once, as long as the records' application times don't overlap.


And it is *officially* unique: the standard calls these constraints unique. I think it is correct 
for us to report them as unique in pg_index.


But is it *literally* unique? Well two identical keys, e.g. (5, '[Jan24,Mar24)') and (5, 
'[Jan24,Mar24)'), do have overlapping ranges, so the second is excluded. Normally a temporal unique 
index is *more* restrictive than a standard one, since it forbids other values too (e.g. (5, 
'[Jan24,Feb24)')). But sadly there is one exception: the ranges in these keys do not overlap: (5, 
'empty'), (5, 'empty'). With ranges/multiranges, `'empty' && x` is false for all x. You can add that 
key as many times as you like, despite a PK/UQ constraint:


postgres=# insert into t values
('[1,2)', 'empty', 'foo'),
('[1,2)', 'empty', 'bar');
INSERT 0 2
postgres=# select * from t;
  id   | valid_at | name
---+--+--
 [1,2) | empty| foo
 [1,2) | empty| bar
(2 rows)

Cases like this shouldn't actually happen for temporal tables, since empty is not a meaningful 
value. An UPDATE/DELETE FOR PORTION OF would never cause an empty. But we should still make sure 
they don't cause problems.


One place we should avoid temporally-unique indexes is REPLICA IDENTITY. Fortunately we already do 
that, but patch 2 adds a test to keep it that way.


Uniqueness is an important property to the planner, too.

We consider indisunique often for estimates, where it needn't be 100% true. Even if there are 
nullable columns or a non-indimmediate index, it still gives useful stats. Duplicates from 'empty' 
shouldn't cause any new problems there.


In proof code we must be more careful. Patch 3 updates relation_has_unique_index_ext and 
rel_supports_distinctness to disqualify WITHOUT OVERLAPS indexes. Maybe that's more cautious than 
needed, but better safe than sorry. This patch has no new test though. I had trouble writing SQL 
that was wrong before its change. I'd be happy for help here!


Another problem is GROUP BY and functional dependencies. This is wrong:

postgres=# create table a (id int4range, valid_at daterange, name text, constraint apk primary 
key (id, valid_at without overlaps));

CREATE TABLE
postgres=# insert into a values ('[1,2)', 'empty', 'foo'), ('[1,2)', 
'empty', 'bar');
INSERT 0 2
postgres=# select * from a group by id, valid_at;
  id   | valid_at | name
---+--+--
 [1,2) | 

Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences

2024-04-02 Thread Michael Zhilin
Thanks to all for review, testing and commit!!! 


On 2 April 2024 22:04:54 GMT+03:00, Tom Lane  wrote:
>Akshat Jaimini  writes:
>> The code seems to implement the feature and has good and explanatory 
>> comments associated with it.
>> I believe we can go ahead with committing patch although I would request 
>> some senior contributors to also take a look at this patch since I am 
>> relatively new to patch reviews.
>
>Looks like a good catch and a reasonable fix.  Pushed after rewriting
>the comments a bit.
>
>As far as this goes:
>
>> I ran make installcheck-world after applying the patch and recompiling it. 
>> It did fail for a particular test but from the logs it seems to be unrelated 
>> to this particular patch since it fails for the following:
>
>> ==
>> select error_trap_test();
>> -  error_trap_test  
>> 
>> - division_by_zero detected
>> -(1 row)
>> -
>> +ERROR:  cannot start subtransactions during a parallel operation
>
>... that's the test case from 0075d7894, and the failure is what
>I'd expect from a backend older than that.  Maybe you forgot to
>recompile/reinstall after updating past that commit?
>
>   regards, tom lane


Re: Synchronizing slots from primary to standby

2024-04-02 Thread Bharath Rupireddy
On Wed, Apr 3, 2024 at 9:04 AM Amit Kapila  wrote:
>
> > I'd just rename LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr
> > moveto, bool *found_consistent_snapshot) to
> > pg_logical_replication_slot_advance(XLogRecPtr moveto, bool
> > *found_consistent_snapshot) and use it. If others don't like this, I'd
> > at least turn pg_logical_replication_slot_advance(XLogRecPtr moveto) a
> > static inline function.
> >
> Yeah, we can do that but it is not a performance-sensitive routine so
> don't know if it is worth it.

Okay for what the patch has right now. No more bikeshedding from me on this.

> > > The slotsync worker needs to advance the slots from different databases in
> > > fast_forward. So, we need to skip this check in fast_forward mode. The 
> > > analysis can
> > > be found in [3].
> > -if (slot->data.database != MyDatabaseId)
> > +/*
> > + * We need to access the system tables during decoding to build the
> > + * logical changes unless we are in fast_forward mode where no changes 
> > are
> > + * generated.
> > + */
> > +if (slot->data.database != MyDatabaseId && !fast_forward)
> >  ereport(ERROR,
> >
> > It's not clear from the comment that we need it for a slotsync worker
> > to advance the slots from different databases. Can this be put into
> > the comment? Also, specify in the comment, why this is safe?
> >
> It is not specific to slot sync worker but specific to fast_forward
> mode. There is already a comment "We need to access the system tables
> during decoding to build the logical changes unless we are in
> fast_forward mode where no changes are generated." telling why it is
> safe. The point is we need database access to access system tables
> while generating the logical changes and in fast-forward mode, we
> don't generate logical changes so this check is not required. Do let
> me if you have a different understanding or if my understanding is
> incorrect.

Understood. Thanks. Just curious, why isn't a problem for the existing
fast_forward mode callers pg_replication_slot_advance and
LogicalReplicationSlotHasPendingWal?

I quickly looked at v8, and have a nit, rest all looks good.

+if (DecodingContextReady(ctx) && found_consistent_snapshot)
+*found_consistent_snapshot = true;

Can the found_consistent_snapshot be checked first to help avoid the
function call DecodingContextReady() for pg_replication_slot_advance
callers?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: archive modules loose ends

2024-04-02 Thread Nathan Bossart
On Tue, Mar 26, 2024 at 02:14:14PM -0500, Nathan Bossart wrote:
> On Mon, Jan 15, 2024 at 08:50:25AM -0600, Nathan Bossart wrote:
>> Thanks for reviewing.  I've marked this as ready-for-committer, and I'm
>> hoping to commit it in the near future.
> 
> This one probably ought to go into v17, but I wanted to do one last call
> for feedback prior to committing.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-02 Thread Amit Kapila
On Tue, Apr 2, 2024 at 7:42 PM Bharath Rupireddy
 wrote:
>
> On Tue, Apr 2, 2024 at 7:25 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > > 1. Can we just remove pg_logical_replication_slot_advance and use
> > > LogicalSlotAdvanceAndCheckSnapState instead? If worried about the
> > > function naming, LogicalSlotAdvanceAndCheckSnapState can be renamed to
> > > pg_logical_replication_slot_advance?
> > >
> > > + * Advance our logical replication slot forward. See
> > > + * LogicalSlotAdvanceAndCheckSnapState for details.
> > >   */
> > >  static XLogRecPtr
> > >  pg_logical_replication_slot_advance(XLogRecPtr moveto)
> > >  {
> >
> > It was commented[1] that it's not appropriate for the
> > pg_logical_replication_slot_advance to have an out parameter
> > 'ready_for_decoding' which looks bit awkward as the functionality doesn't 
> > match
> > the name, and is also not consistent with the style of
> > pg_physical_replication_slot_advance(). So, we added a new function.
>
> I disagree here. A new function just for a parameter is not that great
> IMHO.
>

It is not for the parameter but primarily for the functionality it
provides. The additional functionality of whether we reached a
consistent point while advancing the slot doesn't sound to suit the
current function. Also, we want to keep the signature similar to the
existing function pg_physical_replication_slot_advance().

>
> I'd just rename LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr
> moveto, bool *found_consistent_snapshot) to
> pg_logical_replication_slot_advance(XLogRecPtr moveto, bool
> *found_consistent_snapshot) and use it. If others don't like this, I'd
> at least turn pg_logical_replication_slot_advance(XLogRecPtr moveto) a
> static inline function.
>

Yeah, we can do that but it is not a performance-sensitive routine so
don't know if it is worth it.

> > > 5. As far as the test case for this issue is concerned, I'm fine with
> > > adding one using an INJECTION point because we seem to be having no
> > > consistent way to control postgres writing current snapshot to WAL.
> >
> > Since me and my colleagues can reproduce the issue consistently after 
> > applying
> > 0002 and it could be rare for concurrent xl_running_xacts to happen, we 
> > discussed[2] to
> > consider adding the INJECTION point after pushing the main fix.
>
> Right.
>
> > > 7.
> > > +/*
> > > + * We need to access the system tables during decoding to build the
> > > + * logical changes unless we are in fast-forward mode where no 
> > > changes
> > > are
> > > + * generated.
> > > + */
> > > +if (slot->data.database != MyDatabaseId && !fast_forward)
> > >
> > > May I know if we need this change for this fix?
> >
> > The slotsync worker needs to advance the slots from different databases in
> > fast_forward. So, we need to skip this check in fast_forward mode. The 
> > analysis can
> > be found in [3].
> -if (slot->data.database != MyDatabaseId)
> +/*
> + * We need to access the system tables during decoding to build the
> + * logical changes unless we are in fast_forward mode where no changes 
> are
> + * generated.
> + */
> +if (slot->data.database != MyDatabaseId && !fast_forward)
>  ereport(ERROR,
>
> It's not clear from the comment that we need it for a slotsync worker
> to advance the slots from different databases. Can this be put into
> the comment? Also, specify in the comment, why this is safe?
>

It is not specific to slot sync worker but specific to fast_forward
mode. There is already a comment "We need to access the system tables
during decoding to build the logical changes unless we are in
fast_forward mode where no changes are generated." telling why it is
safe. The point is we need database access to access system tables
while generating the logical changes and in fast-forward mode, we
don't generate logical changes so this check is not required. Do let
me if you have a different understanding or if my understanding is
incorrect.

-- 
With Regards,
Amit Kapila.




Re: remaining sql/json patches

2024-04-02 Thread jian he
On Tue, Apr 2, 2024 at 9:57 PM Amit Langote  wrote:
>
> Please let me know if you have further comments on 0001.  I'd like to
> get that in before spending more energy on 0002.
>

hi. some issues with the doc.
i think, some of the "path expression" can be replaced by
"path_expression".
maybe not all of them.

+  
+   
+
+ context_item,
path_expression 
AS json_path_name
  PASSING {
value AS
varname } ,
...
+
+
+
+ The input data to query, the JSON path expression defining the query,
+ and an optional PASSING clause, which can provide data
+ values to the path_expression.
+ The result of the input data
+ evaluation is called the row pattern. The row
+ pattern is used as the source for row values in the constructed view.
+
+
+   

maybe
change this part "The input data to query, the JSON path expression
defining the query,"
to
`
context_item is the input data to query,
path_expression is the JSON path expression
defining the query,
`

+
+ Specifying FORMAT JSON makes it explcit that you
+ expect that the value to be a valid json object.
+
"explcit" change to "explicit", or should it be "explicitly"?
also FORMAT JSON can be override by OMIT QUOTES.
SELECT sub.* FROM JSON_TABLE('{"a":{"z1": "a"}}', '$.a' COLUMNS(xx
TEXT format json path '$.z1' omit quotes))sub;
it return not double quoted literal 'a', which cannot be a valid json.

create or replace FUNCTION test_format_json() returns table (thetype
text, is_ok bool) AS $$
declare
part1_sql text := $sql$SELECT sub.* FROM JSON_TABLE('{"a":{"z1":
"a"}}', '$.a'  COLUMNS(xx $sql$;
part2_sql text := $sql$ format json path '$.z1' omit quotes))sub $sql$;
run_status bool := true;
r record;
fin record;
BEGIN
for r in
select format_type(oid, -1) as aa
from pg_type where  typtype = 'b' and typarray != 0 and
typnamespace = 11 and typnotnull is false
loop
begin
-- raise notice '%',CONCAT_WS(' ', part1_sql, r.aa, part2_sql);
-- raise notice 'r.aa %', r.aa;
run_status := true;
execute CONCAT_WS(' ', part1_sql, r.aa, part2_sql) into fin;
return query select r.aa, run_status;
exception when others then
begin
run_status := false;
return query select r.aa, run_status;
end;
end;
end loop;
END;
$$ language plpgsql;
create table sss_1 as select * from test_format_json();
select * from sss_1 where is_ok is true;

use the above query, I've figure out that FORMAT JSON can apply to the
following types:
bytea
name
text
json
bpchar
character varying
jsonb
and these type's customized domain type.

overall, the idea is that:
 Specifying FORMAT JSON makes it explicitly that you
 expect that the value to be a valid json object.
FORMAT JSON can be overridden by OMIT QUOTES
specification, which can make the return value not a valid
json.
FORMAT JSON can only work with certain kinds of
data types.
---
+
+ Optionally, you can add ON ERROR clause to define
+ error behavior.
+
I think "error behavior" may refer to "what kind of error message it will omit"
but here, it's about what to do when an error happens.
so I guess it's misleading.

maybe we can explain it similar to json_exist.
+
+ Optionally, you can add ON ERROR clause to define
+  the behavior if an error occurs.
+

+
+ The specified type should have a cast from the
+ boolean.
+
should be
+
+ The specified type should have a cast from the
+ boolean.
+


+
+ Inserts a SQL/JSON value into the output row.
+
maybe
+
+ Inserts a value that the data type is
type into the output row.
+

+
+ Inserts a boolean item into each output row.
+
maybe changed to:
+
+ Inserts a value that the data type is
type into the output row.
+

"name type EXISTS" branch mentioned: "The specified type should have a
cast from the boolean."
but "name type [FORMAT JSON [ENCODING UTF8]] [ PATH path_expression ]"
never mentioned the "type"parameter.
maybe add one para, something like:
"after apply path_expression, the yield value cannot be coerce to
type it will return null"




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-02 Thread shveta malik
On Tue, Apr 2, 2024 at 11:58 AM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Tue, Apr 02, 2024 at 12:07:54PM +0900, Masahiko Sawada wrote:
> > On Mon, Apr 1, 2024 at 12:18 PM Bharath Rupireddy
> >
> > FWIW, coming to this thread late, I think that the inactive_since
> > should not be synchronized from the primary. The wall clocks are
> > different on the primary and the standby so having the primary's
> > timestamp on the standby can confuse users, especially when there is a
> > big clock drift. Also, as Amit mentioned, inactive_since seems not to
> > be consistent with other parameters we copy. The
> > replication_slot_inactive_timeout feature should work on the standby
> > independent from the primary, like other slot invalidation mechanisms,
> > and it should be based on its own local clock.
>
> Thanks for sharing your thoughts! So, it looks like that most of us agree to 
> not
> sync inactive_since from the primary, I'm fine with that.

+1 on not syncing slots from primary.

> > If we want to invalidate the synced slots due to the timeout, I think
> > we need to define what is "inactive" for synced slots.
> >
> > Suppose that the slotsync worker updates the local (synced) slot's
> > inactive_since whenever releasing the slot, irrespective of the actual
> > LSNs (or other slot parameters) having been updated. I think that this
> > idea cannot handle a slot that is not acquired on the primary. In this
> > case, the remote slot is inactive but the local slot is regarded as
> > active.  WAL files are piled up on the standby (and on the primary) as
> > the slot's LSNs don't move forward. I think we want to regard such a
> > slot as "inactive" also on the standby and invalidate it because of
> > the timeout.
>
> I think that makes sense to somehow link inactive_since on the standby to
> the actual LSNs (or other slot parameters) being updated or not.
>
> > > > Now, the other concern is that calling GetCurrentTimestamp()
> > > > could be costly when the values for the slot are not going to be
> > > > updated but if that happens we can optimize such that before acquiring
> > > > the slot we can have some minimal pre-checks to ensure whether we need
> > > > to update the slot or not.
> >
> > If we use such pre-checks, another problem might happen; it cannot
> > handle a case where the slot is acquired on the primary but its LSNs
> > don't move forward. Imagine a logical replication conflict happened on
> > the subscriber, and the logical replication enters the retry loop. In
> > this case, the remote slot's inactive_since gets updated for every
> > retry, but it looks inactive from the standby since the slot LSNs
> > don't change. Therefore, only the local slot could be invalidated due
> > to the timeout but probably we don't want to regard such a slot as
> > "inactive".
> >
> > Another idea I came up with is that the slotsync worker updates the
> > local slot's inactive_since to the local timestamp only when the
> > remote slot might have got inactive. If the remote slot is acquired by
> > someone, the local slot's inactive_since is also NULL. If the remote
> > slot gets inactive, the slotsync worker sets the local timestamp to
> > the local slot's inactive_since. Since the remote slot could be
> > acquired and released before the slotsync worker gets the remote slot
> > data again, if the remote slot's inactive_since > the local slot's
> > inactive_since, the slotsync worker updates the local one.
>
> Then I think we would need to be careful about time zone comparison.

Yes. Also we need to consider the case when a user is relying on
pg_sync_replication_slots() and has not enabled slot-sync worker. In
such a case if synced slot's inactive_since is derived from inactivity
of remote-slot, it might not be that frequently updated (based on when
the user actually runs the SQL function) and thus may be misleading.
OTOH, if inactivty_since of synced slots represents its own
inactivity, then it will give correct info even for the case when the
SQL function is run after a long time and slot-sync worker is
disabled.

> > IOW, we
> > detect whether the remote slot was acquired and released since the
> > last synchronization, by checking the remote slot's inactive_since.
> > This idea seems to handle these cases I mentioned unless I'm missing
> > something, but it requires for the slotsync worker to update
> > inactive_since in a different way than other parameters.
> >
> > Or a simple solution is that the slotsync worker updates
> > inactive_since as it does for non-synced slots, and disables
> > timeout-based slot invalidation for synced slots.

I like this idea better, it takes care of such a case too when the
user is relying on sync-function rather than worker and does not want
to get the slots invalidated in between 2 sync function calls.

> Yeah, I think the main question to help us decide is: do we want to invalidate
> "inactive" synced slots locally (in addition to synchronizing the invalidation
> from the primary)?


Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 05:20:20PM -0500, Nathan Bossart wrote:
> Sorry for the noise.  I noticed a couple of silly mistakes immediately
> after sending v21.

Sigh...  I missed a line while rebasing these patches, which seems to have
grossly offended cfbot.  Apologies again for the noise.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From bfe2b3158378fd822c17fb251178df7557065cfd Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 2 Apr 2024 15:54:49 -0500
Subject: [PATCH v23 1/3] inline pg_popcount for small numbers of bytes

---
 src/include/port/pg_bitutils.h | 34 --
 src/port/pg_bitutils.c | 12 ++--
 2 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1f487a4bc3 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
-extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
+/*
+ * Returns the number of 1-bits in buf.
+ *
+ * If there aren't many bytes to process, the function call overhead of the
+ * optimized versions isn't worth taking, so we inline a loop that consults
+ * pg_number_of_ones in that case.  If there are many bytes to process, we
+ * accept the function call overhead because the optimized versions are likely
+ * to be faster.
+ */
+static inline uint64
+pg_popcount(const char *buf, int bytes)
+{
+	/*
+	 * We use 8 bytes as the threshold because that's where we'll first use
+	 * special instructions on 64-bit systems.  A threshold of 4 bytes might
+	 * make more sense on 32-bit systems, but it seems unlikely to make a
+	 * tremendous difference.
+	 */
+	if (bytes < 8)
+	{
+		uint64		popcnt = 0;
+
+		while (bytes--)
+			popcnt += pg_number_of_ones[(unsigned char) *buf++];
+		return popcnt;
+	}
+
+	return pg_popcount_optimized(buf, bytes);
+}
+
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..6271acea60 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -155,13 +155,13 @@ choose_popcount_functions(void)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
+		pg_popcount_optimized = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
+		pg_popcount_optimized = pg_popcount_slow;
 	}
 }
 
@@ -183,7 +183,7 @@ static uint64
 pg_popcount_choose(const char *buf, int bytes)
 {
 	choose_popcount_functions();
-	return pg_popcount(buf, bytes);
+	return pg_popcount_optimized(buf, bytes);
 }
 
 /*
@@ -387,11 +387,11 @@ pg_popcount64(uint64 word)
 }
 
 /*
- * pg_popcount
+ * pg_popcount_optimized
  *		Returns the number of 1-bits in buf
  */
 uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_optimized(const char *buf, int bytes)
 {
 	return pg_popcount_slow(buf, bytes);
 }
-- 
2.25.1

>From da744d0614021cf002e4d9e292e5c874bd81a84e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v23 2/3] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |  29 ++-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 

Re: using extended statistics to improve join estimates

2024-04-02 Thread Andy Fan


Tomas Vondra  writes:

> On 4/2/24 10:23, Andy Fan wrote:
>> 
>>> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
 Rebased over 269b532ae and muted compiler warnings.
>> 
>> Thank you Justin for the rebase!
>> 
>> Hello Tomas,
>> 
>> Thanks for the patch! Before I review the path at the code level, I want
>> to explain my understanding about this patch first.
>> 
>
> If you want to work on this patch, that'd be cool. A review would be
> great, but if you want to maybe take over and try moving it forward,
> that'd be even better. I don't know when I'll have time to work on it
> again, but I'd promise to help you with working on it.

OK, I'd try to moving it forward.

>
>> Before this patch, we already use MCV information for the eqjoinsel, it
>> works as combine the MCV on the both sides to figure out the mcv_freq
>> and then treat the rest equally, but this doesn't work for MCV in
>> extended statistics, this patch fill this gap. Besides that, since
>> extended statistics means more than 1 columns are involved, if 1+
>> columns are Const based on RestrictInfo, we can use such information to
>> filter the MCVs we are interesting, that's really cool. 
>> 
>
> Yes, I think that's an accurate description of what the patch does.

Great to know that:)

>
>> I did some more testing, all of them are inner join so far, all of them
>> works amazing and I am suprised this patch didn't draw enough
>> attention.

> I think it didn't go forward for a bunch of reasons:
>
..
>
> 3) Uncertainty about how applicable the patch is in practice.
>
> I suppose it was some combination of these reasons, not sure.
>
> As for the "practicality" mentioned in (3), it's been a while since I
> worked on the patch so I don't recall the details, but I think I've been
> thinking mostly about "start join" queries, where a big "fact" table
> joins to small dimensions. And in that case the fact table may have a
> MCV, but the dimensions certainly don't have any (because the join
> happens on a PK).
>
> But maybe that's a wrong way to think about it - it was clearly useful
> to consider the case with (per-attribute) MCVs on both sides as worth
> special handling. So why not to do that for multi-column MCVs, right?

Yes, that's what my current understanding is.

There are some cases where there are 2+ clauses between two tables AND
the rows estimiation is bad AND the plan is not the best one. In such
sisuations, I'd think this patch probably be helpful. The current case
in hand is PG11, there is no MCV information for extended statistics, so
I even can't verify the patch here is useful or not manually. When I see
them next time in a newer version of PG, I can verity it manually to see
if the rows estimation can be better.  

>> At for the code level, I reviewed them in the top-down manner and almost
>> 40% completed. Here are some findings just FYI. For efficiency purpose,
>> I provide each feedback with a individual commit, after all I want to
>> make sure my comment is practical and coding and testing is a good way
>> to archive that. I tried to make each of them as small as possible so
>> that you can reject or accept them convinently.
>> 
>> 0001 is your patch, I just rebase them against the current master. 0006
>> is not much relevant with current patch, and I think it can be committed
>> individually if you are OK with that.
>> 
>> Hope this kind of review is helpful.
>> 
>
> Cool! There's obviously no chance to get this into v18, and I have stuff
> to do in this CF. But I'll take a look after that.

Good to know that. I will continue my work before that. 

-- 
Best Regards
Andy Fan





Re: Commitfest Manager for March

2024-04-02 Thread Sutou Kouhei
Hi,

In 
  "Re: Commitfest Manager for March" on Mon, 1 Apr 2024 11:57:27 +0300,
  Heikki Linnakangas  wrote:

> On 01/04/2024 09:05, Andrey M. Borodin wrote:
>> I think it makes sense to close this commitfest only after Feature
>> Freeze on April 8, 2024 at 0:00 AoE.
>> What do you think?
> 
> +1. IIRC that's how it's been done in last commitfest in previous
> years too.

Thanks for extending the period.

Could someone review my patches for this commitfest?

1. https://commitfest.postgresql.org/47/4681/
   Make COPY format extendable: Extract COPY TO format implementations
   Status: Needs review

2. https://commitfest.postgresql.org/47/4791/
   meson: Specify -Wformat as a common warning flag for extensions
   Status: Ready for Committer


Thanks,
-- 
kou




Re: Streaming I/O, vectored I/O (WIP)

2024-04-02 Thread Thomas Munro
On Tue, Apr 2, 2024 at 9:39 PM Thomas Munro  wrote:
> So this is the version I'm going to commit shortly, barring objections.

And done, after fixing a small snafu with smgr-only reads coming from
CreateAndCopyRelationData() (BM_PERMANENT would be
incorrectly/unnecessarily set for unlogged tables).

Here are the remaining patches discussed in this thread.  They give
tablespace-specific io_combine_limit, effective_io_readahead_window
(is this useful?), and up-to-1MB io_combine_limit (is this useful?).
I think the first two would probably require teaching reloption.c how
to use guc.c's parse_int() and unit flags, but I won't have time to
look at that for this release so I'll just leave these here.

On the subject of guc.c, this is a terrible error message... did I do
something wrong?

postgres=# set io_combine_limit = '42MB';
ERROR:  5376 8kB is outside the valid range for parameter
"io_combine_limit" (1 .. 32)
From 84b8280481312cdd1efcb7efa1182d4647cbe00a Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 30 Mar 2024 19:09:44 +1300
Subject: [PATCH v16 1/4] ALTER TABLESPACE ... SET (io_combine_limit = ...).

This is the per-tablespace version of the GUC of the same name.

XXX reloptions.c lacks the ability to accept units eg '64kB'!  Which is
why I haven't included it with the main feature commit.

Suggested-by: Tomas Vondra https://postgr.es/m/f603ac51-a7ff-496a-99c1-76673635692e%40enterprisedb.com
---
 doc/src/sgml/ref/alter_tablespace.sgml |  9 +++---
 src/backend/access/common/reloptions.c | 12 +++-
 src/backend/storage/aio/read_stream.c  | 39 --
 src/backend/utils/cache/spccache.c | 14 +
 src/bin/psql/tab-complete.c|  3 +-
 src/include/commands/tablespace.h  |  1 +
 src/include/utils/spccache.h   |  1 +
 7 files changed, 58 insertions(+), 21 deletions(-)

diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index 6ec863400d1..faf0c6e7fbc 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -84,16 +84,17 @@ ALTER TABLESPACE name RESET ( ,
   ,
   ,
-  ).  This may be useful if
+  ),
+  ).  This may be useful if
   one tablespace is located on a disk which is faster or slower than the
   remainder of the I/O subsystem.
  
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d85599..1e1c611fab2 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -371,6 +371,15 @@ static relopt_int intRelOpts[] =
 		0, 0, 0
 #endif
 	},
+	{
+		{
+			"io_combine_limit",
+			"Limit on the size of data reads and writes.",
+			RELOPT_KIND_TABLESPACE,
+			ShareUpdateExclusiveLock
+		},
+		-1, 1, MAX_IO_COMBINE_LIMIT
+	},
 	{
 		{
 			"parallel_workers",
@@ -2089,7 +2098,8 @@ tablespace_reloptions(Datum reloptions, bool validate)
 		{"random_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, random_page_cost)},
 		{"seq_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, seq_page_cost)},
 		{"effective_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, effective_io_concurrency)},
-		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)}
+		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)},
+		{"io_combine_limit", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, io_combine_limit)},
 	};
 
 	return (bytea *) build_reloptions(reloptions, validate,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 4f21262ff5e..907c80e6bf9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -114,6 +114,7 @@ struct ReadStream
 	int16		max_pinned_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		io_combine_limit;
 	bool		advice_enabled;
 
 	/*
@@ -241,7 +242,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 
 	/* This should only be called with a pending read. */
 	Assert(stream->pending_read_nblocks > 0);
-	Assert(stream->pending_read_nblocks <= io_combine_limit);
+	Assert(stream->pending_read_nblocks <= stream->io_combine_limit);
 
 	/* We had better not exceed the pin limit by starting this read. */
 	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
@@ -329,7 +330,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		int16		buffer_index;
 		void	   *per_buffer_data;
 
-		if (stream->pending_read_nblocks == io_combine_limit)
+		if (stream->pending_read_nblocks == stream->io_combine_limit)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
@@ -389,7 +390,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 	 * signaled end-of-stream, we start the read immediately.
 	 */
 	if (stream->pending_read_nblocks > 0 &&
-		

Re: Building with musl in CI and the build farm

2024-04-02 Thread Thomas Munro
On Wed, Mar 27, 2024 at 11:27 AM Wolfgang Walther
 wrote:
> The animal runs in a docker container via GitHub Actions in [2].

Great idea :-)




Re: Statistics Import and Export

2024-04-02 Thread Corey Huinker
>
> Yeah, but that problem exists no matter what.  I haven't read enough
> of the patch to find where it's determining that, but I assume there's
> code in there to intuit the statistics storage type depending on the
> table column's data type and the statistics kind.
>

Correct. It borrows a lot from examine_attribute() and the *_typanalyze()
functions. Actually using VacAttrStats proved problematic, but that can be
revisited at some point.


> We could not trust the exporting side to tell us the correct answer;
> for one reason, it might be different across different releases.
> So "derive it reliably on the destination" is really the only option.
>

+1


> I think that it's impossible to do this in the general case, since
> type-specific typanalyze functions can store pretty nearly whatever
> they like.  However, the pg_stats view isn't going to show nonstandard
> statistics kinds anyway, so we are going to be lossy for custom
> statistics kinds.
>

Sadly true.


Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 05:01:32PM -0500, Nathan Bossart wrote:
> In v21, 0001 is just the above inlining idea, which seems worth doing
> independent of $SUBJECT.  0002 and 0003 are the AVX-512 patches, which I've
> modified similarly to 0001, i.e., I've inlined the "fast" version in the
> function pointer to avoid the function call overhead when there are fewer
> than 64 bytes.  All of this overhead juggling should result in choosing the
> optimal popcount implementation depending on how many bytes there are to
> process, roughly speaking.

Sorry for the noise.  I noticed a couple of silly mistakes immediately
after sending v21.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From cfc5e9fe77f96225ec67a044377b10113c98ce0d Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 2 Apr 2024 15:54:49 -0500
Subject: [PATCH v22 1/3] inline pg_popcount for small numbers of bytes

---
 src/include/port/pg_bitutils.h | 34 --
 src/port/pg_bitutils.c | 12 ++--
 2 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1f487a4bc3 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
-extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
+/*
+ * Returns the number of 1-bits in buf.
+ *
+ * If there aren't many bytes to process, the function call overhead of the
+ * optimized versions isn't worth taking, so we inline a loop that consults
+ * pg_number_of_ones in that case.  If there are many bytes to process, we
+ * accept the function call overhead because the optimized versions are likely
+ * to be faster.
+ */
+static inline uint64
+pg_popcount(const char *buf, int bytes)
+{
+	/*
+	 * We use 8 bytes as the threshold because that's where we'll first use
+	 * special instructions on 64-bit systems.  A threshold of 4 bytes might
+	 * make more sense on 32-bit systems, but it seems unlikely to make a
+	 * tremendous difference.
+	 */
+	if (bytes < 8)
+	{
+		uint64		popcnt = 0;
+
+		while (bytes--)
+			popcnt += pg_number_of_ones[(unsigned char) *buf++];
+		return popcnt;
+	}
+
+	return pg_popcount_optimized(buf, bytes);
+}
+
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..6271acea60 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -155,13 +155,13 @@ choose_popcount_functions(void)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
+		pg_popcount_optimized = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
+		pg_popcount_optimized = pg_popcount_slow;
 	}
 }
 
@@ -183,7 +183,7 @@ static uint64
 pg_popcount_choose(const char *buf, int bytes)
 {
 	choose_popcount_functions();
-	return pg_popcount(buf, bytes);
+	return pg_popcount_optimized(buf, bytes);
 }
 
 /*
@@ -387,11 +387,11 @@ pg_popcount64(uint64 word)
 }
 
 /*
- * pg_popcount
+ * pg_popcount_optimized
  *		Returns the number of 1-bits in buf
  */
 uint64
-pg_popcount(const char *buf, int bytes)
+pg_popcount_optimized(const char *buf, int bytes)
 {
 	return pg_popcount_slow(buf, bytes);
 }
-- 
2.25.1

>From a8024ebcc54b4ac0d3d145ade5d7cd85eb192afc Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v22 2/3] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 

Re: Statistics Import and Export

2024-04-02 Thread Tom Lane
Jeff Davis  writes:
> We need to get the original element type on the import side somehow,
> right? Otherwise it will be hard to tell whether '{1, 2, 3, 4}' has
> element type "int4" or "text", which affects the binary representation
> of the anyarray value in pg_statistic.

Yeah, but that problem exists no matter what.  I haven't read enough
of the patch to find where it's determining that, but I assume there's
code in there to intuit the statistics storage type depending on the
table column's data type and the statistics kind.

> Either we need to get it at export time (which seems the most reliable
> in principle, but problematic for older versions) and pass it as an
> argument to pg_set_attribute_stats(); or we need to derive it reliably
> from the table schema on the destination side, right?

We could not trust the exporting side to tell us the correct answer;
for one reason, it might be different across different releases.
So "derive it reliably on the destination" is really the only option.

I think that it's impossible to do this in the general case, since
type-specific typanalyze functions can store pretty nearly whatever
they like.  However, the pg_stats view isn't going to show nonstandard
statistics kinds anyway, so we are going to be lossy for custom
statistics kinds.

regards, tom lane




Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 01:40:21PM -0500, Nathan Bossart wrote:
> On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote:
>> I don't like the double evaluation of the macro argument.  Seems like
>> you could get the same results more safely with
>> 
>>  static inline uint64
>>  pg_popcount(const char *buf, int bytes)
>>  {
>>  if (bytes < 64)
>>  {
>>  uint64  popcnt = 0;
>> 
>>  while (bytes--)
>>  popcnt += pg_number_of_ones[(unsigned char) 
>> *buf++];
>> 
>>  return popcnt;
>>  }
>>  return pg_popcount_optimized(buf, bytes);
>>  }
> 
> Yeah, I like that better.  I'll do some testing to see what the threshold
> really should be before posting an actual patch.

My testing shows that inlining wins with fewer than 8 bytes for the current
"fast" implementation.  The "fast" implementation wins with fewer than 64
bytes compared to the AVX-512 implementation.  These results are pretty
intuitive because those are the points at which the optimizations kick in.

In v21, 0001 is just the above inlining idea, which seems worth doing
independent of $SUBJECT.  0002 and 0003 are the AVX-512 patches, which I've
modified similarly to 0001, i.e., I've inlined the "fast" version in the
function pointer to avoid the function call overhead when there are fewer
than 64 bytes.  All of this overhead juggling should result in choosing the
optimal popcount implementation depending on how many bytes there are to
process, roughly speaking.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From ce1180d557cbdf8cff33842ea2f1a22ba6676725 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 2 Apr 2024 15:54:49 -0500
Subject: [PATCH v21 1/3] inline pg_popcount for small numbers of bytes

---
 src/include/port/pg_bitutils.h | 34 --
 src/port/pg_bitutils.c | 10 +-
 2 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 53e5239717..1f487a4bc3 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,16 +302,46 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
-extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
-extern uint64 pg_popcount(const char *buf, int bytes);
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
+/*
+ * Returns the number of 1-bits in buf.
+ *
+ * If there aren't many bytes to process, the function call overhead of the
+ * optimized versions isn't worth taking, so we inline a loop that consults
+ * pg_number_of_ones in that case.  If there are many bytes to process, we
+ * accept the function call overhead because the optimized versions are likely
+ * to be faster.
+ */
+static inline uint64
+pg_popcount(const char *buf, int bytes)
+{
+	/*
+	 * We use 8 bytes as the threshold because that's where we'll first use
+	 * special instructions on 64-bit systems.  A threshold of 4 bytes might
+	 * make more sense on 32-bit systems, but it seems unlikely to make a
+	 * tremendous difference.
+	 */
+	if (bytes < 8)
+	{
+		uint64		popcnt = 0;
+
+		while (bytes--)
+			popcnt += pg_number_of_ones[(unsigned char) *buf++];
+		return popcnt;
+	}
+
+	return pg_popcount_optimized(buf, bytes);
+}
+
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 28312f3dd9..4720f8e419 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -118,7 +118,7 @@ static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
-uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -155,13 +155,13 @@ choose_popcount_functions(void)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
-		pg_popcount = pg_popcount_fast;
+		pg_popcount_optimized = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
-		pg_popcount = pg_popcount_slow;
+		pg_popcount_optimized = pg_popcount_slow;
 	}
 }
 
@@ -183,7 +183,7 @@ static uint64
 pg_popcount_choose(const char *buf, int bytes)
 {
 	

Re: Statistics Import and Export

2024-04-02 Thread Jeff Davis
On Tue, 2024-04-02 at 17:31 -0400, Tom Lane wrote:
> And that means that we don't need the sending
> side to know the element type anyway.

We need to get the original element type on the import side somehow,
right? Otherwise it will be hard to tell whether '{1, 2, 3, 4}' has
element type "int4" or "text", which affects the binary representation
of the anyarray value in pg_statistic.

Either we need to get it at export time (which seems the most reliable
in principle, but problematic for older versions) and pass it as an
argument to pg_set_attribute_stats(); or we need to derive it reliably
from the table schema on the destination side, right?

Regards,
Jeff Davis





Re: Statistics Import and Export

2024-04-02 Thread Corey Huinker
>
> side to know the element type anyway.  So, I apologize for sending
> us down a useless side path.  We may as well stick to the function
> signature as shown in the v15 patch --- although maybe variadic
> any is still worthwhile so that an unrecognized field name doesn't
> need to be a hard error?
>

Variadic is nearly done. This issue was the main blocking point. I can go
back to array_in() as we know that code works.


Re: Statistics Import and Export

2024-04-02 Thread Tom Lane
Jeff Davis  writes:
> On the export side, the problem is that the element type (and
> dimensionality and maybe hasnull) is an important part of the anyarray
> value, but it's not part of the output of anyarray_out(). For new
> versions, we can add a scalar function that simply outputs the
> information we need. For old versions, we can hack it by parsing the
> output of anyarray_send(), which contains the information we need
> (binary outputs are under-specified, but I believe they are specified
> enough in this case).

Yeah, I was thinking yesterday about pulling the anyarray columns in
binary and looking at the header fields.  However, I fear there is a
showstopper problem: anyarray_send will fail if the element type
doesn't have a typsend function, which is entirely possible for
user-defined types (and I'm not even sure we've provided them for
every type in the core distro).  I haven't thought of a good answer
to that other than a new backend function.  However ...

> On the import side, the problem is that there may not be an input
> function to go from a 1-D array of text to a 1-D array of any element
> type we want. For example, there's no input function that will create a
> 1-D array with element type float4[] (that's because Postgres doesn't
> really have arrays-of-arrays, it has multi-dimensional arrays).
> Instead, don't use the input function, pass each element of the 1-D
> text array to the element type's input function (which may be scalar or
> not) and then construct a 1-D array out of that with the appropriate
> element type (which may be scalar or not).

Yup.  I had hoped that we could avoid doing any array-munging inside
pg_set_attribute_stats, but this array-of-arrays problem seems to
mean we have to.  In turn, that means that the whole idea of
declaring the function inputs as anyarray rather than text[] is
probably pointless.  And that means that we don't need the sending
side to know the element type anyway.  So, I apologize for sending
us down a useless side path.  We may as well stick to the function
signature as shown in the v15 patch --- although maybe variadic
any is still worthwhile so that an unrecognized field name doesn't
need to be a hard error?

regards, tom lane




Re: Extension for PostgreSQL cast jsonb to hstore WIP

2024-04-02 Thread Andrew Dunstan


On 2024-04-02 Tu 11:43, ShadowGhost wrote:
At the moment, this cast supports only these structures, as it was 
enough for my tasks:

{str:numeric}
{str:str}
{str:bool}
{str:null}
But it's a great idea and I'll think about implementing it.



Please don't top-post on the PostgreSQL lists. See 



I don't think a cast that doesn't cater for all the forms json can take 
is going to work very well. At the very least you would need to error 
out in cases you didn't want to cover, and have tests for all of those 
errors. But the above is only a tiny fraction of those. If the error 
cases are going to be so much more than the cases that work it seems a 
bit pointless.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-02 Thread Daniel Gustafsson
> On 2 Apr 2024, at 20:55, Daniel Gustafsson  wrote:

> The attached removes 1.0.2 support (meson build parts untested yet)

And a rebased version which applies over the hmac_openssl.c changes earlier
today that I hadn't pulled in.

--
Daniel Gustafsson



v2-0001-Remove-support-for-OpenSSL-1.0.2-and-1.1.0.patch
Description: Binary data


Re: Statistics Import and Export

2024-04-02 Thread Jeff Davis
On Tue, 2024-04-02 at 12:59 -0400, Corey Huinker wrote:
>  However, some of the ANYARRAYs have element types that are
> themselves arrays, and near as I can tell, such a construct is not
> expressible in SQL. So, rather than getting an anyarray of an array
> type, you instead get an array of one higher dimension.

Fundamentally, you want to recreate the exact same anyarray values on
the destination system as they existed on the source. There's some
complexity to that on both the export side as well as the import side,
but I believe the problems are solvable.

On the export side, the problem is that the element type (and
dimensionality and maybe hasnull) is an important part of the anyarray
value, but it's not part of the output of anyarray_out(). For new
versions, we can add a scalar function that simply outputs the
information we need. For old versions, we can hack it by parsing the
output of anyarray_send(), which contains the information we need
(binary outputs are under-specified, but I believe they are specified
enough in this case). There may be other hacks to get the information
from the older systems; that's just an idea. To get the actual data,
doing histogram_bounds::text::text[] seems to be enough: that seems to
always give a one-dimensional array with element type "text", even if
the element type is an array. (Note: this means we need the function's
API to also include this extra information about the anyarray values,
so it might be slightly more complex than name/value pairs).

On the import side, the problem is that there may not be an input
function to go from a 1-D array of text to a 1-D array of any element
type we want. For example, there's no input function that will create a
1-D array with element type float4[] (that's because Postgres doesn't
really have arrays-of-arrays, it has multi-dimensional arrays).
Instead, don't use the input function, pass each element of the 1-D
text array to the element type's input function (which may be scalar or
not) and then construct a 1-D array out of that with the appropriate
element type (which may be scalar or not).

Regards,
Jeff Davis





Re: Fix out-of-bounds in the function PQescapeinternal (src/interfaces/libpq/fe-exec.c)

2024-04-02 Thread Tom Lane
Ranier Vilela  writes:
> While I working in [1], Coverity reported some errors:
> src/bin/pg_basebackup/pg_createsubscriber.c
> CID 1542690: (#1 of 2): Out-of-bounds access (OVERRUN)
> alloc_strlen: Allocating insufficient memory for the terminating null of
> the string. [Note: The source code implementation of the function has been
> overridden by a builtin model.]
> CID 1542690: (#2 of 2): Out-of-bounds access (OVERRUN)
> alloc_strlen: Allocating insufficient memory for the terminating null of
> the string. [Note: The source code implementation of the function has been
> overridden by a builtin model.]

Yeah, we saw that in the community run too.  I'm tempted to call it
an AI hallucination.  The "Note" seems to mean that they're not
actually analyzing our code but some made-up substitute.

> The source of errors is the function PQescapeInternal.
> The slow path has bugs when num_quotes or num_backslashes are greater than
> zero.
> For each num_quotes or num_backslahes we need to allocate two more.

Nonsense.  The quote or backslash is already counted in input_len,
so we need to add just one more.

If there were anything wrong here, I'm quite sure our testing under
e.g. valgrind would have caught it years ago.  However, just to be
sure, I tried adding an Assert that the allocated space is filled
exactly, as attached.  It gets through check-world just fine.

regards, tom lane

diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index c02a9180b2..43a4ce0458 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -4255,7 +4255,9 @@ PQescapeInternal(PGconn *conn, const char *str, size_t len, bool as_ident)
 
 	/* Closing quote and terminating NUL. */
 	*rp++ = quote_char;
-	*rp = '\0';
+	*rp++ = '\0';
+
+	Assert(rp == result + result_size);
 
 	return result;
 }


Re: WIP Incremental JSON Parser

2024-04-02 Thread Andrew Dunstan



On 2024-04-02 Tu 15:38, Jacob Champion wrote:

On Mon, Apr 1, 2024 at 4:53 PM Andrew Dunstan  wrote:

Anyway, here are new patches. I've rolled the new semantic test into the
first patch.

Looks good! I've marked RfC.



Thanks! I appreciate all the work you've done on this. I will give it 
one more pass and commit RSN.





json_lex() is not really a very hot piece of code.

Sure, but I figure if someone is trying to get the performance of the
incremental parser to match the recursive one, so we can eventually
replace it, it might get a little warmer. :)


I don't think this is where the performance penalty lies. Rather, I 
suspect it's the stack operations in the non-recursive parser itself. 
The speed test doesn't involve any partial token processing at all, and 
yet the non-recursive parser is significantly slower in that test.



I think it'd be good for a v1.x of this feature to focus on
simplification of the code, and hopefully consolidate and/or eliminate
some of the duplicated parsing work so that the mental model isn't
quite so big.

I'm not sure how you think that can be done.

I think we'd need to teach the lower levels of the lexer about
incremental parsing too, so that we don't have two separate sources of
truth about what ends a token. Bonus points if we could keep the parse
state across chunks to the extent that we didn't need to restart at
the beginning of the token every time. (Our current tools for this are
kind of poor, like the restartable state machine in PQconnectPoll.
While I'm dreaming, I'd like coroutines.) Now, whether the end result
would be more or less maintainable is left as an exercise...



I tried to disturb the main lexer processing as little as possible. We 
could possibly unify the two paths, but I have a strong suspicion that 
would result in a performance hit (the main part of the lexer doesn't 
copy any characters at all, it just keeps track of pointers into the 
input). And while the code might not undergo lots of change, the routine 
itself is quite performance critical.


Anyway, I think that's all something for another day.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-04-02 Thread Daniel Verite
Tom Lane wrote:

> > I should say that I've noticed significant latency improvements with
> > FETCH_COUNT retrieving large resultsets, such that it would benefit
> > non-interactive use cases.
> 
> Do you have a theory for why that is?  It's pretty counterintuitive
> that it would help at all.

I've been thinking that it's a kind of pipeline/parallelism effect.
When libpq accumulates all rows in one resultset, if the network
or the server are not fast enough, it spends a certain amount of
time waiting for the data to come in.
But when it accumulates fewer rows and gives back control
to the app to display intermediate results, during that time the
network buffers can fill in, resulting, I assume, in less time waiting
overall.

I think the benefit is similar to what we get with \copy. In fact
with the  above-mentioned test, the execution times with
FETCH_COUNT=1000 look very close to \copy of the same query.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Combine Prune and Freeze records emitted by vacuum

2024-04-02 Thread Melanie Plageman
On Tue, Apr 02, 2024 at 01:24:27PM -0400, Melanie Plageman wrote:
> On Tue, Apr 2, 2024 at 9:11 AM Heikki Linnakangas  wrote:
> >
> > On 01/04/2024 20:22, Melanie Plageman wrote:
> > > Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
> > > you didn't modify them much/at all, but I noticed some things in my code
> > > that could be better.
> >
> > Ok, here's what I have now. I made a lot of small comment changes here
> > and there, and some minor local refactorings, but nothing major. I lost
> > track of all the individual changes I'm afraid, so I'm afraid you'll
> > have to just diff against the previous version if you want to see what's
> > changed. I hope I didn't break anything.
> >
> > I'm pretty happy with this now. I will skim through it one more time
> > later today or tomorrow, and commit. Please review once more if you have
> > a chance.
> 
> Thanks!
> 
> 0001 looks good. Attached are some comment updates and such on top of
> 0001 and 0002.
> 
> I started some performance testing of 0002 but haven't finished yet. I
> wanted to provide my other review first.

I tried to do some performance tests of just on-access HOT pruning with
the patches in this thread applied. I'm not sure if I succeeded in being
targeted enough to have usable results.

Off-list Andres gave me some suggestions of how to improve my test case
and setup and this is what I ended up doing:


On-access pruning during a SELECT query:


# Make a table with a single not NULL column of a small datatype to fit
# as many tuples as possible on the page so each page we prune exercises
# those loops in heap_page_prune_and_freeze() and heap_prune_chain() as
# much as possible

psql -c "create table small(col smallint not null)"

# Insert data that is the same except for ~1 row per page with a different value

for i in $(seq 1000)
do
  psql -c "INSERT INTO small VALUES(2);" -c "INSERT INTO small SELECT 1 FROM 
(SELECT generate_series(1,220));"
done

# COPY this data to a file
psql -c "COPY small TO '/tmp/small.data';"

# Start postgres bound to a single CPU core

# Run the following script with pgbench

# Make the table unlogged table so we don't see the effects of WAL 
writes in
# results
#
# Make sure autovacuum doesn't run on the table
  drop table if exists small;
  create unlogged table small(col smallint not null) with (autovacuum_enabled = 
false);
  copy small from '/tmp/small.data';
  update small set col = 9 where col = 2;
  select * from small where col = 0;

pgbench -n -f query.sql -t 100 -M prepared -r

# (I made sure that HOT pruning was actually happening for the SELECT
# query before running this with pgbench)

# There seemed to be no meaningful difference for this example with the
# patches:

on current master:
statement latencies in milliseconds and failures:
12.387   0  drop table if exists small;
 1.914   0  create unlogged table small(col smallint not null) 
with (autovacuum_enabled = false);
   100.152   0  copy small from '/tmp/small.data';
49.480   0  update small set col = 9 where col = 2;
46.835   0  select * from small where col = 0;

with the patches applied:
statement latencies in milliseconds and failures:
13.281   0  drop table if exists small;
 1.952   0  create unlogged table small(col smallint not null) 
with (autovacuum_enabled = false);
99.418   0  copy small from '/tmp/small.data';
47.397   0  update small set col = 9 where col = 2;
46.887   0  select * from small where col = 0;


On-access pruning during UPDATE:


# The idea is to test a query which would be calling the new
# heap_prune_record_unchanged_lp_normal() function a lot

# I made the same table but filled entirely with the same value

psql -c "create table small(col smallint not null)" \
-c "INSERT INTO small SELECT 1 FROM (SELECT generate_series(1, 
221000));"

# COPY this data to a file
psql -c "COPY small TO '/tmp/small_univalue.data';"

# Start postgres bound to a single CPU core

# Run the following script with pgbench

# Pick a low fillfactor so we have room for the HOT updates
  drop table if exists small;
  create unlogged table small(col smallint not null) with (autovacuum_enabled = 
false, fillfactor=25);
  copy small from '/tmp/small_univalue.data';
  update small set col = 3;
  update small set col = 4;
  update small set col = 5;
  update small set col = 6;

pgbench -n -f update.sql -t 20 -M prepared -r

# There again seems to be no meaningful difference with the patches
# applied

on master:
statement latencies in milliseconds and failures:
19.880   0  drop table if exists small;
 2.099   0  create unlogged table small(col 

Re: Reports on obsolete Postgres versions

2024-04-02 Thread Daniel Gustafsson
> On 2 Apr 2024, at 22:46, Bruce Momjian  wrote:
> On Tue, Apr  2, 2024 at 11:34:46AM +0200, Magnus Hagander wrote:

>> I do like the term "current"  better. It conveys (at least a bit) that we
>> really consider all the older ones to be, well, obsolete. The difference
>> "current vs obsolete" is stronger than "latest vs not quite latest".
> 
> Okay, I changed "superseded" to "old", and changed "latest" to
> "current", patch attached.

+1, LGTM

--
Daniel Gustafsson





Re: Reports on obsolete Postgres versions

2024-04-02 Thread Bruce Momjian
On Tue, Apr  2, 2024 at 11:34:46AM +0200, Magnus Hagander wrote:
> On Tue, Apr 2, 2024 at 9:24 AM Daniel Gustafsson  wrote:
> A few small comments:
> 
> +considers performing minor upgrades to be less risky than continuing to
> +run superseded minor versions.
> 
> I think "superseded minor versions" could be unnecessarily complicated for
> non-native speakers, I consider myself fairly used to reading english but
> still
> had to spend a few extra (brain)cycles parsing the meaning of it in this
> context.
> 
> + We recommend that users always run the latest minor release associated
> 
> Or perhaps "current minor release" which is the term we use in the table
> below
> on the same page?
> 
> I do like the term "current"  better. It conveys (at least a bit) that we
> really consider all the older ones to be, well, obsolete. The difference
> "current vs obsolete" is stronger than "latest vs not quite latest".

Okay, I changed "superseded" to "old", and changed "latest" to
"current", patch attached.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/templates/support/versioning.html b/templates/support/versioning.html
index d48e11e0..0ed79f4f 100644
--- a/templates/support/versioning.html
+++ b/templates/support/versioning.html
@@ -45,15 +45,8 @@ number, e.g. 9.5.3 to 9.5.4.
 Upgrading
 
 
-  
-We always recommend that all users run the latest available minor
-release for whatever major version is in use.
-  
-
-
-
-Major versions usually change the internal format of system tables and data
-files.  These changes are often complex, so we do not maintain backward
+Major versions usually change the internal format of the system tables.
+These changes are often complex, so we do not maintain backward
 compatibility of all stored data.  A dump/reload of the database or use of the
 pg_upgrade module is required
 for major upgrades. We also recommend reading the
@@ -65,18 +58,25 @@ versions prior to doing so.
 
 
 
-Upgrading to a minor release does not normally require a dump and restore;  you
-can stop the database server, install the updated binaries, and restart the
-server.  For some releases, manual changes may be required to complete the
-upgrade, so always read the release notes before upgrading.
+Minor release upgrades do not require a dump and restore;  you simply stop
+the database server, install the updated binaries, and restart the server.
+Such upgrades might require manual changes to complete so always read
+the release notes first.
 
 
 
-While upgrading will always contain some level of risk, PostgreSQL minor releases
-fix only frequently-encountered bugs, security
-issues, and data corruption problems to reduce the risk associated with
-upgrading. For minor releases, the community considers not upgrading to be
-riskier than upgrading.
+Minor releases only fix frequently-encountered bugs, security issues, and data corruption
+problems, so such upgrades are very low risk.  The community
+considers performing minor upgrades to be less risky than continuing to
+run an old minor version.
+
+
+
+  
+We recommend that users always run the current minor release associated
+with their major version.
+  
 
 
 Releases


Re: add function argument names to regex* functions.

2024-04-02 Thread Tom Lane
jian he  writes:
> On Thu, Jan 18, 2024 at 4:17 PM Peter Eisentraut  wrote:
>> Reading back through the discussion, I wasn't quite able to interpret
>> the resolution regarding Oracle compatibility.  From the patch, it looks
>> like you chose not to adopt the parameter names from Oracle.  Was that
>> your intention?

> per committee message:
> https://git.postgresql.org/cgit/postgresql.git/commit/?id=6424337073589476303b10f6d7cc74f501b8d9d7
> Even if the names are all the same, our function is still not the same
> as oracle.

The fact that there's minor discrepancies in the regex languages
doesn't seem to me to have a lot of bearing on whether we should
follow Oracle's choices of parameter names.

However, if we do follow Oracle, it seems like we should do that
consistently, which this patch doesn't.  For instance, per [1]
Oracle calls the arguments of regex_substr

source_char,
pattern,
position,
occurrence,
match_param,
subexpr

while we have

string,
pattern,
start,
N,
flags,
subexpr

The patch proposes to replace "N" with "occurrence" but not touch
the other discrepancies, which seems to me to be a pretty poor
choice.  "occurrence" is very long and difficult to spell correctly,
and if you're not following Oracle slavishly, exactly what is the
argument in its favor?  I quite agree that Oracle's other choices
aren't improvements over ours, but neither is that one.

On the whole my inclination would be to stick to the names we have
in the documentation.  There might be an argument for changing "N"
to something lower-case so you don't have to quote it; but if we do,
I'd go for, say, "count".

regards, tom lane

[1] 
https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/REGEXP_SUBSTR.html#GUID-2903904D-455F-4839-A8B2-1731EF4BD099




Re: Popcount optimization using AVX512

2024-04-02 Thread Ants Aasma
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
> > What about using the masking capabilities of AVX-512 to handle the
> > tail in the same code path? Masked out portions of a load instruction
> > will not generate an exception. To allow byte level granularity
> > masking, -mavx512bw is needed. Based on wikipedia this will only
> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
> > VPOPCNTQ implies availability of BW.
>
> Sounds promising.  IMHO we should really be sure that these kinds of loads
> won't generate segfaults and the like due to the masked-out portions.  I
> searched around a little bit but haven't found anything that seemed
> definitive.

After sleeping on the problem, I think we can avoid this question
altogether while making the code faster by using aligned accesses.
Loads that straddle cache line boundaries run internally as 2 load
operations. Gut feel says that there are enough out-of-order resources
available to make it not matter in most cases. But even so, not doing
the extra work is surely better. Attached is another approach that
does aligned accesses, and thereby avoids going outside bounds.

Would be interesting to see how well that fares in the small use case.
Anything that fits into one aligned cache line should be constant
speed, and there is only one branch, but the mask setup and folding
the separate popcounts together should add up to about 20-ish cycles
of overhead.

Regards,
Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee5..e1fbd98fa14 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -30,20 +30,44 @@
 uint64
 pg_popcount_avx512(const char *buf, int bytes)
 {
-	uint64		popcnt;
+	__m512i		val, cnt;
 	__m512i		accum = _mm512_setzero_si512();
+	const char *final;
+	int 		tail_idx;
+	__mmask64	mask = -1;
 
-	for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
-	{
-		const		__m512i val = _mm512_loadu_si512((const __m512i *) buf);
-		const		__m512i cnt = _mm512_popcnt_epi64(val);
+	/*
+	 * Align buffer down to avoid double load overhead from unaligned access.
+	 * Calculate a mask to ignore preceding bytes. Find start offset of final
+	 * iteration and number of valid bytes making sure that final iteration
+	 * is not empty.
+	 */
+	mask <<= ((uintptr_t) buf) % sizeof(__m512i);
+	tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1;
+	final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1);
+	buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
 
+	/*
+	 * Iterate through all but the final iteration. Starting from second
+	 * iteration, the start index mask is ignored.
+	 */
+	for (; buf < final; buf += sizeof(__m512i))
+	{
+		val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+		cnt = _mm512_popcnt_epi64(val);
 		accum = _mm512_add_epi64(accum, cnt);
-		buf += sizeof(__m512i);
+
+		mask = -1;
 	}
 
-	popcnt = _mm512_reduce_add_epi64(accum);
-	return popcnt + pg_popcount_fast(buf, bytes);
+	/* Final iteration needs to ignore bytes that are not within the length */
+	mask &= ((~0ULL) >> (64 - tail_idx));
+
+	val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+	cnt = _mm512_popcnt_epi64(val);
+	accum = _mm512_add_epi64(accum, cnt);
+
+	return _mm512_reduce_add_epi64(accum);
 }
 
 #endif			/* TRY_POPCNT_FAST */


Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-04-02 Thread Tom Lane
"Daniel Verite"  writes:
>   Tom Lane wrote:
>> I do not buy that psql's FETCH_COUNT mode is a sufficient reason
>> to add it.  FETCH_COUNT mode is not something you'd use
>> non-interactively

> I should say that I've noticed significant latency improvements with
> FETCH_COUNT retrieving large resultsets, such that it would benefit
> non-interactive use cases.

Do you have a theory for why that is?  It's pretty counterintuitive
that it would help at all.

regards, tom lane




Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-04-02 Thread Daniel Verite
Tom Lane wrote:

> I do not buy that psql's FETCH_COUNT mode is a sufficient reason
> to add it.  FETCH_COUNT mode is not something you'd use
> non-interactively

I should say that I've noticed significant latency improvements with
FETCH_COUNT retrieving large resultsets, such that it would benefit
non-interactive use cases.

For instance, with the current v7 patch, a query like the OP's initial
case and batches of 1000 rows:

$ cat fetchcount-test.sql

select repeat('a', 100) || '-' ||
i || '-' || repeat('b', 500) as total_pat
from generate_series(1, 500) as i
\g /dev/null

$ export TIMEFORMAT=%R

$ for s in $(seq 1 10); do time /usr/local/pgsql/bin/psql -At \
   -v FETCH_COUNT=1000 -f fetchcount-test.sql;  done

3.597
3.413
3.362
3.612
3.377
3.416
3.346
3.368
3.504
3.413

=> Average elapsed time = 3.44s

Now without FETCH_COUNT, fetching the 5 million rows in one resultset:

$ for s in $(seq 1 10); do time /usr/local/pgsql/bin/psql -At \
-f fetchcount-test.sql;  done

4.200
4.178
4.200
4.169
4.195
4.217
4.197
4.234
4.225
4.242

=> Average elapsed time = 4.20s

By comparison the unpatched version (cursor-based method)
gives these execution times with FETCH_COUNT=1000:

4.458
4.448
4.476
4.455
4.450
4.466
4.395
4.429
4.387
4.473

=> Average elapsed time = 4.43s

Now that's just one test, but don't these numbers look good?


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




RE: Crash on UNION with PG 17

2024-04-02 Thread Regina Obe
> On Thu, 28 Mar 2024 at 04:34, Regina Obe  wrote:
> > CREATE TABLE edge_data AS
> > SELECT i AS edge_id, i + 1 AS start_node, i + 2 As end_node FROM
> > generate_series(1,10) AS i;
> >
> >   WITH edge AS (
> > SELECT start_node, end_node
> > FROM edge_data
> > WHERE edge_id = 1
> >   )
> >   SELECT start_node id FROM edge UNION
> >   SELECT end_node FROM edge;
> 
> As of d5d2205c8, this query should work as expected.
> 
> Thank you for letting us know about this.
> 
> David

Thanks for the fix.  I confirm it works now and our bots are green again.

Regina





Re: Fixing backslash dot for COPY FROM...CSV

2024-04-02 Thread Tom Lane
"Daniel Verite"  writes:
>   Robert Haas wrote:
>> Those links unfortunately seem not to be entirely specific to this
>> issue. Other, related things seem to be discussed there, and it's not
>> obvious that everyone agrees on what to do, or really that anyone
>> agrees on what to do.

> The proposed  patch addresses these cases by making the sequence
> \. non-special in CSV (in fact backslash itself is a normal character in
> CSV).
> It does not address the cases when the data is embedded after
> the COPY command or typed interactively in psql, since these cases
> require an explicit end-of-data marker, and CSV does not have
> the concept of an end-of-data marker.

I've looked over this patch and I generally agree that this is a
reasonable solution.  While it's barely possible that somebody
out there is depending on the current behavior of \. in CSV mode,
it seems considerably more likely that people who run into it
would consider it a bug.  In any case, the patch isn't proposed
for back-patch, and as cross-version incompatibilities go this
seems like a pretty small one.

I concur with Robert's doubts about some of the doc changes though.
In particular, since we're not changing the behavior for non-CSV
mode, we shouldn't remove any statements that apply to non-CSV mode.

I'm also wondering why the patch adds a test for
"PQprotocolVersion(conn) >= 3" in handleCopyIn.  As already
noted upthread, we ripped out all support for protocol 2.0
some time ago, so that sure looks like dead code.

regards, tom lane




Re: New Table Access Methods for Multi and Single Inserts

2024-04-02 Thread Jeff Davis
On Sun, 2024-03-31 at 21:18 +0530, Bharath Rupireddy wrote:
> if (table_modify_buffer_insert() is defined)
>    table_modify_buffer_insert(...);
> else
> {
>   myState->bistate = GetBulkInsertState();
>   table_tuple_insert(...);
> }

We can't alloc/free the bulk insert state for every insert call. I see
two options:

* Each caller needs to support two code paths: if the buffered insert
APIs are defined, then use those; otherwise the caller needs to manage
the bulk insert state itself and call the plain insert API.

* Have default implementation for the new API methods, so that the
default for the begin method would allocate the bulk insert state, and
the default for the buffered insert method would be to call plain
insert using the bulk insert state.

I'd prefer the latter, at least in the long term. But I haven't really
thought through the details, so perhaps we'd need to use the former.

> > 
> > After we have these new APIs fully in place and used by COPY, what
> > will
> > happen to those other APIs? Will they be deprecated or will there
> > be a
> > reason to keep them?
> 
> Deprecated perhaps?

Including Alexander on this thread, because he's making changes to the
multi-insert API. We need some consensus on where we are going with
these APIs before we make more changes, and what incremental steps make
sense in v17.

Here's where I think this API should go:

1. Have table_modify_begin/end and table_modify_buffer_insert, like
those that are implemented in your patch.

2. Add some kind of flush callback that will be called either while the
tuples are being flushed or after the tuples are flushed (but before
they are freed by the AM). (Aside: do we need to call it while the
tuples are being flushed to get the right visibility semantics for
after-row triggers?)

3. Add table_modify_buffer_{update|delete} APIs.

4. Some kind of API tweaks to help manage memory when modifying
pertitioned tables, so that the buffering doesn't get out of control.
Perhaps just reporting memory usage and allowing the caller to force
flushes would be enough.

5. Use these new methods for CREATE/REFRESH MATERIALIZED VIEW. This is
fairly straightforward, I believe, and handled by your patch. Indexes
are (re)built afterward, and no triggers are possible.

6. Use these new methods for CREATE TABLE ... AS. This is fairly
straightforward, I believe, and handled by your patch. No indexes or
triggers are possible.

7. Use these new methods for COPY. We have to be careful to avoid
regressions for the heap method, because it's already managing its own
buffers. If the AM manages the buffering, then it may require
additional copying of slots, which could be a disadvantage. To solve
this, we may need some minor API tweaks to avoid copying when the
caller guarantees that the memory will not be freed to early, or
perhaps expose the AM's memory context to copyfrom.c. Another thing to
consider is that the buffering in copyfrom.c is also used for FDWs, so
that buffering code path needs to be preserved in copyfrom.c even if
not used for AMs.

8. Use these new methods for INSERT INTO ... SELECT. One potential
challenge here is that execution nodes are not always run to
completion, so we need to be sure that the flush isn't forgotten in
that case.

9. Use these new methods for DELETE, UPDATE, and MERGE. MERGE can use
the buffer_insert/update/delete APIs; we don't need a separate merge
method. This probably requires that the AM maintain 3 separate buffers
to distinguish different kinds of changes at flush time (obviously
these can be initialized lazily to avoid overhead when not being used).

10. Use these new methods for logical apply.

11. Deprecate the multi_insert API.

Thoughts on this plan? Does your patch make sense in v17 as a stepping
stone, or should we try to make all of these API changes together in
v18?

Also, a sample AM code would be a huge benefit here. Writing a real AM
is hard, but perhaps we can at least have an example one to demonstrate
how to use these APIs?

Regards,
Jeff Davis





Re: WIP Incremental JSON Parser

2024-04-02 Thread Jacob Champion
On Mon, Apr 1, 2024 at 4:53 PM Andrew Dunstan  wrote:
> Anyway, here are new patches. I've rolled the new semantic test into the
> first patch.

Looks good! I've marked RfC.

> json_lex() is not really a very hot piece of code.

Sure, but I figure if someone is trying to get the performance of the
incremental parser to match the recursive one, so we can eventually
replace it, it might get a little warmer. :)

> > I think it'd be good for a v1.x of this feature to focus on
> > simplification of the code, and hopefully consolidate and/or eliminate
> > some of the duplicated parsing work so that the mental model isn't
> > quite so big.
>
> I'm not sure how you think that can be done.

I think we'd need to teach the lower levels of the lexer about
incremental parsing too, so that we don't have two separate sources of
truth about what ends a token. Bonus points if we could keep the parse
state across chunks to the extent that we didn't need to restart at
the beginning of the token every time. (Our current tools for this are
kind of poor, like the restartable state machine in PQconnectPoll.
While I'm dreaming, I'd like coroutines.) Now, whether the end result
would be more or less maintainable is left as an exercise...

Thanks!
--Jacob




Re: Missing LWLock protection in pgstat_reset_replslot()

2024-04-02 Thread Tom Lane
Michael Paquier  writes:
> On Fri, Mar 08, 2024 at 07:46:26PM +0900, Michael Paquier wrote:
>> Sounds good to me.

> I've applied the patch of this thread as b36fbd9f8da1, though I did
> not see a huge point in backpatching as at the end this is just a
> consistency improvement.

I've closed the CF entry for this [1] as committed.  Please re-open
it if there was something left to do here.

regards, tom lane

[1] https://commitfest.postgresql.org/47/4878/




Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences

2024-04-02 Thread Tom Lane
Akshat Jaimini  writes:
> The code seems to implement the feature and has good and explanatory comments 
> associated with it.
> I believe we can go ahead with committing patch although I would request some 
> senior contributors to also take a look at this patch since I am relatively 
> new to patch reviews.

Looks like a good catch and a reasonable fix.  Pushed after rewriting
the comments a bit.

As far as this goes:

> I ran make installcheck-world after applying the patch and recompiling it. It 
> did fail for a particular test but from the logs it seems to be unrelated to 
> this particular patch since it fails for the following:

> ==
> select error_trap_test();
> -  error_trap_test  
> 
> - division_by_zero detected
> -(1 row)
> -
> +ERROR:  cannot start subtransactions during a parallel operation

... that's the test case from 0075d7894, and the failure is what
I'd expect from a backend older than that.  Maybe you forgot to
recompile/reinstall after updating past that commit?

regards, tom lane




Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-02 Thread Daniel Gustafsson
> On 30 Mar 2024, at 22:27, Thomas Munro  wrote:
> On Sun, Mar 31, 2024 at 9:59 AM Tom Lane  wrote:

Thanks a lot for bringing this up again Thomas, 1.0.2 has bitten me so many
times and I'd be thrilled to get rid of it.

>> I think it's probably true that <=1.0.2 is not in any distro that
>> we still need to pay attention to, but I reject the contention
>> that RHEL8 is not in that set.
> 
> Hmm, OK so it doesn't have 3 available in parallel from base repos.
> But it's also about to reach end of "full support" in 2 months[1], so
> if we applied the policies we discussed in the LLVM-vacuuming thread
> (to wit: build farm - EOL'd OSes), then...  One question I'm unclear
> on is whether v17 will be packaged for RHEL8.

While 1.1.1 is EOL in upstream, it won't buy us much to deprecate past it since
we don't really make use of 3.x specific functionality.  I wouldn't mind not
being on the hook to support an EOL version of OpenSSL for another 5 years, but
it also won't shift the needle too much.  For v18 I'd like to work on modernize
our OpenSSL code to make more use of 3+ features/API's and that could be a good
point to cull 1.1.1 support.

Settling for removing support for 1.0.2, which is antiques roadshow material at
this point (no TLSv1.3 support for example), removes most of the compatibility
mess we have in libpq.  1.1.1 was not a deprecation point in OpenSSL but we can
define 1.1.0 as our compatibility level to build warning-free.

The attached removes 1.0.2 support (meson build parts untested yet) with a few
small touch ups of related documentation.  I haven't yet done the research on
where that leaves LibreSSL since we don't really define anywhere what we
support (so for we've gotten by assuming it's kind of sort 1.0.2 for the parts
we care about which is skating on fairly thin ice).

--
Daniel Gustafsson



v1-0001-Remove-support-for-OpenSSL-1.0.2-and-1.1.0.patch
Description: Binary data


Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
>> On 2024-Apr-02, Nathan Bossart wrote:
>>> Another idea I had is to turn pg_popcount() into a macro that just uses the
>>> pg_number_of_ones array when called for few bytes:
>>> 
>>> static inline uint64
>>> pg_popcount_inline(const char *buf, int bytes)
>>> {
>>> uint64  popcnt = 0;
>>> 
>>> while (bytes--)
>>> popcnt += pg_number_of_ones[(unsigned char) *buf++];
>>> 
>>> return popcnt;
>>> }
>>> 
>>> #define pg_popcount(buf, bytes) \
>>> ((bytes < 64) ? \
>>>  pg_popcount_inline(buf, bytes) : \
>>>  pg_popcount_optimized(buf, bytes))
>>> 
>>> But again, I'm not sure this is really worth it for the current use-cases.
> 
>> Eh, that seems simple enough, and then you can forget about that case.
> 
> I don't like the double evaluation of the macro argument.  Seems like
> you could get the same results more safely with
> 
>   static inline uint64
>   pg_popcount(const char *buf, int bytes)
>   {
>   if (bytes < 64)
>   {
>   uint64  popcnt = 0;
> 
>   while (bytes--)
>   popcnt += pg_number_of_ones[(unsigned char) 
> *buf++];
> 
>   return popcnt;
>   }
>   return pg_popcount_optimized(buf, bytes);
>   }

Yeah, I like that better.  I'll do some testing to see what the threshold
really should be before posting an actual patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RFC: Additional Directory for Extensions

2024-04-02 Thread David E. Wheeler
Hackers,

In the Security lessons from liblzma thread[1], walther broached the subject of 
an extension directory path[1]:

> Also a configurable directoy to look up extensions, possibly even to be 
> changed at run-time like [2]. The patch says this:
> 
>> This directory is prepended to paths when loading extensions (control and 
>> SQL files), and to the '$libdir' directive when loading modules that back 
>> functions. The location is made configurable to allow build-time testing of 
>> extensions that do not have been installed to their proper location yet.
> 
> This seems like a great thing to have. This might also be relevant in 
> light of recent discussions in the ecosystem around extension management.


That quotation comes from this Debian patch[2] maintained by Christoph Berg. 
I’d like to formally propose integrating this patch into the core. And not only 
because it’s overhead for package maintainers like Christoph, but because a 
number of use cases have emerged since we originally discussed something like 
this back in 2013[3]:

Docker Immutability
---

Docker images must be immutable. In order for users of a Docker image to 
install extensions that persist, they must create a persistent volume, map it 
to SHAREDIR/extensions, and copy over all the core extensions (or muck with 
symlink magic[4]). This makes upgrades trickier, because the core extensions 
are mixed in with third party extensions. 

By supporting a second directory pretended to the list of directories to 
search, as the Debian patch does, users of Docker images can keep extensions 
they install separate from core extensions, in a directory mounted to a 
persistent volume with none of the core extensions. Along with tweaking 
dynamic_library_path to support additional directories for shared object 
libraries, which can also be mounted to a separate path, we can have a 
persistent and clean separation of immutable core extensions and extensions 
installed at runtime.

Postgres.app


The Postgres.app project also supports installing extensions. However, because 
they must go into the SHAREDIR/extensions, once a user installs one the package 
has been modified and the Apple bundle signature will be broken. The OS will no 
longer be able to validate that the app is legit.

If the core supported an additional extension (and PKGLIBDIR), it would allow 
an immutable PostgreSQL base package and still allow extensions to be installed 
into directories outside the app bundle, and thus preserve bundle signing on 
macOS (and presumably other systems --- is this the nix issue, too?)

RFC
---

I know there was some objection to changes like this in the past, but the 
support I’m seeing in the liblzma thread for making pkglibdir configurable  me 
optimistic that this might be the right time to support additional 
configuration for the extension directory, as well, starting with the Debian 
patch, perhaps.

Thoughts?

I would be happy to submit a clean version of the Debian patch[2].

Best,

David

[1] 
https://www.postgresql.org/message-id/99c41b46-616e-49d0-9ffd-a29432cec818%40technowledgy.de
[2] 
https://salsa.debian.org/postgresql/postgresql/-/blob/17/debian/patches/extension_destdir?ref_type=heads
[3] https://www.postgresql.org/message-id/flat/51ae0845.8010...@ocharles.org.uk
[4] https://speakerdeck.com/ongres/postgres-extensions-in-kubernetes?slide=14





Re: Fix resource leak (src/backend/libpq/be-secure-common.c)

2024-04-02 Thread Daniel Gustafsson
> On 2 Apr 2024, at 20:13, Ranier Vilela  wrote:

> Fix by freeing the pointer, like pclose_check (src/common/exec.c) similar 
> case.

Off the cuff, seems reasonable when loglevel is LOG.

--
Daniel Gustafsson





Re: using extended statistics to improve join estimates

2024-04-02 Thread Tomas Vondra
On 4/2/24 10:23, Andy Fan wrote:
> 
>> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
>>> Rebased over 269b532ae and muted compiler warnings.
> 
> Thank you Justin for the rebase!
> 
> Hello Tomas,
> 
> Thanks for the patch! Before I review the path at the code level, I want
> to explain my understanding about this patch first.
> 

If you want to work on this patch, that'd be cool. A review would be
great, but if you want to maybe take over and try moving it forward,
that'd be even better. I don't know when I'll have time to work on it
again, but I'd promise to help you with working on it.

> Before this patch, we already use MCV information for the eqjoinsel, it
> works as combine the MCV on the both sides to figure out the mcv_freq
> and then treat the rest equally, but this doesn't work for MCV in
> extended statistics, this patch fill this gap. Besides that, since
> extended statistics means more than 1 columns are involved, if 1+
> columns are Const based on RestrictInfo, we can use such information to
> filter the MCVs we are interesting, that's really cool. 
> 

Yes, I think that's an accurate description of what the patch does.

> I did some more testing, all of them are inner join so far, all of them
> works amazing and I am suprised this patch didn't draw enough
> attention. I will test more after I go though the code.
> 

I think it didn't go forward for a bunch of reasons:

1) I got distracted by something else requiring immediate attention, and
forgot about this patch.

2) I got stuck on some detail of the patch, unsure which of the possible
solutions to try first.

3) Uncertainty about how applicable the patch is in practice.

I suppose it was some combination of these reasons, not sure.


As for the "practicality" mentioned in (3), it's been a while since I
worked on the patch so I don't recall the details, but I think I've been
thinking mostly about "start join" queries, where a big "fact" table
joins to small dimensions. And in that case the fact table may have a
MCV, but the dimensions certainly don't have any (because the join
happens on a PK).

But maybe that's a wrong way to think about it - it was clearly useful
to consider the case with (per-attribute) MCVs on both sides as worth
special handling. So why not to do that for multi-column MCVs, right?

> At for the code level, I reviewed them in the top-down manner and almost
> 40% completed. Here are some findings just FYI. For efficiency purpose,
> I provide each feedback with a individual commit, after all I want to
> make sure my comment is practical and coding and testing is a good way
> to archive that. I tried to make each of them as small as possible so
> that you can reject or accept them convinently.
> 
> 0001 is your patch, I just rebase them against the current master. 0006
> is not much relevant with current patch, and I think it can be committed
> individually if you are OK with that.
> 
> Hope this kind of review is helpful.
> 

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Fix resource leak (src/backend/libpq/be-secure-common.c)

2024-04-02 Thread Ranier Vilela
Hi,

Per Coverity.

Coverity reported a resource leak at the function
run_ssl_passphrase_command.
7. alloc_fn: Storage is returned from allocation function
wait_result_to_str.["show details"]

8. noescape: Assuming resource wait_result_to_str(pclose_rc) is not freed
or pointed-to as ellipsis argument to errdetail_internal.

CID 1533043: (#1 of 1): Resource leak (RESOURCE_LEAK)
9. leaked_storage: Failing to save or free storage allocated by
wait_result_to_str(pclose_rc) leaks it.

I think that Coverity is right.

Fix by freeing the pointer, like pclose_check (src/common/exec.c) similar
case.

Patch attached.

best regards,


fix-resource-leak-be-secure-common.patch
Description: Binary data


Re: Popcount optimization using AVX512

2024-04-02 Thread Tom Lane
Alvaro Herrera  writes:
> On 2024-Apr-02, Nathan Bossart wrote:
>> Another idea I had is to turn pg_popcount() into a macro that just uses the
>> pg_number_of_ones array when called for few bytes:
>> 
>>  static inline uint64
>>  pg_popcount_inline(const char *buf, int bytes)
>>  {
>>  uint64  popcnt = 0;
>> 
>>  while (bytes--)
>>  popcnt += pg_number_of_ones[(unsigned char) *buf++];
>> 
>>  return popcnt;
>>  }
>> 
>>  #define pg_popcount(buf, bytes) \
>>  ((bytes < 64) ? \
>>   pg_popcount_inline(buf, bytes) : \
>>   pg_popcount_optimized(buf, bytes))
>> 
>> But again, I'm not sure this is really worth it for the current use-cases.

> Eh, that seems simple enough, and then you can forget about that case.

I don't like the double evaluation of the macro argument.  Seems like
you could get the same results more safely with

static inline uint64
pg_popcount(const char *buf, int bytes)
{
if (bytes < 64)
{
uint64  popcnt = 0;

while (bytes--)
popcnt += pg_number_of_ones[(unsigned char) 
*buf++];

return popcnt;
}
return pg_popcount_optimized(buf, bytes);
}

regards, tom lane




Re: Popcount optimization using AVX512

2024-04-02 Thread Alvaro Herrera
On 2024-Apr-02, Nathan Bossart wrote:

> Another idea I had is to turn pg_popcount() into a macro that just uses the
> pg_number_of_ones array when called for few bytes:
> 
>   static inline uint64
>   pg_popcount_inline(const char *buf, int bytes)
>   {
>   uint64  popcnt = 0;
> 
>   while (bytes--)
>   popcnt += pg_number_of_ones[(unsigned char) *buf++];
> 
>   return popcnt;
>   }
> 
>   #define pg_popcount(buf, bytes) \
>   ((bytes < 64) ? \
>pg_popcount_inline(buf, bytes) : \
>pg_popcount_optimized(buf, bytes))
> 
> But again, I'm not sure this is really worth it for the current use-cases.

Eh, that seems simple enough, and then you can forget about that case.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"




Re: Combine Prune and Freeze records emitted by vacuum

2024-04-02 Thread Melanie Plageman
On Tue, Apr 2, 2024 at 9:11 AM Heikki Linnakangas  wrote:
>
> On 01/04/2024 20:22, Melanie Plageman wrote:
> > Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
> > you didn't modify them much/at all, but I noticed some things in my code
> > that could be better.
>
> Ok, here's what I have now. I made a lot of small comment changes here
> and there, and some minor local refactorings, but nothing major. I lost
> track of all the individual changes I'm afraid, so I'm afraid you'll
> have to just diff against the previous version if you want to see what's
> changed. I hope I didn't break anything.
>
> I'm pretty happy with this now. I will skim through it one more time
> later today or tomorrow, and commit. Please review once more if you have
> a chance.

Thanks!

0001 looks good. Attached are some comment updates and such on top of
0001 and 0002.

I started some performance testing of 0002 but haven't finished yet. I
wanted to provide my other review first.

> > This probably doesn't belong here. I noticed spgdoinsert.c had a static
> > function for sorting OffsetNumbers, but I didn't see anything general
> > purpose anywhere else.
>
> I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would
> be nice to have just one copy of this in some common place, but I also
> wasn't sure where to put it.

I looked a bit through utils and common and didn't see anywhere
obvious to put it. We could make a new file? 0003 fixes where you
forgot to change the name of the qsort function, though.

- Melanie
From 3b69cf3123732c3296a784be8f4fc08ec024c0d5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 2 Apr 2024 11:11:07 -0400
Subject: [PATCH v13 3/4] fix qsort func

---
 src/backend/access/heap/vacuumlazy.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ace95a4de2..c3a9dc1ad6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -46,6 +46,7 @@
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
+#include "common/int.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1496,7 +1497,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * sorted.
 		 */
 		qsort(presult.deadoffsets, presult.lpdead_items, sizeof(OffsetNumber),
-			  OffsetNumber_cmp);
+			  cmpOffsetNumbers);
 
 		dead_items_add(vacrel, blkno, presult.deadoffsets, presult.lpdead_items);
 	}
-- 
2.40.1

From 021407cee292c0b1ef5145f37aef889b68b739b0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 2 Apr 2024 11:22:35 -0400
Subject: [PATCH v13 4/4] update few more outdated comments

---
 src/backend/access/heap/pruneheap.c | 53 +
 src/backend/storage/ipc/procarray.c |  6 ++--
 src/include/access/heapam.h |  2 +-
 3 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 41c919b15b..ddc228c86d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -328,7 +328,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  *
  * presult contains output parameters needed by callers, such as the number of
  * tuples removed and the offsets of dead items on the page after pruning.
- * heap_page_prune_and_freeze() is responsible for initializing it.
+ * heap_page_prune_and_freeze() is responsible for initializing it. Required by
+ * all callers.
  *
  * reason indicates why the pruning is performed.  It is included in the WAL
  * record for debugging and analysis purposes, but otherwise has no effect.
@@ -393,6 +394,7 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer,
 	prstate.pagefrz.freeze_required = false;
 	if (prstate.freeze)
 	{
+		Assert(new_relfrozen_xid && new_relmin_mxid);
 		prstate.pagefrz.FreezePageRelfrozenXid = *new_relfrozen_xid;
 		prstate.pagefrz.NoFreezePageRelfrozenXid = *new_relfrozen_xid;
 		prstate.pagefrz.FreezePageRelminMxid = *new_relmin_mxid;
@@ -415,19 +417,19 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer,
 	prstate.deadoffsets = presult->deadoffsets;
 
 	/*
-	 * Caller may update the VM after we're done.  We keep track of whether
-	 * the page will be all_visible and all_frozen, once we're done with the
-	 * pruning and freezing, to help the caller to do that.
+	 * Caller may update the VM after we're done.  We can keep track of
+	 * whether the page will be all-visible and all-frozen after pruning and
+	 * freezing to help the caller to do that.
 	 *
 	 * Currently, only VACUUM sets the VM bits.  To save the effort, only do
-	 * only the bookkeeping if the caller needs it.  Currently, that's tied to
-	 * HEAP_PAGE_PRUNE_FREEZE, but it could be a separate flag, if you wanted
-	 * to update the VM bits without also freezing, or freezing without
+	 * the bookkeeping if the caller needs it.  

Re: Streaming read-ready sequential scan code

2024-04-02 Thread Heikki Linnakangas

On 01/04/2024 22:58, Melanie Plageman wrote:

Attached v7 has version 14 of the streaming read API as well as a few
small tweaks to comments and code.


I saw benchmarks in this thread to show that there's no regression when 
the data is in cache, but I didn't see any benchmarks demonstrating the 
benefit of this. So I ran this quick test:


-- create table ~1 GB table with only 1 row per page.
CREATE TABLE giga (i int, filler text) with (fillfactor=10);
insert into giga select g, repeat('x', 900) from generate_series(1, 
14) g;

vacuum freeze giga;

\timing on
select count(*) from giga;

The SELECT takes about 390 ms on 'master', and 230 ms with the patch.

This is pretty much the best case for this patch, real world gains will 
be much smaller. Nevertheless, nice speedup!


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Statistics Import and Export

2024-04-02 Thread Corey Huinker
I have refactored pg_set_relation_stats to be variadic, and I'm working on
pg_set_attribute_sttats, but I'm encountering an issue with the anyarray
values.

Jeff suggested looking at anyarray_send as a way of extracting the type,
and with some extra twiddling we can get and cast the type. However, some
of the ANYARRAYs have element types that are themselves arrays, and near as
I can tell, such a construct is not expressible in SQL. So, rather than
getting an anyarray of an array type, you instead get an array of one
higher dimension.  Like so:

# select schemaname, tablename, attname,

 substring(substring(anyarray_send(histogram_bounds) from 9 for
4)::text,2)::bit(32)::integer::regtype,


 substring(substring(anyarray_send(histogram_bounds::text::text[][]) from 9
for 4)::text,2)::bit(32)::integer::regtype
from pg_stats where histogram_bounds is not null

and tablename = 'pg_proc' and attname = 'proargnames'


  ;

 schemaname | tablename |   attname   | substring | substring

+---+-+---+---

 pg_catalog | pg_proc   | proargnames | text[]| text

Luckily, passing in such a value would have done all of the element
typechecking for us, so we would just move the data to an array of one less
dimension typed elem[]. If there's an easy way to do that, I don't know of
it.

What remains is just checking the input types against the expected type of
the array, stepping down the dimension if need be, and skipping if the type
doesn't meet expectations.


Re: On disable_cost

2024-04-02 Thread Tom Lane
Robert Haas  writes:
> On Tue, Apr 2, 2024 at 11:54 AM Tom Lane  wrote:
>> I suspect that it'd behave poorly when there are both disabled and
>> promoted sub-paths in a tree, for pretty much the same reasons you
>> explained just upthread.

> Hmm, can you explain further? I think essentially you'd be maximizing
> #(promoted notes)-#(disabled nodes), but I have no real idea whether
> that behavior will be exactly what people want or extremely
> unintuitive or something in the middle. It seems like it should be
> fine if there's only promoting or only disabling or if we can respect
> both the promoting and the disabling, assuming we even want to have
> both, but I'm suspicious that it will be weird somehow in other cases.
> I can't say exactly in what way, though. Do you have more insight?

Not really.  But if you had, say, a join of a promoted path to a
disabled path, that would be treated as on-par with a join of two
regular paths, which seems like it'd lead to odd choices.  Maybe
it'd be fine, but my gut says it'd likely not act nicely.  As you
say, it's a lot easier to believe that only-promoted or only-disabled
situations would behave sanely.

regards, tom lane




Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-04-02 Thread Tom Lane
"Daniel Verite"  writes:
> Updated patches are attached.

I started to look through this, and almost immediately noted

- 
-  Retrieving Query Results Row-by-Row
+ 
+  Retrieving Query Results in chunks

This is a bit problematic, because changing the sect1 ID will
change the page's URL, eg

https://www.postgresql.org/docs/current/libpq-single-row-mode.html

Aside from possibly breaking people's bookmarks, I'm pretty sure this
will cause the web docs framework to not recognize any cross-version
commonality of the page.  How ugly would it be if we left the ID
alone?  Another idea could be to leave the whole page alone and add
a new  for chunked mode.

But ... TBH I'm not convinced that we need the chunked mode at all.
We explicitly rejected that idea back when single-row mode was
designed, see around here:

https://www.postgresql.org/message-id/flat/50173BF7.1070801%40Yahoo.com#7f92ebad0143fb5f575ecb3913c5ce88

and I'm still very skeptical that there's much win to be had.
I do not buy that psql's FETCH_COUNT mode is a sufficient reason
to add it.  FETCH_COUNT mode is not something you'd use
non-interactively, and there is enough overhead elsewhere in psql
(notably in result-set formatting) that it doesn't seem worth
micro-optimizing the part about fetching from libpq.

(I see that there was some discussion in that old thread about
micro-optimizing single-row mode internally to libpq by making
PGresult creation cheaper, which I don't think anyone ever got
back to doing.  Maybe we should resurrect that.)

regards, tom lane




Re: On disable_cost

2024-04-02 Thread Robert Haas
On Tue, Apr 2, 2024 at 11:54 AM Tom Lane  wrote:
> > I'm pretty sure negative costs are going to create a variety of
> > unpleasant planning artifacts.
>
> Indeed.  It might be okay to have negative values for disabled-ness
> if we treat disabled-ness as a "separate order of infinity", but
> I suspect that it'd behave poorly when there are both disabled and
> promoted sub-paths in a tree, for pretty much the same reasons you
> explained just upthread.

Hmm, can you explain further? I think essentially you'd be maximizing
#(promoted notes)-#(disabled nodes), but I have no real idea whether
that behavior will be exactly what people want or extremely
unintuitive or something in the middle. It seems like it should be
fine if there's only promoting or only disabling or if we can respect
both the promoting and the disabling, assuming we even want to have
both, but I'm suspicious that it will be weird somehow in other cases.
I can't say exactly in what way, though. Do you have more insight?

> > I think the only reason we're
> > driving this off of costing today is that making add_path() more
> > complicated is unappealing, mostly on performance grounds, and if you
> > add disabled-ness (or promoted-ness) as a separate axis of value then
> > add_path() has to know about that on top of everything else.
>
> It doesn't seem to me that it's a separate axis of value, just a
> higher-order component of the cost metric.  Nonetheless, adding even
> a few instructions to add_path comparisons sounds expensive.  Maybe
> it'd be fine, but we'd need to do some performance testing.

Hmm, yeah. I'm not sure how much difference there is between these
things in practice. I didn't run down everything that was happening,
but I think what I did was equivalent to making it a higher-order
component of the cost metric, and it seemed like an awful lot of paths
were surviving anyway, e.g. index scans survived
enable_indexscan=false because they had a sort order, and I think
sequential scans were surviving enable_seqscan=false too, perhaps
because they had no startup cost. At any rate there's no question that
add_path() is hot.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: On disable_cost

2024-04-02 Thread Tom Lane
Robert Haas  writes:
> On Tue, Apr 2, 2024 at 10:04 AM Greg Sabino Mullane  
> wrote:
>> if (!enable_seqscan)
>> startup_cost += disable_cost;
>> else if (promote_seqscan)
>> startup_cost -= promotion_cost; // or replace "promote" with "encourage"?

> I'm pretty sure negative costs are going to create a variety of
> unpleasant planning artifacts.

Indeed.  It might be okay to have negative values for disabled-ness
if we treat disabled-ness as a "separate order of infinity", but
I suspect that it'd behave poorly when there are both disabled and
promoted sub-paths in a tree, for pretty much the same reasons you
explained just upthread.

> I think the only reason we're
> driving this off of costing today is that making add_path() more
> complicated is unappealing, mostly on performance grounds, and if you
> add disabled-ness (or promoted-ness) as a separate axis of value then
> add_path() has to know about that on top of everything else.

It doesn't seem to me that it's a separate axis of value, just a
higher-order component of the cost metric.  Nonetheless, adding even
a few instructions to add_path comparisons sounds expensive.  Maybe
it'd be fine, but we'd need to do some performance testing.

regards, tom lane




Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Mon, Apr 01, 2024 at 05:11:17PM -0500, Nathan Bossart wrote:
> Here is a v19 of the patch set.  I moved out the refactoring of the
> function pointer selection code to 0001.  I think this is a good change
> independent of $SUBJECT, and I plan to commit this soon.  In 0002, I
> changed the syslogger.c usage of pg_popcount() to use pg_number_of_ones
> instead.  This is standard practice elsewhere where the popcount functions
> are unlikely to win.  I'll probably commit this one soon, too, as it's even
> more trivial than 0001.
>
> 0003 is the AVX512 POPCNT patch.  Besides refactoring out 0001, there are
> no changes from v18.  0004 is an early proof-of-concept for using AVX512
> for the visibility map code.  The code is missing comments, and I haven't
> performed any benchmarking yet, but I figured I'd post it because it
> demonstrates how it's possible to build upon 0003 in other areas.

I've committed the first two patches, and I've attached a rebased version
of the latter two.

> AFAICT the main open question is the function call overhead in 0003 that
> Alvaro brought up earlier.  After 0002 is committed, I believe the only
> in-tree caller of pg_popcount() with very few bytes is bit_count(), and I'm
> not sure it's worth expending too much energy to make sure there are
> absolutely no regressions there.  However, I'm happy to do so if folks feel
> that it is necessary, and I'd be grateful for thoughts on how to proceed on
> this one.

Another idea I had is to turn pg_popcount() into a macro that just uses the
pg_number_of_ones array when called for few bytes:

static inline uint64
pg_popcount_inline(const char *buf, int bytes)
{
uint64  popcnt = 0;

while (bytes--)
popcnt += pg_number_of_ones[(unsigned char) *buf++];

return popcnt;
}

#define pg_popcount(buf, bytes) \
((bytes < 64) ? \
 pg_popcount_inline(buf, bytes) : \
 pg_popcount_optimized(buf, bytes))

But again, I'm not sure this is really worth it for the current use-cases.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3c5c3fdaffd623b513bcc476ee7c15f6379af1e7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v20 1/2] AVX512 popcount support

---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  15 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |   7 +-
 src/port/pg_popcount_avx512.c|  49 ++
 src/port/pg_popcount_avx512_choose.c |  71 
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 651 insertions(+), 5 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..5fb60775ca 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl

Re: Detoasting optionally to make Explain-Analyze less misleading

2024-04-02 Thread Tom Lane
Matthias van de Meent  writes:
> Attached is v9, which is rebased on master to handle 24eebc65's
> changed signature of pq_sendcountedtext.
> It now also includes autocompletion, and a second patch which adds
> documentation to give users insights into this new addition to
> EXPLAIN.

I took a quick look through this.  Some comments in no particular
order:

Documentation is not optional, so I don't really see the point of
splitting this into two patches.

IIUC, it's not possible to use the SERIALIZE option when explaining
CREATE TABLE AS, because you can't install the instrumentation tuple
receiver when the IntoRel one is needed.  I think that's fine because
no serialization would happen in that case anyway, but should we
throw an error if that combination is requested?  Blindly reporting
that zero bytes were serialized seems like it'd confuse people.

I'd lose the stuff about measuring memory consumption.  Nobody asked
for that and the total is completely misleading, because in reality
we'll reclaim the memory used after each row.  It would allow cutting
the text-mode output down to one line, too, instead of having your
own format that's not like anything else.

I thought the upthread agreement was to display the amount of
data sent rounded to kilobytes, so why is the code displaying
an exact byte count?

I don't especially care for magic numbers like these:

+   /* see printtup.h why we add 18 bytes here. These are the infos
+* needed for each attribute plus the attribute's name */
+   receiver->metrics.bytesSent += (int64) namelen + 1 + 18;

If the protocol is ever changed in a way that invalidates this,
there's about 0 chance that somebody would remember to touch
this code.

However, isn't the bottom half of serializeAnalyzeStartup doing
exactly what the comment above it says we don't do, that is accounting
for the RowDescription message?  Frankly I agree with the comment that
it's not worth troubling over, so I'd just drop that code rather than
finding a solution for the magic-number problem.

Don't bother with duplicating valgrind-related logic in
serializeAnalyzeReceive --- that's irrelevant to actual users.

This seems like cowboy coding:

+   self->destRecevier.mydest = DestNone;

You should define a new value of the CommandDest enum and
integrate this receiver type into the support functions
in dest.c.

BTW, "destRecevier" is misspelled...

regards, tom lane




Re: Extension for PostgreSQL cast jsonb to hstore WIP

2024-04-02 Thread ShadowGhost
At the moment, this cast supports only these structures, as it was enough
for my tasks:
{str:numeric}
{str:str}
{str:bool}
{str:null}
But it's a great idea and I'll think about implementing it.

вт, 2 апр. 2024 г. в 19:48, Andrew Dunstan :

>
> On 2024-04-02 Tu 07:07, ShadowGhost wrote:
> > Hello all.
> > Recently, when working with the hstore and json formats, I came across
> > the fact that PostgreSQL has a cast of hstore to json, but there is no
> > reverse cast. I thought it might make it more difficult to work with
> > these formats. And I decided to make a cast json in the hstore. I used
> > the built-in jsonb structure to create it and may have introduced
> > methods to increase efficiency by 25% than converting the form
> > jsonb->text->hstore. Which of course is a good fact. I also wrote
> > regression tests to check the performance. I think this extension will
> > improve the work with jsonb and hstore in PostgreSQL.
> > If you've read this far, thank you for your interest, and I hope you
> > enjoy this extension!
> >
>
> One reason we don't have such a cast is that hstore has a flat
> structure, while json is tree structured, and it's not always an object
> / hash. Thus it's easy to reliably cast hstore to json but far less easy
> to cast json to hstore in the general case.
>
> What do you propose to do in the case or json consisting of scalars, or
> arrays, or with nested elements?
>
>
> cheers
>
>
> andrew
>
>
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
>


Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-02 Thread Kartyshov Ivan

On 2024-04-02 13:15, Bharath Rupireddy wrote:

On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
 wrote:


8 years, we tried to add this feature, and now we suggest the best way
for this feature is to commit the minimal version first.


Just curious, do you or anyone else have an immediate use for this
function? If yes, how are they achieving read-after-write-consistency
on streaming standbys in their application right now without a
function like this?


Just now, application runs pg_current_wal_lsn() after update and then
waits on loop pg_current_wal_flush_lsn() until updated.

Or use slow synchronous_commit.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com




Re: Table AM Interface Enhancements

2024-04-02 Thread Jeff Davis
On Tue, 2024-04-02 at 11:49 +0300, Alexander Korotkov wrote:
> I don't like the idea that every custom table AM reltoptions should
> begin with StdRdOptions.  I would rather introduce the new data
> structure with table options, which need to be accessed outside of
> table AM.  Then reloptions will be a backbox only directly used in
> table AM, while table AM has a freedom on what to store in reloptions
> and how to calculate externally-visible options.  What do you think?

Hi Alexander!

I agree with all of that. It will take some refactoring to get there,
though.

One idea is to store StdRdOptions like normal, but if an unrecognized
option is found, ask the table AM if it understands the option. In that
case I think we'd just use a different field in pg_class so that it can
use whatever format it wants to represent its options.

Regards,
Jeff Davis





Re: Statistics Import and Export

2024-04-02 Thread Jeff Davis
On Tue, 2024-04-02 at 05:38 -0400, Corey Huinker wrote:
> Here's a one-liner patch for disabling update of pg_class
> relpages/reltuples/relallviible during a binary upgrade.

This change makes sense to me regardless of the rest of the work.
Updating the relpages/reltuples/relallvisible during pg_upgrade before
the data is there will store the wrong stats.

It could use a brief comment, though.

Regards,
Jeff Davis





Re: On disable_cost

2024-04-02 Thread Robert Haas
On Tue, Apr 2, 2024 at 10:04 AM Greg Sabino Mullane  wrote:
> So rather than listing all the things we don't want to happen, we need a way 
> to force (nay, highly encourage) a particular solution. As our costing is a 
> based on positive numbers, what if we did something like this in costsize.c?
>
>  Costdisable_cost = 1.0e10;
>  Costpromotion_cost = 1.0e10; // or higher or lower, depending on how 
> strongly we want to "beat" disable_costs effects.
> ...
>
> if (!enable_seqscan)
> startup_cost += disable_cost;
> else if (promote_seqscan)
> startup_cost -= promotion_cost; // or replace "promote" with 
> "encourage"?

I'm pretty sure negative costs are going to create a variety of
unpleasant planning artifacts. The large positive costs do, too, which
is where this whole discussion started. If I disable (or promote) some
particular plan, I want the rest of the plan tree to come out looking
as much as possible like what would have happened if the same
alternative had won organically on cost. I think the only reason we're
driving this off of costing today is that making add_path() more
complicated is unappealing, mostly on performance grounds, and if you
add disabled-ness (or promoted-ness) as a separate axis of value then
add_path() has to know about that on top of everything else. I think
the goal here is to come up with a more principled alternative that
isn't just based on whacking large numbers into the cost and hoping
something good happens ... but it is a whole lot easier to be unhappy
with the status quo than it is to come up with something that's
actually better.

I am planning to spend some more time thinking about it, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: psql not responding to SIGINT upon db reconnection

2024-04-02 Thread Tristan Partin

On Tue Apr 2, 2024 at 9:32 AM CDT, Robert Haas wrote:

On Mon, Apr 1, 2024 at 12:04 PM Tristan Partin  wrote:
> > This sentence seems a bit contorted to me, like maybe Yoda wrote it.
>
> Incorporated feedback, I have :).

Committed it, I did. My thanks for working on this issue, I extend.


Thanks to everyone for their reviews! Patch is in a much better place 
than when it started.


--
Tristan Partin
Neon (https://neon.tech)




Re: psql not responding to SIGINT upon db reconnection

2024-04-02 Thread Robert Haas
On Mon, Apr 1, 2024 at 12:04 PM Tristan Partin  wrote:
> > This sentence seems a bit contorted to me, like maybe Yoda wrote it.
>
> Incorporated feedback, I have :).

Committed it, I did. My thanks for working on this issue, I extend.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Synchronizing slots from primary to standby

2024-04-02 Thread Bharath Rupireddy
On Tue, Apr 2, 2024 at 7:25 PM Zhijie Hou (Fujitsu)
 wrote:
>
> > 1. Can we just remove pg_logical_replication_slot_advance and use
> > LogicalSlotAdvanceAndCheckSnapState instead? If worried about the
> > function naming, LogicalSlotAdvanceAndCheckSnapState can be renamed to
> > pg_logical_replication_slot_advance?
> >
> > + * Advance our logical replication slot forward. See
> > + * LogicalSlotAdvanceAndCheckSnapState for details.
> >   */
> >  static XLogRecPtr
> >  pg_logical_replication_slot_advance(XLogRecPtr moveto)
> >  {
>
> It was commented[1] that it's not appropriate for the
> pg_logical_replication_slot_advance to have an out parameter
> 'ready_for_decoding' which looks bit awkward as the functionality doesn't 
> match
> the name, and is also not consistent with the style of
> pg_physical_replication_slot_advance(). So, we added a new function.

I disagree here. A new function just for a parameter is not that great
IMHO. I'd just rename LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr
moveto, bool *found_consistent_snapshot) to
pg_logical_replication_slot_advance(XLogRecPtr moveto, bool
*found_consistent_snapshot) and use it. If others don't like this, I'd
at least turn pg_logical_replication_slot_advance(XLogRecPtr moveto) a
static inline function.

> > 5. As far as the test case for this issue is concerned, I'm fine with
> > adding one using an INJECTION point because we seem to be having no
> > consistent way to control postgres writing current snapshot to WAL.
>
> Since me and my colleagues can reproduce the issue consistently after applying
> 0002 and it could be rare for concurrent xl_running_xacts to happen, we 
> discussed[2] to
> consider adding the INJECTION point after pushing the main fix.

Right.

> > 7.
> > +/*
> > + * We need to access the system tables during decoding to build the
> > + * logical changes unless we are in fast-forward mode where no changes
> > are
> > + * generated.
> > + */
> > +if (slot->data.database != MyDatabaseId && !fast_forward)
> >
> > May I know if we need this change for this fix?
>
> The slotsync worker needs to advance the slots from different databases in
> fast_forward. So, we need to skip this check in fast_forward mode. The 
> analysis can
> be found in [3].
-if (slot->data.database != MyDatabaseId)
+/*
+ * We need to access the system tables during decoding to build the
+ * logical changes unless we are in fast_forward mode where no changes are
+ * generated.
+ */
+if (slot->data.database != MyDatabaseId && !fast_forward)
 ereport(ERROR,

It's not clear from the comment that we need it for a slotsync worker
to advance the slots from different databases. Can this be put into
the comment? Also, specify in the comment, why this is safe?

Also, if this change is needed for only slotsync workers, why not
protect it with IsSyncingReplicationSlots()? Otherwise, it might
impact non-slotsync callers, no?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: On disable_cost

2024-04-02 Thread Greg Sabino Mullane
On Mon, Apr 1, 2024 at 7:54 PM Robert Haas  wrote:

> What I think we're mostly doing in the regression tests is shutting
> off every relevant type of plan except one. I theorize that what we
> actually want to do is tell the planner what we do want to happen,
> rather than what we don't want to happen, but we've got this weird set
> of GUCs that do the opposite of that and we're super-attached to them
> because they've existed forever.


So rather than listing all the things we don't want to happen, we need a
way to force (nay, highly encourage) a particular solution. As our costing
is a based on positive numbers, what if we did something like this in
costsize.c?

 Costdisable_cost = 1.0e10;
 Costpromotion_cost = 1.0e10; // or higher or lower, depending on
how strongly we want to "beat" disable_costs effects.
...

if (!enable_seqscan)
startup_cost += disable_cost;
else if (promote_seqscan)
startup_cost -= promotion_cost; // or replace "promote" with
"encourage"?


Cheers,
Greg


Re: Confusing #if nesting in hmac_openssl.c

2024-04-02 Thread Daniel Gustafsson
> On 2 Apr 2024, at 15:50, Tom Lane  wrote:

> I'll go ahead with what I have.

+1

--
Daniel Gustafsson





RE: Synchronizing slots from primary to standby

2024-04-02 Thread Zhijie Hou (Fujitsu)
On Tuesday, April 2, 2024 8:49 PM Bharath Rupireddy 
 wrote:
> 
> On Tue, Apr 2, 2024 at 2:11 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > CFbot[1] complained about one query result's order in the tap-test, so I am
> > attaching a V7 patch set which fixed this. There are no changes in 0001.
> >
> > [1] https://cirrus-ci.com/task/6375962162495488
> 
> Thanks. Here are some comments:

Thanks for the comments.

> 
> 1. Can we just remove pg_logical_replication_slot_advance and use
> LogicalSlotAdvanceAndCheckSnapState instead? If worried about the
> function naming, LogicalSlotAdvanceAndCheckSnapState can be renamed to
> pg_logical_replication_slot_advance?
> 
> + * Advance our logical replication slot forward. See
> + * LogicalSlotAdvanceAndCheckSnapState for details.
>   */
>  static XLogRecPtr
>  pg_logical_replication_slot_advance(XLogRecPtr moveto)
>  {

It was commented[1] that it's not appropriate for the
pg_logical_replication_slot_advance to have an out parameter
'ready_for_decoding' which looks bit awkward as the functionality doesn't match
the name, and is also not consistent with the style of
pg_physical_replication_slot_advance(). So, we added a new function.

> 
> 2.
> +if (!ready_for_decoding)
> +{
> +elog(DEBUG1, "could not find consistent point for synced
> slot; restart_lsn = %X/%X",
> + LSN_FORMAT_ARGS(slot->data.restart_lsn));
> 
> Can we specify the slot name in the message?

Added.

> 
> 3. Also, can the "could not find consistent point for synced slot;
> restart_lsn = %X/%X" be emitted at LOG level just like other messages
> in update_and_persist_local_synced_slot. Although, I see "XXX should
> this be changed to elog(DEBUG1) perhaps?", these messages need to be
> at LOG level as they help debug issues if at all they are hit.

Changed to LOG and reworded the message.

> 
> 4. How about using found_consistent_snapshot instead of
> ready_for_decoding? A general understanding is that the synced slots
> are not allowed for decoding (although with this fix, we do that for
> internal purposes), ready_for_decoding looks a bit misleading.

Agreed and renamed.

> 
> 5. As far as the test case for this issue is concerned, I'm fine with
> adding one using an INJECTION point because we seem to be having no
> consistent way to control postgres writing current snapshot to WAL.

Since me and my colleagues can reproduce the issue consistently after applying
0002 and it could be rare for concurrent xl_running_xacts to happen, we 
discussed[2] to
consider adding the INJECTION point after pushing the main fix.

> 
> 6. A nit: can we use "fast_forward mode" instead of "fast-forward
> mode" just to be consistent?
> + * logical changes unless we are in fast-forward mode where no changes
> are
> 
> 7.
> +/*
> + * We need to access the system tables during decoding to build the
> + * logical changes unless we are in fast-forward mode where no changes
> are
> + * generated.
> + */
> +if (slot->data.database != MyDatabaseId && !fast_forward)
> 
> May I know if we need this change for this fix?

The slotsync worker needs to advance the slots from different databases in
fast_forward. So, we need to skip this check in fast_forward mode. The analysis 
can
be found in [3].

Attach the V8 patch which addressed above comments.


[1] 
https://www.postgresql.org/message-id/CAA4eK1%2BwkaRi2BrLLC_0gKbHN68Awc9dRp811G3An6A6fuqdOg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/ZgvI9iAUWCZ17z5V%40ip-10-97-1-34.eu-west-3.compute.internal
[3] 
https://www.postgresql.org/message-id/CAJpy0uCQ2PDCAqcnbdOz6q_ZqmBfMyBpVqKDqL_XZBP%3DeK-1yw%40mail.gmail.com

Best Regards,
Hou zj


v8-0002-test-the-data-loss-case.patch
Description: v8-0002-test-the-data-loss-case.patch


v8-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch
Description:  v8-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch


Re: Confusing #if nesting in hmac_openssl.c

2024-04-02 Thread Tom Lane
Daniel Gustafsson  writes:
> On 2 Apr 2024, at 02:01, Tom Lane  wrote:
>> I don't think
>> it is helpful to put the resource owner manipulations inside #ifdef
>> HAVE_HMAC_CTX_NEW and HAVE_HMAC_CTX_FREE --- ...
>> What do you think of rearranging it as attached?

> +1 on this patch, it makes the #ifdef soup more readable.

Thanks for looking at it.

> We could go even
> further and remove the HAVE_HMAC defines completely with USE_RESOWNER_FOR_HMAC
> being set by autoconf/meson?  I've attached an untested sketch diff to
> illustrate.

I'm inclined to think that won't work, because we need the HAVE_
macros separately to compile correct frontend code.

> A related tangent.  If we assembled the data to calculate on ourselves rather
> than rely on OpenSSL to do it with subsequent _update calls we could instead
> use the simpler HMAC() API from OpenSSL.  That would remove the need for the
> HMAC_CTX and resource owner tracking entirely and just have our pg_hmac_ctx.
> Thats clearly not for this patch though, just thinking out loud that we set up
> OpenSSL infrastructure that we don't really use.

Simplifying like that could be good, but I'm not volunteering.
For the moment I'd just like to silence the buildfarm warning,
so I'll go ahead with what I have.

regards, tom lane




meson vs windows perl

2024-04-02 Thread Andrew Dunstan

meson.build has this code

    ldopts = run_command(perl, '-MExtUtils::Embed', '-e', 'ldopts',
   check: true).stdout().strip()     undesired =
   run_command(perl_conf_cmd, 'ccdlflags', check:
   true).stdout().split()     undesired += run_command(perl_conf_cmd,
   'ldflags', check: true).stdout().split()     perl_ldopts = []    
   foreach ldopt : ldopts.split(' ')   if ldopt == '' or ldopt in
   undesired     continue   endif   perl_ldopts +=
   ldopt.strip('"')     endforeach     message('LDFLAGS recommended by
   perl: "@0@"'.format(ldopts))     message('LDFLAGS for embedding
   perl: "@0@"'.format(' '.join(perl_ldopts)))


This code is seriously broken if perl reports items including spaces, 
when a) removing the quotes is quite wrong, and b) splitting on spaces 
is also wrong.


Here's an example from one of my colleagues:


   C:\Program Files\Microsoft Visual Studio\2022\Professional>perl.EXE 
-MExtUtils::Embed -e ldopts
  -nologo -nodefaultlib -debug -opt:ref,icf -ltcg  
-libpath:"C:\edb\languagepack\v4\Perl-5.38\lib\CORE"
   -machine:AMD64 -subsystem:console,"5.02"  
"C:\edb\languagepack\v4\Perl-5.38\lib\CORE\perl538.lib"
   "C:\Program Files\Microsoft Visual 
Studio\2022\Professional\VC\Tools\MSVC\14.38.33130\lib\x64\oldnames.lib"
   "C:\Program Files (x86)\Windows 
Kits\10\\lib\10.0.22621.0\\um\x64\kernel32.lib"
   "C:\Program Files (x86)\Windows Kits\10\\lib\10.0.22621.0\\um\x64\user32.lib"
   "C:\Program Files (x86)\Windows Kits\10\\lib\10.0.22621.0\\um\x64\gdi32.lib"
   "C:\Program Files (x86)\Windows 
Kits\10\\lib\10.0.22621.0\\um\x64\winspool.lib"
   "C:\Program Files (x86)\Windows 
Kits\10\\lib\10.0.22621.0\\um\x64\comdlg32.lib"
   "C:\Program Files (x86)\Windows 
Kits\10\\lib\10.0.22621.0\\um\x64\advapi32.lib"
   "C:\Program Files (x86)\Windows 
Kits\10\\lib\10.0.22621.0\\um\x64\shell32.lib"
   "C:\Program Files (x86)\Windows Kits\10\\lib\10.0.22621.0\\um\x64\ole32.lib"
   "C:\Program Files (x86)\Windows 
Kits\10\\lib\10.0.22621.0\\um\x64\oleaut32.lib"
   "C:\Program Files (x86)\Windows 
Kits\10\\lib\10.0.22621.0\\um\x64\netapi32.lib"
   "C:\Program Files (x86)\Windows Kits\10\\lib\10.0.22621.0\\um\x64\uuid.lib"
   "C:\Program Files (x86)\Windows Kits\10\\lib\10.0.22621.0\\um\x64\ws2_32.lib"
   "C:\Program Files (x86)\Windows Kits\10\\lib\10.0.22621.0\\um\x64\mpr.lib"
   "C:\Program Files (x86)\Windows Kits\10\\lib\10.0.22621.0\\um\x64\winmm.lib"
   "C:\Program Files (x86)\Windows 
Kits\10\\lib\10.0.22621.0\\um\x64\version.lib"
   "C:\Program Files (x86)\Windows Kits\10\\lib\10.0.22621.0\\um\x64\odbc32.lib"
   "C:\Program Files (x86)\Windows 
Kits\10\\lib\10.0.22621.0\\um\x64\odbccp32.lib"
   "C:\Program Files (x86)\Windows 
Kits\10\\lib\10.0.22621.0\\um\x64\comctl32.lib"
   "C:\Program Files\Microsoft Visual 
Studio\2022\Professional\VC\Tools\MSVC\14.38.33130\lib\x64\msvcrt.lib"
   "C:\Program Files\Microsoft Visual 
Studio\2022\Professional\VC\Tools\MSVC\14.38.33130\lib\x64\vcruntime.lib"
   "C:\Program Files (x86)\Windows Kits\10\lib\10.0.22621.0\ucrt\x64\ucrt.lib"

And with that we get errors like

   cl : Command line warning D9024 : unrecognized source file type 
'C:\Program', object file assumed
   cl : Command line warning D9024 : unrecognized source file type 
'Files\Microsoft', object file assumed
   cl : Command line warning D9024 : unrecognized source file type 'Visual', 
object file assumed
   cl : Command line warning D9024 : unrecognized source file type 
'C:\Program', object file assumed
   cl : Command line warning D9024 : unrecognized source file type 'Files', 
object file assumed
   cl : Command line warning D9024 : unrecognized source file type 
'(x86)\Windows', object file assumed


It looks like we need to get smarter about how we process the ldopts and strip 
out the ccdlflags and ldflags

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: remaining sql/json patches

2024-04-02 Thread jian he
On Fri, Mar 22, 2024 at 12:08 AM Amit Langote  wrote:
>
> On Wed, Mar 20, 2024 at 9:53 PM Amit Langote  wrote:
> > I'll push 0001 tomorrow.
>
> Pushed that one.  Here's the remaining JSON_TABLE() patch.
>
I know v45 is very different from v47.
but v45 contains all the remaining features to be implemented.

I've attached 2 files.
v45-0001-propagate-passing-clause-to-every-json_ta.based_on_v45
after_apply_jsonpathvar.sql.

the first file should be applied after v45-0001-JSON_TABLE.patch
the second file has all kinds of tests to prove that
applying JsonPathVariable to the NESTED PATH is ok.

I know that v45 is not the whole patch we are going to push for postgres17.
I just want to point out that applying the PASSING clause to the NESTED PATH
works fine with V45.

that means, I think, we can safely apply PASSING clause to NESTED PATH for
feature "PLAN DEFAULT clause", "specific PLAN clause" and "sibling
NESTED COLUMNS clauses".


v45-0001-propagate-passing-clause-to-every-json_ta.based_on_v45
Description: Binary data


after_apply_jsonpathvar.sql
Description: application/sql


Re: Combine Prune and Freeze records emitted by vacuum

2024-04-02 Thread Heikki Linnakangas

On 01/04/2024 20:22, Melanie Plageman wrote:

Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
you didn't modify them much/at all, but I noticed some things in my code
that could be better.


Ok, here's what I have now. I made a lot of small comment changes here 
and there, and some minor local refactorings, but nothing major. I lost 
track of all the individual changes I'm afraid, so I'm afraid you'll 
have to just diff against the previous version if you want to see what's 
changed. I hope I didn't break anything.


I'm pretty happy with this now. I will skim through it one more time 
later today or tomorrow, and commit. Please review once more if you have 
a chance.



This probably doesn't belong here. I noticed spgdoinsert.c had a static
function for sorting OffsetNumbers, but I didn't see anything general
purpose anywhere else.


I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would 
be nice to have just one copy of this in some common place, but I also 
wasn't sure where to put it.


--
Heikki Linnakangas
Neon (https://neon.tech)
From be8891155c93f3555c49371f9804bdf5ba578f6e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 2 Apr 2024 15:47:06 +0300
Subject: [PATCH v12 1/2] Refactor how heap_prune_chain() updates prunable_xid

In preparation of freezing and counting tuples which are not
candidates for pruning, split heap_prune_record_unchanged() into
multiple functions, depending the kind of line pointer. That's not too
interesting right now, but makes the next commit smaller.

Recording the lowest soon-to-be prunable xid is one of the actions we
take for unchanged LP_NORMAL item pointers but not for others, so move
that to the new heap_prune_record_unchanged_lp_normal() function. The
next commit will add more actions to these functions.

Author: Melanie Plageman 
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
---
 src/backend/access/heap/pruneheap.c | 125 
 1 file changed, 92 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index ef563e19aa5..1b5bf990d21 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -78,7 +78,11 @@ static void heap_prune_record_redirect(PruneState *prstate,
 static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum, bool was_normal);
 static void heap_prune_record_dead_or_unused(PruneState *prstate, OffsetNumber offnum, bool was_normal);
 static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum, bool was_normal);
-static void heap_prune_record_unchanged(PruneState *prstate, OffsetNumber offnum);
+
+static void heap_prune_record_unchanged_lp_unused(Page page, PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_unchanged_lp_normal(Page page, int8 *htsv, PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_unchanged_lp_dead(Page page, PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_unchanged_lp_redirect(PruneState *prstate, OffsetNumber offnum);
 
 static void page_verify_redirects(Page page);
 
@@ -311,7 +315,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/* Nothing to do if slot doesn't contain a tuple */
 		if (!ItemIdIsUsed(itemid))
 		{
-			heap_prune_record_unchanged(, offnum);
+			heap_prune_record_unchanged_lp_unused(page, , offnum);
 			continue;
 		}
 
@@ -324,7 +328,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			if (unlikely(prstate.mark_unused_now))
 heap_prune_record_unused(, offnum, false);
 			else
-heap_prune_record_unchanged(, offnum);
+heap_prune_record_unchanged_lp_dead(page, , offnum);
 			continue;
 		}
 
@@ -434,7 +438,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			}
 		}
 		else
-			heap_prune_record_unchanged(, offnum);
+			heap_prune_record_unchanged_lp_normal(page, presult->htsv, , offnum);
 	}
 
 	/* We should now have processed every tuple exactly once  */
@@ -652,9 +656,6 @@ heap_prune_chain(Page page, BlockNumber blockno, OffsetNumber maxoff,
 		 */
 		chainitems[nchain++] = offnum;
 
-		/*
-		 * Check tuple's visibility status.
-		 */
 		switch (htsv_get_valid_status(htsv[offnum]))
 		{
 			case HEAPTUPLE_DEAD:
@@ -670,9 +671,6 @@ heap_prune_chain(Page page, BlockNumber blockno, OffsetNumber maxoff,
 			case HEAPTUPLE_RECENTLY_DEAD:
 
 /*
- * This tuple may soon become DEAD.  Update the hint field so
- * that the page is reconsidered for pruning in future.
- *
  * We don't need to advance the conflict horizon for
  * RECENTLY_DEAD tuples, even if we are removing them.  This
  * is because we only remove RECENTLY_DEAD tuples if they
@@ -681,8 +679,6 @@ heap_prune_chain(Page page, BlockNumber blockno, OffsetNumber maxoff,
  * tuple by virtue of being later in the chain.  We will have
  * advanced the conflict horizon for the DEAD 

Re: Synchronizing slots from primary to standby

2024-04-02 Thread Bharath Rupireddy
On Tue, Apr 2, 2024 at 2:11 PM Zhijie Hou (Fujitsu)
 wrote:
>
> CFbot[1] complained about one query result's order in the tap-test, so I am
> attaching a V7 patch set which fixed this. There are no changes in 0001.
>
> [1] https://cirrus-ci.com/task/6375962162495488

Thanks. Here are some comments:

1. Can we just remove pg_logical_replication_slot_advance and use
LogicalSlotAdvanceAndCheckSnapState instead? If worried about the
function naming, LogicalSlotAdvanceAndCheckSnapState can be renamed to
pg_logical_replication_slot_advance?

+ * Advance our logical replication slot forward. See
+ * LogicalSlotAdvanceAndCheckSnapState for details.
  */
 static XLogRecPtr
 pg_logical_replication_slot_advance(XLogRecPtr moveto)
 {

2.
+if (!ready_for_decoding)
+{
+elog(DEBUG1, "could not find consistent point for synced
slot; restart_lsn = %X/%X",
+ LSN_FORMAT_ARGS(slot->data.restart_lsn));

Can we specify the slot name in the message?

3. Also, can the "could not find consistent point for synced slot;
restart_lsn = %X/%X" be emitted at LOG level just like other messages
in update_and_persist_local_synced_slot. Although, I see "XXX should
this be changed to elog(DEBUG1) perhaps?", these messages need to be
at LOG level as they help debug issues if at all they are hit.

4. How about using found_consistent_snapshot instead of
ready_for_decoding? A general understanding is that the synced slots
are not allowed for decoding (although with this fix, we do that for
internal purposes), ready_for_decoding looks a bit misleading.

5. As far as the test case for this issue is concerned, I'm fine with
adding one using an INJECTION point because we seem to be having no
consistent way to control postgres writing current snapshot to WAL.

6. A nit: can we use "fast_forward mode" instead of "fast-forward
mode" just to be consistent?
+ * logical changes unless we are in fast-forward mode where no changes are

7.
+/*
+ * We need to access the system tables during decoding to build the
+ * logical changes unless we are in fast-forward mode where no changes are
+ * generated.
+ */
+if (slot->data.database != MyDatabaseId && !fast_forward)

May I know if we need this change for this fix?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Extension for PostgreSQL cast jsonb to hstore WIP

2024-04-02 Thread Andrew Dunstan



On 2024-04-02 Tu 07:07, ShadowGhost wrote:

Hello all.
Recently, when working with the hstore and json formats, I came across 
the fact that PostgreSQL has a cast of hstore to json, but there is no 
reverse cast. I thought it might make it more difficult to work with 
these formats. And I decided to make a cast json in the hstore. I used 
the built-in jsonb structure to create it and may have introduced 
methods to increase efficiency by 25% than converting the form 
jsonb->text->hstore. Which of course is a good fact. I also wrote 
regression tests to check the performance. I think this extension will 
improve the work with jsonb and hstore in PostgreSQL.
If you've read this far, thank you for your interest, and I hope you 
enjoy this extension!




One reason we don't have such a cast is that hstore has a flat 
structure, while json is tree structured, and it's not always an object 
/ hash. Thus it's easy to reliably cast hstore to json but far less easy 
to cast json to hstore in the general case.


What do you propose to do in the case or json consisting of scalars, or 
arrays, or with nested elements?



cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-02 Thread Alexander Korotkov
On Tue, Apr 2, 2024 at 2:47 PM Andy Fan  wrote:
> Bharath Rupireddy  writes:
>
> > On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
> >  wrote:
> >>
> >> 8 years, we tried to add this feature, and now we suggest the best way
> >> for this feature is to commit the minimal version first.
> >
> > Just curious, do you or anyone else have an immediate use for this
> > function? If yes, how are they achieving read-after-write-consistency
> > on streaming standbys in their application right now without a
> > function like this?
>
> The link [1] may be helpful and I think the reason there is reasonable
> to me.
>
> Actually we also disucss how to make sure the "read your writes
> consistency" internally, and the soluation here looks good to me.
>
> Glad to know that this patch will be committed very soon.
>
> [1]
> https://www.postgresql.org/message-id/CAPpHfdtuiL1x4APTs7u1fCmxkVp2-ZruXcdCfprDMdnOzvdC%2BA%40mail.gmail.com

Thank you for your feedback.

I also can confirm that a lot of users would be very happy to have
"read your writes consistency" and ready to do something to achieve
this at an application level.  However, they typically don't know what
exactly they need.

So, blogging about pg_wal_replay_wait() and spreading words about it
at conferences would be highly appreciated.

--
Regards,
Alexander Korotkov




Re: [PATCH] kNN for btree

2024-04-02 Thread Anton A. Melnikov

Hi, Andrey!

On 31.03.2024 12:22, Andrey M. Borodin wrote:




On 15 Jan 2024, at 13:11, Anton A. Melnikov  wrote:

If there are any ideas pro and contra would be glad to discuss them.


Hi, Anton!

This is kind of ancient thread. I've marked CF entry [0] as "Needs review" and 
you as an author (seems like you are going to be a point of correspondence on this 
feature).



That's right, i would like to bring the work on this feature to a positive 
result.

First of all, let me share a simple test that allows one to estimate the effect 
of applying this patch and,
i hope, can be considered as a criterion for future versions.

For all the tests below, one should set the following settings:
set enable_seqscan to false;
set enable_indexscan to true;
set enable_bitmapscan to false;
set enable_indexonlyscan to true;
set max_parallel_workers_per_gather = 0;

To carry out the test, one can use the table "events" mentioned in the first 
message of this thread, linked here [1].
psql -f events.dump
Then perform a query like that:
explain (costs off, analyze on) SELECT date FROM events ORDER BY date <-> 
'1957-10-04'::date ASC LIMIT 10;

When using the existing btree_gist extension and preliminary commands executing:
create extension btree_gist;

CREATE INDEX event_date_idx ON events USING gist (date);


the test query gives:

postgres=# explain (costs off, analyze on) SELECT date FROM events ORDER BY date 
<-> '1957-10-04'::date ASC LIMIT 10;

   QUERY PLAN

--

  Limit (actual time=0.759..102.992 rows=10 loops=1)

->  Index Only Scan using event_date_idx on events (actual 
time=0.757..97.021 rows=10 loops=1)

  Order By: (date <-> '1957-10-04'::date)

  Heap Fetches: 0

  Planning Time: 0.504 ms

  Execution Time: 108.311 ms


Average value on my PC was 107+-1 ms.

When using an existing patch from [1] and creating a btree index:

CREATE INDEX event_date_idx ON events USING btree (date);


instead of btree_gist one, the test query gives:

postgres=# explain (costs off, analyze on) SELECT date FROM events ORDER BY date 
<-> '1957-10-04'::date ASC LIMIT 10;

   QUERY PLAN

--

  Limit (actual time=0.120..48.817 rows=10 loops=1)

->  Index Only Scan using event_date_idx on events (actual 
time=0.117..42.610 rows=10 loops=1)

  Order By: (date <-> '1957-10-04'::date)

  Heap Fetches: 0

  Planning Time: 0.487 ms

  Execution Time: 54.463 ms


55+-1 ms on average.
The execution time is reduced by ~2 times. So the effect is obvious
but the implementation problems are reasonable too.


On 15.01.2024 11:11, Anton A. Melnikov wrote:

On 16.03.2020 16:17, Alexander Korotkov wrote:

After another try to polish this patch I figured out that the way it's
implemented is unnatural.  I see the two reasonable ways to implement
knn for B-tree, but current implementation matches none of them.

1) Implement knn as two simultaneous scans over B-tree: forward and
backward.  It's similar to what current patchset does.  But putting
this logic to B-tree seems artificial.  What B-tree does here is still
unidirectional scan.  On top of that we merge results of two
unidirectional scans.  The appropriate place to do this high-level
work is IndexScan node or even Optimizer/Executor (Merge node over to
IndexScan nodes), but not B-tree itself.
2) Implement arbitrary scans in B-tree using priority queue like GiST
and SP-GiST do.  That would lead to much better support for KNN.  We
can end up in supporting interesting cases like "ORDER BY col1 DESC,
col2 <> val1, col2 ASC" or something.  But that's requires way more
hacking in B-tree core.





At first i'm going to implement p.1). I think it's preferable for now
because it seems easier and faster to get a working version.



I was wrong here. Firstly, this variant turned out to be not so easy and fast,
and secondly, when i received the desired query plan, i was not happy with the 
results:

In the case of btree_gist, splitting the query into two scans at the optimizer 
level
and adding MergeAppend on the top of it resulted in a ~13% slowdown in query 
execution.
The average time became ~121 ms.

  Limit (actual time=1.205..117.689 rows=10 loops=1)

->  Merge Append (actual time=1.202..112.260 rows=10 loops=1)

  Sort Key: ((events.date <-> '1957-10-04'::date))

  ->  Index Only Scan using event_date_idx on events (actual 
time=0.713..43.372 rows=42585 loops=1)  

Index Cond: (date < '1957-10-04'::date)  

Order By: (date <-> '1957-10-04'::date)   

Heap Fetches: 0 

  ->  Index Only Scan using event_date_idx on events events_1 

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-02 Thread Alexander Korotkov
On Tue, Apr 2, 2024 at 1:11 PM Kartyshov Ivan
 wrote:
> On 2024-04-02 11:14, Andy Fan wrote:
> > If so, when users call pg_wal_replay_wait, they can get informed when
> > the wal is replaied to the target_lsn, but they can't know how long
> > time
> > it waits unless they check it in application side, I think such
> > information will be useful for monitor purpose sometimes.
> >
> > select pg_wal_replay_wait(lsn, 1000);  may just take 1ms in fact, in
> > this case, I want this function return 1.
>
> Hi Andy, to get timing we can use \time in psql.
> Here is an example.
> postgres=# \timing
> Timing is on.
> postgres=# select 1;
>   ?column?
> --
>  1
> (1 row)
>
> Time: 0.536 ms
>
>
> >void
> And returning VOID is the best option, rather than returning TRUE|FALSE
> or timing. It left the logic of the procedure very simple, we get an
> error if LSN is not reached.
>
> 8 years, we tried to add this feature, and now we suggest the best way
> for this feature is to commit the minimal version first.
>
> Let's discuss further improvements in future versions.

+1,
It seems there was not yet a precedent of builtin PostgreSQL function
returning its duration.  And I don't think we need to introduce such
precedent, at least now.  This seems like we're placing the
responsibility on monitoring resources usage to an application.

I'd also like to note that pg_wal_replay_wait() comes with a dedicated
wait event.  So, one could monitor the average duration of these waits
using sampling of wait events.

--
Regards,
Alexander Korotkov




Re: Confusing #if nesting in hmac_openssl.c

2024-04-02 Thread Daniel Gustafsson
> On 2 Apr 2024, at 02:01, Tom Lane  wrote:

> hmac_openssl.c:90:1: warning: unused function 'ResourceOwnerRememberHMAC' 
> [-Wunused-function]
> hmac_openssl.c:95:1: warning: unused function 'ResourceOwnerForgetHMAC' 
> [-Wunused-function]
> 
> Looking at the code, this is all from commit e6bdfd970, and apparently
> batfish is our only animal that doesn't HAVE_HMAC_CTX_NEW.

Thanks for looking at this, it's been on my TODO for some time.  It's a warning
which only shows up when building against 1.0.2, the functions are present in
1.1.0 and onwards (while deprecated in 3.0).

> I don't think
> it is helpful to put the resource owner manipulations inside #ifdef
> HAVE_HMAC_CTX_NEW and HAVE_HMAC_CTX_FREE --- probably, it would never
> be the case that only one of those is defined,

Correct, no version of OpenSSL has only one of them defined.

> What do you think of rearranging it as attached?

+1 on this patch, it makes the #ifdef soup more readable.  We could go even
further and remove the HAVE_HMAC defines completely with USE_RESOWNER_FOR_HMAC
being set by autoconf/meson?  I've attached an untested sketch diff to
illustrate.

A related tangent.  If we assembled the data to calculate on ourselves rather
than rely on OpenSSL to do it with subsequent _update calls we could instead
use the simpler HMAC() API from OpenSSL.  That would remove the need for the
HMAC_CTX and resource owner tracking entirely and just have our pg_hmac_ctx.
Thats clearly not for this patch though, just thinking out loud that we set up
OpenSSL infrastructure that we don't really use.

--
Daniel Gustafsson



hmac_context.diff
Description: Binary data


Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-02 Thread Andy Fan


Bharath Rupireddy  writes:

> On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
>  wrote:
>>
>> 8 years, we tried to add this feature, and now we suggest the best way
>> for this feature is to commit the minimal version first.
>
> Just curious, do you or anyone else have an immediate use for this
> function? If yes, how are they achieving read-after-write-consistency
> on streaming standbys in their application right now without a
> function like this?

The link [1] may be helpful and I think the reason there is reasonable
to me.

Actually we also disucss how to make sure the "read your writes
consistency" internally, and the soluation here looks good to me.

Glad to know that this patch will be committed very soon. 

[1]
https://www.postgresql.org/message-id/CAPpHfdtuiL1x4APTs7u1fCmxkVp2-ZruXcdCfprDMdnOzvdC%2BA%40mail.gmail.com
 

-- 
Best Regards
Andy Fan





Re: Allow non-superuser to cancel superuser tasks.

2024-04-02 Thread Andrey M. Borodin



> On 2 Apr 2024, at 01:21, Nathan Bossart  wrote:
> 
> I haven't looked too closely, but I'm pretty skeptical that the test suite
> in your patch would be stable.  Unfortunately, I don't have any better
> ideas at the moment besides not adding a test for this new role.

We can add tests just like [0] with injection points.
I mean replace that "sleep 1" with something like 
"$node->wait_for_event('autovacuum worker', 'autocauum-runing');".
Currently we have no infrastructure to wait for autovacuum of particular table, 
but I think it's doable.
Also I do not like that test is changing system-wide autovac settings, AFAIR 
these settings can be set for particular table.

Thanks!


Best regards, Andrey Borodin.

[0] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=eeefd4280f6



Extension for PostgreSQL cast jsonb to hstore WIP

2024-04-02 Thread ShadowGhost
Hello all.
Recently, when working with the hstore and json formats, I came across the
fact that PostgreSQL has a cast of hstore to json, but there is no reverse
cast. I thought it might make it more difficult to work with these formats.
And I decided to make a cast json in the hstore. I used the built-in jsonb
structure to create it and may have introduced methods to increase
efficiency by 25% than converting the form jsonb->text->hstore. Which of
course is a good fact. I also wrote regression tests to check the
performance. I think this extension will improve the work with jsonb and
hstore in PostgreSQL.
If you've read this far, thank you for your interest, and I hope you enjoy
this extension!
 Antoine
From 334bf26cbea8d3ccd1821a7ecd5bd1134d9b1641 Mon Sep 17 00:00:00 2001
From: Antoine Violin 
Date: Mon, 25 Mar 2024 17:34:23 +0700
Subject: [PATCH v1] Add cast jsonb to hstore

---
 contrib/cast_jsonb_to_hstore/Makefile |  18 +++
 .../cast_jsonb_to_hstore--1.0.sql |  16 +++
 .../cast_jsonb_to_hstore.c| 119 ++
 .../cast_jsonb_to_hstore.control  |   5 +
 .../expected/cast_jsonb_to_hstore.out |  71 +++
 .../sql/cast_jsonb_to_hstore.sql  |  27 
 6 files changed, 256 insertions(+)
 create mode 100644 contrib/cast_jsonb_to_hstore/Makefile
 create mode 100644 contrib/cast_jsonb_to_hstore/cast_jsonb_to_hstore--1.0.sql
 create mode 100644 contrib/cast_jsonb_to_hstore/cast_jsonb_to_hstore.c
 create mode 100644 contrib/cast_jsonb_to_hstore/cast_jsonb_to_hstore.control
 create mode 100644 contrib/cast_jsonb_to_hstore/expected/cast_jsonb_to_hstore.out
 create mode 100644 contrib/cast_jsonb_to_hstore/sql/cast_jsonb_to_hstore.sql

diff --git a/contrib/cast_jsonb_to_hstore/Makefile b/contrib/cast_jsonb_to_hstore/Makefile
new file mode 100644
index 00..96db73215a
--- /dev/null
+++ b/contrib/cast_jsonb_to_hstore/Makefile
@@ -0,0 +1,18 @@
+MODULES = cast_jsonb_to_hstore
+EXTENSION = cast_jsonb_to_hstore
+DATA = cast_jsonb_to_hstore--1.0.sql
+PGFILEDESC = "Convert data between different character sets"
+REGRESS = cast_jsonb_to_hstore
+EXTRA_INSTALL = contrib/hstore
+
+ifdef USE_PGXS
+PG_CONFIG = PG_CONFIG
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+PG_CPPFLAGS = -I$(top_srcdir)/contrib
+subdir = contrib/cast_jsonb_to_hstore
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
\ No newline at end of file
diff --git a/contrib/cast_jsonb_to_hstore/cast_jsonb_to_hstore--1.0.sql b/contrib/cast_jsonb_to_hstore/cast_jsonb_to_hstore--1.0.sql
new file mode 100644
index 00..db31fedf48
--- /dev/null
+++ b/contrib/cast_jsonb_to_hstore/cast_jsonb_to_hstore--1.0.sql
@@ -0,0 +1,16 @@
+\echo Use "CREATE EXTENSION cast_jsonb_to_hstore" to load this file. \quit
+CREATE OR REPLACE FUNCTION jsonb_to_hstore(j0 jsonb)
+RETURNS hstore 
+AS '$libdir/cast_jsonb_to_hstore', 'jsonb_to_hstore'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE OR REPLACE FUNCTION json_to_hstore(j0 json)
+RETURNS hstore AS 
+$BODY$
+SELECT hstore(array_agg(key), array_agg(value))
+FROM json_each_text(j0)
+$BODY$
+LANGUAGE 'sql' IMMUTABLE;
+
+CREATE CAST (jsonb AS hstore) WITH FUNCTION jsonb_to_hstore(jsonb) AS IMPLICIT;
+CREATE CAST (json AS hstore) WITH FUNCTION json_to_hstore(json) AS IMPLICIT;
\ No newline at end of file
diff --git a/contrib/cast_jsonb_to_hstore/cast_jsonb_to_hstore.c b/contrib/cast_jsonb_to_hstore/cast_jsonb_to_hstore.c
new file mode 100644
index 00..c174414896
--- /dev/null
+++ b/contrib/cast_jsonb_to_hstore/cast_jsonb_to_hstore.c
@@ -0,0 +1,119 @@
+#include "postgres.h"
+#include "hstore/hstore.h"
+#include "utils/jsonb.h"
+
+PG_MODULE_MAGIC;
+
+typedef int (*hstoreUniquePairs_t) (Pairs *a, int32 l, int32 *buflen);
+static hstoreUniquePairs_t hstoreUniquePairs_p;
+typedef HStore *(*hstorePairs_t) (Pairs *pairs, int32 pcount, int32 buflen);
+static hstorePairs_t hstorePairs_p;
+typedef size_t (*hstoreCheckKeyLen_t) (size_t len);
+static hstoreCheckKeyLen_t hstoreCheckKeyLen_p;
+typedef size_t (*hstoreCheckValLen_t) (size_t len);
+static hstoreCheckValLen_t hstoreCheckValLen_p;
+
+void
+_PG_init(void)
+{
+	AssertVariableIsOfType(, hstoreUniquePairs_t);
+	hstoreUniquePairs_p = (hstoreUniquePairs_t)
+		load_external_function("$libdir/hstore", "hstoreUniquePairs",
+			   true, NULL);
+	AssertVariableIsOfType(, hstorePairs_t);
+	hstorePairs_p = (hstorePairs_t)
+		load_external_function("$libdir/hstore", "hstorePairs",
+			   true, NULL);
+	AssertVariableIsOfType(, hstoreCheckKeyLen_t);
+	hstoreCheckKeyLen_p = (hstoreCheckKeyLen_t)
+		load_external_function("$libdir/hstore", "hstoreCheckKeyLen",
+			   true, NULL);
+	AssertVariableIsOfType(, hstoreCheckValLen_t);
+	hstoreCheckValLen_p = (hstoreCheckValLen_t)
+		load_external_function("$libdir/hstore", "hstoreCheckValLen",
+			   true, NULL);
+}
+
+#define hstoreUniquePairs 

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-04-02 Thread torikoshia

On 2024-04-01 11:31, Masahiko Sawada wrote:
On Fri, Mar 29, 2024 at 11:54 AM torikoshia 
 wrote:


On 2024-03-28 21:54, Masahiko Sawada wrote:
> On Thu, Mar 28, 2024 at 9:38 PM torikoshia 
> wrote:
>>
>> On 2024-03-28 10:20, Masahiko Sawada wrote:
>> > Hi,
>> >
>> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada 
>> > wrote:
>> >>
>> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
>> >>  wrote:
>> >> >
>> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia  
wrote:
>> >> > > On 2024-01-18 10:10, jian he wrote:
>> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 

>> >> > > > wrote:
>> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  
wrote:
>> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
>> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of 
"error")?
>> >> > > >> > You will need a separate parameter anyway to specify the 
destination
>> >> > > >> > of "log", unless "none" became an illegal table name when I 
wasn't
>> >> > > >> > looking.  I don't buy that one parameter that has some special 
values
>> >> > > >> > while other values could be names will be a good design.  
Moreover,
>> >> > > >> > what if we want to support (say) log-to-file along with 
log-to-table?
>> >> > > >> > Trying to distinguish a file name from a table name without any 
other
>> >> > > >> > context seems impossible.
>> >> > > >>
>> >> > > >> I've been thinking we can add more values to this option to log 
errors
>> >> > > >> not only to the server logs but also to the error table (not sure
>> >> > > >> details but I imagined an error table is created for each table on
>> >> > > >> error), without an additional option for the destination name. The
>> >> > > >> values would be like error_action 
{error|ignore|save-logs|save-table}.
>> >> > > >>
>> >> > > >
>> >> > > > another idea:
>> >> > > > on_error {error|ignore|other_future_option}
>> >> > > > if not specified then by default ERROR.
>> >> > > > You can also specify ERROR or IGNORE for now.
>> >> > > >
>> >> > > > I agree, the parameter "error_action" is better than "location".
>> >> > >
>> >> > > I'm not sure whether error_action or on_error is better, but either 
way
>> >> > > "error_action error" and "on_error error" seems a bit odd to me.
>> >> > > I feel "stop" is better for both cases as Tom suggested.
>> >> >
>> >> > OK.  What about this?
>> >> > on_error {stop|ignore|other_future_option}
>> >> > where other_future_option might be compound like "file 'copy.log'" or
>> >> > "table 'copy_log'".
>> >>
>> >> +1
>> >>
>> >
>> > I realized that ON_ERROR syntax synoposis in the documentation is not
>> > correct. The option doesn't require the value to be quoted and the
>> > value can be omitted. The attached patch fixes it.
>> >
>> > Regards,
>>
>> Thanks!
>>
>> Attached patch fixes the doc, but I'm wondering perhaps it might be
>> better to modify the codes to prohibit abbreviation of the value.
>>
>> When seeing the query which abbreviates ON_ERROR value, I feel it's
>> not
>> obvious what happens compared to other options which tolerates
>> abbreviation of the value such as FREEZE or HEADER.
>>
>>COPY t1 FROM stdin WITH (ON_ERROR);
>>
>> What do you think?
>
> Indeed. Looking at options of other commands such as VACUUM and
> EXPLAIN, I can see that we can omit a boolean value, but non-boolean
> parameters require its value. The HEADER option is not a pure boolean
> parameter but we can omit the value. It seems to be for backward
> compatibility; it used to be a boolean parameter. I agree that the
> above example would confuse users.
>
> Regards,

Thanks for your comment!

Attached a patch which modifies the code to prohibit omission of its
value.

I was a little unsure about adding a regression test for this, but I
have not added it since other COPY option doesn't test the omission of
its value.


Probably should we change the doc as well since ON_ERROR value doesn't
necessarily need to be single-quoted?


Agreed.
Since it seems this issue is independent from the omission of ON_ERROR 
option value, attached a separate patch.



The rest looks good to me.

Alexander, what do you think about this change as you're the committer
of this feature?



--
Regards,

--
Atsushi Torikoshi
NTT DATA Group CorporationFrom 1b4bec3c2223246ec59ffb9eb7de2f1de27315f7 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Fri, 29 Mar 2024 11:36:12 +0900
Subject: [PATCH v1] Disallow ON_ERROR option without value

Currently ON_ERROR option of COPY allows to omit its value,
but the syntax synopsis in the documentation requires it.

Since it seems non-boolean parameters usually require its value
and it's not obvious what happens when value of ON_ERROR is
omitted, this patch disallows ON_ERROR without its value.
---
 src/backend/commands/copy.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 28cf8b040a..2719bf28b7 100644
--- 

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-02 Thread Bharath Rupireddy
On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
 wrote:
>
> 8 years, we tried to add this feature, and now we suggest the best way
> for this feature is to commit the minimal version first.

Just curious, do you or anyone else have an immediate use for this
function? If yes, how are they achieving read-after-write-consistency
on streaming standbys in their application right now without a
function like this?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-02 Thread Kartyshov Ivan

On 2024-04-02 11:14, Andy Fan wrote:

If so, when users call pg_wal_replay_wait, they can get informed when
the wal is replaied to the target_lsn, but they can't know how long 
time

it waits unless they check it in application side, I think such
information will be useful for monitor purpose sometimes.

select pg_wal_replay_wait(lsn, 1000);  may just take 1ms in fact, in
this case, I want this function return 1.


Hi Andy, to get timing we can use \time in psql.
Here is an example.
postgres=# \timing
Timing is on.
postgres=# select 1;
 ?column?
--
1
(1 row)

Time: 0.536 ms



   void

And returning VOID is the best option, rather than returning TRUE|FALSE
or timing. It left the logic of the procedure very simple, we get an
error if LSN is not reached.

8 years, we tried to add this feature, and now we suggest the best way
for this feature is to commit the minimal version first.

Let's discuss further improvements in future versions.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com




Re: Statistics Import and Export

2024-04-02 Thread Corey Huinker
Here's a one-liner patch for disabling update of pg_class
relpages/reltuples/relallviible during a binary upgrade.

This was causting pg_upgrade tests to fail in the existing stats import
work.
From 322db8c9e8796ce673f7d7b63bd64e4c9403a144 Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Mon, 1 Apr 2024 18:25:18 -0400
Subject: [PATCH v1] Disable updating pg_class for CREATE INDEX during binary
 upgrade.

There is no point in setting these values because the table will always
be empty.
---
 src/backend/catalog/index.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b6a7c60e23..f83b35add5 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2874,7 +2874,7 @@ index_update_stats(Relation rel,
 		dirty = true;
 	}
 
-	if (reltuples >= 0)
+	if ((reltuples >= 0) && (!IsBinaryUpgrade))
 	{
 		BlockNumber relpages = RelationGetNumberOfBlocks(rel);
 		BlockNumber relallvisible;
-- 
2.44.0



Re: Reports on obsolete Postgres versions

2024-04-02 Thread Magnus Hagander
On Tue, Apr 2, 2024 at 9:24 AM Daniel Gustafsson  wrote:

> > On 2 Apr 2024, at 00:56, Bruce Momjian  wrote:
>
> > I ended up writing the attached doc patch.  I found that some or our
> > text was overly-wordy, causing the impact of what we were trying to say
> > to be lessened.  We might want to go farther than this patch, but I
> > think it is an improvement.
>
> Agreed, this is an good incremental improvement over what we have.
>
> > I also moved the  text to the bottom of the section
>
> +1
>
> A few small comments:
>
> +considers performing minor upgrades to be less risky than continuing to
> +run superseded minor versions.
>
> I think "superseded minor versions" could be unnecessarily complicated for
> non-native speakers, I consider myself fairly used to reading english but
> still
> had to spend a few extra (brain)cycles parsing the meaning of it in this
> context.
>
> + We recommend that users always run the latest minor release associated
>
> Or perhaps "current minor release" which is the term we use in the table
> below
> on the same page?
>


I do like the term "current"  better. It conveys (at least a bit) that we
really consider all the older ones to be, well, obsolete. The difference
"current vs obsolete" is stronger than "latest vs not quite latest".

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Reports on obsolete Postgres versions

2024-04-02 Thread Magnus Hagander
On Wed, Mar 13, 2024 at 7:47 PM Jeremy Schneider 
wrote:

>
> > On Mar 13, 2024, at 11:39 AM, Tom Lane  wrote:
> >
> > Jeremy Schneider  writes:
> >>> On 3/13/24 11:21 AM, Tom Lane wrote:
> >>> Agreed, we would probably add confusion not reduce it if we were to
> >>> change our longstanding nomenclature for this.
> >
> >> Before v10, the quarterly maintenance updates were unambiguously and
> >> always called patch releases
> >
> > I think that's highly revisionist history.  I've always called them
> > minor releases, and I don't recall other people using different
> > terminology.  I believe the leadoff text on
> >
> > https://www.postgresql.org/support/versioning/
> >
> > is much older than when we switched from two-part major version
> > numbers to one-part major version numbers.
>
> Huh, that wasn’t what I expected. I only started (in depth) working with
> PG around 9.6 and I definitely thought of “6” as the minor version. This is
> an interesting mailing list thread.
>

That common misunderstanding was, in fact, one of the reasons to go to
two-part version numbers instead of 3. Because people did not realize that
the full 9.6 digit was the major version, and thus what was maintained and
such.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: pg_combinebackup --copy-file-range

2024-04-02 Thread Tomas Vondra
On 4/1/24 23:45, Thomas Munro wrote:
> ...
>>
>> I was very puzzled by the awful performance on ZFS. When every other fs
>> (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took
>> 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning
>> advice I could think of, with almost no effect.
>>
>> Ultimately I decided that it probably is the "no readahead" behavior
>> I've observed on ZFS. I assume it's because it doesn't use the page
>> cache where the regular readahead is detected etc. And there's no
>> prefetching in pg_combinebackup, so I decided to an experiment and added
>> a trivial explicit prefetch when reconstructing the file - every time
>> we'd read data from a file, we do posix_fadvise for up to 128 blocks
>> ahead (similar to what bitmap heap scan code does). See 0002.
>>
>> And tadaaa - the duration dropped from 900-1000 seconds to only about
>> 250-300 seconds, so an improvement of a factor of 3-4x. I think this is
>> pretty massive.
> 
> Interesting.  ZFS certainly has its own prefetching heuristics with
> lots of logic and settings, but it could be that it's using
> strict-next-block detection of access pattern (ie what I called
> effective_io_readahead_window=0 in the streaming I/O thread) instead
> of a window (ie like the Linux block device level read ahead where,
> AFAIK, if you access anything in that sliding window it is triggered),
> and perhaps your test has a lot of non-contiguous but close-enough
> blocks?  (Also reminds me of the similar discussion on the BHS thread
> about distinguishing sequential access from
> mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the
> fine line between them.)
> 

I don't think the files have a lot of non-contiguous but close-enough
blocks (it's rather that we'd skip blocks that need to come from a later
incremental file). The backups are generated to have a certain fraction
of modified blocks.

For example the smallest backup has 1% means 99% of blocks comes from
the base backup, and 1% comes from the increment. And indeed, the whole
database is ~75GB and the backup is ~740MB. Which means that on average
there will be runs of 99 blocks in the base backup, then skip 1 block
(to come from the increment), and then again 99-1-99-1. So it's very
sequential, almost no holes, and the increment is 100% sequential. And
it still does not seem to prefetch anything.


> You could double-check this and related settings (for example I think
> it might disable itself automatically if you're on a VM with small RAM
> size):
> 
> https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable
> 

I haven't touched that parameter at all, and it's "enabled" by default:

# cat /sys/module/zfs/parameters/zfs_prefetch_disable
0

While trying to make the built-in prefetch work I reviewed the other
parameters with the "prefetch" tag, without success. And I haven't seen
any advice on how to make it work ...

>> There's a couple more interesting ZFS details - the prefetching seems to
>> be necessary even when using copy_file_range() and don't need to read
>> the data (to calculate checksums). This is why the "manifest=off" chart
>> has the strange group of high bars at the end - the copy cases are fast
>> because prefetch happens, but if we switch to copy_file_range() there
>> are no prefetches and it gets slow.
> 
> Hmm, at a guess, it might be due to prefetching the dnode (root object
> for a file) and block pointers, ie the structure but not the data
> itself.
> 

Yeah, that's possible. But the effects are the same - it doesn't matter
what exactly is not prefetched. But perhaps we could prefetch just a
tiny part of the record, enough to prefetch the dnode+pointers, not the
whole record. Might save some space in ARC, perhaps?

>> This is a bit bizarre, especially because the manifest=on cases are
>> still fast, exactly because the pread + prefetching still happens. I'm
>> sure users would find this puzzling.
>>
>> Unfortunately, the prefetching is not beneficial for all filesystems.
>> For XFS it does not seem to make any difference, but on BTRFS it seems
>> to cause a regression.
>>
>> I think this means we may need a "--prefetch" option, that'd force
>> prefetching, probably both before pread and copy_file_range. Otherwise
>> people on ZFS are doomed and will have poor performance.
> 
> Seems reasonable if you can't fix it by tuning ZFS.  (Might also be an
> interesting research topic for a potential ZFS patch:
> prefetch_swiss_cheese_window_size.  I will not be nerd-sniped into
> reading the relevant source today, but I'll figure it out soonish...)
> 

It's entirely possible I'm just too stupid and it works just fine for
everyone else. But maybe not, and I'd say an implementation that is this
difficult to configure is almost as if it didn't exist at all. The linux
read-ahead works by default pretty great.

So I don't see how to make this work without explicit prefetch ... Of

Re: Synchronizing slots from primary to standby

2024-04-02 Thread Bertrand Drouvot
Hi,

On Tue, Apr 02, 2024 at 02:19:30PM +0530, Amit Kapila wrote:
> On Tue, Apr 2, 2024 at 1:54 PM Bertrand Drouvot
>  wrote:
> > What about adding a "wait" injection point in LogStandbySnapshot() to 
> > prevent
> > checkpointer/bgwriter to log a standby snapshot? Something among those 
> > lines:
> >
> >if (AmCheckpointerProcess() || AmBackgroundWriterProcess())
> >INJECTION_POINT("bgw-log-standby-snapshot");
> >
> > And make use of it in the test, something like:
> >
> >$node_primary->safe_psql('postgres',
> >"SELECT injection_points_attach('bgw-log-standby-snapshot', 
> > 'wait');");
> >
> 
> Sometimes we want the checkpoint to log the standby snapshot as we
> need it at a predictable time, maybe one can use
> pg_log_standby_snapshot() instead of that. Can we add an injection
> point as a separate patch/commit after a bit more discussion?

Sure, let's come back to this injection point discussion after the feature
freeze. BTW, I think it could also be useful to make use of injection point for
the test that has been added in 7f13ac8123.

I'll open a new thread for this at that time.

>. One other
> idea to make such tests predictable is to add a developer-specific GUC
> say debug_bg_log_standby_snapshot or something like that but injection
> point sounds like a better idea.

Agree.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Table AM Interface Enhancements

2024-04-02 Thread Alexander Korotkov
Hi, Jeff!

On Tue, Apr 2, 2024 at 8:19 AM Jeff Davis  wrote:
> On Sat, 2024-03-30 at 23:33 +0200, Alexander Korotkov wrote:
> > I've pushed 0001, 0002 and 0006.
>
> Sorry to jump in to this discussion so late. I had worked on something
> like the custom reloptions (0002) in the past, and there were some
> complications that don't seem to be addressed in commit c95c25f9af.
>
> * At minimum I think it needs some direction (comments, docs, tests)
> that show how it's supposed to be used.
>
> * The bytea returned by the reloptions() method is not in a trivial
> format. It's a StdRelOptions struct with string values stored after the
> end of the struct. To build the bytea internally, there's some
> infrastructure like allocateRelOptStruct() and fillRelOptions(), and
> it's not very easy to extend those to support a few custom options.
>
> * If we ever decide to add a string option to StdRdOptions, I think the
> design breaks, because the code that looks for those string values
> wouldn't know how to skip over the custom options. Perhaps we can just
> promise to never do that, but we should make it explicit somehow.
>
> * Most existing heap reloptions (other than fillfactor) are used by
> other parts of the system (like autovacuum) so should be considered
> valid for any AM. Most AMs will just want to add a handful of their own
> options on top, so it would be good to demonstrate how this should be
> done.
>
> * There are still places that are inappropriately calling
> heap_reloptions directly. For instance, in ProcessUtilitySlow(), it
> seems to assume that a toast table is a heap?

Thank you for the detailed explanation.  This piece definitely needs
more work.  I've just reverted the c95c25f9af.

I don't like the idea that every custom table AM reltoptions should
begin with StdRdOptions.  I would rather introduce the new data
structure with table options, which need to be accessed outside of
table AM.  Then reloptions will be a backbox only directly used in
table AM, while table AM has a freedom on what to store in reloptions
and how to calculate externally-visible options.  What do you think?

--
Regards,
Alexander Korotkov




Re: Synchronizing slots from primary to standby

2024-04-02 Thread Amit Kapila
On Tue, Apr 2, 2024 at 1:54 PM Bertrand Drouvot
 wrote:
>
> On Tue, Apr 02, 2024 at 07:20:46AM +, Zhijie Hou (Fujitsu) wrote:
> > I added one test in 040_standby_failover_slots_sync.pl in 0002 patch, which 
> > can
> > reproduce the data loss issue consistently on my machine.
>
> Thanks!
>
> > It may not reproduce
> > in some rare cases if concurrent xl_running_xacts are written by bgwriter, 
> > but
> > I think it's still valuable if it can verify the fix in most cases.
>
> What about adding a "wait" injection point in LogStandbySnapshot() to prevent
> checkpointer/bgwriter to log a standby snapshot? Something among those lines:
>
>if (AmCheckpointerProcess() || AmBackgroundWriterProcess())
>INJECTION_POINT("bgw-log-standby-snapshot");
>
> And make use of it in the test, something like:
>
>$node_primary->safe_psql('postgres',
>"SELECT injection_points_attach('bgw-log-standby-snapshot', 
> 'wait');");
>

Sometimes we want the checkpoint to log the standby snapshot as we
need it at a predictable time, maybe one can use
pg_log_standby_snapshot() instead of that. Can we add an injection
point as a separate patch/commit after a bit more discussion? I want
to discuss this in a separate thread so that later we should not get
an objection to adding an injection_point at this location. One other
idea to make such tests predictable is to add a developer-specific GUC
say debug_bg_log_standby_snapshot or something like that but injection
point sounds like a better idea.

-- 
With Regards,
Amit Kapila.




RE: Synchronizing slots from primary to standby

2024-04-02 Thread Zhijie Hou (Fujitsu)
On Tuesday, April 2, 2024 3:21 PM Zhijie Hou (Fujitsu)  
wrote:
> On Tuesday, April 2, 2024 8:35 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Monday, April 1, 2024 7:30 PM Amit Kapila 
> > wrote:
> > >
> > > On Mon, Apr 1, 2024 at 6:26 AM Zhijie Hou (Fujitsu)
> > > 
> > > wrote:
> > > >
> > > > On Friday, March 29, 2024 2:50 PM Amit Kapila
> > > > 
> > > wrote:
> > > > >
> > > >
> > > > >
> > > > >
> > > > > 2.
> > > > > +extern XLogRecPtr
> > > > > +pg_logical_replication_slot_advance(XLogRecPtr
> > > moveto,
> > > > > +   bool *found_consistent_point);
> > > > > +
> > > > >
> > > > > This API looks a bit awkward as the functionality doesn't match
> > > > > the name. How about having a function with name
> > > > > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto,
> > > > > ready_for_decoding) with the same functionality as your patch
> > > > > has for
> > > > > pg_logical_replication_slot_advance() and then invoke it both
> > > > > from pg_logical_replication_slot_advance and slotsync.c. The
> > > > > function name is too big, we can think of a shorter name. Any ideas?
> > > >
> > > > How about LogicalSlotAdvanceAndCheckDecodingState() Or just
> > > > LogicalSlotAdvanceAndCheckDecoding()?
> > > >
> > >
> > > It is about snapbuild state, so how about naming the function as
> > > LogicalSlotAdvanceAndCheckSnapState()?
> >
> > It looks better to me, so changed.
> >
> > >
> > > I have made quite a few cosmetic changes in comments and code. See
> > > attached. This is atop your latest patch. Can you please review and
> > > include these changes in the next version?
> >
> > Thanks, I have reviewed and merged them.
> > Attach the V5 patch set which addressed above comments and ran pgindent.
> 
> I added one test in 040_standby_failover_slots_sync.pl in 0002 patch, which 
> can
> reproduce the data loss issue consistently on my machine. It may not
> reproduce in some rare cases if concurrent xl_running_xacts are written by
> bgwriter, but I think it's still valuable if it can verify the fix in most 
> cases. The test
> will fail if directly applied on HEAD, and will pass after applying atop of 
> 0001.

CFbot[1] complained about one query result's order in the tap-test, so I am
attaching a V7 patch set which fixed this. There are no changes in 0001.

[1] https://cirrus-ci.com/task/6375962162495488

Best Regards,
Hou zj



v7-0002-test-the-data-loss-case.patch
Description: v7-0002-test-the-data-loss-case.patch


v7-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch
Description:  v7-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch


Re: Streaming I/O, vectored I/O (WIP)

2024-04-02 Thread Thomas Munro
I had been planning to commit v14 this morning but got cold feet with
the BMR-based interface.  Heikki didn't like it much, and in the end,
neither did I.  I have now removed it, and it seems much better.  No
other significant changes, just parameter types and inlining details.
For example:

 * read_stream_begin_relation() now takes a Relation, likes its name says
 * StartReadBuffers()'s operation takes smgr and optional rel
 * ReadBuffer_common() takes smgr and optional rel

ReadBuffer() (which calls ReadBuffer_common() which calls
StartReadBuffer() as before) now shows no regression in a tight loop
over ~1 million already-in-cache pages (something Heikki had observed
before and could only completely fix with a change that affected all
callers).  The same test using read_stream.c is still slightly slower,
~1 million pages -in-cache pages 301ms -> 308ms, which seems
acceptable to me and could perhaps be chased down with more study of
inlining/specialisation.  As mentioned before, it doesn't seem to be
measurable once you actually do something with the pages.

In some ways BMR was better than the "fake RelationData" concept
(another attempt at wrestling with the relation vs storage duality,
that is, the online vs recovery duality).  But in other ways it was
worse: a weird inconsistent mixture of pass-by-pointer and
pass-by-value interfaces that required several code paths to handle it
being only partially initialised, which turned out to be wasted cycles
implicated in regressions, despite which it is not even very nice to
use anyway.  I'm sure it could be made to work better, but I'm not yet
sure it's really needed.  In later work for recovery I will need to
add a separate constructor read_stream_begin_smgr_something() anyway
for other reasons (multi-relation streaming, different callback) and
perhaps also a separate StartReadBuffersSmgr() if it saves measurable
cycles to strip out branches.  Maybe it was all just premature
pessimisation.

So this is the version I'm going to commit shortly, barring objections.
From 4c3ec42aabaaf0b54f8c4393bef3411fed3a054f Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 2 Apr 2024 14:40:40 +1300
Subject: [PATCH v15 1/4] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two main advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the
kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.

A new GUC io_combine_limit is defined, and the functions for limiting
per-backend pin counts are now made into public APIs.  Those are
provided for the benefit of callers of StartReadBuffers(), who should
respect them when deciding how many buffers to read at once.  A later
commit will add a higher level mechanism for doing that automatically
with a more practical interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of just issuing
advice and leaving WaitReadBuffers() to do the work synchronously.

Author: Thomas Munro 
Author: Andres Freund  (some optimization tweaks)
Reviewed-by: Melanie Plageman 
Reviewed-by: Heikki Linnakangas 
Reviewed-by: Nazir Bilal Yavuz 
Reviewed-by: Dilip Kumar 
Reviewed-by: Andres Freund 
Tested-by: Tomas Vondra 
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  14 +
 src/backend/storage/buffer/bufmgr.c   | 720 --
 src/backend/storage/buffer/localbuf.c |  14 +-
 src/backend/utils/misc/guc_tables.c   |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h  |  50 ++
 src/tools/pgindent/typedefs.list  |   2 +
 7 files changed, 592 insertions(+), 223 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0e9617bcff4..624518e0b01 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2708,6 +2708,20 @@ include_dir 'conf.d'

   
 
+  
+   io_combine_limit (integer)
+   
+io_combine_limit configuration parameter
+   
+   
+   
+
+ Controls the largest I/O size in operations that combine I/O.
+ The default is 128kB.
+
+   
+  
+
   
max_worker_processes (integer)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..944ee271ba4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,10 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffer() -- as above, with separate wait step
+ * StartReadBuffers() -- multiple block version
+ * WaitReadBuffers() 

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-02 Thread Bertrand Drouvot
Hi,

On Tue, Apr 02, 2024 at 12:41:35PM +0530, Bharath Rupireddy wrote:
> On Tue, Apr 2, 2024 at 11:58 AM Bertrand Drouvot
>  wrote:
> >
> > > Or a simple solution is that the slotsync worker updates
> > > inactive_since as it does for non-synced slots, and disables
> > > timeout-based slot invalidation for synced slots.
> >
> > Yeah, I think the main question to help us decide is: do we want to 
> > invalidate
> > "inactive" synced slots locally (in addition to synchronizing the 
> > invalidation
> > from the primary)?
> 
> I think this approach looks way simpler than the other one. The other
> approach of linking inactive_since on the standby for synced slots to
> the actual LSNs (or other slot parameters) being updated or not looks
> more complicated, and might not go well with the end user.  However,
> we need to be able to say why we don't invalidate synced slots due to
> inactive timeout unlike the wal_removed invalidation that can happen
> right now on the standby for synced slots. This leads us to define
> actually what a slot being active means. Is syncing the data from the
> remote slot considered as the slot being active?
> 
> On the other hand, it may not sound great if we don't invalidate
> synced slots due to inactive timeout even though they hold resources
> such as WAL and XIDs.

Right and the "only" benefit then would be to give an idea as to when the last
sync did occur on the local slot.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: using extended statistics to improve join estimates

2024-04-02 Thread Andy Fan

> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
>> Rebased over 269b532ae and muted compiler warnings.

Thank you Justin for the rebase!

Hello Tomas,

Thanks for the patch! Before I review the path at the code level, I want
to explain my understanding about this patch first.

Before this patch, we already use MCV information for the eqjoinsel, it
works as combine the MCV on the both sides to figure out the mcv_freq
and then treat the rest equally, but this doesn't work for MCV in
extended statistics, this patch fill this gap. Besides that, since
extended statistics means more than 1 columns are involved, if 1+
columns are Const based on RestrictInfo, we can use such information to
filter the MCVs we are interesting, that's really cool. 

I did some more testing, all of them are inner join so far, all of them
works amazing and I am suprised this patch didn't draw enough
attention. I will test more after I go though the code.

At for the code level, I reviewed them in the top-down manner and almost
40% completed. Here are some findings just FYI. For efficiency purpose,
I provide each feedback with a individual commit, after all I want to
make sure my comment is practical and coding and testing is a good way
to archive that. I tried to make each of them as small as possible so
that you can reject or accept them convinently.

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Hope this kind of review is helpful.

-- 
Best Regards
Andy Fan

>From daa6c27bc7dd0631607f0f254cc15491633a9ccc Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 13 Dec 2021 14:05:17 +0100
Subject: [PATCH v1 1/8] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c|  63 +-
 src/backend/statistics/extended_stats.c   | 805 ++
 src/backend/statistics/mcv.c  | 758 +
 .../statistics/extended_stats_internal.h  |  20 +
 src/include/statistics/statistics.h   |  12 +
 src/test/regress/expected/stats_ext.out   | 167 
 src/test/regress/sql/stats_ext.sql|  66 ++
 7 files changed, 1890 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 0ab021c1e8..bedf76edae 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -48,6 +48,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 			 JoinType jointype,
 			 SpecialJoinInfo *sjinfo,
 			 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+		Node *clause, RestrictInfo *rinfo,
+		int varRelid, SpecialJoinInfo *sjinfo);
 
 /
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -127,12 +130,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+	 * see them 

Re: Synchronizing slots from primary to standby

2024-04-02 Thread Bertrand Drouvot
Hi,

On Tue, Apr 02, 2024 at 07:20:46AM +, Zhijie Hou (Fujitsu) wrote:
> I added one test in 040_standby_failover_slots_sync.pl in 0002 patch, which 
> can
> reproduce the data loss issue consistently on my machine.

Thanks!

> It may not reproduce
> in some rare cases if concurrent xl_running_xacts are written by bgwriter, but
> I think it's still valuable if it can verify the fix in most cases.

What about adding a "wait" injection point in LogStandbySnapshot() to prevent
checkpointer/bgwriter to log a standby snapshot? Something among those lines:

   if (AmCheckpointerProcess() || AmBackgroundWriterProcess())
   INJECTION_POINT("bgw-log-standby-snapshot");

And make use of it in the test, something like:

   $node_primary->safe_psql('postgres',
   "SELECT injection_points_attach('bgw-log-standby-snapshot', 
'wait');");

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-02 Thread Andy Fan


Hi Alexander!

>> +
>> +pg_wal_replay_wait (
>> +  target_lsn pg_lsn,
>> +  timeout bigint 
>> DEFAULT 0)
>> +void
>> +   
>>
>> Should we return the millseconds of waiting time?  I think this
>> information may be useful for customer if they want to know how long
>> time it waits for for minitor purpose.
>
> Please, check it more carefully.  In v17 timeout is in integer milliseconds.

I guess one of us misunderstand the other:( and I do didn't check the
code very carefully. 

Acutally I meant the "return value" rather than function argument. IIUC
the current return value is void per below statement.

>> +void

If so, when users call pg_wal_replay_wait, they can get informed when
the wal is replaied to the target_lsn, but they can't know how long time
it waits unless they check it in application side, I think such
information will be useful for monitor purpose sometimes. 

select pg_wal_replay_wait(lsn, 1000);  may just take 1ms in fact, in
this case, I want this function return 1. 

-- 
Best Regards
Andy Fan





Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-02 Thread Alexander Korotkov
Hi, Andy!

On Tue, Apr 2, 2024 at 6:29 AM Andy Fan  wrote:
> >  Did you forget to attach the new patch?
> >
> > Yes, here it is.
> >
> > [4. text/x-diff; 
> > v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patch]...
>
> +
> +pg_wal_replay_wait (
> +  target_lsn pg_lsn,
> +  timeout bigint 
> DEFAULT 0)
> +void
> +   
>
> Should we return the millseconds of waiting time?  I think this
> information may be useful for customer if they want to know how long
> time it waits for for minitor purpose.

Please, check it more carefully.  In v17 timeout is in integer milliseconds.

--
Regards,
Alexander Korotkov




Re: Improve tab completion for ALTER DEFAULT PRIVILEGE and ALTER TABLE

2024-04-02 Thread Masahiko Sawada
On Mon, Apr 1, 2024 at 10:41 PM vignesh C  wrote:
>
> On Thu, 28 Mar 2024 at 13:05, Masahiko Sawada  wrote:
> >
> > Hi,
> >
> > Thank you for the patch!
> >
> > On Mon, Jul 3, 2023 at 12:12 AM vignesh C  wrote:
> > >
> > > Hi,
> > >
> > > Improved tab completion for "ALTER DEFAULT PRIVILEGE" and "ALTER TABLE":
> > > 1) GRANT, REVOKE and FOR USER keyword was not displayed in tab
> > > completion of alter default privileges like the below statement:
> > > ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC;
> > > ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC;
> > > ALTER DEFAULT PRIVILEGES FOR USER vignesh revoke INSERT ON tables FROM 
> > > dba1;
> >
> > +1
> >
> > >
> > > 2) USER was not displayed for "ALTER DEFAULT PRIVILEGES IN SCHEMA
> > > public FOR " like in below statement:
> > > ALTER DEFAULT PRIVILEGES IN SCHEMA public FOR USER dba1 GRANT INSERT
> > > ON TABLES TO PUBLIC;
> >
> > Since there is no difference FOR USER and FOR ROLE, I'm not sure we
> > really want to support both in tab-completion.
>
> I have removed this change
>
> > >
> > > 3) "FOR GRANT OPTION" was not display for "ALTER DEFAULT PRIVILEGES
> > > REVOKE " like in below statement:
> > > alter default privileges revoke grant option for select ON tables FROM 
> > > dba1;
> >
> > +1. But the v3 patch doesn't cover the following case:
> >
> > =# alter default privileges for role masahiko revoke [tab]
> > ALL CREATE  DELETE  EXECUTE INSERT  MAINTAIN
> >  REFERENCES  SELECT  TRIGGER TRUNCATEUPDATE  USAGE
>
> Modified in the updated patch
>
> > And it doesn't cover MAINTAIN neither:
> >
> > =# alter default privileges revoke [tab]
> > ALL   DELETEGRANT OPTION FOR  REFERENCES
> >  TRIGGER   UPDATE
> > CREATEEXECUTE   INSERTSELECT
> >  TRUNCATE  USAGE
>
> Modified in the updated patch
>
> > The patch adds the completions for ALTER DEFAULT PRIVILEGES REVOKE,
> > but we handle such case in GRANT and REVOKE part:
> >
> > (around L3958)
> > /*
> >  * With ALTER DEFAULT PRIVILEGES, restrict completion to grantable
> >  * privileges (can't grant roles)
> >  */
> > if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES"))
> > COMPLETE_WITH("SELECT", "INSERT", "UPDATE",
> >   "DELETE", "TRUNCATE", "REFERENCES", "TRIGGER",
> >   "CREATE", "EXECUTE", "USAGE", "MAINTAIN", "ALL");
>
> The current patch handles the fix from here now.
>
> > Also, I think we can support WITH GRANT OPTION too. For example,
> >
> > =# alter default privileges for role masahiko grant all on tables to
> > public [tab]
>
> I have handled this in the updated patch
>
> > It's already supported in the GRANT statement.
> >
> > >
> > > 4) "DATA TYPE" was missing in "ALTER TABLE table-name ALTER COLUMN
> > > column-name SET" like in:
> > > ALTER TABLE t1 ALTER COLUMN c1 SET DATA TYPE text;
> > >
> >
> > +1. The patch looks good to me, so pushed.
>
> Thanks for committing this.
>
> The updated patch has the changes for the above comments.
>

Thank you for updating the patch.

I think it doesn't work well as "GRANT OPTION FOR" is complemented
twice. For example,

=# alter default privileges for user masahiko revoke [tab]
ALL   DELETEGRANT OPTION FOR  MAINTAIN
 SELECTTRUNCATE  USAGE
CREATEEXECUTE   INSERTREFERENCES
 TRIGGER   UPDATE
=# alter default privileges for user masahiko revoke grant option for [tab]
ALL   DELETEGRANT OPTION FOR  MAINTAIN
 SELECTTRUNCATE  USAGE
CREATEEXECUTE   INSERTREFERENCES
 TRIGGER   UPDATE

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Use streaming read API in ANALYZE

2024-04-02 Thread Nazir Bilal Yavuz
Hi,

Thanks for the review!

On Wed, 27 Mar 2024 at 23:15, Melanie Plageman
 wrote:
>
> On Tue, Mar 26, 2024 at 02:51:27PM +0300, Nazir Bilal Yavuz wrote:
> > Hi,
> >
> > On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz  wrote:
> > >
> > >
> > > The new version of the streaming read API [1] is posted. I updated the
> > > streaming read API changes patch (0001), using the streaming read API
> > > in ANALYZE patch (0002) remains the same. This should make it easier
> > > to review as it can be applied on top of master
> > >
> > >
> >
> > The new version of the streaming read API is posted [1]. I rebased the
> > patch on top of master and v9 of the streaming read API.
> >
> > There is a minimal change in the 'using the streaming read API in ANALYZE
> > patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE
> > to copy exactly the same behavior as before. Also, some benchmarking
> > results:
> >
> > I created a 22 GB table and set the size of shared buffers to 30GB, the
> > rest is default.
> >
> > ╔═══╦═╦╗
> > ║   ║  Avg Timings in ms  ║║
> > ╠═══╬══╦══╬╣
> > ║   ║  master  ║  patched ║ percentage ║
> > ╠═══╬══╬══╬╣
> > ║ Both OS cache and ║  ║  ║║
> > ║  shared buffers are clear ║ 513.9247 ║ 463.1019 ║%9.9║
> > ╠═══╬══╬══╬╣
> > ║   OS cache is loaded but  ║  ║  ║║
> > ║  shared buffers are clear ║ 423.1097 ║ 354.3277 ║%16.3   ║
> > ╠═══╬══╬══╬╣
> > ║ Shared buffers are loaded ║  ║  ║║
> > ║   ║  89.2846 ║  84.6952 ║%5.1║
> > ╚═══╩══╩══╩╝
> >
> > Any kind of feedback would be appreciated.
>
> Thanks for working on this!
>
> A general review comment: I noticed you have the old streaming read
> (pgsr) naming still in a few places (including comments) -- so I would
> just make sure and update everywhere when you rebase in Thomas' latest
> version of the read stream API.

Done.

>
> > From c7500cc1b9068ff0b704181442999cd8bed58658 Mon Sep 17 00:00:00 2001
> > From: Nazir Bilal Yavuz 
> > Date: Mon, 19 Feb 2024 14:30:47 +0300
> > Subject: [PATCH v3 2/2] Use streaming read API in ANALYZE
> >
> > --- a/src/backend/commands/analyze.c
> > +++ b/src/backend/commands/analyze.c
> > @@ -1102,6 +1102,26 @@ examine_attribute(Relation onerel, int attnum, Node 
> > *index_expr)
> >   return stats;
> >  }
> >
> > +/*
> > + * Prefetch callback function to get next block number while using
> > + * BlockSampling algorithm
> > + */
> > +static BlockNumber
> > +pg_block_sampling_streaming_read_next(StreamingRead *stream,
> > +   
> > void *user_data,
> > +   
> > void *per_buffer_data)
>
> I don't think you need the pg_ prefix

Done.

>
> > +{
> > + BlockSamplerData *bs = user_data;
> > + BlockNumber *current_block = per_buffer_data;
>
> Why can't you just do BufferGetBlockNumber() on the buffer returned from
> the read stream API instead of allocating per_buffer_data for the block
> number?

Done.

>
> > +
> > + if (BlockSampler_HasMore(bs))
> > + *current_block = BlockSampler_Next(bs);
> > + else
> > + *current_block = InvalidBlockNumber;
> > +
> > + return *current_block;
>
>
> I think we'd like to keep the read stream code in heapam-specific code.
> Instead of doing streaming_read_buffer_begin() here, you could put this
> in heap_beginscan() or initscan() guarded by
> scan->rs_base.rs_flags & SO_TYPE_ANALYZE

In the recent changes [1], heapam_scan_analyze_next_[block | tuple]
are removed from tableam. They are directly called from
heapam-specific code now. So, IMO, no need to do this now.

v4 is rebased on top of v14 streaming read API changes.

[1] 27bc1772fc814946918a5ac8ccb9b5c5ad0380aa

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 725a3d875fb1d675f5d99d8602a87b5e47b765db Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v4 1/2] v14 Streaming Read API

Part 1:

Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later 

  1   2   >