Re: New Table Access Methods for Multi and Single Inserts

2024-03-01 Thread Bharath Rupireddy
On Mon, Jan 29, 2024 at 5:16 PM Bharath Rupireddy
 wrote:
>
> > Please find the attached v9 patch set.

I've had to rebase the patches due to commit 874d817, please find the
attached v11 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From c338f541e01850fa4bb423e09acce618be9e21ba Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 1 Mar 2024 14:23:07 +
Subject: [PATCH v11 1/4] New TAMs for inserts

---
 src/backend/access/heap/heapam.c | 224 +++
 src/backend/access/heap/heapam_handler.c |   9 +
 src/include/access/heapam.h  |  49 +
 src/include/access/tableam.h | 138 ++
 4 files changed, 420 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a536..7df305380e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -68,6 +68,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
+#include "utils/memutils.h"
 #include "utils/relcache.h"
 #include "utils/snapmgr.h"
 #include "utils/spccache.h"
@@ -2446,6 +2447,229 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	pgstat_count_heap_insert(relation, ntuples);
 }
 
+/*
+ * Initialize state required for an insert a single tuple or multiple tuples
+ * into a heap.
+ */
+TableInsertState *
+heap_insert_begin(Relation rel, CommandId cid, int am_flags, int insert_flags)
+{
+	TableInsertState *tistate;
+
+	tistate = palloc0(sizeof(TableInsertState));
+	tistate->rel = rel;
+	tistate->cid = cid;
+	tistate->am_flags = am_flags;
+	tistate->insert_flags = insert_flags;
+
+	if ((am_flags & TABLEAM_MULTI_INSERTS) != 0 ||
+		(am_flags & TABLEAM_BULKWRITE_BUFFER_ACCESS_STRATEGY))
+		tistate->am_data = palloc0(sizeof(HeapInsertState));
+
+	if ((am_flags & TABLEAM_MULTI_INSERTS) != 0)
+	{
+		HeapMultiInsertState *mistate;
+
+		mistate = palloc0(sizeof(HeapMultiInsertState));
+		mistate->slots = palloc0(sizeof(TupleTableSlot *) * HEAP_MAX_BUFFERED_SLOTS);
+
+		mistate->context = AllocSetContextCreate(CurrentMemoryContext,
+ "heap_multi_insert_v2 memory context",
+ ALLOCSET_DEFAULT_SIZES);
+
+		((HeapInsertState *) tistate->am_data)->mistate = mistate;
+	}
+
+	if ((am_flags & TABLEAM_BULKWRITE_BUFFER_ACCESS_STRATEGY) != 0)
+		((HeapInsertState *) tistate->am_data)->bistate = GetBulkInsertState();
+
+	return tistate;
+}
+
+/*
+ * Insert a single tuple into a heap.
+ */
+void
+heap_insert_v2(TableInsertState * state, TupleTableSlot *slot)
+{
+	bool		shouldFree = true;
+	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, );
+	BulkInsertState bistate = NULL;
+
+	Assert(state->am_data != NULL &&
+		   ((HeapInsertState *) state->am_data)->mistate == NULL);
+
+	/* Update tuple with table oid */
+	slot->tts_tableOid = RelationGetRelid(state->rel);
+	tuple->t_tableOid = slot->tts_tableOid;
+
+	if (state->am_data != NULL &&
+		((HeapInsertState *) state->am_data)->bistate != NULL)
+		bistate = ((HeapInsertState *) state->am_data)->bistate;
+
+	/* Perform insertion, and copy the resulting ItemPointer */
+	heap_insert(state->rel, tuple, state->cid, state->insert_flags,
+bistate);
+	ItemPointerCopy(>t_self, >tts_tid);
+
+	if (shouldFree)
+		pfree(tuple);
+}
+
+/*
+ * Create/return next free slot from multi-insert buffered slots array.
+ */
+TupleTableSlot *
+heap_multi_insert_next_free_slot(TableInsertState * state)
+{
+	TupleTableSlot *slot;
+	HeapMultiInsertState *mistate;
+
+	Assert(state->am_data != NULL &&
+		   ((HeapInsertState *) state->am_data)->mistate != NULL);
+
+	mistate = ((HeapInsertState *) state->am_data)->mistate;
+	slot = mistate->slots[mistate->cur_slots];
+
+	if (slot == NULL)
+	{
+		slot = table_slot_create(state->rel, NULL);
+		mistate->slots[mistate->cur_slots] = slot;
+	}
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * Store passed-in tuple into in-memory buffered slots. When full, insert
+ * multiple tuples from the buffers into heap.
+ */
+void
+heap_multi_insert_v2(TableInsertState * state, TupleTableSlot *slot)
+{
+	TupleTableSlot *dstslot;
+	HeapMultiInsertState *mistate;
+
+	Assert(state->am_data != NULL &&
+		   ((HeapInsertState *) state->am_data)->mistate != NULL);
+
+	mistate = ((HeapInsertState *) state->am_data)->mistate;
+	dstslot = mistate->slots[mistate->cur_slots];
+
+	if (dstslot == NULL)
+	{
+		dstslot = table_slot_create(state->rel, NULL);
+		mistate->slots[mistate->cur_slots] = dstslot;
+	}
+
+	/*
+	 * Caller may have got the slot using heap_multi_insert_next_free_slot,
+	 * filled it and passed. So, skip copying in such a case.
+	 */
+	if ((state->am_flags & TABLEAM_SKIP_MULTI_INSERTS_FLUSH) == 0)
+	{
+		ExecClearTuple(dstslot);
+		ExecCopySlot(dstslot, slot);
+	}
+	else
+		Assert(dstslot == slot);
+
+	mistate->cur_slots++;
+
+	/*
+	 * When passed-in slot is already materialized, memory 

Re: pread, pwrite, etc return ssize_t not int

2024-03-01 Thread Peter Eisentraut

On 01.03.24 22:23, Thomas Munro wrote:

For the overflow of the input length (size_t -> DWORD), I don't think we
actually need to do anything.  The size argument would be truncated, but
the callers would just repeat the calls with the remaining size, so in
effect they will read the data in chunks of rest + N * DWORD_MAX.  The
patch just changes this to chunks of N * 1GB + rest.


But implicit conversion size_t -> DWORD doesn't convert large numbers
to DWORD_MAX, it just cuts off the high bits, and that might leave you
with zero.  Zero has a special meaning (if we assume that kernel
doesn't reject a zero size argument outright, I dunno): if returned by
reads it indicates EOF, and if returned by writes a typical caller
would either loop forever making no progress or (in some of our code)
conjure up a fake ENOSPC.  Hence desire to impose a cap.


Right, my thinko.  Your patch is correct then.





RE: Synchronizing slots from primary to standby

2024-03-01 Thread Zhijie Hou (Fujitsu)
On Thursday, February 29, 2024 11:16 AM Zhijie Hou (Fujitsu) 
 wrote:
> 
> On Wednesday, February 28, 2024 7:36 PM Bertrand Drouvot
>  wrote:
> >
> > 4 ===
> >
> > Regarding the test, what about adding one to test the "new" behavior
> > discussed up-thread? (logical replication will wait if slot mentioned
> > in standby_slot_names is dropped and/or does not exist when the engine
> > starts?)
> 
> Will think about this.

I think we could add a test to check the warning message for dropped slot, but
since similar wait/warning functionality has been tested, I prefer to leave
this for now and can consider it again after the main patch and tests are
stabilized considering the previous experience of BF instability with this
feature.

Best Regards,
Hou zj




RE: Synchronizing slots from primary to standby

2024-03-01 Thread Zhijie Hou (Fujitsu)
On Friday, March 1, 2024 12:23 PM Peter Smith  wrote:
> 
> Here are some review comments for v102-0001.
> 
> ==
> doc/src/sgml/config.sgml
> 
> 1.
> +   
> +Lists the streaming replication standby server slot names that 
> logical
> +WAL sender processes will wait for. Logical WAL sender processes will
> +send decoded changes to plugins only after the specified replication
> +slots confirm receiving WAL. This guarantees that logical replication
> +slots with failover enabled do not consume changes until those
> changes
> +are received and flushed to corresponding physical standbys. If a
> +logical replication connection is meant to switch to a physical 
> standby
> +after the standby is promoted, the physical replication slot for the
> +standby should be listed here. Note that logical replication will not
> +proceed if the slots specified in the standby_slot_names do
> not exist or
> +are invalidated.
> +   
> 
> Should this also mention the effect this GUC has on those 2 SQL functions? 
> E.g.
> Commit message says:
> 
> Additionally, The SQL functions pg_logical_slot_get_changes and
> pg_replication_slot_advance are modified to wait for the replication slots
> mentioned in 'standby_slot_names' to catch up before returning.

Added.

> 
> ==
> src/backend/replication/slot.c
> 
> 2. validate_standby_slots
> 
> + else if (!ReplicationSlotCtl)
> + {
> + /*
> + * We cannot validate the replication slot if the replication slots'
> + * data has not been initialized. This is ok as we will validate the
> + * specified slot when waiting for them to catch up. See
> + * StandbySlotsHaveCaughtup for details.
> + */
> + }
> + else
> + {
> + /*
> + * If the replication slots' data have been initialized, verify if the
> + * specified slots exist and are logical slots.
> + */
> + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> 
> IMO that 2nd comment does not need to say "If the replication slots'
> data have been initialized," because that is implicit from the if/else.

Removed.

> 
> ~~~
> 
> 3. GetStandbySlotList
> 
> +List *
> +GetStandbySlotList(void)
> +{
> + if (RecoveryInProgress())
> + return NIL;
> + else
> + return standby_slot_names_list;
> +}
> 
> The 'else' is not needed. IMO is better without it, but YMMV.

Removed.

> 
> ~~~
> 
> 4. StandbySlotsHaveCaughtup
> 
> +/*
> + * Return true if the slots specified in standby_slot_names have caught
> +up to
> + * the given WAL location, false otherwise.
> + *
> + * The elevel parameter determines the error level used for logging
> +messages
> + * related to slots that do not exist, are invalidated, or are inactive.
> + */
> +bool
> +StandbySlotsHaveCaughtup(XLogRecPtr wait_for_lsn, int elevel)
> 
> /determines/specifies/

Changed.

> 
> ~
> 
> 5.
> + /*
> + * Don't need to wait for the standby to catch up if there is no value
> + in
> + * standby_slot_names.
> + */
> + if (!standby_slot_names_list)
> + return true;
> +
> + /*
> + * If we are on a standby server, we do not need to wait for the
> + standby to
> + * catch up since we do not support syncing slots to cascading standbys.
> + */
> + if (RecoveryInProgress())
> + return true;
> +
> + /*
> + * Return true if all the standbys have already caught up to the passed
> + in
> + * WAL localtion.
> + */
> + if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) &&
> + standby_slot_oldest_flush_lsn >= wait_for_lsn) return true;
> 
> 
> 5a.
> I felt all these comments should be worded in a consistent way like "Don't 
> need
> to wait ..."
> 
> e.g.
> 1. Don't need to wait for the standbys to catch up if there is no value in
> 'standby_slot_names'.
> 2. Don't need to wait for the standbys to catch up if we are on a standby 
> server,
> since we do not support syncing slots to cascading standbys.
> 3. Don't need to wait for the standbys to catch up if they are already beyond
> the specified WAL location.

Changed.

> 
> ~
> 
> 5b.
> typo
> /WAL localtion/WAL location/
> 

Fixed.

> ~~~
> 
> 6.
> + if (!slot)
> + {
> + /*
> + * It may happen that the slot specified in standby_slot_names GUC
> + * value is dropped, so let's skip over it.
> + */
> + ereport(elevel,
> + errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("replication slot \"%s\" specified in parameter %s does not exist",
> +name, "standby_slot_names"));
> + continue;
> + }
> 
> Is "is dropped" strictly the only reason? IIUC another reason is the slot 
> maybe
> just did not even exist in the first place but it was not detected before now
> because inititial validation was skipped.

Changed the comment.

> 
> ~~~
> 
> 7.
> + /*
> + * Return false if not all the standbys have caught up to the specified
> + WAL
> + * location.
> + */
> + if (caught_up_slot_num != list_length(standby_slot_names_list))
> + return false;
> 
> Somehow it seems complicated to have a counter of the slots as you process
> then compare that counter to the list_length 

Re: Avoiding inadvertent debugging mode for pgbench

2024-03-01 Thread Euler Taveira
On Fri, Mar 1, 2024, at 8:07 PM, Tomas Vondra wrote:
> On 3/1/24 23:41, Nathan Bossart wrote:
> > 
> > I think this is a generally reasonable proposal, except I don't know
> > whether this breakage is acceptable.  AFAICT there are two fundamental
> > behavior changes folks would observe:
> > 
> > * "-d " would cease to emit the debugging output, and while
> >   enabling debug mode might've been unintentional in most cases, it might
> >   actually have been intentional in others.
> > 
> 
> I think this is the more severe of the two issues, because it's a silent
> change. Everything will seem to work, but the user won't get the debug
> info (if they actually wanted it).

Indeed. Hopefully the user will notice soon when inspecting the standard error
output.

> > * "-d" with no argument or with a following option would begin failing, and
> >   users would need to replace "-d" with "--debug".
> > 
> 
> I think this is fine.

Yeah. It will force the user to fix it immediately.

> > Neither of these seems particularly severe to me, especially for a
> > benchmarking program, but I'd be curious to hear what others think.
> > 
> 
> I agree the -d option may be confusing, but is it worth it? I don't
> know, it depends on how often people actually get confused by it, and I
> don't recall hitting this (nor hearing about others). To be honest I
> didn't even realize pgbench even has a debug switch ...

I'm the one that has a habit to use -d to specify the database name. I
generally include -d for pgbench and then realized that I don't need the debug
information because it is not for database specification.

> But I'd like to mention this is far from our only tool using "-d" to
> enable debug mode. A quick git-grep shows postgres, initdb,
> pg_archivecleanup and pg_combinebackup do the same thing. So maybe it's
> not that inconsistent overall.

As Greg said none of these programs connects to the database.

I don't like to break backward compatibility but in this case I suspect that it
is ok. I don't recall the last time I saw a script that makes use of -d option.
How often do you need a pgbench debug information?


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Showing applied extended statistics in explain Part 2

2024-03-01 Thread Tomas Vondra
On 3/1/24 01:19, Tatsuro Yamada wrote:
> Hi,
> 
> This original patch made by Tomas improves the usability of extended 
> statistics, 
> so I rebased it on 362de947, and I'd like to re-start developing it.
>  
> The previous thread [1] suggested something to solve. I'll try to solve it as 
> best I can, but I think this feature is worth it with some limitations.
> Please find the attached file.
> 

Thank you for the interest in moving this patch forward. And I agree
it's worth to cut some of the stuff if it's necessary to make it work.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Avoiding inadvertent debugging mode for pgbench

2024-03-01 Thread Tomas Vondra
On 3/1/24 23:41, Nathan Bossart wrote:
> On Thu, Feb 29, 2024 at 07:05:13PM -0500, Greg Sabino Mullane wrote:
>> Attached please find a patch to adjust the behavior of the pgbench program
>> and make it behave like the other programs that connect to a database
>> (namely, psql and pg_dump). Specifically, add support for using -d and
>> --dbname to specify the name of the database. This means that -d can no
>> longer be used to turn on debugging mode, and the long option --debug must
>> be used instead.
>>
>> This removes a long-standing footgun, in which people assume that the -d
>> option behaves the same as other programs. Indeed, because it takes no
>> arguments, and because the first non-option argument is the database name,
>> it still appears to work. However, people then wonder why pgbench is so
>> darn verbose all the time! :)
>>
>> This is a breaking change, but fixing it this way seems to have the least
>> total impact, as the number of people using the debug mode of pgbench is
>> likely quite small. Further, those already using the long option are
>> unaffected, and those using the short one simply need to replace '-d' with
>> '--debug', arguably making their scripts a little more self-documenting in
>> the process.
> 
> I think this is a generally reasonable proposal, except I don't know
> whether this breakage is acceptable.  AFAICT there are two fundamental
> behavior changes folks would observe:
> 
> * "-d " would cease to emit the debugging output, and while
>   enabling debug mode might've been unintentional in most cases, it might
>   actually have been intentional in others.
> 

I think this is the more severe of the two issues, because it's a silent
change. Everything will seem to work, but the user won't get the debug
info (if they actually wanted it).

> * "-d" with no argument or with a following option would begin failing, and
>   users would need to replace "-d" with "--debug".
> 

I think this is fine.

> Neither of these seems particularly severe to me, especially for a
> benchmarking program, but I'd be curious to hear what others think.
> 

I agree the -d option may be confusing, but is it worth it? I don't
know, it depends on how often people actually get confused by it, and I
don't recall hitting this (nor hearing about others). To be honest I
didn't even realize pgbench even has a debug switch ...

But I'd like to mention this is far from our only tool using "-d" to
enable debug mode. A quick git-grep shows postgres, initdb,
pg_archivecleanup and pg_combinebackup do the same thing. So maybe it's
not that inconsistent overall.

(Note: I didn't check if the other cases may lead to the same confusion
where people enable debug accidentally. Maybe not.)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

2024-03-01 Thread Melanie Plageman
On Sat, Feb 17, 2024 at 5:31 PM Tomas Vondra
 wrote:
>
> Hi David,
>
> Do you plan to work continue working on this patch? I did take a look,
> and on the whole it looks reasonable - it modifies the right places etc.

I haven't started reviewing this patch yet, but I just ran into the
behavior that it fixes and was very outraged. +1 to fixing this.

- Melanie




Re: Avoiding inadvertent debugging mode for pgbench

2024-03-01 Thread Nathan Bossart
On Thu, Feb 29, 2024 at 07:05:13PM -0500, Greg Sabino Mullane wrote:
> Attached please find a patch to adjust the behavior of the pgbench program
> and make it behave like the other programs that connect to a database
> (namely, psql and pg_dump). Specifically, add support for using -d and
> --dbname to specify the name of the database. This means that -d can no
> longer be used to turn on debugging mode, and the long option --debug must
> be used instead.
> 
> This removes a long-standing footgun, in which people assume that the -d
> option behaves the same as other programs. Indeed, because it takes no
> arguments, and because the first non-option argument is the database name,
> it still appears to work. However, people then wonder why pgbench is so
> darn verbose all the time! :)
> 
> This is a breaking change, but fixing it this way seems to have the least
> total impact, as the number of people using the debug mode of pgbench is
> likely quite small. Further, those already using the long option are
> unaffected, and those using the short one simply need to replace '-d' with
> '--debug', arguably making their scripts a little more self-documenting in
> the process.

I think this is a generally reasonable proposal, except I don't know
whether this breakage is acceptable.  AFAICT there are two fundamental
behavior changes folks would observe:

* "-d " would cease to emit the debugging output, and while
  enabling debug mode might've been unintentional in most cases, it might
  actually have been intentional in others.

* "-d" with no argument or with a following option would begin failing, and
  users would need to replace "-d" with "--debug".

Neither of these seems particularly severe to me, especially for a
benchmarking program, but I'd be curious to hear what others think.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-01 Thread Nathan Bossart
On Wed, Feb 21, 2024 at 08:10:00PM +0530, Bharath Rupireddy wrote:
> I'm thinking the other way around - how about we revert
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=007693f2a3ac2ac19affcb03ad43cdb36ccff5b5,
> that is, put in place "conflict" as a boolean and introduce
> invalidation_reason the text form. So, for logical slots, whenever the
> "conflict" column is true, the reason is found in invaldiation_reason
> column? How does it sound? Again the debate might be "conflict" vs
> "invalidation", but that looks clean IMHO.

Would you ever see "conflict" as false and "invalidation_reason" as
non-null for a logical slot?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Experiments with Postgres and SSL

2024-03-01 Thread Jacob Champion
On Wed, Feb 28, 2024 at 4:10 AM Heikki Linnakangas  wrote:
> I think we'd want to *avoid* changing the major protocol version in a
> way that would introduce a new roundtrip, though.

I'm starting to get up to speed with this patchset. So far I'm mostly
testing how it works; I have yet to take an in-depth look at the
implementation.

I'll squint more closely at the MITM-protection changes in 0008 later.
First impressions, though: it looks like that code has gotten much
less straightforward, which I think is dangerous given the attack it's
preventing. (Off-topic: I'm skeptical of future 0-RTT support. Our
protocol doesn't seem particularly replay-safe to me.)

If we're interested in ALPN negotiation in the future, we may also
want to look at GREASE [1] to keep those options open in the presence
of third-party implementations. Unfortunately OpenSSL doesn't do this
automatically yet.

If we don't have a reason not to, it'd be good to follow the strictest
recommendations from [2] to avoid cross-protocol attacks. (For anyone
currently running web servers and Postgres on the same host, they
really don't want browsers "talking" to their Postgres servers.) That
would mean checking the negotiated ALPN on both the server and client
side, and failing if it's not what we expect.

I'm not excited about the proliferation of connection options. I don't
have a lot of ideas on how to fix it, though, other than to note that
the current sslnegotiation option names are very unintuitive to me:
- "postgres": only legacy handshakes
- "direct": might be direct... or maybe legacy
- "requiredirect": only direct handshakes... unless other options are
enabled and then we fall back again to legacy? How many people willing
to break TLS compatibility with old servers via "requiredirect" are
going to be okay with lazy fallback to GSS or otherwise?

Heikki mentioned possibly hard-coding a TLS alert if direct SSL is
attempted without server TLS support. I think that's a cool idea, but
without an official "TLS not supported" alert code (which, honestly,
would be strange to standardize) I'm kinda -0.5 on it. If the client
tells me about a handshake_failure or similar, I'm going to start
investigating protocol versions and ciphersuites; I'm not going to
think to myself that maybe the server lacks TLS support altogether.
(Plus, we need to have a good error message when connecting to older
servers anyway. I think we should be able to key off of the EOF coming
back from OpenSSL; it'd be a good excuse to give that part of the code
some love.)

For the record, I'm adding some one-off tests for this feature to a
local copy of my OAuth pytest suite, which is designed to do the kinds
of testing you're running into trouble with. It's not in any way
viable for a PG17 commit, but if you're interested I can make the
patches available.

--Jacob

[1] https://www.rfc-editor.org/rfc/rfc8701.html
[2] https://alpaca-attack.com/libs.html




Re: Popcount optimization using AVX512

2024-03-01 Thread Nathan Bossart
Thanks for the new version of the patch.  I didn't see a commitfest entry
for this one, and unfortunately I think it's too late to add it for the
March commitfest.  I would encourage you to add it to July's commitfest [0]
so that we can get some routine cfbot coverage.

On Tue, Feb 27, 2024 at 08:46:06PM +, Amonson, Paul D wrote:
> After consulting some Intel internal experts on MSVC the linking issue as
> it stood was not resolved. Instead, I created a MSVC ONLY work-around.
> This adds one extra functional call on the Windows builds (The linker
> resolves a real function just fine but not a function pointer of the same
> name). This extra latency does not exist on any of the other platforms. I
> also believe I addressed all issues raised in the previous reviews. The
> new pg_popcnt_x86_64_accel.c file is now the ONLY file compiled with the
> AVX512 compiler flags. I added support for the MSVC compiler flag as
> well. Both meson and autoconf are updated with the new refactor.
> 
> I am attaching the new patch.

I think this patch might be missing the new files.

-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))

IME this means that the autoconf you are using has been patched.  A quick
search on the mailing lists seems to indicate that it might be specific to
Debian [1].

-static int pg_popcount32_slow(uint32 word);
-static int pg_popcount64_slow(uint64 word);
+intpg_popcount32_slow(uint32 word);
+intpg_popcount64_slow(uint64 word);
+uint64 pg_popcount_slow(const char *buf, int bytes);

This patch appears to do a lot of refactoring.  Would it be possible to
break out the refactoring parts into a prerequisite patch that could be
reviewed and committed independently from the AVX512 stuff?

-#if SIZEOF_VOID_P >= 8
+#if SIZEOF_VOID_P == 8
/* Process in 64-bit chunks if the buffer is aligned. */
-   if (buf == (const char *) TYPEALIGN(8, buf))
+   if (buf == (const char *)TYPEALIGN(8, buf))
{
-   const uint64 *words = (const uint64 *) buf;
+   const uint64 *words = (const uint64 *)buf;
 
while (bytes >= 8)
{
@@ -309,9 +213,9 @@ pg_popcount(const char *buf, int bytes)
bytes -= 8;
}
 
-   buf = (const char *) words;
+   buf = (const char *)words;
}
-#else
+#elif SIZEOF_VOID_P == 4
/* Process in 32-bit chunks if the buffer is aligned. */
if (buf == (const char *) TYPEALIGN(4, buf))
{

Most, if not all, of these changes seem extraneous.  Do we actually need to
more strictly check SIZEOF_VOID_P?

[0] https://commitfest.postgresql.org/48/
[1] https://postgr.es/m/20230211020042.uthdgj72kp3xlqam%40awork3.anarazel.de

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pread, pwrite, etc return ssize_t not int

2024-03-01 Thread Thomas Munro
On Sat, Mar 2, 2024 at 3:12 AM Peter Eisentraut  wrote:
> 0001-Return-ssize_t-in-fd.c-I-O-functions.patch
>
> This patch looks correct to me.

Thanks, I'll push this one.

> 0002-Fix-theoretical-overflow-in-Windows-pg_pread-pg_pwri.patch
>
> I have two comments on that:
>
> For the overflow of the input length (size_t -> DWORD), I don't think we
> actually need to do anything.  The size argument would be truncated, but
> the callers would just repeat the calls with the remaining size, so in
> effect they will read the data in chunks of rest + N * DWORD_MAX.  The
> patch just changes this to chunks of N * 1GB + rest.

But implicit conversion size_t -> DWORD doesn't convert large numbers
to DWORD_MAX, it just cuts off the high bits, and that might leave you
with zero.  Zero has a special meaning (if we assume that kernel
doesn't reject a zero size argument outright, I dunno): if returned by
reads it indicates EOF, and if returned by writes a typical caller
would either loop forever making no progress or (in some of our code)
conjure up a fake ENOSPC.  Hence desire to impose a cap.

I'm on the fence about whether it's worth wasting any more energy on
this, I mean we aren't really going to read/write 4GB, so I'd be OK if
we just left this as an observation in the archives...

> The other issue, the possible overflow of size_t -> ssize_t is not
> specific to Windows.  We could install some protection against that on
> some other layer, but it's unclear how widespread that issue is or what
> the appropriate fix is.  POSIX says that passing in a size larger than
> SSIZE_MAX has implementation-defined effect.  The FreeBSD man page says
> that this will result in an EINVAL error.  So if we here truncate
> instead of error, we'd introduce a divergence.

Yeah, right, that's the caller's job to worry about on all platforms
so I was wrong to mention ssize_t in the comment.




Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-03-01 Thread Justin Pryzby
On Fri, Mar 01, 2024 at 10:56:50AM +0900, Michael Paquier wrote:
> On Thu, Feb 29, 2024 at 08:51:31AM -0600, Justin Pryzby wrote:
> > On Wed, Feb 28, 2024 at 05:08:49PM +0900, Michael Paquier wrote:
> >> I have implemented that so as we keep the default, historical
> >> behavior: if pg_class.relam is 0 for a partitioned table, use the AM
> >> defined by default_table_access_method.  The patch only adds a path to
> >> switch to a different AM than the GUC when creating a new partition if
> >> and only if a partitioned table has been manipulated with ALTER TABLE
> >> SET ACCESS METHOD to update its AM to something else than the GUC.
> >> Similarly to tablespaces, CREATE TABLE USING is *not* supported for
> >> partitioned tables, same behavior as previously.
> > 
> > This patch allows resetting relam=0 by running ALTER TABLE SET AM to the
> > same value as the GUC.  Maybe it'd be better to have an explicit SET
> > DEFAULT (as in b9424d01 and 4f622503).
> 
> Outside the scope of this patch's thread, this looks like a good idea
> even for tables/matviews.  And the semantics are pretty easy: if DEFAULT
> is specified, just set the access method to NULL in the parser and let
> tablecmds.c go the AM OID lookup in the prep phase if set to NULL.
> See 0001 attached.  This one looks pretty good taken as an independent
> piece.
> 
> When it comes to partitioned tables, there is a still a tricky case:
> what should we do when a user specifies a non-default value in the SET
> ACCESS METHOD clause and it matches default_table_access_method?

I don't think it's tricky - it seems more like a weird hack in the
previous patch version to make AMs behave like tablespaces, despite not
being completely parallel, due to the absence of a pg_default AM.

With the new 001, the hack can go away, and so it should.

> Should the relam be 0 or should we force relam to be the OID of the
> given value given by the query?

You said "force" it to be the user-specified value, but I think that's
not "forcing", it's respecting (but to take the user's desired value,
and conditionally store 0 instead, that could be described as
"forcing.")

> Implementation-wise, forcing the value to 0 is simpler, but I can get
> why it could be confusing as well, because the state of the catalogs
> does not reflect what was provided in the query.

> At the same time, the user has explicitly set the access method to be
> the same as the default, so perhaps 0 makes sense anyway in this case.

I think if the user sets something "explicitly", the catalog should
reflect what they set.  Tablespaces have dattablespace, but AMs don't --
it's a simpler case.

For 001: we don't *need* to support "ALTER SET AM default" for leaf
tables.  It doesn't do anything that's not already possible.  But, if
AMs for partitioned tables are optional rather than required, then seems
to be needed to allow (re)settinng relam=0.

But for partitioned tables, I think it should set relam=0 directly.
Currently it 1) falls through to default_table_am; and 2) detects that
it's the default, so then sets relam to 0.  Since InvalidOid 

On Fri, Mar 01, 2024 at 02:03:48PM +0900, Michael Paquier wrote:
> Fun topic, especially once coupled with the internals of tablecmds.c
> that uses InvalidOid for the new access AM as a special value to work
> as a no-op.

Since InvalidOid is already taken, I guess you might need to introduce a
boolean flag, like set_relam, indicating that the statement has an
ACCESS METHOD clause.

> + * method defined so as their children can inherit it; however, this is 
> handled

so that

> +  * Do nothing: access methods is a setting that partitions can

method (singular), or s/is/are/

On Wed, Feb 28, 2024 at 05:08:49PM +0900, Michael Paquier wrote:
> Similarly to tablespaces, CREATE TABLE USING is *not* supported for
> partitioned tables, same behavior as previously.

Maybe I misunderstood what you're trying to say, but CREATE..TABLESPACE
*is* supported for partitioned tables.  I'm not sure why it wouldn't be
supported to set the AM, too.

In any case, it'd be a bit confusing for the error message to still say:

postgres=# CREATE TABLE a(i int) PARTITION BY RANGE(a) USING heap2;
ERROR:  specifying a table access method is not supported on a partitioned table

-- 
Justin




Re: speed up a logical replica setup

2024-03-01 Thread Euler Taveira
On Thu, Feb 22, 2024, at 12:45 PM, Hayato Kuroda (Fujitsu) wrote:
> Based on idea from Euler, I roughly implemented. Thought?
> 
> 0001-0013 were not changed from the previous version.
> 
> V24-0014: addressed your comment in the replied e-mail.
> V24-0015: Add disconnect_database() again, per [3]
> V24-0016: addressed your comment in [4].
> V24-0017: addressed your comment in [5].
> V24-0018: addressed your comment in [6].

Thanks for your review. I'm attaching v25 that hopefully addresses all pending
points.

Regarding your comments [1] on v21, I included changes for almost all items
except 2, 20, 23, 24, and 25. It still think item 2 is not required because
pg_ctl will provide a suitable message. I decided not to rearrange the block of
SQL commands (item 20) mainly because it would avoid these objects on node_f.
Do we really need command_checks_all? Depending on the output it uses
additional cycles than command_ok.

In summary:

v24-0002: documentation is updated. I didn't apply this patch as-is. Instead, I
checked what you wrote and fix some gaps in what I've been written.
v24-0003: as I said I don't think we need to add it, however, I won't fight
against it if people want to add this check.
v24-0004: I spent some time on it. This patch is not completed. I cleaned it up
and include the start_standby_server code. It starts the server using the
specified socket directory, port and username, hence, preventing external client
connections during the execution.
v24-0005: partially applied
v24-0006: applied with cosmetic change
v24-0007: applied with cosmetic change
v24-0008: applied
v24-0009: applied with cosmetic change
v24-0010: not applied. Base on #15, I refactored this code a bit. pg_fatal is
not used when there is a database connection open. Instead, pg_log_error()
followed by disconnect_database(). In cases that it should exit immediately,
disconnect_database() has a new parameter (exit_on_error) that controls if it
needs to call exit(1). I go ahead and did the same for connect_database().
v24-0011: partially applied. I included some of the suggestions (18, 19, and 
21).
v24-0012: not applied. Under reflection, after working on v24-0004, the target
server will start with new parameters (that only accepts local connections),
hence, during the execution is not possible anymore to detect if the target
server is a primary for another server. I added a sentence for it in the
documentation (Warning section).
v24-0013: good catch. Applied.
v24-0014: partially applied. After some experiments I decided to use a small
number of attempts. The current code didn't reset the counter if the connection
is reestablished. I included the documentation suggestion. I didn't include the
IF EXISTS in the DROP PUBLICATION because it doesn't solve the issue. Instead,
I refactored the drop_publication() to not try again if the DROP PUBLICATION
failed.
v24-0015: not applied. I refactored the exit code to do the right thing: print
error message, disconnect database (if applicable) and exit.
v24-0016: not applied. But checked if the information was presented in the
documentation; it is.
v24-0017: good catch. Applied.
v24-0018: not applied.

I removed almost all boolean return and include the error logic inside the
function. It reads better. I also changed the connect|disconnect_database
functions to include the error logic inside it. There is a new parameter
on_error_exit for it. I removed the action parameter from pg_ctl_status() -- I
think someone suggested it -- and the error message was moved to outside the
function. I improved the cleanup routine. It provides useful information if it
cannot remove the object (publication or replication slot) from the primary.

Since I applied v24-0004, I realized that extra start / stop service are
required. It mean pg_createsubscriber doesn't start the transformation with the
current standby settings. Instead, it stops the standby if it is running and
start it with the provided command-line options (socket, port,
listen_addresses). It has a few drawbacks:
* See v34-0012. It cannot detect if the target server is a primary for another
  server. It is documented.
* I also removed the check for standby is running. If the standby was stopped a
  long time ago, it will take some time to reach the start point.
* Dry run mode has to start / stop the service to work correctly. Is it an
  issue?

However, I decided to include --retain option, I'm thinking about to remove it.
If the logging is enabled, the information during the pg_createsubscriber will
be available. The client log can be redirected to a file for future inspection.

Comments?


[1] 
https://www.postgresql.org/message-id/TYCPR01MB12077756323B79042F29DDAEDF54C2%40TYCPR01MB12077.jpnprd01.prod.outlook.com


--
Euler Taveira
EDB   https://www.enterprisedb.com/
From 41f222b68ab4e365be0123075d54d5da8468bcd2 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 5 Jun 2023 14:39:40 -0400
Subject: [PATCH v25] pg_createsubscriber: creates a 

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-01 Thread Melanie Plageman
On Thu, Feb 29, 2024 at 7:29 PM Melanie Plageman
 wrote:
>
> On Thu, Feb 29, 2024 at 5:44 PM Tomas Vondra
>  wrote:
> >
> >
> >
> > On 2/29/24 22:19, Melanie Plageman wrote:
> > > On Thu, Feb 29, 2024 at 7:54 AM Tomas Vondra
> > >  wrote:
> > >>
> > >>
> > >>
> > >> On 2/29/24 00:40, Melanie Plageman wrote:
> > >>> On Wed, Feb 28, 2024 at 6:17 PM Tomas Vondra
> > >>>  wrote:
> > 
> > 
> > 
> >  On 2/28/24 21:06, Melanie Plageman wrote:
> > > On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra
> > >  wrote:
> > >>
> > >> On 2/28/24 15:56, Tomas Vondra wrote:
> >  ...
> > >>>
> > >>> Sure, I can do that. It'll take a couple hours to get the results, 
> > >>> I'll
> > >>> share them when I have them.
> > >>>
> > >>
> > >> Here are the results with only patches 0001 - 0012 applied (i.e. 
> > >> without
> > >> the patch introducing the streaming read API, and the patch switching
> > >> the bitmap heap scan to use it).
> > >>
> > >> The changes in performance don't disappear entirely, but the scale is
> > >> certainly much smaller - both in the complete results for all runs, 
> > >> and
> > >> for the "optimal" runs that would actually pick bitmapscan.
> > >
> > > Hmm. I'm trying to think how my refactor could have had this impact.
> > > It seems like all the most notable regressions are with 4 parallel
> > > workers. What do the numeric column labels mean across the top
> > > (2,4,8,16...) -- are they related to "matches"? And if so, what does
> > > that mean?
> > >
> > 
> >  That's the number of distinct values matched by the query, which should
> >  be an approximation of the number of matching rows. The number of
> >  distinct values in the data set differs by data set, but for 1M rows
> >  it's roughly like this:
> > 
> >  uniform: 10k
> >  linear: 10k
> >  cyclic: 100
> > 
> >  So for example matches=128 means ~1% of rows for uniform/linear, and
> >  100% for cyclic data sets.
> > >>>
> > >>> Ah, thank you for the explanation. I also looked at your script after
> > >>> having sent this email and saw that it is clear in your script what
> > >>> "matches" is.
> > >>>
> >  As for the possible cause, I think it's clear most of the difference
> >  comes from the last patch that actually switches bitmap heap scan to 
> >  the
> >  streaming read API. That's mostly expected/understandable, although we
> >  probably need to look into the regressions or cases with e_i_c=0.
> > >>>
> > >>> Right, I'm mostly surprised about the regressions for patches 0001-0012.
> > >>>
> > >>> Re eic 0: Thomas Munro and I chatted off-list, and you bring up a
> > >>> great point about eic 0. In old bitmapheapscan code eic 0 basically
> > >>> disabled prefetching but with the streaming read API, it will still
> > >>> issue fadvises when eic is 0. That is an easy one line fix. Thomas
> > >>> prefers to fix it by always avoiding an fadvise for the last buffer in
> > >>> a range before issuing a read (since we are about to read it anyway,
> > >>> best not fadvise it too). This will fix eic 0 and also cut one system
> > >>> call from each invocation of the streaming read machinery.
> > >>>
> >  To analyze the 0001-0012 patches, maybe it'd be helpful to run tests 
> >  for
> >  individual patches. I can try doing that tomorrow. It'll have to be a
> >  limited set of tests, to reduce the time, but might tell us whether 
> >  it's
> >  due to a single patch or multiple patches.
> > >>>
> > >>> Yes, tomorrow I planned to start trying to repro some of the "red"
> > >>> cases myself. Any one of the commits could cause a slight regression
> > >>> but a 3.5x regression is quite surprising, so I might focus on trying
> > >>> to repro that locally and then narrow down which patch causes it.
> > >>>
> > >>> For the non-cached regressions, perhaps the commit to use the correct
> > >>> recheck flag (0004) when prefetching could be the culprit. And for the
> > >>> cached regressions, my money is on the commit which changes the whole
> > >>> control flow of BitmapHeapNext() and the next_block() and next_tuple()
> > >>> functions (0010).
> > >>>
> > >>
> > >> I do have some partial results, comparing the patches. I only ran one of
> > >> the more affected workloads (cyclic) on the xeon, attached is a PDF
> > >> comparing master and the 0001-0014 patches. The percentages are timing
> > >> vs. the preceding patch (green - faster, red - slower).
> > >
> > > Just confirming: the results are for uncached?
> > >
> >
> > Yes, cyclic data set, uncached case. I picked this because it seemed
> > like one of the most affected cases. Do you want me to test some other
> > cases too?
>
> So, I actually may have found the source of at least part of the
> regression with 0010. I was able to reproduce the regression with
> patch 0010 applied for the unached 

Re: [Patch] add multiple client certificate selection feature

2024-03-01 Thread Cary Huang
Hello

I would like to share a version 2 patch for multiple client certificate 
selection feature with several enhancements over v1. I removed the extra 
parameter "sslcertdir" and "sslkeydir". Instead, I reuse the existing sslcert, 
ssldir and sslpassword parameters but allow multiple entries to be supplied 
separated by comma. This way, we are able to use a different sslpassword to 
decrypt different sslkey files based on the selected certificate. This was not 
possible in v1.

When a client is doing a TLS handshake with a server that requires client 
certificate, the client will obtain a list of trusted CA names from the server 
and try to match it from the list of certificates provided via sslcert option. 
A client certificate is chosen if its issuer matches one of the server’s 
trusted CA names. Once a certificate is chosen, the corresponding private key 
and sslpassword (if required) will be used to establish a secured TLS 
connection.

The feature is useful when a libpq client needs to communicate with multiple 
TLS-enabled PostgreSQL server instances with different TLS certificate setups. 
Instead of letting the application to figure out what certificate to send to 
what server, we can configure all possible certificate candidates to libpq and 
have it choose the best one to use instead.

 

Hello Daniel

Sorry to bother. I am just wondering your opinion about this feature? Should 
this be added to commitfest for review? This feature involves certificates 
issued by different root CAs to test the its ability to pick the right 
certificate, so the existing ssl tap test’s certificate generation script needs 
an update to test this. I have not done so yet, because I would like to discuss 
with you first.

Any comments and recommendations are welcome. Thank you!





Best regards

Cary Huang

v2-0001-multiple_client_certificate_selection_support.patch
Description: Binary data


Re: index prefetching

2024-03-01 Thread Peter Geoghegan
On Fri, Mar 1, 2024 at 10:18 AM Tomas Vondra
 wrote:
> But I have very hard time figuring out what the MVP version should be,
> because I have very limited understanding on how much control the index
> AM ought to have :-( And it'd be a bit silly to do something in v17,
> only to have to rip it out in v18 because it turned out to not get the
> split right.

I suspect that you're overestimating the difficulty of getting the
layering right (at least relative to the difficulty of everything
else).

The executor proper doesn't know anything about pins on leaf pages
(and in reality nbtree usually doesn't hold any pins these days). All
the executor knows is that it had better not be possible for an
in-flight index scan to get confused by concurrent TID recycling by
VACUUM. When amgettuple/btgettuple is called, nbtree usually just
returns TIDs it collected from a just-scanned leaf page.

This sort of stuff already lives in the index AM. It seems to me that
everything at the API and executor level can continue to work in
essentially the same way as it always has, with only minimal revision
to the wording around buffer pins (in fact that really should have
happened back in 2015, as part of commit 2ed5b87f).  The hard part
will be figuring out how to make the physical index scan prefetch
optimally, in a way that balances various considerations. These
include:

* Managing heap prefetch distance.

* Avoiding making kill_prior_tuple significantly less effective
(perhaps the new design could even make it more effective, in some
scenarios, by holding onto multiple buffer pins based on a dynamic
model).

* Figuring out how many leaf pages it makes sense to read ahead of
accessing the heap, since there is no fixed relationship between the
number of leaf pages we need to scan to collect a given number of
distinct heap blocks that we need for prefetching. (This is made more
complicated by things like LIMIT, but is actually an independent
problem.)

So I think that you need to teach index AMs to behave roughly as if
multiple leaf pages were read as one single leaf page, at least in
terms of things like how the BTScanOpaqueData.currPos state is
managed. I imagine that currPos will need to be filled with TIDs from
multiple index pages, instead of just one, with entries that are
organized in a way that preserves the illusion of one continuous scan
from the point of view of the executor proper. By the time we actually
start really returning TIDs via btgettuple, it looks like we scanned
one giant leaf page instead of several (the exact number of leaf pages
scanned will probably have to be indeterminate, because it'll depend
on things like heap prefetch distance).

The good news (assuming that I'm right here) is that you don't need to
have specific answers to most of these questions in order to commit a
v1 of index prefeteching. ISTM that all you really need is to have
confidence that the general approach that I've outlined is the right
approach, long term (certainly not nothing, but I'm at least
reasonably confident here).

> The hard thing is what to do about cases where neither of this helps.
> The example I keep thinking about is IOS - if we don't do prefetching,
> it's not hard to construct cases where regular index scan gets much
> faster than IOS (with many not-all-visible pages). But we can't just
> prefetch all pages, because that'd hurt IOS cases with most pages fully
> visible (when we don't need to actually access the heap).
>
> I managed to deal with this in the executor-level version, but I'm not
> sure how to do this if the control moves closer to the index AM.

The reality is that nbtree already knows about index-only scans. It
has to, because it wouldn't be safe to drop the pin on a leaf page's
buffer when the scan is "between pages" in the specific case of
index-only scans (so the _bt_killitems code path used when
kill_prior_tuple has index tuples to kill knows about index-only
scans).

I actually added commentary to the nbtree README that goes into TID
recycling by VACUUM not too long ago. This includes stuff about how
LP_UNUSED items in the heap are considered dead to all index scans
(which can actually try to look at a TID that just became LP_UNUSED in
the heap!), even though LP_UNUSED items don't prevent VACUUM from
setting heap pages all-visible. This seemed like the only way of
explaining the _bt_killitems IOS issue, that actually seemed to make
sense.

What you really want to do here is to balance costs and benefits.
That's just what's required. The fact that those costs and benefits
span multiple levels of abstractions makes it a bit awkward, but
doesn't (and can't) change the basic shape of the problem.

--
Peter Geoghegan




Re: Statistics Import and Export

2024-03-01 Thread Stephen Frost
Greetings,

On Fri, Mar 1, 2024 at 12:14 Nathan Bossart 
wrote:

> On Thu, Feb 29, 2024 at 10:55:20PM -0500, Corey Huinker wrote:
> >> That’s certainly a fair point and my initial reaction (which could
> >> certainly be wrong) is that it’s unlikely to be an issue- but also, if
> you
> >> feel you could make it work with an array and passing all the attribute
> >> info in with one call, which I suspect would be possible but just a bit
> >> more complex to build, then sure, go for it. If it ends up being overly
> >> unwieldy then perhaps the  per-attribute call would be better and we
> could
> >> perhaps acquire the lock before the function calls..?  Doing a check to
> see
> >> if we have already locked it would be cheaper than trying to acquire a
> new
> >> lock, I’m fairly sure.
> >
> > Well the do_analyze() code was already ok with acquiring the lock once
> for
> > non-inherited stats and again for inherited stats, so the locks were
> > already not the end of the world. However, that's at most a 2x of the
> > locking required, and this would natts * x, quite a bit more. Having the
> > procedures check for a pre-existing lock seems like a good compromise.
>
> I think this is a reasonable starting point.  If the benchmarks show that
> the locking is a problem, we can reevaluate, but otherwise IMHO we should
> try to keep it as simple/flexible as possible.


Yeah, this was my general feeling as well.  If it does become an issue, it
certainly seems like we would have ways to improve it in the future. Even
with this locking it is surely going to be better than having to re-analyze
the entire database which is where we are at now.

Thanks,

Stephen

>


Re: Statistics Import and Export

2024-03-01 Thread Nathan Bossart
On Thu, Feb 29, 2024 at 10:55:20PM -0500, Corey Huinker wrote:
>> That’s certainly a fair point and my initial reaction (which could
>> certainly be wrong) is that it’s unlikely to be an issue- but also, if you
>> feel you could make it work with an array and passing all the attribute
>> info in with one call, which I suspect would be possible but just a bit
>> more complex to build, then sure, go for it. If it ends up being overly
>> unwieldy then perhaps the  per-attribute call would be better and we could
>> perhaps acquire the lock before the function calls..?  Doing a check to see
>> if we have already locked it would be cheaper than trying to acquire a new
>> lock, I’m fairly sure.
> 
> Well the do_analyze() code was already ok with acquiring the lock once for
> non-inherited stats and again for inherited stats, so the locks were
> already not the end of the world. However, that's at most a 2x of the
> locking required, and this would natts * x, quite a bit more. Having the
> procedures check for a pre-existing lock seems like a good compromise.

I think this is a reasonable starting point.  If the benchmarks show that
the locking is a problem, we can reevaluate, but otherwise IMHO we should
try to keep it as simple/flexible as possible.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-01 Thread Tomas Vondra


On 3/1/24 17:51, Melanie Plageman wrote:
> On Fri, Mar 1, 2024 at 9:05 AM Tomas Vondra
>  wrote:
>>
>> On 3/1/24 02:18, Melanie Plageman wrote:
>>> On Thu, Feb 29, 2024 at 6:44 PM Tomas Vondra
>>>  wrote:

 On 2/29/24 23:44, Tomas Vondra wrote:
 1) On master there's clear difference between eic=0 and eic=1 cases, but
 on the patched build there's literally no difference - for example the
 "uniform" distribution is clearly not great for prefetching, but eic=0
 regresses to eic=1 poor behavior).
>>>
>>> Yes, so eic=0 and eic=1 are identical with the streaming read API.
>>> That is, eic 0 does not disable prefetching. Thomas is going to update
>>> the streaming read API to avoid issuing an fadvise for the last block
>>> in a range before issuing a read -- which would mean no prefetching
>>> with eic 0 and eic 1. Not doing prefetching with eic 1 actually seems
>>> like the right behavior -- which would be different than what master
>>> is doing, right?
>>
>> I don't think we should stop doing prefetching for eic=1, or at least
>> not based just on these charts. I suspect these "uniform" charts are not
>> a great example for the prefetching, because it's about distribution of
>> individual rows, and even a small fraction of rows may match most of the
>> pages. It's great for finding strange behaviors / corner cases, but
>> probably not a sufficient reason to change the default.
> 
> Yes, I would like to see results from a data set where selectivity is
> more correlated to pages/heap fetches. But, I'm not sure I see how
> that is related to prefetching when eic = 1.
> 

OK, I'll make that happen.

>> I think it makes sense to issue a prefetch one page ahead, before
>> reading/processing the preceding one, and it's fairly conservative
>> setting, and I assume the default was chosen for a reason / after
>> discussion.
> 
> Yes, I suppose the overhead of an fadvise does not compare to the IO
> latency of synchronously reading that block. Actually, I bet the
> regression I saw by accidentally moving BitmapAdjustPrefetchIterator()
> after table_scan_bitmap_next_block() would be similar to the
> regression introduced by making eic = 1 not prefetch.
> 
> When you think about IO concurrency = 1, it doesn't imply prefetching
> to me. But, I think we want to do the right thing and have parity with
> master.
> 

Just to be sure we're on the same page regarding what eic=1 means,
consider a simple sequence of pages: A, B, C, D, E, ...

With the current "master" code, eic=1 means we'll issue a prefetch for B
and then read+process A. And then issue prefetch for C and read+process
B, and so on. It's always one page ahead.

Yes, if the page is already in memory, the fadvise is just overhead. It
may happen for various reasons (say, read-ahead). But it's just this one
case, I'd bet in other cases eic=1 would be a win.

>> My suggestion would be to keep the master behavior unless not practical,
>> and then maybe discuss changing the details later. The patch is already
>> complicated enough, better to leave that discussion for later.
> 
> Agreed. Speaking of which, we need to add back use of tablespace IO
> concurrency for the streaming read API (which is used by
> BitmapHeapScan in master).
> 

+1

>>> With very low selectivity, you are less likely to get readahead
>>> (right?) and similarly less likely to be able to build up > 8kB IOs --
>>> which is one of the main value propositions of the streaming read
>>> code. I imagine that this larger read benefit is part of why the
>>> performance is better at higher selectivities with the patch. This
>>> might be a silly experiment, but we could try decreasing
>>> MAX_BUFFERS_PER_TRANSFER on the patched version and see if the
>>> performance gains go away.
>>
>> Sure, I can do that. Do you have any particular suggestion what value to
>> use for MAX_BUFFERS_PER_TRANSFER?
> 
> I think setting it to 1 would be the same as always master -- doing
> only 8kB reads. The only thing about that is that I imagine the other
> streaming read code has some overhead which might end up being a
> regression on balance even with the prefetching if we aren't actually
> using the ranges/vectored capabilities of the streaming read
> interface. Maybe if you just run it for one of the very obvious
> performance improvement cases? I can also try this locally.
> 

OK, I'll try with 1, and then we can adjust.

>> I'll also try to add a better version of uniform, where the selectivity
>> matches more closely to pages, not rows.
> 
> This would be great.
> 


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-01 Thread Melanie Plageman
On Fri, Mar 1, 2024 at 9:05 AM Tomas Vondra
 wrote:
>
> On 3/1/24 02:18, Melanie Plageman wrote:
> > On Thu, Feb 29, 2024 at 6:44 PM Tomas Vondra
> >  wrote:
> >>
> >> On 2/29/24 23:44, Tomas Vondra wrote:
> >> 1) On master there's clear difference between eic=0 and eic=1 cases, but
> >> on the patched build there's literally no difference - for example the
> >> "uniform" distribution is clearly not great for prefetching, but eic=0
> >> regresses to eic=1 poor behavior).
> >
> > Yes, so eic=0 and eic=1 are identical with the streaming read API.
> > That is, eic 0 does not disable prefetching. Thomas is going to update
> > the streaming read API to avoid issuing an fadvise for the last block
> > in a range before issuing a read -- which would mean no prefetching
> > with eic 0 and eic 1. Not doing prefetching with eic 1 actually seems
> > like the right behavior -- which would be different than what master
> > is doing, right?
>
> I don't think we should stop doing prefetching for eic=1, or at least
> not based just on these charts. I suspect these "uniform" charts are not
> a great example for the prefetching, because it's about distribution of
> individual rows, and even a small fraction of rows may match most of the
> pages. It's great for finding strange behaviors / corner cases, but
> probably not a sufficient reason to change the default.

Yes, I would like to see results from a data set where selectivity is
more correlated to pages/heap fetches. But, I'm not sure I see how
that is related to prefetching when eic = 1.

> I think it makes sense to issue a prefetch one page ahead, before
> reading/processing the preceding one, and it's fairly conservative
> setting, and I assume the default was chosen for a reason / after
> discussion.

Yes, I suppose the overhead of an fadvise does not compare to the IO
latency of synchronously reading that block. Actually, I bet the
regression I saw by accidentally moving BitmapAdjustPrefetchIterator()
after table_scan_bitmap_next_block() would be similar to the
regression introduced by making eic = 1 not prefetch.

When you think about IO concurrency = 1, it doesn't imply prefetching
to me. But, I think we want to do the right thing and have parity with
master.

> My suggestion would be to keep the master behavior unless not practical,
> and then maybe discuss changing the details later. The patch is already
> complicated enough, better to leave that discussion for later.

Agreed. Speaking of which, we need to add back use of tablespace IO
concurrency for the streaming read API (which is used by
BitmapHeapScan in master).

> > With very low selectivity, you are less likely to get readahead
> > (right?) and similarly less likely to be able to build up > 8kB IOs --
> > which is one of the main value propositions of the streaming read
> > code. I imagine that this larger read benefit is part of why the
> > performance is better at higher selectivities with the patch. This
> > might be a silly experiment, but we could try decreasing
> > MAX_BUFFERS_PER_TRANSFER on the patched version and see if the
> > performance gains go away.
>
> Sure, I can do that. Do you have any particular suggestion what value to
> use for MAX_BUFFERS_PER_TRANSFER?

I think setting it to 1 would be the same as always master -- doing
only 8kB reads. The only thing about that is that I imagine the other
streaming read code has some overhead which might end up being a
regression on balance even with the prefetching if we aren't actually
using the ranges/vectored capabilities of the streaming read
interface. Maybe if you just run it for one of the very obvious
performance improvement cases? I can also try this locally.

> I'll also try to add a better version of uniform, where the selectivity
> matches more closely to pages, not rows.

This would be great.

- Melanie




Re: make BuiltinTrancheNames less ugly

2024-03-01 Thread Tristan Partin

On Fri Mar 1, 2024 at 8:00 AM CST, Alvaro Herrera wrote:

On 2024-Feb-23, Heikki Linnakangas wrote:

> On 12/02/2024 19:01, Tristan Partin wrote:
> > On Wed Jan 24, 2024 at 8:09 AM CST, Alvaro Herrera wrote:
> > > IMO it would be less ugly to have the origin file lwlocknames.txt be
> > > not a text file but a .h with a macro that can be defined by
> > > interested parties so that they can extract what they want from the
> > > file, like PG_CMDTAG or PG_KEYWORD.  Using Perl for this seems dirty...
> > 
> > I really like this idea, and would definitely be more inclined to see

> > a solution like this.
> 
> +1 to that idea from me too. Seems pretty straightforward.


OK, here's a patch that does it.  I have not touched Meson yet.

I'm pretty disappointed of not being able to remove
generate-lwlocknames.pl (it now generates the .h, no longer the .c
file), but I can't find a way to do the equivalent #defines in CPP ...
it'd require invoking the C preprocessor twice.  Maybe an intermediate
.h file would solve the problem, but I have no idea how would that work
with Meson.  I guess I'll do it in Make and let somebody suggest a Meson
way.


I can help you with Meson if you get the Make implementation done.

--
Tristan Partin
Neon (https://neon.tech)




Re: double precisoin type

2024-03-01 Thread Tom Lane
Fabrice Chapuis  writes:
> Documentation says:
> double precision 8 bytes variable-precision, inexact 15 decimal digits
> precision

The documentation is stating the minimum number of decimal digits
that will be accurately reproduced.  You got 16 reproduced correctly
in this example, but you were lucky.

float8out has a different rule, which is to emit enough digits to
describe the actually-stored binary value unambiguously, so that
dump and reload will not change the stored value.

regards, tom lane




Re: POC, WIP: OR-clause support for indexes

2024-03-01 Thread Alexander Korotkov
Hi, Andrei,
Hi, Alena!

On Thu, Feb 29, 2024 at 10:59 AM Andrei Lepikhov 
wrote:

> On 28/2/2024 17:27, Alena Rybakina wrote:
> > Maybe like that:
> >
> > It also considers the way to generate a path using BitmapScan indexes,
> > converting the transformed expression into expressions separated by "OR"
> > operations, and if it turns out to be the best and finally selects the
> > best one.
> Thanks,
> I spent some time describing the feature with documentation.
> A condensed description of the GUC is in the runtime-config file.
> Feature description has spread between TransformOrExprToANY and
> generate_saop_pathlist routines.
> Also, I've made tiny changes in the code to look more smoothly.
> All modifications are integrated into the two new patches.
>
> Feel free to add, change or totally rewrite these changes.
>

I'm going to review and revise the patch.

One question I have yet.

>/*
> * Transformation only works with both side type is not
> * { array | composite | domain | record }.

Why do we limit transformation for these types?  Also, it doesn't seem the
current code restricts anything except composite/record.

--
Regards,
Alexander Korotkov


Re: index prefetching

2024-03-01 Thread Tomas Vondra
On 2/15/24 21:30, Peter Geoghegan wrote:
> On Thu, Feb 15, 2024 at 3:13 PM Andres Freund  wrote:
>>> This is why I don't think that the tuples with lower page offset
>>> numbers are in any way significant here.  The significant part is
>>> whether or not you'll actually need to visit more than one leaf page
>>> in the first place (plus the penalty from not being able to reorder
>>> the work across page boundaries in your initial v1 of prefetching).
>>
>> To me this your phrasing just seems to reformulate the issue.
> 
> What I said to Tomas seems very obvious to me. I think that there
> might have been some kind of miscommunication (not a real
> disagreement). I was just trying to work through that.
> 
>> In practical terms you'll have to wait for the full IO latency when fetching
>> the table tuple corresponding to the first tid on a leaf page. Of course
>> that's also the moment you had to visit another leaf page. Whether the stall
>> is due to visit another leaf page or due to processing the first entry on 
>> such
>> a leaf page is a distinction without a difference.
> 
> I don't think anybody said otherwise?
> 
 That's certainly true / helpful, and it makes the "first entry" issue
 much less common. But the issue is still there. Of course, this says
 nothing about the importance of the issue - the impact may easily be so
 small it's not worth worrying about.
>>>
>>> Right. And I want to be clear: I'm really *not* sure how much it
>>> matters. I just doubt that it's worth worrying about in v1 -- time
>>> grows short. Although I agree that we should commit a v1 that leaves
>>> the door open to improving matters in this area in v2.
>>
>> I somewhat doubt that it's realistic to aim for 17 at this point.
> 
> That's a fair point. Tomas?
> 

I think that's a fair assessment.

To me it seems doing the prefetching solely at the executor level is not
really workable. And if it can be made to work, there's far too many
open questions to do that in the last commitfest.

I think the consensus is at least some of the logic/control needs to
move back to the index AM. Maybe there's some minimal part that we could
do for v17, even if it has various limitations, and then improve that in
v18. Say, doing the leaf-page-at-a-time and passing a little bit of
information from the index scan to drive this.

But I have very hard time figuring out what the MVP version should be,
because I have very limited understanding on how much control the index
AM ought to have :-( And it'd be a bit silly to do something in v17,
only to have to rip it out in v18 because it turned out to not get the
split right.

>> We seem to
>> still be doing fairly fundamental architectual work. I think it might be the
>> right thing even for 18 to go for the simpler only-a-single-leaf-page
>> approach though.
> 
> I definitely think it's a good idea to have that as a fall back
> option. And to not commit ourselves to having something better than
> that for v1 (though we probably should commit to making that possible
> in v2).
> 

Yeah, I agree with that.

>> I wonder if there are prerequisites that can be tackled for 17. One idea is 
>> to
>> work on infrastructure to provide executor nodes with information about the
>> number of tuples likely to be fetched - I suspect we'll trigger regressions
>> without that in place.
> 
> I don't think that there'll be regressions if we just take the simpler
> only-a-single-leaf-page approach. At least it seems much less likely.
> 

I'm sure we could pass additional information from the index scans to
improve that further. But I think the gradual ramp-up would deal with
most regressions. At least that's my experience from benchmarking the
early version.

The hard thing is what to do about cases where neither of this helps.
The example I keep thinking about is IOS - if we don't do prefetching,
it's not hard to construct cases where regular index scan gets much
faster than IOS (with many not-all-visible pages). But we can't just
prefetch all pages, because that'd hurt IOS cases with most pages fully
visible (when we don't need to actually access the heap).

I managed to deal with this in the executor-level version, but I'm not
sure how to do this if the control moves closer to the index AM.

>> One way to *sometimes* process more than a single leaf page, without having 
>> to
>> redesign kill_prior_tuple, would be to use the visibilitymap to check if the
>> target pages are all-visible. If all the table pages on a leaf page are
>> all-visible, we know that we don't need to kill index entries, and thus can
>> move on to the next leaf page
> 
> It's possible that we'll need a variety of different strategies.
> nbtree already has two such strategies in _bt_killitems(), in a way.
> Though its "Modified while not pinned means hinting is not safe" path
> (LSN doesn't match canary value path) seems pretty naive. The
> prefetching stuff might present us with a good opportunity to replace
> that with something 

Re: index prefetching

2024-03-01 Thread Tomas Vondra
Hi,

Thanks for looking at the patch!


On 3/1/24 09:20, Jakub Wartak wrote:
> On Wed, Jan 24, 2024 at 7:13 PM Tomas Vondra
>  wrote:
> [
>>
>> (1) Melanie actually presented a very different way to implement this,
>> relying on the StreamingRead API. So chances are this struct won't
>> actually be used.
> 
> Given lots of effort already spent on this and the fact that is thread
> is actually two:
> 
> a. index/table prefetching since Jun 2023 till ~Jan 2024
> b. afterwards index/table prefetching with Streaming API, but there
> are some doubts of whether it could happen for v17 [1]
> 
> ... it would be pitty to not take benefits of such work (even if
> Streaming API wouldn't be ready for this; although there's lots of
> movement in the area), so I've played a little with with the earlier
> implementation from [2] without streaming API as it already received
> feedback, it demonstrated big benefits, and earlier it got attention
> on pgcon unconference. Perhaps, some of those comment might be passed
> later to the "b"-patch (once that's feasible):
> 

TBH I don't have a clear idea what to do. It'd be cool to have at least
some benefits in v17, but I don't know how to do that in a way that
would be useful in the future.

For example, the v20240124 patch implements this in the executor, but
based on the recent discussions it seems that's not the right layer -
the index AM needs to have some control, and I'm not convinced it's
possible to improve it in that direction (even ignoring the various
issues we identified in the executor-based approach).

I think it might be more practical to do this from the index AM, even if
it has various limitations. Ironically, that's what I proposed at pgcon,
but mostly because it was the quick way to do this.

> 1. v20240124-0001-Prefetch-heap-pages-during-index-scans.patch does
> not apply cleanly anymore, due show_buffer_usage() being quite
> recently refactored in 5de890e3610d5a12cdaea36413d967cf5c544e20 :
> 
> patching file src/backend/commands/explain.c
> Hunk #1 FAILED at 3568.
> Hunk #2 FAILED at 3679.
> 2 out of 2 hunks FAILED -- saving rejects to file
> src/backend/commands/explain.c.rej
> 
> 2. v2 applies (fixup), but it would nice to see that integrated into
> main patch (it adds IndexOnlyPrefetchInfo) into one patch
> 

Yeah, but I think it was an old patch version, no point in rebasing that
forever. Also, I'm not really convinced the executor-level approach is
the right path forward.

> 3. execMain.c :
> 
> + * XXX It might be possible to improve the prefetching code
> to handle this
> + * by "walking back" the TID queue, but it's not clear if
> it's worth it.
> 
> Shouldn't we just remove the XXX? The walking-back seems to be niche
> so are fetches using cursors when looking at real world users queries
> ? (support cases bias here when looking at peopel's pg_stat_activity)
> 
> 4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
> but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
> (allowing it to follow dynamically; the more prefetches the user wants
> to perform, the more you spread them across shared LRUs and the more
> memory for history is required?)
> 
> + * XXX Maybe we could consider effective_cache_size when sizing the 
> cache?
> + * Not to size the cache for that, ofc, but maybe as a guidance of how 
> many
> + * heap pages it might keep. Maybe just a fraction fraction of the value,
> + * say Max(8MB, effective_cache_size / max_connections) or something.
> + */
> +#definePREFETCH_LRU_SIZE8/* slots in one LRU */
> +#definePREFETCH_LRU_COUNT128 /* number of LRUs */
> +#definePREFETCH_CACHE_SIZE(PREFETCH_LRU_SIZE *
> PREFETCH_LRU_COUNT)
> 

I don't see why would this be related to effective_io_concurrency? It's
merely about how many recently accessed pages we expect to find in the
page cache. It's entirely separate from the prefetch distance.

> BTW:
> + * heap pages it might keep. Maybe just a fraction fraction of the value,
> that's a duplicated "fraction" word over there.
> 
> 5.
> + * XXX Could it be harmful that we read the queue backwards?
> Maybe memory
> + * prefetching works better for the forward direction?
> 
> I wouldn't care, we are optimizing I/O (and context-switching) which
> weighs much more than memory access direction impact and Dilipi
> earlier also expressed no concern, so maybe it could be also removed
> (one less "XXX" to care about)
> 

Yeah, I think it's negligible. Probably a microoptimization we can
investigate later, I don't want to complicate the code unnecessarily.

> 6. in IndexPrefetchFillQueue()
> 
> +while (!PREFETCH_QUEUE_FULL(prefetch))
> +{
> +IndexPrefetchEntry *entry
> += prefetch->next_cb(scan, direction, prefetch->data);
> 
> If we are at it... that's a strange split and assignment not indented :^)
> 
> 7. in 

double precisoin type

2024-03-01 Thread Fabrice Chapuis
Hello,

postgres [1264904]=# select 123456789.123456789123456::double precision;
┌┐
│   float8   │
├┤
│ 123456789.12345679 │
└┘
(1 row)

I do not understand why this number is truncated at 123456789.12345679 that
is 17 digits and not 15 digits

Any idea

Fabrice

Documentation says:
double precision 8 bytes variable-precision, inexact 15 decimal digits
precision


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-01 Thread Bharath Rupireddy
On Thu, Feb 22, 2024 at 1:44 PM Bertrand Drouvot
 wrote:
>
> > > Does that make sense to you to use "conflict" as value in 
> > > "invalidation_reason"
> > > when the slot has "conflict_reason" not NULL?
> >
> > I'm thinking the other way around - how about we revert
> > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=007693f2a3ac2ac19affcb03ad43cdb36ccff5b5,
> > that is, put in place "conflict" as a boolean and introduce
> > invalidation_reason the text form. So, for logical slots, whenever the
> > "conflict" column is true, the reason is found in invaldiation_reason
> > column? How does it sound?
>
> Yeah, I think that looks fine too. We would need more change (like take care 
> of
> ddd5f4f54a for example).
>
> CC'ing Amit, Hou-San and Shveta to get their point of view (as the ones behind
> 007693f2a3 and ddd5f4f54a).

Yeah, let's wait for what others think about it.

FWIW, I've had to rebase the patches due to 943f7ae1c. Please see the
attached v6 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v6-0001-Track-invalidation_reason-in-pg_replication_slots.patch
Description: Binary data


v6-0002-Add-XID-based-replication-slot-invalidation.patch
Description: Binary data


v6-0003-Track-inactive-replication-slot-information.patch
Description: Binary data


v6-0004-Add-inactive_timeout-based-replication-slot-inval.patch
Description: Binary data


Re: Make query cancellation keys longer

2024-03-01 Thread Peter Eisentraut

On 29.02.24 22:25, Heikki Linnakangas wrote:
Currently, cancel request key is a 32-bit token, which isn't very much 
entropy. If you want to cancel another session's query, you can 
brute-force it. In most environments, an unauthorized cancellation of a 
query isn't very serious, but it nevertheless would be nice to have more 
protection from it. The attached patch makes it longer. It is an 
optional protocol feature, so it's fully backwards-compatible with 
clients that don't support longer keys.


My intuition would be to make this a protocol version bump, not an 
optional feature.  I think this is something that everyone should 
eventually be using, not a niche feature that you explicitly want to 
opt-in for.


One complication with this was that because we no longer know how long 
the key should be, 4-bytes or something longer, until the backend has 
performed the protocol negotiation, we cannot generate the key in the 
postmaster before forking the process anymore.


Maybe this would be easier if it's a protocol version number change, 
since that is sent earlier than protocol extensions?






Re: pread, pwrite, etc return ssize_t not int

2024-03-01 Thread Peter Eisentraut

On 27.02.24 12:21, Thomas Munro wrote:

Patches attached.

PS Correction to my earlier statement about POSIX: the traditional K
interfaces were indeed in the original POSIX.1 1988 but it was the
1990 edition (approximately coinciding with standard C) that adopted
void, size_t, const and invented ssize_t.


0001-Return-ssize_t-in-fd.c-I-O-functions.patch

This patch looks correct to me.

0002-Fix-theoretical-overflow-in-Windows-pg_pread-pg_pwri.patch

I have two comments on that:

For the overflow of the input length (size_t -> DWORD), I don't think we 
actually need to do anything.  The size argument would be truncated, but 
the callers would just repeat the calls with the remaining size, so in 
effect they will read the data in chunks of rest + N * DWORD_MAX.  The 
patch just changes this to chunks of N * 1GB + rest.


The other issue, the possible overflow of size_t -> ssize_t is not 
specific to Windows.  We could install some protection against that on 
some other layer, but it's unclear how widespread that issue is or what 
the appropriate fix is.  POSIX says that passing in a size larger than 
SSIZE_MAX has implementation-defined effect.  The FreeBSD man page says 
that this will result in an EINVAL error.  So if we here truncate 
instead of error, we'd introduce a divergence.






Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-01 Thread Tomas Vondra



On 3/1/24 02:18, Melanie Plageman wrote:
> On Thu, Feb 29, 2024 at 6:44 PM Tomas Vondra
>  wrote:
>>
>> On 2/29/24 23:44, Tomas Vondra wrote:
>>>
>>> ...
>>>
>
> I do have some partial results, comparing the patches. I only ran one of
> the more affected workloads (cyclic) on the xeon, attached is a PDF
> comparing master and the 0001-0014 patches. The percentages are timing
> vs. the preceding patch (green - faster, red - slower).

 Just confirming: the results are for uncached?

>>>
>>> Yes, cyclic data set, uncached case. I picked this because it seemed
>>> like one of the most affected cases. Do you want me to test some other
>>> cases too?
>>>
>>
>> BTW I decided to look at the data from a slightly different angle and
>> compare the behavior with increasing effective_io_concurrency. Attached
>> are charts for three "uncached" cases:
>>
>>  * uniform, work_mem=4MB, workers_per_gather=0
>>  * linear-fuzz, work_mem=4MB, workers_per_gather=0
>>  * uniform, work_mem=4MB, workers_per_gather=4
>>
>> Each page has charts for master and patched build (with all patches). I
>> think there's a pretty obvious difference in how increasing e_i_c
>> affects the two builds:
> 
> Wow! These visualizations make it exceptionally clear. I want to go to
> the Vondra school of data visualizations for performance results!
> 

Welcome to my lecture on how to visualize data. The process has about
four simple steps:

1) collect data for a lot of potentially interesting cases
2) load them into excel / google sheets / ...
3) slice and dice them into charts that you understand / can explain
4) every now and then there's something you can't understand / explain

Thank you for attending my lecture ;-) No homework today.

>> 1) On master there's clear difference between eic=0 and eic=1 cases, but
>> on the patched build there's literally no difference - for example the
>> "uniform" distribution is clearly not great for prefetching, but eic=0
>> regresses to eic=1 poor behavior).
> 
> Yes, so eic=0 and eic=1 are identical with the streaming read API.
> That is, eic 0 does not disable prefetching. Thomas is going to update
> the streaming read API to avoid issuing an fadvise for the last block
> in a range before issuing a read -- which would mean no prefetching
> with eic 0 and eic 1. Not doing prefetching with eic 1 actually seems
> like the right behavior -- which would be different than what master
> is doing, right?
> 

I don't think we should stop doing prefetching for eic=1, or at least
not based just on these charts. I suspect these "uniform" charts are not
a great example for the prefetching, because it's about distribution of
individual rows, and even a small fraction of rows may match most of the
pages. It's great for finding strange behaviors / corner cases, but
probably not a sufficient reason to change the default.

I think it makes sense to issue a prefetch one page ahead, before
reading/processing the preceding one, and it's fairly conservative
setting, and I assume the default was chosen for a reason / after
discussion.

My suggestion would be to keep the master behavior unless not practical,
and then maybe discuss changing the details later. The patch is already
complicated enough, better to leave that discussion for later.

> Hopefully this fixes the clear difference between master and the
> patched version at eic 0.
> 
>> 2) For some reason, the prefetching with eic>1 perform much better with
>> the patches, except for with very low selectivity values (close to 0%).
>> Not sure why this is happening - either the overhead is much lower
>> (which would matter on these "adversarial" data distribution, but how
>> could that be when fadvise is not free), or it ends up not doing any
>> prefetching (but then what about (1)?).
> 
> For the uniform with four parallel workers, eic == 0 being worse than
> master makes sense for the above reason. But I'm not totally sure why
> eic == 1 would be worse with the patch than with master. Both are
> doing a (somewhat useless) prefetch.
> 

Right.

> With very low selectivity, you are less likely to get readahead
> (right?) and similarly less likely to be able to build up > 8kB IOs --
> which is one of the main value propositions of the streaming read
> code. I imagine that this larger read benefit is part of why the
> performance is better at higher selectivities with the patch. This
> might be a silly experiment, but we could try decreasing
> MAX_BUFFERS_PER_TRANSFER on the patched version and see if the
> performance gains go away.
> 

Sure, I can do that. Do you have any particular suggestion what value to
use for MAX_BUFFERS_PER_TRANSFER?

I'll also try to add a better version of uniform, where the selectivity
matches more closely to pages, not rows.

>> 3) I'm not sure about the linear-fuzz case, the only explanation I have
>> we're able to skip almost all of the prefetches (and read-ahead likely
>> works pretty well here).
> 
> I started 

Re: make BuiltinTrancheNames less ugly

2024-03-01 Thread Alvaro Herrera
On 2024-Feb-23, Heikki Linnakangas wrote:

> On 12/02/2024 19:01, Tristan Partin wrote:
> > On Wed Jan 24, 2024 at 8:09 AM CST, Alvaro Herrera wrote:
> > > IMO it would be less ugly to have the origin file lwlocknames.txt be
> > > not a text file but a .h with a macro that can be defined by
> > > interested parties so that they can extract what they want from the
> > > file, like PG_CMDTAG or PG_KEYWORD.  Using Perl for this seems dirty...
> > 
> > I really like this idea, and would definitely be more inclined to see
> > a solution like this.
> 
> +1 to that idea from me too. Seems pretty straightforward.

OK, here's a patch that does it.  I have not touched Meson yet.

I'm pretty disappointed of not being able to remove
generate-lwlocknames.pl (it now generates the .h, no longer the .c
file), but I can't find a way to do the equivalent #defines in CPP ...
it'd require invoking the C preprocessor twice.  Maybe an intermediate
.h file would solve the problem, but I have no idea how would that work
with Meson.  I guess I'll do it in Make and let somebody suggest a Meson
way.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)
>From c70ae4eff6ec8e8965207d6e0d6d8fcdcd91cf16 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 1 Mar 2024 13:03:10 +0100
Subject: [PATCH] Rework lwlocknames.txt to become lwlocklist.h

This saves some code and some space in BuiltinTrancheNames, as foreseen
in commit 74a730631065.
---
 src/backend/Makefile  |  4 +-
 src/backend/storage/lmgr/.gitignore   |  1 -
 src/backend/storage/lmgr/Makefile |  9 +-
 .../storage/lmgr/generate-lwlocknames.pl  | 52 ++--
 src/backend/storage/lmgr/lwlock.c | 15 ++--
 src/backend/storage/lmgr/lwlocknames.txt  | 60 --
 .../utils/activity/wait_event_names.txt   |  8 +-
 src/include/storage/lwlocklist.h  | 82 +++
 8 files changed, 124 insertions(+), 107 deletions(-)
 delete mode 100644 src/backend/storage/lmgr/lwlocknames.txt
 create mode 100644 src/include/storage/lwlocklist.h

diff --git a/src/backend/Makefile b/src/backend/Makefile
index d66e2a4b9f..34bb6393c4 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -110,8 +110,8 @@ $(top_builddir)/src/port/libpgport_srv.a: | submake-libpgport
 parser/gram.h: parser/gram.y
 	$(MAKE) -C parser gram.h
 
-storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl storage/lmgr/lwlocknames.txt utils/activity/wait_event_names.txt
-	$(MAKE) -C storage/lmgr lwlocknames.h lwlocknames.c
+storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl $(top_srcdir)/src/include/storage/lwlocklist.h utils/activity/wait_event_names.txt
+	$(MAKE) -C storage/lmgr lwlocknames.h
 
 utils/activity/wait_event_types.h: utils/activity/generate-wait_event_types.pl utils/activity/wait_event_names.txt
 	$(MAKE) -C utils/activity wait_event_types.h pgstat_wait_event.c wait_event_funcs_data.c
diff --git a/src/backend/storage/lmgr/.gitignore b/src/backend/storage/lmgr/.gitignore
index dab4c3f580..8e5b734f15 100644
--- a/src/backend/storage/lmgr/.gitignore
+++ b/src/backend/storage/lmgr/.gitignore
@@ -1,3 +1,2 @@
-/lwlocknames.c
 /lwlocknames.h
 /s_lock_test
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index 1aef423384..9d7dbe5abd 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -18,7 +18,6 @@ OBJS = \
 	lmgr.o \
 	lock.o \
 	lwlock.o \
-	lwlocknames.o \
 	predicate.o \
 	proc.o \
 	s_lock.o \
@@ -35,11 +34,7 @@ s_lock_test: s_lock.c $(top_builddir)/src/common/libpgcommon.a $(top_builddir)/s
 		$(TASPATH) -L $(top_builddir)/src/common -lpgcommon \
 		-L $(top_builddir)/src/port -lpgport -lm -o s_lock_test
 
-# see notes in src/backend/parser/Makefile
-lwlocknames.c: lwlocknames.h
-	touch $@
-
-lwlocknames.h: $(top_srcdir)/src/backend/storage/lmgr/lwlocknames.txt $(top_srcdir)/src/backend/utils/activity/wait_event_names.txt generate-lwlocknames.pl
+lwlocknames.h: $(top_srcdir)/src/include/storage/lwlocklist.h $(top_srcdir)/src/backend/utils/activity/wait_event_names.txt generate-lwlocknames.pl
 	$(PERL) $(srcdir)/generate-lwlocknames.pl $^
 
 check: s_lock_test
@@ -47,4 +42,4 @@ check: s_lock_test
 
 clean:
 	rm -f s_lock_test
-	rm -f lwlocknames.h lwlocknames.c
+	rm -f lwlocknames.h
diff --git a/src/backend/storage/lmgr/generate-lwlocknames.pl b/src/backend/storage/lmgr/generate-lwlocknames.pl
index 7b93ecf6c1..f46634a180 100644
--- a/src/backend/storage/lmgr/generate-lwlocknames.pl
+++ b/src/backend/storage/lmgr/generate-lwlocknames.pl
@@ -1,6 +1,6 @@
 #!/usr/bin/perl
 #
-# Generate lwlocknames.h and lwlocknames.c from lwlocknames.txt
+# Generate lwlocknames.h from lwlocklist.h
 # Copyright (c) 2000-2024, PostgreSQL Global Development Group
 
 use strict;
@@ -14,26 +14,22 @@ my $continue = "\n";
 
 

Re: Commitfest Manager for March

2024-03-01 Thread Andrey M. Borodin



> On 1 Mar 2024, at 17:29, Daniel Gustafsson  wrote:
> 
> The call for a CFM volunteer is still open.

I always wanted to try. And most of the stuff I'm interested in is already 
committed.

But given importance of last commitfest before feature freeze, we might be 
interested in more experienced CFM.
If I can do something useful - I'm up for it.


Best regards, Andrey Borodin.



Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

2024-03-01 Thread Matthias van de Meent
On Wed, 24 Jan 2024 at 13:02, Matthias van de Meent
 wrote:
> > 1.
> > Commit message refers to a non-existing reference '(see [0])'.
>
> Noted, I'll update that.
>
> > 2.
> > +When we do a binary search on a sorted set (such as a BTree), we know that 
> > a
> > +tuple will be smaller than its left neighbour, and larger than its right
> > +neighbour.
> >
> > I think this should be 'larger than left neighbour and smaller than
> > right neighbour' instead of the other way around.
>
> Noted, will be fixed, too.

Attached is version 15 of this patch, with the above issues fixed.
It's also rebased on top of 655dc310 of this morning, so that should
keep good for some time again.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v15-0001-btree-Implement-dynamic-prefix-compression.patch
Description: Binary data


Re: Commitfest Manager for March

2024-03-01 Thread Daniel Gustafsson
> On 29 Feb 2024, at 22:26, Daniel Gustafsson  wrote:

> We are now hours away from starting the last commitfest for v17

It is now March 1 in all timezones, so I have switched 202403 to In Progress
and 202307 to Open.  There are a total of 331 patches registered with 286 of
those in an open state, 24 of those have been around for 10 CF's or more.

The call for a CFM volunteer is still open.

--
Daniel Gustafsson





Re: src/include/miscadmin.h outdated comments

2024-03-01 Thread Matthias van de Meent
On Fri, 1 Mar 2024 at 12:19, jian he  wrote:
>
> hi.
>
> /*
>  *   globals.h -- *
>  
> */
>
> The above comment src/include/miscadmin.h is not accurate?
> we don't have globals.h file?

The header of the file describes the following:

 * miscadmin.h
 *  This file contains general postgres administration and initialization
 *  stuff that used to be spread out between the following files:
 *globals.hglobal variables
 *pdir.hdirectory path crud
 *pinit.hpostgres initialization
 *pmod.hprocessing modes
 *  Over time, this has also become the preferred place for widely known
 *  resource-limitation stuff, such as work_mem and check_stack_depth().

So, presumably that section is what once was globals.h.

As to whether the comment should remain now that it's been 17 years
since those files were merged, I'm not sure: while the comment has
mostly historic value, there is something to be said about it
delineating sections of the file.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




src/include/miscadmin.h outdated comments

2024-03-01 Thread jian he
hi.

/*
 *   globals.h -- *
 */

The above comment src/include/miscadmin.h is not accurate?
we don't have globals.h file?




Re: Regardign RecentFlushPtr in WalSndWaitForWal()

2024-03-01 Thread Matthias van de Meent
On Mon, 26 Feb 2024 at 12:46, shveta malik  wrote:
>
> Hi hackers,
>
> I would like to understand why we have code [1] that retrieves
> RecentFlushPtr in WalSndWaitForWal() outside of the loop. We utilize
> RecentFlushPtr later within the loop, but prior to that, we already
> have [2]. Wouldn't [2] alone be sufficient?
>
> Just to check the impact, I ran 'make check-world' after removing [1],
> and did not see any issue exposed by the test at-least.

Yeah, that seems accurate.

> Any thoughts?
[...]
> [2]:
> /* Update our idea of the currently flushed position. */
> else if (!RecoveryInProgress())

I can't find where this "else" of this "else if" clause came from, as
this piece of code hasn't changed in years. But apart from that, your
observation seems accurate, yes.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Volatile write caches on macOS and Windows, redux

2024-03-01 Thread Thomas Munro
Rebased over 8d140c58.


v2-0001-Make-wal_sync_method-fdatasync-the-default-on-all.patch
Description: Binary data


v2-0002-Remove-fsync_writethrough-add-fsync-full-macOS-on.patch
Description: Binary data


Re: Regardign RecentFlushPtr in WalSndWaitForWal()

2024-03-01 Thread Bertrand Drouvot
Hi,

On Mon, Feb 26, 2024 at 05:16:39PM +0530, shveta malik wrote:
> Hi hackers,
> 
> I would like to understand why we have code [1] that retrieves
> RecentFlushPtr in WalSndWaitForWal() outside of the loop. We utilize
> RecentFlushPtr later within the loop, but prior to that, we already
> have [2]. Wouldn't [2] alone be sufficient?
> 
> Just to check the impact, I ran 'make check-world' after removing [1],
> and did not see any issue exposed by the test at-least.
> 
> Any thoughts?
> 
> [1]:
> /* Get a more recent flush pointer. */
> if (!RecoveryInProgress())
> RecentFlushPtr = GetFlushRecPtr(NULL);
> else
> RecentFlushPtr = GetXLogReplayRecPtr(NULL);
> 
> [2]:
> /* Update our idea of the currently flushed position. */
> else if (!RecoveryInProgress())
> RecentFlushPtr = GetFlushRecPtr(NULL);
> else
> RecentFlushPtr = GetXLogReplayRecPtr(NULL);
> 

It seems to me that [2] alone could be sufficient.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Missing LWLock protection in pgstat_reset_replslot()

2024-03-01 Thread Bertrand Drouvot
Hi hackers,

I think that pgstat_reset_replslot() is missing LWLock protection. Indeed, we
don't have any guarantee that the slot is active (then preventing it to be
dropped/recreated) when this function is executed.

Attached a patch to add the missing protection.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 26007012272631a6d1024c143c8dd3eed9430c03 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 1 Mar 2024 10:04:17 +
Subject: [PATCH v1] Adding LWLock protection in pgstat_reset_replslot()

pgstat_reset_replslot() is missing a LWLock protection as we don't have any
guarantee that the slot is active (then preventing it to be dropped/recreated)
when this function is executed.
---
 src/backend/utils/activity/pgstat_replslot.c | 4 
 1 file changed, 4 insertions(+)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/pgstat_replslot.c b/src/backend/utils/activity/pgstat_replslot.c
index 70cabf2881..a113e1c486 100644
--- a/src/backend/utils/activity/pgstat_replslot.c
+++ b/src/backend/utils/activity/pgstat_replslot.c
@@ -46,6 +46,8 @@ pgstat_reset_replslot(const char *name)
 
 	Assert(name != NULL);
 
+	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
 	/* Check if the slot exits with the given name. */
 	slot = SearchNamedReplicationSlot(name, true);
 
@@ -65,6 +67,8 @@ pgstat_reset_replslot(const char *name)
 	/* reset this one entry */
 	pgstat_reset(PGSTAT_KIND_REPLSLOT, InvalidOid,
  ReplicationSlotIndex(slot));
+
+	LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
-- 
2.34.1



Re: Improve readability by using designated initializers when possible

2024-03-01 Thread Peter Eisentraut

On 01.03.24 05:08, Michael Paquier wrote:

On Thu, Feb 29, 2024 at 12:41:38PM +0100, Peter Eisentraut wrote:

On 27.02.24 08:57, Alvaro Herrera wrote:

On 2024-Feb-27, Michael Paquier wrote:

These would cause compilation failures.  Saying that, this is a very
nice cleanup, so I've fixed these and applied the patch after checking
that the one-one replacements were correct.


Oh, I thought we were going to get rid of ObjectClass altogether -- I
mean, have getObjectClass() return ObjectAddress->classId, and then
define the OCLASS values for each catalog OID [... tries to ...]  But
this(*) doesn't work for two reasons:


I have long wondered what the point of ObjectClass is.  I find the extra
layer of redirection, which is used only in small parts of the code, and the
similarity to ObjectType confusing.  I happened to have a draft patch for
its removal lying around, so I'll show it here, rebased over what has
already been done in this thread.


The elimination of getObjectClass() seems like a good end goal IMO, so
the direction of the patch is interesting.  Would object_type_map and
ObjectProperty follow the same idea of relying on the catalogs OID
instead of the OBJECT_*?

Note that there are still two dependencies to getObjectClass() in
event_trigger.c and dependency.c.


Oops, there was a second commit in my branch that I neglected to send 
in.  Here is my complete patch set.
From 00fe9c6163261c1fdfc806934903bcd60b857633 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 11 Sep 2022 22:26:55 +0200
Subject: [PATCH v2 1/2] Don't use ObjectClass for event trigger support

---
 src/backend/catalog/dependency.c |   2 +-
 src/backend/commands/event_trigger.c | 122 +++
 src/include/commands/event_trigger.h |   2 +-
 3 files changed, 12 insertions(+), 114 deletions(-)

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 2eb41d537b..192dedb3cb 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -208,7 +208,7 @@ deleteObjectsInList(ObjectAddresses *targetObjects, 
Relation *depRel,
if (extra->flags & DEPFLAG_REVERSE)
normal = true;
 
-   if 
(EventTriggerSupportsObjectClass(getObjectClass(thisobj)))
+   if (EventTriggerSupportsObject(thisobj))
{
EventTriggerSQLDropAddObject(thisobj, original, 
normal);
}
diff --git a/src/backend/commands/event_trigger.c 
b/src/backend/commands/event_trigger.c
index c8b662131c..f2188c7b82 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -1141,128 +1141,26 @@ EventTriggerSupportsObjectType(ObjectType obtype)
case OBJECT_EVENT_TRIGGER:
/* no support for event triggers on event triggers */
return false;
-   case OBJECT_ACCESS_METHOD:
-   case OBJECT_AGGREGATE:
-   case OBJECT_AMOP:
-   case OBJECT_AMPROC:
-   case OBJECT_ATTRIBUTE:
-   case OBJECT_CAST:
-   case OBJECT_COLUMN:
-   case OBJECT_COLLATION:
-   case OBJECT_CONVERSION:
-   case OBJECT_DEFACL:
-   case OBJECT_DEFAULT:
-   case OBJECT_DOMAIN:
-   case OBJECT_DOMCONSTRAINT:
-   case OBJECT_EXTENSION:
-   case OBJECT_FDW:
-   case OBJECT_FOREIGN_SERVER:
-   case OBJECT_FOREIGN_TABLE:
-   case OBJECT_FUNCTION:
-   case OBJECT_INDEX:
-   case OBJECT_LANGUAGE:
-   case OBJECT_LARGEOBJECT:
-   case OBJECT_MATVIEW:
-   case OBJECT_OPCLASS:
-   case OBJECT_OPERATOR:
-   case OBJECT_OPFAMILY:
-   case OBJECT_POLICY:
-   case OBJECT_PROCEDURE:
-   case OBJECT_PUBLICATION:
-   case OBJECT_PUBLICATION_NAMESPACE:
-   case OBJECT_PUBLICATION_REL:
-   case OBJECT_ROUTINE:
-   case OBJECT_RULE:
-   case OBJECT_SCHEMA:
-   case OBJECT_SEQUENCE:
-   case OBJECT_SUBSCRIPTION:
-   case OBJECT_STATISTIC_EXT:
-   case OBJECT_TABCONSTRAINT:
-   case OBJECT_TABLE:
-   case OBJECT_TRANSFORM:
-   case OBJECT_TRIGGER:
-   case OBJECT_TSCONFIGURATION:
-   case OBJECT_TSDICTIONARY:
-   case OBJECT_TSPARSER:
-   case OBJECT_TSTEMPLATE:
-   case OBJECT_TYPE:
-   case OBJECT_USER_MAPPING:
-   case OBJECT_VIEW:
+   default:
return true;
-
-   /*
-* There's intentionally no default: case here; we want 
the
-* 

Re: Synchronizing slots from primary to standby

2024-03-01 Thread Amit Kapila
On Fri, Mar 1, 2024 at 11:41 AM Masahiko Sawada  wrote:
>
> On Fri, Mar 1, 2024 at 12:42 PM Zhijie Hou (Fujitsu)
>  wrote:
> ---
> I was a bit surprised by the fact that standby_slot_names value is
> handled in a different way than a similar parameter
> synchronous_standby_names. For example, the following command doesn't
> work unless there is a replication slot 'slot1, slot2':
>
> =# alter system set standby_slot_names to 'slot1, slot2';
> ERROR:  invalid value for parameter "standby_slot_names": ""slot1, slot2""
> DETAIL:  replication slot "slot1, slot2" does not exist
>
> Whereas "alter system set synchronous_standby_names to stb1, stb2;"
> can correctly separate the string into 'stb1' and 'stb2'.
>
> Probably it would be okay since this behavior of standby_slot_names is
> the same as other GUC parameters that accept a comma-separated string.
> But I was confused a bit the first time I used it.
>

I think it is better to keep the behavior in this regard the same as
'synchronous_standby_names' because both have similarities w.r.t
replication.

-- 
With Regards,
Amit Kapila.




Re: Synchronizing slots from primary to standby

2024-03-01 Thread Ajin Cherian
On Thu, Feb 29, 2024 at 12:34 PM Zhijie Hou (Fujitsu) <
houzj.f...@fujitsu.com> wrote:

> On Monday, February 26, 2024 7:52 PM Amit Kapila 
> wrote:
> >
> > On Mon, Feb 26, 2024 at 7:49 AM Zhijie Hou (Fujitsu) <
> houzj.f...@fujitsu.com>
> > wrote:
> > >
> > > Attach the V98 patch set which addressed above comments.
> > >
> >
> > Few comments:
> > =
> > 1.
> >  WalSndWaitForWal(XLogRecPtr loc)
> >  {
> >   int wakeEvents;
> > + bool wait_for_standby = false;
> > + uint32 wait_event;
> > + List*standby_slots = NIL;
> >   static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
> >
> > + if (MyReplicationSlot->data.failover && replication_active)
> > + standby_slots = GetStandbySlotList(true);
> > +
> >   /*
> > - * Fast path to avoid acquiring the spinlock in case we already know we
> > - * have enough WAL available. This is particularly interesting if we're
> > - * far behind.
> > + * Check if all the standby servers have confirmed receipt of WAL up to
> > + * RecentFlushPtr even when we already know we have enough WAL
> available.
> > + *
> > + * Note that we cannot directly return without checking the status of
> > + * standby servers because the standby_slot_names may have changed,
> > + which
> > + * means there could be new standby slots in the list that have not yet
> > + * caught up to the RecentFlushPtr.
> >   */
> > - if (RecentFlushPtr != InvalidXLogRecPtr &&
> > - loc <= RecentFlushPtr)
> > - return RecentFlushPtr;
> > + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr) {
> > + FilterStandbySlots(RecentFlushPtr, _slots);
> >
> > I think even if the slot list is not changed, we will always process
> each slot
> > mentioned in standby_slot_names once. Can't we cache the previous list of
> > slots for we have already waited for? In that case, we won't even need
> to copy
> > the list via GetStandbySlotList() unless we need to wait.
> >
> > 2.
> >  WalSndWaitForWal(XLogRecPtr loc)
> >  {
> > + /*
> > + * Update the standby slots that have not yet caught up to the flushed
> > + * position. It is good to wait up to RecentFlushPtr and then let it
> > + * send the changes to logical subscribers one by one which are
> > + * already covered in RecentFlushPtr without needing to wait on every
> > + * change for standby confirmation.
> > + */
> > + if (wait_for_standby)
> > + FilterStandbySlots(RecentFlushPtr, _slots);
> > +
> >   /* Update our idea of the currently flushed position. */
> > - if (!RecoveryInProgress())
> > + else if (!RecoveryInProgress())
> >   RecentFlushPtr = GetFlushRecPtr(NULL);
> >   else
> >   RecentFlushPtr = GetXLogReplayRecPtr(NULL); ...
> > /*
> > * If postmaster asked us to stop, don't wait anymore.
> > *
> > * It's important to do this check after the recomputation of
> > * RecentFlushPtr, so we can send all remaining data before shutting
> > * down.
> > */
> > if (got_STOPPING)
> > break;
> >
> > I think because 'wait_for_standby' may not be set in the first or
> consecutive
> > cycles we may send the WAL to the logical subscriber before sending it
> to the
> > physical subscriber during shutdown.
>
> Here is the v101 patch set which addressed above comments.
>
> This version will cache the oldest standby slot's LSN each time we waited
> for
> them to catch up. The cached LSN is invalidated when we reload the GUC
> config.
> In the WalSndWaitForWal function, instead of traversing the entire standby
> list
> each time, we can check the cached LSN to quickly determine if the standbys
> have caught up. When a shutdown signal is received, we continue to wait
> for the
> standby slots to catch up. When waiting for the standbys to catch up after
> receiving the shutdown signal, an ERROR is reported if any slots are
> dropped,
> invalidated, or inactive. This measure is taken to prevent the walsender
> from
> waiting indefinitely.
>
>
Thanks for the patch.
I did some performance test run on PATCH v101 with synchronous_commit
turned on to check how much logical replication changes affects transaction
speed on primary compared to HEAD code. In all configurations, there is a
primary, a logical subscriber and a physical standby and the logical
subscriber is listed in the synchronous_standby_names. This means all
transactions on primary will not be committed until the logical subscriber
has confirmed receipt of this transaction.


Machine details:
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz, 800GB RAM

My addition configuration on each instance is:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

All tests are done using pgbench running for 15 minutes:
Creating tables: pgbench -p 6972 postgres -qis 2
Running benchmark: pgbench postgres -p 6972 -c 10 -j 3 -T 900 -P 5

HEAD code-
 Primary with Synchronous_commit=on, physical standby with
hot_standby_feedback=off
RUN1 (TPS) RUN2 (TPS) AVERAGE 

Re: index prefetching

2024-03-01 Thread Jakub Wartak
On Wed, Jan 24, 2024 at 7:13 PM Tomas Vondra
 wrote:
[
>
> (1) Melanie actually presented a very different way to implement this,
> relying on the StreamingRead API. So chances are this struct won't
> actually be used.

Given lots of effort already spent on this and the fact that is thread
is actually two:

a. index/table prefetching since Jun 2023 till ~Jan 2024
b. afterwards index/table prefetching with Streaming API, but there
are some doubts of whether it could happen for v17 [1]

... it would be pitty to not take benefits of such work (even if
Streaming API wouldn't be ready for this; although there's lots of
movement in the area), so I've played a little with with the earlier
implementation from [2] without streaming API as it already received
feedback, it demonstrated big benefits, and earlier it got attention
on pgcon unconference. Perhaps, some of those comment might be passed
later to the "b"-patch (once that's feasible):

1. v20240124-0001-Prefetch-heap-pages-during-index-scans.patch does
not apply cleanly anymore, due show_buffer_usage() being quite
recently refactored in 5de890e3610d5a12cdaea36413d967cf5c544e20 :

patching file src/backend/commands/explain.c
Hunk #1 FAILED at 3568.
Hunk #2 FAILED at 3679.
2 out of 2 hunks FAILED -- saving rejects to file
src/backend/commands/explain.c.rej

2. v2 applies (fixup), but it would nice to see that integrated into
main patch (it adds IndexOnlyPrefetchInfo) into one patch

3. execMain.c :

+ * XXX It might be possible to improve the prefetching code
to handle this
+ * by "walking back" the TID queue, but it's not clear if
it's worth it.

Shouldn't we just remove the XXX? The walking-back seems to be niche
so are fetches using cursors when looking at real world users queries
? (support cases bias here when looking at peopel's pg_stat_activity)

4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
(allowing it to follow dynamically; the more prefetches the user wants
to perform, the more you spread them across shared LRUs and the more
memory for history is required?)

+ * XXX Maybe we could consider effective_cache_size when sizing the cache?
+ * Not to size the cache for that, ofc, but maybe as a guidance of how many
+ * heap pages it might keep. Maybe just a fraction fraction of the value,
+ * say Max(8MB, effective_cache_size / max_connections) or something.
+ */
+#definePREFETCH_LRU_SIZE8/* slots in one LRU */
+#definePREFETCH_LRU_COUNT128 /* number of LRUs */
+#definePREFETCH_CACHE_SIZE(PREFETCH_LRU_SIZE *
PREFETCH_LRU_COUNT)

BTW:
+ * heap pages it might keep. Maybe just a fraction fraction of the value,
that's a duplicated "fraction" word over there.

5.
+ * XXX Could it be harmful that we read the queue backwards?
Maybe memory
+ * prefetching works better for the forward direction?

I wouldn't care, we are optimizing I/O (and context-switching) which
weighs much more than memory access direction impact and Dilipi
earlier also expressed no concern, so maybe it could be also removed
(one less "XXX" to care about)

6. in IndexPrefetchFillQueue()

+while (!PREFETCH_QUEUE_FULL(prefetch))
+{
+IndexPrefetchEntry *entry
+= prefetch->next_cb(scan, direction, prefetch->data);

If we are at it... that's a strange split and assignment not indented :^)

7. in IndexPrefetchComputeTarget()

+ * XXX We cap the target to plan_rows, becausse it's pointless to prefetch
+ * more than we expect to use.

That's a nice fact that's already in patch, so XXX isn't needed?

8.
+ * XXX Maybe we should reduce the value with parallel workers?

I was assuming it could be a good idea, but the same doesn't seem
(eic/actual_parallel_works_per_gather) to be performed for bitmap heap
scan prefetches, so no?

9.
+/*
+ * No prefetching for direct I/O.
+ *
+ * XXX Shouldn't we do prefetching even for direct I/O? We would only
+ * pretend doing it now, ofc, because we'd not do posix_fadvise(), but
+ * once the code starts loading into shared buffers, that'd work.
+ */
+if ((io_direct_flags & IO_DIRECT_DATA) != 0)
+return 0;

It's redundant (?) and could be removed as
PrefetchBuffer()->PrefetchSharedBuffer() already has this at line 571:

 5   #ifdef USE_PREFETCH
 4   │   │   /*
 3   │   │   │* Try to initiate an asynchronous read.  This
returns false in
 2   │   │   │* recovery if the relation file doesn't exist.
 1   │   │   │*/
   571   │   │   if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
 1   │   │   │   smgrprefetch(smgr_reln, forkNum, blockNum, 1))
 2   │   │   {
 3   │   │   │   result.initiated_io = true;
 4   │   │   }
 5   #endif> >   >   >   >   >   >   /* 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-01 Thread Masahiko Sawada
On Thu, Feb 29, 2024 at 8:43 PM John Naylor  wrote:
>
> On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada  wrote:
>
> > - v63-0008 patch fixes a bug in tidstore.
>
> - page->nwords = wordnum + 1;
> - Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1]));
> + page->nwords = wordnum;
> + Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
>
> Yikes, I'm guessing this failed in a non-assert builds? I wonder why
> my compiler didn't yell at me... Have you tried a tidstore-debug build
> without asserts?

Yes. I didn't get any failures.

>
> > - v63-0009 patch is a draft idea of cleanup memory context handling.
>
> Thanks, looks pretty good!
>
> + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
> +"tidstore storage",
>
> "tidstore storage" sounds a bit strange -- maybe look at some other
> context names for ideas.

Agreed. How about "tidstore's radix tree"?

>
> - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
> + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
> +? tree->leaf_ctx
> +: tree->context, allocsize);
>
> Instead of branching here, can we copy "context" to "leaf_ctx" when
> necessary (those names should look more like eachother, btw)? I think
> that means anything not covered by this case:
>
> +#ifndef RT_VARLEN_VALUE_SIZE
> + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
> + tree->leaf_ctx = SlabContextCreate(ctx,
> +RT_STR(RT_PREFIX) "radix_tree leaf contex",
> +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> +sizeof(RT_VALUE_TYPE));
> +#endif /* !RT_VARLEN_VALUE_SIZE */
>
> ...also, we should document why we're using slab here. On that, I
> don't recall why we are? We've never had a fixed-length type test case
> on 64-bit, so it wasn't because it won through benchmarking. It seems
> a hold-over from the days of "multi-value leaves". Is it to avoid the
> possibility of space wastage with non-power-of-two size types?

Yes, it matches my understanding.

>
> For this stanza that remains unchanged:
>
> for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
> {
>   MemoryContextDelete(tree->node_slabs[i]);
> }
>
> if (tree->leaf_ctx)
> {
>   MemoryContextDelete(tree->leaf_ctx);
> }
>
> ...is there a reason we can't just delete tree->ctx, and let that
> recursively delete child contexts?

I thought that considering the RT_CREATE doesn't create its own memory
context but just uses the passed context, it might be a bit unusable
to delete the passed context in the radix tree code. For example, if a
caller creates a radix tree (or tidstore) on a memory context and
wants to recreate it again and again, he also needs to re-create the
memory context together. It might be okay if we leave comments on
RT_CREATE as a side effect, though. This is the same reason why we
don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(),

On Fri, Mar 1, 2024 at 1:15 PM John Naylor  wrote:
>
> I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and
> I'm not convinced it's freeing all the memory. It's been many months
> since we discussed this last, but IIRC we cannot just tell DSA to free
> all its segments, right?

Right.

>  Is there currently anything preventing us
> from destroying the whole DSA area at once?

When it comes to tidstore and parallel vacuum, we initialize DSA and
create a tidstore there at the beginning of the lazy vacuum, and
recreate the tidstore again after the heap vacuum. So I don't want to
destroy the whole DSA when destroying the tidstore. Otherwise, we will
need to create a new DSA and pass its handle somehow.

Probably the bitmap scan case is similar. Given that bitmap scan
(re)creates tidbitmap in the same DSA multiple times, it's better to
avoid freeing the whole DSA.

>
> + /* The last level node has pointers to values */
> + if (shift == 0)
> + {
> +   dsa_free(tree->dsa, ptr);
> +   return;
> + }
>
> IIUC, this doesn't actually free leaves, it only frees the last-level
> node. And, this function is unaware of whether children could be
> embedded values. I'm thinking we need to get rid of the above
> pre-check and instead, each node kind to have something like (e.g.
> node4):
>
> RT_PTR_ALLOC child = n4->children[i];
>
> if (shift > 0)
>   RT_FREE_RECURSE(tree, child, shift - RT_SPAN);
> else if (!RT_CHILDPTR_IS_VALUE(child))
>   dsa_free(tree->dsa, child);
>
> ...or am I missing something?

You're not missing anything. RT_FREE_RECURSE() has not been updated
for a long time. If we still need to use RT_FREE_RECURSE(), it should
be updated.

> Thirdly, cosmetic: With the introduction of single-value leaves, it
> seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?

Agreed.

On Fri, Mar 1, 2024 at 3:58 PM John Naylor  wrote:
>
> I wrote:
>
> > Secondly, I thought about my recent work to skip checking if we first
> > need to create a root node, and that has a harmless (for vacuum at
> > least) but slightly untidy behavior: When RT_SET is first called, and
> >