date:20230807

On Tue, Aug 01, 2023 at 04:39:54PM -0700, Jeff Davis wrote:
> On Tue, 2023-08-01 at 16:14 +0300, Aleksander Alekseev wrote:
> > Probably I'm missing something, but if memory allocation is required
> > during WAL replay and it fails, wouldn't it be a better solution to
> > log the error and terminate the DBMS immediately?
> 
> We need to differentiate between:
> 
> 1. No valid record exists and it must be the end of WAL; LOG and start
> up.
> 
> 2. A valid record exists and we are unable to process it (e.g. due to
> OOM); PANIC.

Yes, still there is a bit more to it.  The origin of the introduction
to palloc(MCXT_ALLOC_NO_OOM) partially comes from this thread, that
has reported a problem where we switched from malloc() to palloc()
when xlogreader.c got introduced:
https://www.postgresql.org/message-id/CAHGQGwE46cJC4rJGv+kVMV8g6BxHm9dBR_7_QdPjvJUqdt7m=q...@mail.gmail.com

And the malloc() behavior when replaying WAL records is even older
than that.

At the end, we want to be able to give more options to anybody looking
at WAL records, and let them take decisions based on the error reached
and the state of the system.  For example, it does not make much sense
to fail hard on OOM if replaying records when in standby mode because
we can just loop again.  The same can actually be said when in crash
recovery.  On OOM, the startup process considers that we have an
invalid record now, which is incorrect.  We could fail hard and FATAL
to replay again (sounds like the natural option), or we could loop
over the record that failed its allocation, repeating things.  In any
case, we need to give more information back to the system so as it can
take better decisions on what it should do.
--
Michael


signature.asc
Description: PGP signature

Re: Synchronizing slots from primary to standby


Hi,

On 8/8/23 7:01 AM, shveta malik wrote:

On Mon, Aug 7, 2023 at 3:17 PM Drouvot, Bertrand
 wrote:


Hi,

On 8/4/23 1:32 PM, shveta malik wrote:

On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand
 wrote:

On 7/28/23 4:39 PM, Bharath Rupireddy wrote:




Agreed. That is why in v10,v11 patches, we have different infra for
sync-slot worker i.e. it is not relying on "logical replication
background worker" anymore.


yeah saw that, looks like the right way to go to me.


Maybe we should start some tests/benchmark with only one sync worker to get 
numbers
and start from there?


Yes, we can do that performance testing to figure out the difference
between the two modes. I will try to get some statistics on this.



Great, thanks!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: WIP: new system catalog pg_wait_event


Hi,

On 8/8/23 5:05 AM, Michael Paquier wrote:

On Tue, Aug 08, 2023 at 11:53:32AM +0900, Kyotaro Horiguchi wrote:

As I mentioned in another thread, I'm uncertain about our stance on
the class id of the wait event. If a class acts as a namespace, we
should include it in the view. Otherwise, if the class id is just an
attribute of the wait event, we should make the event name unique.


Including the class name in the view makes the most sense to me, FWIW,
as it could be also possible that one reuses an event name in the
existing in-core list, but for extensions.  That's of course not
something I would recommend.


Thanks Kyotaro-san and Michael for the feedback. I do agree and will
add the class name.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Fix badly generated entries in wait_event_names.txt


Hi,

On 8/8/23 1:25 AM, Michael Paquier wrote:

On Mon, Aug 07, 2023 at 10:34:43AM +0200, Drouvot, Bertrand wrote:

fa88928470 introduced src/backend/utils/activity/wait_event_names.txt that has
been auto-generated. The auto-generation had parsing errors leading to bad
entries in wait_event_names.txt (when the same "WAIT_EVENT" name can be found as
a substring of another one, then descriptions were merged in one of them).


Thanks for the patch, applied. 


Thanks!


I have double-checked the contents of
the tables and the five entries you have pointed out are the only ones
with duplicated descriptions.


Yeah, did the same on my side before submitting the patch, and all looks
good now.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: 2023-08-10 release announcement draft

2023-08-07 Thread Erik Rijkers


Op 8/8/23 om 03:15 schreef Jonathan S. Katz:


Please provide your feedback no later than August 10, 2023 0:00 AoE[1].


'You us'  should be
'You use'
   (2x)


Erik

Re: proposal: psql: show current user in prompt

2023-08-07 Thread Pavel Stehule

po 31. 7. 2023 v 17:46 odesílatel Jelte Fennema  napsal:

> On Mon, 24 Jul 2023 at 21:16, Pavel Stehule 
> wrote:
> > I don't understand how it can be possible to do it without.  I need to
> process possible errors, and then I need to read and synchronize protocol.
> I didn't inject
> > this feature to some oher flow, so I need to implement a complete
> process.
>
> I think I might be missing the reason for this then. Could you explain
> a bit more why you didn't inject the feature into another flow?
> Because it seems like it would be better to inserting the logic for
> handling the new response packet in pqParseInput3(), and then wait on
> the result with PQexecFinish(). This would allow sending these
> messages in a pipelined mode.
>
> > For example, PQsetClientEncoding does a PQexec call, which is much more
> expensive.
>
> Yeah, but you'd usually only call that once for the life of the
> connection. But honestly it would still be good if there was a
> pipelined version of that function.
>
> > Unfortunately, for this feature, I cannot change some local state
> variables, but I need to change the state of the server. Own message is
> necessary, because we don't want to be limited by the current transaction
> state, and then we cannot reuse PQexec.
>
> I guess this is your reasoning for why it needs its own state machine,
> but I don't think I understand the problem. Could you expand a bit
> more on what you mean? Note that different message types use
> PQexecFinish to wait for their result, e.g. PQdescribePrepared and
> PQclosePrepared use PQexecFinish too and those wait for a
> RowDescription and a Close message respectively. I added the logic for
> PQclosePerpared recently, that patch might be some helpful example
> code:
> https://github.com/postgres/postgres/commit/28b5726561841556dc3e00ffe26b01a8107ee654


The reason why I implemented separate flow is usage from psql and
independence of transaction state.  It is used for the \set command, that
is non-transactional, not SQL. If I inject this message to some other flow,
I lose this independence. Proposed message can be injected to other flows
too, I think, but for the proposed psql feature it is not practical.
Without independence on transaction state and SQL, I can just implement
some SQL function that sets reporting for any GUC, and it is more simple
than extending protocol.

Regards

Pavel



>
> > The API can be changed from PQlinkParameterStatus(PGconn *conn, const
> char *paramName)
> >
> > to
> >
> > PQlinkParameterStatus(PGconn *conn, int nParamNames, const char
> *paramNames)
>
> That would definitely address the issue with the many round trips
> being needed. But it would still leave the issue of introducing a
> second state machine in the libpq code. So if it's possible to combine
> this new code into the existing state machine, then that seems a lot
> better.
>

[PATCH] update the comment in SnapshotSetCommandId

2023-08-07 Thread Xiaoran Wang

Hi,

I updated the comment in 'SnapshotSetCommandId' in this patch which specifies 
the reason why it
is not necessary to update 'curcid' of CatalogSnapshot.

Best regards, xiaoran


0001-Update-the-comment-in-SnapshotSetCommandId.patch
Description: 0001-Update-the-comment-in-SnapshotSetCommandId.patch

Re: Synchronizing slots from primary to standby

2023-08-07 Thread shveta malik

On Mon, Aug 7, 2023 at 3:17 PM Drouvot, Bertrand
 wrote:
>
> Hi,
>
> On 8/4/23 1:32 PM, shveta malik wrote:
> > On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand
> >  wrote:
> >> On 7/28/23 4:39 PM, Bharath Rupireddy wrote:
>
> >> Sorry to be late, but I gave a second thought and I wonder if we really 
> >> need this design.
> >> (i.e start a logical replication background worker on the standby to sync 
> >> the slots).
> >>
> >> Wouldn't that be simpler to "just" update the sync slots "metadata"
> >> as the https://github.com/EnterpriseDB/pg_failover_slots module (mentioned 
> >> by Peter
> >> up-thread) is doing?
> >> (making use of LogicalConfirmReceivedLocation(), 
> >> LogicalIncreaseXminForSlot()
> >> and LogicalIncreaseRestartDecodingForSlot(), If I read 
> >> synchronize_one_slot() correctly).
> >>
> >
> > Agreed. It would be simpler to just update the metadata. I think you
> > have not got chance to review the latest posted patch ('v10-0003')
> > yet, it does the same.
>
> Thanks for the feedback! Yeah, I did not look at v10 in details and was
> looking at the email thread only.
>
> Indeed, I now see that 0003 does update the metadata in local_slot_advance(),
> that's great!
>
> >
> > But I do not quite get it as in how can we do it w/o starting a
> > background worker?
>
> Yeah, agree that we still need background workers.
> What I meant was to avoid to use "logical replication background worker"
> (aka through logicalrep_worker_launch()) to sync the slots.
>

Agreed. That is why in v10,v11 patches, we have different infra for
sync-slot worker i.e. it is not relying on "logical replication
background worker" anymore.

> > The question here is how many background workers we
> > need to have. Will one be sufficient or do we need one per db (as done
> > earlier by the original patches in this thread) or are we good with
> > dividing work among some limited number of workers?
> >
> > I feel syncing all slots in one worker may increase the lag between
> > subsequent syncs for a particular slot and if the number of slots are
> > huge, the chances of losing the slot-data is more in case of failure.
> > Starting one worker per db also might not be that efficient as it will
> > increase load on the system (both in terms of background worker and
> > network traffic) especially for a case where the number of dbs are
> > more. Thus starting max 'n' number of workers where 'n' is decided by
> > GUC and dividing the work/DBs among these looks a better option to me.
> > Please see the discussion in and around the email at [1]
> >
> > [1]: 
> > https://www.postgresql.org/message-id/CAJpy0uCT%2BnpL4eUvCWiV_MBEri9ixcUgJVDdsBCJSqLd0oD1fQ%40mail.gmail.com
>
> Thanks for the link! If I read the email thread correctly, this discussion
> was before V10 (which is the first version making use of 
> LogicalConfirmReceivedLocation(),
> LogicalIncreaseXminForSlot(), LogicalIncreaseRestartDecodingForSlot()) means
> before the metadata sync only has been implemented.
>
> While I agree that the approach to split the sync load among workers with the 
> new
> max_slot_sync_workers GUC seems reasonable without the sync only feature (pre 
> V10),
> I'm not sure that with the metadata sync only in place the extra complexity 
> to manage multiple
> sync workers is needed.
>
> Maybe we should start some tests/benchmark with only one sync worker to get 
> numbers
> and start from there?

Yes, we can do that performance testing to figure out the difference
between the two modes. I will try to get some statistics on this.

thanks
Shveta

Re: pg_upgrade fails with in-place tablespace

2023-08-07 Thread Rui Zhao

Thank you for your reply. I have implemented the necessary changes in 
accordance with your suggestions.
On Tue, Aug 08, 2023 at 09:57:00AM +0800, Michael Paquier wrote:
> I don't have a good feeling about enforcing allow_in_place_tablespaces
> in the connection creating the tablespace, as it can be useful to let
> the restore of the dump fail if this GUC is disabled in the new
> cluster, so as one can check that no in-place tablespaces are left
> around on the new cluster restoring the contents of the dump.
I have only enabled allow_in_place_tablespaces to true during a binary upgrade.
Please review my lastest patch.
--
Best regards,
Rui Zhao


0001-Fix-pg_upgrade-with-in-place-tablespaces.patch
Description: Binary data

Re: Use of additional index columns in rows filtering

2023-08-07 Thread Peter Geoghegan

On Mon, Aug 7, 2023 at 3:18 PM Peter Geoghegan  wrote:
> Even my patch cannot always make SAOP clauses into index quals. There
> are specific remaining gaps that I hope that your patch will still
> cover. The simplest example is a similar NOT IN() inequality, like
> this:
>
> select
>   ctid, *
> from
>   tenk1
> where
>   thousand = 42
>   and
>   tenthous not in (1, 3, 42, 43, 44, 45, 46, 47, 48, 49, 50);
>
> There is no way that my patch can handle this case. Where your patch
> seems to be unable to do better than master here, either -- just like
> with the "tenthous in ( )" variant. Once again, the inequality SAOP
> also ends up as table filter quals, not index filter quals.
>
> It would also be nice if we found a way of doing this, while still
> reliably avoiding all visibility checks (just like "real index quals"
> will) -- since that should be safe in this specific case.

Actually, this isn't limited to SAOP inequalities. It appears as if
*any* simple inequality has the same limitation. So, for example, the
following query can only use table filters with the patch (never index
filters):

select
  ctid, *
from
  tenk1
where
  thousand = 42 and tenthous != 1;

This variant will use index filters, as expected (though with some
risk of heap accesses when VM bits aren't set):

select
  ctid, *
from
  tenk1
where
  thousand = 42 and tenthous is distinct from 1;

Offhand I suspect that it's a similar issue to the one you described for SAOPs.

I see that get_op_btree_interpretation() will treat != as a kind of
honorary member of an opfamily whose = operator has our != operator as
its negator. Perhaps we should be finding a way to pass != quals into
the index AM so that they become true index quals (obviously they
would only be index filter predicates, never access predicates). That
has the advantage of working in a way that's analogous to the way that
index quals already avoid visibility checks.

-- 
Peter Geoghegan

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-07 Thread Masahiko Sawada

On Tue, Aug 8, 2023 at 3:10 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-08-07 23:05:39 +0900, Masahiko Sawada wrote:
> > On Mon, Aug 7, 2023 at 3:16 PM David Rowley  wrote:
> > >
> > > On Wed, 2 Aug 2023 at 13:35, David Rowley  wrote:
> > > > So, it looks like this item can be closed off.  I'll hold off from
> > > > doing that for a few days just in case anyone else wants to give
> > > > feedback or test themselves.
> > >
> > > Alright, closed.
> >
> > IIUC the problem with multiple concurrent COPY is not resolved yet.
>
> Yea - it was just hard to analyze until the other regressions were fixed.
>
>
> > The result of nclients = 1 became better thanks to recent fixes, but
> > there still seems to be the performance regression at nclient = 2~16
> > (on RHEL 8 and 9). Andres reported[1] that after changing
> > MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it
> > would not be the solution, as he mentioned.
>
> I think there could be a quite simple fix: Track by how much we've extended
> the relation previously in the same bistate. If we already extended by many
> blocks, it's very likey that we'll do so further.
>
> A simple prototype patch attached. The results for me are promising. I copied
> a smaller file [1], to have more accurate throughput results at shorter runs
> (15s).

Thank you for the patch!

>
> HEAD before:
> clients  tps
> 1 41
> 2 76
> 4136
> 8248
> 16   360
> 32   375
> 64   317
>
>
> HEAD after:
> clients  tps
> 1 43
> 2 80
> 4155
> 8280
> 16   369
> 32   405
> 64   344
>
> Any chance you could your benchmark? I don't see as much of a regression vs 16
> as you...

Sure. The results are promising for me too:

 nclients = 1, execution time = 13.743
 nclients = 2, execution time = 7.552
 nclients = 4, execution time = 4.758
 nclients = 8, execution time = 3.035
 nclients = 16, execution time = 2.172
 nclients = 32, execution time = 1.959
nclients = 64, execution time = 1.819
nclients = 128, execution time = 1.583
nclients = 256, execution time = 1.631

Here are results of the same benchmark test you used:

w/o patch:
clientstps
1   66.702
2   59.696
4   73.783
8   168.115
16  400.134
32  574.098
64  565.373
128 526.303
256 591.751

w/ patch:
clients   tps
1   67.735
2   122.534
4   240.707
8   398.944
16  541.097
32  643.083
64  614.775
128 616.007
256 577.885

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

2023-08-07 Thread shveta malik

On Tue, Aug 1, 2023 at 4:52 PM shveta malik  wrote:
>
> On Thu, Jul 27, 2023 at 12:13 PM shveta malik  wrote:
> >
> > On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila  
> > wrote:
> > >
> > > On Wed, Jul 26, 2023 at 10:31 AM shveta malik  
> > > wrote:
> > > >
> > > > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy
> > > > >  wrote:
> > > > > >
> > > > > > Is having one (or a few more - not
> > > > > > necessarily one for each logical slot) worker for all logical slots
> > > > > > enough?
> > > > > >
> > > > >
> > > > > I guess for a large number of slots the is a possibility of a large
> > > > > gap in syncing the slots which probably means we need to retain
> > > > > corresponding WAL for a much longer time on the primary. If we can
> > > > > prove that the gap won't be large enough to matter then this would be
> > > > > probably worth considering otherwise, I think we should find a way to
> > > > > scale the number of workers to avoid the large gap.
> > > > >
> > > >
> > > > How about this:
> > > >
> > > > 1) On standby, spawn 1 worker per database in the start (as it is
> > > > doing currently).
> > > >
> > > > 2) Maintain statistics on activity against each primary's database on
> > > > standby by any means. Could be by maintaining 'last_synced_time' and
> > > > 'last_activity_seen time'.  The last_synced_time is updated every time
> > > > we sync/recheck slots for that particular database. The
> > > > 'last_activity_seen_time' changes only if we get any slot on that
> > > > database where actually confirmed_flush or say restart_lsn has changed
> > > > from what was maintained already.
> > > >
> > > > 3) If at any moment, we find that 'last_synced_time' -
> > > > 'last_activity_seen' goes beyond a threshold, that means that DB is
> > > > not active currently. Add it to list of inactive DB
> > > >
> > >
> > > I think we should also increase the next_sync_time if in current sync,
> > > there is no update.
> >
> > +1
> >
> > >
> > > > 4) Launcher on the other hand is always checking if it needs to spawn
> > > > any other extra worker for any new DB. It will additionally check if
> > > > number of inactive databases (maintained on standby) has gone higher
> > > > (> some threshold), then it brings down the workers for those and
> > > > starts a common worker which takes care of all such inactive databases
> > > > (or merge all in 1), while workers for active databases remain as such
> > > > (i.e. one per db). Each worker maintains the list of DBs which it is
> > > > responsible for.
> > > >
> > > > 5) If in the list of these inactive databases, we again find any
> > > > active database using the above logic, then the launcher will spawn a
> > > > separate worker for that.
> > > >
> > >
> > > I wonder if we anyway some sort of design like this because we
> > > shouldn't allow to spawn as many workers as the number of databases.
> > > There has to be some existing or new GUC like max_sync_slot_workers
> > > which decided the number of workers.
> > >
> >
> > Currently it does not have any such GUC for sync-slot workers. It
> > mainly uses the logical-rep-worker framework for the sync-slot worker
> > part and thus it relies on 'max_logical_replication_workers' GUC. Also
> > it errors out if 'max_replication_slots' is set to zero. I think it is
> > not the correct way of doing things for sync-slot. We can have a new
> > GUC (max_sync_slot_workers) as you suggested and if the number of
> > databases < max_sync_slot_workers, then we can start 1 worker per
> > dbid, else divide the work equally among the max sync-workers
> > possible. And for inactive database cases, we can increase the
> > next_sync_time rather than starting a special worker to handle all the
> > inactive databases.  Thoughts?
> >
>
> Attaching the PoC patch (0003) where attempts to implement the basic
> infrastructure for the suggested design. Rebased the existing patches
> (0001 and 0002) as well.
>
> This patch adds a new GUC max_slot_sync_workers; the default and max
> value is kept at 2 and 50 respectively for this PoC patch. Now the
> replication launcher divides the work equally among these many
> slot-sync workers. Let us say there are multiple slots on primary
> belonging to 10 DBs and say new GUC on standby is set at default value
> of 2, then each worker on standby will manage 5 dbs individually and
> will keep on synching the slots for them. If a new DB is found by
> replication launcher, it will assign this new db to the worker
> handling the minimum number of dbs currently (or first worker in case
> of equal count) and that worker will pick up the new db the next time
> it tries to sync the slots.
> I have kept the changes in separate patches (003) for ease of review.
> Since this is just a PoC patch, many things are yet to be done
> appropriately, will cover those in next versions.
>

Attaching new set of patches which attempt to implement below

Re: [PoC] Reducing planning time when tables have many partitions

2023-08-07 Thread Andrey Lepikhov

On 7/8/2023 19:15, Ashutosh Bapat wrote:

On Mon, Aug 7, 2023 at 2:21 PM Andrey Lepikhov 
mailto:a.lepik...@postgrespro.ru>> wrote:

 >> Do you think that the memory measurement patch I have shared in
those threads is useful in itself? If so, I will start another
proposal to address it.
 >
 > For me, who is developing the planner in this thread, the memory
 > measurement patch is useful. However, most users do not care about
 > memory usage, so there is room for consideration. For example, making
 > the metrics optional in EXPLAIN ANALYZE outputs might be better.
 >
+1. Any memory-related info in the output of EXPLAIN ANALYZE makes
tests
more complex because of architecture dependency.

As far as the tests go, the same is the case with planning time and 
execution time. They change even without changing the architecture. But 
we have tests which mask the actual values. Something similar will be 
done to the planning memory.
It is a positive thing to access some planner internals from the 
console, of course. My point is dedicated to the structuration of an 
EXPLAIN output and is caused by two reasons:
1. I use the EXPLAIN command daily to identify performance issues and 
the optimiser's weak points. According to the experience, when you have 
an 'explain analyze' containing more than 100 strings, you try removing 
unnecessary information to improve observability. It would be better to 
have the possibility to see an EXPLAIN with different levels of the 
output details. Flexibility here reduces a lot of manual work, sometimes.
2. Writing extensions and having an explain analyze in the regression 
test, we must create masking functions just to make the test more 
stable. That additional work can be avoided with another option, like 
MEMUSAGE ON/OFF.

So, in my opinion, it would be better to introduce this new output data 
guarded by additional option.

I will propose it as a separate patch in the next commitfest and will 
seek opinions from other hackers.

Cool, good news.

--
regards,
Andrey Lepikhov
Postgres Professional

Re: [PATCH] Add loongarch native checksum implementation.

2023-08-07 Thread YANG Xudong


Thanks for the comment. I have updated the patch to v3. Please have a look.


On 2023/8/7 19:01, John Naylor wrote:


On Fri, Jun 16, 2023 at 8:28 AM YANG Xudong > wrote:
 > > +# If the intrinsics are supported, sets 
pgac_loongarch_crc32c_intrinsics,

 > > +# and CFLAGS_CRC.
 > >
 > > +# Check if __builtin_loongarch_crcc_* intrinsics can be used
 > > +# with the default compiler flags.
 > > +# CFLAGS_CRC is set if the extra flag is required.
 > >
 > > Same here -- it seems we don't need to set CFLAGS_CRC at all. Can you
 > > confirm?
 > >
 >
 > We don't need to set CFLAGS_CRC as commented. I have updated the
 > configure script to make it align with the logic in meson build script.

(Looking again at v2)

The compilation test is found in c-compiler.m4, which still has all 
logic for CFLAGS_CRC, including saving and restoring the old CFLAGS. Can 
this also be simplified?


Fixed the function in c-compiler.m4 by removing the function argument 
and the logic of handling CFLAGS and CFLAGS_CRC.





I diff'd pg_crc32c_loongarch.c with the current other files, and found 
it is structurally the same as the Arm implementation. That's logical if 
memory alignment is important.


   /*
- * ARMv8 doesn't require alignment, but aligned memory access is
- * significantly faster. Process leading bytes so that the loop below
- * starts with a pointer aligned to eight bytes.
+ * Aligned memory access is significantly faster.
+ * Process leading bytes so that the loop below starts with a pointer 
aligned to eight bytes.


Can you confirm the alignment requirement -- it's not clear what the 
intention is since "doesn't require" wasn't carried over. Is there any 
documentation (or even a report in some other context) about aligned vs 
unaligned memory access performance?


It is in the official document that the alignment is not required.

https://github.com/loongson/la-softdev-convention/blob/master/la-softdev-convention.adoc#74-unaligned-memory-access-support


However, I found this patch in LKML that shows great performance gain 
when using aligned memory access similar to this patch.


https://lore.kernel.org/lkml/20230410115734.93365-1-wang...@loongson.cn/

So I guess using aligned memory access is necessary and I have updated 
the comment in the code.





--
John Naylor
EDB: http://www.enterprisedb.com From ced3f65d7445bdcca0628c4f5073b5657a81cd28 Mon Sep 17 00:00:00 2001
From: YANG Xudong 
Date: Tue, 8 Aug 2023 10:41:58 +0800
Subject: [PATCH v3] Add loongarch native checksum implementation.

---
 config/c-compiler.m4   | 29 ++
 configure  | 69 
 configure.ac   | 33 +++
 meson.build| 24 +++
 src/include/pg_config.h.in |  3 ++
 src/include/port/pg_crc32c.h   |  9 +
 src/port/meson.build   |  3 ++
 src/port/pg_crc32c_loongarch.c | 73 ++
 8 files changed, 228 insertions(+), 15 deletions(-)
 create mode 100644 src/port/pg_crc32c_loongarch.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5be8f0f08d..ad6e90 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -661,3 +661,32 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_ARMV8_CRC32C_INTRINSICS
+
+# PGAC_LOONGARCH_CRC32C_INTRINSICS
+# ---
+# Check if the compiler supports the LoongArch CRCC instructions, using
+# __builtin_loongarch_crcc_w_b_w, __builtin_loongarch_crcc_w_h_w,
+# __builtin_loongarch_crcc_w_w_w and __builtin_loongarch_crcc_w_d_w
+# intrinsic functions.
+#
+# If the intrinsics are supported, sets pgac_loongarch_crc32c_intrinsics.
+AC_DEFUN([PGAC_LOONGARCH_CRC32C_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_loongarch_crc32c_intrinsics])])dnl
+AC_CACHE_CHECK(
+  [for __builtin_loongarch_crcc_w_b_w, __builtin_loongarch_crcc_w_h_w, 
__builtin_loongarch_crcc_w_w_w and __builtin_loongarch_crcc_w_d_w],
+  [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+  [unsigned int crc = 0;
+   crc = __builtin_loongarch_crcc_w_b_w(0, crc);
+   crc = __builtin_loongarch_crcc_w_h_w(0, crc);
+   crc = __builtin_loongarch_crcc_w_w_w(0, crc);
+   crc = __builtin_loongarch_crcc_w_d_w(0, crc);
+   /* return computed value, to prevent the above being optimized away */
+   return crc == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_loongarch_crc32c_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_LOONGARCH_CRC32C_INTRINSICS
diff --git a/configure b/configure
index 963fbbcf1e..c5eb2814f1 100755
--- a/configure
+++ b/configure
@@ -18047,6 +18047,47 @@ fi
 
 fi
 
+# Check for LoongArch CRC intrinsics to do CRC calculations.
+#
+# Check if __builtin_loongarch_crcc_* intrinsics can be used
+# with the default compiler flags.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for

Re: WIP: new system catalog pg_wait_event

On Tue, Aug 08, 2023 at 11:53:32AM +0900, Kyotaro Horiguchi wrote:
> As I mentioned in another thread, I'm uncertain about our stance on
> the class id of the wait event. If a class acts as a namespace, we
> should include it in the view. Otherwise, if the class id is just an
> attribute of the wait event, we should make the event name unique.

Including the class name in the view makes the most sense to me, FWIW,
as it could be also possible that one reuses an event name in the
existing in-core list, but for extensions.  That's of course not
something I would recommend.
--
Michael


signature.asc
Description: PGP signature

Re: WIP: new system catalog pg_wait_event

2023-08-07 Thread Kyotaro Horiguchi

At Mon, 7 Aug 2023 17:11:50 +0200, "Drouvot, Bertrand" 
 wrote in 
> That way I think it's flexible enough to add more code if needed in
> the SRF.
> 
> The patch also:
> 
> - updates the doc
> - works with autoconf and meson
> - adds a simple test
> 
> I'm adding a new CF entry for it.

As I mentioned in another thread, I'm uncertain about our stance on
the class id of the wait event. If a class acts as a namespace, we
should include it in the view. Otherwise, if the class id is just an
attribute of the wait event, we should make the event name unique.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Extract numeric filed in JSONB more effectively

2023-08-07 Thread Andy Fan

Hi:

On Mon, Aug 7, 2023 at 7:51 PM Andy Fan  wrote:

> Hi Jian:
>
> Thanks for the review!
>
> compared with jsonb_numeric. I am wondering if you need a free *jb.
>> elog(INFO,"jb=%p arg pointer=%p ", jb, PG_GETARG_POINTER(0));
>> says there two are not the same.
>>
>
> Thanks for pointing this out,  I am not sure what to do right now.
> Basically the question is that shall we free the memory which
> is allocated in a function call.  The proof to do it is obvious, but the
> proof to NOT do it may be usually the memory is allocated under
> ExprContext  Memorycontext,  it will be reset once the current
> tuple is proceed, and MemoryContextReset will be more effective
> than pfrees;
>

I just found Andres's opinion on this, it looks like he would suggest
not free it [1], and the reason is similar here [2], so I would like to
keep it as it is.

[1]
https://www.postgresql.org/message-id/20230216213554.vintskinrqqrxf6d%40awork3.anarazel.de

[2]
https://www.postgresql.org/message-id/20230217202626.ihd55rgxgkr2uqim%40awork3.anarazel.de


-- 
Best Regards
Andy Fan

Re: [PATCH] [zh_CN.po] fix a typo in simplified Chinese translation file

On Wed, 2 Aug 2023 at 15:45, jian he  wrote:
> I think it's pretty obviously. anyway. I created an commitfest entry.
> https://commitfest.postgresql.org/44/4470/

I saw that there were two CF entries for this patch. I marked one as
committed and the other as withdrawn.

For the future, I believe the discussion for translations goes on in
the pgsql-translators mailing list [1]. Ordinarily, there'd be no CF
entry for translation updates.

David

[1] https://www.postgresql.org/list/pgsql-translators/

Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Hi,

On 2023-08-07 19:15:41 -0700, Andres Freund wrote:
> The set of free CI providers has shrunk since we chose cirrus, as have the
> "free" resources provided. I started, quite incomplete as of now, wiki page at
> [4].

Oops, as Thomas just noticed, I left off that link:

[4] https://wiki.postgresql.org/wiki/CI_Providers

Greetings,

Andres Freund

Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

Hi,

As some of you might have seen when running CI, cirrus-ci is restricting how
much CI cycles everyone can use for free (announcement at [1]). This takes
effect September 1st.

This obviously has consequences both for individual users of CI as well as
cfbot.


The first thing I think we should do is to lower the cost of CI. One thing I
had not entirely realized previously, is that macos CI is by far the most
expensive CI to provide. That's not just the case with cirrus-ci, but also
with other providers.  See the series of patches described later in the email.


To me, the situation for cfbot is different than the one for individual
users.

IMO, for the individual user case it's important to use CI for "free", without
a whole lot of complexity. Which imo rules approaches like providing
$cloud_provider compute accounts, that's too much setup work.  With the
improvements detailed below, cirrus' free CI would last about ~65 runs /
month.

For cfbot I hope we can find funding to pay for compute to use for CI. The, by
far, most expensive bit is macos. To a significant degree due to macos
licensing terms not allowing more than 2 VMs on a physical host :(.


The reason we chose cirrus-ci were

a) Ability to use full VMs, rather than a pre-selected set of VMs, which
   allows us to test a larger number

b) Ability to link to log files, without requiring an account. E.g. github
   actions doesn't allow to view logs unless logged in.

c) Amount of compute available.


The set of free CI providers has shrunk since we chose cirrus, as have the
"free" resources provided. I started, quite incomplete as of now, wiki page at
[4].


Potential paths forward for individual CI:

- migrate wholesale to another CI provider

- split CI tasks across different CI providers, rely on github et al
  displaying the CI status for different platforms

- give up


Potential paths forward for cfbot, in addition to the above:

- Pay for compute / ask the various cloud providers to grant us compute
  credits. At least some of the cloud providers can be used via cirrus-ci.

- Host (some) CI runners ourselves. Particularly with macos and windows, that
  could provide significant savings.

- Build our own system, using buildbot, jenkins or whatnot.


Opinions as to what to do?



The attached series of patches:

1) Makes startup of macos instances faster, using more efficient caching of
   the required packages. Also submitted as [2].

2) Introduces a template initdb that's reused during the tests. Also submitted
   as [3]

3) Remove use of -DRANDOMIZE_ALLOCATED_MEMORY from macos tasks. It's
   expensive. And CI also uses asan on linux, so I don't think it's really
   needed.

4) Switch tasks to use debugoptimized builds. Previously many tasks used -Og,
   to get decent backtraces etc. But the amount of CPU burned that way is too
   large. One issue with that is that use of ccache becomes much more crucial,
   uncached build times do significantly increase.

5) Move use of -Dsegsize_blocks=6 from macos to linux

   Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we
   could stop covering both meson and autoconf segsize_blocks. It does affect
   runtime on linux as well.

6) Disable write cache flushes on windows

   It's a bit ugly to do this without using the UI... Shaves off about 30s
   from the tests.

7) pg_regress only checked once a second whether postgres started up, but it's
   usually much faster. Use pg_ctl's logic.  It might be worth replacing the
   use psql with directly using libpq in pg_regress instead, looks like the
   overhead of repeatedly starting psql is noticeable.


FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly
like the following (depends on caching etc):

task costs in credits
linux-sanity: 0.01
linux-compiler-warnings: 0.05
linux-meson: 0.07
freebsd   : 0.08
linux-autoconf: 0.09
windows   : 0.18
macos : 0.28
total task runtime is 40.8
cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month


Greetings,

Andres Freund

[1] https://cirrus-ci.org/blog/2023/07/17/limiting-free-usage-of-cirrus-ci/
[2] 
https://www.postgresql.org/message-id/20230805202539.r3umyamsnctysdc7%40awork3.anarazel.de
[3] https://postgr.es/m/20220120021859.3zpsfqn4z7ob7...@alap3.anarazel.de
>From 00c48a90f3acc6cba6af873b29429ee6d4ba38a6 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 3 Aug 2023 23:29:13 -0700
Subject: [PATCH v1 1/9] ci: macos: used cached macports install

This substantially speeds up the mac CI time.

Discussion: https://postgr.es/m/20230805202539.r3umyamsnctys...@awork3.anarazel.de
---
 .cirrus.yml  | 63 +++---
 src/tools/ci/ci_macports_packages.sh | 97 
 2 files changed, 122 insertions(+), 38 deletions(-)
 create mode 100755 src/tools/ci/ci_macports_packages.sh

diff --git a/.cirrus.yml b/.cirrus.yml
index

Re: 2023-08-10 release announcement draft

2023-08-07 Thread Jonathan S. Katz

On 8/7/23 9:53 PM, David Rowley wrote:

On Tue, 8 Aug 2023 at 13:49, Jonathan S. Katz wrote:

On 8/7/23 9:45 PM, David Rowley wrote:

* Fix a performance regression when running concurrent
[`COPY`](https://www.postgresql.org/docs/16/sql-copy.html) statements on a
single table.

I think this is still outstanding. A bit of work has been done for the
int parsing regression but it seems there's still a performance
regression when running multiple COPYs on the same table, per [1].

Hm, the open item was closed[1] -- was that premature, or is this a new
issue (have not yet read the thread you referenced)?

I closed it thinking that enough had been done to resolve the
performance regression. In the linked thread, Sawadasan shows that
that's not the case. So, yes, premature. I've reverted the change to
the open items list now.

Got it. I reverted it as well from the release announcement. Reattaching
with the clean copy.

(Aside: I'm super excited for this PG16 improvement + fixed regression,
as lately I've had to do some bulk imports on a single table that could
really benefit from this :)

Jonathan

The PostgreSQL Global Development Group has released an update to all supported
versions of PostgreSQL, including 15.4, 14.9, 13.12, 12.16, and 11.21, as well
as the third beta release of PostgreSQL 16. This release fixes over 40 bugs
reported over the last several months.

If you use [BRIN]((https://www.postgresql.org/docs/current/brin-intro.html))
indexes to look up `NULL` values, you will need to
[reindex](https://www.postgresql.org/docs/current/sql-reindex.html)
them after upgrading to this release. You us
[`REINDEX
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-reindex.html)
to avoid blocking writes to the affected index and table, for example:

```
REINDEX INDEX CONCURRENTLY your_index_name;
```

For the full list of changes, please review the
[release notes](https://www.postgresql.org/docs/release/).

PostgreSQL 11 EOL Notice

PostgreSQL 11 will stop receiving fixes on November 9, 2023. If you are
running PostgreSQL 11 in a production environment, we suggest that you make
plans to upgrade to a newer, supported version of PostgreSQL. Please see our
[versioning policy](https://www.postgresql.org/support/versioning/) for more
information.

A Note on the PostgreSQL 16 Beta

This release marks the third beta release of PostgreSQL 16 and puts the
community one step closer to general availability tentatively around the end of
the third quarter.

In the spirit of the open source PostgreSQL community, we strongly encourage you
to test the new features of PostgreSQL 16 on your systems to help us eliminate
bugs or other issues that may exist. While we do not advise you to run
PostgreSQL 16 Beta 3 in production environments, we encourage you to find ways
to run your typical application workloads against this beta release.

Your testing and feedback will help the community ensure that PostgreSQL 16
upholds our standards of delivering a stable, reliable release of the world's
most advanced open source relational database. Please read more about our
[beta testing process](https://www.postgresql.org/developer/beta/) and how you
can contribute:

[https://www.postgresql.org/developer/beta/](https://www.postgresql.org/developer/beta/)

You can find information about all of the PostgreSQL 16 features and changes in
the [release notes](https://www.postgresql.org/docs/16/release-16.html):

[https://www.postgresql.org/docs/16/release-16.html](https://www.postgresql.org/docs/16/release-16.html)

Bug Fixes and Improvements
--

This update fixes over 40 bugs that were reported in the last several months.
The issues listed below affect PostgreSQL 15. Some of these issues may also
affect other supported versions of PostgreSQL.

Re: pg_upgrade fails with in-place tablespace

On Wed, Aug 02, 2023 at 10:38:00PM +0800, Rui Zhao wrote:
> However, when using pg_dump to back up this in-place tablespace, it
> is dumped with a different path: 
> CREATE TABLESPACE space_test LOCATION 'pg_tblspc/'
> This can be confusing because it does not match the initial path
> that we created. Additionally, pg_restore cannot restore this
> in-place tablespace because the CREATE TABLESPACE command only
> supports an empty path for creating in-place tablespaces.

Yes, the dump part is a different issue.  Using a relative path is not
fine when restoring the tablespace.

> You are correct. I have made the necessary modifications to ensure
> compatibility with in-place tablespaces. Please find the updated
> code attached. 

+   /* Allow to create in-place tablespace. */
+   if (!is_absolute_path(spclocation))
+   appendPQExpBuffer(buf, "SET allow_in_place_tablespaces = true;\n");
+
appendPQExpBuffer(buf, "CREATE TABLESPACE %s", fspcname);
appendPQExpBuffer(buf, " OWNER %s", fmtId(spcowner));

As you say, This change in pg_dumpall.c takes care of an issue related
to the dumps of in-place tablespaces, not only pg_upgrade.  See for
example on HEAD:
$ psql -c "CREATE TABLESPACE popo LOCATION ''"
$ pg_dumpall  | grep TABLESPACE
CREATE TABLESPACE popo OWNER postgres LOCATION 'pg_tblspc/16385';

And restoring this dump would be incorrect.  I think that we'd better
be doing something like that for the dump part, as of:
@@ -1286,7 +1286,12 @@ dumpTablespaces(PGconn *conn)
appendPQExpBuffer(buf, " OWNER %s", fmtId(spcowner));
 
appendPQExpBufferStr(buf, " LOCATION ");
-   appendStringLiteralConn(buf, spclocation, conn);
+
+   /* In-place tablespaces use a relative path */
+   if (is_absolute_path(spclocation))
+   appendStringLiteralConn(buf, spclocation, conn);
+   else
+   appendPQExpBufferStr(buf, "'';\n");

I don't have a good feeling about enforcing allow_in_place_tablespaces
in the connection creating the tablespace, as it can be useful to let
the restore of the dump fail if this GUC is disabled in the new
cluster, so as one can check that no in-place tablespaces are left
around on the new cluster restoring the contents of the dump.
--
Michael


signature.asc
Description: PGP signature

Re: 2023-08-10 release announcement draft

On Tue, 8 Aug 2023 at 13:49, Jonathan S. Katz  wrote:
>
> On 8/7/23 9:45 PM, David Rowley wrote:
>
> >> * Fix a performance regression when running concurrent
> >> [`COPY`](https://www.postgresql.org/docs/16/sql-copy.html) statements on a
> >> single table.
> >
> > I think this is still outstanding. A bit of work has been done for the
> > int parsing regression but it seems there's still a performance
> > regression when running multiple COPYs on the same table, per [1].
>
> Hm, the open item was closed[1] -- was that premature, or is this a new
> issue (have not yet read the thread you referenced)?

I closed it thinking that enough had been done to resolve the
performance regression. In the linked thread, Sawadasan shows that
that's not the case. So, yes, premature.  I've reverted the change to
the open items list now.

David

Re: 2023-08-10 release announcement draft

2023-08-07 Thread Jonathan S. Katz


On 8/7/23 9:45 PM, David Rowley wrote:


* Fix a performance regression when running concurrent
[`COPY`](https://www.postgresql.org/docs/16/sql-copy.html) statements on a
single table.


I think this is still outstanding. A bit of work has been done for the
int parsing regression but it seems there's still a performance
regression when running multiple COPYs on the same table, per [1].


Hm, the open item was closed[1] -- was that premature, or is this a new 
issue (have not yet read the thread you referenced)?



or an previous major version of PostgreSQL, you will need to use a strategy


"a previous".


Thanks for the catch -- fixed locally.

Jonathan

[1] 
https://wiki.postgresql.org/index.php?title=PostgreSQL_16_Open_Items#resolved_before_16beta3


OpenPGP_signature
Description: OpenPGP digital signature

Re: 2023-08-10 release announcement draft

On Tue, 8 Aug 2023 at 13:15, Jonathan S. Katz  wrote:
> Attached is the release announcement draft for the 2023-08-10 update
> release, which also includes the release of PostgreSQL 16 Beta 3.

Thanks for drafting this.

> * Fix a performance regression when running concurrent
> [`COPY`](https://www.postgresql.org/docs/16/sql-copy.html) statements on a
> single table.

I think this is still outstanding. A bit of work has been done for the
int parsing regression but it seems there's still a performance
regression when running multiple COPYs on the same table, per [1].

> or an previous major version of PostgreSQL, you will need to use a strategy

"a previous".

David

[1] 
https://postgr.es/m/CAD21AoC0GvCEZbDreoAcj=i0ljncqepqbh_cxcubkezygyw...@mail.gmail.com

2023-08-10 release announcement draft

2023-08-07 Thread Jonathan S. Katz

Hi,

Attached is the release announcement draft for the 2023-08-10 update
release, which also includes the release of PostgreSQL 16 Beta 3.

Please provide your feedback no later than August 10, 2023 0:00 AoE[1].

Thanks,

Jonathan

[1] https://en.wikipedia.org/wiki/Anywhere_on_Earth
The PostgreSQL Global Development Group has released an update to all supported
versions of PostgreSQL, including 15.4, 14.9, 13.12, 12.16, and 11.21, as well
as the third beta release of PostgreSQL 16. This release fixes over 40 bugs
reported over the last several months.

```
REINDEX INDEX CONCURRENTLY your_index_name;
```

For the full list of changes, please review the
[release notes](https://www.postgresql.org/docs/release/).

PostgreSQL 11 EOL Notice

A Note on the PostgreSQL 16 Beta

This release marks the third beta release of PostgreSQL 16 and puts the
community one step closer to general availability tentatively around the end of
the third quarter.

[https://www.postgresql.org/developer/beta/](https://www.postgresql.org/developer/beta/)

You can find information about all of the PostgreSQL 16 features and changes in
the [release notes](https://www.postgresql.org/docs/16/release-16.html):

[https://www.postgresql.org/docs/16/release-16.html](https://www.postgresql.org/docs/16/release-16.html)

Bug Fixes and Improvements
--

* Fix for handling of `NULL` values in
[BRIN](https://www.postgresql.org/docs/current/brin-intro.html)
indexes. This fix does not apply to existing BRIN indexes -- you will need to
run [`REINDEX`](https://www.postgresql.org/docs/current/sql-reindex.html) to fix
any BRIN indexes used to search for NULL values.
* Avoid leaving a corrupted database behind when DROP DATABASE is interrupted.
* Several fixes for partitioned indexes.
* Fix [`ALTER EXTENSION ... SET
SCHEMA`](https://www.postgresql.org/docs/current/sql-alterextension.html)
to error if the extension contains any objects outside the extension's schema.
* Fix dependency tracking of access methods for tables.
* Don't use partial unique indexes for uniqueness proofs in the planner.
* Correctly handle sub-SELECTs in RLS policy expressions and security-barrier
views when expanding rule actions.
* Fix race conditions in conflict detection for `SERIALIZABLE` transaction
isolation mode.
* Fix intermittent failures when trying to update a field of a composite column
that requires [out-of-line
TOASTing](https://www.postgresql.org/docs/current/storage-toast.html).
* Fix several memory leaks that occurred during the lifespan of a query.
* Accept fractional seconds in the input to the [jsonpath
`datetime()`](https://www.postgresql.org/docs/current/functions-json.html#FUNCTIONS-SQLJSON-PATH-OPERATORS)
method.
* Increase token limit in `pg_hba.conf` and `pg_ident.conf` to 10,240 bytes.
* An out-of-memory error from JIT will now cause a PostgreSQL `FATAL` error
instead of a C++ exception.
* Allow `VACUUM` to continue after detecting certain types of B-tree index
corruption. While this fix allows VACUUM to continue, you still need to
`REINDEX` to fix the broken index.
* Avoid double replay of prepared transactions during crash recovery.
* Ensure that checkpoint calls `fsync` on a newly created but empty table.
* Silence "missing contrecord" errors to avoid logging

Re: Support to define custom wait events for extensions

On Tue, Aug 08, 2023 at 09:39:02AM +0900, Masahiro Ikeda wrote:
> I am thinking a bit that we also need another hash where the key
> is a custom string. For extensions that have no dependencies
> with shared_preload_libraries, I think we need to avoid that
> WaitEventExtensionNew() is called repeatedly and a new eventId
> is issued each time.
> 
> So, is it better to have another hash where the key is
> a custom string and uniqueness is identified by it to determine
> if a new eventId should be issued?

Yeah, I was also considering if something like that is really
necessary, but I cannot stop worrying about adding more contention to
the hash table lookup each time an extention needs to retrieve an
event ID to use for WaitLatch() or such.  The results of the hash
table lookups could be cached in each backend, still it creates an
extra cost when combined with queries running in parallel on
pg_stat_activity that do the opposite lookup event_id -> event_name.
My suggestion adds more load to AddinShmemInitLock instead.

Hence, I was just thinking about relying on AddinShmemInitLock to
insert new entries in the hash table, only if its shmem state is not
found when calling ShmemInitStruct().  Or perhaps it is just OK to not
care about the impact event_name -> event_id lookup for fresh
connections, and just bite the bullet with two lookup keys instead of
relying on AddinShmemInitLock for the addition of new entries in the
hash table?  Hmm, perhaps you're right with your approach here, at the
end.
--
Michael

signature.asc
Description: PGP signature

Re: Check volatile functions in ppi_clauses for memoize node

2023-08-07 Thread Richard Guo

On Mon, Aug 7, 2023 at 6:19 PM David Rowley  wrote:

> On Fri, 4 Aug 2023 at 22:26, Richard Guo  wrote:
> > explain (costs off)
> > select * from t t1 left join lateral
> > (select t1.a as t1a, t2.a as t2a from t t2) s
> > on t1.a = s.t2a + random();
> >   QUERY PLAN
> > ---
> >  Nested Loop Left Join
> >->  Seq Scan on t t1
> >->  Memoize
> >  Cache Key: t1.a
> >  Cache Mode: binary
> >  ->  Seq Scan on t t2
> >Filter: (t1.a = (a + random()))
> > (7 rows)
> >
> > According to the theory we should not use memoize node for this query
> > because of the volatile function in the inner side.  So propose a patch
> > to fix that.
>
> Thanks for the patch.  I've pushed a variation of it.
>
> I didn't really see the need to make a single list with all the
> RestrictInfos. It seems fine just to write code and loop over the
> ppi_clauses checking for volatility.


That looks good.  Thanks for pushing it!

Thanks
Richard

Re: pgbench: allow to exit immediately when any client is aborted

2023-08-07 Thread Yugo NAGATA

Hello Fabien,

On Mon, 7 Aug 2023 12:17:38 +0200 (CEST)
Fabien COELHO  wrote:

> 
> Hello Yugo-san,
> 
> > I attached v2 patch including the documentation and some comments
> > in the code.
> 
> I've looked at this patch.

Thank you for your review!

> 
> I'm unclear whether it does what it says: "exit immediately on abort", I 
> would expect a cold call to "exit" (with a clear message obviously) when 
> the abort occurs.
> 
> Currently it skips to "done" which starts by closing this particular 
> thread client connections, then it will call "exit" later, so it is not 
> "immediate".

There are cases where "goto done" is used where some error like
"invalid socket: ..." happens. I would like to make pgbench exit in
such cases, too, so I chose to handle errors below the "done:" label.
However, we can change it to call "exit" instead of "goo done" at each
place. Do you think this is better?

> I do not think that this cleanup is necessary, because anyway all other 
> threads will be brutally killed by the exit called by the aborted thread, 
> so why bothering to disconnect only some connections?

Agreed. This disconnection is not necessary anyway even when we would like
to handle it below "done".

>  /* If any client is aborted, exit immediately. */
>  if (state[i].state != CSTATE_FINISHED)
> 
> For this comment, I would prefer "if ( ... == CSTATE_ABORTED)" rather that 
> implying that not finished means aborted, and if you follow my previous 
> remark then this code can be removed.

Ok. If we handle errors like "invalid socket:" (mentioned above) after
skipping to "done", we should set the status to CSTATE_ABORTED before the jump. 
Otherwise, if we choose to call "exit" immediately at each error instead of
skipping to "done", we can remove this as you says.

> Typo: "going to exist" -> "going to exit". Note that it is not "the whole 
> thread" but "the program" which is exiting.

I'll fix.

> There is no test.

I'll add an test.

Regards,
Yugo Nagata

> -- 
> Fabien.

-- 
Yugo NAGATA

Re: Support to define custom wait events for extensions

2023-08-07 Thread Masahiro Ikeda


On 2023-08-08 08:54, Michael Paquier wrote:

On Wed, Aug 02, 2023 at 06:34:15PM +0900, Masahiro Ikeda wrote:

On 2023-08-01 12:23, Andres Freund wrote:
This is why the scheme as implemented doesn't really make sense to 
me.

It'd be
much easier to use if we had a shared hashtable with the identifiers
than
what's been merged now.

In plenty of cases it's not realistic for an extension library to run 
in

each
backend, while still needing to wait for things.


OK, I'll try to make a PoC patch.


Hmm.  There are a few things to take into account here:
- WaitEventExtensionShmemInit() should gain a dshash_create(), to make
sure that the shared table is around, and we are going to have a
reference to it in WaitEventExtensionCounterData, saved from
dshash_get_hash_table_handle().
- The hash table entries could just use nextId as key to look at the
entries, with entries added during WaitEventExtensionNew(), and use as
contents the name of the wait event.  We are going to need a fixed
size for these custom strings, but perhaps a hard limit of 256
characters for each entry of the hash table is more than enough for
most users?
- WaitEventExtensionRegisterName() could be removed, I guess, replaced
by a single WaitEventExtensionNew(), as of:
uint32 WaitEventExtensionNew(const char *event_name);
- GetWaitEventExtensionIdentifier() needs to switch to a lookup of the
shared hash table, based on the eventId.

All that would save from the extra WaitEventExtensionRegisterName()
needed in the backends to keep a track of the names, indeed.


Thank you for pointing the direction of implementation.

I am thinking a bit that we also need another hash where the key
is a custom string. For extensions that have no dependencies
with shared_preload_libraries, I think we need to avoid that 
WaitEventExtensionNew() is called repeatedly and a new eventId

is issued each time.

So, is it better to have another hash where the key is
a custom string and uniqueness is identified by it to determine
if a new eventId should be issued?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Re: Using defines for protocol characters

2023-08-07 Thread Tatsuo Ishii

> On Mon, 7 Aug 2023 at 16:50, Tatsuo Ishii  wrote:
> 
>> > On Mon, Aug 07, 2023 at 04:02:08PM -0400, Tom Lane wrote:
>> >> Dave Cramer  writes:
>> >>> On Mon, 7 Aug 2023 at 12:59, Robert Haas 
>> wrote:
>>  PqMsgEmptyQueryResponse or something like that seems better, if we
>>  want to keep the current capitalization. I'm not a huge fan of the way
>>  we vary our capitalization conventions so much all over the code base,
>>  but I think we would at least do well to keep it consistent from one
>>  end of a certain identifier to the other.
>> >>
>> >>> I don't have a strong preference, but before I make the changes I'd
>> like to
>> >>> get consensus.
>> >>> Can we vote or whatever it takes to decide on a naming pattern that is
>> >>> acceptable ?
>> >>
>> >> I'm good with Robert's proposal above.
>> >
>> > +1
>>
>> +1.
>>
>> Also we need to decide what to do with them:
>>
>> > #define PQMSG_REQ_PREPARED 'S'
>> > #define PQMSG_REQ_PORTAL 'P'
>>
>> If we go "PqMsgEmptyQueryResponse", probably we should go something
>> like for these?
>>
>> #define PqMsgReqPrepared 'S'
>> #define PqMsgReqPortal 'P'
>>
> 
> I went with PqMsgPortalSubCommand and PqMsgPreparedSubCommand
> 
> See attached patch

Looks good to me.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Using defines for protocol characters

2023-08-07 Thread Dave Cramer

On Mon, 7 Aug 2023 at 16:50, Tatsuo Ishii  wrote:

> > On Mon, Aug 07, 2023 at 04:02:08PM -0400, Tom Lane wrote:
> >> Dave Cramer  writes:
> >>> On Mon, 7 Aug 2023 at 12:59, Robert Haas 
> wrote:
>  PqMsgEmptyQueryResponse or something like that seems better, if we
>  want to keep the current capitalization. I'm not a huge fan of the way
>  we vary our capitalization conventions so much all over the code base,
>  but I think we would at least do well to keep it consistent from one
>  end of a certain identifier to the other.
> >>
> >>> I don't have a strong preference, but before I make the changes I'd
> like to
> >>> get consensus.
> >>> Can we vote or whatever it takes to decide on a naming pattern that is
> >>> acceptable ?
> >>
> >> I'm good with Robert's proposal above.
> >
> > +1
>
> +1.
>
> Also we need to decide what to do with them:
>
> > #define PQMSG_REQ_PREPARED 'S'
> > #define PQMSG_REQ_PORTAL 'P'
>
> If we go "PqMsgEmptyQueryResponse", probably we should go something
> like for these?
>
> #define PqMsgReqPrepared 'S'
> #define PqMsgReqPortal 'P'
>

I went with PqMsgPortalSubCommand and PqMsgPreparedSubCommand

See attached patch

Dave


0001-Created-protocol.h.patch
Description: Binary data

Re: Support to define custom wait events for extensions

On Wed, Aug 02, 2023 at 06:34:15PM +0900, Masahiro Ikeda wrote:
> On 2023-08-01 12:23, Andres Freund wrote:
>> This is why the scheme as implemented doesn't really make sense to me.
>> It'd be
>> much easier to use if we had a shared hashtable with the identifiers
>> than
>> what's been merged now.
>> 
>> In plenty of cases it's not realistic for an extension library to run in
>> each
>> backend, while still needing to wait for things.
> 
> OK, I'll try to make a PoC patch.

Hmm.  There are a few things to take into account here:
- WaitEventExtensionShmemInit() should gain a dshash_create(), to make
sure that the shared table is around, and we are going to have a
reference to it in WaitEventExtensionCounterData, saved from
dshash_get_hash_table_handle().
- The hash table entries could just use nextId as key to look at the
entries, with entries added during WaitEventExtensionNew(), and use as
contents the name of the wait event.  We are going to need a fixed
size for these custom strings, but perhaps a hard limit of 256
characters for each entry of the hash table is more than enough for
most users?
- WaitEventExtensionRegisterName() could be removed, I guess, replaced
by a single WaitEventExtensionNew(), as of:
uint32 WaitEventExtensionNew(const char *event_name);
- GetWaitEventExtensionIdentifier() needs to switch to a lookup of the
shared hash table, based on the eventId.

All that would save from the extra WaitEventExtensionRegisterName()
needed in the backends to keep a track of the names, indeed.
--
Michael


signature.asc
Description: PGP signature

Re: cpluspluscheck vs ICU

Hi,

On 2023-03-10 20:10:30 -0800, Andres Freund wrote:
> On 2023-03-10 19:37:27 -0800, Andres Freund wrote:
> > I just hit this once more - and I figured out a fairly easy fix:
> > 
> > We just need a
> >   #ifndef U_DEFAULT_SHOW_DRAFT
> >   #define U_DEFAULT_SHOW_DRAFT 0
> >   #endif
> > before including unicode/ucol.h.
> > 
> > At first I was looking at
> >   #define U_SHOW_CPLUSPLUS_API 0
> > and
> >   #define U_HIDE_INTERNAL_API 1
> > which both work, but they are documented to be internal.
> 
> Err. Unfortunately only the U_SHOW_CPLUSPLUS_API approach actually works. The
> others don't, not quite sure what I was doing earlier.
> 
> So it's either relying on a define marked as internal, or the below:
> 
> > Alternatively we could emit U_DEFAULT_SHOW_DRAFT 0 into pg_config.h to avoid
> > that issue.
> > 
> > 
> > The only other thing I see is to do something like:
> > [ugly]
> > which seems mighty ugly.

The ICU docs talk about it like it's not really internal:
https://github.com/unicode-org/icu/blob/720e5741ccaa112c4faafffdedeb7459b66c5673/docs/processes/release/tasks/healthy-code.md#test-icu4c-headers

So I'm inclined to go with that solution.

Any comments? Arguments against?

Greetings,

Andres Freund

Re: Fix badly generated entries in wait_event_names.txt

On Mon, Aug 07, 2023 at 10:34:43AM +0200, Drouvot, Bertrand wrote:
> fa88928470 introduced src/backend/utils/activity/wait_event_names.txt that has
> been auto-generated. The auto-generation had parsing errors leading to bad
> entries in wait_event_names.txt (when the same "WAIT_EVENT" name can be found 
> as
> a substring of another one, then descriptions were merged in one of them).

Thanks for the patch, applied.  I have double-checked the contents of
the tables and the five entries you have pointed out are the only ones
with duplicated descriptions.
--
Michael


signature.asc
Description: PGP signature

Re: Faster "SET search_path"

2023-08-07 Thread Jeff Davis

On Mon, 2023-08-07 at 15:39 -0700, Nathan Bossart wrote:
> 0001 and 0002 LGTM.

I'll probably go ahead with 0001 soon. Simple and effective.

But for 0002, I was thinking about trying to optimize so
check_search_path() only gets called once at the beginning, rather than
for each function invocation. It's a bit of a hack right now, but it
should be correct because it's just doing a syntactic validity check
and that doesn't depend on catalog contents. If I can find a
reasonably-nice way to do this, then there's no reason to optimize the
check itself, and we wouldn't have to maintain that separate parsing
code.

I might just avoid guc.c entirely and directly set
namespace_search_path and baseSearchPathValid=false. The main thing we
lose there is the GUC stack (AtEOXact_GUC()), but there's already a
PG_TRY/PG_FINALLY block in fmgr_security_definer, so we can use that to
change it back safely. (I think? I need to convince myself that it's
safe to skip the work in guc.c, at least for the special case of
search_path, and that it won't be too fragile.)

I started down this road and it gets even closer in performance to the
plain function call. I don't know if we'll ever call it zero cost, but
safety has a cost sometimes.

>   0003 is looking pretty good, too, but I think we
> should get some more eyes on it, given the complexity.

I'm happy with 0003 and I'm fairly sure we want it, but I agree that
another set of eyes would be nice so I'll give it a bit more time. I
could also improve the testing; any suggestions here are welcome.

> I wonder if it'd be faster to downcase first and then pass it to
> hash_bytes().  This would require modifying the key, which might be a
> deal-breaker, but maybe there's some way to improve find_option() for
> all
> GUCs.

I thought about that and I don't think modifying the argument is a good
API.

For the match function, I added a fast path for strcmp()==0 with a
fallback to case folding (on the assumption that the case is consistent
in most cases, especially built-in ones). For the hash function, I used
a stack buffer and downcased the first K bytes into that, then did
hash_bytes(buff, K). Those improved things a bit but it wasn't an
exciting amount so I didn't post it. I made a patch to port guc_hashtab
to simplehash because it seems like a better fit than hsearch.h; but
again, I didn't see exciting numbers from doing so, so I didn't post
it.

> 
Maybe we could use perfect hashing for the built-in GUCs, and fall back
to simplehash for the custom GUCs?

Regardless, I'm mostly interested in search_path because I want to get
to the point where we can recommend that users specify it for every
function that depends on it. A special case to facilitate that without
compromising (much) on performance seems reasonable to me.

Regards,
Jeff Davis

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

Hi,

On 2023-08-07 12:57:40 +0200, Christoph Berg wrote:
> Re: Thomas Munro
> > Thanks for testing!  Would you mind trying v8 from that thread?  V7
> > had a silly bug (I accidentally deleted a 'case' label while cleaning
> > some stuff up, resulting in the above error...)
> 
> v8 worked better. It succeeded a few times (at least 12, my screen
> scrollback didn't catch more) before erroring like this:

> [10:21:58.410](0.151s) ok 15 - startup deadlock: logfile contains terminated 
> connection due to recovery conflict
> [10:21:58.463](0.053s) not ok 16 - startup deadlock: stats show conflict on 
> standby
> [10:21:58.463](0.000s)
> [10:21:58.463](0.000s) #   Failed test 'startup deadlock: stats show conflict 
> on standby'
> #   at t/031_recovery_conflict.pl line 332.
> [10:21:58.463](0.000s) #  got: '0'
> # expected: '1'

Hm, that could just be a "harmless" race. Does it still happen if you apply
the attached patch in addition?

Greetings,

Andres Freund
diff --git i/src/backend/utils/activity/pgstat_database.c w/src/backend/utils/activity/pgstat_database.c
index 7149f22f729..bb36d73ec04 100644
--- i/src/backend/utils/activity/pgstat_database.c
+++ w/src/backend/utils/activity/pgstat_database.c
@@ -81,12 +81,22 @@ void
 pgstat_report_recovery_conflict(int reason)
 {
 	PgStat_StatDBEntry *dbentry;
+	PgStat_EntryRef *entry_ref;
+	PgStatShared_Database *sharedent;
 
 	Assert(IsUnderPostmaster);
 	if (!pgstat_track_counts)
 		return;
 
-	dbentry = pgstat_prep_database_pending(MyDatabaseId);
+	/*
+	 * Update the shared stats directly - recovery conflicts should never be
+	 * common enough for that to be a problem.
+	 */
+	entry_ref =
+		pgstat_get_entry_ref_locked(PGSTAT_KIND_DATABASE, MyDatabaseId, InvalidOid, false);
+
+	sharedent = (PgStatShared_Database *) entry_ref->shared_stats;
+	dbentry = >stats;
 
 	switch (reason)
 	{
@@ -116,6 +126,8 @@ pgstat_report_recovery_conflict(int reason)
 			dbentry->conflict_startup_deadlock++;
 			break;
 	}
+
+	pgstat_unlock_entry(entry_ref);
 }
 
 /*

Re: Using defines for protocol characters

2023-08-07 Thread Tatsuo Ishii

> On Mon, Aug 07, 2023 at 04:02:08PM -0400, Tom Lane wrote:
>> Dave Cramer  writes:
>>> On Mon, 7 Aug 2023 at 12:59, Robert Haas  wrote:
 PqMsgEmptyQueryResponse or something like that seems better, if we
 want to keep the current capitalization. I'm not a huge fan of the way
 we vary our capitalization conventions so much all over the code base,
 but I think we would at least do well to keep it consistent from one
 end of a certain identifier to the other.
>> 
>>> I don't have a strong preference, but before I make the changes I'd like to
>>> get consensus.
>>> Can we vote or whatever it takes to decide on a naming pattern that is
>>> acceptable ?
>> 
>> I'm good with Robert's proposal above.
> 
> +1

+1.

Also we need to decide what to do with them:

> #define PQMSG_REQ_PREPARED 'S'
> #define PQMSG_REQ_PORTAL 'P'

If we go "PqMsgEmptyQueryResponse", probably we should go something
like for these?

#define PqMsgReqPrepared 'S'
#define PqMsgReqPortal 'P'

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Using defines for protocol characters

2023-08-07 Thread Peter Smith

+#ifndef _PROTOCOL_H
+#define _PROTOCOL_H
+
+#define PQMSG_REQ_BIND  'B'
+#define PQMSG_REQ_CLOSE 'C'
+#define PQMSG_REQ_DESCRIBE  'D'
+#define PQMSG_REQ_EXECUTE   'E'
+#define PQMSG_REQ_FUNCTION_CALL 'F'
+#define PQMSG_REQ_FLUSH_DATA'H'
+#define PQMSG_REQ_BACKEND_KEY_DATA  'K'
+#define PQMSG_REQ_PARSE 'P'
+#define PQMSG_REQ_AUTHENTICATION'R'
+#define PQMSG_REQ_SYNC_DATA 'S'
+#define PQMSG_REQ_SIMPLE_QUERY  'Q'
+#define PQMSG_REQ_TERMINATE 'X'
+#define PQMSG_REQ_COPY_FAIL 'f'
+#define PQMSG_REQ_COPY_DONE 'c'
+#define PQMSG_REQ_COPY_DATA 'd'
+#define PQMSG_REQ_COPY_PROGRESS 'p'
+#define PQMSG_REQ_PREPARED  'S'
+#define PQMSG_REQ_PORTAL'P'
+
+
+/*
+Responses
+*/
+#define PQMSG_RESP_PARSE_COMPLETE   '1'
+#define PQMSG_RESP_BIND_COMPLETE'2'
+#define PQMSG_RESP_CLOSE_COMPLETE   '3'
+#define PQMSG_RESP_NOTIFY   'A'
+#define PQMSG_RESP_COMMAND_COMPLETE 'C'
+#define PQMSG_RESP_DATA_ROW 'D'
+#define PQMSG_RESP_ERROR'E'
+#define PQMSG_RESP_COPY_IN  'G'
+#define PQMSG_RESP_COPY_OUT 'H'
+#define PQMSG_RESP_EMPTY_QUERY  'I'
+#define PQMSG_RESP_NOTICE   'N'
+#define PQMSG_RESP_PARALLEL_PROGRESS 'P'
+#define PQMSG_RESP_FUNCTION_CALL'V'
+#define PQMSG_RESP_PARAMETER_STATUS 'S'
+#define PQMSG_RESP_ROW_DESCRIPTION  'T'
+#define PQMSG_RESP_COPY_BOTH'W'
+#define PQMSG_RESP_READY_FOR_QUERY  'Z'
+#define PQMSG_RESP_NO_DATA  'n'
+#define PQMSG_RESP_PASSWORD 'p'
+#define PQMSG_RESP_PORTAL_SUSPENDED 's'
+#define PQMSG_RESP_PARAMETER_DESCRIPTION 't'
+#define PQMSG_RESP_NEGOTIATE_PROTOCOL'v'
+#endif

Was ordering-by-value intended here? If yes, then FYI both of those
groups of #defines are very nearly, but not quite, in that order.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Faster "SET search_path"

0001 and 0002 LGTM.  0003 is looking pretty good, too, but I think we
should get some more eyes on it, given the complexity.

On Mon, Aug 07, 2023 at 12:57:27PM -0700, Jeff Davis wrote:
> (Aside: part of the reason set_config_option() is slow is because of
> the lookup in guc_hashtab. That's slower than some other hash lookups
> because it does case-folding, which needs to be done in both the hash
> function and also the match function. The hash lookups in
> SearchPathCache are significantly faster. I also have a patch to change
> guc_hashtab to simplehash, but I didn't see much difference.)

I wonder if it'd be faster to downcase first and then pass it to
hash_bytes().  This would require modifying the key, which might be a
deal-breaker, but maybe there's some way to improve find_option() for all
GUCs.

> But in general I'd prefer to optimize cases that are going to work
> nicely by default for a lot of users (especially switching to a safe
> search path), without the need for obscure knowledge about the
> performance implications of the session search_path. And to the extent
> that we do optimize for the pre-existing search_path, I'd like to
> understand the inherent overhead of changing the search path versus
> incidental overhead.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use of additional index columns in rows filtering

2023-08-07 Thread Peter Geoghegan

On Mon, Aug 7, 2023 at 12:34 PM Tomas Vondra
 wrote:
> But then we call get_index_paths/build_index_path a little bit later,
> and that decides to skip "lower SAOP" (which seems a bit strange,
> because the column is "after" the equality, but meh). Anyway, at this
> point we already decided what's a filter, ignoring the index clauses,
> and not expecting any backsies.

I'm not surprised that it's due to the issue around "lower SAOP"
clauses within get_index_paths/build_index_path. That whole approach
seems rather ad-hoc to me. As you probably realize already, my own
patch has to deal with lots of issues in the same area.

> The simples fix seems to be to add these skipped SAOP clauses as
> filters. We know it can be evaluated on the index ...

Right. Obviously, my preferred solution to the problem at hand is to
make everything into index quals via an approach like the one from my
patch -- that works sensibly, no matter the length of the SAOP arrays.
But even if you're willing to assume that that work will be in place
for 17, there are still certain remaining gaps, that also seem
important.

Even my patch cannot always make SAOP clauses into index quals. There
are specific remaining gaps that I hope that your patch will still
cover. The simplest example is a similar NOT IN() inequality, like
this:

select
  ctid, *
from
  tenk1
where
  thousand = 42
  and
  tenthous not in (1, 3, 42, 43, 44, 45, 46, 47, 48, 49, 50);

There is no way that my patch can handle this case. Where your patch
seems to be unable to do better than master here, either -- just like
with the "tenthous in ( )" variant. Once again, the inequality SAOP
also ends up as table filter quals, not index filter quals.

It would also be nice if we found a way of doing this, while still
reliably avoiding all visibility checks (just like "real index quals"
will) -- since that should be safe in this specific case.

The MDAM paper describes a scheme for converting NOT IN() clauses into
DNF single value predicates. But that's not going to happen for 17,
and doesn't seem all that helpful with a query like this in any case.
But it does suggest an argument in favor of visibility checks not
being truly required for SAOP inequalities like this one (when they
appear in index filters). I'm not sure if that idea is too particular
to SAOP inequalities to be interesting -- just a suggestion.

-- 
Peter Geoghegan

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

Hi,

On 2023-08-07 14:36:48 -0700, Andres Freund wrote:
> What if fast path locks entered PROCLOCK into the shared hashtable, just like
> with normal locks, the first time a lock is acquired by a backend. Except that
> we'd set a flag indicating the lock is a fastpath lock. When the lock is
> released, neither the LOCALLOCK nor the PROCLOCK entry would be
> removed. Instead, the LOCK/PROCLOCK would be modified to indicate that the
> lock is not held anymore.
> 
> That itself wouldn't buy us much - we'd still need to do a lookup in the
> shared hashtable. But, by the time we decide whether to use fast path locks,
> we've already done a hash lookup in the LOCALLOCK hashtable. Because the
> PROCLOCK entry would continue to exist, we can use LOCALLOCK->proclock to get
> the PROCLOCK entry without a shared hash table lookup.
> 
> Acquiring a strong lock on a fastpath lock would basically entail modifying
> all the relevant PROCLOCKs atomically to indicate that fast path locks aren't
> possible anymore.  Acquiring a fast path lock would just require atomically
> modifying the PROCLOCK to indicate that the lock is held.
> 
> On a first blush, this sounds like it could end up being fairly clean and
> generic?

On 2023-08-07 13:05:32 -0400, Robert Haas wrote:
> Of course, another thing we could do is try to improve the main lock
> manager somehow. I confess that I don't have a great idea for that at
> the moment, but the current locking scheme there is from a very, very
> long time ago and clearly wasn't designed with modern hardware in
> mind.

I think the biggest flaw of the locking scheme is that the LockHash locks
protect two, somewhat independent, things:
1) the set of currently lockable objects, i.e. the entries in the hash table 
[partition]
2) the state of all the locks [in a partition]

It'd not be that hard to avoid the shared hashtable lookup in a number of
cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest
above.  But we can't, in general, avoid the lock on the partition anyway, as
the each lock's state is also protected by the partition lock.

The amount of work to do a lookup in the shared hashtable and/or create a new
entry therein, is quite bound.  But the work for acquiring a lock is much less
so. We'll e.g. often have to iterate over the set of lock holders etc.

I think we ought to investigate whether pushing down the locking for the "lock
state" into the individual locks is worth it. That way the partitioned lock
would just protect the hashtable.

The biggest issue I see is deadlock checking. Right now acquiring all lock
partitions gives you a consistent view of all the non-fastpath locks - and
fastpath locks can't participate in deadlocks. Any scheme that makes "lock
state" locking in general more granular, will make it next to impossible to
have a similarly consistent view of all locks.  I'm not sure the current
degree of consistency is required however - the lockers participating in a
lock cycle, pretty much by definition, are blocked.

A secondary issue is that making the locks more granular could affect the
happy path measurably - we'd need two atomics for each heavyweight lock
acquisition, not one.  But if we cached the lookup in the shared hashtable,
we'd commonly be able to skip the hashtable lookup...

Greetings,

Andres Freund

Re: Using defines for protocol characters

On Mon, Aug 07, 2023 at 04:02:08PM -0400, Tom Lane wrote:
> Dave Cramer  writes:
>> On Mon, 7 Aug 2023 at 12:59, Robert Haas  wrote:
>>> PqMsgEmptyQueryResponse or something like that seems better, if we
>>> want to keep the current capitalization. I'm not a huge fan of the way
>>> we vary our capitalization conventions so much all over the code base,
>>> but I think we would at least do well to keep it consistent from one
>>> end of a certain identifier to the other.
> 
>> I don't have a strong preference, but before I make the changes I'd like to
>> get consensus.
>> Can we vote or whatever it takes to decide on a naming pattern that is
>> acceptable ?
> 
> I'm good with Robert's proposal above.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

Hi,

On 2023-08-07 13:05:32 -0400, Robert Haas wrote:
> I would also argue that the results are actually not that great,
> because once you get past 64 partitions you're right back where you
> started, or maybe worse off. To me, there's nothing magical about
> cases between 16 and 64 relations that makes them deserve special
> treatment - plenty of people are going to want to use hundreds of
> partitions, and even if you only use a few dozen, this isn't going to
> help as soon as you join two or three partitioned tables, and I
> suspect it hurts whenever it doesn't help.
> 
> I think we need a design that scales better. I don't really know what
> that would look like, exactly, but your idea of a hybrid approach
> seems like it might be worth further consideration. We don't have to
> store an infinite number of fast-path locks in an array that we search
> linearly, and it might be better that if we moved to some other
> approach we could avoid some of the regression.

My gut feeling is that the state for fast path locks doesn't live in quite the
right place.

What if fast path locks entered PROCLOCK into the shared hashtable, just like
with normal locks, the first time a lock is acquired by a backend. Except that
we'd set a flag indicating the lock is a fastpath lock. When the lock is
released, neither the LOCALLOCK nor the PROCLOCK entry would be
removed. Instead, the LOCK/PROCLOCK would be modified to indicate that the
lock is not held anymore.

That itself wouldn't buy us much - we'd still need to do a lookup in the
shared hashtable. But, by the time we decide whether to use fast path locks,
we've already done a hash lookup in the LOCALLOCK hashtable. Because the
PROCLOCK entry would continue to exist, we can use LOCALLOCK->proclock to get
the PROCLOCK entry without a shared hash table lookup.

Acquiring a strong lock on a fastpath lock would basically entail modifying
all the relevant PROCLOCKs atomically to indicate that fast path locks aren't
possible anymore.  Acquiring a fast path lock would just require atomically
modifying the PROCLOCK to indicate that the lock is held.

On a first blush, this sounds like it could end up being fairly clean and
generic?

Greetings,

Andres Freund

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-07 Thread Matt Smiley

>
> Why would the access frequency be uniform? In particular, there's a huge
> variability in how long the locks need to exist
>

As a supporting data point, our example production workload shows a 3x
difference between the most versus least frequently contended lock_manager
lock:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726

Since we deterministically distribute relations among those 16 lock_manager
lwlocks by hashing their lock tag, we can probably assume a roughly uniform
number of relations are being managed by each lock_manager lock, but the
demand (and contention) for them is non-uniform. This 3x spread
corroborates the intuition that some relations are locked more frequently
than others (that being both a schema- and workload-specific property).

Since we're contemplating a new hashing scheme, I wonder how we could
accommodate that kind of asymmetry, where some relations are locked more
frequently than others.

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

Hi,

On 2023-08-07 13:59:26 -0700, Matt Smiley wrote:
> I have not yet written a reproducer since we see this daily in production.
> I have a sketch of a few ways that I think will reproduce the behavior
> we're observing, but haven't had time to implement it.
> 
> I'm not sure if we're seeing this behavior in production

It might be worth for you to backpatch

commit 92daeca45df
Author: Andres Freund 
Date:   2022-11-21 20:34:17 -0800

Add wait event for pg_usleep() in perform_spin_delay()

into 12. That should be low risk and have only trivially resolvable
conflicts. Alternatively, you could use bpftrace et al to set a userspace
probe on perform_spin_delay().


> , but it's definitely an interesting find.  Currently we are running
> postgres 12.11, with an upcoming upgrade to 15 planned.  Good to know
> there's a potential improvement waiting in 16.  I noticed that in
> LWLockAcquire the call to LWLockDequeueSelf occurs (
> https://github.com/postgres/postgres/blob/REL_12_11/src/backend/storage/lmgr/lwlock.c#L1218)
> directly between the unsuccessful attempt to immediately acquire the lock
> and reporting the backend's wait event.

That's normal.



> > I'm also wondering if it's possible that the reason for the throughput
> > drops
> > are possibly correlated with heavyweight contention or higher frequency
> > access
> > to the pg_locks view. Deadlock checking and the locks view acquire locks on
> > all lock manager partitions... So if there's a bout of real lock contention
> > (for longer than deadlock_timeout)...
> >
> 
> Great questions, but we ruled that out.  The deadlock_timeout is 5 seconds,
> so frequently hitting that would massively violate SLO and would alert the
> on-call engineers.  The pg_locks view is scraped a couple times per minute
> for metrics collection, but the lock_manager lwlock contention can be
> observed thousands of times every second, typically with very short
> durations.  The following example (captured just now) shows the number of
> times per second over a 10-second window that any 1 of the 16
> "lock_manager" lwlocks was contended:

Some short-lived contention is fine and expected - the question is how long
the waits are...

Unfortunately my experience is that the overhead of bpftrace means that
analyzing things like this with bpftrace is very hard... :(.


> > Given that most of your lock manager traffic comes from query planning -
> > have you evaluated using prepared statements more heavily?
> >
> 
> Yes, there are unrelated obstacles to doing so -- that's a separate can of
> worms, unfortunately.  But in this pathology, even if we used prepared
> statements, the backend would still need to reacquire the same locks during
> each executing transaction.  So in terms of lock acquisition rate, whether
> it's via the planner or executor doing it, the same relations have to be
> locked.

Planning will often lock more database objects than query execution. Which can
keep you using fastpath locks for longer.

Greetings,

Andres Freund

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-07 Thread Matt Smiley

Hi Andres, thanks for helping!  Great questions, replies are inline below.

On Sun, Aug 6, 2023 at 1:00 PM Andres Freund  wrote:

> Hm, I'm curious whether you have a way to trigger the issue outside of your
> prod environment. Mainly because I'm wondering if you're potentially
> hitting
> the issue fixed in a4adc31f690 - we ended up not backpatching that fix, so
> you'd not see the benefit unless you reproduced the load in 16+.
>

Thanks for sharing this!

I have not yet written a reproducer since we see this daily in production.
I have a sketch of a few ways that I think will reproduce the behavior
we're observing, but haven't had time to implement it.

I'm not sure if we're seeing this behavior in production, but it's
definitely an interesting find.  Currently we are running postgres 12.11,
with an upcoming upgrade to 15 planned.  Good to know there's a potential
improvement waiting in 16.  I noticed that in LWLockAcquire the call to
LWLockDequeueSelf occurs (
https://github.com/postgres/postgres/blob/REL_12_11/src/backend/storage/lmgr/lwlock.c#L1218)
directly between the unsuccessful attempt to immediately acquire the lock
and reporting the backend's wait event.  The distinctive indicators we have
been using for this pathology are that "lock_manager" wait_event and its
associated USDT probe (
https://github.com/postgres/postgres/blob/REL_12_11/src/backend/storage/lmgr/lwlock.c#L1236-L1237),
both of which occur after whatever overhead is incurred by
LWLockDequeueSelf.  As you mentioned in your commit message, that overhead
is hard to detect.  My first impression is that whatever overhead it incurs
is in addition to what we are investigating.

> I'm also wondering if it's possible that the reason for the throughput
> drops
> are possibly correlated with heavyweight contention or higher frequency
> access
> to the pg_locks view. Deadlock checking and the locks view acquire locks on
> all lock manager partitions... So if there's a bout of real lock contention
> (for longer than deadlock_timeout)...
>

Great questions, but we ruled that out.  The deadlock_timeout is 5 seconds,
so frequently hitting that would massively violate SLO and would alert the
on-call engineers.  The pg_locks view is scraped a couple times per minute
for metrics collection, but the lock_manager lwlock contention can be
observed thousands of times every second, typically with very short
durations.  The following example (captured just now) shows the number of
times per second over a 10-second window that any 1 of the 16
"lock_manager" lwlocks was contended:

msmiley@patroni-main-2004-103-db-gprd.c.gitlab-production.internal:~$ sudo
./bpftrace -e 'usdt:/usr/lib/postgresql/12/bin/postgres:lwlock__wait__start
/str(arg0) == "lock_manager"/ { @[arg1] = count(); } interval:s:1 {
print(@); clear(@); } interval:s:10 { exit(); }'
Attaching 5 probes...
@[0]: 12122
@[0]: 12888
@[0]: 13011
@[0]: 13348
@[0]: 11461
@[0]: 10637
@[0]: 10892
@[0]: 12334
@[0]: 11565
@[0]: 11596

Typically that contention only lasts a couple microseconds.  But the long
tail can sometimes be much slower.  Details here:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365159507
.

Given that most of your lock manager traffic comes from query planning -
> have
> you evaluated using prepared statements more heavily?
>

Yes, there are unrelated obstacles to doing so -- that's a separate can of
worms, unfortunately.  But in this pathology, even if we used prepared
statements, the backend would still need to reacquire the same locks during
each executing transaction.  So in terms of lock acquisition rate, whether
it's via the planner or executor doing it, the same relations have to be
locked.

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

On Mon, Aug 7, 2023 at 3:48 PM Tomas Vondra
 wrote:
> Why would the access frequency be uniform? In particular, there's a huge
> variability in how long the locks need to exist - IIRC we may be keeping
> locks for tables for a long time, but not for indexes. From this POV it
> might be better to do fast-path locking for indexes, no?

If you're not using explicit transactions, you take a bunch of locks
at the start of a statement and then release all of them at the end.
None of the locks stick around so fast-path locking structure goes
through cycles where it starts out empty, fills up to N items, and
then goes back to empty. If you visualize it as a cache, we're
flushing the entire cache at the end of every operation.

If you run multiple statements in a transaction, the locks will be
kept until the end of the transaction, once acquired. So then you
could start with a small number and gradually accumulate more. But
then you're going to release them all at once at the end.

The main thing that matters here seems to be whether or not all of the
locks can go through the fast-path mechanism, or how many have to go
through the regular mechanism. It shouldn't matter, AFAICS, *which
ones* go through the fast-path mechanism. If you think it does, I'd
like to hear why - it's possible I'm missing something here.

> Maybe, but isn't that mostly what the regular non-fast-path locking
> does? Wouldn't that defeat the whole purpose of fast-path locking?

I don't think so. The main lock manager has two flaws that hinder
performance in comparison with the fast-path mechanism. The first, but
less important, one is that the data structures are just a lot
simpler. For access to a small number of fixed-size elements, a C
array is hard to beat, and the main lock manager data structures are a
lot more complex. The second one, which I think is more important, is
that we've essentially flipped the ordering of the primary key. In the
main lock manager, you start by hashing the locked object and that
gives you a partition number and you then take that partition lock.
Then, you iterate through a list of backends that have that object
locked. This means that if a lot of people are taking locks on the
same object, even if there's no actual conflict between the lock
modes, you still get a lot of contention. But in the fast-path
mechanism, it's reversed: first, you go to the shared memory *for your
backend* and then you search through it for the particular locked
object at issue. So basically the main lock manager treats the primary
key as (what, who) while the fast-path mechanism treats it as (who,
what). And that gets rid of a ton of contention because then different
backends locking the same object (in sufficiently weak lock modes)
never touch the same cache lines, so there's actually zero contention.
That is, I believe, the most important thing about the fast-path
locking system.

What I've just said is slightly complicated by the existence of
FastPathStrongRelationLockData, which is concurrently accessed by all
backends when using fast-path locking, but it's only read-only as
nobody actually takes a strong lock (like an AccessExclusiveLock on a
table). So you probably do get some cache line effects there, but
because it's read-only, they don't cause too much of a headache.

We do have to be careful that the overhead of checking multiple
locking data structures doesn't add up to a problem, for sure. But
there can still, I believe, be a lot of benefit in dividing up access
first by "who" and then by "what" for weak relation locks even if the
per-backend data structures become more complex. Or at least I hope
so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Using defines for protocol characters

Dave Cramer  writes:
> On Mon, 7 Aug 2023 at 12:59, Robert Haas  wrote:
>> PqMsgEmptyQueryResponse or something like that seems better, if we
>> want to keep the current capitalization. I'm not a huge fan of the way
>> we vary our capitalization conventions so much all over the code base,
>> but I think we would at least do well to keep it consistent from one
>> end of a certain identifier to the other.

> I don't have a strong preference, but before I make the changes I'd like to
> get consensus.
> Can we vote or whatever it takes to decide on a naming pattern that is
> acceptable ?

I'm good with Robert's proposal above.

regards, tom lane

Re: Using defines for protocol characters

2023-08-07 Thread Dave Cramer

On Mon, 7 Aug 2023 at 12:59, Robert Haas  wrote:

> On Mon, Aug 7, 2023 at 2:25 PM Tom Lane  wrote:
> > +1.  For ease of greppability, maybe even PQMSG_EmptyQueryResponse
> > and so on?  Then one grep would find both uses of the constants and
> > code/docs references.  Not sure if the prefix should be all-caps or
> > not if we go this way.
>
> PqMsgEmptyQueryResponse or something like that seems better, if we
> want to keep the current capitalization. I'm not a huge fan of the way
> we vary our capitalization conventions so much all over the code base,
> but I think we would at least do well to keep it consistent from one
> end of a certain identifier to the other.
>

I don't have a strong preference, but before I make the changes I'd like to
get consensus.

Can we vote or whatever it takes to decide on a naming pattern that is
acceptable ?

Dave

Re: Faster "SET search_path"

2023-08-07 Thread Jeff Davis

On Wed, 2023-08-02 at 01:14 -0400, Isaac Morland wrote:
> > > On Sat, 2023-07-29 at 12:44 -0400, Isaac Morland wrote:
> > > > Essentially, "just" observe efficiently (somehow) that no
> > > > change is
> > > > needed, and skip changing it?

...

> Speaking as someone who uses a lot of stored procedures, many of
> which call one another, I definitely want this optimization.

If we try to avoid the set_config_option() entirely, we'd have to
introduce the invalidation logic into fmgr.c somehow, and there would
be a performance cliff for users who change their session's
search_path.

Instead, in v4 (attached) I introduced a fast path (patch 0004,
experimental) for setting search_path that uses the same low-level
invalidation logic in namespace.c, but avoids most of the overhead of
set_config_option().

(Aside: part of the reason set_config_option() is slow is because of
the lookup in guc_hashtab. That's slower than some other hash lookups
because it does case-folding, which needs to be done in both the hash
function and also the match function. The hash lookups in
SearchPathCache are significantly faster. I also have a patch to change
guc_hashtab to simplehash, but I didn't see much difference.)

New timings (with session's search_path set to a, b):

  inc()plain function: 4.1s
  inc_secdef() SECURITY DEFINER:   4.3s
  inc_wm() SET work_mem = '4GB':   6.3s
  inc_ab() SET search_path = a, b: 5.4s
  inc_bc() SET search_path = b, c: 5.6s

I reported the numbers a bit differently here. All numbers are using v4
of the patch. The plain function is a baseline; SECURITY DEFINER is for
measuring the overhead of the fmgr_security_definer wrapper; inc_ab()
is for setting the search_path to the same value it's already set to;
and inc_bc() is for setting the search_path to a new value.

There's still a difference between inc() (with no proconfigs) and
inc_ab(), so you can certainly argue that some or all of that can be
avoided if the search_path is unchanged. For instance, adding the new
GUC level causes AtEOXact_GUC to be run, and that does some work. Maybe
it's even possible to avoid the fmgr_security_definer wrapper entirely
(though it would be tricky to do so).

But in general I'd prefer to optimize cases that are going to work
nicely by default for a lot of users (especially switching to a safe
search path), without the need for obscure knowledge about the
performance implications of the session search_path. And to the extent
that we do optimize for the pre-existing search_path, I'd like to
understand the inherent overhead of changing the search path versus
incidental overhead.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

From f45ee8c5ea6517907000d1d3566dcc766ca25732 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 27 Jul 2023 15:11:42 -0700
Subject: [PATCH v4 1/4] Transform proconfig for faster execution.

Store function config settings as a list of (name, value) pairs to
avoid the need to parse for each function invocation.
---
 src/backend/utils/fmgr/fmgr.c | 29 +++---
 src/backend/utils/misc/guc.c  | 45 ---
 src/include/utils/guc.h   |  2 ++
 3 files changed, 59 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c
index 9208c31fe0..6b2d7d7be3 100644
--- a/src/backend/utils/fmgr/fmgr.c
+++ b/src/backend/utils/fmgr/fmgr.c
@@ -612,7 +612,8 @@ struct fmgr_security_definer_cache
 {
 	FmgrInfo	flinfo;			/* lookup info for target function */
 	Oid			userid;			/* userid to set, or InvalidOid */
-	ArrayType  *proconfig;		/* GUC values to set, or NULL */
+	List	   *configNames;	/* GUC names to set, or NIL */
+	List	   *configValues;	/* GUC values to set, or NIL */
 	Datum		arg;			/* passthrough argument for plugin modules */
 };

@@ -634,6 +635,8 @@ fmgr_security_definer(PG_FUNCTION_ARGS)
 	FmgrInfo   *save_flinfo;
 	Oid			save_userid;
 	int			save_sec_context;
+	ListCell   *lc1;
+	ListCell   *lc2;
 	volatile int save_nestlevel;
 	PgStat_FunctionCallUsage fcusage;

@@ -666,8 +669,11 @@ fmgr_security_definer(PG_FUNCTION_ARGS)
 );
 		if (!isnull)
 		{
+			ArrayType *array;
 			oldcxt = MemoryContextSwitchTo(fcinfo->flinfo->fn_mcxt);
-			fcache->proconfig = DatumGetArrayTypePCopy(datum);
+			array = DatumGetArrayTypeP(datum);
+			TransformGUCArray(array, >configNames,
+			  >configValues);
 			MemoryContextSwitchTo(oldcxt);
 		}

@@ -680,7 +686,7 @@ fmgr_security_definer(PG_FUNCTION_ARGS)

 	/* GetUserIdAndSecContext is cheap enough that no harm in a wasted call */
 	GetUserIdAndSecContext(_userid, _sec_context);
-	if (fcache->proconfig)		/* Need a new GUC nesting level */
+	if (fcache->configNames != NIL)		/* Need a new GUC nesting level */
 		save_nestlevel = NewGUCNestLevel();
 	else
 		save_nestlevel = 0;		/* keep compiler quiet */
@@ -689,12 +695,17 @@ fmgr_security_definer(PG_FUNCTION_ARGS)
 		SetUserIdAndSecContext(fcache->userid,

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND




On 8/7/23 21:21, Robert Haas wrote:
> On Mon, Aug 7, 2023 at 3:02 PM Tomas Vondra
>  wrote:
>>> I would also argue that the results are actually not that great,
>>> because once you get past 64 partitions you're right back where you
>>> started, or maybe worse off. To me, there's nothing magical about
>>> cases between 16 and 64 relations that makes them deserve special
>>> treatment - plenty of people are going to want to use hundreds of
>>> partitions, and even if you only use a few dozen, this isn't going to
>>> help as soon as you join two or three partitioned tables, and I
>>> suspect it hurts whenever it doesn't help.
>>
>> That's true, but doesn't that apply to any cache that can overflow? You
>> could make the same argument about the default value of 16 slots - why
>> not to have just 8?
> 
> Yes and no. I mean, there are situations where when the cache
> overflows, you still get a lot of benefit out of the entries that you
> are able to cache, as when the frequency of access follows some kind
> of non-uniform distribution, Zipfian or decreasing geometrically or
> whatever. There are also situations where you can just make the cache
> big enough that as a practical matter it's never going to overflow. I
> can't think of a PostgreSQL-specific example right now, but if you
> find that a 10-entry cache of other people living in your house isn't
> good enough, a 200-entry cache should solve the problem for nearly
> everyone alive. If that doesn't cause a resource crunch, crank up the
> cache size and forget about it. But here we have neither of those
> situations. The access frequency is basically uniform, and the cache
> size needed to avoid overflows seems to be unrealistically large, at
> least given the current design. So I think that in this case upping
> the cache size figures to be much less effective than in some other
> cases.
> 

Why would the access frequency be uniform? In particular, there's a huge
variability in how long the locks need to exist - IIRC we may be keeping
locks for tables for a long time, but not for indexes. From this POV it
might be better to do fast-path locking for indexes, no?

> It's also a bit questionable whether "cache" is even the right word
> here. I'd say it isn't, because it's not like the information in the
> fast-path locking structures is a subset of the full information
> stored elsewhere. Whatever information is stored there is canonical
> for those entries.
> 

Right. Calling this a cache might be a bit misleading.

>> Yes, I agree. I don't know if this particular design would be the right
>> one (1000 elements seems a bit too much for something included right in
>> PGPROC). But yeah, something that flips from linear search to something
>> else would be reasonable.
> 
> Yeah ... or there could be a few slots in the PGPROC and then a bit
> indicating whether to jump to a larger shared memory structure located
> in a separate array. Not sure exactly.
> 

Maybe, but isn't that mostly what the regular non-fast-path locking
does? Wouldn't that defeat the whole purpose of fast-path locking?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use of additional index columns in rows filtering




On 8/7/23 02:38, Peter Geoghegan wrote:
> On Sun, Aug 6, 2023 at 3:28 PM Peter Geoghegan  wrote:
>> I decided to verify my understanding by checking what would happen
>> when I ran the OR-heavy tenk1 regression test query against a
>> combination of your patch, and v7 of the OR-to-SAOP transformation
>> patch. (To be clear, this is without my patch.)
> 
> I also spotted what looks like it might be a problem with your patch
> when looking at this query (hard to be sure if it's truly a bug,
> though).
> 
> I manually SAOP-ify the OR-heavy tenk1 regression test query like so:
> 
> select
>   *
> from
>   tenk1
> where
>   thousand = 42
>   and tenthous in (1, 3, 42);
> 
> Sure enough, I continue to get 7 buffer hits with this query. Just
> like with the BitmapOr plan (and exactly like the original query with
> the OR-to-SAOP transformation patch in place).
> 
> As I continue to add SAOP constants to the original "tenthous" IN(),
> eventually the planner switches over to not using index quals on the
> "tenthous" low order index column (they're only used on the high order
> "thousand" index column). Here's where the switch to only using the
> leading column from the index happens for me:
> 
> select
>   *
> from
>   tenk1
> where
>   thousand = 42
>   and
>   tenthous in (1, 3, 42, 43, 44, 45, 46, 47, 48, 49, 50);
> 
> This plan switchover isn't surprising in itself -- it's one of the most
> important issues addressed by my SAOP patch. However, it *is* a little
> surprising that your patch doesn't even manage to use "Index Filter"
> quals. It appears that it is only capable of using table filter quals.
> Obviously, the index has all the information that expression
> evaluation needs, and yet I see "Filter: (tenk1.tenthous = ANY
> ('{1,3,42,43,44,45,46,47,48,49,50}'::integer[]))". So no improvement
> over master here.
> 
> Interestingly enough, your patch only has this problem with SAOPs, at
> least that I know of -- the spelling/style matters. If I add many
> additional "tenthous" constants to the original version of the query
> from the regression tests in the same way, but using the "longform"
> (tenthous = 1 or tenthous = 3 ...) spelling, then your patch does
> indeed use index filters/expression evaluation. Just like the original
> "risky" plan (it's just a much bigger expression, with many more ORs).
> 

Right. This happens because the matching of SAOP to indexes happens in
multiple places. Firstly, create_index_paths() matches the clauses to
the index by calling

match_restriction_clauses_to_index
 -> match_clauses_to_index
 -> match_clause_to_index

Which is where we also decide which *unmatched* clauses can be filters.
And this *does* match the SAOP to the index key, hence no index filter.

But then we call get_index_paths/build_index_path a little bit later,
and that decides to skip "lower SAOP" (which seems a bit strange,
because the column is "after" the equality, but meh). Anyway, at this
point we already decided what's a filter, ignoring the index clauses,
and not expecting any backsies.

The simples fix seems to be to add these skipped SAOP clauses as
filters. We know it can be evaluated on the index ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-07 Thread Matt Smiley

Thank you Tomas!  I really appreciate your willingness to dig in here and
help us out!  The rest of my replies are inline below.

On Thu, Aug 3, 2023 at 1:39 PM Tomas Vondra 
wrote:

> The analysis in the linked gitlab issue is pretty amazing. I wasn't
> planning to argue against the findings anyway, but plenty of data
> supporting the conclusions is good.
>

Thank you!  I totally agree, having supporting data is so helpful.

I'm not an expert on locking, so some of the stuff I say may be
> trivially obvious - it's just me thinking about ...
>

Absolutely makes sense to check assumptions, etc.  Thanks for being open!
For what it's worth, I've also been working with Postgres for many years,
and I love that it keeps teaching me new things, this topic being just the
latest.

I wonder what's the rough configuration of those systems, though. Both
> the hardware and PostgreSQL side. How many cores / connections, etc.?
>

Each of the postgres hosts had 96 vCPUs and at peak handled roughly 80
concurrently active connections.

For purposes of reproducing the pathology, I think we can do so with a
single postgres instance.  We will need a high enough query rate to push
the bottleneck to lock_manager lwlock contention.  The simplest way to do
so is probably to give it a small dataset that fits easily in cache and run
several concurrent client connections doing cheap single-row queries, each
in its own transaction, against a target table that has either many indexes
or partitions or both.

For context, here's a brief summary of the production environment where we
first observed this pathology:
The writable primary postgres instance has several streaming replicas, used
for read-only portions of the workload.  All of them run on equivalent
hardware.  Most of the research focuses on the streaming replica postgres
instances, although the same pathology has been observed in the writable
primary node as well. The general topology is thousands of client
connections fanning down into several pgbouncer instances per Postgres
instance.  From each Postgres instance's perspective, its workload
generally has a daily peak of roughly 80 concurrently active backends
supporting a throughput of 75K transactions second, where most transactions
run a single query.

Yes, I agree with that. Partitioning makes this issue works, I guess.
> Schemas with indexes on every column are disturbingly common these days
> too, which hits the issue too ...
>

Agreed.

> Those are pretty great pieces of information. I wonder if some of the
> measurements may be affecting the observation (by consuming too much
> CPU, making the contention worse), but overall it seems convincing.
>

Yes, definitely overhead is a concern, glad you asked!

Here are my notes on the overhead for each bpftrace script:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678

Here is a summary of where that overhead comes from:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365310956

Here are more generic benchmark results for uprobe overhead:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1383

Briefly, we generally  expect the instrumentation overhead to be roughly
1-2 microseconds per call to the instrumented instruction.  It partly
depends on what we're doing in the instrumentation, but most of that
overhead is just the interrupt-handling to transfer control flow to/from
the BPF code.

Would it be difficult to sample just a small fraction of the calls? Say,
> 1%, to get good histograms/estimated with acceptable CPU usage.
>

That would be great, but since the overhead comes mostly from the control
transfer, it wouldn't help to put sampling logic in the tracer itself.  The
main way to mitigate that overhead is to choose instrumentation points
where the call rate is tolerably low.  That's why the only instrumentation
I used for more than a few seconds at a time were the "low overhead"
scripts that instrument only the stalled call to LWLockAcquire.

> In any case, it's a great source of information to reproduce the issue
> and evaluate possible fixes.
>

Thanks, that's my hope!

> After looking at the code etc. I think the main trade-off here is going
> to be the cost of searching the fpRelId array. At the moment it's
> searched linearly, which is cheap for 16 locks. But at some point it'll
> become as expensive as updating the slowpath, and the question is when.
>
> I wonder if we could switch to a more elaborate strategy if the number
> of locks is high enough. Say, a hash table, or some hybrid approach.
>

Interesting idea!  I was hoping a linear search would stay cheap enough but
you're right, it's going to become too inefficient at some point.  It might
make sense to start with just blackbox timing or throughput measurements,
because directly measuring that search duration may not be cheap.  To
observe durations via BPF, we have to instrument 2 points (e.g. function
entry and return, or more generally the

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

On Mon, Aug 7, 2023 at 3:02 PM Tomas Vondra
 wrote:
> > I would also argue that the results are actually not that great,
> > because once you get past 64 partitions you're right back where you
> > started, or maybe worse off. To me, there's nothing magical about
> > cases between 16 and 64 relations that makes them deserve special
> > treatment - plenty of people are going to want to use hundreds of
> > partitions, and even if you only use a few dozen, this isn't going to
> > help as soon as you join two or three partitioned tables, and I
> > suspect it hurts whenever it doesn't help.
>
> That's true, but doesn't that apply to any cache that can overflow? You
> could make the same argument about the default value of 16 slots - why
> not to have just 8?

Yes and no. I mean, there are situations where when the cache
overflows, you still get a lot of benefit out of the entries that you
are able to cache, as when the frequency of access follows some kind
of non-uniform distribution, Zipfian or decreasing geometrically or
whatever. There are also situations where you can just make the cache
big enough that as a practical matter it's never going to overflow. I
can't think of a PostgreSQL-specific example right now, but if you
find that a 10-entry cache of other people living in your house isn't
good enough, a 200-entry cache should solve the problem for nearly
everyone alive. If that doesn't cause a resource crunch, crank up the
cache size and forget about it. But here we have neither of those
situations. The access frequency is basically uniform, and the cache
size needed to avoid overflows seems to be unrealistically large, at
least given the current design. So I think that in this case upping
the cache size figures to be much less effective than in some other
cases.

It's also a bit questionable whether "cache" is even the right word
here. I'd say it isn't, because it's not like the information in the
fast-path locking structures is a subset of the full information
stored elsewhere. Whatever information is stored there is canonical
for those entries.

> Yes, I agree. I don't know if this particular design would be the right
> one (1000 elements seems a bit too much for something included right in
> PGPROC). But yeah, something that flips from linear search to something
> else would be reasonable.

Yeah ... or there could be a few slots in the PGPROC and then a bit
indicating whether to jump to a larger shared memory structure located
in a separate array. Not sure exactly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

On 8/7/23 18:56, Nathan Bossart wrote:
> On Mon, Aug 07, 2023 at 12:51:24PM +0200, Tomas Vondra wrote:
>> The bad news is this seems to have negative impact on cases with few
>> partitions, that'd fit into 16 slots. Which is not surprising, as the
>> code has to walk longer arrays, it probably affects caching etc. So this
>> would hurt the systems that don't use that many relations - not much,
>> but still.
>>
>> The regression appears to be consistently ~3%, and v2 aimed to improve
>> that - at least for the case with just 100 rows. It even gains ~5% in a
>> couple cases. It's however a bit strange v2 doesn't really help the two
>> larger cases.
>>
>> Overall, I think this seems interesting - it's hard to not like doubling
>> the throughput in some cases. Yes, it's 100 rows only, and the real
>> improvements are bound to be smaller, it would help short OLTP queries
>> that only process a couple rows.
> 
> Indeed.  I wonder whether we could mitigate the regressions by using SIMD
> intrinsics in the loops.  Or auto-vectorization, if that is possible.
> 

Maybe, but from what I know about SIMD it would require a lot of changes
to the design, so that the loops don't mix accesses to different PGPROC
fields (fpLockBits, fpRelId) and so on. But I think it'd be better to
just stop walking the whole array regularly.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

On 8/7/23 19:05, Robert Haas wrote:
> On Mon, Aug 7, 2023 at 6:51 AM Tomas Vondra
>  wrote:
>> The regression appears to be consistently ~3%, and v2 aimed to improve
>> that - at least for the case with just 100 rows. It even gains ~5% in a
>> couple cases. It's however a bit strange v2 doesn't really help the two
>> larger cases.
> 
> To me, that's an absolutely monumental regression for a change like
> this. Sure, lots of people have partitioned tables. But also, lots of
> people don't. Penalizing very simple queries by 2-3% seems completely
> over the top to me. I can't help wondering whether there's actually
> something wrong with the test, or the coding, because that seems huge
> to me.
> 

I'm the first to admit the coding (in my patches) is far from perfect,
and this may easily be a consequence of that. My motivation was to get
some quick measurements for the "bad case".

> I would also argue that the results are actually not that great,
> because once you get past 64 partitions you're right back where you
> started, or maybe worse off. To me, there's nothing magical about
> cases between 16 and 64 relations that makes them deserve special
> treatment - plenty of people are going to want to use hundreds of
> partitions, and even if you only use a few dozen, this isn't going to
> help as soon as you join two or three partitioned tables, and I
> suspect it hurts whenever it doesn't help.
> 

That's true, but doesn't that apply to any cache that can overflow? You
could make the same argument about the default value of 16 slots - why
not to have just 8?

FWIW I wasn't really suggesting we should increase the value to 64, I
was just trying to get a better idea of the costs at play (fast-path
cache maintenance and regular locking).

> I think we need a design that scales better. I don't really know what
> that would look like, exactly, but your idea of a hybrid approach
> seems like it might be worth further consideration. We don't have to
> store an infinite number of fast-path locks in an array that we search
> linearly, and it might be better that if we moved to some other
> approach we could avoid some of the regression. You mentioned a hash
> table; a partially associative cache might also be worth considering,
> like having an array of 1k slots but dividing it logically into 128
> bins of 16 slots each and only allowing an OID to be recorded in the
> bin whose low 7 bits match the low 7 bits of the OID.

Yes, I agree. I don't know if this particular design would be the right
one (1000 elements seems a bit too much for something included right in
PGPROC). But yeah, something that flips from linear search to something
else would be reasonable.

> 
> But maybe first we need to understand where all the CPU cycles are
> going, because maybe that's optimizing completely the wrong thing and,
> again, it seems like an awfully large hit.
> 

Right. We're mostly just guessing what the issue is.

> Of course, another thing we could do is try to improve the main lock
> manager somehow. I confess that I don't have a great idea for that at
> the moment, but the current locking scheme there is from a very, very
> long time ago and clearly wasn't designed with modern hardware in
> mind.
> 

No idea, but I'd bet some of the trade offs may need re-evaluation.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using defines for protocol characters

On Mon, Aug 7, 2023 at 2:25 PM Tom Lane  wrote:
> +1.  For ease of greppability, maybe even PQMSG_EmptyQueryResponse
> and so on?  Then one grep would find both uses of the constants and
> code/docs references.  Not sure if the prefix should be all-caps or
> not if we go this way.

PqMsgEmptyQueryResponse or something like that seems better, if we
want to keep the current capitalization. I'm not a huge fan of the way
we vary our capitalization conventions so much all over the code base,
but I think we would at least do well to keep it consistent from one
end of a certain identifier to the other.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] psql: Add tab-complete for optional view parameters

2023-08-07 Thread Christoph Heiss


Hi all,
sorry for the long delay.

On Mon, Jan 09, 2023 at 04:32:09PM +0100, Jim Jones wrote:
> However, an "ALTER TABLE  S" does not complete the open
> parenthesis "(" from "SET (", as suggested in "ALTER VIEW  ".
>
> postgres=# ALTER VIEW w SET
> Display all 187 possibilities? (y or n)
>
> Is it intended to behave like this? If so, an "ALTER VIEW 
> RES" does complete the open parenthesis -> "RESET (".

On Sun, Jan 29, 2023 at 10:19:12AM +, Mikhail Gribkov wrote:
> The patch have a potential, although I have to agree with Jim Jones,
> it obviously have a problem with "alter view  set"
> handling.
> [..]
> I think it may worth looking at "alter materialized view"  completion
> tree and making "alter view" the same way.

Thank you both for reviewing/testing and the suggestions. Yeah,
definitively, sounds very sensible.

I've attached a new revision, rebased and addressing the above by
aligning it with how "ALTER MATERIALIZED VIEW" works, such that "SET ("
and "SET SCHEMA" won't compete anymore. So that should now work more
like expected.

postgres=# ALTER MATERIALIZED VIEW m
ALTER COLUMN CLUSTER ON   DEPENDS ON EXTENSION
NO DEPENDS ON EXTENSION  OWNER TO RENAME
RESET (  SET

postgres=# ALTER MATERIALIZED VIEW m SET
(ACCESS METHODSCHEMA   TABLESPACE
WITHOUT CLUSTER

postgres=# ALTER VIEW v
ALTER COLUMN  OWNER TO  RENAMERESET (   SET

postgres=# ALTER VIEW v SET
(   SCHEMA

postgres=# ALTER VIEW v SET (
CHECK_OPTION  SECURITY_BARRIER  SECURITY_INVOKER

On Fri, Jan 06, 2023 at 12:18:44PM +, Dean Rasheed wrote:
> Hmm, I don't think we should be offering "check_option" as a tab
> completion for CREATE VIEW at all, since that would encourage users to
> use non-SQL-standard syntax, rather than CREATE VIEW ... WITH
> [CASCADED|LOCAL] CHECK OPTION.

Left that part in for now. I would argue that it is a well-documented
combination and as such users would expect it to turn up in the
tab-complete as well. OTOH not against removing it either, if there are
others voicing the same opinion ..

Thanks,
Christoph
>From 3663eb0b5008d632972d4b66a105fc08cfff13fb Mon Sep 17 00:00:00 2001
From: Christoph Heiss 
Date: Mon, 7 Aug 2023 20:37:19 +0200
Subject: [PATCH v3] psql: Add tab-complete for optional view parameters

This adds them in the same matter as it works for storage parameters of
tables.

Signed-off-by: Christoph Heiss 
---
 src/bin/psql/tab-complete.c | 36 +++-
 1 file changed, 31 insertions(+), 5 deletions(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 779fdc90cb..83ec1508bb 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1329,6 +1329,13 @@ static const char *const table_storage_parameters[] = {
NULL
 };
 
+/* Optional parameters for CREATE VIEW and ALTER VIEW */
+static const char *const view_optional_parameters[] = {
+   "check_option",
+   "security_barrier",
+   "security_invoker",
+   NULL
+};
 
 /* Forward declaration of functions */
 static char **psql_completion(const char *text, int start, int end);
@@ -2216,8 +2223,7 @@ psql_completion(const char *text, int start, int end)
COMPLETE_WITH("TO");
/* ALTER VIEW  */
else if (Matches("ALTER", "VIEW", MatchAny))
-   COMPLETE_WITH("ALTER COLUMN", "OWNER TO", "RENAME",
- "SET SCHEMA");
+   COMPLETE_WITH("ALTER COLUMN", "OWNER TO", "RENAME", "RESET (", 
"SET");
/* ALTER VIEW xxx RENAME */
else if (Matches("ALTER", "VIEW", MatchAny, "RENAME"))
COMPLETE_WITH_ATTR_PLUS(prev2_wd, "COLUMN", "TO");
@@ -2233,6 +2239,16 @@ psql_completion(const char *text, int start, int end)
/* ALTER VIEW xxx RENAME COLUMN yyy */
else if (Matches("ALTER", "VIEW", MatchAny, "RENAME", "COLUMN", 
MatchAnyExcept("TO")))
COMPLETE_WITH("TO");
+   /* ALTER VIEW xxx SET ( yyy [= zzz] ) */
+   else if (Matches("ALTER", "VIEW", MatchAny, "SET"))
+   COMPLETE_WITH("(", "SCHEMA");
+   /* ALTER VIEW xxx SET|RESET ( yyy [= zzz] ) */
+   else if (Matches("ALTER", "VIEW", MatchAny, "SET|RESET", "("))
+   COMPLETE_WITH_LIST(view_optional_parameters);
+   else if (Matches("ALTER", "VIEW", MatchAny, "SET", "(", "check_option", 
"="))
+   COMPLETE_WITH("local", "cascaded");
+   else if (Matches("ALTER", "VIEW", MatchAny, "SET", "(", 
"security_barrier|security_invoker", "="))
+   COMPLETE_WITH("true", "false");
 
/* ALTER MATERIALIZED VIEW  */
else if (Matches("ALTER", "MATERIALIZED", "VIEW", MatchAny))
@@ -3525,13 +3541,23 @@ psql_completion(const char *text, int start, int end)
}
 
 /* CREATE VIEW --- is allowed inside CREATE SCHEMA, so use TailMatches */
-   /* Complete CREATE [ OR REPLACE ] VIEW  with AS */
+

Re: Using defines for protocol characters

On Mon, Aug 07, 2023 at 02:24:58PM -0400, Tom Lane wrote:
> Robert Haas  writes:
>> IMHO, the correspondence between the names in the patch and the
>> traditional names in the documentation could be stronger. For example,
>> the documentation mentions EmptyQueryResponse and
>> NegotiateProtocolVersion, but in this patch those become
>> PQMSG_RESP_NEGOTIATE_PROTOCOL and PQMSG_RESP_EMPTY_QUERY. I think we
>> could consider directly using the names from the documentation, right
>> down to capitalization, perhaps with a prefix, but I'm also totally
>> fine with this use of uppercase letters and underscores. But why not
>> do a strict translation, like EmptyQueryResponse ->
>> PQMSG_EMPTY_QUERY_RESPONSE, NegotiateProtocolVersion ->
>> PQMSG_NEGOTIATE_PROTOCOL_VERSION? To me at least, the current patch is
>> inventing new and slightly different names for things that already
>> have names...
> 
> +1.  For ease of greppability, maybe even PQMSG_EmptyQueryResponse
> and so on?  Then one grep would find both uses of the constants and
> code/docs references.  Not sure if the prefix should be all-caps or
> not if we go this way.

+1 for Tom's suggestion.  I have no strong opinion about the prefix, but I
do think easing greppability is worthwhile.

As mentioned earlier [0], I think we should also consider putting the
definitions in pqcomm.h.

[0] https://postgr.es/m/20230803185356.GA1144430%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Partial aggregates pushdown

2023-08-07 Thread Bruce Momjian

On Mon, Jul 10, 2023 at 07:35:27AM +, 
fujii.y...@df.mitsubishielectric.co.jp wrote:
> > > I will add a postgres_fdw option "check_partial_aggregate_support".
> > > This option is false, default.
> > > Only if this option is true, postgres_fdw connect to the remote server 
> > > and get the version of the remote server.
> > > And if the version of the remote server is less than PG17, then partial 
> > > aggregate push down to the remote server is
> > disable.
> > 
> > Great!
> I have modified the program except for the point "if the version of the 
> remote server is less than PG17".
> Instead, we have addressed the following.
> "If check_partial_aggregate_support is true and the remote server version is 
> older than the local server
> version, postgres_fdw does not assume that the partial aggregate function is 
> on the remote server unless
> the partial aggregate function and the aggregate function match."
> The reason for this is to maintain compatibility with any aggregate function 
> that does not support partial
> aggregate in one version of V1 (V1 is PG17 or higher), even if the next 
> version supports partial aggregate.
> For example, string_agg does not support partial aggregation in PG15, but it 
> will support partial aggregation
> in PG16.

Just to clarify, I think you are saying:

If check_partial_aggregate_support is true and the remote server
version is older than the local server version, postgres_fdw
checks if the partial aggregate function exists on the remote
server during planning and only uses it if it does.

I tried to phrase it in a positive way, and mentioned the plan time
distinction.  Also, I am sorry I was away for most of July and am just
getting to this.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Using defines for protocol characters

Robert Haas  writes:
> IMHO, the correspondence between the names in the patch and the
> traditional names in the documentation could be stronger. For example,
> the documentation mentions EmptyQueryResponse and
> NegotiateProtocolVersion, but in this patch those become
> PQMSG_RESP_NEGOTIATE_PROTOCOL and PQMSG_RESP_EMPTY_QUERY. I think we
> could consider directly using the names from the documentation, right
> down to capitalization, perhaps with a prefix, but I'm also totally
> fine with this use of uppercase letters and underscores. But why not
> do a strict translation, like EmptyQueryResponse ->
> PQMSG_EMPTY_QUERY_RESPONSE, NegotiateProtocolVersion ->
> PQMSG_NEGOTIATE_PROTOCOL_VERSION? To me at least, the current patch is
> inventing new and slightly different names for things that already
> have names...

+1.  For ease of greppability, maybe even PQMSG_EmptyQueryResponse
and so on?  Then one grep would find both uses of the constants and
code/docs references.  Not sure if the prefix should be all-caps or
not if we go this way.

regards, tom lane

Re: Performance degradation on concurrent COPY into a single relation in PG16.

Hi,

On 2023-08-07 23:05:39 +0900, Masahiko Sawada wrote:
> On Mon, Aug 7, 2023 at 3:16 PM David Rowley  wrote:
> >
> > On Wed, 2 Aug 2023 at 13:35, David Rowley  wrote:
> > > So, it looks like this item can be closed off.  I'll hold off from
> > > doing that for a few days just in case anyone else wants to give
> > > feedback or test themselves.
> >
> > Alright, closed.
>
> IIUC the problem with multiple concurrent COPY is not resolved yet.

Yea - it was just hard to analyze until the other regressions were fixed.


> The result of nclients = 1 became better thanks to recent fixes, but
> there still seems to be the performance regression at nclient = 2~16
> (on RHEL 8 and 9). Andres reported[1] that after changing
> MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it
> would not be the solution, as he mentioned.

I think there could be a quite simple fix: Track by how much we've extended
the relation previously in the same bistate. If we already extended by many
blocks, it's very likey that we'll do so further.

A simple prototype patch attached. The results for me are promising. I copied
a smaller file [1], to have more accurate throughput results at shorter runs
(15s).

HEAD before:
clients  tps
1 41
2 76
4136
8248
16   360
32   375
64   317


HEAD after:
clients  tps
1 43
2 80
4155
8280
16   369
32   405
64   344

Any chance you could your benchmark? I don't see as much of a regression vs 16
as you...

Greetings,

Andres Freund

[1] COPY (SELECT generate_series(1, 10)) TO '/tmp/data.copy';
diff --git i/src/include/access/hio.h w/src/include/access/hio.h
index 228433ee4a2..5ae39aec7f8 100644
--- i/src/include/access/hio.h
+++ w/src/include/access/hio.h
@@ -41,6 +41,7 @@ typedef struct BulkInsertStateData
 	 */
 	BlockNumber next_free;
 	BlockNumber last_free;
+	uint32		already_extended_by;
 } BulkInsertStateData;
 
 
diff --git i/src/backend/access/heap/heapam.c w/src/backend/access/heap/heapam.c
index 7ed72abe597..6a66214a580 100644
--- i/src/backend/access/heap/heapam.c
+++ w/src/backend/access/heap/heapam.c
@@ -1776,6 +1776,7 @@ GetBulkInsertState(void)
 	bistate->current_buf = InvalidBuffer;
 	bistate->next_free = InvalidBlockNumber;
 	bistate->last_free = InvalidBlockNumber;
+	bistate->already_extended_by = 0;
 	return bistate;
 }
 
diff --git i/src/backend/access/heap/hio.c w/src/backend/access/heap/hio.c
index c275b08494d..a2be4273df1 100644
--- i/src/backend/access/heap/hio.c
+++ w/src/backend/access/heap/hio.c
@@ -283,6 +283,13 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 		 */
 		extend_by_pages += extend_by_pages * waitcount;
 
+		/*
+		 * If we previously extended using the same bistate, it's very likely
+		 * we'll extend some more. Try to extend by as many pages.
+		 */
+		if (bistate)
+			extend_by_pages = Max(extend_by_pages, bistate->already_extended_by);
+
 		/*
 		 * Can't extend by more than MAX_BUFFERS_TO_EXTEND_BY, we need to pin
 		 * them all concurrently.
@@ -409,6 +416,7 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 		/* maintain bistate->current_buf */
 		IncrBufferRefCount(buffer);
 		bistate->current_buf = buffer;
+		bistate->already_extended_by += extend_by_pages;
 	}
 
 	return buffer;

Re: Using defines for protocol characters

On Mon, Aug 7, 2023 at 1:19 PM Dave Cramer  wrote:
> Any other changes required ?

IMHO, the correspondence between the names in the patch and the
traditional names in the documentation could be stronger. For example,
the documentation mentions EmptyQueryResponse and
NegotiateProtocolVersion, but in this patch those become
PQMSG_RESP_NEGOTIATE_PROTOCOL and PQMSG_RESP_EMPTY_QUERY. I think we
could consider directly using the names from the documentation, right
down to capitalization, perhaps with a prefix, but I'm also totally
fine with this use of uppercase letters and underscores. But why not
do a strict translation, like EmptyQueryResponse ->
PQMSG_EMPTY_QUERY_RESPONSE, NegotiateProtocolVersion ->
PQMSG_NEGOTIATE_PROTOCOL_VERSION? To me at least, the current patch is
inventing new and slightly different names for things that already
have names...

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Using defines for protocol characters

2023-08-07 Thread Dave Cramer

On Mon, 7 Aug 2023 at 03:10, Alvaro Herrera  wrote:

> On 2023-Aug-07, Peter Smith wrote:
>
> > I guess, your patch would not be much different; you can still have
> > all the nice names and assign the appropriate values to the enum
> > values same as now, but using an enum you might also gain
> > type-checking in the code and also get warnings for the "switch"
> > statements if there are any cases accidentally omitted.
>
> Hmm, I think omitting a 'default' clause (which is needed when you want
> warnings for missing clauses) in a switch that handles protocol traffic
> is not great, because the switch would misbehave when the network
> counterpart sends a broken message.  I'm not sure we want to do that.
> It could become a serious security problem if confronted with a
> malicious libpq.
>
>
Any other changes required ?

Dave

> --
> Álvaro Herrera PostgreSQL Developer  —
> https://www.EnterpriseDB.com/
>

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

On Mon, Aug 7, 2023 at 6:51 AM Tomas Vondra
 wrote:
> The regression appears to be consistently ~3%, and v2 aimed to improve
> that - at least for the case with just 100 rows. It even gains ~5% in a
> couple cases. It's however a bit strange v2 doesn't really help the two
> larger cases.

To me, that's an absolutely monumental regression for a change like
this. Sure, lots of people have partitioned tables. But also, lots of
people don't. Penalizing very simple queries by 2-3% seems completely
over the top to me. I can't help wondering whether there's actually
something wrong with the test, or the coding, because that seems huge
to me.

I would also argue that the results are actually not that great,
because once you get past 64 partitions you're right back where you
started, or maybe worse off. To me, there's nothing magical about
cases between 16 and 64 relations that makes them deserve special
treatment - plenty of people are going to want to use hundreds of
partitions, and even if you only use a few dozen, this isn't going to
help as soon as you join two or three partitioned tables, and I
suspect it hurts whenever it doesn't help.

I think we need a design that scales better. I don't really know what
that would look like, exactly, but your idea of a hybrid approach
seems like it might be worth further consideration. We don't have to
store an infinite number of fast-path locks in an array that we search
linearly, and it might be better that if we moved to some other
approach we could avoid some of the regression. You mentioned a hash
table; a partially associative cache might also be worth considering,
like having an array of 1k slots but dividing it logically into 128
bins of 16 slots each and only allowing an OID to be recorded in the
bin whose low 7 bits match the low 7 bits of the OID.

But maybe first we need to understand where all the CPU cycles are
going, because maybe that's optimizing completely the wrong thing and,
again, it seems like an awfully large hit.

Of course, another thing we could do is try to improve the main lock
manager somehow. I confess that I don't have a great idea for that at
the moment, but the current locking scheme there is from a very, very
long time ago and clearly wasn't designed with modern hardware in
mind.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

On Mon, Aug 07, 2023 at 12:51:24PM +0200, Tomas Vondra wrote:
> The bad news is this seems to have negative impact on cases with few
> partitions, that'd fit into 16 slots. Which is not surprising, as the
> code has to walk longer arrays, it probably affects caching etc. So this
> would hurt the systems that don't use that many relations - not much,
> but still.
> 
> The regression appears to be consistently ~3%, and v2 aimed to improve
> that - at least for the case with just 100 rows. It even gains ~5% in a
> couple cases. It's however a bit strange v2 doesn't really help the two
> larger cases.
> 
> Overall, I think this seems interesting - it's hard to not like doubling
> the throughput in some cases. Yes, it's 100 rows only, and the real
> improvements are bound to be smaller, it would help short OLTP queries
> that only process a couple rows.

Indeed.  I wonder whether we could mitigate the regressions by using SIMD
intrinsics in the loops.  Or auto-vectorization, if that is possible.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: generic plans and "initial" pruning

On Mon, Aug 7, 2023 at 11:44 AM Tom Lane  wrote:
> Right, I doubt that changing that is going to work out well.
> Hash joins might have issues with it too.

I thought about the case, because Hash and Hash Join are such closely
intertwined nodes, but I don't see any problem there. It doesn't
really look like it would matter in what order things got cleaned up.
Unless I'm missing something, all of the data structures are just
independent things that we have to get rid of sometime.

> Could it work to make the patch force child cleanup before parent,
> instead of after?  Or would that break other places?

To me, it seems like the overwhelming majority of the code simply
doesn't care. You could pick an order out of a hat and it would be
100% OK. But I haven't gone and looked through it with this specific
idea in mind.

> On the whole though I think it's probably a good idea to leave
> parent nodes in control of the timing, so I kind of side with
> your later comment about whether we want to change this at all.

My overall feeling here is that what Gather and Gather Merge is doing
is pretty weird. I think I kind of knew that at the time this was all
getting implemented and reviewed, but I wasn't keen to introduce more
infrastructure changes than necessary given that parallel query, as a
project, was still pretty new and I didn't want to give other hackers
more reasons to be unhappy with what was already a lot of very
wide-ranging change to the system. A good number of years having gone
by now, and other people having worked on that code some more, I'm not
too worried about someone calling for a wholesale revert of parallel
query. However, there's a second problem here as well, which is that
I'm still not sure what the right thing to do is. We've fiddled around
with the shutdown sequence for parallel query a number of times now,
and I think there's still stuff that doesn't work quite right,
especially around getting all of the instrumentation data back to the
leader. I haven't spent enough time on this recently enough to be sure
what if any problems remain, though.

So on the one hand, I don't really like the fact that we have an
ad-hoc recursion arrangement here, instead of using
planstate_tree_walker or, as Amit proposes, a List. Giving subordinate
nodes control over the ordering when they don't really need it just
means we have more code with more possibility for bugs and less
certainty about whether the theoretical flexibility is doing anything
in practice. But on the other hand, because we know that at least for
the Gather/GatherMerge case it seems like it probably matters
somewhat, it definitely seems appealing not to change anything as part
of this patch set that we don't really have to.

I've had it firmly in my mind here that we were going to need to
change something somehow -- I mean, the possibility of returning in
the middle of node initialization seems like a pretty major change to
the way this stuff works, and it seems hard for me to believe that we
can just do that and not have to adjust any code anywhere else. Can it
really be true that we can do that and yet not end up creating any
states anywhere with which the current cleanup code is unprepared to
cope? Maybe, but it would seem like rather good luck if that's how it
shakes out. Still, at the moment, I'm having a hard time understanding
what this particular change buys us.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: initial pruning in parallel append

On Mon, Aug 7, 2023 at 10:25 AM Amit Langote  wrote:
> Note we’re talking here about “initial” pruning that occurs during 
> ExecInitNode(). Workers are only launched during ExecGather[Merge]() which 
> thereafter do ExecInitNode() on their copy of the the plan tree.  So if we 
> are to pass the pruning results for cross-checking, it will have to be from 
> the leader to workers.

That doesn't seem like a big problem because there aren't many node
types that do pruning, right? I think we're just talking about Append
and MergeAppend, or something like that, right? You just need the
ExecWhateverEstimate function to budget some DSM space to store the
information, which can basically just be a bitmap over the set of
child plans, and the ExecWhateverInitializeDSM copies the information
into that DSM space, and ExecWhateverInitializeWorker() copies the
information from the shared space back into the local node (or maybe
just points to it, if the representation is sufficiently compatible).
I feel like this is an hour or two's worth of coding, unless I'm
missing something, and WAY easier than trying to reason about what
happens if expression evaluation isn't as stable as we'd like it to
be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: generic plans and "initial" pruning

Robert Haas  writes:
> Second, I wondered whether the ordering of cleanup operations could be
> an issue. Right now, a node can position cleanup code before, after,
> or both before and after recursing to child nodes, whereas with this
> design change, the cleanup code will always be run before recursing to
> child nodes. Here, I think we have problems. Both ExecGather and
> ExecEndGatherMerge intentionally clean up the children before the
> parent, so that the child shutdown happens before
> ExecParallelCleanup(). Based on the comment and commit
> acf555bc53acb589b5a2827e65d655fa8c9adee0, this appears to be
> intentional, and you can sort of see why from looking at the stuff
> that happens in ExecParallelCleanup().

Right, I doubt that changing that is going to work out well.
Hash joins might have issues with it too.

Could it work to make the patch force child cleanup before parent,
instead of after?  Or would that break other places?

On the whole though I think it's probably a good idea to leave
parent nodes in control of the timing, so I kind of side with
your later comment about whether we want to change this at all.

regards, tom lane

Re: generic plans and "initial" pruning

On Thu, Aug 3, 2023 at 4:37 AM Amit Langote  wrote:
> Here's a patch set where the refactoring to move the ExecutorStart()
> calls to be closer to GetCachedPlan() (for the call sites that use a
> CachedPlan) is extracted into a separate patch, 0002.  Its commit
> message notes an aspect of this refactoring that I feel a bit nervous
> about -- needing to also move the CommandCounterIncrement() call from
> the loop in PortalRunMulti() to PortalStart() which now does
> ExecutorStart() for the PORTAL_MULTI_QUERY case.

I spent some time today reviewing 0001. Here are a few thoughts and
notes about things that I looked at.

First, I wondered whether it was really adequate for ExecEndPlan() to
just loop over estate->es_plan_nodes and call it good. Put
differently, is it possible that we could ever have more than one
relevant EState, say for a subplan or an EPQ execution or something,
so that this loop wouldn't cover everything? I found nothing to make
me think that this is a real danger.

Second, I wondered whether the ordering of cleanup operations could be
an issue. Right now, a node can position cleanup code before, after,
or both before and after recursing to child nodes, whereas with this
design change, the cleanup code will always be run before recursing to
child nodes. Here, I think we have problems. Both ExecGather and
ExecEndGatherMerge intentionally clean up the children before the
parent, so that the child shutdown happens before
ExecParallelCleanup(). Based on the comment and commit
acf555bc53acb589b5a2827e65d655fa8c9adee0, this appears to be
intentional, and you can sort of see why from looking at the stuff
that happens in ExecParallelCleanup(). If the instrumentation data
vanishes before the child nodes have a chance to clean things up,
maybe EXPLAIN ANALYZE won't reflect that instrumentation any more. If
the DSA vanishes, maybe we'll crash if we try to access it. If we
actually reach DestroyParallelContext(), we're just going to start
killing the workers. None of that sounds like what we want.

The good news, of a sort, is that I think this might be the only case
of this sort of problem. Most nodes recurse at the end, after doing
all the cleanup, so the behavior won't change. Moreover, even if it
did, most cleanup operations look pretty localized -- they affect only
the node itself, and not its children. A somewhat interesting case is
nodes associated with subplans. Right now, because of the coding of
ExecEndPlan, nodes associated with subplans are all cleaned up at the
very end, after everything that's not inside of a subplan. But with
this change, they'd get cleaned up in the order of initialization,
which actually seems more natural, as long as it doesn't break
anything, which I think it probably won't, since as I mention in most
cases node cleanup looks quite localized, i.e. it doesn't care whether
it happens before or after the cleanup of other nodes.

I think something will have to be done about the parallel query stuff,
though. I'm not sure exactly what. It is a little weird that Gather
and Gather Merge treat starting and killing workers as a purely
"private matter" that they can decide to handle without the executor
overall being very much aware of it. So maybe there's a way that some
of the cleanup logic here could be hoisted up into the general
executor machinery, that is, first end all the nodes, and then go
back, and end all the parallelism using, maybe, another list inside of
the estate. However, I think that the existence of ExecShutdownNode()
is a complication here -- we need to make sure that we don't break
either the case where that happen before overall plan shutdown, or the
case where it doesn't.

Third, a couple of minor comments on details of how you actually made
these changes in the patch set. Personally, I would remove all of the
"is closed separately" comments that you added. I think it's a
violation of the general coding principle that you should make the
code look like it's always been that way. Sure, in the immediate
future, people might wonder why you don't need to recurse, but 5 or 10
years from now that's just going to be clutter. Second, in the cases
where the ExecEndNode functions end up completely empty, I would
suggest just removing the functions entirely and making the switch
that dispatches on the node type have a switch case that lists all the
nodes that don't need a callback here and say /* Nothing do for these
node types */ break;. This will save a few CPU cycles and I think it
will be easier to read as well.

Fourth, I wonder whether we really need this patch at all. I initially
thought we did, because if we abandon the initialization of a plan
partway through, then we end up with a plan that is in a state that
previously would never have occurred, and we still have to be able to
clean it up. However, perhaps it's a difference without a distinction.
Say we have a partial plan tree, where not all of the PlanState nodes
ever got created. We then just call

Re: WIP: new system catalog pg_wait_event


Hi,

On 8/7/23 10:23 AM, Drouvot, Bertrand wrote:

Hi,

On 8/4/23 5:08 PM, Tom Lane wrote:

"Drouvot, Bertrand"  writes:

Now that fa88928470 generates automatically code and documentation
related to wait events, why not exposing the wait events description
through a system catalog relation?


I think you'd be better off making this a view over a set-returning
function.  The nearby work to allow run-time extensibility of the
set of wait events is not going to be happy with a static catalog.


Oh right, good point, thanks!: I'll come back with a new patch version to make
use of SRF instead.


Please find attached v2 making use of SRF.

v2:

- adds a new pg_wait_event.c that acts as the "template" for the SRF
- generates pg_wait_event_insert.c (through generate-wait_event_types.pl) that
is included into pg_wait_event.c

That way I think it's flexible enough to add more code if needed in the SRF.

The patch also:

- updates the doc
- works with autoconf and meson
- adds a simple test

I'm adding a new CF entry for it.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom 63f1316b2c160b72bdc1bcff6272e2888fce0db4 Mon Sep 17 00:00:00 2001
From: bdrouvotAWS 
Date: Sat, 5 Aug 2023 12:39:42 +
Subject: [PATCH v2] pg_wait_event

Adding a new system view, namely pg_wait_event, that describes the wait events.
---
 doc/src/sgml/system-views.sgml| 55 ++
 src/backend/catalog/system_views.sql  |  3 +
 src/backend/utils/activity/.gitignore |  1 +
 src/backend/utils/activity/Makefile   |  6 +-
 .../activity/generate-wait_event_types.pl | 76 +--
 src/backend/utils/activity/meson.build|  1 +
 src/backend/utils/activity/pg_wait_event.c| 41 ++
 src/include/catalog/pg_proc.dat   |  6 ++
 src/include/utils/meson.build |  4 +-
 src/test/regress/expected/rules.out   |  3 +
 src/test/regress/expected/sysviews.out|  8 ++
 src/test/regress/sql/sysviews.sql |  4 +
 src/tools/msvc/clean.bat  |  1 +
 13 files changed, 183 insertions(+), 26 deletions(-)
  19.8% doc/src/sgml/
  60.6% src/backend/utils/activity/
   5.3% src/include/catalog/
   5.7% src/test/regress/expected/
   3.0% src/test/regress/sql/
   5.3% src/

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 57b228076e..e9cbeff682 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -221,6 +221,11 @@
   views
  
 
+ 
+  pg_wait_event
+  wait events
+ 
+
 

   
@@ -4825,4 +4830,54 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
  
 
+
+ 
+  pg_wait_event
+
+  
+   pg_wait_event
+  
+
+  
+   The view pg_wait_event provides description about 
the
+   wait events.
+  
+
+  
+   pg_wait_event Columns
+   
+
+ 
+  
+   Column Type
+  
+  
+   Description
+  
+ 
+
+
+
+ 
+  
+   wait_event_name text
+  
+  
+   Wait event name
+  
+ 
+
+ 
+  
+   description texte
+  
+  
+   Wait event description
+  
+ 
+
+   
+  
+ 
+
 
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index af65af6bdd..f86a4dd770 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1342,3 +1342,6 @@ CREATE VIEW pg_stat_subscription_stats AS
 ss.stats_reset
 FROM pg_subscription as s,
  pg_stat_get_subscription_stats(s.oid) as ss;
+
+CREATE VIEW pg_wait_event AS
+SELECT * FROM pg_get_wait_events() AS we;
diff --git a/src/backend/utils/activity/.gitignore 
b/src/backend/utils/activity/.gitignore
index d77079285b..ad089a0b63 100644
--- a/src/backend/utils/activity/.gitignore
+++ b/src/backend/utils/activity/.gitignore
@@ -1,2 +1,3 @@
 /pgstat_wait_event.c
 /wait_event_types.h
+/pg_wait_event_insert.c
diff --git a/src/backend/utils/activity/Makefile 
b/src/backend/utils/activity/Makefile
index f1117745d4..8595e6ea77 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -32,10 +32,14 @@ OBJS = \
pgstat_subscription.o \
pgstat_wal.o \
pgstat_xact.o \
+   pg_wait_event.o \
wait_event.o
 
 include $(top_srcdir)/src/backend/common.mk
 
+pg_wait_event.o: pg_wait_event_insert.c
+pg_wait_event_insert.c: wait_event_types.h
+
 wait_event.o: pgstat_wait_event.c
 pgstat_wait_event.c: wait_event_types.h
touch $@
@@ -44,4 +48,4 @@ wait_event_types.h: 
$(top_srcdir)/src/backend/utils/activity/wait_event_names.tx
$(PERL) $(srcdir)/generate-wait_event_types.pl --code $<
 
 maintainer-clean: clean
-   rm -f wait_event_types.h pgstat_wait_event.c
+   rm -f wait_event_types.h pgstat_wait_event.c pg_wait_event_insert.c
diff --git

Re: initial pruning in parallel append

2023-08-07 Thread Amit Langote

On Mon, Aug 7, 2023 at 22:29 Tom Lane  wrote:

> Robert Haas  writes:
> > ... Second, we've generally made a
> > decision up until now that we don't want to have a hard dependency on
> > stable functions actually being stable. If they aren't, and for
> > example you're using index expressions, your queries may return wrong
> > answers, but you won't get weird internal error messages, and you
> > won't get a crash. I think the bar for this feature is the same.
>
> Yeah, I agree --- wrong answers may be acceptable in such a case, but
> crashes or unintelligible error messages aren't.  There are practical
> reasons for being tolerant here, notably that it's not that easy
> for users to get their volatility markings right.
>
> In the case at hand, I think that means that allowing workers to do
> pruning would require hardening the parallel append code against the
> situation where their pruning results vary.  Maybe, instead of passing
> the pruning results *down*, we could pass them *up* to the leader and
> then throw an error if they're different?


Note we’re talking here about “initial” pruning that occurs during
ExecInitNode(). Workers are only launched during ExecGather[Merge]() which
thereafter do ExecInitNode() on their copy of the the plan tree.  So if we
are to pass the pruning results for cross-checking, it will have to be from
the leader to workers.

> --
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-07 Thread Masahiko Sawada

On Mon, Aug 7, 2023 at 3:16 PM David Rowley  wrote:
>
> On Wed, 2 Aug 2023 at 13:35, David Rowley  wrote:
> > So, it looks like this item can be closed off.  I'll hold off from
> > doing that for a few days just in case anyone else wants to give
> > feedback or test themselves.
>
> Alright, closed.

IIUC the problem with multiple concurrent COPY is not resolved yet.
I've run the same benchmark that I used for the first report:

* PG15 (cb2ae5741f)
 nclients = 1, execution time = 15.213
 nclients = 2, execution time = 9.470
 nclients = 4, execution time = 6.508
 nclients = 8, execution time = 4.526
 nclients = 16, execution time = 3.788
 nclients = 32, execution time = 3.837
 nclients = 64, execution time = 4.286
 nclients = 128, execution time = 4.388
 nclients = 256, execution time = 4.333

* PG16 (67a007dc0c)
 nclients = 1, execution time = 14.494
 nclients = 2, execution time = 12.962
 nclients = 4, execution time = 17.757
 nclients = 8, execution time = 10.865
 nclients = 16, execution time = 7.371
 nclients = 32, execution time = 4.929
 nclients = 64, execution time = 2.212
 nclients = 128, execution time = 2.020
 nclients = 256, execution time = 2.196

The result of nclients = 1 became better thanks to recent fixes, but
there still seems to be the performance regression at nclient = 2~16
(on RHEL 8 and 9). Andres reported[1] that after changing
MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it
would not be the solution, as he mentioned.

Regards,

[1] 
https://www.postgresql.org/message-id/20230711185159.v2j5vnyrtodnwhgz%40awork3.anarazel.de

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

postgres_fdw could support row comparison pushdown

2023-08-07 Thread Alexander Pyhalov


Hi.

postgres_fdw currently doesn't handle RowCompareExpr, which doesn't 
allow keyset pagination queries to be efficiently executed over sharded 
table.
Attached patch adds handling of RowCompareExpr in deparse.c, so that we 
could push down conditions like
WHERE (created, id) > ('2023-01-01 00:00:00'::timestamp, 12345) to the 
foreign server.


I'm not sure about conditions when it's possible for RowCompareExpr to 
have opnos with different names or namespaces, but comment in 
ruleutils.c suggests that this is possible, so I've added check for this 
in foreign_expr_walker().

--
Best regards,
Alexander Pyhalov,
Postgres ProfessionalFrom 655148c85768afbbfc034e6f5dc5a5a6d72139b8 Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov 
Date: Mon, 7 Aug 2023 11:47:31 +0300
Subject: [PATCH] postgres_fdw: support RowCompareExpr pushdown

---
 contrib/postgres_fdw/deparse.c| 139 ++
 .../postgres_fdw/expected/postgres_fdw.out|  49 ++
 contrib/postgres_fdw/sql/postgres_fdw.sql |   9 ++
 3 files changed, 197 insertions(+)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 09d6dd60ddc..5eed3ac981c 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -166,6 +166,7 @@ static void deparseBoolExpr(BoolExpr *node, deparse_expr_cxt *context);
 static void deparseNullTest(NullTest *node, deparse_expr_cxt *context);
 static void deparseCaseExpr(CaseExpr *node, deparse_expr_cxt *context);
 static void deparseArrayExpr(ArrayExpr *node, deparse_expr_cxt *context);
+static void deparseRowCompareExpr(RowCompareExpr *node, deparse_expr_cxt *context);
 static void printRemoteParam(int paramindex, Oid paramtype, int32 paramtypmod,
 			 deparse_expr_cxt *context);
 static void printRemotePlaceholder(Oid paramtype, int32 paramtypmod,
@@ -287,6 +288,51 @@ is_foreign_expr(PlannerInfo *root,
 	return true;
 }
 
+/*
+ * Determines if the names and namespaces of operations match
+ */
+static bool
+opnames_match(List *opnos)
+{
+	Oid			oprns = InvalidOid;
+	char	   *oprname = NULL;
+	ListCell   *lc;
+	bool		match = true;
+
+	foreach(lc, opnos)
+	{
+		HeapTuple	opertup;
+		Form_pg_operator operform;
+		Oid			opno = lfirst_oid(lc);
+
+		opertup = SearchSysCache1(OPEROID, ObjectIdGetDatum(opno));
+		if (!HeapTupleIsValid(opertup))
+			elog(ERROR, "cache lookup failed for operator %u", opno);
+		operform = (Form_pg_operator) GETSTRUCT(opertup);
+		/* First op */
+		if (oprname == NULL)
+		{
+			oprname = pstrdup(NameStr(operform->oprname));
+			oprns = operform->oprnamespace;
+		}
+		else
+		{
+			Assert(OidIsValid(oprns));
+			if (oprns != operform->oprnamespace || (strcmp(oprname, NameStr(operform->oprname)) != 0))
+match = false;
+		}
+		ReleaseSysCache(opertup);
+
+		if (!match)
+			break;
+	}
+
+	if (oprname)
+		pfree(oprname);
+
+	return match;
+}
+
 /*
  * Check if expression is safe to execute remotely, and return true if so.
  *
@@ -872,6 +918,44 @@ foreign_expr_walker(Node *node,
 	state = FDW_COLLATE_UNSAFE;
 			}
 			break;
+		case T_RowCompareExpr:
+			{
+RowCompareExpr *rce = (RowCompareExpr *) node;
+ListCell   *lc;
+
+if (list_length(rce->opnos) == 0)
+	return false;
+
+/*
+ * Only shippable operators can be sent to remote.
+ */
+foreach(lc, rce->opnos)
+{
+	if (!is_shippable(lfirst_oid(lc), OperatorRelationId, fpinfo))
+		return false;
+}
+
+/* If opnos names do not match, can't deparse such expression */
+if (!opnames_match(rce->opnos))
+	return false;
+
+/*
+ * Recurse to arguments
+ */
+if (!foreign_expr_walker((Node *) rce->largs,
+		 glob_cxt, _cxt, case_arg_cxt))
+	return false;
+
+if (!foreign_expr_walker((Node *) rce->rargs,
+		 glob_cxt, _cxt, case_arg_cxt))
+	return false;
+
+/* Output is always boolean and so noncollatable. */
+collation = InvalidOid;
+state = FDW_COLLATE_NONE;
+
+			}
+			break;
 		case T_List:
 			{
 List	   *l = (List *) node;
@@ -2785,6 +2869,9 @@ deparseExpr(Expr *node, deparse_expr_cxt *context)
 		case T_ArrayExpr:
 			deparseArrayExpr((ArrayExpr *) node, context);
 			break;
+		case T_RowCompareExpr:
+			deparseRowCompareExpr((RowCompareExpr *) node, context);
+			break;
 		case T_Aggref:
 			deparseAggref((Aggref *) node, context);
 			break;
@@ -3508,6 +3595,58 @@ deparseArrayExpr(ArrayExpr *node, deparse_expr_cxt *context)
 		 deparse_type_name(node->array_typeid, -1));
 }
 
+/*
+ * Deparse RowCompareExpr
+ */
+static void
+deparseRowCompareExpr(RowCompareExpr *node, deparse_expr_cxt *context)
+{
+	StringInfo	buf = context->buf;
+	bool		first;
+	ListCell   *arg;
+	HeapTuple	opertup;
+	Form_pg_operator operform;
+	Oid			opno = linitial_oid(node->opnos);
+
+	/* Deparse the first argument */
+	appendStringInfoString(buf, "(ROW(");
+
+	first = true;
+	foreach(arg, node->largs)
+	{
+		if (!first)
+			appendStringInfoString(buf, ", ");
+		deparseExpr((Expr *)

Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-08-07 Thread Sandro Santilli

On Tue, Aug 01, 2023 at 08:24:15PM +0200, Daniel Gustafsson wrote:
> > On 28 Jun 2023, at 10:29, Daniel Gustafsson  wrote:
> > 
> >> On 31 May 2023, at 21:07, Sandro Santilli  wrote:
> >> On Thu, Apr 27, 2023 at 12:49:57PM +0200, Sandro Santilli wrote:
> > 
> >>> I'm happy to bring back the control-file switch if there's an
> >>> agreement about that.
> >> 
> >> I'm attaching an up-to-date version of the patch with the control-file
> >> switch back in, so there's an explicit choice by extension developers.
> > 
> > This version fails the extension test since the comment from the .sql file 
> > is
> > missing from test_extensions.out.  Can you fix that in a rebase such that 
> > the
> > CFBot can have a green build of this patch?
> 
> With no update to the thread and the patch still not applying I'm marking this
> returned with feedback.  Please feel free to resubmit to a future CF when 
> there
> is a new version of the patch.

Updated version of the patch is attached, regresses cleanly.
Will try to submit this to future CF from the website, let me know
if I'm doing it wrong.

--strk;
>From 9b8d1f45e5cd4b1e286d1c82d8ebca4fcc25eb92 Mon Sep 17 00:00:00 2001
From: Sandro Santilli 
Date: Wed, 14 Sep 2022 11:10:10 +0200
Subject: [PATCH v4] Allow wildcard (%) in extension upgrade paths

A wildcard character "%" will be accepted in the
"source" side of the upgrade script and be considered
usable to upgrade any version to the "target" side.

Using wildcards needs to be explicitly requested by
extensions via a "wildcard_upgrades" setting in their
control file.

Includes regression test and documentation.
---
 doc/src/sgml/extend.sgml  | 14 +
 src/backend/commands/extension.c  | 58 +--
 src/test/modules/test_extensions/Makefile |  7 ++-
 .../expected/test_extensions.out  | 18 ++
 .../test_extensions/sql/test_extensions.sql   |  9 +++
 .../test_ext_wildcard1--%--2.0.sql|  6 ++
 .../test_ext_wildcard1--1.0.sql   |  6 ++
 .../test_ext_wildcard1.control|  4 ++
 8 files changed, 113 insertions(+), 9 deletions(-)
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--%--2.0.sql
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--1.0.sql
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1.control

diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml
index 218940ee5c..3d4003eaef 100644
--- a/doc/src/sgml/extend.sgml
+++ b/doc/src/sgml/extend.sgml
@@ -822,6 +822,20 @@ RETURNS anycompatible AS ...

   
  
+
+ 
+  wildcard_upgrades (boolean)
+  
+   
+This parameter, if set to true (which is not the
+default), allows ALTER EXTENSION to consider
+a wildcard character % as matching any version of
+the extension. Such wildcard match will only be used when no
+perfect match is found for a version.
+   
+  
+ 
+
 
 
 
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 535072d181..c05055f5a0 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -88,6 +88,7 @@ typedef struct ExtensionControlFile
 	bool		relocatable;	/* is ALTER EXTENSION SET SCHEMA supported? */
 	bool		superuser;		/* must be superuser to install? */
 	bool		trusted;		/* allow becoming superuser on the fly? */
+	bool		wildcard_upgrades;  /* allow using wildcards in upgrade scripts */
 	int			encoding;		/* encoding of the script file, or -1 */
 	List	   *requires;		/* names of prerequisite extensions */
 	List	   *no_relocate;	/* names of prerequisite extensions that
@@ -132,6 +133,7 @@ static void ApplyExtensionUpdates(Oid extensionOid,
   bool cascade,
   bool is_create);
 static char *read_whole_file(const char *filename, int *length);
+static bool file_exists(const char *name);
 
 
 /*
@@ -584,6 +586,14 @@ parse_extension_control_file(ExtensionControlFile *control,
 		 errmsg("parameter \"%s\" requires a Boolean value",
 item->name)));
 		}
+		else if (strcmp(item->name, "wildcard_upgrades") == 0)
+		{
+			if (!parse_bool(item->value, >wildcard_upgrades))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg("parameter \"%s\" requires a Boolean value",
+item->name)));
+		}
 		else if (strcmp(item->name, "encoding") == 0)
 		{
 			control->encoding = pg_valid_server_encoding(item->value);
@@ -656,6 +666,7 @@ read_extension_control_file(const char *extname)
 	control->relocatable = false;
 	control->superuser = true;
 	control->trusted = false;
+	control->wildcard_upgrades = false;
 	control->encoding = -1;
 
 	/*
@@ -913,7 +924,15 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 	if (from_version == NULL)
 		elog(DEBUG1, "executing extension script for \"%s\" version '%s'", control->name, version);
 	else
+	{
+		if (

Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-08-07 Thread Ashutosh Bapat

On Fri, Aug 4, 2023 at 6:08 AM Richard Guo  wrote:

>
> On Thu, Aug 3, 2023 at 7:20 PM Ashutosh Bapat <
> ashutosh.bapat@gmail.com> wrote:
>
>> However, if reparameterization can *not* happen at the planning time, we
>> have chosen a plan which can not be realised meeting a dead end. So as long
>> as we can ensure that the reparameterization is possible we can delay
>> actual translations. I think it should be possible to decide whether
>> reparameterization is possible based on the type of path alone. So seems
>> doable.
>>
>
> That has been done in v2 patch.  See the new added function
> path_is_reparameterizable_by_child().
>

Thanks. The patch looks good overall. I like it because of its potential to
reduce memory consumption in reparameterization. That's cake on top of
cream :)

Some comments here.

+ {
+ FLAT_COPY_PATH(new_path, path, Path);
+
+ if (path->pathtype == T_SampleScan)
+ {
+ Index scan_relid = path->parent->relid;
+ RangeTblEntry *rte;
+
+ /* it should be a base rel with a tablesample clause... */
+ Assert(scan_relid > 0);
+ rte = planner_rt_fetch(scan_relid, root);
+ Assert(rte->rtekind == RTE_RELATION);
+ Assert(rte->tablesample != NULL);
+
+ ADJUST_CHILD_EXPRS(rte->tablesample, TableSampleClause *);
+ }
+ }

This change makes this function usable only after the final plan has been
selected. If some code accidently uses it earlier, it would corrupt
rte->tablesample without getting detected easily. I think we should mention
this in the function prologue and move it somewhere else. Other option is
to a.
leave rte->tablesample unchanged here with a comment as to why we aren't
changing it b. reparameterize tablesample expression in
create_samplescan_plan() if the path is parameterized by the parent. We call
reparameterize_path_by_child() in create_nestloop_plan() as this patch does
right now. I like that we are delaying reparameterization. It saves a bunch
of
memory. I haven't measured it but from my recent experiments I know it will
be
a lot.

  * If the inner path is parameterized, it is parameterized by the
- * topmost parent of the outer rel, not the outer rel itself.  Fix
- * that.
+ * topmost parent of the outer rel, not the outer rel itself.  We will

There's something wrong with the original sentence (which was probably
written
by me back then :)). I think it should have read "If the inner path is
parameterized by the topmost parent of the outer rel instead of the outer
rel
itself, fix that.". If you agree, the new comment should change it
accordingly.

@@ -2430,6 +2430,16 @@ create_nestloop_path(PlannerInfo *root,
 {
  NestPath   *pathnode = makeNode(NestPath);
  Relids inner_req_outer = PATH_REQ_OUTER(inner_path);
+ Relids outerrelids;
+
+ /*
+ * Paths are parameterized by top-level parents, so run parameterization
+ * tests on the parent relids.
+ */
+ if (outer_path->parent->top_parent_relids)
+ outerrelids = outer_path->parent->top_parent_relids;
+ else
+ outerrelids = outer_path->parent->relids;

This looks like an existing bug. Did partitionwise join paths ended up with
extra restrict clauses without this fix? I am not able to see why this
change
is needed by rest of the changes in the patch?

Anyway, let's rename outerrelids to something else e.g.  outer_param_relids
to reflect
its true meaning.

+bool
+path_is_reparameterizable_by_child(Path *path)
+{
+ switch (nodeTag(path))
+ {
+ case T_Path:
... snip ...
+ case T_GatherPath:
+ return true;
+ default:
+
+ /* We don't know how to reparameterize this path. */
+ return false;
+ }
+
+ return false;
+}

How do we make sure that any changes to reparameterize_path_by_child() are
reflected here? One way is to call this function from
reparameterize_path_by_child() the first thing and return if the path can
not
be reparameterized.

I haven't gone through each path's translation to make sure that it's
possible
to do that during create_nestloop_plan(). I will do that in my next round of
review if time permits. But AFAIR, each of the translations are possible
during
create_nestloop_plan() when it happens in this patch.

-#define ADJUST_CHILD_ATTRS(node) \
+#define ADJUST_CHILD_EXPRS(node, fieldtype) \
  ((node) = \
- (List *) adjust_appendrel_attrs_multilevel(root, (Node *) (node), \
- child_rel, \
- child_rel->top_parent))
+ (fieldtype) adjust_appendrel_attrs_multilevel(root, (Node *) (node), \
+   child_rel, \
+   child_rel->top_parent))
+
+#define ADJUST_CHILD_ATTRS(node) ADJUST_CHILD_EXPRS(node, List *)

IIRC, ATTRS here is taken from the function being called. I think we should
just change the macro definition, not its name. If you would like to have a
separate macro for List nodes, create ADJUST_CHILD_ATTRS_IN_LIST or some
such.

-- 
Best Wishes,
Ashutosh Bapat

Re: initial pruning in parallel append

Robert Haas  writes:
> ... Second, we've generally made a
> decision up until now that we don't want to have a hard dependency on
> stable functions actually being stable. If they aren't, and for
> example you're using index expressions, your queries may return wrong
> answers, but you won't get weird internal error messages, and you
> won't get a crash. I think the bar for this feature is the same.

Yeah, I agree --- wrong answers may be acceptable in such a case, but
crashes or unintelligible error messages aren't.  There are practical
reasons for being tolerant here, notably that it's not that easy
for users to get their volatility markings right.

In the case at hand, I think that means that allowing workers to do
pruning would require hardening the parallel append code against the
situation where their pruning results vary.  Maybe, instead of passing
the pruning results *down*, we could pass them *up* to the leader and
then throw an error if they're different?

regards, tom lane

Re: initial pruning in parallel append

On Tue, Jun 27, 2023 at 9:23 AM Amit Langote  wrote:
> Maybe that stuff could be resurrected, though I was wondering if the
> risk of the same initial pruning steps returning different results
> when performed repeatedly in *one query lifetime* aren't pretty
> minimal or maybe rather non-existent?  I think that's because
> performing initial pruning steps entails computing constant and/or
> stable expressions and comparing them with an unchanging set of
> partition bound values, with comparison functions whose result is also
> presumed to be stable. Then there's also the step of mapping the
> partition indexes as they appear in the PartitionDesc to the indexes
> of their subplans under Append/MergeAppend using the information
> contained in PartitionPruneInfo (subplan_map) and the result of
> mapping should be immutable too.
>
> I considered that the comparison functions that
> match_clause_to_partition_key() obtains by calling get_opfamily_proc()
> may in fact not be stable, though that doesn't seem to be a worry at
> least with the out-of-the-box pg_amproc collection:

I think it could be acceptable if a stable function not actually being
stable results in some kind of internal error message, hopefully one
that in some way hints to the user what the problem was. But crashing
because some expression was supposed to be stable and wasn't is a
bridge too far, especially if, as I think would be the case here, the
crash happens in a part of the code that is far removed from where the
problem was introduced.

The real issue here is about how much trust you can place in a given
invariant. If, in a single function, we initialize a value to 0 and
thereafter only ever increment it, we can logically reason that if we
ever see a value less than zero, there must have been an overflow. It
is true that if our function calls some other function, that other
function could access data through a garbage pointer and possibly
corrupt the value of our function's local variable, but that's
extremely unlikely, and we can basically decide that we're not going
to are about it, because such code is likely to crash anyway before
too long.

But now consider an invariant that implicates a larger amount of code
e.g. you must always hold a buffer pin before accessing the buffer
contents. In many cases, it's fairly easy to verify that this must be
so in any given piece of code, but there are problems: some code that
does buffer access is complicated enough that it's hard to fully
verify, especially when buffer pins are held across long periods, and
what is probably worse, there are tons of different places in the code
that access buffers. Hence, we've had bugs in this area, and likely
will have bugs in this area again. In theory, with a sufficient amount
of really careful work, you can find all of the problems, but in
practice it's pretty difficult. Nonetheless, we just have to just
accept the risk that we're going to crash if a bug in this area does
exist, because there's no real way to cope with the contents of the
buffer that you're accessing being swapped out while you're in the
middle of looking at it, or even modifying it.

But the present case is different in a couple of ways. First, there's
probably even more code involved, including a good bit of it that's
not in core but is user-defined. Second, we've generally made a
decision up until now that we don't want to have a hard dependency on
stable functions actually being stable. If they aren't, and for
example you're using index expressions, your queries may return wrong
answers, but you won't get weird internal error messages, and you
won't get a crash. I think the bar for this feature is the same.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Minor configure/meson cleanup

2023-08-07 Thread Tristan Partin


I agree that the change look good.

--
Tristan Partin
Neon (https://neon.tech)

Re: [PoC] Reducing planning time when tables have many partitions

2023-08-07 Thread Ashutosh Bapat

On Mon, Aug 7, 2023 at 2:21 PM Andrey Lepikhov 
wrote:

> >> Do you think that the memory measurement patch I have shared in those
> threads is useful in itself? If so, I will start another proposal to
> address it.
> >
> > For me, who is developing the planner in this thread, the memory
> > measurement patch is useful. However, most users do not care about
> > memory usage, so there is room for consideration. For example, making
> > the metrics optional in EXPLAIN ANALYZE outputs might be better.
> >
> +1. Any memory-related info in the output of EXPLAIN ANALYZE makes tests
> more complex because of architecture dependency.
>
>
As far as the tests go, the same is the case with planning time and
execution time. They change even without changing the architecture. But we
have tests which mask the actual values. Something similar will be done to
the planning memory.

I will propose it as a separate patch in the next commitfest and will seek
opinions from other hackers.

-- 
Best Wishes,
Ashutosh Bapat

Re: Extract numeric filed in JSONB more effectively

2023-08-07 Thread Andy Fan

Hi Jian:

Thanks for the review!

compared with jsonb_numeric. I am wondering if you need a free *jb.
> elog(INFO,"jb=%p arg pointer=%p ", jb, PG_GETARG_POINTER(0));
> says there two are not the same.
>

Thanks for pointing this out,  I am not sure what to do right now.
Basically the question is that shall we free the memory which
is allocated in a function call.  The proof to do it is obvious, but the
proof to NOT do it may be usually the memory is allocated under
ExprContext  Memorycontext,  it will be reset once the current
tuple is proceed, and MemoryContextReset will be more effective
than pfrees;

I checked most of the functions to free its memory, besides the
ones you mentioned,  numeric_gt/ne/xxx function also free them
directly.  But the functions like jsonb_object_field_text,
jsonb_array_element,  jsonb_array_element_text don't.

I'd like to hear more options from more experienced people,
this issue also confused me before. and I'm neutral to this now.
after we get an agreement on this,  I will update the patch
accordingly.

-- 
Best Regards
Andy Fan

Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2023-08-07 Thread Alvaro Herrera

On 2023-Aug-07, tender wang wrote:

> The foreign key still works even though partition was detached. Is this
> behavior expected?

Well, there's no reason for it not to, right?  For example, if you
detach a partition and then attach it again, you don't have to scan the
partition on attach, because you know the constraint has remained valid
all along.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)

Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2023-08-07 Thread tender wang

The foreign key still works even though partition was detached. Is this
behavior expected?
I can't find the answer in the document. If it is expected behavior ,
please ignore the bug I reported a few days ago.

tender wang  于2023年8月4日周五 17:04写道：

> Oversight the DetachPartitionFinalize(), I found another bug below:
>
> postgres=# CREATE TABLE p ( id bigint PRIMARY KEY ) PARTITION BY list (id);
> CREATE TABLE
> postgres=# CREATE TABLE p_1 PARTITION OF p FOR VALUES IN (1);
> CREATE TABLE
> postgres=# CREATE TABLE r_1 (
> postgres(# id   bigint PRIMARY KEY,
> postgres(# p_id bigint NOT NULL
> postgres(#   );
> CREATE TABLE
> postgres=#  CREATE TABLE r (
> postgres(# id   bigint PRIMARY KEY,
> postgres(# p_id bigint NOT NULL,
> postgres(# FOREIGN KEY (p_id) REFERENCES p (id)
> postgres(#   ) PARTITION BY list (id);
> CREATE TABLE
> postgres=# ALTER TABLE r ATTACH PARTITION r_1 FOR VALUES IN (1);
> ALTER TABLE
> postgres=# ALTER TABLE r DETACH PARTITION r_1;
> ALTER TABLE
> postgres=# insert into r_1 values(1,1);
> ERROR:  insert or update on table "r_1" violates foreign key constraint
> "r_p_id_fkey"
> DETAIL:  Key (p_id)=(1) is not present in table "p".
>
> After detach operation, r_1 is normal relation and the inherited foreign
> key 'r_p_id_fkey' should be removed.
>
>
> tender wang  于2023年8月3日周四 17:34写道：
>
>> I think the code to determine that fk of a partition is inherited or not
>> is not enough.
>> For example, in this case, foreign key r_1_p_id_fkey1  is not inherited
>> from parent.
>>
>> If conform->conparentid(in DetachPartitionFinalize func) is valid, we
>> should recheck confrelid(pg_constraint) field.
>>
>> I try to fix this problem in the attached patch.
>> Any thoughts.
>>
>> Alvaro Herrera  于2023年8月3日周四 17:02写道：
>>
>>> On 2023-Aug-03, tender wang wrote:
>>>
>>> > I think  old "sub-FK" should not be dropped, that will be violates
>>> foreign
>>> > key constraint.
>>>
>>> Yeah, I've been playing more with the patch and it is definitely not
>>> doing the right things.  Just eyeballing the contents of pg_trigger and
>>> pg_constraint for partitions added by ALTER...ATTACH shows that the
>>> catalog contents are inconsistent with those added by CREATE TABLE
>>> PARTITION OF.
>>>
>>> --
>>> Álvaro Herrera PostgreSQL Developer  —
>>> https://www.EnterpriseDB.com/
>>>
>>

Re: [PATCH] Add loongarch native checksum implementation.

2023-08-07 Thread John Naylor

On Fri, Jun 16, 2023 at 8:28 AM YANG Xudong  wrote:
> > +# If the intrinsics are supported, sets
pgac_loongarch_crc32c_intrinsics,
> > +# and CFLAGS_CRC.
> >
> > +# Check if __builtin_loongarch_crcc_* intrinsics can be used
> > +# with the default compiler flags.
> > +# CFLAGS_CRC is set if the extra flag is required.
> >
> > Same here -- it seems we don't need to set CFLAGS_CRC at all. Can you
> > confirm?
> >
>
> We don't need to set CFLAGS_CRC as commented. I have updated the
> configure script to make it align with the logic in meson build script.

(Looking again at v2)

The compilation test is found in c-compiler.m4, which still has all logic
for CFLAGS_CRC, including saving and restoring the old CFLAGS. Can this
also be simplified?

I diff'd pg_crc32c_loongarch.c with the current other files, and found it
is structurally the same as the Arm implementation. That's logical if
memory alignment is important.

  /*
- * ARMv8 doesn't require alignment, but aligned memory access is
- * significantly faster. Process leading bytes so that the loop below
- * starts with a pointer aligned to eight bytes.
+ * Aligned memory access is significantly faster.
+ * Process leading bytes so that the loop below starts with a pointer
aligned to eight bytes.

Can you confirm the alignment requirement -- it's not clear what the
intention is since "doesn't require" wasn't carried over. Is there any
documentation (or even a report in some other context) about aligned vs
unaligned memory access performance?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

2023-08-07 Thread Christoph Berg

Re: Thomas Munro
> Thanks for testing!  Would you mind trying v8 from that thread?  V7
> had a silly bug (I accidentally deleted a 'case' label while cleaning
> some stuff up, resulting in the above error...)

v8 worked better. It succeeded a few times (at least 12, my screen
scrollback didn't catch more) before erroring like this:

[10:21:58.410](0.151s) ok 15 - startup deadlock: logfile contains terminated 
connection due to recovery conflict
[10:21:58.463](0.053s) not ok 16 - startup deadlock: stats show conflict on 
standby
[10:21:58.463](0.000s)
[10:21:58.463](0.000s) #   Failed test 'startup deadlock: stats show conflict 
on standby'
#   at t/031_recovery_conflict.pl line 332.
[10:21:58.463](0.000s) #  got: '0'
# expected: '1'

Christoph
2023-08-07 10:21:53.073 UTC [3132659] LOG:  starting PostgreSQL 17devel (Debian 
17~~devel-1) on s390x-ibm-linux-gnu, compiled by gcc (Debian 13.2.0-1) 13.2.0, 
64-bit
2023-08-07 10:21:53.074 UTC [3132659] LOG:  listening on Unix socket 
"/tmp/Ew0sehtBKm/.s.PGSQL.52786"
2023-08-07 10:21:53.075 UTC [3132664] LOG:  database system was shut down at 
2023-08-07 10:21:52 UTC
2023-08-07 10:21:53.086 UTC [3132659] LOG:  database system is ready to accept 
connections
2023-08-07 10:21:53.523 UTC [3132690] 031_recovery_conflict.pl LOG:  statement: 
CREATE TABLESPACE test_recovery_conflict_tblspc LOCATION ''
2023-08-07 10:21:53.536 UTC [3132700] 031_recovery_conflict.pl LOG:  received 
replication command: SHOW data_directory_mode
2023-08-07 10:21:53.536 UTC [3132700] 031_recovery_conflict.pl STATEMENT:  SHOW 
data_directory_mode
2023-08-07 10:21:53.536 UTC [3132700] 031_recovery_conflict.pl LOG:  received 
replication command: SHOW wal_segment_size
2023-08-07 10:21:53.536 UTC [3132700] 031_recovery_conflict.pl STATEMENT:  SHOW 
wal_segment_size
2023-08-07 10:21:53.536 UTC [3132700] 031_recovery_conflict.pl LOG:  received 
replication command: IDENTIFY_SYSTEM
2023-08-07 10:21:53.536 UTC [3132700] 031_recovery_conflict.pl STATEMENT:  
IDENTIFY_SYSTEM
2023-08-07 10:21:53.536 UTC [3132700] 031_recovery_conflict.pl LOG:  received 
replication command: BASE_BACKUP ( LABEL 'pg_basebackup base backup',  
PROGRESS,  CHECKPOINT 'fast',  WAIT 0,  MANIFEST 'yes',  TARGET 'client')
2023-08-07 10:21:53.536 UTC [3132700] 031_recovery_conflict.pl STATEMENT:  
BASE_BACKUP ( LABEL 'pg_basebackup base backup',  PROGRESS,  CHECKPOINT 'fast', 
 WAIT 0,  MANIFEST 'yes',  TARGET 'client')
2023-08-07 10:21:53.542 UTC [3132662] LOG:  checkpoint starting: immediate 
force wait
2023-08-07 10:21:53.548 UTC [3132662] LOG:  checkpoint complete: wrote 7 
buffers (5.5%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.001 s, 
sync=0.001 s, total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s; 
distance=11350 kB, estimate=11350 kB; lsn=0/260, redo lsn=0/228
2023-08-07 10:21:53.561 UTC [3132701] 031_recovery_conflict.pl LOG:  received 
replication command: SHOW data_directory_mode
2023-08-07 10:21:53.561 UTC [3132701] 031_recovery_conflict.pl STATEMENT:  SHOW 
data_directory_mode
2023-08-07 10:21:53.566 UTC [3132701] 031_recovery_conflict.pl LOG:  received 
replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3132701" TEMPORARY 
PHYSICAL ( RESERVE_WAL)
2023-08-07 10:21:53.566 UTC [3132701] 031_recovery_conflict.pl STATEMENT:  
CREATE_REPLICATION_SLOT "pg_basebackup_3132701" TEMPORARY PHYSICAL ( 
RESERVE_WAL)
2023-08-07 10:21:53.576 UTC [3132701] 031_recovery_conflict.pl LOG:  received 
replication command: IDENTIFY_SYSTEM
2023-08-07 10:21:53.576 UTC [3132701] 031_recovery_conflict.pl STATEMENT:  
IDENTIFY_SYSTEM
2023-08-07 10:21:53.581 UTC [3132701] 031_recovery_conflict.pl LOG:  received 
replication command: START_REPLICATION SLOT "pg_basebackup_3132701" 0/200 
TIMELINE 1
2023-08-07 10:21:53.581 UTC [3132701] 031_recovery_conflict.pl STATEMENT:  
START_REPLICATION SLOT "pg_basebackup_3132701" 0/200 TIMELINE 1
2023-08-07 10:21:53.774 UTC [3132700] 031_recovery_conflict.pl LOG:  temporary 
file: path "base/pgsql_tmp/pgsql_tmp3132700.0", size 137324
2023-08-07 10:21:53.774 UTC [3132700] 031_recovery_conflict.pl STATEMENT:  
BASE_BACKUP ( LABEL 'pg_basebackup base backup',  PROGRESS,  CHECKPOINT 'fast', 
 WAIT 0,  MANIFEST 'yes',  TARGET 'client')
2023-08-07 10:21:54.341 UTC [3132754] standby LOG:  received replication 
command: IDENTIFY_SYSTEM
2023-08-07 10:21:54.341 UTC [3132754] standby STATEMENT:  IDENTIFY_SYSTEM
2023-08-07 10:21:54.348 UTC [3132754] standby LOG:  received replication 
command: START_REPLICATION 0/300 TIMELINE 1
2023-08-07 10:21:54.348 UTC [3132754] standby STATEMENT:  START_REPLICATION 
0/300 TIMELINE 1
2023-08-07 10:21:54.361 UTC [3132760] 031_recovery_conflict.pl LOG:  statement: 
CREATE DATABASE test_db
2023-08-07 10:21:54.395 UTC [3132770] 031_recovery_conflict.pl LOG:  statement: 
CREATE TABLE test_recovery_conflict_table1(a int, b int);
2023-08-07 10:21:54.396 UTC [3132770] 031_recovery_conflict.pl LOG:  statement: 
INSERT INTO

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-07 Thread Hayato Kuroda (Fujitsu)

Dear Amit, Julien,

> > > >
> > > > Unless I'm missing something I don't see what prevents something to
> connect
> > > > using the replication protocol and issue any query or even create new
> > > > replication slots?
> > > >
> > >
> > > I think the point is that if we have any slots where we have not
> > > consumed the pending WAL (other than the expected like
> > > SHUTDOWN_CHECKPOINT) or if there are invalid slots then the upgrade
> > > won't proceed and we will request user to remove such slots or ensure
> > > that WAL is consumed by slots. So, I think in the case you mentioned,
> > > the upgrade won't succeed.
> >
> > What if new slots are added while the old instance is started in the middle 
> > of
> > pg_upgrade, *after* the various checks are done?
> >
> 
> They won't be copied but I think that won't be any different than
> other objects like tables. Anyway, I have another idea which is to not
> allow creating slots during binary upgrade unless one specifically
> requests it by having an API like binary_upgrade_allow_slot_create()
> similar to existing APIs binary_upgrade_*.

I confirmed the part and confirmed that objects created after the dump
were not copied to new node. PSA scripts to emulate my test.

# tested steps

-1. applied v18 patch set
0. modified source to create objects during upgrade and install:

```
@@ -188,6 +188,9 @@ check_and_dump_old_cluster(bool live_check)
if (!user_opts.check)
generate_old_dump();
 
+   printf("XXX: start to sleep\n");
+   sleep(35);
+
```

1. prepared a node which had a replication slot
2. did pg_upgrade, the process will sleep 35 seconds during that
3. connected to the in-upgrading node by the command:

```
psql "host=`pwd` user=postgres port=50432 replication=database"
```

4. created a table and replication slot. Note that for binary upgrade, it was 
very
  hard to create tables manually. For me, table "bar" and slot "test" were 
created.
5. waited until the upgrade and boot new node.
6. confirmed that created tables and slots were not found on new node.

```
new_publisher=# \d
Did not find any relations.

new_publisher=# SELECT slot_name FROM pg_replication_slots WHERE slot_name = 
'test';
 slot_name 
---
(0 rows)
```

You can execute test_01.sh first, and then execute test_02.sh while the first 
terminal is stuck.


Note that such creations are theoretically occurred, but it is very rare.
By followings line in start_postmaster(), the TCP/IP connections are refused and
only the superuser can connect to the server.

```
#if !defined(WIN32)
/* prevent TCP/IP connections, restrict socket access */
strcat(socket_string,
   " -c listen_addresses='' -c unix_socket_permissions=0700");

/* Have a sockdir?  Tell the postmaster. */
if (cluster->sockdir)
snprintf(socket_string + strlen(socket_string),
 sizeof(socket_string) - strlen(socket_string),
 " -c %s='%s'",
 (GET_MAJOR_VERSION(cluster->major_version) <= 
902) ?
 "unix_socket_directory" : 
"unix_socket_directories",
 cluster->sockdir);
#endif
```

Moreover, the socket directory is set to current dir of caller, and port number
is also different from setting written in postgresql.conf.
I think there are few chances that replication slots are accidentally created
during the replication slot.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



test_01.sh
Description: test_01.sh


test_02.sh
Description: test_02.sh

Re: Check volatile functions in ppi_clauses for memoize node

On Fri, 4 Aug 2023 at 22:26, Richard Guo  wrote:
> explain (costs off)
> select * from t t1 left join lateral
> (select t1.a as t1a, t2.a as t2a from t t2) s
> on t1.a = s.t2a + random();
>   QUERY PLAN
> ---
>  Nested Loop Left Join
>->  Seq Scan on t t1
>->  Memoize
>  Cache Key: t1.a
>  Cache Mode: binary
>  ->  Seq Scan on t t2
>Filter: (t1.a = (a + random()))
> (7 rows)
>
> According to the theory we should not use memoize node for this query
> because of the volatile function in the inner side.  So propose a patch
> to fix that.

Thanks for the patch.  I've pushed a variation of it.

I didn't really see the need to make a single list with all the
RestrictInfos. It seems fine just to write code and loop over the
ppi_clauses checking for volatility.

David

Re: pgbench: allow to exit immediately when any client is aborted

2023-08-07 Thread Fabien COELHO




Hello Yugo-san,


I attached v2 patch including the documentation and some comments
in the code.


I've looked at this patch.

I'm unclear whether it does what it says: "exit immediately on abort", I 
would expect a cold call to "exit" (with a clear message obviously) when 
the abort occurs.


Currently it skips to "done" which starts by closing this particular 
thread client connections, then it will call "exit" later, so it is not 
"immediate".


I do not think that this cleanup is necessary, because anyway all other 
threads will be brutally killed by the exit called by the aborted thread, 
so why bothering to disconnect only some connections?


/* If any client is aborted, exit immediately. */
if (state[i].state != CSTATE_FINISHED)

For this comment, I would prefer "if ( ... == CSTATE_ABORTED)" rather that 
implying that not finished means aborted, and if you follow my previous 
remark then this code can be removed.


Typo: "going to exist" -> "going to exit". Note that it is not "the whole 
thread" but "the program" which is exiting.


There is no test.

--
Fabien.

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-07 Thread Amit Kapila

On Mon, Aug 7, 2023 at 1:06 PM Julien Rouhaud  wrote:
>
> On Mon, Aug 07, 2023 at 12:42:33PM +0530, Amit Kapila wrote:
> > On Mon, Aug 7, 2023 at 11:29 AM Julien Rouhaud  wrote:
> > >
> > > Unless I'm missing something I don't see what prevents something to 
> > > connect
> > > using the replication protocol and issue any query or even create new
> > > replication slots?
> > >
> >
> > I think the point is that if we have any slots where we have not
> > consumed the pending WAL (other than the expected like
> > SHUTDOWN_CHECKPOINT) or if there are invalid slots then the upgrade
> > won't proceed and we will request user to remove such slots or ensure
> > that WAL is consumed by slots. So, I think in the case you mentioned,
> > the upgrade won't succeed.
>
> What if new slots are added while the old instance is started in the middle of
> pg_upgrade, *after* the various checks are done?
>

They won't be copied but I think that won't be any different than
other objects like tables. Anyway, I have another idea which is to not
allow creating slots during binary upgrade unless one specifically
requests it by having an API like binary_upgrade_allow_slot_create()
similar to existing APIs binary_upgrade_*.

> > > Note also that as complained a few years ago nothing prevents a bgworker 
> > > from
> > > spawning up during pg_upgrade and possibly corrupt the upgraded cluster if
> > > multixid are assigned.  If publications are preserved wouldn't it mean 
> > > that
> > > such bgworkers could also lead to data loss?
> > >
> >
> > Is it because such workers would write some WAL which slots may not
> > process? If so, I think it is equally dangerous as other problems that
> > can arise due to such a worker. Do you think of any special handling
> > here?
>
> Yes, and there were already multiple reports of multixact corruption due to
> bgworker activity during pg_upgrade (see
> https://www.postgresql.org/message-id/20210121152357.s6eflhqyh4g5e...@dalibo.com
> for instance).  I think we should once and for all fix this whole class of
> problem one way or another.
>

I don't object to doing something like we discussed in the thread you
linked but don't see the link with this work. Surely, the extra
WAL/XIDs generated during the upgrade will cause data inconsistency
which is no different after this patch.

-- 
With Regards,
Amit Kapila.

Re: Minor configure/meson cleanup

2023-08-07 Thread Peter Eisentraut


On 07.08.23 11:18, Thomas Munro wrote:

0001: There is no point in searching -lrt and -lposix4 for fdatasync()
if it's supposed to help Solaris only.  They moved all the realtime
stuff into the main C library at least as far back as Solaris
10/OpenSolaris.

0002: There is no point in probing -lposix4 for anything.  That was
superseded by -lrt even longer ago.

We could go further:  I suspect the only remaining -lrt search we
still need in practice is for shm_open() on glibc < 2.34, but here I
just wanted to clean out some old Sun stuff that was called out by
name.


Looks sensible to me.

Re: Synchronizing slots from primary to standby

Hi,

On 8/4/23 1:32 PM, shveta malik wrote:

On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand
wrote:

On 7/28/23 4:39 PM, Bharath Rupireddy wrote:

Sorry to be late, but I gave a second thought and I wonder if we really need
this design.
(i.e start a logical replication background worker on the standby to sync the
slots).

Wouldn't that be simpler to "just" update the sync slots "metadata"
as the https://github.com/EnterpriseDB/pg_failover_slots module (mentioned by
Peter
up-thread) is doing?
(making use of LogicalConfirmReceivedLocation(), LogicalIncreaseXminForSlot()
and LogicalIncreaseRestartDecodingForSlot(), If I read synchronize_one_slot()
correctly).

Agreed. It would be simpler to just update the metadata. I think you
have not got chance to review the latest posted patch ('v10-0003')
yet, it does the same.

Thanks for the feedback! Yeah, I did not look at v10 in details and was
looking at the email thread only.

Indeed, I now see that 0003 does update the metadata in local_slot_advance(),
that's great!

But I do not quite get it as in how can we do it w/o starting a
background worker?

Yeah, agree that we still need background workers.
What I meant was to avoid to use "logical replication background worker"
(aka through logicalrep_worker_launch()) to sync the slots.

The question here is how many background workers we
need to have. Will one be sufficient or do we need one per db (as done
earlier by the original patches in this thread) or are we good with
dividing work among some limited number of workers?

I feel syncing all slots in one worker may increase the lag between
subsequent syncs for a particular slot and if the number of slots are
huge, the chances of losing the slot-data is more in case of failure.
Starting one worker per db also might not be that efficient as it will
increase load on the system (both in terms of background worker and
network traffic) especially for a case where the number of dbs are
more. Thus starting max 'n' number of workers where 'n' is decided by
GUC and dividing the work/DBs among these looks a better option to me.
Please see the discussion in and around the email at [1]

[1]:
https://www.postgresql.org/message-id/CAJpy0uCT%2BnpL4eUvCWiV_MBEri9ixcUgJVDdsBCJSqLd0oD1fQ%40mail.gmail.com

Thanks for the link! If I read the email thread correctly, this discussion
was before V10 (which is the first version making use of
LogicalConfirmReceivedLocation(),
LogicalIncreaseXminForSlot(), LogicalIncreaseRestartDecodingForSlot()) means
before the metadata sync only has been implemented.

While I agree that the approach to split the sync load among workers with the
new
max_slot_sync_workers GUC seems reasonable without the sync only feature (pre
V10),
I'm not sure that with the metadata sync only in place the extra complexity to
manage multiple
sync workers is needed.

Maybe we should start some tests/benchmark with only one sync worker to get
numbers
and start from there?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Extract numeric filed in JSONB more effectively

2023-08-07 Thread jian he

Hi.

+Datum
+jsonb_object_field_type(PG_FUNCTION_ARGS)
+{
+ Jsonb*jb = PG_GETARG_JSONB_P(0);
+ text*key = PG_GETARG_TEXT_PP(1);
+ Oid targetOid = PG_GETARG_OID(2);

compared with jsonb_numeric. I am wondering if you need a free *jb.
elog(INFO,"jb=%p arg pointer=%p ", jb, PG_GETARG_POINTER(0));
says there two are not the same.

Minor configure/meson cleanup

2023-08-07 Thread Thomas Munro

0001: There is no point in searching -lrt and -lposix4 for fdatasync()
if it's supposed to help Solaris only.  They moved all the realtime
stuff into the main C library at least as far back as Solaris
10/OpenSolaris.

0002: There is no point in probing -lposix4 for anything.  That was
superseded by -lrt even longer ago.

We could go further:  I suspect the only remaining -lrt search we
still need in practice is for shm_open() on glibc < 2.34, but here I
just wanted to clean out some old Sun stuff that was called out by
name.
From 9c8663239ff44a75a615ec5ad81bd8dcad3378a6 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 7 Aug 2023 13:54:54 +1200
Subject: [PATCH 1/2] Don't probe extra libraries for fdatasync.

Commit d2e15083 got rid of the main configure probe and HAVE_FDATASYNC
macro, but we still searched -lrt and -lposix4 for old Solaris systems.
It's in the C library on modern Solaris, as on other supported systems.
---
 configure| 57 
 configure.ac |  2 --
 meson.build  |  1 -
 3 files changed, 60 deletions(-)

diff --git a/configure b/configure
index 2e518c8007..9ea188bc9b 100755
--- a/configure
+++ b/configure
@@ -12195,63 +12195,6 @@ if test "$ac_res" != no; then :
 
 fi
 
-# Solaris:
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing fdatasync" >&5
-$as_echo_n "checking for library containing fdatasync... " >&6; }
-if ${ac_cv_search_fdatasync+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  ac_func_search_save_LIBS=$LIBS
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-/* Override any GCC internal prototype to avoid an error.
-   Use char because int might match the return type of a GCC
-   builtin and then its argument prototype would still apply.  */
-#ifdef __cplusplus
-extern "C"
-#endif
-char fdatasync ();
-int
-main ()
-{
-return fdatasync ();
-  ;
-  return 0;
-}
-_ACEOF
-for ac_lib in '' rt posix4; do
-  if test -z "$ac_lib"; then
-ac_res="none required"
-  else
-ac_res=-l$ac_lib
-LIBS="-l$ac_lib  $ac_func_search_save_LIBS"
-  fi
-  if ac_fn_c_try_link "$LINENO"; then :
-  ac_cv_search_fdatasync=$ac_res
-fi
-rm -f core conftest.err conftest.$ac_objext \
-conftest$ac_exeext
-  if ${ac_cv_search_fdatasync+:} false; then :
-  break
-fi
-done
-if ${ac_cv_search_fdatasync+:} false; then :
-
-else
-  ac_cv_search_fdatasync=no
-fi
-rm conftest.$ac_ext
-LIBS=$ac_func_search_save_LIBS
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_fdatasync" >&5
-$as_echo "$ac_cv_search_fdatasync" >&6; }
-ac_res=$ac_cv_search_fdatasync
-if test "$ac_res" != no; then :
-  test "$ac_res" = "none required" || LIBS="$ac_res $LIBS"
-
-fi
-
 # Cygwin:
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing shmget" >&5
 $as_echo_n "checking for library containing shmget... " >&6; }
diff --git a/configure.ac b/configure.ac
index 3ebe1a796d..e36c1aee5a 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1286,8 +1286,6 @@ AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt])
 AC_SEARCH_LIBS(shm_open, rt)
 AC_SEARCH_LIBS(shm_unlink, rt)
 AC_SEARCH_LIBS(clock_gettime, [rt posix4])
-# Solaris:
-AC_SEARCH_LIBS(fdatasync, [rt posix4])
 # Cygwin:
 AC_SEARCH_LIBS(shmget, cygipc)
 # *BSD:
diff --git a/meson.build b/meson.build
index 04ea348852..b7e67264e1 100644
--- a/meson.build
+++ b/meson.build
@@ -2405,7 +2405,6 @@ func_checks = [
   # required. Just checking for dlsym() ought to suffice.
   ['dlsym', {'dependencies': [dl_dep], 'define': false}],
   ['explicit_bzero'],
-  ['fdatasync', {'dependencies': [rt_dep, posix4_dep], 'define': false}], # Solaris
   ['getifaddrs'],
   ['getopt', {'dependencies': [getopt_dep, gnugetopt_dep], 'skip': always_replace_getopt}],
   ['getopt_long', {'dependencies': [getopt_dep, gnugetopt_dep], 'skip': always_replace_getopt_long}],
-- 
2.41.0

From 865bb8a85627c14ec69a8a919486f06cbd45b0fb Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 7 Aug 2023 13:56:41 +1200
Subject: [PATCH 2/2] Remove traces of Sun -lposix4.

This was a library on ancient Solaris systems, which was eventually
replaced with -lrt.  Both are obsolete, as the relevant symbols moved
into the C library long ago.
---
 configure| 2 +-
 configure.ac | 2 +-
 meson.build  | 3 +--
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/configure b/configure
index 9ea188bc9b..3ee4801a35 100755
--- a/configure
+++ b/configure
@@ -12163,7 +12163,7 @@ return clock_gettime ();
   return 0;
 }
 _ACEOF
-for ac_lib in '' rt posix4; do
+for ac_lib in '' rt; do
   if test -z "$ac_lib"; then
 ac_res="none required"
   else
diff --git a/configure.ac b/configure.ac
index e36c1aee5a..314bdb5273 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1285,7 +1285,7 @@ AC_SEARCH_LIBS(socket, [socket ws2_32])
 AC_SEARCH_LIBS(getopt_long, [getopt gnugetopt])
 AC_SEARCH_LIBS(shm_open, rt)
 AC_SEARCH_LIBS(shm_unlink, rt)
-AC_SEARCH_LIBS(clock_gettime, [rt posix4])

Re: Using defines for protocol characters

2023-08-07 Thread Alvaro Herrera

On 2023-Aug-07, Peter Smith wrote:

> I guess, your patch would not be much different; you can still have
> all the nice names and assign the appropriate values to the enum
> values same as now, but using an enum you might also gain
> type-checking in the code and also get warnings for the "switch"
> statements if there are any cases accidentally omitted.

Hmm, I think omitting a 'default' clause (which is needed when you want
warnings for missing clauses) in a switch that handles protocol traffic
is not great, because the switch would misbehave when the network
counterpart sends a broken message.  I'm not sure we want to do that.
It could become a serious security problem if confronted with a
malicious libpq.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-07 Thread Amit Kapila

On Mon, Aug 7, 2023 at 2:02 PM Masahiko Sawada  wrote:
>
> On Mon, Aug 7, 2023 at 12:54 PM Amit Kapila  wrote:
> >
> > On Sun, Aug 6, 2023 at 6:02 PM Masahiko Sawada  
> > wrote:
> > >
> > > IIUC the above query checks if the WAL record written at the slot's
> > > confirmed_flush_lsn is a CHECKPOINT_SHUTDOWN, but there is no check if
> > > this WAL record is the latest record.
> > >
> >
> > Yeah, I also think there should be some way to ensure this. How about
> > passing the number of records to read to this API? Actually, that will
> > address my other concern as well where the current API can lead to
> > reading an unbounded number of records if the confirmed_flush_lsn
> > location is far behind the CHECKPOINT_SHUTDOWN. Do you have any better
> > ideas to address it?
>
> It makes sense to me to limit the number of WAL records to read. But
> as I mentioned below, if there is a chance of any WAL activity during
> the upgrade, I'm not sure what limit to set.
>

In that case, we won't be able to pass the number of records. We need
to check based on the type of records.

>
> > However, I am not
> > sure if there can't be any additional WAL from checkpointer or
> > bgwriter. Checkpointer has a code that ensures that if there is no
> > important WAL activity then it would be skipped. Similarly, bgwriter
> > also doesn't LOG xl_running_xacts unless there is an important
> > activity.
>
> WAL records for hint bit updates could be generated even in upgrading mode?
>

Do you mean these records can be generated during reading catalog tables?

> > I feel if there is a chance of any WAL activity during the
> > upgrade, we need to either change the check to ensure such WAL records
> > are expected or document the same in some way.
>
> Yes, but how does it work with the above idea of limiting the number
> of WAL records to read? If XLOG_FPI_FOR_HINT can still be generated in
> the upgrade mode, we cannot predict how many such records are
> generated after the latest CHECKPOINT_SHUTDOWN.
>

Right, as said earlier, in that case, we need to rely on the type of records.

> I'm not really sure we should always perform the slot's
> confirmed_flush_lsn check by default in the first place. With this
> check, the upgrade won't be able to proceed if there is any logical
> slot that is not used by logical replication (or something streaming
> the changes using walsender), right? For example, if a user uses a
> program that periodically consumes the changes from the logical slot,
> the slot would not be able to pass the check even if the user executed
> pg_logical_slot_get_changes() just before shutdown. The backend
> process who consumes the changes is always terminated before the
> shutdown checkpoint. On the other hand, I think there are cases where
> the user can ensure that no meaningful WAL records are generated after
> the last pg_logical_slot_get_changes(). I'm concerned that this check
> might make upgrading such cases cumbersome unnecessarily.
>

You are right and I have mentioned the same case today in my response
to Jonathan but do you have better ideas to deal with such slots than
to give an ERROR?

-- 
With Regards,
Amit Kapila.

Re: [PoC] Reducing planning time when tables have many partitions

2023-08-07 Thread Andrey Lepikhov


On 7/8/2023 15:19, Yuya Watari wrote:

Hello,

Thank you for your reply.

On Thu, Aug 3, 2023 at 10:29 PM Ashutosh Bapat
 wrote:

If you think that the verification is important to catch bugs, you may want to 
encapsulate it with an #ifdef .. #endif such that the block within is not 
compiled by default. See OPTIMIZER_DEBUG for example.


In my opinion, verifying the iteration results is only necessary to
avoid introducing bugs while developing this patch. The verification
is too excessive for regular development of PostgreSQL. I agree that
we should avoid a significant degradation in assert enabled builds, so
I will consider removing it.
I should admit, these checks has helped me during backpatching this 
feature to pg v.13 (users crave speed up of query planning a lot). Maybe 
it is a sign of a lack of tests, but in-fact, it already has helped.


One more thing: I think, you should add comments to 
add_child_rel_equivalences() and add_child_join_rel_equivalences()

on replacing of:

if (bms_is_subset(cur_em->em_relids, top_parent_relids) &&
!bms_is_empty(cur_em->em_relids))
and
if (bms_overlap(cur_em->em_relids, top_parent_relids))

with different logic. What was changed? It will be better to help future 
developers realize this part of the code more easily by adding some 
comments.



Do you think that the memory measurement patch I have shared in those threads 
is useful in itself? If so, I will start another proposal to address it.


For me, who is developing the planner in this thread, the memory
measurement patch is useful. However, most users do not care about
memory usage, so there is room for consideration. For example, making
the metrics optional in EXPLAIN ANALYZE outputs might be better.

+1. Any memory-related info in the output of EXPLAIN ANALYZE makes tests 
more complex because of architecture dependency.


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Using defines for protocol characters

2023-08-07 Thread Peter Smith

Hi, I wondered if any consideration was given to using an enum instead
of all the #defines.

I guess, your patch would not be much different; you can still have
all the nice names and assign the appropriate values to the enum
values same as now, but using an enum you might also gain
type-checking in the code and also get warnings for the "switch"
statements if there are any cases accidentally omitted.

For example, see typedef enum LogicalRepMsgType [1].

--
[1] 
https://github.com/postgres/postgres/blob/eeb4eeea2c525c51767ffeafda0070b946f26ae8/src/include/replication/logicalproto.h#L57C31-L57C31

Kind Regards,
Peter Smith
Fujitsu Australia

Fix badly generated entries in wait_event_names.txt