date:20230510

pg_upgrade - typo in verbose log

2023-05-10 Thread Peter Smith

Hi --

While reviewing another patch for the file info.c, I noticed some
misplaced colon (':') in the verbose logs for print_rel_infos().

PSA patch to fix it.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-pg_upgrade-typo-in-verbose-log.patch
Description: Binary data

Re: Discussion: psql \et -> edit the trigger function

2023-05-10 Thread Pavel Stehule

st 10. 5. 2023 v 19:08 odesílatel Kirk Wolak  napsal:

> On Wed, May 10, 2023 at 12:20 PM Pavel Stehule 
> wrote:
>
>> Hi
>>
>> st 10. 5. 2023 v 17:33 odesílatel Kirk Wolak  napsal:
>>
>>> We already have
>>> \ef
>>> \ev
>>>
>>> The use case here is simply that it saves me from:
>>> \d 
>>> [scroll through all the fields]
>>> [often scroll right]
>>> select function name
>>> \ef [paste function name]
>>>
>>> and tab completion is much narrower
>>>
>>> When doing conversions and reviews all of this stuff has to be reviewed.
>>> Oftentimes, renamed, touched.
>>>
>>> I am 100% willing to write the code, docs, etc. but would appreciate
>>> feedback.
>>>
>>
>> \et can be little bit confusing, because looks like editing trigger, not
>> trigger function
>>
>> what \eft triggername
>>
>> ?
>>
>> Pavel, I am "torn" because of my OCD, I would expect
> \eft 
> to list functions that RETURN TRIGGER as opposed to the level of
> indirection I was aiming for.
>
> where
> \et 
>   Would specifically let me complete the Trigger_Name, but find the
> function
>
> It makes me wonder, now if:
> \etf
>
> Is better for this (edit trigger function... given the trigger name).
> And as another poster suggested.  As we do the AUTOCOMPLETE for that, we
> could address it for:
> \ef?
>
> because:
> \eft 
> is valuable as well, and deserves to work just like all \ef? items
>
> It seems like a logical way to break it down.
>

This is a problem, and it isn't easy to find a design that is consistent
and useful. Maybe Tom's proposal "\st" is best, although the "t" can be
messy - it can be "t" like table or "t" like trigger or "t" like type.

Personally, I don't like editing DDL in psql or pgAdmin. In all my training
I say "don't do it". But on second hand,  I agree so it can be handy for
prototyping or for some playing.

I think implementing "\st triggername" can be a good start, and then we can
continue in design later.

My comments:

* Maybe "\str" can be better than only "\st". Only "\st" can be confusing -
minimally we use "t" like symbol for tables

* I think so arguments can be - tablename, triggername or [tablename
triggername]

It can display more triggers than just one when specification is general or
result is not uniq

Regards

Pavel









> regards
>>
>> Pavel
>>
>>
>>
>>>
>>> Kirk...
>>>
>>

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-10 Thread Amit Kapila

On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear hackers,
>
> I rebased and refined my PoC. Followings are the changes:
>

1. Is my understanding correct that this patch creates the delay files
for each transaction? If so, did you consider other approaches such as
using one file to avoid creating many files?
2. For streaming transactions, first the changes are written in the
temp file and then moved to the delay file. It seems like there is a
double work. Is it possible to unify it such that when min_apply_delay
is specified, we just use the delay file without sacrificing the
advantages like stream sub-abort can truncate the changes?
3. Ideally, there shouldn't be a performance impact of this feature on
regular transactions because the delay file is created only when
min_apply_delay is active but better to do some testing of the same.

Overall, I think such an approach can address comments by Sawada-San
[1] but not sure if Sawada-San or others have any better ideas to
achieve this feature. It would be good to see what others think of
this approach.

[1] - 
https://www.postgresql.org/message-id/CAD21AoAeG2%2BRsUYD9%2BmEwr8-rrt8R1bqpe56T2D%3DeuO-Qs-GAg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: walsender performance regression due to logical decoding on standby changes

2023-05-10 Thread Masahiko Sawada

On Wed, May 10, 2023 at 7:33 PM Amit Kapila  wrote:
>
> On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy 
> >  wrote:
> > >
> > >
> > > > My current guess is that mis-using a condition variable is the best 
> > > > bet. I
> > > > think it should work to use ConditionVariablePrepareToSleep() before a
> > > > WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
> > > > ConditionVariableSleep(). The latch set from 
> > > > ConditionVariableBroadcast()
> > > > would still cause the necessary wakeup.
> > >
> > > How about something like the attached? Recovery and subscription tests
> > > don't complain with the patch.
> >
> > Thanks for the patch. I noticed one place where the logic is different from 
> > before and
> > just to confirm:
> >
> > if (AllowCascadeReplication())
> > -   WalSndWakeup(switchedTLI, true);
> > +   ConditionVariableBroadcast(>cv);
> >
> > After the change, we wakeup physical walsender regardless of switchedTLI 
> > flag.
> > Is this intentional ? if so, I think It would be better to update the 
> > comments above this.
> >
>
> This raises the question of whether we need this condition variable
> logic for physical walsenders?

It sounds like a good idea. We can have two condition variables for
logical and physical walsenders, and selectively wake up walsenders
sleeping on the condition variables. It should work, it seems like
much of a hack, though.

Regards,

[1] 
https://www.postgresql.org/message-id/2d314c22b9e03415aa1c7d8fd1f698dae60effa7.camel%40j-davis.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: benchmark results comparing versions 15.2 and 16

2023-05-10 Thread Michael Paquier

On Tue, May 09, 2023 at 09:48:24AM -0700, Andres Freund wrote:
> On 2023-05-08 12:11:17 -0700, Andres Freund wrote:
>> I can reproduce a significant regression due to f193883fc of a workload just
>> running
>> SELECT CURRENT_TIMESTAMP;
>> 
>> A single session running it on my workstation via pgbench -Mprepared gets
>> before:
>> tps = 89359.128359 (without initial connection time)
>> after:
>> tps = 83843.585152 (without initial connection time)
>> 
>> Obviously this is an extreme workload, but that nevertheless seems too large
>> to just accept...
> 
> Added an open item for this.

Thanks for the report, I'll come back to it and look at it at the
beginning of next week.  In the worst case, that would mean a revert
of this refactoring, I assume.
--
Michael


signature.asc
Description: PGP signature

Re: Redundant strlen(query) in get_rel_infos

2023-05-10 Thread Michael Paquier

On Thu, May 11, 2023 at 01:06:42PM +1000, Peter Smith wrote:
> While reviewing another patch to the file info.c, I noticed there seem
> to be some unnecessary calls to strlen(query) in get_rel_infos()
> function.
> 
> i.e. The query is explicitly initialized to an empty string
> immediately prior, so why the strlen?

It just looks like this was copied from a surrounding area like
get_db_infos().  Keeping the code as it is is no big deal, either, but
yes we could just remove them and save the two calls.  So ok by me.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Add native windows on arm64 support

2023-05-10 Thread Michael Paquier

On Thu, May 11, 2023 at 12:17:25PM +1200, Thomas Munro wrote:
> Azure does have an image "Microsoft Windows 11 Preview arm64" to run
> on Ampere CPUs, which says it's a pre-release version intended for
> validation, which sounds approximately like what we want.  I will try
> to find out more.

Interesting.  Thanks for mentioning it.
--
Michael


signature.asc
Description: PGP signature

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-10 Thread Hayato Kuroda (Fujitsu)

Dear Amit-san, Bharath,

Thank you for giving your opinion!

> Some of the other solutions like MySQL also have this feature. See
> [2], you can also read the other use cases in that article. It seems
> pglogical has this feature and there is a customer demand for the same
> [3]

Additionally, the Db2[1] seems to have similar feature. If we extend to DBaaSes,
RDS for MySQL [2] and TencentDB [3] have that. These may indicate the needs
of the delayed replication. 

[1]: 
https://www.ibm.com/docs/en/db2/11.5?topic=parameters-hadr-replay-delay-hadr-replay-delay
[2]: 
https://aws.amazon.com/jp/blogs/database/recover-from-a-disaster-with-delayed-replication-in-amazon-rds-for-mysql/
[3]: https://www.tencentcloud.com/document/product/236/41085

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-10 Thread Amit Kapila

On Wed, May 10, 2023 at 5:35 PM Bharath Rupireddy
 wrote:
>
> On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear hackers,
> >
> > I rebased and refined my PoC. Followings are the changes:
>
> Thanks.
>
> Apologies for being late here. Please bear with me if I'm repeating
> any of the discussed points.
>
> I'm mainly trying to understand the production level use-case behind
> this feature, and for that matter, recovery_min_apply_delay. AFAIK,
> people try to keep the replication lag as minimum as possible i.e.
> near zero to avoid the extreme problems on production servers - wal
> file growth, blocked vacuum, crash and downtime.
>
> The proposed feature commit message and existing docs about
> recovery_min_apply_delay justify the reason as 'offering opportunities
> to correct data loss errors'. If someone wants to enable
> recovery_min_apply_delay/min_apply_delay on production servers, I'm
> guessing their values will be in hours, not in minutes; for the simple
> reason that when a data loss occurs, people/infrastructure monitoring
> postgres need to know it first and need time to respond with
> corrective actions to recover data loss. When these parameters are
> set, the primary server mustn't be generating too much WAL to avoid
> eventual crash/downtime. Who would really want to be so defensive
> against somebody who may or may not accidentally cause data loss and
> enable these features on production servers (especially when these can
> take down the primary server) and live happily with the induced
> replication lag?
>
> AFAIK, PITR is what people use for recovering from data loss errors in
> production.
>

I think PITR is not a preferred way to achieve this because it can be
quite time-consuming. See how Gitlab[1] uses delayed replication in
PostgreSQL. This is one of the use cases I came across but I am sure
there will be others as well, otherwise, we would not have introduced
this feature in the first place.

Some of the other solutions like MySQL also have this feature. See
[2], you can also read the other use cases in that article. It seems
pglogical has this feature and there is a customer demand for the same
[3]

> IMO, before we even go implement the apply delay feature for logical
> replication, it's worth to understand if induced replication lags have
> any production level significance.
>

I think the main thing here is to come up with the right design to
implement this feature. In the last release, we found some blocking
problems with the proposed patch at that time but Kuroda-San came up
with a new patch with a different design based on the discussion here.
I haven't looked at it yet though.


[1] - 
https://about.gitlab.com/blog/2019/02/13/delayed-replication-for-disaster-recovery-with-postgresql/
[2] - https://dev.mysql.com/doc/refman/8.0/en/replication-delayed.html
[3] - 
https://www.postgresql.org/message-id/73b06a32-56ab-4056-86ff-e307f3c316f1%40www.fastmail.com

-- 
With Regards,
Amit Kapila.

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-05-10 Thread Wei Wang (Fujitsu)

On Tue, May 9, 2023 at 17:44 PM Hayato Kuroda (Fujitsu) 
 wrote:
> Thank you for reviewing! PSA new version.

Thanks for your patches.
Here are some comments for 0001 patch:

1. In the function getLogicalReplicationSlots
```
+   /*
+* Note: Currently we do not have any options to 
include/exclude slots
+* in dumping, so all the slots must be selected.
+*/
+   slotinfo[i].dobj.dump = DUMP_COMPONENT_ALL;
```
I think currently we are only dumping the definition of logical replication
slots. It seems better to set it as DUMP_COMPONENT_DEFINITION here.

2. In the function dumpLogicalReplicationSlot
```
+   ArchiveEntry(fout, slotinfo->dobj.catId, slotinfo->dobj.dumpId,
+ARCHIVE_OPTS(.tag = slotname,
+ .description 
= "REPLICATION SLOT",
+ .section = 
SECTION_POST_DATA,
+ .createStmt = 
query->data));
```
I think if we do not set the member dropStmt in macro ARCHIVE_OPTS here, when we
specifying the option "--logical-replication-slots-only" and option "-c/--clean"
together, the "-c/--clean" will not work.

I think that we could use the function pg_drop_replication_slot to set this
member. Then, in the main function in the pg_dump.c file, we should add a check
to prevent specifying option "--logical-replication-slots-only" and
option "--if-exists" together.
Or, we could simply add a check to prevent specifying option
"--logical-replication-slots-only" and option "-c/--clean" together.
What do you think?

Regards,
Wang wei

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-05-10 Thread Wei Wang (Fujitsu)

On Thu, May 11, 2023 at 10:12 AM Peter Smith  wrote:
> Hi Kuroda-san. I checked again the v11-0001.
> 
> Here are a few more review comments.
> 
> ==
> src/bin/pg_dump/pg_dump.c
> 
> 1. help
> 
>   printf(_("  --insertsdump data as INSERT
> commands, rather than COPY\n"));
>   printf(_("  --load-via-partition-rootload partitions via the
> root table\n"));
> + printf(_("  --logical-replication-slots-only\n"
> + "   dump only logical replication slots,
> no schema or data\n"));
>   printf(_("  --no-commentsdo not dump comments\n"));
> 
> Now you removed the PG Docs for the internal pg_dump option based on
> my previous review comment (see [2]#1). So does it mean this "help"
> also be removed so this option will be completely invisible to the
> user? I am not sure, but if you do choose to remove this help then
> probably a comment should be added here to explain why it is
> deliberately not listed.

I'm not sure if there is any reason to not expose this new option? Do we have
concerns that users who use this new option by mistake may cause data
inconsistencies?

BTW, I think that all options of pg_dump (please see the array of long_options
in the main function of the pg_dump.c file) are currently exposed to the user.

Regards,
Wang wei

Redundant strlen(query) in get_rel_infos

2023-05-10 Thread Peter Smith

Hi.

While reviewing another patch to the file info.c, I noticed there seem
to be some unnecessary calls to strlen(query) in get_rel_infos()
function.

i.e. The query is explicitly initialized to an empty string
immediately prior, so why the strlen?

PSA patch for this.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Remove-redundant-strlen.patch
Description: Binary data

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-05-10 Thread Peter Smith

Hi Kuroda-san. I checked again the v11-0001.

Here are a few more review comments.

==
src/bin/pg_dump/pg_dump.c

1. help

  printf(_("  --insertsdump data as INSERT
commands, rather than COPY\n"));
  printf(_("  --load-via-partition-rootload partitions via the
root table\n"));
+ printf(_("  --logical-replication-slots-only\n"
+ "   dump only logical replication slots,
no schema or data\n"));
  printf(_("  --no-commentsdo not dump comments\n"));

Now you removed the PG Docs for the internal pg_dump option based on
my previous review comment (see [2]#1). So does it mean this "help"
also be removed so this option will be completely invisible to the
user? I am not sure, but if you do choose to remove this help then
probably a comment should be added here to explain why it is
deliberately not listed.

==
src/bin/pg_upgrade/check.c

2. check_new_cluster

Although you wrote "Added", I don't think my previous comment ([1]#8)
was yet addressed.

What I mean to say ask was: can that call to get_logical_slot_infos()
be done later, only when you know that option was specified?

e.g

BEFORE
get_logical_slot_infos(_cluster);
...
if (user_opts.include_logical_slots)
check_for_parameter_settings(_cluster);

SUGGESTION
if (user_opts.include_logical_slots)
{
get_logical_slot_infos(_cluster);
check_for_parameter_settings(_cluster);
}

==
src/bin/pg_upgrade/info.c

3. get_db_and_rel_infos

> src/bin/pg_upgrade/info.c
>
> 11. get_db_and_rel_infos
>
> + {
>   get_rel_infos(cluster, >dbarr.dbs[dbnum]);
>
> + /*
> + * Additionally, slot_arr must be initialized because they will be
> + * checked later.
> + */
> + cluster->dbarr.dbs[dbnum].slot_arr.nslots = 0;
> + cluster->dbarr.dbs[dbnum].slot_arr.slots = NULL;
> + }
>
> 11a.
> I think probably it would have been easier to just use 'pg_malloc0'
> instead of 'pg_malloc' in the get_db_infos, then this code would not
> be necessary.
I was not sure whether it is OK to change like that because of the
performance efficiency. But OK, fixed.
> 11b.
> BTW, shouldn't this function also be calling free_logical_slot_infos()
> too? That will also have the same effect (initializing the slot_arr)
> but without having to change anything else.

~

Above is your reply ([2]11a). If you were not sure about the malloc0
then I think the suggestion ([1]#12b) achieves the same thing and
initializes those fields. You did not reply to 12b, so I wondered if
you accidentally missed that point.

~~~

4. get_logical_slot_infos

+ for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
+ {
+ DbInfo *pDbInfo = >dbarr.dbs[dbnum];
+
+ if (pDbInfo->slot_arr.slots)
+ free_logical_slot_infos(>slot_arr);

Maybe it is ok, but it seems unusual that this
get_logical_slot_infos() is also doing a free. I didn't notice this
same pattern with the other get_XXX functions. Why is it needed? Even
if pDbInfo->slot_arr.slots was not NULL, is the information stale or
will you just end up re-fetching the same info?

==
.../pg_upgrade/t/003_logical_replication_slots.pl

5.
+# Preparations for the subsequent test. The case max_replication_slots is set
+# to 0 is prohibit.

/prohibit/prohibited/

--
[1] My v10 review -
https://www.postgresql.org/message-id/CAHut%2BPtpQaKVfqr-8KUtGZqei1C9gWF0%2BY8n1UafvAQeS4G_hg%40mail.gmail.com
[2] Kuroda-san's reply to my v10 review -
https://www.postgresql.org/message-id/TYAPR01MB5866A537AC9AD46E49345A78F5769%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: 2023-05-11 release announcement draft

2023-05-10 Thread Jonathan S. Katz


On 5/7/23 10:34 PM, David Rowley wrote:


* Fix partition pruning bug with the boolean "IS NOT TRUE" and "IS NOT
FALSE" conditions. NULL partitions were accidentally pruned when they
shouldn't have been.


Thanks for the additional explanation. I took your suggestion verbatim.

Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature

Re: [PATCH] Add native windows on arm64 support

2023-05-10 Thread Thomas Munro

On Thu, May 11, 2023 at 11:34 AM Michael Paquier  wrote:
> On Wed, May 10, 2023 at 09:24:39AM -0400, Andrew Dunstan wrote:
> > We will definitely want buildfarm support. I don't have such a machine and
> > am not likely to have one any time soon. I do run drongo and fairywren on an
> > EC2 instance, but AWS doesn't seem to have support for Windows on ARM. Maybe
> > it's available on Azure (maybe then some of our colleagues working at
> > Microsoft could arrange such a beast for me to set up potential buildfarm
> > animal, or else run one themselves.)
>
> Unfortunately AWS does not offer ARM on Windows with an EC2, that's
> something I have also looked at setting up :/

Azure does have an image "Microsoft Windows 11 Preview arm64" to run
on Ampere CPUs, which says it's a pre-release version intended for
validation, which sounds approximately like what we want.  I will try
to find out more.

Re: [PATCH] Add native windows on arm64 support

2023-05-10 Thread Michael Paquier

On Wed, May 10, 2023 at 09:24:39AM -0400, Andrew Dunstan wrote:
> We will definitely want buildfarm support. I don't have such a machine and
> am not likely to have one any time soon. I do run drongo and fairywren on an
> EC2 instance, but AWS doesn't seem to have support for Windows on ARM. Maybe
> it's available on Azure (maybe then some of our colleagues working at
> Microsoft could arrange such a beast for me to set up potential buildfarm
> animal, or else run one themselves.)

Unfortunately AWS does not offer ARM on Windows with an EC2, that's
something I have also looked at setting up :/

Anyway, Niyat has mentioned me that he was working on that, but it
seems that it was off-list.  Niyat, could you confirm this part?
--
Michael


signature.asc
Description: PGP signature

Re: WAL Insertion Lock Improvements

2023-05-10 Thread Michael Paquier

On Wed, May 10, 2023 at 10:40:20PM +0530, Bharath Rupireddy wrote:
> test-case 2: -T900, WAL ~256 bytes - ran for about 3.5 hours and the
> more than 3X improvement in TPS is seen - 3.11X @ 512 3.79 @ 768, 3.47
> @ 1024, 2.27 @ 2048, 2.77 @ 4096
>
> [...]
>
> test-case 2: -t100, WAL ~256 bytes - ran for more than 12 hours
> and the maximum improvement is 1.84X @ 1024 client.

Thanks.  So that's pretty close to what I was seeing when it comes to
this message size where you see much more effects under a number of
clients of at least 512~.  Any of these tests have been using fsync =
on, I assume.  I think that disabling fsync or just mounting pg_wal to
a tmpfs should show the same pattern for larger record sizes (after 1k
of message size the curve begins to go down with 512~ clients).
--
Michael


signature.asc
Description: PGP signature

Re: benchmark results comparing versions 15.2 and 16

2023-05-10 Thread David Rowley

On Thu, 11 May 2023 at 01:00, Alexander Lakhin  wrote:
> This time `git bisect` pointed at 3c6fc5820. Having compared execution plans
> (both attached), I see the following differences (3c6fc5820~1 vs 3c6fc5820):

Based on what you've sent, I'm uninspired to want to try to do
anything about it.  The patched version finds a plan that's cheaper.
The row estimates are miles off with both plans. I'm not sure what
we're supposed to do here. It's pretty hard to make changes to the
planner's path generation without risking that a bad plan is chosen
when it wasn't beforehand with bad row estimates.

Is the new plan still slower if you increase work_mem so that the sort
no longer goes to disk?  Maybe the planner would have picked Hash
Aggregate if the row estimates had been such that cost_tuplesort()
knew that the sort would have gone to disk.

David

Re: base backup vs. concurrent truncation

2023-05-10 Thread Greg Stark

Including the pre-truncation length in the wal record is the obviously
solid approach and I none of the below is a good substitution for it.
But

On Tue, 25 Apr 2023 at 13:30, Andres Freund  wrote:
>
> It isn't - but the alternatives aren't great either. It's not that easy to hit
> this scenario, so I think something along these lines is more palatable than
> adding a pass through the entire data directory.

Doing one pass through the entire data directory on startup before
deciding the directory is consistent doesn't sound like a crazy idea.
It's pretty easy to imagine bugs in backup software that leave out
files in the middle of tables -- some of us don't even have to
imagine...

Similarly checking for a stray next segment whenever extending a file
to maximum segment size seems like a reasonable thing to check for
too.

These kinds of checks are the kind of paranoia that catches filesystem
bugs, backup software bugs, cron jobs, etc that we don't even know to
watch for.

-- 
greg

Re: Unlinking Parallel Hash Join inner batch files sooner

2023-05-10 Thread Jehan-Guillaume de Rorthais

Hi,

Thanks for working on this!

On Wed, 10 May 2023 15:11:20 +1200
Thomas Munro  wrote:

> One complaint about PHJ is that it can, in rare cases, use a
> surprising amount of temporary disk space where non-parallel HJ would
> not.  When it decides that it needs to double the number of batches to
> try to fit each inner batch into memory, and then again and again
> depending on your level of bad luck, it leaves behind all the earlier
> generations of inner batch files to be cleaned up at the end of the
> query.  That's stupid.  Here's a patch to unlink them sooner, as a
> small improvement.

This patch can indeed save a decent amount of temporary disk space.

Considering its complexity is (currently?) quite low, it worth it.

> The reason I didn't do this earlier is that sharedtuplestore.c
> continues the pre-existing tradition where each parallel process
> counts what it writes against its own temp_file_limit.  At the time I
> thought I'd need to have one process unlink all the files, but if a
> process were to unlink files that it didn't create, that accounting
> system would break.  Without some new kind of shared temp_file_limit
> mechanism that doesn't currently exist, per-process counters could go
> negative, creating free money.  In the attached patch, I realised
> something that I'd missed before: there is a safe point for each
> backend to unlink just the files that it created, and there is no way
> for a process that created files not to reach that point.

Indeed.

For what it worth, from my new and non-experienced understanding of the
parallel mechanism, waiting for all workers to reach
WAIT_EVENT_HASH_GROW_BATCHES_REPARTITION, after re-dispatching old batches in
new ones, seems like a safe place to instruct each workers to clean their old
temp files.

> Here's an example query that tries 8, 16 and then 32 batches on my
> machine, because reltuples is clobbered with a bogus value.

Nice!

Regards,

RFI: Extending the TOAST Pointer

2023-05-10 Thread Nikita Malakhov

Hi hackers!

There were several discussions where the limitations of the existing TOAST
pointers
were mentioned [1], [2] and [3] and from time to time this topic appears in
other places.

We proposed a fresh approach to the TOAST mechanics in [2], but
unfortunately the
patch was met quite unfriendly, and after several iterations was rejected,
although we
still have hopes for it and have several very promising features based on
it.

Anyway, the old TOAST pointer is also the cause of problems like [4], and
this part of
the PostgreSQL screams to be revised and improved.

The TOAST begins with the pointer to the externalized value - the TOAST
Pointer, which
is very limited in means of storing data, and all TOAST improvements
require revision
of this Pointer structure. So we decided to ask the community for thoughts
and ideas on
how to rework this pointer.
The TOAST Pointer (varatt_external structure) stores 4 fields:
[varlena header][<4b - original data size><4b - size in TOAST table><4b -
TOAST table OID><4b - ID of chunk>]
In [2] we proposed the new Custom TOAST pointer structure where main
feature is
extensibility:
[varlena header][<2b - total size of the TOAST pointer><4b size of original
data><4b - OID of algorithm used for TOASTing>]
where Custom TOAST Pointer is distinguished from Regular one by va_flag
field which
is a part of varlena header, so new pointer format does not interfere with
the old (regular) one.
The first field is necessary because the Custom TOAST pointer has variable
length due to the
tail used for inline storage, and original value size is used by the
Executor. The third field could
be a subject for discussion.

Thoughts? Objections?

[1] [PATCH] Infinite loop while acquiring new TOAST Oid

[2] Pluggable Toaster

[3] [PATCH] Compression dictionaries for JSONB

[4] BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in
the table.


-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Re: Subscription suborigin?

2023-05-10 Thread Bruce Momjian



Never mind --- I just realized "sub" is the table prefix.  :-(

---

On Wed, May 10, 2023 at 03:36:31PM -0400, Bruce Momjian wrote:
> This commit:
> 
>   commit 366283961a
>   Author: Amit Kapila 
>   Date:   Thu Jul 21 08:47:38 2022 +0530
>   
>   Allow users to skip logical replication of data having origin.
> 
> has this change:
> 
>   diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
>   index 670a5406d6..a186e35f00 100644
>   --- a/doc/src/sgml/catalogs.sgml
>   +++ b/doc/src/sgml/catalogs.sgml
>   @@ -7943,6 +7943,20 @@ SCRAM-SHA-256$iteration 
> count:
>   see .
>  
> 
>   +
>   + 
>   +  
> -->   +   suborigin text
>   +  
>   +  
> -->   +   The origin value must be either none or
>   +   any. The default is any.
>   +   If none, the subscription will request the 
> publisher
>   +   to only send changes that don't have an origin. If
>   +   any, the publisher sends changes regardless 
> of their
>   +   origin.
>   +  
>   + 
>
>   
>  
> 
> Is 'suborigin' the right column mame, and if so, should "The origin
> value" be "The suborigin value"?
> 
> -- 
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
> 
>   Embrace your flaws.  They make you human, rather than perfect,
>   which you will never be.
> 
> 

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Embrace your flaws.  They make you human, rather than perfect,
  which you will never be.

Subscription suborigin?

2023-05-10 Thread Bruce Momjian

This commit:

commit 366283961a
Author: Amit Kapila 
Date:   Thu Jul 21 08:47:38 2022 +0530

Allow users to skip logical replication of data having origin.

has this change:

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 670a5406d6..a186e35f00 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7943,6 +7943,20 @@ SCRAM-SHA-256$iteration 
count:
see .
   
  
+
+ 
+  
--> +   suborigin text
+  
+  
--> +   The origin value must be either none or
+   any. The default is any.
+   If none, the subscription will request the 
publisher
+   to only send changes that don't have an origin. If
+   any, the publisher sends changes regardless 
of their
+   origin.
+  
+ 
 

   

Is 'suborigin' the right column mame, and if so, should "The origin
value" be "The suborigin value"?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Embrace your flaws.  They make you human, rather than perfect,
  which you will never be.

Re: smgrzeroextend clarification

2023-05-10 Thread Andres Freund

Hi,

On 2023-05-10 11:50:14 +0200, Peter Eisentraut wrote:
> I was looking at the new smgrzeroextend() function in the smgr API.  The
> documentation isn't very extensive:
> 
> /*
>  *  smgrzeroextend() -- Add new zeroed out blocks to a file.
>  *
>  *  Similar to smgrextend(), except the relation can be extended by
>  *  multiple blocks at once and the added blocks will be filled with
>  *  zeroes.
>  */
> 
> The documentation of smgrextend() is:
> 
> /*
>  *  smgrextend() -- Add a new block to a file.
>  *
>  *  The semantics are nearly the same as smgrwrite(): write at the
>  *  specified position.  However, this is to be used for the case of
>  *  extending a relation (i.e., blocknum is at or beyond the current
>  *  EOF).  Note that we assume writing a block beyond current EOF
>  *  causes intervening file space to become filled with zeroes.
>  */
> 
> So if you want to understand what smgrzeroextend() does, you need to
> mentally combine the documentation of three different functions.  Could we
> write documentation for each function that stands on its own?  And document
> the function arguments, like what does blocknum and nblocks mean?

I guess it couldn't hurt. But if we go down that route, we basically need to
rewrite all the function headers in smgr.c, I think.


> Moreover, the text "except the relation can be extended by multiple blocks
> at once and the added blocks will be filled with zeroes" doesn't make much
> sense as a differentiation, because smgrextend() does that as well.

Hm? smgrextend() writes a single block, and it's filled with the caller
provided buffer.


> AFAICT, the differences between smgrextend() and smgrzeroextend() are:
> 
> 1. smgrextend() writes a payload block in addition to extending the file,
> smgrzeroextend() just extends the file without writing a payload.
> 
> 2. smgrzeroextend() uses various techniques (posix_fallocate() etc.) to make
> sure the extended space is actually reserved on disk, smgrextend() does not.
> 
> #1 seems fine, but the naming of the APIs does not reflect that at all.
> 
> If we think that #2 is important, maybe smgrextend() should do that as well?
> Or at least explain why it's not needed?

smgrextend() does #2 - it just does it by writing data.

The FileFallocate() path in smgrzeroextend() tries to avoid writing data if
extending by sufficient blocks - not having dirty data in the kernel page
cache can substantially reduce the IO usage.

Whereas the FileZero() path just optimizes the number of syscalls (and cache
misses etc).

Greetings,

Andres Freund

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-05-10 Thread Andres Freund

Hi,

On 2023-05-10 17:44:07 +0200, Peter Eisentraut wrote:
> On 12.03.23 00:41, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-03-11 15:34:55 -0800, Mark Dilger wrote:
> > > > On Mar 11, 2023, at 3:22 PM, Andres Freund  wrote:
> > > > 
> > > > Something like the attached.
> > > 
> > > I like that your patch doesn't make the test longer.  I assume you've 
> > > already run the tests and that it works.
> > 
> > I did check that, yes :). My process of writing perl is certainly, uh,
> > iterative. No way I would get anything close to working without testing it.
> > 
> > CI now finished the tests as well:
> > https://cirrus-ci.com/build/6675457702100992
> > 
> > So I'll go ahead and push that.
> 
> There is a small issue with this commit (a4f23f9b3c).
> 
> In src/bin/pg_amcheck/t/004_verify_heapam.pl, there is code to detect
> whether the page layout matches expectations and if not it calls plan
> skip_all.

Some of these skip_all's don't seem like a good idea. Why is a broken
relfrozenxid a cause for skipping a test? But anyway, that's really unrelated
to the topic at hand.


> This commit adds a test
> 
> is(scalar @lp_off, $ROWCOUNT, "acquired row offsets");
> 
> *before* that skip_all call.  This appears to be invalid.  If the skip_all
> happens, you get a complaint like
> 
> t/004_verify_heapam.pl (Wstat: 0 Tests: 1 Failed: 0)
>   Parse errors: Bad plan.  You planned 0 tests but ran 1.
> 
> We could move the is() test after all the skip_all's.  Any thoughts?

I think the easiest fix is to just die if we can't get the offsets - it's not
like we can really continue afterwards...

Greetings,

Andres Freund

Re:Re: Feature: Add reloption support for table access method

2023-05-10 Thread 吴昊

> > The rest of this mail is to talk about the first issue. It looks reasonable 
> > to add a similar callback in
> > struct TableAmRoutine, and parse reloptions by the callback. This patch is 
> > in the attachment file.
>
> Why did you add relkind to the callbacks? The callbacks are specific to a
> certain relkind, so I don't think that makes sense.
An implementation of table access method may be used for table/toast/matview, 
different relkinds
may define different set of reloptions. If they have the same reloption set, 
just ignore the relkind
parameter.
> I don't think we really need GetTableAmRoutineByAmId() that raises nice
> errors etc - as the AM has already been converted to an oid, we shouldn't need
> to recheck?

When defining a relation, the function knows only the access method name, not 
the AM routine struct.
The AmRoutine needs to be looked-up by its access method name or oid. The 
existing function
calculates AmRoutine by the handler oid, not by am oid.

> > +bytea *
> > +table_reloptions_am(Oid accessMethodId, Datum reloptions, char relkind, 
> > bool validate)
> >  {
> > ...
> > 
> > +   /* built-in table access method put here to fetch TAM fast */
> > +   case HEAP_TABLE_AM_OID:
> > +   tam = GetHeapamTableAmRoutine();
> > +   break;
> > default:
> > -   /* other relkinds are not supported */
> > -   return NULL;
> > +   tam = GetTableAmRoutineByAmId(accessMethodId);
> > +   break;

> Why do we need this fastpath? This shouldn't be something called at a
> meaningful frequency?
OK, it make sense.

> > }
> > +   return table_reloptions(tam->amoptions, reloptions, relkind, validate);
> > }
>
> I'd just pass the tam, instead of an individual function.
It's aligned to index_reloptions, and the function extractRelOptions also uses
an individual function other a pointer to AmRoutine struct.



> Did you mean table instead of relation in the comment?


Yes, the comment doesn't update.


 
> > Another thing about reloption is that the current API passes a parameter 
> > `validate` to tell the parse
> > functioin to check and whether to raise an error. It doesn't have enough 
> > context when these reloptioins
> > are used:
> > 1. CREATE TABLE ... WITH(...)
> > 2. ALTER TABLE ... SET ...
> > 3. ALTER TABLE ... RESET ...
> > The reason why the context matters is that some reloptions are disallowed 
> > to change after creating
> > the table, while some reloptions are allowed.
>
> What kind of reloption are you thinking of here?

DRAFT: The amoptions in TableAmRoutine may change to
```
bytea   *(*amoptions)(Datum reloptions, char relkind, ReloptionContext 
context);
enum ReloptionContext {
RELOPTION_INIT, // CREATE TABLE ... WITH(...)
RELOPTION_SET, // ALTER TABLE ... SET ...
RELOPTION_RESET, // ALTER TABLE ... RESET ...
RELOPTION_EXTRACT, // build reloptions from pg_class.reloptions
}
```
The callback always validates the reloptions if the context is not 
RELOPTION_EXTRACT.
If the TAM disallows to update some reloptions, it may throw an error when the 
context is
one of (RELOPTION_SET, RELOPTION_RESET).
The similar callback `amoptions` in IndexRoutine also applies this change.
BTW, it's hard to find an appropriate header file to define the 
ReloptionContext, which is
used by index/table AM.

Regards,
Hao Wu

de-catalog one error message

2023-05-10 Thread Alvaro Herrera

Hi

While translating the v15 message catalog (yes, I'm quite late!), I
noticed that commit 1f39bce02154 introduced three copies of the
following message in hashagg_batch_read():

+   ereport(ERROR,
+   (errcode_for_file_access(),
+errmsg("unexpected EOF for tape %d: requested %zu bytes, read 
%zu bytes",
+   tapenum, sizeof(uint32), nread)));

These messages should only arise when a hash spill file has gone
corrupt: as I understand, this cannot happen merely because of a full
disk,  because that should fail during _write_ of the file, not read.
And no other user-caused causes should exist.

Therefore, I propose to turn these messages into errmsg_internal(), so
that we do not translate them.  They could even be an elog() and ditch
the errcode, but I see no reason to go that far.  Or we could make them
ERRCODE_DATA_CORRUPTED.


BTW, the aforementioned commit actually appeared in v13, and I
translated the message there at the time.  The reason I noticed this now
is that the %d was changed to %p by commit c4649cce39a4, and it was that
change that triggered me now.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Small aircraft do not crash frequently ... usually only once!"
  (ponder, http://thedailywtf.com/)

Re: Discussion: psql \et -> edit the trigger function

2023-05-10 Thread Tom Lane

I wrote:
> Hmm, I wonder how useful that's really going to be, considering
> that trigger names aren't unique across tables.  Wouldn't it
> need to be more like "\et table-name trigger-name"?

Different line of thought: \et seems awfully single-purpose.
Perhaps we should think more of "\st table-name trigger-name"
(show trigger), which perhaps could print something along the
lines of

CREATE TRIGGER after_ins_stmt_trig AFTER INSERT ON main_table
FOR EACH STATEMENT EXECUTE FUNCTION trigger_func('after_ins_stmt');

CREATE FUNCTION public.trigger_func()
 RETURNS trigger
... the rest like \sf for the trigger function

If you indeed want to edit the function, it's a quick copy-and-paste
from here.  But if you just want to see the trigger definition,
this is more wieldy than looking at the whole "\d table-name"
output.  Also we have less of an overloading problem with the
command name.

regards, tom lane

createuser --memeber and PG 16

2023-05-10 Thread Bruce Momjian

In writing the PG 16 release notes, I came upon an oddity in our new
createuser syntax, specifically --role and --member.  It turns out that
--role matches CREATE ROLE ... ROLE IN (and has prior to PG 16) while
the new --member option matches CREATE ROLE ... ROLE.  The PG 16 feature
discussion thread is here:


https://www.postgresql.org/message-id/flat/69a9851035cf0f0477bcc5d742b031a3%40oss.nttdata.com

This seems like it will be forever confusing to people.  I frankly don't
know why --role matching CREATE ROLE ... ROLE IN was not already
confusing in pre-PG 16.  Any new ideas for improvement?

At a minium I would like to apply the attached doc patch to PG 16 to
improve awkward wording in CREATE ROLE and createuser.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Embrace your flaws.  They make you human, rather than perfect,
  which you will never be.
diff --git a/doc/src/sgml/ref/create_role.sgml b/doc/src/sgml/ref/create_role.sgml
index 4a84461b28..69ea9060bf 100644
--- a/doc/src/sgml/ref/create_role.sgml
+++ b/doc/src/sgml/ref/create_role.sgml
@@ -285,10 +285,11 @@ in sync when changing the above synopsis!
   IN ROLE role_name
   

-The IN ROLE clause lists one or more existing
-roles to which the new role will be immediately added as a new
-member.  (Note that there is no option to add the new role as an
-administrator; use a separate GRANT command to do that.)
+The IN ROLE clause causes the new role to
+be automatically added as a member of the specified existing
+roles. (Note that there is no option to add the new role as an
+administrator; use a separate GRANT command
+to do that.)

   
  
@@ -306,9 +307,9 @@ in sync when changing the above synopsis!
   ROLE role_name
   

-The ROLE clause lists one or more existing
-roles which are automatically added as members of the new role.
-(This in effect makes the new role a group.)
+The ROLE clause causes one or more specified
+existing roles to be automatically added as members of the new
+role.  (This in effect makes the new role a group.)

   
  
diff --git a/doc/src/sgml/ref/createuser.sgml b/doc/src/sgml/ref/createuser.sgml
index c5c74b86a2..58ed111642 100644
--- a/doc/src/sgml/ref/createuser.sgml
+++ b/doc/src/sgml/ref/createuser.sgml
@@ -85,11 +85,10 @@ PostgreSQL documentation
   --admin=role
   

-Indicates a role that will be immediately added as a member of the new
+Indicates an existing role that will be automatically added as a member of the new
 role with admin option, giving it the right to grant membership in the
-new role to others.  Multiple roles to add as members (with admin
-option) of the new role can be specified by writing multiple
--a switches.
+new role to others.  Multiple existing roles can be specified by
+writing multiple -a switches.

   
  
@@ -153,11 +152,10 @@ PostgreSQL documentation
   --role=role
   

- Indicates a role to which this role will be added immediately as a new
- member. Multiple roles to which this role will be added as a member
- can be specified by writing multiple
- -g switches.
- 
+Indicates the new role should be automatically added as a member
+of the specified existing role. Multiple existing roles can be
+specified by writing multiple -g switches.
+   
   
  
 
@@ -227,9 +225,9 @@ PostgreSQL documentation
   --member=role
   

-Indicates role that will be immediately added as a member of the new
-role.  Multiple roles to add as members of the new role can be specified
-by writing multiple -m switches.
+Indicates the specified existing role should be automatically
+added as a member of the new role. Multiple existing roles can
+be specified by writing multiple -m switches.

Re: Discussion: psql \et -> edit the trigger function

2023-05-10 Thread Tom Lane

Kirk Wolak  writes:
> \et 
>   Would specifically let me complete the Trigger_Name, but find the function

Hmm, I wonder how useful that's really going to be, considering
that trigger names aren't unique across tables.  Wouldn't it
need to be more like "\et table-name trigger-name"?

Also, in a typical database I bet a large fraction of pg_trigger.tgname
entries are going to be "RI_ConstraintTrigger_something".  Are we going
to suppress those?

regards, tom lane

Re: WAL Insertion Lock Improvements

2023-05-10 Thread Bharath Rupireddy

On Tue, May 9, 2023 at 9:24 AM Bharath Rupireddy
 wrote:
>
> On Tue, May 9, 2023 at 9:02 AM Michael Paquier  wrote:
> >
> > On Mon, May 08, 2023 at 04:04:10PM -0700, Nathan Bossart wrote:
> > > On Mon, May 08, 2023 at 05:57:09PM +0530, Bharath Rupireddy wrote:
> > >> test-case 1: -T5, WAL ~16 bytes
> > >> test-case 1: -t1000, WAL ~16 bytes
> > >
> > > I wonder if it's worth doing a couple of long-running tests, too.
> >
> > Yes, 5s or 1000 transactions per client is too small, though it shows
> > that things are going in the right direction.
>
> I'll pick a test case that generates a reasonable amount of WAL 256
> bytes. What do you think of the following?
>
> test-case 2: -T900, WAL ~256 bytes (for c in 1 2 4 8 16 32 64 128 256
> 512 768 1024 2048 4096 - takes 3.5hrs)
> test-case 2: -t100, WAL ~256 bytes
>
> If okay, I'll fire the tests.

test-case 2: -T900, WAL ~256 bytes - ran for about 3.5 hours and the
more than 3X improvement in TPS is seen - 3.11X @ 512 3.79 @ 768, 3.47
@ 1024, 2.27 @ 2048, 2.77 @ 4096

test-case 2: -T900, WAL ~256 bytes
clientsHEADPATCHED
113941351
215511445
431042881
859745774
161215411319
322243821606
644368940567
1288072677993
256139987141638
51260108187126
76851188194406
102448766169353
204846617105961
409644163122697

test-case 2: -t100, WAL ~256 bytes - ran for more than 12 hours
and the maximum improvement is 1.84X @ 1024 client.

test-case 2: -t100, WAL ~256 bytes
clientsHEADPATCHED
114541500
216571612
432233224
863056295
161244712260
322485524335
644522944386
1288075279518
256120663119083
512149546159396
768118298181732
1024101829187492
2048107506191378
4096125130186728

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Discussion: psql \et -> edit the trigger function

2023-05-10 Thread Kirk Wolak

On Wed, May 10, 2023 at 12:20 PM Pavel Stehule 
wrote:

> Hi
>
> st 10. 5. 2023 v 17:33 odesílatel Kirk Wolak  napsal:
>
>> We already have
>> \ef
>> \ev
>>
>> The use case here is simply that it saves me from:
>> \d 
>> [scroll through all the fields]
>> [often scroll right]
>> select function name
>> \ef [paste function name]
>>
>> and tab completion is much narrower
>>
>> When doing conversions and reviews all of this stuff has to be reviewed.
>> Oftentimes, renamed, touched.
>>
>> I am 100% willing to write the code, docs, etc. but would appreciate
>> feedback.
>>
>
> \et can be little bit confusing, because looks like editing trigger, not
> trigger function
>
> what \eft triggername
>
> ?
>
> Pavel, I am "torn" because of my OCD, I would expect
\eft 
to list functions that RETURN TRIGGER as opposed to the level of
indirection I was aiming for.

where
\et 
  Would specifically let me complete the Trigger_Name, but find the function

It makes me wonder, now if:
\etf

Is better for this (edit trigger function... given the trigger name).
And as another poster suggested.  As we do the AUTOCOMPLETE for that, we
could address it for:
\ef?

because:
\eft 
is valuable as well, and deserves to work just like all \ef? items

It seems like a logical way to break it down.

> regards
>
> Pavel
>
>
>
>>
>> Kirk...
>>
>

Re: [PATCH] Use RelationClose rather than table_close in heap_create_with_catalog

2023-05-10 Thread Tom Lane

Bharath Rupireddy  writes:
> And, the /* do not unlock till end of xact */, it looks like it's been
> there from day 1. It may be indicating that the ref count fo the new
> relation created in heap_create_with_catalog() will be decremented at
> the end of xact, but I'm not sure what it means.

Hmm, I think it's been copied-and-pasted from somewhere.  It's quite
common for us to not release locks on modified tables until end of
transaction.  However, that's not what's happening here, because we
actually *don't have any such lock* at this point, as you can easily
prove by stepping through this code and watching the contents of
pg_locks from another session.  We do acquire AccessExclusiveLock
on the new table eventually, but not till control returns to
DefineRelation.

I'm not real sure that I like the proposed code change: it's unclear
to me whether it's an unwise piercing of a couple of abstraction
layers or an okay change because those abstraction layers haven't
yet been applied to the new relation at all.  However, I think the
existing comment is actively misleading and needs to be changed.
Maybe something like

/*
 * Close the relcache entry, since we return only an OID not a
 * relcache reference.  Note that we do not yet hold any lockmanager
 * lock on the new rel, so there's nothing to release.
 */
table_close(new_rel_desc, NoLock);

/*
 * ok, the relation has been cataloged, so close catalogs and return
 * the OID of the newly created relation.
 */
table_close(pg_class_desc, RowExclusiveLock);

Given these comments, maybe changing the first call to RelationClose
would be sensible, but I'm still not quite convinced.

regards, tom lane

Re: Discussion: psql \et -> edit the trigger function

2023-05-10 Thread Dagfinn Ilmari Mannsåker

Kirk Wolak  writes:

> We already have
> \ef
> \ev
>
> The use case here is simply that it saves me from:
> \d 
> [scroll through all the fields]
> [often scroll right]
> select function name
> \ef [paste function name]
>
> and tab completion is much narrower

I think it would make more sense to model it on the filtering letters
available for \df:

  \df[anptw][S+] [FUNCPTRN [TYPEPTRN ...]]
   list [only agg/normal/procedure/trigger/window] functions

I just noticed that tab completion after e.g. \dft does not take the
function type restriction into account, so the solution for \ef
should be made to work for both. I wonder if it would even be possible
to share the tab completion filtering conditions with the actual
implementation of \df.

Also, I notice that \df only tab completes functions (i.e. not
procedures), although it actually returns all routines.

> When doing conversions and reviews all of this stuff has to be reviewed.
> Oftentimes, renamed, touched.
>
> I am 100% willing to write the code, docs, etc. but would appreciate
> feedback.

I'm happy to assist with and review at least the tab completion parts of
this effort.

> Kirk...

- ilmari

Re: Discussion: psql \et -> edit the trigger function

2023-05-10 Thread Pavel Stehule

Hi

st 10. 5. 2023 v 17:33 odesílatel Kirk Wolak  napsal:

> We already have
> \ef
> \ev
>
> The use case here is simply that it saves me from:
> \d 
> [scroll through all the fields]
> [often scroll right]
> select function name
> \ef [paste function name]
>
> and tab completion is much narrower
>
> When doing conversions and reviews all of this stuff has to be reviewed.
> Oftentimes, renamed, touched.
>
> I am 100% willing to write the code, docs, etc. but would appreciate
> feedback.
>

\et can be little bit confusing, because looks like editing trigger, not
trigger function

what \eft triggername

?

regards

Pavel



>
> Kirk...
>

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-05-10 Thread Peter Eisentraut


On 12.03.23 00:41, Andres Freund wrote:

Hi,

On 2023-03-11 15:34:55 -0800, Mark Dilger wrote:

On Mar 11, 2023, at 3:22 PM, Andres Freund  wrote:

Something like the attached.


I like that your patch doesn't make the test longer.  I assume you've already 
run the tests and that it works.


I did check that, yes :). My process of writing perl is certainly, uh,
iterative. No way I would get anything close to working without testing it.

CI now finished the tests as well:
https://cirrus-ci.com/build/6675457702100992

So I'll go ahead and push that.


There is a small issue with this commit (a4f23f9b3c).

In src/bin/pg_amcheck/t/004_verify_heapam.pl, there is code to detect 
whether the page layout matches expectations and if not it calls plan 
skip_all.


This commit adds a test

is(scalar @lp_off, $ROWCOUNT, "acquired row offsets");

*before* that skip_all call.  This appears to be invalid.  If the 
skip_all happens, you get a complaint like


t/004_verify_heapam.pl (Wstat: 0 Tests: 1 Failed: 0)
  Parse errors: Bad plan.  You planned 0 tests but ran 1.

We could move the is() test after all the skip_all's.  Any thoughts?

Discussion: psql \et -> edit the trigger function

2023-05-10 Thread Kirk Wolak

We already have
\ef
\ev

The use case here is simply that it saves me from:
\d 
[scroll through all the fields]
[often scroll right]
select function name
\ef [paste function name]

and tab completion is much narrower

When doing conversions and reviews all of this stuff has to be reviewed.
Oftentimes, renamed, touched.

I am 100% willing to write the code, docs, etc. but would appreciate
feedback.

Kirk...

Re: [PATCH] Use RelationClose rather than table_close in heap_create_with_catalog

2023-05-10 Thread tender wang

Bharath Rupireddy  于2023年5月10日周三
22:17写道：

> On Sat, Mar 18, 2023 at 12:34 PM Xiaoran Wang  wrote:
> >
> > Hi hackers,
> >
> >  In heap_create_with_catalog, the Relation new_rel_desc is created
> > by RelationBuildLocalRelation, not table_open. So it's better to
> > call RelationClose to release it.
> >
> > What's more, the comment for it seems useless, just delete it.
>
> Essentially, all the close functions are the same with NoLock, IOW,
> table_close(relation, NoLock) = relation_closerelation, NoLock) =
> RelationClose(relation). Therefore, table_close(new_rel_desc, NoLock);
> looks fine to me.

  Agreed.

And, the /* do not unlock till end of xact */, it looks like it's been
> there from day 1. It may be indicating that the ref count fo the new
> relation created in heap_create_with_catalog() will be decremented at
> the end of xact, but I'm not sure what it means.
>
  Me too

> --
> Bharath Rupireddy
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>
>
>

Re: [PATCH] Use RelationClose rather than table_close in heap_create_with_catalog

2023-05-10 Thread Bharath Rupireddy

On Sat, Mar 18, 2023 at 12:34 PM Xiaoran Wang  wrote:
>
> Hi hackers,
>
>  In heap_create_with_catalog, the Relation new_rel_desc is created
> by RelationBuildLocalRelation, not table_open. So it's better to
> call RelationClose to release it.
>
> What's more, the comment for it seems useless, just delete it.

Essentially, all the close functions are the same with NoLock, IOW,
table_close(relation, NoLock) = relation_closerelation, NoLock) =
RelationClose(relation). Therefore, table_close(new_rel_desc, NoLock);
looks fine to me.

And, the /* do not unlock till end of xact */, it looks like it's been
there from day 1. It may be indicating that the ref count fo the new
relation created in heap_create_with_catalog() will be decremented at
the end of xact, but I'm not sure what it means.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Use RelationClose rather than table_close in heap_create_with_catalog

2023-05-10 Thread Xiaoran Wang

The routine table_close takes  2 params: Relation and LOCKMODE, it first 
calls RelationClose to decrease the relation cache reference count, then deals 
with the lock on the table based on LOCKMOD param.

In heap_create_with_catalog, the Relation new_rel_desc is only a local relation 
cache, created by RelationBuildLocalRelation.  No other processes can see this 
relation, as the transaction is not committed, so there is no lock on it.

There is no problem to release the relation cache by table_close(new_rel_desc,  
NoLock) here. However, from my point of view, table_close(new_rel_desc, 
NoLock);  /* do not unlock till end of xact */
this line is a little confusing since there is no lock on the relation at all.  
So I think it's better to use RelationColse here.

From: tender wang 
Sent: Wednesday, May 10, 2023 10:57 AM
To: Xiaoran Wang 
Cc: PostgreSQL-development 
Subject: Re: [PATCH] Use RelationClose rather than table_close in 
heap_create_with_catalog

!! External Email


Xiaoran Wang mailto:wxiao...@vmware.com>> 于2023年3月18日周六 
15:04写道：
Hi hackers,

 In heap_create_with_catalog, the Relation new_rel_desc is created
by RelationBuildLocalRelation, not table_open. So it's better to
call RelationClose to release it.
Why it's better to call RelationClose? Is there a problem if using 
table_close()?
What's more, the comment for it seems useless, just delete it.

Thanks!

regard, tender wang

!! External Email: This email originated from outside of the organization. Do 
not click links or open attachments unless you recognize the sender.

Re: Tables getting stuck at 's' state during logical replication

2023-05-10 Thread Bharath Rupireddy

On Tue, May 9, 2023 at 4:38 PM Amit Kapila  wrote:
>
> On Fri, May 5, 2023 at 7:27 PM Padmavathi G  wrote:
> >
> > Some background on the setup on which I am trying to carry out the upgrade:
> >
> > We have a pod in a kubernetes cluster which contains the postgres 11 image. 
> > We are following the logical replication process for upgrade
> >
> > Steps followed for logical replication:
> >
> > 1. Created a new pod in the same kubernetes cluster with the latest 
> > postgres 15 image
> > 2. Created a publication  (say publication 1) in the old pod including all 
> > tables in a database
> > 3. Created a subscription (say subscription 1) in the new pod for the above 
> > mentioned publication
> > 4. When monitoring the subscription via pg_subscription_rel in the 
> > subscriber, I noticed that out of 45 tables 20 were in the 'r' state and 25 
> > were in 's' state and they remained in the same state for almost 2 days, 
> > there was no improvement in the state. But the logs showed that the tables 
> > which had 's' state also had "synchronization workers for  
> > finished".
> >
>
> I think the problem happened in this step where some of the tables
> remained in 's' state. It is not clear why this could happen because
> apply worker should eventually update these relations state to 'r'. If
> this is reproducible, we can try two things to further investigate the
> issue: (a) Disable and Enable the subscription once and see if that
> helps; (b) try increasing the LOG level to DEBUG2 and see if we get
> any useful LOG message.

It looks like the respective apply workers for the tables whose state
was 's' are not able to get to UpdateSubscriptionRelState with
SUBREL_STATE_READY. In addition to what Amit suggested above, is it
possible for you to reproduce this problem on upstream (reproducing on
HEAD cool otherwise PG 15 enough) code? If yes, you can add custom
debug messages to see what's happening.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Add native windows on arm64 support

2023-05-10 Thread Andrew Dunstan



On 2023-05-08 Mo 15:58, Tom Lane wrote:

Andres Freund  writes:

I don't really have feelings either way - but haven't we gone further and even
backpatched things like spinlock support for new arches in the past?

Mmmm ... don't really think those cases were comparable.  We weren't
adding support for a whole new OS.  Now, you might argue that Windows
on arm64 will be just like Windows on x86_64, but I think the jury
is still out on that.  Microsoft was so Intel-only for so many years
that I bet they've had to change quite a bit to make it go on ARM.

Also, the cases of back-patched spinlock support that I can find
in the last few years were pretty low-risk.  I'll grant that
c32fcac56 was a bit blue-sky because hardly anybody had RISC-V
at that point, but by the same token anybody relying on it at the
time would be dealing with a beta-grade OS too.  On the other hand,
1c72d82c2 was immediately testable in the buildfarm, and f3bd00c01
was importing code already verified by our OpenBSD packagers.

As I said upthread, this seems like something to put in at the
beginning of a dev cycle, not post-feature-freeze.




We will definitely want buildfarm support. I don't have such a machine 
and am not likely to have one any time soon. I do run drongo and 
fairywren on an EC2 instance, but AWS doesn't seem to have support for 
Windows on ARM. Maybe it's available on Azure (maybe then some of our 
colleagues working at Microsoft could arrange such a beast for me to set 
up potential buildfarm animal, or else run one themselves.)




cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: benchmark results comparing versions 15.2 and 16

2023-05-10 Thread Alexander Lakhin


08.05.2023 16:00, Alexander Lakhin wrote:

... Having compared 15.3 (56e869a09) with master
(58f5edf84) I haven't seen significant regressions except a few minor ones.
First regression observed with a simple pgbench test:


Another noticeable, but not critical performance degradation is revealed by
query 87 from TPC-DS (I use s64da-benchmark):
https://github.com/swarm64/s64da-benchmark-toolkit/blob/master/benchmarks/tpcds/queries/queries_10/87.sql

With `prepare_benchmark --scale-factor=2`, `run_benchmark --scale-factor=10`
I get on master:
2023-05-10 09:27:52,888 INFO    : finished 80/103: query 87 of stream  0: 2.26s 
OK
but on REL_15_STABLE:
2023-05-10 08:13:40,648 INFO    : finished 80/103: query 87 of stream  0: 1.94s 
OK

This time `git bisect` pointed at 3c6fc5820. Having compared execution plans
(both attached), I see the following differences (3c6fc5820~1 vs 3c6fc5820):
->  Subquery Scan on "*SELECT* 1"  (cost=149622.00..149958.68 rows=16834 width=21) (actual time=1018.606..1074.468 
rows=93891 loops=1)

 ->  Unique  (cost=149622.00..149790.34 rows=16834 width=17) (actual 
time=1018.604..1064.790 rows=93891 loops=1)
  ->  Sort  (cost=149622.00..149664.09 rows=16834 width=17) (actual 
time=1018.603..1052.591 rows=94199 loops=1)
   ->  Gather  (cost=146588.59..148440.33 rows=16834 width=17) (actual 
time=880.899..913.978 rows=94199 loops=1)
vs
->  Subquery Scan on "*SELECT* 1"  (cost=147576.79..149829.53 rows=16091 width=21) (actual time=1126.489..1366.751 
rows=93891 loops=1)

 ->  Unique  (cost=147576.79..149668.62 rows=16091 width=17) (actual 
time=1126.487..1356.938 rows=93891 loops=1)
  ->  Gather Merge  (cost=147576.79..149547.94 rows=16091 width=17) (actual 
time=1126.487..1345.253 rows=94204 loops=1)
   ->  Unique  (cost=146576.78..146737.69 rows=16091 width=17) (actual 
time=1124.426..1306.532 rows=47102 loops=2)
    ->  Sort  (cost=146576.78..146617.01 rows=16091 width=17) (actual 
time=1124.424..1245.110 rows=533434 loops=2)

->  Subquery Scan on "*SELECT* 2"  (cost=52259.82..52428.16 rows=8417 width=21) (actual time=653.640..676.879 rows=62744 
loops=1)

 ->  Unique  (cost=52259.82..52343.99 rows=8417 width=17) (actual 
time=653.639..670.405 rows=62744 loops=1)
  ->  Sort  (cost=52259.82..52280.86 rows=8417 width=17) (actual 
time=653.637..662.428 rows=62744 loops=1)
   ->  Gather  (cost=50785.20..51711.07 rows=8417 width=17) (actual 
time=562.158..571.737 rows=62744 loops=1)
    ->  HashAggregate  (cost=49785.20..49869.37 rows=8417 width=17) (actual 
time=538.263..544.336 rows=31372 loops=2)
 ->  Nested Loop  (cost=0.85..49722.07 rows=8417 width=17) (actual 
time=2.049..469.747 rows=284349 loops=2)
vs
->  Subquery Scan on "*SELECT* 2"  (cost=48503.68..49630.12 rows=8046 width=21) (actual time=700.050..828.388 rows=62744 
loops=1)

 ->  Unique  (cost=48503.68..49549.66 rows=8046 width=17) (actual 
time=700.047..821.836 rows=62744 loops=1)
  ->  Gather Merge  (cost=48503.68..49489.31 rows=8046 width=17) (actual 
time=700.047..814.136 rows=62744 loops=1)
   ->  Unique  (cost=47503.67..47584.13 rows=8046 width=17) (actual 
time=666.348..763.403 rows=31372 loops=2)
    ->  Sort  (cost=47503.67..47523.78 rows=8046 width=17) (actual 
time=666.347..730.336 rows=284349 loops=2)
 ->  Nested Loop  (cost=0.85..46981.72 rows=8046 width=17) (actual 
time=1.852..454.111 rows=284349 loops=2)

->  Subquery Scan on "*SELECT* 3"  (cost=50608.83..51568.70 rows=7165 width=21) (actual time=302.571..405.305 rows=23737 
loops=1)

 ->  Unique  (cost=50608.83..51497.05 rows=7165 width=17) (actual 
time=302.568..402.818 rows=23737 loops=1)
  ->  Gather Merge  (cost=50608.83..51443.31 rows=7165 width=17) (actual 
time=302.567..372.246 rows=287761 loops=1)
   ->  Sort  (cost=49608.81..49616.27 rows=2985 width=17) (actual 
time=298.204..310.075 rows=95920 loops=3)
    ->  Nested Loop  (cost=2570.65..49436.52 rows=2985 width=17) (actual 
time=3.205..229.192 rows=95920 loops=3)
vs
->  Subquery Scan on "*SELECT* 3"  (cost=50541.84..51329.11 rows=5708 width=21) (actual time=302.042..336.820 rows=23737 
loops=1)

 ->  Unique  (cost=50541.84..51272.03 rows=5708 width=17) (actual 
time=302.039..334.329 rows=23737 loops=1)
  ->  Gather Merge  (cost=50541.84..51229.22 rows=5708 width=17) (actual 
time=302.039..331.296 rows=24128 loops=1)
   ->  Unique  (cost=49541.81..49570.35 rows=2854 width=17) (actual 
time=298.771..320.560 rows=8043 loops=3)
    ->  Sort  (cost=49541.81..49548.95 rows=2854 width=17) (actual 
time=298.770..309.603 rows=95920 loops=3)
 ->  Nested Loop  (cost=2570.52..49378.01 rows=2854 width=17) (actual 
time=3.209..230.291 rows=95920 loops=3)

From the commit message and the discussion [1], it's not clear to me, whether
this plan change is expected. Perhaps it's too minor issue to bring attention
to, but maybe this information could be useful for v16 performance analysis.

[1] 
https://postgr.es/m/CAApHDvo8Lz2H=42urbbfp65ltceuoh288mt7dsg2_ewtw1a...@mail.gmail.com

Best regards,
Alexander

Re: smgrzeroextend clarification

2023-05-10 Thread Bharath Rupireddy

On Wed, May 10, 2023 at 3:20 PM Peter Eisentraut
 wrote:
>
> I was looking at the new smgrzeroextend() function in the smgr API.  The
> documentation isn't very extensive:
>
> /*
>   *  smgrzeroextend() -- Add new zeroed out blocks to a file.
>   *
>   *  Similar to smgrextend(), except the relation can be extended by
>   *  multiple blocks at once and the added blocks will be filled with
>   *  zeroes.
>   */
>
> The documentation of smgrextend() is:
>
> /*
>   *  smgrextend() -- Add a new block to a file.
>   *
>   *  The semantics are nearly the same as smgrwrite(): write at the
>   *  specified position.  However, this is to be used for the case of
>   *  extending a relation (i.e., blocknum is at or beyond the current
>   *  EOF).  Note that we assume writing a block beyond current EOF
>   *  causes intervening file space to become filled with zeroes.
>   */
>
> So if you want to understand what smgrzeroextend() does, you need to
> mentally combine the documentation of three different functions.  Could
> we write documentation for each function that stands on its own?  And
> document the function arguments, like what does blocknum and nblocks mean?

Why not?

> Moreover, the text "except the relation can be extended by multiple
> blocks at once and the added blocks will be filled with zeroes" doesn't
> make much sense as a differentiation, because smgrextend() does that as
> well.

Not exactly. smgrextend() doesn't write a zero-ed block on its own, it
writes the content that's passed to it via 'buffer'. It's just that
some of smgrextend() callers pass in a zero buffer. Whereas,
smgrzeroextend() writes zero-ed blocks on its own, something
smgrextend() called in a loop with zero-ed 'buffer'. Therefore, the
existing wording seems fine to me.

> AFAICT, the differences between smgrextend() and smgrzeroextend() are:
>
> 1. smgrextend() writes a payload block in addition to extending the
> file, smgrzeroextend() just extends the file without writing a payload.

I think how smgrzeroextend() extends the zeroed blocks is internal to
it, and mdzeroextend() happens to use fallocate (if available). I
think the existing wording around smgrzeroextend() seems fine to me.

> 2. smgrzeroextend() uses various techniques (posix_fallocate() etc.) to
> make sure the extended space is actually reserved on disk, smgrextend()
> does not.

It's not smgrzeroextend() per se, it is mdzeroextend() that uses
fallocate() if available.

Overall, +0.5 from me if we want to avoid comment traversals to
understand what these functions do and be more descriptive; we might
end up duplicating comments. But, I'm fine with ""except the relation
can be extended by multiple" before smgrzeroextend().

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Memory leak from ExecutorState context?

2023-05-10 Thread Jehan-Guillaume de Rorthais

Thank you for your review!

On Mon, 8 May 2023 11:56:48 -0400
Melanie Plageman  wrote:

> ...
> > 4. accessor->read_buffer is currently allocated in accessor->context as
> > well.
> > 
> >This buffer holds tuple read from the fileset. This is still a buffer,
> > but not related to any file anymore...
> > 
> > Because of 3 and 4, and the gross context granularity of SharedTuplestore
> > related-code, I'm now wondering if "fileCxt" shouldn't be renamed "bufCxt".
> 
> I think bufCxt is a potentially confusing name. The context contains
> pointers to the batch files and saying those are related to buffers is
> confusing. It also sounds like it could include any kind of buffer
> related to the hashtable or hash join.
> 
> Perhaps we could call this memory context the "spillCxt"--indicating it
> is for the memory required for spilling to permanent storage while
> executing hash joins.

"Spilling" seems fair and a large enough net to grab everything around temp
files and accessing them.

> I discuss this more in my code review below.
> 
> > From c5ed2ae2c2749af4f5058b012dc5e8a9e1529127 Mon Sep 17 00:00:00 2001
> > From: Jehan-Guillaume de Rorthais 
> > Date: Mon, 27 Mar 2023 15:54:39 +0200
> > Subject: [PATCH 2/3] Allocate hash batches related BufFile in a dedicated
> >  context  
> 
> > diff --git a/src/backend/executor/nodeHash.c
> > b/src/backend/executor/nodeHash.c index 5fd1c5553b..a4fbf29301 100644
> > --- a/src/backend/executor/nodeHash.c
> > +++ b/src/backend/executor/nodeHash.c
> > @@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List
> > *hashOperators, List *hashCollations, 
> > if (nbatch > 1 && hashtable->parallel_state == NULL)
> > {
> > +   MemoryContext oldctx;
> > +
> > /*
> >  * allocate and initialize the file arrays in hashCxt (not
> > needed for
> >  * parallel case which uses shared tuplestores instead of
> > raw files) */
> > +   oldctx = MemoryContextSwitchTo(hashtable->fileCxt);
> > +
> > hashtable->innerBatchFile = palloc0_array(BufFile *,
> > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
> > /* The files will not be opened until needed... */
> > /* ... but make sure we have temp tablespaces established
> > for them */  
> 
> I haven't studied it closely, but I wonder if we shouldn't switch back
> into the oldctx before calling PrepareTempTablespaces().
> PrepareTempTablespaces() is explicit about what memory context it is
> allocating in, however, I'm not sure it makes sense to call it in the
> fileCxt. If you have a reason, you should add a comment and do the same
> in ExecHashIncreaseNumBatches().

I had no reason. I catched it in ExecHashIncreaseNumBatches() while reviewing
myself, but not here.

Line moved in v6.

> > @@ -934,13 +943,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
> >hashtable, nbatch, hashtable->spaceUsed);
> >  #endif
> >  
> > -   oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
> > -
> > if (hashtable->innerBatchFile == NULL)
> > {
> > +   MemoryContext oldcxt =
> > MemoryContextSwitchTo(hashtable->fileCxt); +
> > /* we had no file arrays before */
> > hashtable->innerBatchFile = palloc0_array(BufFile *,
> > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
> > +  
> > +   MemoryContextSwitchTo(oldcxt);
> > +
> > /* time to establish the temp tablespaces, too */
> > PrepareTempTablespaces();
> > }
> > @@ -951,8 +963,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)  
> 
> I don't see a reason to call repalloc0_array() in a different context
> than the initial palloc0_array().

Unless I'm wrong, wrapping the whole if/else blocks in memory context
hashtable->fileCxt seemed useless as repalloc() actually realloc in the context
the original allocation occurred. So I only wrapped the palloc() calls.

> > diff --git a/src/backend/executor/nodeHashjoin.c
> > ...
> > @@ -1308,21 +1310,27 @@ ExecParallelHashJoinNewBatch(HashJoinState
> > *hjstate)
> >   * The data recorded in the file for each tuple is its hash value,  
> 
> It doesn't sound accurate to me to say that it should be called *in* the
> HashBatchFiles context. We now switch into that context, so you can
> probably remove this comment and add a comment above the switch into the
> filecxt which explains that the temp file buffers should be allocated in
> the filecxt (both because they need to be allocated in a sufficiently
> long-lived context and to increase visibility of their memory overhead).

Indeed. Comment moved and reworded in v6.

> > diff --git a/src/include/executor/hashjoin.h
> > b/src/include/executor/hashjoin.h index 8ee59d2c71..74867c3e40 100644
> > --- a/src/include/executor/hashjoin.h
> > +++ b/src/include/executor/hashjoin.h
> > @@ -25,10 +25,14 @@
> >   *
> >   * Each active hashjoin has a HashJoinTable control block, which is

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-05-10 Thread Bharath Rupireddy

On Fri, Apr 28, 2023 at 2:35 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear hackers,
>
> I rebased and refined my PoC. Followings are the changes:

Thanks.

Apologies for being late here. Please bear with me if I'm repeating
any of the discussed points.

I'm mainly trying to understand the production level use-case behind
this feature, and for that matter, recovery_min_apply_delay. AFAIK,
people try to keep the replication lag as minimum as possible i.e.
near zero to avoid the extreme problems on production servers - wal
file growth, blocked vacuum, crash and downtime.

The proposed feature commit message and existing docs about
recovery_min_apply_delay justify the reason as 'offering opportunities
to correct data loss errors'. If someone wants to enable
recovery_min_apply_delay/min_apply_delay on production servers, I'm
guessing their values will be in hours, not in minutes; for the simple
reason that when a data loss occurs, people/infrastructure monitoring
postgres need to know it first and need time to respond with
corrective actions to recover data loss. When these parameters are
set, the primary server mustn't be generating too much WAL to avoid
eventual crash/downtime. Who would really want to be so defensive
against somebody who may or may not accidentally cause data loss and
enable these features on production servers (especially when these can
take down the primary server) and live happily with the induced
replication lag?

AFAIK, PITR is what people use for recovering from data loss errors in
production.

IMO, before we even go implement the apply delay feature for logical
replication, it's worth to understand if induced replication lags have
any production level significance. We can also debate if providing
apply delay hooks is any better with simple out-of-the-box extensions
as opposed to the core providing these features.

Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: WAL Insertion Lock Improvements

2023-05-10 Thread Michael Paquier

On Mon, May 08, 2023 at 05:57:09PM +0530, Bharath Rupireddy wrote:
> Note that I've used pg_logical_emit_message() for ease of
> understanding about the txns generating various amounts of WAL, but
> the pattern is the same if txns are generating various amounts of WAL
> say with inserts.

Sounds good to me to just rely on that for some comparison numbers.

+* NB: LWLockConflictsWithVar (which is called from
+* LWLockWaitForVar) relies on the spinlock used above in this
+* function and doesn't use a memory barrier.

This patch adds the following comment in WaitXLogInsertionsToFinish()
because lwlock.c on HEAD mentions that:
/*
 * Test first to see if it the slot is free right now.
 *
 * XXX: the caller uses a spinlock before this, so we don't need a memory
 * barrier here as far as the current usage is concerned.  But that might
 * not be safe in general.
 */

Should it be something where we'd better be noisy about at the top of
LWLockWaitForVar()?  We don't want to add a memory barrier at the
beginning of LWLockConflictsWithVar(), still it strikes me that
somebody that aims at using LWLockWaitForVar() may miss this point
because LWLockWaitForVar() is the routine published in lwlock.h, not
LWLockConflictsWithVar().  This does not need to be really
complicated, say a note at the top of LWLockWaitForVar() among the
lines of (?):
"Be careful that LWLockConflictsWithVar() does not include a memory
barrier, hence the caller of this function may want to rely on an
explicit barrier or a spinlock to avoid memory ordering issues."

>> +* NB: Note the use of pg_atomic_exchange_u64 as opposed to just
>> +* pg_atomic_write_u64 to update the value. Since pg_atomic_exchange_u64 
>> is
>> +* a full barrier, we're guaranteed that the subsequent shared memory
>> +* reads/writes, if any, happen after we reset the lock variable.
>>
>> This mentions that the subsequent read/write operations are safe, so
>> this refers to anything happening after the variable is reset.  As
>> a full barrier, should be also mention that this is also ordered with
>> respect to anything that the caller did before clearing the variable?
>> From this perspective using pg_atomic_exchange_u64() makes sense to me
>> in LWLockReleaseClearVar().
>
> Wordsmithed that comment a bit.

-* Set the variable's value before releasing the lock, that prevents race
-* a race condition wherein a new locker acquires the lock, but hasn't yet
-* set the variables value.
[...]
+* NB: pg_atomic_exchange_u64 is used here as opposed to just
+* pg_atomic_write_u64 to update the variable. Since pg_atomic_exchange_u64
+* is a full barrier, we're guaranteed that all loads and stores issued
+* prior to setting the variable are completed before any loads or stores
+* issued after setting the variable.

This is the same explanation as LWLockUpdateVar(), except that we
lose the details explaining why we are doing the update before
releasing the lock.

It took me some time, but I have been able to deploy a big box to see
the effect of this patch at a rather large scale (64 vCPU, 512G of
memory), with the following test characteristics for HEAD and v6:
- TPS comparison with pgbench and pg_logical_emit_message().
- Record sizes of 16, 64, 256, 1k, 4k and 16k.
- Clients and jobs equal at 4, 16, 64, 256, 512, 1024, 2048, 4096.
- Runs of 3 mins for each of the 48 combinations, meaning 96 runs in
total.

And here are the results I got:
message_size_b| 16 | 64 |256 |   1024 |  4096 |   16k
--|||||---|---
head_4_clients|   3026 |   2965 |   2846 |   2880 |  2778 |  2412
head_16_clients   |  12087 |  11287 |  11670 |  11100 |  9003 |  5608
head_64_clients   |  42995 |  44005 |  43592 |  35437 | 21533 | 11273
head_256_clients  | 106775 | 109233 | 104201 |  80759 | 42118 | 16254
head_512_clients  | 153849 | 156950 | 142915 |  99288 | 57714 | 16198
head_1024_clients | 122102 | 123895 | 114248 | 117317 | 62270 | 16261
head_2048_clients | 126730 | 115594 | 109671 | 119564 | 62454 | 16298
head_4096_clients | 111564 | 111697 | 119164 | 123483 | 62430 | 16140
v6_4_clients  |   2893 |   2917 |   3087 |   2904 |  2786 |  2262
v6_16_clients |  12097 |  11387 |  11579 |  11242 |  9228 |  5661
v6_64_clients |  45124 |  46533 |  42275 |  36124 | 21696 | 11386
v6_256_clients| 121500 | 125732 | 104328 |  78989 | 41949 | 16254
v6_512_clients| 164120 | 174743 | 146677 |  98110 | 60228 | 16171
v6_1024_clients   | 168990 | 180710 | 149894 | 117431 | 62271 | 16259
v6_2048_clients   | 165426 | 162893 | 146322 | 132476 | 62468 | 16274
v6_4096_clients   | 161283 | 158732 | 162474 | 135636 | 62461 | 16030

These tests are not showing me any degradation, and a correlation
between the record size and the number of clients where the TPS begins
to show a difference between HEAD and v6 of the patch.  In short the
shorter the record, the

Re: Subscription statistics are not dropped at DROP SUBSCRIPTION in some cases

2023-05-10 Thread Melih Mutlu

Hi,

Masahiko Sawada , 8 May 2023 Pzt, 10:24 tarihinde
şunu yazdı:

> I've attached the patch. Feedback is very welcome.
>

Thanks for the patch, nice catch.
I can confirm that the issue exists on HEAD and gets resolved by this
patch. Also it looks like stats are really not affected if transaction
fails for some reason, as you explained.
IMO, the patch will be OK after commit message is added.

Thanks,
-- 
Melih Mutlu
Microsoft

Re: Add two missing tests in 035_standby_logical_decoding.pl

2023-05-10 Thread Amit Kapila

On Tue, May 9, 2023 at 12:44 PM Drouvot, Bertrand
 wrote:
>
> On 5/9/23 8:02 AM, Amit Kapila wrote:
> > On Mon, May 8, 2023 at 1:45 PM Drouvot, Bertrand
> >  wrote:
>
> >
> > Why not initialize the cascading standby node just before the standby
> > promotion test: "Test standby promotion and logical decoding behavior
> > after the standby gets promoted."? That way we will avoid any unknown
> > side-effects of cascading standby and it will anyway look more logical
> > to initialize it where the test needs it.
> >
>
> Yeah, that's even better. Moved the physical slot creation on the standby
> and the cascading standby initialization where "strictly" needed in V2
> attached.
>
> Also ensuring that hsf is set to on on the cascading standby to be on the
> safe side of thing.
>

Pushed this yesterday.

-- 
With Regards,
Amit Kapila.

Re: walsender performance regression due to logical decoding on standby changes

2023-05-10 Thread Amit Kapila

On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy 
>  wrote:
> >
> >
> > > My current guess is that mis-using a condition variable is the best bet. I
> > > think it should work to use ConditionVariablePrepareToSleep() before a
> > > WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
> > > ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
> > > would still cause the necessary wakeup.
> >
> > How about something like the attached? Recovery and subscription tests
> > don't complain with the patch.
>
> Thanks for the patch. I noticed one place where the logic is different from 
> before and
> just to confirm:
>
> if (AllowCascadeReplication())
> -   WalSndWakeup(switchedTLI, true);
> +   ConditionVariableBroadcast(>cv);
>
> After the change, we wakeup physical walsender regardless of switchedTLI flag.
> Is this intentional ? if so, I think It would be better to update the 
> comments above this.
>

This raises the question of whether we need this condition variable
logic for physical walsenders?

-- 
With Regards,
Amit Kapila.

RE: walsender performance regression due to logical decoding on standby changes

2023-05-10 Thread Zhijie Hou (Fujitsu)


On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy 
 wrote:
> 
> On Wed, May 10, 2023 at 12:33 AM Andres Freund 
> wrote:
> >
> > Unfortunately I have found the following commit to have caused a
> performance
> > regression:
> >
> > commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
> >
> > The problem is that, on a standby, after the change - as needed to for the
> > approach to work - the call to WalSndWakeup() in ApplyWalRecord()
> happens for
> > every record, instead of only happening when the timeline is changed (or
> WAL
> > is flushed or ...).
> >
> > WalSndWakeup() iterates over all walsender slots, regardless of whether in
> > use. For each of the walsender slots it acquires a spinlock.
> >
> > When replaying a lot of small-ish WAL records I found the startup process to
> > spend the majority of the time in WalSndWakeup(). I've not measured it very
> > precisely yet, but the overhead is significant (~35% slowdown), even with 
> > the
> > default max_wal_senders. If that's increased substantially, it obviously 
> > gets
> > worse.
> 
> I played it with a simple primary -> standby1 -> standby2 setup. I ran
> a pgbench script [1] on primary and counted the number of times
> WalSndWakeup() gets called from ApplyWalRecord() and the number of
> times spinlock is acquired/released in WalSndWakeup(). It's a whopping
> 21 million times spinlock is acquired/released on the standby 1 and
> standby 2 for just a < 5min of pgbench run on the primary:
> 
> standby 1:
> 2023-05-10 05:32:43.249 UTC [1595600] LOG:  FOO WalSndWakeup() in
> ApplyWalRecord() was called 2176352 times
> 2023-05-10 05:32:43.249 UTC [1595600] LOG:  FOO spinlock
> acquisition/release count in WalSndWakeup() is 21763530
> 
> standby 2:
> 2023-05-10 05:32:43.249 UTC [1595625] LOG:  FOO WalSndWakeup() in
> ApplyWalRecord() was called 2176352 times
> 2023-05-10 05:32:43.249 UTC [1595625] LOG:  FOO spinlock
> acquisition/release count in WalSndWakeup() is 21763530
> 
> In this case, there is no timeline switch or no logical decoding on
> the standby or such, but WalSndWakeup() gets called because the
> standby can't make out if the slot is for logical or physical
> replication unless spinlock is acquired. Before e101dfac3a,
> WalSndWakeup() was getting called only when there was a timeline
> switch.
> 
> > The only saving grace is that this is not an issue on the primary.
> 
> Yeah.
> 
> > I don't think the approach of not having any sort of "registry" of whether
> > anybody is waiting for the replay position to be updated is
> > feasible. Iterating over all walsenders slots is just too expensive -
> > WalSndWakeup() shows up even if I remove the spinlock (which we likely
> could,
> > especially when just checking if the the walsender is connected).
> 
> Right.
> 
> > My current guess is that mis-using a condition variable is the best bet. I
> > think it should work to use ConditionVariablePrepareToSleep() before a
> > WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
> > ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
> > would still cause the necessary wakeup.
> 
> How about something like the attached? Recovery and subscription tests
> don't complain with the patch.

Thanks for the patch. I noticed one place where the logic is different from 
before and
just to confirm:

if (AllowCascadeReplication())
-   WalSndWakeup(switchedTLI, true);
+   ConditionVariableBroadcast(>cv);

After the change, we wakeup physical walsender regardless of switchedTLI flag.
Is this intentional ? if so, I think It would be better to update the comments 
above this.

Best Regards,
Hou zj

Re: walsender performance regression due to logical decoding on standby changes

2023-05-10 Thread Drouvot, Bertrand


Hi,

On 5/10/23 8:36 AM, Bharath Rupireddy wrote:

On Wed, May 10, 2023 at 12:33 AM Andres Freund  wrote:


Unfortunately I have found the following commit to have caused a performance
regression:

commit e101dfac3a53c20bfbf1ca85d30a368c2954facf

The problem is that, on a standby, after the change - as needed to for the
approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for
every record, instead of only happening when the timeline is changed (or WAL
is flushed or ...).

WalSndWakeup() iterates over all walsender slots, regardless of whether in
use. For each of the walsender slots it acquires a spinlock.

When replaying a lot of small-ish WAL records I found the startup process to
spend the majority of the time in WalSndWakeup(). I've not measured it very
precisely yet, but the overhead is significant (~35% slowdown), even with the
default max_wal_senders. If that's increased substantially, it obviously gets
worse.




Thanks Andres for the call out! I do agree that this is a concern.


The only saving grace is that this is not an issue on the primary.


Yeah.


+1




My current guess is that mis-using a condition variable is the best bet. I
think it should work to use ConditionVariablePrepareToSleep() before a
WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
would still cause the necessary wakeup.


Yeah, I think that "mis-using" a condition variable is a valid option. Unless 
I'm missing
something, I don't think there is anything wrong with using a CV that way (aka 
not using
ConditionVariableTimedSleep() or ConditionVariableSleep() in this particular 
case).



How about something like the attached? Recovery and subscription tests
don't complain with the patch.


Thanks Bharath for looking at it!

I launched a full Cirrus CI test with it but it failed on one environment (did 
not look in details,
just sharing this here): https://cirrus-ci.com/task/6570140767092736

Also I have a few comments:

@@ -1958,7 +1959,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord 
*record, TimeLineID *repl
 * --
 */
if (AllowCascadeReplication())
-   WalSndWakeup(switchedTLI, true);
+   ConditionVariableBroadcast(>cv);

I think the comment above this change needs to be updated.


@@ -3368,9 +3370,13 @@ WalSndWait(uint32 socket_events, long timeout, uint32 
wait_event)
WaitEvent   event;

ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, socket_events, NULL);
+
+   ConditionVariablePrepareToSleep(>cv);
if (WaitEventSetWait(FeBeWaitSet, timeout, , 1, wait_event) == 1 
&&
(event.events & WL_POSTMASTER_DEATH))
proc_exit(1);
+
+   ConditionVariableCancelSleep();

May be worth to update the comment above WalSndWait() too? (to explain why a CV 
handling is part of it).


@@ -108,6 +109,8 @@ typedef struct
 */
boolsync_standbys_defined;

+   ConditionVariable cv;

Worth to give it a more meaning full name? (to give a clue as when it is used)


I think we also need to update the comment above WalSndWakeup():

"
 * For cascading replication we need to wake up physical walsenders separately
 * from logical walsenders (see the comment before calling WalSndWakeup() in
 * ApplyWalRecord() for more details).
"

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

smgrzeroextend clarification

2023-05-10 Thread Peter Eisentraut

I was looking at the new smgrzeroextend() function in the smgr API.  The 
documentation isn't very extensive:


/*
 *  smgrzeroextend() -- Add new zeroed out blocks to a file.
 *
 *  Similar to smgrextend(), except the relation can be extended by
 *  multiple blocks at once and the added blocks will be filled with
 *  zeroes.
 */

The documentation of smgrextend() is:

/*
 *  smgrextend() -- Add a new block to a file.
 *
 *  The semantics are nearly the same as smgrwrite(): write at the
 *  specified position.  However, this is to be used for the case of
 *  extending a relation (i.e., blocknum is at or beyond the current
 *  EOF).  Note that we assume writing a block beyond current EOF
 *  causes intervening file space to become filled with zeroes.
 */

So if you want to understand what smgrzeroextend() does, you need to 
mentally combine the documentation of three different functions.  Could 
we write documentation for each function that stands on its own?  And 
document the function arguments, like what does blocknum and nblocks mean?


Moreover, the text "except the relation can be extended by multiple 
blocks at once and the added blocks will be filled with zeroes" doesn't 
make much sense as a differentiation, because smgrextend() does that as 
well.


AFAICT, the differences between smgrextend() and smgrzeroextend() are:

1. smgrextend() writes a payload block in addition to extending the 
file, smgrzeroextend() just extends the file without writing a payload.


2. smgrzeroextend() uses various techniques (posix_fallocate() etc.) to 
make sure the extended space is actually reserved on disk, smgrextend() 
does not.


#1 seems fine, but the naming of the APIs does not reflect that at all.

If we think that #2 is important, maybe smgrextend() should do that as 
well?  Or at least explain why it's not needed?

Re: Feature: Add reloption support for table access method

2023-05-10 Thread Jelte Fennema

I'm definitely in favor of this general idea of supporting custom rel
options, we could use that for Citus its columnar TableAM. There have
been at least two other discussions on this topic:
1. 
https://www.postgresql.org/message-id/flat/CAFF0-CG4KZHdtYHMsonWiXNzj16gWZpduXAn8yF7pDDub%2BGQMg%40mail.gmail.com
2. 
https://www.postgresql.org/message-id/flat/429fb58fa3218221bb17c7bf9e70e1aa6cfc6b5d.camel%40j-davis.com

I haven't looked at the patch in detail, but it's probably good to
check what part of those previous discussions apply to it. And if
there's any ideas from those previous attempts that this patch could
borrow.

On Tue, 9 May 2023 at 22:59, Andres Freund  wrote:
>
> Hi,
>
> On 2023-05-05 16:44:39 +0800, 吴昊 wrote:
> > When I wrote an extension to implement a new storage by table access 
> > method. I found some issues
> > that the existing code has strong assumptions for heap tables now. Here are 
> > 3 issues that I currently have:
> >
> >
> > 1. Index access method has a callback to handle reloptions, but table 
> > access method hasn't. We can't
> > add storage-specific parameters by `WITH` clause when creating a table.
>
> Makes sense to add that.
>
>
> > 2. Existing code strongly assumes that the data file of a table structures 
> > by a serial physical files named
> > in a hard coded rule: [.]. It may only fit for heap 
> > like tables. A new storage may have its
> > owner structure on how the data files are organized. The problem happens 
> > when dropping a table.
>
> I agree that it's not great, but I don't think it's particularly easy to fix
> (because things like transactional DDL require fairly tight integration). Most
> of the time it should be possible can also work around the limitations though.
>
>
> > 3. The existing code also assumes that the data file consists of a series 
> > of fix-sized block. It may not
> > be desired by other storage. Is there any suggestions on this situation?
>
> That's a requirement of using the buffer manager, but if you don't want to
> rely on that, you can use a different pattern. There's some limitations
> (format of TIDs, most prominently), but you should be able to deal with that.
>
> I don't think it would make sense to support other block sizes in the buffer
> manager.
>
>
> > The rest of this mail is to talk about the first issue. It looks reasonable 
> > to add a similar callback in
> > struct TableAmRoutine, and parse reloptions by the callback. This patch is 
> > in the attachment file.
>
> Why did you add relkind to the callbacks? The callbacks are specific to a
> certain relkind, so I don't think that makes sense.
>
> I don't think we really need GetTableAmRoutineByAmId() that raises nice
> errors etc - as the AM has already been converted to an oid, we shouldn't need
> to recheck?
>
>
>
> > +bytea *
> > +table_reloptions_am(Oid accessMethodId, Datum reloptions, char relkind, 
> > bool validate)
> >  {
> > ...
> >
> > + /* built-in table access method put here to fetch TAM fast */
> > + case HEAP_TABLE_AM_OID:
> > + tam = GetHeapamTableAmRoutine();
> > + break;
> >   default:
> > - /* other relkinds are not supported */
> > - return NULL;
> > + tam = GetTableAmRoutineByAmId(accessMethodId);
> > + break;
>
> Why do we need this fastpath? This shouldn't be something called at a
> meaningful frequency?
>
>
> > }
> > + return table_reloptions(tam->amoptions, reloptions, relkind, 
> > validate);
> > }
>
> I'd just pass the tam, instead of an individual function.
>
> > @@ -866,6 +866,11 @@ typedef struct TableAmRoutine
> > 
> >  struct SampleScanState *scanstate,
> > 
> >  TupleTableSlot *slot);
> >
> > + /*
> > +  * This callback is used to parse reloptions for 
> > relation/matview/toast.
> > +  */
> > + bytea   *(*amoptions)(Datum reloptions, char relkind, bool 
> > validate);
> > +
> >  } TableAmRoutine;
>
> Did you mean table instead of relation in the comment?
>
>
>
> > Another thing about reloption is that the current API passes a parameter 
> > `validate` to tell the parse
> > functioin to check and whether to raise an error. It doesn't have enough 
> > context when these reloptioins
> > are used:
> > 1. CREATE TABLE ... WITH(...)
> > 2. ALTER TABLE ... SET ...
> > 3. ALTER TABLE ... RESET ...
> > The reason why the context matters is that some reloptions are disallowed 
> > to change after creating
> > the table, while some reloptions are allowed.
>
> What kind of reloption are you thinking of here?
>
> Greetings,
>
> Andres Freund
>
>

Re: MERGE lacks ruleutils.c decompiling support!?

2023-05-10 Thread Alvaro Herrera

BTW, I spent some time adding a cross-check to see if the code was at
least approximately correct for all the queries in the MERGE regression
tests, and couldn't find any failures.  I then extended the test to the
other optimizable commands, and couldn't find any problems there either.

My approach was perhaps a bit simple-minded: I just patched
pg_analyze_and_rewrite_fixedparams() to call pg_get_querydef() after
parse_analyze_fixedparams(), then ran the main regression tests.  No
crashes.  Also had it output as a WARNING together with the
query_string, so that I could eyeball for any discrepancies; I couldn't
find any queries that produce wrong contents, though this was just
manual examination of the resulting noise logs.

I suppose a better approach might be to run the query produced by
pg_get_querydef() again through parse analysis and see if it produces
the same; or better yet, discard the original parsed query and parse the
pg_get_querydef().  I didn't try to do this.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No deja de ser humillante para una persona de ingenio saber
que no hay tonto que no le pueda enseñar algo." (Jean B. Say)

Re: Allow pg_archivecleanup to remove backup history files

2023-05-10 Thread Bharath Rupireddy

On Tue, May 9, 2023 at 7:03 PM torikoshia  wrote:
>
> Attached a patch with documentation and regression tests.

Thanks. I think pg_archivecleanup cleaning up history files makes it a
complete feature as there's no need to write custom code/scripts over
and above what pg_archivecleanup provides. It will help those who are
using pg_archivecleanup for cleaning up older WAL files, say from
their archive location.

Just curious to know the driving point behind this proposal - is
pg_archivecleanup deployed in production that was unable to clean up
the history files and there were many such history files left? It will
help us know how pg_archivecleanup is being used.

I'm wondering if making -x generic with '-x' '.backup', is simpler
than adding another option?

Comments on the patch:
1. Why just only the backup history files? Why not remove the timeline
history files too? Is it because there may not be as many tli switches
happening as backups?
2.+sub remove_backuphistoryfile_run_check
+{
Why to invent a new function when run_check() can be made generic with
few arguments passed? For instance, run_check() can receive
pg_archivecleanup command args, what to use for create_files(), in the
error condition if the pg_archivecleanup command args contain 'b',
then use a different message "$test_name: first older WAL file was
cleaned up" or "$test_name: first .backup file was cleaned up".
Otherwise, just modify the messages to be:
"$test_name: first older file %s was cleaned up", $files[0]);
"$test_name: second older file %s was cleaned up", $files[1]);
"$test_name: restart file %s was not cleaned up", $files[2]);
"$test_name: newer file %s was not cleaned up", $files[3]);
"$test_name: unrelated file %s was not cleaned up", $files[4]);

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pg_upgrade and logical replication

2023-05-10 Thread Peter Smith

On Mon, Apr 24, 2023 at 4:19 PM Julien Rouhaud  wrote:
>
> Hi,
>
> On Thu, Apr 13, 2023 at 03:26:56PM +1000, Peter Smith wrote:
> >
> > 1.
> > All the comments look alike, so it is hard to know what is going on.
> > If each of the main test parts could be highlighted then the test code
> > would be easier to read IMO.
> >
> > Something like below:
> > [...]
>
> I added a bit more comments about what's is being tested.  I'm not sure that a
> big TEST CASE prefix is necessary, as it's not really multiple separated test
> cases and other stuff can be tested in between.  Also AFAICT no other TAP test
> current needs this kind of banner, even if they're testing more complex
> scenario.

Hmm, I think there are plenty of examples of subscription TAP tests
having some kind of highlighted comments as suggested, for better
readability.

e.g. See src/test/subscription
t/014_binary.pl
t/015_stream.pl
t/016_stream_subxact.pl
t/018_stream_subxact_abort.pl
t/021_twophase.pl
t/022_twophase_cascade.pl
t/023_twophase_stream.pl
t/028_row_filter.pl
t/030_origin.pl
t/031_column_list.pl
t/032_subscribe_use_index.pl

A simple  to separate the main test parts is all
that is needed.

> > 4b.
> > All these messages like "Table t1 should still have 2 rows on the new
> > subscriber" don't seem very helpful. e.g. They are not saying anything
> > about WHAT this is testing or WHY it should still have 2 rows.
>
> I don't think that those messages are supposed to say what or why something is
> tested, just give a quick context / reference on the test in case it's broken.
> The comments are there to explain in more details what is tested and/or why.
>

But, why can’t they do both? They can be a quick reference *and* at
the same time give some more meaning to the error log.  Otherwise,
these messages might as well just say ‘ref1’, ‘ref2’, ‘ref3’...

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: pg_upgrade and logical replication

2023-05-10 Thread Peter Smith

Here are some review comments for the v5-0001 patch code.

==
General

1. ALTER SUBSCRIPTION name ADD TABLE (relid = XYZ, state = 'x' [, lsn = 'X/Y'])

I was a bit confused by this relation 'state' mentioned in multiple
places. IIUC the pg_upgrade logic is going to reject anything with a
non-READY (not 'r') state anyhow, so what is the point of having all
the extra grammar/parse_subscription_options etc to handle setting the
state when only possible value must be 'r'?

~~~

2. state V relstate

I still feel code readbility suffers a bit by calling some fields/vars
a generic 'state' instead of the more descriptive 'relstate'. Maybe
it's just me.

Previously commented same (see [1]#3, #4, #5)

==
doc/src/sgml/ref/pgupgrade.sgml

3.
+   
+Fully preserve the logical subscription state if any.  That includes
+the underlying replication origin with their remote LSN and the list of
+relations in each subscription so that replication can be simply
+resumed if the subscriptions are reactivated.
+   

I think the "if any" part is not necessary. If you remove those words,
then the rest of the sentence can be simplified.

SUGGESTION
Fully preserve the logical subscription state, which includes the
underlying replication origin's remote LSN, and the list of relations
in each subscription. This allows replication to simply resume when
the subscriptions are reactivated.

~~~

4.
+   
+If this option isn't used, it is up to the user to reactivate the
+subscriptions in a suitable way; see the subscription part in  for more information.
+   

The link still renders strangely as previously reported (see [1]#2b).

~~~

5.
+   
+If this option is used and any of the subscription on the old cluster
+has an unknown remote_lsn (0/0), or has any relation
+in a state different from r (ready), the
+pg_upgrade run will error.
+   

5a.
/subscription/subscriptions/

~

5b
"has any relation in a state different from r" --> "has any relation
with state other than r"

==
src/backend/commands/subscriptioncmds.c

6.
+ if (strlen(state_str) != 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid relation state: %s", state_str)));

Is this relation state validation overly simplistic, by only checking
for length 1? Shouldn't this just be asserting the relstate must be
'r'?

==
src/bin/pg_dump/pg_dump.c

7. getSubscriptionTables

+/*
+ * getSubscriptionTables
+ *   get information about the given subscription's relations
+ */
+void
+getSubscriptionTables(Archive *fout)
+{
+ SubscriptionInfo *subinfo;
+ SubRelInfo *rels = NULL;
+ PQExpBuffer query;
+ PGresult   *res;
+ int i_srsubid;
+ int i_srrelid;
+ int i_srsubstate;
+ int i_srsublsn;
+ int i_nrels;
+ int i,
+ cur_rel = 0,
+ ntups,
+ last_srsubid = InvalidOid;

Why some above are single int declarations and some are compound int
declarations? Why not make them all consistent?

~

8.
+ appendPQExpBuffer(query, "SELECT srsubid, srrelid, srsubstate, srsublsn,"
+   " count(*) OVER (PARTITION BY srsubid) AS nrels"
+   " FROM pg_subscription_rel"
+   " ORDER BY srsubid");

Should this SQL be schema-qualified like pg_catalog.pg_subscription_rel?

~

9.
+ for (i = 0; i < ntups; i++)
+ {
+ int cur_srsubid = atooid(PQgetvalue(res, i, i_srsubid));

Should 'cur_srsubid' be declared Oid to match the atooid?

~~~

10. getSubscriptions

+ if (PQgetisnull(res, i, i_suboriginremotelsn))
+ subinfo[i].suboriginremotelsn = NULL;
+ else
+ subinfo[i].suboriginremotelsn =
+ pg_strdup(PQgetvalue(res, i, i_suboriginremotelsn));
+
+ /*
+ * For now assume there's no relation associated with the
+ * subscription. Later code might update this field and allocate
+ * subrels as needed.
+ */
+ subinfo[i].nrels = 0;

The wording "For now assume there's no" kind of gives an ambiguous
interpretation for this comment. IMO it sounds like this is the
"current" logic but some future PG version may behave differently - I
don't think that is the intended meaning at all.

SUGGESTION.
Here we just initialize nrels to say there are 0 relations associated
with the subscription. If necessary, subsequent logic will update this
field and allocate the subrels.

~~~

11. dumpSubscription

+ for (i = 0; i < subinfo->nrels; i++)
+ {
+ appendPQExpBuffer(query, "\nALTER SUBSCRIPTION %s ADD TABLE "
+   "(relid = %u, state = '%c'",
+   qsubname,
+   subinfo->subrels[i].srrelid,
+   subinfo->subrels[i].srsubstate);
+
+ if (subinfo->subrels[i].srsublsn[0] != '\0')
+ appendPQExpBuffer(query, ", LSN = '%s'",
+   subinfo->subrels[i].srsublsn);
+
+ appendPQExpBufferStr(query, ");");
+ }

I previously asked ([1]#11) about how can this ALTER SUBSCRIPTION
TABLE code happen unless 'preserve_subscriptions' is true, and you
confirmed "It indirectly is, as in that case subinfo->nrels is
guaranteed to be 0. I just tried to keep the code simpler and avoid
too many nested conditions."

~

If you are worried about

Re: walsender performance regression due to logical decoding on standby changes

2023-05-10 Thread Drouvot, Bertrand


Hi,

On 5/9/23 11:00 PM, Andres Freund wrote:

Hi,

On 2023-05-09 13:38:24 -0700, Jeff Davis wrote:

On Tue, 2023-05-09 at 12:02 -0700, Andres Freund wrote:

I don't think the approach of not having any sort of "registry" of
whether
anybody is waiting for the replay position to be updated is
feasible. Iterating over all walsenders slots is just too expensive -


Would it work to use a shared counter for the waiters (or, two
counters, one for physical and one for logical), and just early exit if
the count is zero?


That doesn't really fix the problem - once you have a single walsender
connected, performance is bad again.



Just to clarify, do you mean that if there is only one remaining active 
walsender that, say,
has been located at slot n, then we’d still have to loop from 0 to n in 
WalSndWakeup()?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: walsender performance regression due to logical decoding on standby changes

2023-05-10 Thread Bharath Rupireddy

On Wed, May 10, 2023 at 12:33 AM Andres Freund  wrote:
>
> Unfortunately I have found the following commit to have caused a performance
> regression:
>
> commit e101dfac3a53c20bfbf1ca85d30a368c2954facf
>
> The problem is that, on a standby, after the change - as needed to for the
> approach to work - the call to WalSndWakeup() in ApplyWalRecord() happens for
> every record, instead of only happening when the timeline is changed (or WAL
> is flushed or ...).
>
> WalSndWakeup() iterates over all walsender slots, regardless of whether in
> use. For each of the walsender slots it acquires a spinlock.
>
> When replaying a lot of small-ish WAL records I found the startup process to
> spend the majority of the time in WalSndWakeup(). I've not measured it very
> precisely yet, but the overhead is significant (~35% slowdown), even with the
> default max_wal_senders. If that's increased substantially, it obviously gets
> worse.

I played it with a simple primary -> standby1 -> standby2 setup. I ran
a pgbench script [1] on primary and counted the number of times
WalSndWakeup() gets called from ApplyWalRecord() and the number of
times spinlock is acquired/released in WalSndWakeup(). It's a whopping
21 million times spinlock is acquired/released on the standby 1 and
standby 2 for just a < 5min of pgbench run on the primary:

standby 1:
2023-05-10 05:32:43.249 UTC [1595600] LOG:  FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595600] LOG:  FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530

standby 2:
2023-05-10 05:32:43.249 UTC [1595625] LOG:  FOO WalSndWakeup() in
ApplyWalRecord() was called 2176352 times
2023-05-10 05:32:43.249 UTC [1595625] LOG:  FOO spinlock
acquisition/release count in WalSndWakeup() is 21763530

In this case, there is no timeline switch or no logical decoding on
the standby or such, but WalSndWakeup() gets called because the
standby can't make out if the slot is for logical or physical
replication unless spinlock is acquired. Before e101dfac3a,
WalSndWakeup() was getting called only when there was a timeline
switch.

> The only saving grace is that this is not an issue on the primary.

Yeah.

> I don't think the approach of not having any sort of "registry" of whether
> anybody is waiting for the replay position to be updated is
> feasible. Iterating over all walsenders slots is just too expensive -
> WalSndWakeup() shows up even if I remove the spinlock (which we likely could,
> especially when just checking if the the walsender is connected).

Right.

> My current guess is that mis-using a condition variable is the best bet. I
> think it should work to use ConditionVariablePrepareToSleep() before a
> WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
> ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
> would still cause the necessary wakeup.

How about something like the attached? Recovery and subscription tests
don't complain with the patch.

[1]
./pg_ctl -D data -l logfile stop
./pg_ctl -D sbdata1 -l logfile1 stop
./pg_ctl -D sbdata2 -l logfile2 stop

cd /home/ubuntu/postgres
rm -rf inst
make distclean
./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j
8 install > install.log 2>&1 &

primary -> standby1 -> standby2

free -m
sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
free -m

cd inst/bin
./initdb -D data
mkdir archived_wal

cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/postgres/inst/bin/archived_wal/%f'
EOF

./pg_ctl -D data -l logfile start
./pg_basebackup -D sbdata1
./psql -c "select pg_create_physical_replication_slot('sb1_slot',
true, false)" postgres

cat << EOF >> sbdata1/postgresql.conf
port=5433
primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu
application_name=sb1'
primary_slot_name='sb1_slot'
restore_command='cp /home/ubuntu/postgres/inst/bin/archived_wal/%f %p'
EOF

touch sbdata1/standby.signal

./pg_ctl -D sbdata1 -l logfile1 start
./pg_basebackup -D sbdata2 -p 5433
./psql -p 5433 -c "select
pg_create_physical_replication_slot('sb2_slot', true, false)" postgres

cat << EOF >> sbdata2/postgresql.conf
port=5434
primary_conninfo='host=localhost port=5433 dbname=postgres user=ubuntu
application_name=sb2'
primary_slot_name='sb2_slot'
restore_command='cp /home/ubuntu/postgres/inst/bin/archived_wal/%f %p'
EOF

touch sbdata2/standby.signal

./pg_ctl -D sbdata2 -l logfile2 start
./psql -c "select pid, backend_type from pg_stat_activity where
backend_type in ('walreceiver', 'walsender');" postgres
./psql -p 5433 -c "select pid, backend_type from pg_stat_activity
where backend_type in ('walreceiver', 'walsender');" postgres
./psql -p 5434 -c "select pid, backend_type from pg_stat_activity
where backend_type in ('walreceiver', 'walsender');" postgres

ulimit -S -n 5000
./pgbench --initialize

60 matches

Mail list logo