Re: Jumble the CALL command in pg_stat_statements

2023-09-27 Thread Michael Paquier
On Wed, Sep 13, 2023 at 11:09:19PM +, Imseih (AWS), Sami wrote:
> I do have a patch for this with test cases, 
> 0001-v1-Jumble-the-SET-command.patch
> If you feel this needs a separate discussion I can start one.

Agreed tha tthis should have its own thread with a proper subject.

> In the patch, the custom _jumbleVariableSetStmt jumbles
>  the kind, name, is_local and number of arguments ( in case of a list ) 
> and tracks the locations for normalization.

There is much more going on here, like FunctionSetResetClause, or
AlterSystemStmt with its generic_reset.

+foreach(l, expr->args)
+{
+A_Const *ac = (A_Const *) lfirst(l);
+
+if(ac->type != T_String)
+RecordConstLocation(jstate, ac->location);
+}

Even this part, I am not sure if it is always correct.  Couldn't we
have cases where String's A_Const had better be recorded as const?

> CALL with OUT or INOUT args is a bit strange, because
> as the doc [1] mentions "Arguments must be supplied for all procedure 
> parameters 
> that lack defaults, including OUT parameters. However, arguments 
> matching OUT parameters are not evaluated, so it's customary
> to just write NULL for them."
> 
> so for pgss, passing a NULL or some other value into OUT/INOUT args should 
> be normalized like IN args.

I've been studying this one, and I can see why you're right here.
This feels much more natural to include.  The INOUT parameters get
registered twice at the same position, and the duplicates are
discarded by pg_stat_statements, which is OK.  The patch is straight
for the CALL part, so I have applied it.
--
Michael


signature.asc
Description: PGP signature


Re: Partial aggregates pushdown

2023-09-27 Thread Alexander Pyhalov

fujii.y...@df.mitsubishielectric.co.jp писал 2023-09-28 07:40:


I'm not sure that I like this mechanics of adding sort group clauses -
it seems we do in core additional work, which is of use only for
one extension, but at least it seems to be working.

We cannot deparse the original sort group clauses and pathtarget
when performing partial aggreggate pushdown by any FDWs.
So I think the additional sort group clauses and pathtarget are
needed by any FDWs, not only postgres_fdw.



Hi.
It seems to me that *fdw postfixes don't clarify things, but just make 
naming more ugly.


+ * Adding these Vars and PlaceHolderVars to PathTarget,
+ * FDW cannot deparse this by the original List of SortGroupClauses.
+ * So, before this adding process,
+ * setGroupClausePartial generates another Pathtarget and another
+ * List of SortGroupClauses for FDW.

It seems that something like:

/*
 * Modified PathTarget cannot be used by FDW as-is to deparse this 
statement.

 * So, before modifying PathTarget, setGroupClausePartial generates
 * another Pathtarget and another list List of SortGroupClauses
 * to make deparsing possible.
 */

sounds better.

--
Best regards,
Alexander Pyhalov,
Postgres Professional




Re: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Karl O. Pinc
On Thu, 28 Sep 2023 12:54:33 +0900
Michael Paquier  wrote:

> I was looking at this thread overall, the three v3 flavors of the doc
> changes and v4.
> 
> -application_name value. Other characters
> will be
> -replaced with question marks (?).
> +application_name value.
> +Other characters are replaced with  +linkend="sql-syntax-strings-escape">C-style hexadecimal
> escapes.
> 
> The simplicity of the change in v4 seems like the best approach to me,
> so +1 for that (including the mention to "C-style").

I agree with Tom that it's not worth spending anyone's attention
on bytes v.s. characters.

So I'm marking the patch ready for committer.
(I have not tried the version that patches against PGv16.)

Thank you everyone, especially Hayato, for spending time
and making this better.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein




Re: Eager page freeze criteria clarification

2023-09-27 Thread Melanie Plageman
On Fri, Sep 8, 2023 at 12:07 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-09-06 10:35:17 -0400, Robert Haas wrote:
> > On Wed, Sep 6, 2023 at 1:09 AM Andres Freund  wrote:
> > > Yea, it'd be useful to have a reasonably approximate wall clock time for 
> > > the
> > > last modification of a page. We just don't have infrastructure for 
> > > determining
> > > that. We'd need an LSN->time mapping (xid->time wouldn't be particularly
> > > useful, I think).
> > >
> > > A very rough approximate modification time can be computed by assuming an 
> > > even
> > > rate of WAL generation, and using the LSN at the time of the last vacuum 
> > > and
> > > the time of the last vacuum, to compute the approximate age.
> > >
> > > For a while I thought that'd not give us anything that just using LSNs 
> > > gives
> > > us, but I think it might allow coming up with a better cutoff logic: 
> > > Instead
> > > of using a cutoff like "page LSN is older than 10% of the LSNs since the 
> > > last
> > > vacuum of the table", it would allow us to approximate "page has not been
> > > modified in the last 15 seconds" or such.  I think that might help avoid
> > > unnecessary freezing on tables with very frequent vacuuming.
> >
> > Yes. I'm uncomfortable with the last-vacuum-LSN approach mostly
> > because of the impact on very frequently vacuumed tables, and
> > secondarily because of the impact on very infrequently vacuumed
> > tables.
> >
> > Downthread, I proposed using the RedoRecPtr of the latest checkpoint
> > rather than the LSN of the previou vacuum. I still like that idea.
>
> Assuming that "downthread" references
> https://postgr.es/m/CA%2BTgmoYb670VcDFbekjn2YQOKF9a7e-kBFoj2WJF1HtH7YPaWQ%40mail.gmail.com
> could you sketch out the logic you're imagining a bit more?
>
>
> > It's a value that we already have, with no additional bookkeeping. It
> > varies over a much narrower range than the interval between vacuums on
> > a table. The vacuum interval could be as short as tens of seconds as
> > long as years, while the checkpoint interval is almost always going to
> > be between a few minutes at the low end and some tens of minutes at
> > the high end, hours at the very most. That's quite appealing.
>
> The reason I was thinking of using the "lsn at the end of the last vacuum", is
> that it seems to be more adapative to the frequency of vacuuming.
>
> One the one hand, if a table is rarely autovacuumed because it is huge,
> (InsertLSN-RedoRecPtr) might or might not be representative of the workload
> over a longer time. On the other hand, if a page in a frequently vacuumed
> table has an LSN from around the last vacuum (or even before), it should be
> frozen, but will appear to be recent in RedoRecPtr based heuristics?
>
>
> Perhaps we can mix both approaches. We can use the LSN and time of the last
> vacuum to establish an LSN->time mapping that's reasonably accurate for a
> relation. For infrequently vacuumed tables we can use the time between
> checkpoints to establish a *more aggressive* cutoff for freezing then what a
> percent-of-time-since-last-vacuum appach would provide. If e.g. a table gets
> vacuumed every 100 hours and checkpoint timeout is 1 hour, no realistic
> percent-of-time-since-last-vacuum setting will allow freezing, as all dirty
> pages will be too new. To allow freezing a decent proportion of those, we
> could allow freezing pages that lived longer than ~20%
> time-between-recent-checkpoints.

One big sticking point for me (brought up elsewhere in this thread, but,
AFAICT never resolved) is that it seems like the fact that we mark pages
all-visible even when not freezing them means that no matter what
heuristic we use, we won't have the opportunity to freeze the pages we
most want to freeze.

Getting to the specific proposal here -- having two+ heuristics and
using them based on table characteristics:

Take the append-only case. Let's say that we had a way to determine if
an insert-only table should use the vacuum LSN heuristic or older than X
checkpoints heuristic. Given the default
autovacuum_vacuum_insert_threshold, every 1000 tuples inserted,
autovacuum will vacuum the table. If you insert to the table often, then
it is a frequently vacuumed table and, if I'm understanding the proposal
correctly, you would want to use a vacuum LSN-based threshold to
determine which pages are old enough to freeze. Using only the
page-older-than-X%-of-LSNs-since-last-vacuum heuristic and assuming no
concurrent workloads, each autovacuum will leave X% of the pages
unfrozen. However, those pages will likely be marked all visible. That
means that the next non-aggressive autovacuum will not even scan those
pages (assuming that run of pages exceeds the skip pages threshold). So
some chunk of pages will get left behind on every autovacuum.

On the other hand, let's say that the table is not frequently inserted
to, so we consider it an infrequently updated table and want to use the
checkpoint-based heuristic. If the page hasn't

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-09-27 Thread Bharath Rupireddy
On Mon, Sep 25, 2023 at 2:06 PM Amit Kapila  wrote:
>
> > > [1] 
> > > https://www.postgresql.org/message-id/CAA4eK1%2BLtWDKXvxS7gnJ562VX%2Bs3C6%2B0uQWamqu%3DUuD8hMfORg%40mail.gmail.com
> >
> > I see. IIUC, without that commit e0b2eed [1], it may happen that the
> > slot's on-disk confirmed_flush LSN value can be higher than the WAL
> > LSN that's flushed to disk, no?
> >
>
> No, without that commit, there is a very high possibility that even if
> we have sent the WAL to the subscriber and got the acknowledgment of
> the same, we would miss updating it before shutdown. This would lead
> to upgrade failures because upgrades have no way to later identify
> whether the remaining WAL records are sent to the subscriber.

Thanks for clarifying. I'm trying understand what happens without
commit e0b2eed0 with an illustration:

step 1: publisher - confirmed_flush LSN  in replication slot on disk
structure is 80
step 2: publisher - sends WAL at LSN 100
step 3: subscriber - acknowledges the apply LSN or confirmed_flush LSN as 100
step 4: publisher - shuts down without writing the new confirmed_flush
LSN as 100 to disk, note that commit e0b2eed0 is not in place
step 5: publisher - restarts
step 6: subscriber - upon publisher restart, the subscriber requests
WAL from publisher from LSN 100 as it tracks the last applied LSN in
replication origin

Now, if the pg_upgrade with the patch in this thread is run on
publisher after step 4, it complains with "The slot \"%s\" has not
consumed the WAL yet".

Is my above understanding right?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2023-09-27 Thread Amit Kapila
On Wed, Sep 27, 2023 at 3:13 PM Drouvot, Bertrand
 wrote:
>
> On 9/19/23 6:50 AM, shveta malik wrote:
> >
> > 1) patch001: wait for physical-standby confirmation logic is now
> > integrated with WalSndWaitForWal(). Now walsender waits for physical
> > standby's confirmation to take changes upto RecentFlushPtr in
> > WalSndWaitForWal(). This allows walsender to send the changes to
> > logical subscribers one by one which are already covered in
> > RecentFlushPtr without needing to wait on every commit for physical
> > standby confirmation.
>
> +   /* XXX: Is waiting for 1 second before retrying enough or more or 
> less? */
> +   (void) WaitLatch(MyLatch,
> +WL_LATCH_SET | WL_TIMEOUT | 
> WL_EXIT_ON_PM_DEATH,
> +1000L,
> +
> WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION);
>
> I think it would be better to let the physical walsender(s) wake up those 
> logical
> walsender(s) (instead of waiting for 1 sec or such). Maybe we could introduce 
> a new CV that would
> broadcast in PhysicalConfirmReceivedLocation() when restart_lsn is changed, 
> what do you think?
>

Yes, I also think there should be some way for physical walsender to
wake up logical walsenders instead of just waiting. By the way, do you
think we need a GUC like standby_slot_names (please see discussion
[1])?

> Still regarding preventing the logical replication to go ahead of
> physical replication standbys specified in standby_slot_names: we currently 
> don't impose this
> limitation to pg_logical_slot_get_changes and friends (that don't start a 
> dedicated walsender).
>
> Shouldn't we also prevent them to go ahead of physical replication standbys 
> specified in standby_slot_names?
>

Yes, I also think similar handling is required in
pg_logical_slot_get_changes_guts(). We do call GetFlushRecPtr(), so
the handling similar to what the patch is trying to do in
WalSndWaitForWal() can be done.

[1] - 
https://www.postgresql.org/message-id/CAJpy0uA%2Bt3XP2M0qtEmrOG1gSwHghjHPno5AtwTXM-94-%2Bc6JQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: pg_upgrade and logical replication

2023-09-27 Thread vignesh C
On Mon, 25 Sept 2023 at 10:05, Amit Kapila  wrote:
>
> On Fri, Sep 22, 2023 at 4:36 AM Michael Paquier  wrote:
> >
> > On Thu, Sep 21, 2023 at 02:35:55PM +0530, Amit Kapila wrote:
> > > It is because after upgrade of both publisher and subscriber, the
> > > subscriptions won't work. Both publisher and subscriber should work,
> > > otherwise, the logical replication set up won't work. I think we can
> > > probably do this, if we can document clearly how the user can make
> > > their logical replication set up work after upgrade.
> >
> > Yeah, well, this comes back to my original point that the upgrade of
> > publisher nodes and subscriber nodes should be treated as two
> > different problems or we're mixing apples and oranges (and a node
> > could have both subscriber and publishers).  While being able to
> > support both is a must, it is going to be a two-step process at the
> > end, with the subscribers done first and the publishers done after.
> > That's also kind of the point that Julien makes in top message of this
> > thread.
> >
> > I agree that docs are lacking in the proposed patch in terms of
> > restrictions, assumptions and process flow, but taken in isolation the
> > problem of the publishers is not something that this patch has to take
> > care of.
> >
>
> I also don't think that this patch has to solve the problem of
> publishers in any way but as per my understanding, if due to some
> reason we are not able to do the upgrade of publishers, this can add
> more steps for users than they have to do now for logical replication
> set up after upgrade. This is because now after restoring the
> subscription rel's and origin, as soon as we start replication after
> creating the slots on the publisher, we will never be able to
> guarantee data consistency. So, they need to drop the entire
> subscription setup including truncating the relations, and then set it
> up from scratch which also means they need to somehow remember or take
> a dump of the current subscription setup. According to me, the key
> point is to have a mechanism to set up slots correctly to allow
> replication (or subscriptions) to work after the upgrade. Without
> that, it appears to me that we are restoring a subscription where it
> can start from some random LSN and can easily lead to data consistency
> issues where it can miss some of the updates.
>
> This is the primary reason why I prioritized to work on the publisher
> side before getting this patch done, otherwise, the solution for this
> patch was relatively clear. I am not sure but I guess this could be
> the reason why originally we left it in the current state, otherwise,
> restoring subscription rel's or origin doesn't seem to be too much of
> an additional effort than what we are doing now.

I have tried to analyze the steps for upgrading the subscriber with
HEAD and with the upgrade patches, Here are the steps for the same:
Current steps to upgrade subscriber in HEAD:
1) Upgrade the subscriber server
2) Start subscriber server
3) truncate the tables
4) Alter the subscriptions to point to new slots in the subscriber
5) Enable the subscriptions
6) Alter subscription to refresh the publications

Steps to upgrade If we commit only the subscriber upgrade patch:
1) Upgrade the subscriber server
2) Start subscriber server
3) truncate the tables
Note: We will have to drop the subscriptions as we have made changes
to the pg_subscription_rel
4) But drop subscription will throw error:
postgres=# DROP SUBSCRIPTION test1 cascade;
ERROR:  could not drop replication slot "test1" on publisher: ERROR:
replication slot "test1" does not exist
5) Alter the subscription to set slot_name to none
6) Make a note of all the subscriptions that are present
7) drop the subscriptions
8) Create the subscriptions

The number of steps will increase in this case.

Steps to upgrade If we commit publisher upgrade patch first and then
the subscriber upgrade patch patch:
1) Upgrade the subscriber server
2) Start subscriber server
3) Enable the subscription
4) Alter subscription to refresh the publications

Based on the above, I also feel it is better to get the upgrade
publisher patch committed first, as a) it will reduce the data copying
time(as truncate is not required) b) the number of steps will reduce
c) all the use cases will be handled.

Regards,
Vignesh




Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-09-27 Thread Michael Paquier
On Thu, Sep 28, 2023 at 12:58:51PM +0900, Kyotaro Horiguchi wrote:
> The attached is a quick mock-up, but providing an approximation of my
> thoughts. (For example, end_of_backup_reached could potentially be
> extended to the ArchiveRecoveryRequested case and we could simplify
> the condition..)

I am not sure why this is related to this thread..

 static XLogRecPtr backupStartPoint;
 static XLogRecPtr backupEndPoint;
 static bool backupEndRequired = false;
+static bool backupEndReached = false;

Anyway, sneaking at your suggestion, this is actually outlining the
main issue I have with this code currently.  We have so many static
booleans to control one behavior over the other that we always try to
make this code more complicated, while we should try to make it
simpler instead. 
--
Michael


signature.asc
Description: PGP signature


Re: Set enable_seqscan doesn't take effect?

2023-09-27 Thread David G. Johnston
On Wednesday, September 27, 2023, jacktby jacktby  wrote:

> postgres=# SET enable_seqscan = off;
> SET
> postgres=# explain select * from t;
>QUERY PLAN
> -
>  Seq Scan on t  (cost=100.00..123.60 rows=1360 width=32)


It wouldn’t cost 10billion to return the first tuple if that setting wasn’t
working.

That is the “discouragement” the documentation is referring to.

I do agree the wording in the docs could be improved since it is a bit
self-contradictory and unspecific, but it is explicitly clear a plan with
sequential scan can still be chosen even with this set to off.

David J.


Re: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Peter Smith
v4 LGTM.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-09-27 Thread Kyotaro Horiguchi
Sorry, it seems that I posted at the wrong position..

At Thu, 28 Sep 2023 12:58:51 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 10 Mar 2023 15:59:04 +0900, Michael Paquier  
> wrote in 
> > My apologies for the long message, but this deserves some attention,
> > IMHO.
> > 
> > So, any thoughts?
> 
> Sorry for being late. However, I agree with David's concern regarding
> the unnecessary inconvenience it introduces.  I'd like to maintain the
> functionality.
> 
> While I agree that InArchiveRecovery should be activated only if
> ArchiveReArchiveRecoveryRequested is true, I oppose to the notion that
> the mere presence of backup_label should be interpreted as a request
> for archive recovery (even if it is implied in a comment in
> InitWalRecovery()). Instead, I propose that we separate backup_label
> and archive recovery, in other words, we should not turn on
> InArchiveRecovery if !ArchiveRecoveryRequested, regardless of the
> presence of backup_label. We can know the minimum required recovery
> LSN by the XLOG_BACKUP_END record.
> 
> The attached is a quick mock-up, but providing an approximation of my
> thoughts. (For example, end_of_backup_reached could potentially be
> extended to the ArchiveRecoveryRequested case and we could simplify
> the condition..)

regards

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index fcbde10529..e4af945319 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5489,17 +5489,15 @@ StartupXLOG(void)
set_ps_display("");
 
/*
-* When recovering from a backup (we are in recovery, and archive 
recovery
-* was requested), complain if we did not roll forward far enough to 
reach
-* the point where the database is consistent.  For regular online
-* backup-from-primary, that means reaching the end-of-backup WAL record
-* (at which point we reset backupStartPoint to be Invalid), for
-* backup-from-replica (which can't inject records into the WAL stream),
-* that point is when we reach the minRecoveryPoint in pg_control (which
-* we purposefully copy last when backing up from a replica).  For
-* pg_rewind (which creates a backup_label with a method of "pg_rewind")
-* or snapshot-style backups (which don't), backupEndRequired will be 
set
-* to false.
+* When recovering from a backup, complain if we did not roll forward 
far
+* enough to reach the point where the database is consistent.  For 
regular
+* online backup-from-primary, that means reaching the end-of-backup WAL
+* record, for backup-from-replica (which can't inject records into the 
WAL
+* stream), that point is when we reach the minRecoveryPoint in 
pg_control
+* (which we purposefully copy last when backing up from a replica).  
For
+* pg_rewind (which creates a backup_label with a method of 
"pg_rewind") or
+* snapshot-style backups (which don't), backupEndRequired will be set 
to
+* false.
 *
 * Note: it is indeed okay to look at the local variable
 * LocalMinRecoveryPoint here, even though ControlFile->minRecoveryPoint
@@ -5508,7 +5506,7 @@ StartupXLOG(void)
 */
if (InRecovery &&
(EndOfLog < LocalMinRecoveryPoint ||
-!XLogRecPtrIsInvalid(ControlFile->backupStartPoint)))
+(haveBackupLabel && 
!endOfRecoveryInfo->end_of_backup_reached)))
{
/*
 * Ran off end of WAL before reaching end-of-backup WAL record, 
or
@@ -5516,16 +5514,13 @@ StartupXLOG(void)
 * recover from an online backup but never called 
pg_backup_stop(), or
 * you didn't archive all the WAL needed.
 */
-   if (ArchiveRecoveryRequested || ControlFile->backupEndRequired)
-   {
-   if (!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) 
|| ControlFile->backupEndRequired)
-   ereport(FATAL,
-   (errmsg("WAL ends before end of 
online backup"),
-errhint("All WAL generated 
while online backup was taken must be available at recovery.")));
-   else
-   ereport(FATAL,
-   (errmsg("WAL ends before 
consistent recovery point")));
-   }
+   if (haveBackupLabel && 
!endOfRecoveryInfo->end_of_backup_reached)
+   ereport(FATAL,
+   (errmsg("WAL ends before end of online 
backup"),
+errhint("All WAL generated while 
online backup was taken must be available at recovery.")));
+   else
+   ere

Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan
On Wed, Sep 27, 2023 at 6:35 PM Andres Freund  wrote:
> >   if insert LSN - RedoRecPtr < insert LSN - page LSN
> >   page is older than the most recent checkpoint start, so freeze it
> >   regardless of whether or not it would emit an FPI
> >
> > What aggressiveness levels should there be? What should change at each
> > level? What criteria should pages have to meet to be subject to the
> > aggressiveness level?
>
> I'm thinking something very roughly along these lines could make sense:
>
> page_lsn_age = insert_lsn - page_lsn;

While there is no reason to not experiment here, I have my doubts
about what you've sketched out. Most importantly, it doesn't have
anything to say about the cost of not freezing -- just the cost of
freezing. But isn't the main problem *not* freezing when we could and
should have? (Of course the cost of freezing is very relevant, but
it's still secondary.)

But even leaving that aside, I just don't get why this will work with
the case that you yourself emphasized earlier on: a workload with
inserts plus "hot tail" updates. If you run TPC-C according to spec,
there is about 12 or 14 hours between the initial inserts into the
orders and order lines table (by a new order transaction), and the
subsequent updates (from the delivery transaction). When I run the
benchmark, I usually don't stick with the spec (it's rather limiting
on modern hardware), so it's more like 2 - 4 hours before each new
order is delivered (meaning updated in those two big tables). Either
way, it's a fairly long time relative to everything else.

Won't the algorithm that you've sketched always think that
"unfreezing" pages doesn't affect recently frozen pages with such a
workload? Isn't the definition of "recently frozen" that emerges from
this algorithm not in any way related to the order delivery time, or
anything like that? You know, rather like vacuum_freeze_min_age.

Separately, at one point you also said "Yes. If the ratio of
opportunistically frozen pages (which I'd define as pages that were
frozen not because they strictly needed to) vs the number of unfrozen
pages increases, we need to make opportunistic freezing less
aggressive and vice versa".

Can we expect a discount for freezing that happened to be very cheap
anyway, when that doesn't work out?

What about a page that we would have had to have frozen anyway (based
on the conventional vacuum_freeze_min_age criteria) not too long after
it was frozen by this new mechanism, that nevertheless became unfrozen
some time later? That is, a page where "the unfreezing" cannot
reasonably be blamed on the initial so-called opportunistic freezing,
because really it was a total accident involving when VACUUM showed
up? You know, just like we'd expect with the TPC-C tables.

Aside: "unfrozen pages" seems to refer to pages that were frozen, and
became unfrozen. Not pages that are simply frozen. Lots of
opportunities for confusion here.

I'm not saying that it's wrong to freeze like this in the specific case of
TPC-C. But do you really need to invent all this complicated
infrastructure, just to avoid freezing the same pages again in a tight
loop?

On a positive note, I like that what you've laid out freezes eagerly
when an FPI won't result -- this much we can all agree on. I guess
that that part is becoming uncontroversial.

--
Peter Geoghegan




Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-09-27 Thread Kyotaro Horiguchi
At Fri, 10 Mar 2023 15:59:04 +0900, Michael Paquier  wrote 
in 
> My apologies for the long message, but this deserves some attention,
> IMHO.
> 
> So, any thoughts?

Sorry for being late. However, I agree with David's concern regarding
the unnecessary inconvenience it introduces.  I'd like to maintain the
functionality.

While I agree that InArchiveRecovery should be activated only if
ArchiveReArchiveRecoveryRequested is true, I oppose to the notion that
the mere presence of backup_label should be interpreted as a request
for archive recovery (even if it is implied in a comment in
InitWalRecovery()). Instead, I propose that we separate backup_label
and archive recovery, in other words, we should not turn on
InArchiveRecovery if !ArchiveRecoveryRequested, regardless of the
presence of backup_label. We can know the minimum required recovery
LSN by the XLOG_BACKUP_END record.

The attached is a quick mock-up, but providing an approximation of my
thoughts. (For example, end_of_backup_reached could potentially be
extended to the ArchiveRecoveryRequested case and we could simplify
the condition..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index fcbde10529..e4af945319 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5489,17 +5489,15 @@ StartupXLOG(void)
set_ps_display("");
 
/*
-* When recovering from a backup (we are in recovery, and archive 
recovery
-* was requested), complain if we did not roll forward far enough to 
reach
-* the point where the database is consistent.  For regular online
-* backup-from-primary, that means reaching the end-of-backup WAL record
-* (at which point we reset backupStartPoint to be Invalid), for
-* backup-from-replica (which can't inject records into the WAL stream),
-* that point is when we reach the minRecoveryPoint in pg_control (which
-* we purposefully copy last when backing up from a replica).  For
-* pg_rewind (which creates a backup_label with a method of "pg_rewind")
-* or snapshot-style backups (which don't), backupEndRequired will be 
set
-* to false.
+* When recovering from a backup, complain if we did not roll forward 
far
+* enough to reach the point where the database is consistent.  For 
regular
+* online backup-from-primary, that means reaching the end-of-backup WAL
+* record, for backup-from-replica (which can't inject records into the 
WAL
+* stream), that point is when we reach the minRecoveryPoint in 
pg_control
+* (which we purposefully copy last when backing up from a replica).  
For
+* pg_rewind (which creates a backup_label with a method of 
"pg_rewind") or
+* snapshot-style backups (which don't), backupEndRequired will be set 
to
+* false.
 *
 * Note: it is indeed okay to look at the local variable
 * LocalMinRecoveryPoint here, even though ControlFile->minRecoveryPoint
@@ -5508,7 +5506,7 @@ StartupXLOG(void)
 */
if (InRecovery &&
(EndOfLog < LocalMinRecoveryPoint ||
-!XLogRecPtrIsInvalid(ControlFile->backupStartPoint)))
+(haveBackupLabel && 
!endOfRecoveryInfo->end_of_backup_reached)))
{
/*
 * Ran off end of WAL before reaching end-of-backup WAL record, 
or
@@ -5516,16 +5514,13 @@ StartupXLOG(void)
 * recover from an online backup but never called 
pg_backup_stop(), or
 * you didn't archive all the WAL needed.
 */
-   if (ArchiveRecoveryRequested || ControlFile->backupEndRequired)
-   {
-   if (!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) 
|| ControlFile->backupEndRequired)
-   ereport(FATAL,
-   (errmsg("WAL ends before end of 
online backup"),
-errhint("All WAL generated 
while online backup was taken must be available at recovery.")));
-   else
-   ereport(FATAL,
-   (errmsg("WAL ends before 
consistent recovery point")));
-   }
+   if (haveBackupLabel && 
!endOfRecoveryInfo->end_of_backup_reached)
+   ereport(FATAL,
+   (errmsg("WAL ends before end of online 
backup"),
+errhint("All WAL generated while 
online backup was taken must be available at recovery.")));
+   else
+   ereport(FATAL,
+   (errmsg("WAL ends before consistent 
recovery point")));
}
 
/*
diff --git a/src/backend/access/transam/xlog

Re: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Michael Paquier
On Thu, Sep 28, 2023 at 03:23:30AM +, Hayato Kuroda (Fujitsu) wrote:
> Hmm, cfbot got angry because it tried to apply both of patches. To avoid it, 
> I repost renamed patch.
> (I'm happy if we can specify the target branch of patches)

I was looking at this thread overall, the three v3 flavors of the doc
changes and v4.

-application_name value. Other characters will be
-replaced with question marks (?).
+application_name value.
+Other characters are replaced with C-style hexadecimal escapes.

The simplicity of the change in v4 seems like the best approach to me,
so +1 for that (including the mention to "C-style").
--
Michael


signature.asc
Description: PGP signature


Re: Set enable_seqscan doesn't take effect?

2023-09-27 Thread jacktby jacktby


> 2023年9月28日 01:07,Andres Freund  写道:
> 
> Hi,
> 
> On 2023-09-28 00:37:41 +0800, jacktby jacktby wrote:
>> postgres=# SET enable_seqscan = off;
>> SET
>> postgres=# explain select * from t;
>> QUERY PLAN
>> -
>> Seq Scan on t  (cost=100.00..123.60 rows=1360 width=32)
>> (1 row)
>> 
>> postgres=#  select * from t;
>> a   
>> ---
>> [1,2]
>> (1 row)
> 
> Sorry to be the grump here:
> 
> You start several threads a week, often clearly not having done much, if any,
> prior research. Often even sending the same question to multiple lists. It
> should not be hard to find an explanation for the behaviour you see here.
> 
> pgsql-hackers isn't a "do my work for me service". We're hacking on
> postgres. It's fine to occasionally ask for direction, but you're very clearly
> exceeding that.
> 
> Greetings,
> 
> Andres Freund
I’m so sorry for that. I think I’m not very familiar with pg, so I ask many 
naive questions. And I apologize for my behavior.



Re: Set enable_seqscan doesn't take effect?

2023-09-27 Thread jacktby jacktby


> 2023年9月28日 01:07,Andres Freund  写道:
> 
> Hi,
> 
> On 2023-09-28 00:37:41 +0800, jacktby jacktby wrote:
>> postgres=# SET enable_seqscan = off;
>> SET
>> postgres=# explain select * from t;
>>  QUERY PLAN
>> -
>> Seq Scan on t  (cost=100.00..123.60 rows=1360 width=32)
>> (1 row)
>> 
>> postgres=#  select * from t;
>>  a   
>> ---
>> [1,2]
>> (1 row)
> 
> Sorry to be the grump here:
> 
> You start several threads a week, often clearly not having done much, if any,
> prior research. Often even sending the same question to multiple lists. It
> should not be hard to find an explanation for the behaviour you see here.
> 
> pgsql-hackers isn't a "do my work for me service". We're hacking on
> postgres. It's fine to occasionally ask for direction, but you're very clearly
> exceeding that.
> 
> Greetings,
> 
> Andres Freund
I’m so sorry for that. I think I’m not very familiar with pg, so I ask many 
naive questions. And I apologize for my behavior.



RE: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Hayato Kuroda (Fujitsu)
> I attached two patches - one is for HEAD, and another one is for REL_16_STABLE
> branch. As shown below, PG16 has the same behavior.

Hmm, cfbot got angry because it tried to apply both of patches. To avoid it, I 
repost renamed patch.
(I'm happy if we can specify the target branch of patches)

Sorry for inconvenience.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

From 8f56b1cd93157a3c8b8f1c703c2752c0727e798c Mon Sep 17 00:00:00 2001
From: Hayato Kuroda 
Date: Wed, 27 Sep 2023 07:31:07 +
Subject: [PATCH v40] Fix description for handling of non-printable ASCII
 characters
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

45b1a67a changed the behavior when characters that are not printable ASCII were
used for three configuration parameters (application_name, cluster_name, and
postgres_fdw.application_name), but it was not documented. This commit fixes
that.

PG15 and prior:

```
postgres=# SET application_name TO 'あああ';
SET
postgres=# SHOW application_name ;
 application_name
--
 ?
(1 row)
```

PG16 and later:

```
postgres=# SET application_name TO 'あああ';
SET
postgres=# SHOW application_name ;
   application_name
--
 \xe3\x81\x82\xe3\x81\x82\xe3\x81\x82
(1 row)
```
---
 doc/src/sgml/config.sgml   | 15 +--
 doc/src/sgml/postgres-fdw.sgml |  6 +++---
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4647539e40..126c147fc6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7012,8 +7012,9 @@ local0.*/var/log/postgresql
 and included in CSV log entries.  It can also be included in regular
 log entries via the  parameter.
 Only printable ASCII characters may be used in the
-application_name value. Other characters will be
-replaced with question marks (?).
+application_name value.
+Other characters are replaced with C-style hexadecimal escapes.

   
  
@@ -8009,10 +8010,12 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;
 The name can be any string of less
 than NAMEDATALEN characters (64 characters in a 
standard
 build). Only printable ASCII characters may be used in the
-cluster_name value. Other characters will be
-replaced with question marks (?).  No name is shown
-if this parameter is set to the empty string '' 
(which is
-the default). This parameter can only be set at server start.
+cluster_name value.
+Other characters are replaced with C-style hexadecimal escapes.
+No name is shown if this parameter is set to the empty string
+'' (which is the default).
+This parameter can only be set at server start.

   
  
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 5062d712e7..c177fd41bc 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -1067,9 +1067,9 @@ postgres=# SELECT postgres_fdw_disconnect_all();
   of any length and contain even non-ASCII characters.  However when
   it's passed to and used as application_name
   in a foreign server, note that it will be truncated to less than
-  NAMEDATALEN characters and anything other than
-  printable ASCII characters will be replaced with question
-  marks (?).
+  NAMEDATALEN characters.
+  Anything other than printable ASCII characters are replaced with C-style hexadecimal escapes.
   See  for details.
  
 
-- 
2.27.0



v4-0001-Fix-description-for-handling-of-non-printable-ASC.patch
Description:  v4-0001-Fix-description-for-handling-of-non-printable-ASC.patch


Re: Streaming I/O, vectored I/O (WIP)

2023-09-27 Thread Heikki Linnakangas

On 31/08/2023 07:00, Thomas Munro wrote:

Currently PostgreSQL reads (and writes) data files 8KB at a time.
That's because we call ReadBuffer() one block at a time, with no
opportunity for lower layers to do better than that.  This thread is
about a model where you say which block you'll want next with a
callback, and then you pull the buffers out of a "stream".


I love this idea! Makes it a lot easier to perform prefetch, as 
evidenced by the 011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch:


 13 files changed, 289 insertions(+), 637 deletions(-)

I'm a bit disappointed and surprised by 
v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:


 4 files changed, 244 insertions(+), 78 deletions(-)

The current prefetching logic in vacuumlazy.c is pretty hairy, so I 
hoped that this would simplify it. I didn't look closely at that patch, 
so maybe it's simpler even though it's more code.



There are more kinds of streaming I/O that would be useful, such as
raw unbuffered files, and of course writes, and I've attached some
early incomplete demo code for writes (just for fun), but the main
idea I want to share in this thread is the idea of replacing lots of
ReadBuffer() calls with the streaming model.


All this makes sense. Some random comments on the patches:


+   /* Avoid a slightly more expensive kernel call if there is no benefit. 
*/
+   if (iovcnt == 1)
+   returnCode = pg_pread(vfdP->fd,
+ iov[0].iov_base,
+ iov[0].iov_len,
+ offset);
+   else
+   returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);


How about pushing down this optimization to pg_preadv() itself? 
pg_readv() is currently just a macro if the system provides preadv(), 
but it could be a "static inline" that does the above dance. I think 
that optimization is platform-dependent anyway, pread() might not be any 
faster on some OSs. In particular, if the system doesn't provide 
preadv() and we use the implementation in src/port/preadv.c, it's the 
same kernel call anyway.



v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch


No smgrextendv()? I guess none of the patches here needed it.


/*
 * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
 * the returned buffer can be used immediately.  Otherwise, a physical read
 * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
 * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
 * caller has the opportunity to coalesce reads of neighboring blocks into one
 * CompleteReadBuffers() call.
 *
 * *found is set to true for a hit, and false for a miss.
 *
 * *allocated is set to true for a miss that allocates a buffer for the first
 * time.  If there are multiple calls to PrepareReadBuffer() for the same
 * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
 * read, then only the first such call will receive *allocated == true, which
 * the caller might use to issue just one prefetch hint.
 */
Buffer
PrepareReadBuffer(BufferManagerRelation bmr,
  ForkNumber forkNum,
  BlockNumber blockNum,
  BufferAccessStrategy strategy,
  bool *found,
  bool *allocated)



If you decide you don't want to perform the read, after all, is there a 
way to abort it without calling CompleteReadBuffers()? Looking at the 
later patch that introduces the streaming read API, seems that it 
finishes all the reads, so I suppose we don't need an abort function. 
Does it all get cleaned up correctly on error?



/*
 * Convert an array of buffer address into an array of iovec objects, and
 * return the number that were required.  'iov' must have enough space for up
 * to PG_IOV_MAX elements.
 */
static int
buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)
 The comment is a bit inaccurate. There's an assertion that If nblocks 
<= PG_IOV_MAX, so while it's true that 'iov' must have enough space for 
up to PG_IOV_MAX elements, that's only because we also assume that 
nblocks <= PG_IOV_MAX.


I don't see anything in the callers (mdreadv() and mdwritev()) to 
prevent them from passing nblocks > PG_IOV_MAX.


in streaming_read.h:


typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);


I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on 
each read. I see that you used that in the WAL replay prefetching, so I 
guess that makes sense.



extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
extern Buffer pg_streaming_read

RE: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Hayato Kuroda (Fujitsu)
Dear Tom,

Thank you for giving a comment!
New patch is available in [1].

> I'd be inclined to keep the text as simple as possible and not focus on
> the distinction between bytes and characters.
>

Okay, in the latest version, the word "byte" was removed.

[1]: 
https://www.postgresql.org/message-id/TYAPR01MB5866DD962CA4FC03E338C6BBF5C1A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





RE: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thank you for reviewing!

> > > > > A. A patch which completely follows your comments. The name is
> > > > > "v3-0001-...patch". Cfbot tests it.
> > > > > B. A patch which completely follows Peter's comments [1]. The
> > > > > name is "Peter_v3-txt".
> > > > > C. A patch which follows both comments. Based on
> > > > > b, but some comments (Don't use the future tense, "Other
> > > > > characters"->"The bytes of other characters"...) were picked. The
> > > > > name is "Both_v3-txt".
> > > >
> > > > I also like C.  Fewer words is better.  So long
> > > > as nothing is left unsaid fewer words make for clarity.
> > > >
> > > > However, in the last hunk, "of other than" does not read well.
> > > > Instead of writing
> > > > "and the bytes of other than printable ASCII characters"
> > > > you want "and the bytes that are not printable ASCII characters".
> > > > That would be my suggestion.
> > > >
> > >
> > > I also prefer Option C, but...

Okay, C was chosen.


> > >
> > > ~~~
> > >
> > > +application_name value.
> > > +The bytes of other characters are replaced with
> > > +C-style escaped
> > > hexadecimal
> > > +byte values.
> > >
> > > V
> > >
> > > +cluster_name value.
> > > +The bytes of other characters are replaced with
> > > +C-style escaped
> > > hexadecimal
> > > +byte values.
> > >
> > > V
> > >
> > > +  NAMEDATALEN characters and the bytes of
> other
> > > than
> > > +  printable ASCII characters are replaced with  > > +  linkend="sql-syntax-strings-escape">C-style escaped
> > > hexadecimal byte
> > > +  values.
> > >
> > >
> > > IIUC all of these 3 places can have exactly the same wording change
> > > (e.g. like Karl's last suggestion [1]).
> > >
> > > SUGGESTION
> > > Any bytes that are not printable ASCII characters are replaced with
> > > C-style escaped hexadecimal
> > > byte values.

Hmm, I felt that using exactly the same wording seemed strange here, so similar
words were used. Also, based on the comment [1], "byte" was removed.

> 
> I had in mind something like a SHIFT-JIS encoding where a single
> "character" may include some trail bytes that happen to be in the
> ASCII printable range. AFAIK because the new logic is processing
> bytes, not characters, I thought the end result could be a mix of
> escaped and unescaped bytes for the single SJIS character. In that
> context, I felt "The bytes of other characters" was not quite
> accurate.
> 
> But now looking at PostgreSQL-supported character sets [1] I saw SJIS
> is not supported anyhow. Unfortunately, I am not familiar enough with
> other encodings to know if there is still a chance of similar
> printable ASCII trail bytes so I am fine with whatever wording is
> chosen.

Based on the discussion [1], I did not handle the part.

> 
> > But because I like short sentences I now think that it's a good
> > idea to break the long sentence of the last hunk into two.
> > Add a period and use the Peter's SUGGESTION above as the
> > text for the second sentence.
> >
> > Is this desireable?
> >
> 
> +1.

OK, divided.

New patch is available in [2].

[1]: https://www.postgresql.org/message-id/803569.1695863971%40sss.pgh.pa.us
[2]: 
https://www.postgresql.org/message-id/TYAPR01MB5866DD962CA4FC03E338C6BBF5C1A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Hayato Kuroda (Fujitsu)
Dear Karl,

Thank you for giving comments! PSA new version.
I attached two patches - one is for HEAD, and another one is for REL_16_STABLE
branch. As shown below, PG16 has the same behavior.

```
psql (16beta3)
Type "help" for help.

postgres=# SET application_name TO 'あああ';
SET
postgres=# SHOW application_name ;
   application_name   
--
 \xe3\x81\x82\xe3\x81\x82\xe3\x81\x82
(1 row)
```


> > > > A. A patch which completely follows your comments. The name is
> > > > "v3-0001-...patch". Cfbot tests it.
> > > > B. A patch which completely follows Peter's comments [1]. The
> > > > name is "Peter_v3-txt".
> > > > C. A patch which follows both comments. Based on
> > > > b, but some comments (Don't use the future tense, "Other
> > > > characters"->"The bytes of other characters"...) were picked. The
> > > > name is "Both_v3-txt".
> > >
> > > I also like C.  Fewer words is better.  So long
> > > as nothing is left unsaid fewer words make for clarity.

Okay, basically I used C.

> > >
> > > However, in the last hunk, "of other than" does not read well.
> > > Instead of writing
> > > "and the bytes of other than printable ASCII characters"
> > > you want "and the bytes that are not printable ASCII characters".
> > > That would be my suggestion.
> > >
> >
> > I also prefer Option C, but...
> >
> > ~~~
> >
> > +application_name value.
> > +The bytes of other characters are replaced with
> > +C-style escaped
> > hexadecimal
> > +byte values.
> >
> > V
> >
> > +cluster_name value.
> > +The bytes of other characters are replaced with
> > +C-style escaped
> > hexadecimal
> > +byte values.
> >
> > V
> >
> > +  NAMEDATALEN characters and the bytes of
> other
> > than
> > +  printable ASCII characters are replaced with  > +  linkend="sql-syntax-strings-escape">C-style escaped
> > hexadecimal byte
> > +  values.
> >
> >
> > IIUC all of these 3 places can have exactly the same wording change
> > (e.g. like Karl's last suggestion [1]).
> >
> > SUGGESTION
> > Any bytes that are not printable ASCII characters are replaced with
> > C-style escaped hexadecimal
> > byte values.
> 
> I don't see the utility in having exactly the same phrase everywhere,
> especially since the last hunk is modifying the end of a long
> sentence.  (Apologies if I'm mis-reading what Peter wrote above.)

Right, here we cannot use exactly the same sentence.

> 
> I like short sentences.  So I prefer "The bytes of other characters"
> rather than "Any bytes that are not printable ASCII characters"
> for the first 2 hunks.  In context I don't see the need to repeat
> the whole "printable ASCII characters" part that appears in the
> preceding sentence of both hunks.  "Other" is clear, IMHO.

Based on the suggestion [1], I removed the word "byte".
(Sorry, but a comment from senior members has higher priority)

> 
> But because I like short sentences I now think that it's a good
> idea to break the long sentence of the last hunk into two.
> Add a period and use the Peter's SUGGESTION above as the
> text for the second sentence.

Right, the sentence is separated into two.

[1]: https://www.postgresql.org/message-id/803569.1695863971%40sss.pgh.pa.us

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




REL_16_v4-0001-Fix-description-for-handling-of-non-printable-ASC.patch
Description:  REL_16_v4-0001-Fix-description-for-handling-of-non-printable-ASC.patch


v4-0001-Fix-description-for-handling-of-non-printable-ASC.patch
Description:  v4-0001-Fix-description-for-handling-of-non-printable-ASC.patch


Re: RFC: Logging plan of the running query

2023-09-27 Thread Andrey Lepikhov

On 28/9/2023 09:04, torikoshia wrote:

On 2023-09-25 18:49, Andrey Lepikhov wrote:

On 25/9/2023 14:21, torikoshia wrote:

On 2023-09-20 14:39, Lepikhov Andrei wrote:
Hmm, as a test, I made sure to call ProcessLogQueryPlanInterrupt() on 
all CFI using 
v28-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch, and 
then ran the following query but did not cause any problems.


```
=# CREATE TABLE test();
=# CREATE OR REPLACE FUNCTION ddl() RETURNS void AS $$
BEGIN
   EXECUTE format('ALTER TABLE test ADD COLUMN x integer;');
   PERFORM pg_sleep(5);
END; $$ LANGUAGE plpgsql VOLATILE;
=# SELECT ddl();
```

Is this the case you're worrying about?


I didn't find a problem either. I just feel uncomfortable if, at the
moment of interruption, we have a descriptor of another query than the
query have been executing and holding resources.


I think that "descriptor" here refers to ActiveQueryDesc, in which case
it is updated at the beginning of ExecutorRun(), so I am wondering if
the situation you're worried about would not occur.
As you can see, in my example we have the only DDL and no queries with 
plans. In this case postgres doesn't call ExecutorRun() just because it 
doesn't have a plan. But locks will be obtained.


--
regards,
Andrey Lepikhov
Postgres Professional





Re: Index range search optimization

2023-09-27 Thread Peter Geoghegan
On Wed, Sep 27, 2023 at 9:41 AM Alexander Korotkov  wrote:
> Fixed typo inficating => indicating as pointed by Pavel.
> Peter, what do you think about the current shape of the patch?

I'll try to get to this tomorrow. I'm rather busy with moving home at
the moment, unfortunately.


-- 
Peter Geoghegan




Re: RFC: Logging plan of the running query

2023-09-27 Thread torikoshia

On 2023-09-25 18:49, Andrey Lepikhov wrote:

On 25/9/2023 14:21, torikoshia wrote:

On 2023-09-20 14:39, Lepikhov Andrei wrote:
Hmm, as a test, I made sure to call ProcessLogQueryPlanInterrupt() on 
all CFI using 
v28-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch, and then 
ran the following query but did not cause any problems.


```
=# CREATE TABLE test();
=# CREATE OR REPLACE FUNCTION ddl() RETURNS void AS $$
BEGIN
   EXECUTE format('ALTER TABLE test ADD COLUMN x integer;');
   PERFORM pg_sleep(5);
END; $$ LANGUAGE plpgsql VOLATILE;
=# SELECT ddl();
```

Is this the case you're worrying about?


I didn't find a problem either. I just feel uncomfortable if, at the
moment of interruption, we have a descriptor of another query than the
query have been executing and holding resources.


I think that "descriptor" here refers to ActiveQueryDesc, in which case
it is updated at the beginning of ExecutorRun(), so I am wondering if
the situation you're worried about would not occur.

-
@@ -303,10 +306,21 @@ ExecutorRun(QueryDesc *queryDesc,
ScanDirection direction, uint64 count,
bool execute_once)
 {
+   /*
+* Update ActiveQueryDesc here to enable retrieval of the currently
+* running queryDesc for nested queries.
+*/
+   QueryDesc *save_ActiveQueryDesc;
+
+   save_ActiveQueryDesc = ActiveQueryDesc;
+   ActiveQueryDesc = queryDesc;
-
--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Peter Smith
On Thu, Sep 28, 2023 at 11:19 AM Tom Lane  wrote:
>
> ... trailing bytes that could be mistaken for ASCII are precisely
> the property that causes us to reject an encoding as not backend-safe.

Oh, that is good to know. Thanks for the information.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: pg_stat_get_activity(): integer overflow due to (int) * (int) for MemoryContextAllocHuge()

2023-09-27 Thread Andres Freund
Hi,

On 2023-09-28 07:53:45 +0900, Michael Paquier wrote:
> On Wed, Sep 27, 2023 at 10:29:25AM -0700, Andres Freund wrote:
> > Frankly, it seems like a quite bad idea to have such a high limit for
> > pgstat_track_activity_query_size. The overhead such a high value has will
> > surprise people...
> 
> Still it could have some value for some users with large analytical
> queries where the syslogger is not going to be a bottleneck?  It seems
> too late to me to change that, but perhaps the docs could be improved
> to tell that using a too high value can have performance consequences,
> while mentioning the maximum value.

I don't think the issue is syslogger, the problem is that suddenly accessing
pg_stat_activity requires gigabytes of memory. That's insane.

Greetings,

Andres Freund




Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund
On 2023-09-27 19:09:41 -0400, Melanie Plageman wrote:
> On Wed, Sep 27, 2023 at 3:25 PM Robert Haas  wrote:
> >
> > On Wed, Sep 27, 2023 at 12:34 PM Andres Freund  wrote:
> > > One way to deal with that would be to not track the average age in
> > > LSN-difference-bytes, but convert the value to some age metric at that
> > > time. If we e.g. were to convert the byte-age into an approximate age in
> > > checkpoints, with quadratic bucketing (e.g. 0 -> current checkpoint, 1 -> 
> > > 1
> > > checkpoint, 2 -> 2 checkpoints ago, 3 -> 4 checkpoints ago, ...), using a 
> > > mean
> > > of that age would probably be fine.
> >
> > Yes. I think it's possible that we could even get by with just two
> > buckets. Say current checkpoint and not. Or current-or-previous
> > checkpoint and not. And just look at what percentage of accesses fall
> > into this first bucket -- it should be small or we're doing it wrong.
> > It seems like the only thing we actually need to avoid is freezing the
> > same ages over and over again in a tight loop.
>
> At the risk of seeming too execution-focused, I want to try and get more
> specific.

I think that's a good intuition :)

> Here is a description of an example implementation to test my
> understanding:
>
> In table-level stats, save two numbers: younger_than_cpt/older_than_cpt
> storing the number of instances of unfreezing a page which is either
> younger or older than the start of the most recent checkpoint at the
> time of its unfreezing

> This has the downside of counting most unfreezings directly after a
> checkpoint in the older_than_cpt bucket. That is: older_than_cpt !=
> longer_frozen_duration at certain times in the checkpoint cycle.

Yea - I don't think just using before/after checkpoint is a good measure. As
you say, it'd be quite jumpy around checkpoints - even though the freezing
behaviour hasn't materially changed. I think using the *distance* between
checkpoints would be a more reliable measure, i.e. if (insert_lsn - page_lsn)
< recent_average_lsn_diff_between_checkpoints, then it's recently modified,
otherwise not.

One problem with using checkpoints "distances" to control things is
forced/immediate checkpoints. The fact that a base backup was started (and
thus a checkpoint completed much earlier than it would have otherwise)
shouldn't make our system assume that the overall behaviour is quite different
going forward.


> Now, I'm trying to imagine how this would interact in a meaningful way
> with opportunistic freezing behavior during vacuum.
>
> You would likely want to combine it with one of the other heuristics we
> discussed.
>
> For example:
> For a table with only 20% younger unfreezings, when vacuuming that page,

Fwiw, I wouldn't say that unfreezing 20% of recently frozen pages is a low
value.


>   if insert LSN - RedoRecPtr < insert LSN - page LSN
>   page is older than the most recent checkpoint start, so freeze it
>   regardless of whether or not it would emit an FPI
>
> What aggressiveness levels should there be? What should change at each
> level? What criteria should pages have to meet to be subject to the
> aggressiveness level?

I'm thinking something very roughly along these lines could make sense:

page_lsn_age = insert_lsn - page_lsn;

if (dirty && !fpi)
{
   /*
* If we can freeze without an FPI, be quite agressive about
* opportunistically freezing. We just need to prevent freezing
* when the table is constantly being rewritten. It's ok to make mistakes
* initially - the rate of unfreezes will quickly stop us from making
* mistakes as often.
*/
#define NO_FPI_FREEZE_FACTOR 10.0
   if (page_lsn_age >
   average_lsn_bytes_per_checkpoint * (1 - recent_unfreeze_ratio) * 
NO_FPI_FREEZE_FACTOR)
  freeze = true;
}
else
{
   /*
* Freezing would emit an FPI and/or dirty the page, making freezing quite
* a bit more costly. Be more hesitant about freezing recently modified
* data, unless it's very rare that we unfreeze recently modified data.
* For insert-only/mostly tables, unfreezes should be rare, so we'll still
* freeze most of the time.
*/
#define FPI_FREEZE_FACTOR 1
   if (page_lsn_age >
   average_lsn_bytes_per_checkpoint * (1 - recent_unfreeze_ratio) * 
FPI_FREEZE_FACTOR)
   freeze = true;
}

Greetings,

Andres Freund




Re: Avoid a possible out-of-bounds access (src/backend/optimizer/util/relnode.c)

2023-09-27 Thread David Rowley
On Thu, 28 Sept 2023 at 02:37, Ranier Vilela  wrote:
>> Please check [1] for the mention of:
>>
>> "The fastest way to get your patch rejected is to make unrelated
>> changes. Reformatting lines that haven't changed, changing unrelated
>> comments you felt were poorly worded, touching code not necessary to
>> your change, etc. Each patch should have the minimum set of changes
>> required to work robustly. If you do not follow the code formatting
>> suggestions above, expect your patch to be returned to you with the
>> feedback of "follow the code conventions", quite likely without any
>> other review."
>
> Forgive my impulsiveness, anyone who loves perfect, well written code, would 
> understand.

Perhaps, but the committers on this project seem to be against the
blunderbuss approach to committing patches.  You might meet less
resistance around here if you assume all of their weapons have a scope
and that they like to aim for something before pulling the trigger.

Personally, this seems like a good idea to me and I'd like to follow
it too.  If you'd like to hold off you a committer whose weapon of
choice is the blunderbuss then I can back off and let you wait for one
to come along. Just let me know.

> Do you have an objection to fixing the function find_base_rel_ignore_join?
> Or is it included in unrelated changes?

Well, the topic seems to be adding additional safety to prevent
accessing negative entries for simple_rel_array.  I can't think why
fixing the same theoretical hazard in find_base_rel_ignore_join()
would be unrelated.  I hope you can see the difference here. Randomly
adjusting function signatures because you happen to be changing some
code within that function does not, in my book, seem related.  You
seem to agree with this given you mentioned "Of course that has
nothing to do with it."

David




Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan
On Wed, Sep 27, 2023 at 5:42 PM Melanie Plageman
 wrote:
> On Wed, Sep 27, 2023 at 5:27 PM Peter Geoghegan  wrote:
> > What about my idea of holding back when some tuples are already frozen
> > from before? Admittedly that's still a fairly raw idea, but something
> > along those lines seems promising.
>
> Originally, when you proposed this, the idea was to avoid freezing a
> page that has some tuples already frozen because that would mean we
> have frozen it and it was unfrozen after.

Right.

> But, today, I was thinking about what heuristics to combine with one
> that considers the average amount of time pages in a table spend
> frozen [1], and I think it might be interesting to use the "some
> tuples on the page are frozen" test in the opposite way than it was
> originally proposed.
>
> Take a case where you insert a row and then update it once and repeat
> forever. Let's say you freeze the page before you've filled it up,
> and, on the next update, the page is unfrozen. Most of the tuples on
> the page will still be frozen. The next time the page is vacuumed, the
> fact that the page has a lot of frozen tuples actually means that it
> is probably worth freezing the remaining few tuples and marking the
> page frozen.

Assuming that you can get away with not writing an FPI, yeah,
probably. Otherwise, maybe, less clear.

> Basically, I'm wondering if we can use the page-is-partially-frozen
> test to catch cases where we are willing to freeze a page a couple of
> times to make sure it gets frozen.

That doesn't seem like the opposite of my suggestion (except maybe in
a mechanical sort of way). To me this sounds like a refinement of the
same idea (not surprising that it would need to be refined).

You seem to be suggesting that we would freeze if either no tuples
were frozen at all, or if almost all tuples were already frozen -- but
not if the page was "somewhere in between". I think I see where you're
going with that.

-- 
Peter Geoghegan




Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan
On Wed, Sep 27, 2023 at 6:07 PM Andres Freund  wrote:
> > I would be sure to look out for new inserts that "unfreeze" pages, too
> > -- ideally you'd have instrumentation that caught that, in order to
> > get a general sense of the extent of the problem in each of your
> > chosen representative workloads. This is particularly likely to be a
> > concern when there is enough space on a heap page to fit one more heap
> > tuple, that's smaller than most other tuples. The FSM will "helpfully"
> > make sure of it. This problem isn't rare at all, unfortunately.
>
> I'm not as convinced as you are that this is a problem / that the solution
> won't cause more problems than it solves. Users are concerned when free space
> can't be used - you don't have to look further than the discussion in the last
> weeks about adding the ability to disable HOT to fight bloat.

All that I said was that it would be a good idea to keep an eye out
for it during performance validation work

> I do agree that the FSM code tries way too hard to fit things onto early pages
> - it e.g. can slow down concurrent copy workloads by 3-4x due to contention in
> the FSM - and that it has more size classes than necessary, but I don't think
> just closing frozen pages against further insertions of small tuples will
> cause its own set of issues.

Of course it'll cause other issues. Of course it's very subtle.

> I think at the very least there'd need to be something causing pages to reopen
> once the aggregate unused space in the table reaches some threshold.

Of course that's true. ISTM that you might well need some kind of
hysteresis to avoid pages ping-ponging. If it isn't sticky, it might
never settle, or take way too long to settle.

-- 
Peter Geoghegan




Re: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Tom Lane
Peter Smith  writes:
> I had in mind something like a SHIFT-JIS encoding where a single
> "character" may include some trail bytes that happen to be in the
> ASCII printable range. AFAIK because the new logic is processing
> bytes, not characters, I thought the end result could be a mix of
> escaped and unescaped bytes for the single SJIS character.

It will not, because ...

> But now looking at PostgreSQL-supported character sets [1] I saw SJIS
> is not supported anyhow. Unfortunately, I am not familiar enough with
> other encodings to know if there is still a chance of similar
> printable ASCII trail bytes so I am fine with whatever wording is
> chosen.

... trailing bytes that could be mistaken for ASCII are precisely
the property that causes us to reject an encoding as not backend-safe.
So this code doesn't need to consider that hazard, and processing the
string byte-by-byte is perfectly OK.

I'd be inclined to keep the text as simple as possible and not focus on
the distinction between bytes and characters.

regards, tom lane




Re: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Peter Smith
On Thu, Sep 28, 2023 at 10:30 AM Karl O. Pinc  wrote:
>
> On Thu, 28 Sep 2023 09:49:03 +1000
> Peter Smith  wrote:
>
> > On Wed, Sep 27, 2023 at 11:59 PM Karl O. Pinc 
> > wrote:
> > >
> > > On Wed, 27 Sep 2023 12:58:54 +
> > > "Hayato Kuroda (Fujitsu)"  wrote:
> > >
> > > > > Should the committer be interested, your patch applies cleanly
> > > > > and the docs build as expected.
> > > >
> > > > Yeah, but cfbot accepted previous version. Did you have anything
> > > > in your mind?
> > >
> > > No.  I'm letting the committer know everything I've checked
> > > so that they can decide what they want to check.
> > >
> > > > Hmm, what you said looked right. But as Peter pointed out [1], the
> > > > fix seems too much. So I attached three version of patches. How do
> > > > you think? For me, type C is best.
> > > >
> > > > A. A patch which completely follows your comments. The name is
> > > > "v3-0001-...patch". Cfbot tests it.
> > > > B. A patch which completely follows Peter's comments [1]. The
> > > > name is "Peter_v3-txt".
> > > > C. A patch which follows both comments. Based on
> > > > b, but some comments (Don't use the future tense, "Other
> > > > characters"->"The bytes of other characters"...) were picked. The
> > > > name is "Both_v3-txt".
> > >
> > > I also like C.  Fewer words is better.  So long
> > > as nothing is left unsaid fewer words make for clarity.
> > >
> > > However, in the last hunk, "of other than" does not read well.
> > > Instead of writing
> > > "and the bytes of other than printable ASCII characters"
> > > you want "and the bytes that are not printable ASCII characters".
> > > That would be my suggestion.
> > >
> >
> > I also prefer Option C, but...
> >
> > ~~~
> >
> > +application_name value.
> > +The bytes of other characters are replaced with
> > +C-style escaped
> > hexadecimal
> > +byte values.
> >
> > V
> >
> > +cluster_name value.
> > +The bytes of other characters are replaced with
> > +C-style escaped
> > hexadecimal
> > +byte values.
> >
> > V
> >
> > +  NAMEDATALEN characters and the bytes of other
> > than
> > +  printable ASCII characters are replaced with  > +  linkend="sql-syntax-strings-escape">C-style escaped
> > hexadecimal byte
> > +  values.
> >
> >
> > IIUC all of these 3 places can have exactly the same wording change
> > (e.g. like Karl's last suggestion [1]).
> >
> > SUGGESTION
> > Any bytes that are not printable ASCII characters are replaced with
> > C-style escaped hexadecimal
> > byte values.
>
> I don't see the utility in having exactly the same phrase everywhere,
> especially since the last hunk is modifying the end of a long
> sentence.  (Apologies if I'm mis-reading what Peter wrote above.)
>
> I like short sentences.  So I prefer "The bytes of other characters"
> rather than "Any bytes that are not printable ASCII characters"
> for the first 2 hunks.  In context I don't see the need to repeat
> the whole "printable ASCII characters" part that appears in the
> preceding sentence of both hunks.  "Other" is clear, IMHO.
>

I had in mind something like a SHIFT-JIS encoding where a single
"character" may include some trail bytes that happen to be in the
ASCII printable range. AFAIK because the new logic is processing
bytes, not characters, I thought the end result could be a mix of
escaped and unescaped bytes for the single SJIS character. In that
context, I felt "The bytes of other characters" was not quite
accurate.

But now looking at PostgreSQL-supported character sets [1] I saw SJIS
is not supported anyhow. Unfortunately, I am not familiar enough with
other encodings to know if there is still a chance of similar
printable ASCII trail bytes so I am fine with whatever wording is
chosen.

> But because I like short sentences I now think that it's a good
> idea to break the long sentence of the last hunk into two.
> Add a period and use the Peter's SUGGESTION above as the
> text for the second sentence.
>
> Is this desireable?
>

+1.

==
[1] https://www.postgresql.org/docs/current/multibyte.html

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Set enable_seqscan doesn't take effect?

2023-09-27 Thread David Rowley
On Thu, 28 Sept 2023 at 13:47, jacktby jacktby  wrote:
>
> postgres=# SET enable_seqscan = off;
> SET
> postgres=# explain select * from t;
>QUERY PLAN
> -
>  Seq Scan on t  (cost=100.00..123.60 rows=1360 width=32)
> (1 row)
>
> postgres=#  select * from t;
>a
> ---
>  [1,2]
> (1 row)>

It may be worth checking what the manual says about this. I guess if
you assume the meaning from the GUC name, then it might be surprising.
If you're still surprised after reading the manual then please report
back.

David




Re: Add test module for Table Access Method

2023-09-27 Thread Michael Paquier
On Sat, Jun 03, 2023 at 07:42:36PM -0400, Fabrízio de Royes Mello wrote:
> So in order to improve things a bit in this area I'm proposing to add a
> test module for Table Access Method similar what we already have for Index
> Access Method.
> 
> This code is based on the "blackhole_am" implemented by Michael Paquier:
> https://github.com/michaelpq/pg_plugins/tree/main/blackhole_am

dummy_index_am has included from the start additional coverage for the
various internal add_*_reloption routines, that were never covered in
the core tree.  Except if I am missing something, I am not seeing some
of the extra usefulness for the patch you've sent here.
--
Michael


signature.asc
Description: PGP signature


Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund
Hi,

On 2023-09-27 17:43:04 -0700, Peter Geoghegan wrote:
> On Wed, Sep 27, 2023 at 5:20 PM Melanie Plageman
>  wrote:
> > > Can you define "unfreeze"? I don't know if this newly invented term
> > > refers to unsetting a page that was marked all-frozen following (say)
> > > an UPDATE, or if it refers to choosing to not freeze when the option
> > > was available (in the sense that it was possible to do it and fully
> > > mark the page all-frozen in the VM). Or something else.
> >
> > By "unfreeze", I mean unsetting a page all frozen in the visibility
> > map when modifying the page for the first time after it was last
> > frozen.
>
> I see. So I guess that Andres meant that you'd track that within all
> backends, using pgstats infrastructure (when he summarized your call
> earlier today)?

That call was just between Robert and me (and not dedicated just to this
topic, fwiw).

Yes, I was thinking of tracking that in pgstat. I can imagine occasionally
rolling it over into pg_class, to better deal with crashes / failovers, but am
fairly agnostic on whether that's really useful / necessary.


> And that that information would be an important input for VACUUM, as opposed
> to something that it maintained itself?

Yes. If the ratio of opportunistically frozen pages (which I'd define as pages
that were frozen not because they strictly needed to) vs the number of
unfrozen pages increases, we need to make opportunistic freezing less
aggressive and vice versa.


> ISTM that the concept of "unfreezing" a page is equivalent to
> "opening" the page that was "closed" at some point (by VACUUM). It's
> not limited to freezing per se -- it's "closed for business until
> further notice", which is a slightly broader concept (and one not
> unique to Postgres). You don't just need to be concerned about updates
> and deletes -- inserts are also a concern.
>
> I would be sure to look out for new inserts that "unfreeze" pages, too
> -- ideally you'd have instrumentation that caught that, in order to
> get a general sense of the extent of the problem in each of your
> chosen representative workloads. This is particularly likely to be a
> concern when there is enough space on a heap page to fit one more heap
> tuple, that's smaller than most other tuples. The FSM will "helpfully"
> make sure of it. This problem isn't rare at all, unfortunately.

I'm not as convinced as you are that this is a problem / that the solution
won't cause more problems than it solves. Users are concerned when free space
can't be used - you don't have to look further than the discussion in the last
weeks about adding the ability to disable HOT to fight bloat.

I do agree that the FSM code tries way too hard to fit things onto early pages
- it e.g. can slow down concurrent copy workloads by 3-4x due to contention in
the FSM - and that it has more size classes than necessary, but I don't think
just closing frozen pages against further insertions of small tuples will
cause its own set of issues.

I think at the very least there'd need to be something causing pages to reopen
once the aggregate unused space in the table reaches some threshold.

Greetings,

Andres Freund




Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan
On Wed, Sep 27, 2023 at 5:20 PM Melanie Plageman
 wrote:
> > Can you define "unfreeze"? I don't know if this newly invented term
> > refers to unsetting a page that was marked all-frozen following (say)
> > an UPDATE, or if it refers to choosing to not freeze when the option
> > was available (in the sense that it was possible to do it and fully
> > mark the page all-frozen in the VM). Or something else.
>
> By "unfreeze", I mean unsetting a page all frozen in the visibility
> map when modifying the page for the first time after it was last
> frozen.

I see. So I guess that Andres meant that you'd track that within all
backends, using pgstats infrastructure (when he summarized your call
earlier today)? And that that information would be an important input
for VACUUM, as opposed to something that it maintained itself?

> I would probably call choosing not to freeze when the option is
> available "no freeze". I have been thinking of what to call it because
> I want to add some developer stats for myself indicating why a page
> that was freezable was not frozen.

I think that having that sort of information available via custom
instrumentation (just for the performance validation side) makes a lot
of sense.

ISTM that the concept of "unfreezing" a page is equivalent to
"opening" the page that was "closed" at some point (by VACUUM). It's
not limited to freezing per se -- it's "closed for business until
further notice", which is a slightly broader concept (and one not
unique to Postgres). You don't just need to be concerned about updates
and deletes -- inserts are also a concern.

I would be sure to look out for new inserts that "unfreeze" pages, too
-- ideally you'd have instrumentation that caught that, in order to
get a general sense of the extent of the problem in each of your
chosen representative workloads. This is particularly likely to be a
concern when there is enough space on a heap page to fit one more heap
tuple, that's smaller than most other tuples. The FSM will "helpfully"
make sure of it. This problem isn't rare at all, unfortunately.

> > The choice to freeze or not freeze pretty much always relies on
> > guesswork about what'll happen to the page in the future, no?
> > Obviously we wouldn't even apply the FPI trigger criteria if we could
> > somehow easily determine that it won't work out (to some degree that's
> > what conditioning it on being able to set the all-frozen VM bit
> > actually does).
>
> I suppose you are thinking of "opportunistic" as freezing whenever we
> aren't certain it is the right thing to do simply because we have the
> opportunity to do it?

I have heard the term "opportunistic freezing" used to refer to
freezing that takes place outside of VACUUM before now. You know,
something perfectly analogous to pruning in VACUUM versus
opportunistic pruning. (I knew that you can't have meant that -- my
point is that the terminology in this area has problems.)

> I want a way to express "freeze when freeze min age doesn't require it"

That makes sense when you consider where we are right now, but it'll
sound odd in a world where freezing via min_freeze_age is the
exception rather than the rule. If anything, it would make more sense
if the traditional min_freeze_age trigger criteria was the type of
freezing that needed its own adjective.

-- 
Peter Geoghegan




Re: Eager page freeze criteria clarification

2023-09-27 Thread Melanie Plageman
On Wed, Sep 27, 2023 at 5:27 PM Peter Geoghegan  wrote:
>
> On Wed, Sep 27, 2023 at 1:45 PM Andres Freund  wrote:
> > I am much more concerned about cases where
> > opportunistic freezing requires an FPI - it'll often *still* be the right
> > choice to freeze the page, but we need a way to prevent that from causing a
> > lot of WAL in worse cases.
>
> What about my idea of holding back when some tuples are already frozen
> from before? Admittedly that's still a fairly raw idea, but something
> along those lines seems promising.

Originally, when you proposed this, the idea was to avoid freezing a
page that has some tuples already frozen because that would mean we
have frozen it and it was unfrozen after.

But, today, I was thinking about what heuristics to combine with one
that considers the average amount of time pages in a table spend
frozen [1], and I think it might be interesting to use the "some
tuples on the page are frozen" test in the opposite way than it was
originally proposed.

Take a case where you insert a row and then update it once and repeat
forever. Let's say you freeze the page before you've filled it up,
and, on the next update, the page is unfrozen. Most of the tuples on
the page will still be frozen. The next time the page is vacuumed, the
fact that the page has a lot of frozen tuples actually means that it
is probably worth freezing the remaining few tuples and marking the
page frozen.

Basically, I'm wondering if we can use the page-is-partially-frozen
test to catch cases where we are willing to freeze a page a couple of
times to make sure it gets frozen.

- Melanie

[1] 
https://www.postgresql.org/message-id/CAAKRu_Y0nLmQ%3DYS1c2ORzLi7bu3eWjdx%2B32BuFc0Tho2o7E3rw%40mail.gmail.com




Re: Fail hard if xlogreader.c fails on out-of-memory

2023-09-27 Thread Michael Paquier
On Tue, Sep 26, 2023 at 06:28:30PM -0700, Noah Misch wrote:
> On Wed, Sep 27, 2023 at 11:06:37AM +1300, Thomas Munro wrote:
>> What Michael wants to do now is remove the 2004-era assumption that
>> malloc failure implies bogus data.  It must be pretty unlikely in a 64
>> bit world with overcommitted virtual memory, but a legitimate
>> xl_tot_len can falsely end recovery and lose data, as reported from a
>> production case analysed by his colleagues.  In other words, we can
>> actually distinguish between lack of resources and recycled bogus
>> data, so why treat them the same?
> 
> Indeed.  Hard failure is fine, and ENOMEM=end-of-WAL definitely isn't.

Are there any more comments and/or suggestions here?

If none, I propose to apply the patch to switch to palloc() instead of
palloc_extended(NO_OOM) in this code around the beginning of next
week, down to 12.
--
Michael


signature.asc
Description: PGP signature


Re: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Karl O. Pinc
On Thu, 28 Sep 2023 09:49:03 +1000
Peter Smith  wrote:

> On Wed, Sep 27, 2023 at 11:59 PM Karl O. Pinc 
> wrote:
> >
> > On Wed, 27 Sep 2023 12:58:54 +
> > "Hayato Kuroda (Fujitsu)"  wrote:
> >  
> > > > Should the committer be interested, your patch applies cleanly
> > > > and the docs build as expected.  
> > >
> > > Yeah, but cfbot accepted previous version. Did you have anything
> > > in your mind?  
> >
> > No.  I'm letting the committer know everything I've checked
> > so that they can decide what they want to check.
> >  
> > > Hmm, what you said looked right. But as Peter pointed out [1], the
> > > fix seems too much. So I attached three version of patches. How do
> > > you think? For me, type C is best.
> > >
> > > A. A patch which completely follows your comments. The name is
> > > "v3-0001-...patch". Cfbot tests it.
> > > B. A patch which completely follows Peter's comments [1]. The
> > > name is "Peter_v3-txt".
> > > C. A patch which follows both comments. Based on
> > > b, but some comments (Don't use the future tense, "Other
> > > characters"->"The bytes of other characters"...) were picked. The
> > > name is "Both_v3-txt".  
> >
> > I also like C.  Fewer words is better.  So long
> > as nothing is left unsaid fewer words make for clarity.
> >
> > However, in the last hunk, "of other than" does not read well.
> > Instead of writing
> > "and the bytes of other than printable ASCII characters"
> > you want "and the bytes that are not printable ASCII characters".
> > That would be my suggestion.
> >  
> 
> I also prefer Option C, but...
> 
> ~~~
> 
> +application_name value.
> +The bytes of other characters are replaced with
> +C-style escaped
> hexadecimal
> +byte values.
> 
> V
> 
> +cluster_name value.
> +The bytes of other characters are replaced with
> +C-style escaped
> hexadecimal
> +byte values.
> 
> V
> 
> +  NAMEDATALEN characters and the bytes of other
> than
> +  printable ASCII characters are replaced with  +  linkend="sql-syntax-strings-escape">C-style escaped
> hexadecimal byte
> +  values.
> 
> 
> IIUC all of these 3 places can have exactly the same wording change
> (e.g. like Karl's last suggestion [1]).
> 
> SUGGESTION
> Any bytes that are not printable ASCII characters are replaced with
> C-style escaped hexadecimal
> byte values.

I don't see the utility in having exactly the same phrase everywhere,
especially since the last hunk is modifying the end of a long
sentence.  (Apologies if I'm mis-reading what Peter wrote above.)

I like short sentences.  So I prefer "The bytes of other characters"
rather than "Any bytes that are not printable ASCII characters"
for the first 2 hunks.  In context I don't see the need to repeat
the whole "printable ASCII characters" part that appears in the
preceding sentence of both hunks.  "Other" is clear, IMHO.

But because I like short sentences I now think that it's a good
idea to break the long sentence of the last hunk into two.
Add a period and use the Peter's SUGGESTION above as the
text for the second sentence.

Is this desireable?

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

P.S. Hayato, it is good practice to cc everybody who has
replied to a thread.  At least I think that's what I see,
it's not just people being lazy with reply-all. So I'm
adding Tom Lane back to the thread.  He can tell us otherwise
if I'm wrong to add him back.




Re: Eager page freeze criteria clarification

2023-09-27 Thread Melanie Plageman
On Wed, Sep 27, 2023 at 7:39 PM Peter Geoghegan  wrote:
>
> On Wed, Sep 27, 2023 at 4:09 PM Melanie Plageman
>  wrote:
> > At the risk of seeming too execution-focused, I want to try and get more
> > specific. Here is a description of an example implementation to test my
> > understanding:
> >
> > In table-level stats, save two numbers: younger_than_cpt/older_than_cpt
> > storing the number of instances of unfreezing a page which is either
> > younger or older than the start of the most recent checkpoint at the
> > time of its unfreezing
>
> Can you define "unfreeze"? I don't know if this newly invented term
> refers to unsetting a page that was marked all-frozen following (say)
> an UPDATE, or if it refers to choosing to not freeze when the option
> was available (in the sense that it was possible to do it and fully
> mark the page all-frozen in the VM). Or something else.

By "unfreeze", I mean unsetting a page all frozen in the visibility
map when modifying the page for the first time after it was last
frozen.

I would probably call choosing not to freeze when the option is
available "no freeze". I have been thinking of what to call it because
I want to add some developer stats for myself indicating why a page
that was freezable was not frozen.

> I also find the term "opportunistic freezing" confusing. What's
> opportunistic about it? It seems as if the term has been used as a
> synonym of "freezing not triggered by min_freeze_age" on this thread,
> or "freezing that is partly conditioned on setting the page all-frozen
> in the VM", but I'm not even sure of that.

By "opportunistic freezing", I refer to freezing not triggered by
min_freeze_age. Though we only do so now if the page could be set
all-frozen, that is not a requirement for use of the term, IMO. I
assume that we are not considering changing any behavior related to
freezing when a tuple is older than freeze_min_age, so I think of that
freezing as "required". And any freezing of newer tuples is
opportunistic.

> The choice to freeze or not freeze pretty much always relies on
> guesswork about what'll happen to the page in the future, no?
> Obviously we wouldn't even apply the FPI trigger criteria if we could
> somehow easily determine that it won't work out (to some degree that's
> what conditioning it on being able to set the all-frozen VM bit
> actually does).

I suppose you are thinking of "opportunistic" as freezing whenever we
aren't certain it is the right thing to do simply because we have the
opportunity to do it? If so, I'm not sure we need that distinction
currently. It seems more useful in a narrative context as a way of
characterizing a situation than it does when categorizing different
behaviors.

I want a way to express "freeze when freeze min age doesn't require it"

- Melanie




Re: Should consider materializing the cheapest inner path in consider_parallel_nestloop()

2023-09-27 Thread David Rowley
On Fri, 8 Sept 2023 at 19:14, Richard Guo  wrote:
> explain select * from partsupp join lineitem on l_partkey > ps_partkey;
>   QUERY PLAN
> --
>  Gather  (cost=0.00..1807085.44 rows=16047 width=301)
>Workers Planned: 4
>->  Nested Loop  (cost=0.00..1807085.44 rows=40116667 width=301)
>  Join Filter: (lineitem.l_partkey > partsupp.ps_partkey)
>  ->  Parallel Seq Scan on lineitem  (cost=0.00..1518.44 rows=15044 
> width=144)
>  ->  Materialize  (cost=0.00..307.00 rows=8000 width=157)
>->  Seq Scan on partsupp  (cost=0.00..267.00 rows=8000 
> width=157)
> (7 rows)
>
> The execution time (ms) are (avg of 3 runs):
>
> unpatched:  71769.21
> patched:65510.04

This gap would be wider if the partsupp Seq Scan were filtering off
some rows and wider still if you added more rows to lineitem.
However, a clauseless seqscan is not the most compelling use case
below a material node. The inner side of the nested loop could be some
subquery that takes 6 days to complete. Running the 6 day query ~15044
times seems like something that would be good to avoid.

It seems worth considering Material paths to me.  I think that the
above example could be tuned any way you like to make it look better
or worse.

David




Re: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Peter Smith
On Wed, Sep 27, 2023 at 11:59 PM Karl O. Pinc  wrote:
>
> On Wed, 27 Sep 2023 12:58:54 +
> "Hayato Kuroda (Fujitsu)"  wrote:
>
> > > Should the committer be interested, your patch applies cleanly
> > > and the docs build as expected.
> >
> > Yeah, but cfbot accepted previous version. Did you have anything in
> > your mind?
>
> No.  I'm letting the committer know everything I've checked
> so that they can decide what they want to check.
>
> > Hmm, what you said looked right. But as Peter pointed out [1], the
> > fix seems too much. So I attached three version of patches. How do
> > you think? For me, type C is best.
> >
> > A. A patch which completely follows your comments. The name is
> > "v3-0001-...patch". Cfbot tests it.
> > B. A patch which completely follows Peter's comments [1]. The name is
> > "Peter_v3-txt".
> > C. A patch which follows both comments. Based on
> > b, but some comments (Don't use the future tense, "Other
> > characters"->"The bytes of other characters"...) were picked. The
> > name is "Both_v3-txt".
>
> I also like C.  Fewer words is better.  So long
> as nothing is left unsaid fewer words make for clarity.
>
> However, in the last hunk, "of other than" does not read well.
> Instead of writing
> "and the bytes of other than printable ASCII characters"
> you want "and the bytes that are not printable ASCII characters".
> That would be my suggestion.
>

I also prefer Option C, but...

~~~

+application_name value.
+The bytes of other characters are replaced with
+C-style escaped hexadecimal
+byte values.

V

+cluster_name value.
+The bytes of other characters are replaced with
+C-style escaped hexadecimal
+byte values.

V

+  NAMEDATALEN characters and the bytes of other than
+  printable ASCII characters are replaced with C-style escaped hexadecimal byte
+  values.


IIUC all of these 3 places can have exactly the same wording change
(e.g. like Karl's last suggestion [1]).

SUGGESTION
Any bytes that are not printable ASCII characters are replaced with
C-style escaped hexadecimal
byte values.

==
[1] 
https://www.postgresql.org/message-id/20230927085924.4198c3d2%40slate.karlpinc.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Should consider materializing the cheapest inner path in consider_parallel_nestloop()

2023-09-27 Thread David Rowley
On Fri, 8 Sept 2023 at 09:41, Robert Haas  wrote:
>
> On Tue, Sep 5, 2023 at 8:07 AM Richard Guo  wrote:
> > Yeah, this seems an omission in commit 45be99f8.
>
> It's been a while, but I think I omitted this deliberately because I
> didn't really understand the value of it and wanted to keep the
> planning cost down.

I think the value is potentially not having to repeatedly execute some
expensive rescan to the nested loop join once for each outer-side
tuple.

The planning cost is something to consider for sure, but it seems
strange that we deemed it worthy to consider material paths for the
non-parallel input paths but draw the line for the parallel/partial
ones. It seems to me that the additional costs and the possible
benefits are the same for both.

David




Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan
On Wed, Sep 27, 2023 at 4:09 PM Melanie Plageman
 wrote:
> At the risk of seeming too execution-focused, I want to try and get more
> specific. Here is a description of an example implementation to test my
> understanding:
>
> In table-level stats, save two numbers: younger_than_cpt/older_than_cpt
> storing the number of instances of unfreezing a page which is either
> younger or older than the start of the most recent checkpoint at the
> time of its unfreezing

Can you define "unfreeze"? I don't know if this newly invented term
refers to unsetting a page that was marked all-frozen following (say)
an UPDATE, or if it refers to choosing to not freeze when the option
was available (in the sense that it was possible to do it and fully
mark the page all-frozen in the VM). Or something else.

I also find the term "opportunistic freezing" confusing. What's
opportunistic about it? It seems as if the term has been used as a
synonym of "freezing not triggered by min_freeze_age" on this thread,
or "freezing that is partly conditioned on setting the page all-frozen
in the VM", but I'm not even sure of that.

The choice to freeze or not freeze pretty much always relies on
guesswork about what'll happen to the page in the future, no?
Obviously we wouldn't even apply the FPI trigger criteria if we could
somehow easily determine that it won't work out (to some degree that's
what conditioning it on being able to set the all-frozen VM bit
actually does).

-- 
Peter Geoghegan




Re: Eager page freeze criteria clarification

2023-09-27 Thread Melanie Plageman
On Wed, Sep 27, 2023 at 3:25 PM Robert Haas  wrote:
>
> On Wed, Sep 27, 2023 at 12:34 PM Andres Freund  wrote:
> > One way to deal with that would be to not track the average age in
> > LSN-difference-bytes, but convert the value to some age metric at that
> > time. If we e.g. were to convert the byte-age into an approximate age in
> > checkpoints, with quadratic bucketing (e.g. 0 -> current checkpoint, 1 -> 1
> > checkpoint, 2 -> 2 checkpoints ago, 3 -> 4 checkpoints ago, ...), using a 
> > mean
> > of that age would probably be fine.
>
> Yes. I think it's possible that we could even get by with just two
> buckets. Say current checkpoint and not. Or current-or-previous
> checkpoint and not. And just look at what percentage of accesses fall
> into this first bucket -- it should be small or we're doing it wrong.
> It seems like the only thing we actually need to avoid is freezing the
> same ages over and over again in a tight loop.

At the risk of seeming too execution-focused, I want to try and get more
specific. Here is a description of an example implementation to test my
understanding:

In table-level stats, save two numbers: younger_than_cpt/older_than_cpt
storing the number of instances of unfreezing a page which is either
younger or older than the start of the most recent checkpoint at the
time of its unfreezing

Upon update or delete (and insert?), if the page being modified is
frozen and
  if insert LSN - RedoRecPtr > insert LSN - old page LSN
  page is younger, younger_than_cpt += 1

  otherwise, older_than_cpt += 1

The ratio of younger/total and older/total can be used to determine how
aggressive opportunistic freezing will be.

This has the downside of counting most unfreezings directly after a
checkpoint in the older_than_cpt bucket. That is: older_than_cpt !=
longer_frozen_duration at certain times in the checkpoint cycle.

Now, I'm trying to imagine how this would interact in a meaningful way
with opportunistic freezing behavior during vacuum.

You would likely want to combine it with one of the other heuristics we
discussed.

For example:
For a table with only 20% younger unfreezings, when vacuuming that page,

  if insert LSN - RedoRecPtr < insert LSN - page LSN
  page is older than the most recent checkpoint start, so freeze it
  regardless of whether or not it would emit an FPI

What aggressiveness levels should there be? What should change at each
level? What criteria should pages have to meet to be subject to the
aggressiveness level?

I have some ideas, but I'd like to try an algorithm along these lines
with an updated work queue workload and the insert-only workload. And I
want to make sure I understand the proposal first.

- Melanie




Re: pg_stat_get_activity(): integer overflow due to (int) * (int) for MemoryContextAllocHuge()

2023-09-27 Thread Michael Paquier
On Wed, Sep 27, 2023 at 10:29:25AM -0700, Andres Freund wrote:
> I don't think going for size_t is a viable path for fixing this. I'm pretty
> sure the initial patch would trigger a type mismatch from guc_tables.c - we
> don't have infrastructure for size_t GUCs.

Nothing marked as PGDLLIMPORT uses size_t in the tree currently, FWIW.

> Perhaps we ought to error out (in BackendStatusShmemSize() or such) if
> pgstat_track_activity_query_size * MaxBackends >= 4GB?

Yeah, agreed that putting a check like that could catch errors more
quickly.

> Frankly, it seems like a quite bad idea to have such a high limit for
> pgstat_track_activity_query_size. The overhead such a high value has will
> surprise people...

Still it could have some value for some users with large analytical
queries where the syslogger is not going to be a bottleneck?  It seems
too late to me to change that, but perhaps the docs could be improved
to tell that using a too high value can have performance consequences,
while mentioning the maximum value.
--
Michael


signature.asc
Description: PGP signature


Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan
On Wed, Sep 27, 2023 at 2:26 PM Peter Geoghegan  wrote:
> On Wed, Sep 27, 2023 at 1:45 PM Andres Freund  wrote:
> > I think we need to make vacuums on large tables much more aggressive than 
> > they
> > are now, independent of opportunistic freezing heuristics. It's idiotic that
> > on large tables we delay vacuuming until multi-pass vacuums are pretty much
> > guaranteed.
>
> Not having to do all of the freezing at once will often still make
> sense in cases where we "lose".

One more thing on this, and the subject of large table that keep
getting larger (including those with a "hot tail" of updates):

Since autovacuum runs against such tables at geometric intervals (as
determined by autovacuum_vacuum_insert_scale_factor), the next VACUUM
is always going to be longer and more expensive than this VACUUM,
forever (ignoring the influence of aggressive mode for a second). This
would even be true if we didn't have the related problem of
autovacuum_vacuum_insert_scale_factor not accounting for the fact that
when VACUUM starts and when VACUUM ends aren't exactly the same thing
in large tables [1] -- that aspect just makes the problem even worse.

Basically, the whole "wait and see" approach makes zero sense here
because we really do need to be aggressive about freezing just to keep
up with the workload. The number of pages we'll scan in the next
VACUUM will always be significantly larger, even if we're very
aggressive about freezing (theoretically it might not be, but then
what VACUUM does doesn't matter that much either way). Time is very
much not on our side here. So we need to anticipate what happens next
with the workload, and how that affects VACUUM in the future -- not
just how VACUUM affects the workload. (VACUUM is just another part of
the workload, in fact.)

[1] 
https://www.postgresql.org/message-id/CAH2-Wzn=bZ4wynYB0hBAeF4kGXGoqC=PZVKHeerBU-je9AQF=g...@mail.gmail.com
-- 
Peter Geoghegan




Re: Streaming I/O, vectored I/O (WIP)

2023-09-27 Thread Thomas Munro
On Thu, Sep 28, 2023 at 9:13 AM Andres Freund  wrote:
> On 2023-09-27 21:33:15 +0300, Heikki Linnakangas wrote:
> > Looking at the later patch that introduces the streaming read API, seems
> > that it finishes all the reads, so I suppose we don't need an abort
> > function. Does it all get cleaned up correctly on error?
>
> I think it should.  The buffer error handling is one of the areas where I
> really would like to have some way of testing the various cases, it's easy to
> get things wrong, and basically impossible to write reliable tests for with
> our current infrastructure.

One thing to highlight is that this patch doesn't create a new state
in that case.  In master, we already have the concept of a buffer with
BM_TAG_VALID but not BM_VALID and not BM_IO_IN_PROGRESS, reachable if
there is an I/O error.  Eventually another reader will try the I/O
again, or the buffer will fall out of the pool.  With this patch it's
the same, it's just a wider window: more kinds of errors might be
thrown in code between Prepare() and Complete() before we even have
BM_IO_IN_PROGRESS.  So there is nothing extra to clean up.  Right?

Yeah, it would be nice to test buffer pool logic directly.  Perhaps
with a C unit test framework[1] and pluggable smgr[2] we could mock up
cases like I/O errors...

> > > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead 
> > > *pgsr,
> > > uintptr_t pgsr_private,
> > > void *per_io_private,
> > > BufferManagerRelation *bmr,
> > > ForkNumber *forkNum,
> > > BlockNumber *blockNum,
> > > ReadBufferMode *mode);
> >
> > I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
> > each read. I see that you used that in the WAL replay prefetching, so I
> > guess that makes sense.
>
> Yea, that's the origin - I don't like it, but I don't really have a better
> idea.

Another idea I considered was that streams could be associated with a
single relation, but recovery could somehow manage a set of them.
>From a certain point of view, that makes sense (we could be redoing
work that was created by multiple concurrent streams at 'do' time, and
with the approach shown here some clustering opportunities available
at do time are lost at redo time), but it's not at all clear that it's
worth the overheads or complexity, and I couldn't immediately figure
out how to do it.  But I doubt there would ever be any other users of
a single stream with multiple relations, and I agree that this is
somehow not quite satisfying...  Perhaps we should think about that
some more...

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=pcnwe...@mail.gmail.com




Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan
On Wed, Sep 27, 2023 at 1:45 PM Andres Freund  wrote:
> On 2023-09-27 13:14:41 -0700, Peter Geoghegan wrote:
> > As a general rule, I think that we're better of gambling against
> > future FPIs, and then pulling back if we go too far. The fact that we
> > went one VACUUM operation without the workload unsetting an
> > all-visible page isn't that much of a signal about what might happen
> > to that page.
>
> I think we can afford to be quite aggressive about opportunistically freezing
> when doing so wouldn't emit an FPI.

I agree. This part is relatively easy -- it is more or less v2 of the
FPI thing. The only problem that I see is that it just isn't that
compelling on its own -- it just doesn't seem ambitious enough.

> I am much more concerned about cases where
> opportunistic freezing requires an FPI - it'll often *still* be the right
> choice to freeze the page, but we need a way to prevent that from causing a
> lot of WAL in worse cases.

What about my idea of holding back when some tuples are already frozen
from before? Admittedly that's still a fairly raw idea, but something
along those lines seems promising.

If you limit yourself to what I've called the easy cases, then you
can't really expect to make a dent in the problems that we see with
large tables -- where FPIs are pretty much the norm rather than the
exception. Again, that's fine, but I'd be disappointed if you and
Melanie can't do better than that for 17.

> > 2. Large tables (i.e. the tables where it really matters) just don't
> > have that many VACUUM operations, relative to everything else.
>
> I think we need to make vacuums on large tables much more aggressive than they
> are now, independent of opportunistic freezing heuristics. It's idiotic that
> on large tables we delay vacuuming until multi-pass vacuums are pretty much
> guaranteed.

Not having to do all of the freezing at once will often still make
sense in cases where we "lose". It's hard to precisely describe how to
assess such things (what's the break even point?), but that makes it
no less true. Constantly losing by a small amount is usually better
than massive drops in performance.

> The current logic made some sense when we didn't have the VM, but now
> autovacuum scheduling is influenced by the portion of the table that that
> vacuum will never look at, which makes no sense.

Yep:

https://www.postgresql.org/message-id/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=uunpza6bt4dsmasrgluq...@mail.gmail.com

> > Who says we'll get more than one opportunity per page with these tables,
> > even with this behavior of scanning all-visible pages in non-aggressive
> > VACUUMs?  Big append-only tables simply won't get the opportunity to catch
> > up in the next non-aggressive VACUUM if there simply isn't one.
>
> I agree that we need to freeze pages in append only tables ASAP. I don't think
> they're that hard a case to detect though. The harder case is the - IME very
> common - case of tables that are largely immutable but have a moving tail
> that's hotly updated.

Doing well with such "moving tail of updates" tables is exactly what
doing well on TPC-C requires. It's easy to detect that a table is
either one of these two things. Which of the two it is specifically
presents greater difficulties. So I don't see strict append-only
versus "append and update each row once" as all that different,
practically speaking, from the point of view of VACUUM.

Often, the individual pages from TPC-C's order lines table look very
much like pgbench_history style pages would -- just because VACUUM is
either ahead or behind the current "hot tail" position with almost all
pages that it scans, due to the current phase of the moon. This is
partly why I place so much emphasis on teaching VACUUM to understand
that what it sees in each page might well have a lot to do with when
it happened to show up, as opposed to something about the
workload/table itself.

BTW, if you're worried about the "hot tail" case in particular then
you should definitely put the FSM stuff in scope here -- it's a huge
part of the overall problem, since it makes the pages take so much
longer to "settle" than you might expect when just considering the
workload abstractly.

-- 
Peter Geoghegan




Re: Annoying build warnings from latest Apple toolchain

2023-09-27 Thread Tom Lane
I wrote:
> I've not yet looked at the meson build infrastructure to
> see if it needs a corresponding change.

I think it doesn't, as long as all the relevant build targets
write their dependencies with "frontend_code" before "libpq".
(The need for this is, of course, documented nowhere.  The state
of the documentation for our meson build system is abysmal.)

However, it's hard to test this, because the meson build
seems completely broken on current macOS:

-
$ meson setup build --prefix=$HOME/pginstall
The Meson build system
Version: 0.64.1
Source dir: /Users/tgl/pgsql
Build dir: /Users/tgl/pgsql/build
Build type: native build
Project name: postgresql
Project version: 17devel

meson.build:9:0: ERROR: Unable to detect linker for compiler `cc -Wl,--version`
stdout: 
stderr: ld: unknown options: --version 
clang: error: linker command failed with exit code 1 (use -v to see invocation)

A full log can be found at /Users/tgl/pgsql/build/meson-logs/meson-log.txt
-

That log file offers no more detail than the terminal output did.

(I also tried with a more recent meson version, 1.1.1, with
the same result.)

regards, tom lane




Re: Checks in RegisterBackgroundWorker.()

2023-09-27 Thread Heikki Linnakangas
Here's a new version of these patches. I fixed one comment and ran 
pgindent, no other changes.


--
Heikki Linnakangas
Neon (https://neon.tech)
From 1984e81c4191427fa2e358a664291edbf8d566ea Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 27 Sep 2023 23:39:09 +0300
Subject: [PATCH v2 1/3] Clarify the checks in RegisterBackgroundWorker.

In EXEC_BACKEND or single-user mode, we process
shared_preload_libraries at postmaster startup as usual, but also at
backend startup. When a library calls RegisterBackgroundWorker() when
being loaded into a backend process, we go through the motions to add
the worker to BackgroundWorkerList, even though that is a
postmaster-private data structure. Make it return early when called in
a backend process, without changing BackgroundWorkerList.

You could argue that it was intentional: In non-EXEC_BACKEND mode, the
backend processes inherit BackgroundWorkerList at fork(), so it does
make some sense to initialize it to the same state in EXEC_BACKEND
mode, too. It's clearly a postmaster-private structure, though, and
all the functions that use it are clearly marked as "should only be
called in postmaster".

You could also argue that libraries should not call
RegisterBackgroundWorker() during backend startup. It's too late to
correctly register any static background workers at that stage. But
it's a common pattern in extensions, and it doesn't seem worth the
churn to require all extensions to change it.

Another sloppiness was the exception for "internal" background
workers. We checked that RegisterBackgroundWorker() was called during
shared_preload_libraries processing, or the background worker was an
internal one. That exception was made in commit 665d1fad99 to allow
postmaster to register the logical apply launcher in
ApplyLauncherRegister(). The way the check was written, it would not
complain if you registered an internal background worker in a regular
backend process. But it would complain if postmaster registered a
background worker defined in a shared library, outside
shared_preload_libraries processing. I think the correct rule is that
you can only register static background workers in the postmaster
process, and only before the bgworker shared memory array has been
initialized. Check for that more directly.
---
 src/backend/postmaster/bgworker.c | 44 +++
 1 file changed, 33 insertions(+), 11 deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 505e38376c3..31350d64013 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -880,21 +880,43 @@ RegisterBackgroundWorker(BackgroundWorker *worker)
 	RegisteredBgWorker *rw;
 	static int	numworkers = 0;
 
-	if (!IsUnderPostmaster)
-		ereport(DEBUG1,
-(errmsg_internal("registering background worker \"%s\"", worker->bgw_name)));
-
-	if (!process_shared_preload_libraries_in_progress &&
-		strcmp(worker->bgw_library_name, "postgres") != 0)
+	/*
+	 * Static background workers can only be registered in the postmaster
+	 * process.
+	 */
+	if (IsUnderPostmaster || !IsPostmasterEnvironment)
 	{
-		if (!IsUnderPostmaster)
-			ereport(LOG,
-	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg("background worker \"%s\": must be registered in shared_preload_libraries",
-			worker->bgw_name)));
+		/*
+		 * In EXEC_BACKEND or single-user mode, we process
+		 * shared_preload_libraries in backend processes too.  We cannot
+		 * register static background workers at that stage, but many
+		 * libraries' _PG_init() functions don't distinguish whether they're
+		 * being loaded in the postmaster or in a backend, they just check
+		 * process_shared_preload_libraries_in_progress.  It's a bit sloppy,
+		 * but for historical reasons we tolerate it.  In EXEC_BACKEND mode,
+		 * the background workers should already have been registered when the
+		 * library was loaded in postmaster.
+		 */
+		if (process_shared_preload_libraries_in_progress)
+			return;
+		ereport(LOG,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("background worker \"%s\": must be registered in shared_preload_libraries",
+		worker->bgw_name)));
 		return;
 	}
 
+	/*
+	 * Cannot register static background workers after calling
+	 * BackgroundWorkerShmemInit().
+	 */
+	if (BackgroundWorkerData != NULL)
+		elog(ERROR, "cannot register background worker \"%s\" after shmem init",
+			 worker->bgw_name);
+
+	ereport(DEBUG1,
+			(errmsg_internal("registering background worker \"%s\"", worker->bgw_name)));
+
 	if (!SanityCheckBackgroundWorker(worker, LOG))
 		return;
 
-- 
2.39.2

From c28a16ccbc10eea01478a0f839bad0404870aba1 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 24 Aug 2023 18:08:44 +0300
Subject: [PATCH v2 2/3] Allocate Backend structs in PostmasterContext.

The child processes don't need them. By allocating them in
PostmasterContext, the memory gets free'd and is made available for
other stuff in the child p

Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund
Hi,

On 2023-09-27 13:14:41 -0700, Peter Geoghegan wrote:
> On Wed, Sep 27, 2023 at 1:03 PM Andres Freund  wrote:
> > I suspect that medium term the better approach would be to be much more
> > aggressive about setting all-visible, including as part of page-level
> > visibility checks, and to deal with the concern of vacuum not processing 
> > such
> > pages soon by not just looking at unmarked pages, but also a portion of the
> > all-visible-but-not-frozen pages (the more all-visible-but-not-frozen pages
> > there are, the more of them each vacuum should process). I think all-visible
> > is too important for index only scans, for us to be able to remove it, or
> > delay setting it until freezing makes sense.
> >
> > My confidence in my gut feeling isn't all that high here, though.
> 
> I think that this is a bad idea. I see two main problems with
> "splitting the difference" like this:
> 
> 1. If we randomly scan some all-visible pages in non-aggressive
> VACUUMs, then we're sure to get FPIs in order to be able to freeze.
> 
> As a general rule, I think that we're better of gambling against
> future FPIs, and then pulling back if we go too far. The fact that we
> went one VACUUM operation without the workload unsetting an
> all-visible page isn't that much of a signal about what might happen
> to that page.

I think we can afford to be quite aggressive about opportunistically freezing
when doing so wouldn't emit an FPI. I am much more concerned about cases where
opportunistic freezing requires an FPI - it'll often *still* be the right
choice to freeze the page, but we need a way to prevent that from causing a
lot of WAL in worse cases.

I think the case where no-fpi-required-freezing is a problem are small,
frequently updated, tables, which are very frequently vacuumed.



> 2. Large tables (i.e. the tables where it really matters) just don't
> have that many VACUUM operations, relative to everything else.

I think we need to make vacuums on large tables much more aggressive than they
are now, independent of opportunistic freezing heuristics. It's idiotic that
on large tables we delay vacuuming until multi-pass vacuums are pretty much
guaranteed.  We don't have the stats to detect that, but if we had them, we
should trigger autovacuums dead items + dead tuples gets to a significant
fraction of what can be tracked in m_w_m.

The current logic made some sense when we didn't have the VM, but now
autovacuum scheduling is influenced by the portion of the table that that
vacuum will never look at, which makes no sense. One can argue that that
frozen-portion is a proxy for the size of the indexes that would need to be
scanned - but that also doesn't make much sense to me, because the number &
size of indexes depends a lot on the workload and because we skip index
vacuums when not necessary.

I guess we could just use (relation_size - skipped_pages) *
autovacuum_vacuum_scale_factor as the threshold, but we don't have a cheap way
to compute the number of frozen pages right now (we do have relallvisible for
non-aggressive vacuums).


> Who says we'll get more than one opportunity per page with these tables,
> even with this behavior of scanning all-visible pages in non-aggressive
> VACUUMs?  Big append-only tables simply won't get the opportunity to catch
> up in the next non-aggressive VACUUM if there simply isn't one.

I agree that we need to freeze pages in append only tables ASAP. I don't think
they're that hard a case to detect though. The harder case is the - IME very
common - case of tables that are largely immutable but have a moving tail
that's hotly updated.

Greetings,

Andres Freund




Re: Annoying build warnings from latest Apple toolchain

2023-09-27 Thread Tom Lane
I wrote:
> Since updating to Xcode 15.0, my macOS machines have been
> spitting a bunch of linker-generated warnings. ...
> some program links complain

> ld: warning: ignoring duplicate libraries: '-lpgcommon', '-lpgport'

I found that this is being caused by the libpq_pgport hack in
Makefile.global.in, which ensures that libpgcommon and libpgport
get linked before libpq.  The comment freely admits that it results in
linking libpgcommon and libpgport twice.  Now, AFAICS that whole hack
is unnecessary on any platform where we know how to do symbol export
control, because then libpq won't expose any of the troublesome
symbols to begin with.  So we can resolve the problem by just not
doing that on macOS, as in the attached draft patch.  I've confirmed
that this suppresses the duplicate-libraries warnings on Xcode 15.0
without creating any issues on older macOS (though I'm only in a
position to test as far back as Catalina).

The patch is written to change things only on macOS, but I wonder
if we should be more aggressive and change it for all platforms
where we have symbol export control (which is almost everything
these days).  I doubt it'd make any noticeable difference in
build time, but doing that would give us more test coverage
which would help expose any weak spots.

I've not yet looked at the meson build infrastructure to
see if it needs a corresponding change.

regards, tom lane

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 18240b5fef..cded651755 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -589,19 +589,31 @@ endif
 libpq = -L$(libpq_builddir) -lpq
 
 # libpq_pgport is for use by client executables (not libraries) that use libpq.
-# We force clients to pull symbols from the non-shared libraries libpgport
+# We want clients to pull symbols from the non-shared libraries libpgport
 # and libpgcommon rather than pulling some libpgport symbols from libpq just
 # because libpq uses those functions too.  This makes applications less
-# dependent on changes in libpq's usage of pgport (on platforms where we
-# don't have symbol export control for libpq).  To do this we link to
+# dependent on changes in libpq's usage of pgport.  To do this we link to
 # pgport before libpq.  This does cause duplicate -lpgport's to appear
-# on client link lines, since that also appears in $(LIBS).
+# on client link lines, since that also appears in $(LIBS).  On platforms
+# where we have symbol export control for libpq, the whole exercise is
+# unnecessary because libpq won't expose any of these symbols.  Currently,
+# only macOS warns about duplicate library references, so we only suppress
+# the duplicates on macOS.
+ifeq ($(PORTNAME),darwin)
+libpq_pgport = $(libpq)
+else ifdef PGXS
+libpq_pgport = -L$(libdir) -lpgcommon -lpgport $(libpq)
+else
+libpq_pgport = -L$(top_builddir)/src/common -lpgcommon -L$(top_builddir)/src/port -lpgport $(libpq)
+endif
+
 # libpq_pgport_shlib is the same idea, but for use in client shared libraries.
+# We need those clients to use the shlib variants, even on macOS.  (Ideally,
+# users of this macro would strip libpgport and libpgcommon from $(LIBS),
+# but no harm is done if they don't.)
 ifdef PGXS
-libpq_pgport = -L$(libdir) -lpgcommon -lpgport $(libpq)
 libpq_pgport_shlib = -L$(libdir) -lpgcommon_shlib -lpgport_shlib $(libpq)
 else
-libpq_pgport = -L$(top_builddir)/src/common -lpgcommon -L$(top_builddir)/src/port -lpgport $(libpq)
 libpq_pgport_shlib = -L$(top_builddir)/src/common -lpgcommon_shlib -L$(top_builddir)/src/port -lpgport_shlib $(libpq)
 endif
 


Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan
On Wed, Sep 27, 2023 at 1:03 PM Andres Freund  wrote:
> I suspect that medium term the better approach would be to be much more
> aggressive about setting all-visible, including as part of page-level
> visibility checks, and to deal with the concern of vacuum not processing such
> pages soon by not just looking at unmarked pages, but also a portion of the
> all-visible-but-not-frozen pages (the more all-visible-but-not-frozen pages
> there are, the more of them each vacuum should process). I think all-visible
> is too important for index only scans, for us to be able to remove it, or
> delay setting it until freezing makes sense.
>
> My confidence in my gut feeling isn't all that high here, though.

I think that this is a bad idea. I see two main problems with
"splitting the difference" like this:

1. If we randomly scan some all-visible pages in non-aggressive
VACUUMs, then we're sure to get FPIs in order to be able to freeze.

As a general rule, I think that we're better of gambling against
future FPIs, and then pulling back if we go too far. The fact that we
went one VACUUM operation without the workload unsetting an
all-visible page isn't that much of a signal about what might happen
to that page.

2. Large tables (i.e. the tables where it really matters) just don't
have that many VACUUM operations, relative to everything else. Who
says we'll get more than one opportunity per page with these tables,
even with this behavior of scanning all-visible pages in
non-aggressive VACUUMs?

Big append-only tables simply won't get the opportunity to catch up in
the next non-aggressive VACUUM if there simply isn't one.

-- 
Peter Geoghegan




Re: Streaming I/O, vectored I/O (WIP)

2023-09-27 Thread Andres Freund
Hi,

On 2023-09-27 21:33:15 +0300, Heikki Linnakangas wrote:
> I'm a bit disappointed and surprised by
> v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:
> 
>  4 files changed, 244 insertions(+), 78 deletions(-)
> 
> The current prefetching logic in vacuumlazy.c is pretty hairy, so I hoped
> that this would simplify it. I didn't look closely at that patch, so maybe
> it's simpler even though it's more code.

A good chunk of the changes is pretty boring stuff. A good chunk of the
remainder could be simplified a lot - it's partially there because vacuumlazy
changed a lot over the last couple years and because a bit more refactoring is
needed.  I do think it's actually simpler in some ways - besides being more
efficient...


> > v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch
> 
> No smgrextendv()? I guess none of the patches here needed it.

I can't really imagine needing it anytime soon - due to the desire to avoid
ENOSPC for pages in the buffer pool the common pattern is to extend relations
with zeroes on disk, then populate those buffers in memory. It's possible that
you could use something like smgrextendv() when operating directly on the smgr
level - but then I suspect you're going to be better off to extend the
relation to the right size in one operation and then just use smgrwritev() to
write out the contents.


> > /*
> >  * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
> >  * the returned buffer can be used immediately.  Otherwise, a physical read
> >  * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
> >  * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
> >  * caller has the opportunity to coalesce reads of neighboring blocks into 
> > one
> >  * CompleteReadBuffers() call.
> >  *
> >  * *found is set to true for a hit, and false for a miss.
> >  *
> >  * *allocated is set to true for a miss that allocates a buffer for the 
> > first
> >  * time.  If there are multiple calls to PrepareReadBuffer() for the same
> >  * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
> >  * read, then only the first such call will receive *allocated == true, 
> > which
> >  * the caller might use to issue just one prefetch hint.
> >  */
> > Buffer
> > PrepareReadBuffer(BufferManagerRelation bmr,
> >   ForkNumber forkNum,
> >   BlockNumber blockNum,
> >   BufferAccessStrategy strategy,
> >   bool *found,
> >   bool *allocated)
> > 
> 
> If you decide you don't want to perform the read, after all, is there a way
> to abort it without calling CompleteReadBuffers()?

When would that be needed?


> Looking at the later patch that introduces the streaming read API, seems
> that it finishes all the reads, so I suppose we don't need an abort
> function. Does it all get cleaned up correctly on error?

I think it should.  The buffer error handling is one of the areas where I
really would like to have some way of testing the various cases, it's easy to
get things wrong, and basically impossible to write reliable tests for with
our current infrastructure.


> > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
> > uintptr_t pgsr_private,
> > void *per_io_private,
> > BufferManagerRelation *bmr,
> > ForkNumber *forkNum,
> > BlockNumber *blockNum,
> > ReadBufferMode *mode);
> 
> I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
> each read. I see that you used that in the WAL replay prefetching, so I
> guess that makes sense.

Yea, that's the origin - I don't like it, but I don't really have a better
idea.


> > extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
> > extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void 
> > **per_io_private);
> > extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
> > extern void pg_streaming_read_free(PgStreamingRead *pgsr);
> 
> Do we need to expose pg_streaming_read_prefetch()? It's only used in the WAL
> replay prefetching patch, and only after calling pg_streaming_read_reset().
> Could pg_streaming_read_reset() call pg_streaming_read_prefetch() directly?
> Is there any need to "reset" the stream, without also starting prefetching?

Heh, I think this is a discussion Thomas I were having before...

Greetings,

Andres Freund




Re: Eager page freeze criteria clarification

2023-09-27 Thread Robert Haas
On Wed, Sep 27, 2023 at 12:34 PM Andres Freund  wrote:
> What do you mean with "always freeze aggressively" - do you mean 'aggressive'
> autovacuums? Or opportunistic freezing being aggressive? I don't know why the
> former would be the case?

I meant the latter.

> > When it grows large enough, we suddenly stop freezing it aggressively, and
> > now it starts experiencing vacuums that do a whole bunch of work all at
> > once. A user who notices that is likely to be pretty confused about what
> > happened, and maybe not too happy when they find out.
>
> Hm - isn't my proposal exactly the other way round? I'm proposing that a page
> is frozen more aggressively if not already in shared buffers - which will
> become more common once the table has grown "large enough"?

OK, but it's counterintuitive either way, IMHO.

> I think there were three main ideas that we discussed:
>
> 1) We don't need to be accurate in the freezing decisions for individual
>pages, we "just" need to avoid the situation that over time we commonly
>freeze pages that will be updated again "soon".
> 2) It might be valuable to adjust the "should freeze page opportunistically"
>based on feedback.
> 3) We might need to classify the workload for a table and use different
>heruristics for different workloads.

I agree with all of that. Good summary.

> One way to deal with that would be to not track the average age in
> LSN-difference-bytes, but convert the value to some age metric at that
> time. If we e.g. were to convert the byte-age into an approximate age in
> checkpoints, with quadratic bucketing (e.g. 0 -> current checkpoint, 1 -> 1
> checkpoint, 2 -> 2 checkpoints ago, 3 -> 4 checkpoints ago, ...), using a mean
> of that age would probably be fine.

Yes. I think it's possible that we could even get by with just two
buckets. Say current checkpoint and not. Or current-or-previous
checkpoint and not. And just look at what percentage of accesses fall
into this first bucket -- it should be small or we're doing it wrong.
It seems like the only thing we actually need to avoid is freezing the
same ages over and over again in a tight loop.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Eager page freeze criteria clarification

2023-09-27 Thread Robert Haas
On Wed, Sep 27, 2023 at 2:36 PM Melanie Plageman
 wrote:
> It seems like the ideal freeze pattern for an insert-only table would be
> to freeze as soon as the page is full before any checkpoints which could
> force you to emit an FPI.

Yes. So imagine we have two freeze criteria:

1. Do not ever opportunistically freeze pages.
2. Opportunistically freeze pages whenever we can completely freeze the page.

For a table where the same rows are frequently updated or where the
table is frequently truncated, we want to apply (1). For an
insert-only table, we want to apply (2). Maybe I'm all wet here, but
it's starting to seem to me that in almost all other cases, we also
want to apply (2). If that's true, then the only thing we need to do
here is identify the special cases where (1) is correct.

It'd be even better if we could adapt this behavior down to the page
level - imagine a large mostly cold table with a hot tail. In a
perfect world we apply (1) to the tail and (2) to the rest. But this
might be too much of a stretch to actually get right.

> One big sticking point for me (brought up elsewhere in this thread, but,
> AFAICT never resolved) is that it seems like the fact that we mark pages
> all-visible even when not freezing them means that no matter what
> heuristic we use, we won't have the opportunity to freeze the pages we
> most want to freeze.

The only solution to this problem that I can see is what Peter
proposed earlier: if we're not prepared to freeze the page, then don't
mark it all-visible either. This might be the right thing to do, and
if it is, we could even go further and get rid of the two as separate
concepts completely. However, I think it would be OK to leave all of
that to one side for the moment, *especially* if we adopt some
proposal that does a lot more opportunistic freezing than we do
currently. Because then the problem just wouldn't come up nearly as
much as it does now. One patch can't fix everything, and shouldn't
try.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan
On Wed, Sep 27, 2023 at 11:36 AM Melanie Plageman
 wrote:
> One big sticking point for me (brought up elsewhere in this thread, but,
> AFAICT never resolved) is that it seems like the fact that we mark pages
> all-visible even when not freezing them means that no matter what
> heuristic we use, we won't have the opportunity to freeze the pages we
> most want to freeze.

I think of it as being a bit like an absorbing state from a Markov
chain. It's a perfect recipe for weird path dependent behavior, which
seems to make your task just about impossible. It literally makes
vacuuming more frequently lead to less useful work being performed! It
does so *reliably*, in the simplest of cases.

Ideally, you'd be designing an algorithm for a system where we could
expect to get approximately the desired outcome whenever we have
approximately the right idea. A system where mistakes tend to "cancel
each other out" over time. As long as you have "set all-visible but
don't freeze" behavior as your starting point, then ISTM that your
algorithm almost has to make no mistakes at all to really improve on
what we have right now. As things stand, when your algorithm decides
to not freeze (for pages that are still set all-visible in the VM),
you really can't afford to be wrong. (I think that you get this
already, but it's a point worth emphasizing.)

-- 
Peter Geoghegan




Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund
Hi,

On 2023-09-27 15:45:06 -0400, Robert Haas wrote:
> > One big sticking point for me (brought up elsewhere in this thread, but,
> > AFAICT never resolved) is that it seems like the fact that we mark pages
> > all-visible even when not freezing them means that no matter what
> > heuristic we use, we won't have the opportunity to freeze the pages we
> > most want to freeze.
> 
> The only solution to this problem that I can see is what Peter
> proposed earlier: if we're not prepared to freeze the page, then don't
> mark it all-visible either. This might be the right thing to do, and
> if it is, we could even go further and get rid of the two as separate
> concepts completely.

I suspect that medium term the better approach would be to be much more
aggressive about setting all-visible, including as part of page-level
visibility checks, and to deal with the concern of vacuum not processing such
pages soon by not just looking at unmarked pages, but also a portion of the
all-visible-but-not-frozen pages (the more all-visible-but-not-frozen pages
there are, the more of them each vacuum should process). I think all-visible
is too important for index only scans, for us to be able to remove it, or
delay setting it until freezing makes sense.

My confidence in my gut feeling isn't all that high here, though.


> However, I think it would be OK to leave all of
> that to one side for the moment, *especially* if we adopt some
> proposal that does a lot more opportunistic freezing than we do
> currently. Because then the problem just wouldn't come up nearly as
> much as it does now. One patch can't fix everything, and shouldn't
> try.

Agreed.

Greetings,

Andres Freund




Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan
On Wed, Sep 27, 2023 at 12:45 PM Robert Haas  wrote:
> > One big sticking point for me (brought up elsewhere in this thread, but,
> > AFAICT never resolved) is that it seems like the fact that we mark pages
> > all-visible even when not freezing them means that no matter what
> > heuristic we use, we won't have the opportunity to freeze the pages we
> > most want to freeze.
>
> The only solution to this problem that I can see is what Peter
> proposed earlier: if we're not prepared to freeze the page, then don't
> mark it all-visible either. This might be the right thing to do, and
> if it is, we could even go further and get rid of the two as separate
> concepts completely. However, I think it would be OK to leave all of
> that to one side for the moment, *especially* if we adopt some
> proposal that does a lot more opportunistic freezing than we do
> currently. Because then the problem just wouldn't come up nearly as
> much as it does now. One patch can't fix everything, and shouldn't
> try.

I guess that it might make sense to leave the issue of
all-visible-only pages out of it for the time being. I would just
point out that that means that you'll be limited to adding strictly
opportunistic behaviors, where it just so happens that VACUUM stumbles
upon pages where freezing early likely makes sense. Basically, an
incremental improvement on the FPI thing -- not the sort of thing
where I'd expect Melanie's current approach of optimizing a whole
cross-section of representative workloads to really help with.

I have a separate concern about it that I'll raise shortly, in my
response to Andres.

-- 
Peter Geoghegan




Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan
On Wed, Sep 27, 2023 at 10:46 AM Andres Freund  wrote:
> I don't disagree that we should do something in that direction - I just don't
> see the raw number of unfrozen pages being useful in that regard. If you have
> a database where no pages live long, we don't need to freeze
> oppportunistically, yet the fraction of unfrozen pages will be huge.

We don't know how to reliably predict the future. We can do our best
to ameliorate problems with such workloads, using a slew of different
strategies (including making the behaviors configurable, holding off
on freezing a second or a third time, etc). Such workloads are not
very common, and won't necessarily suffer too much. I think that
they're pretty much limited to queue-like tables.

Any course of action will likely have some downsides.
Melanie/you/whomever will need to make a trade-off, knowing that
somebody somewhere isn't going to be completely happy about it. I just
don't think that anybody will ever be able to come up with an
algorithm that can generalize well enough for things to not work out
that way. I really care about the problems in this area being
addressed more comprehensively, so I'm certainly not going to be the
one that just refuses to accept a trade-off (within reason, of
course).

> > > If we want to take global freeze debt into account, which I think is a 
> > > good
> > > idea, we'll need a smarter way to represent the debt than just the number 
> > > of
> > > unfrozen pages.  I think we would need to track the age of unfrozen pages 
> > > in
> > > some way. If there are a lot of unfrozen pages with a recent xid, then 
> > > it's
> > > fine, but if they are older and getting older, it's a problem and we need 
> > > to
> > > be more aggressive.
> >
> > Tables like pgbench_history will have lots of unfrozen pages with a
> > recent XID that get scanned during every VACUUM. We should be freezing
> > such pages at the earliest opportunity.
>
> I think we ought to be able to freeze tables with as simple a workload as
> pgbench_history has aggressively without taking a global freeze debt into
> account.

Agreed.

> We definitely *also* should take the number of unfrozen pages into account. I
> just don't determining freeze debt primarily using the number of unfrozen
> pages will be useful. The presence of unfrozen pages that are likely to be
> updated again soon is not a problem and makes the simple metric pretty much
> useless.

I never said "primarily".

> > To be clear, that doesn't mean that XID age shouldn't play an
> > important role in helping VACUUM to differentiate between pages that
> > should not be frozen and pages that should be frozen.
>
> I think we need to take it into acocunt to determine a useful freeze debt on a
> table level (and potentially system wide too).

If we reach the stage where XID age starts to matter, we're no longer
talking about performance stability IMV. We're talking about avoiding
wraparound, which is mostly a different problem.

Recall that I started this whole line of discussion about debt in
reaction to a point of Robert's about lazy freezing. My original point
was that (as a general rule) we can afford to be lazy when we're not
risking too much -- when the eventual cost of catching up (having
gotten it wrong) isn't too painful (i.e. doesn't lead to a big balloon
payment down the road). Laziness is the thing that requires
justification. Putting off vital maintenance work for as long as
possible only makes sense under fairly limited circumstances.

A complicating factor here is the influence of the free space map. The
way that that works (the total lack of hysteresis to discourage using
space from old pages) is probably something that makes the problems
with freezing harder to solve. Maybe that should be put in scope.

> Assuming we could compute it cheaply enough, if we had an approximate median
> oldest-64bit-xid-on-page and the number of unfrozen pages, we could
> differentiate between tables that have lots of recent unfrozen pages (the
> median will be low) and pages with lots of unfrozen pages that are unlikely to
> be updated again (the median will be high and growing).  Something like the
> median 64bit xid would be interesting because it'd not get "invalidated" if
> relfrozenxid is increased.

I'm glad that you're mostly of the view that we should be freezing a
lot more aggressively overall, but I think that you're still too
focussed on avoiding small problems. I understand why novel new
problems are generally more of a concern than established old
problems, but there needs to be a sense of proportion. Performance
stability is incredibly important, and isn't zero cost.

-- 
Peter Geoghegan




Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)

2023-09-27 Thread Tomas Vondra
On 9/27/23 15:38, Tom Lane wrote:
> Tomas Vondra  writes:
>> On 9/26/23 23:50, Tom Lane wrote:
>>> Huh.  I'm baffled as to what's up there.  Is it possible that this is
>>> actually a hardware-based difference?  I didn't think there was much
>>> difference between Pi 3B and Pi 4, but we're running out of other
>>> explanations.
> 
>> Hmm, yeah. Which FreeBSD image did you install? armv7 or aarch64?
> 
> https://download.freebsd.org/releases/arm64/aarch64/ISO-IMAGES/14.0/FreeBSD-14.0-BETA3-arm64-aarch64-RPI.img.xz
> 

Thanks, that's the image I've used. This is really strange ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund
Hi,

On 2023-09-27 10:25:00 -0700, Peter Geoghegan wrote:
> On Wed, Sep 27, 2023 at 10:01 AM Andres Freund  wrote:
> > On 2023-09-26 09:07:13 -0700, Peter Geoghegan wrote:
> > I don't think doing this on a system wide basis with a metric like #unfrozen
> > pages is a good idea. It's quite common to have short lived data in some
> > tables while also having long-lived data in other tables. Making 
> > opportunistic
> > freezing more aggressive in that situation will just hurt, without a benefit
> > (potentially even slowing down the freezing of older data!). And even 
> > within a
> > single table, making freezing more aggressive because there's a decent sized
> > part of the table that is updated regularly and thus not frozen, doesn't 
> > make
> > sense.
>
> I never said that #unfrozen pages should be the sole criterion, for
> anything. Just that it would influence the overall strategy, making
> the system veer towards more aggressive freezing. It would complement
> a more sophisticated algorithm that decides whether or not to freeze a
> page based on its individual characteristics.
>
> For example, maybe the page-level algorithm would have a random
> component. That could potentially be where the global (or at least
> table level) view gets to influence things -- the random aspect is
> weighed using the global view of debt. That kind of thing seems like
> an interesting avenue of investigation.

I don't disagree that we should do something in that direction - I just don't
see the raw number of unfrozen pages being useful in that regard. If you have
a database where no pages live long, we don't need to freeze
oppportunistically, yet the fraction of unfrozen pages will be huge.


> > If we want to take global freeze debt into account, which I think is a good
> > idea, we'll need a smarter way to represent the debt than just the number of
> > unfrozen pages.  I think we would need to track the age of unfrozen pages in
> > some way. If there are a lot of unfrozen pages with a recent xid, then it's
> > fine, but if they are older and getting older, it's a problem and we need to
> > be more aggressive.
>
> Tables like pgbench_history will have lots of unfrozen pages with a
> recent XID that get scanned during every VACUUM. We should be freezing
> such pages at the earliest opportunity.

I think we ought to be able to freeze tables with as simple a workload as
pgbench_history has aggressively without taking a global freeze debt into
account.


> > The problem I see is how track the age of unfrozen data -
> > it'd be easy enough to track the mean(oldest-64bit-xid-on-page), but then we
> > again have the issue of rare outliers moving the mean too much...
>
> I think that XID age is mostly not very important compared to the
> absolute amount of unfrozen pages, and the cost profile of freezing
> now versus later. (XID age *is* important in emergencies, but that's
> mostly not what we're discussing right now.)

We definitely *also* should take the number of unfrozen pages into account. I
just don't determining freeze debt primarily using the number of unfrozen
pages will be useful. The presence of unfrozen pages that are likely to be
updated again soon is not a problem and makes the simple metric pretty much
useless.


> To be clear, that doesn't mean that XID age shouldn't play an
> important role in helping VACUUM to differentiate between pages that
> should not be frozen and pages that should be frozen.

I think we need to take it into acocunt to determine a useful freeze debt on a
table level (and potentially system wide too).

Assuming we could compute it cheaply enough, if we had an approximate median
oldest-64bit-xid-on-page and the number of unfrozen pages, we could
differentiate between tables that have lots of recent unfrozen pages (the
median will be low) and pages with lots of unfrozen pages that are unlikely to
be updated again (the median will be high and growing).  Something like the
median 64bit xid would be interesting because it'd not get "invalidated" if
relfrozenxid is increased.

Greetings,

Andres Freund




Re: pg_stat_get_activity(): integer overflow due to (int) * (int) for MemoryContextAllocHuge()

2023-09-27 Thread Andres Freund
Hi,

On 2023-09-27 15:42:15 +0100, Peter Eisentraut wrote:
> On 27.09.23 09:08, Michael Paquier wrote:
> > On Wed, Sep 27, 2023 at 08:41:55AM +0200, Jakub Wartak wrote:
> > > Attached patch adjusts pgstat_track_activity_query_size to be of
> > > size_t from int and fixes the issue.
> > 
> > This cannot be backpatched, and using size_t is not really needed as
> > track_activity_query_size is capped at 1MB.  Why don't you just tweak
> > the calculation done in pgstat_read_current_status() instead, say with
> > a cast?
> 
> I think it's preferable to use the right type to begin with, rather than
> fixing it up afterwards with casts.

I don't think going for size_t is a viable path for fixing this. I'm pretty
sure the initial patch would trigger a type mismatch from guc_tables.c - we
don't have infrastructure for size_t GUCs.

Nor do I think either of the patches here is a complete fix - there will still
be overflows on 32bit systems.

Perhaps we ought to error out (in BackendStatusShmemSize() or such) if
pgstat_track_activity_query_size * MaxBackends >= 4GB?

Frankly, it seems like a quite bad idea to have such a high limit for
pgstat_track_activity_query_size. The overhead such a high value has will
surprise people...

Greetings,

Andres Freund




Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan
On Wed, Sep 27, 2023 at 10:01 AM Andres Freund  wrote:
> On 2023-09-26 09:07:13 -0700, Peter Geoghegan wrote:
> I don't think doing this on a system wide basis with a metric like #unfrozen
> pages is a good idea. It's quite common to have short lived data in some
> tables while also having long-lived data in other tables. Making opportunistic
> freezing more aggressive in that situation will just hurt, without a benefit
> (potentially even slowing down the freezing of older data!). And even within a
> single table, making freezing more aggressive because there's a decent sized
> part of the table that is updated regularly and thus not frozen, doesn't make
> sense.

I never said that #unfrozen pages should be the sole criterion, for
anything. Just that it would influence the overall strategy, making
the system veer towards more aggressive freezing. It would complement
a more sophisticated algorithm that decides whether or not to freeze a
page based on its individual characteristics.

For example, maybe the page-level algorithm would have a random
component. That could potentially be where the global (or at least
table level) view gets to influence things -- the random aspect is
weighed using the global view of debt. That kind of thing seems like
an interesting avenue of investigation.

> If we want to take global freeze debt into account, which I think is a good
> idea, we'll need a smarter way to represent the debt than just the number of
> unfrozen pages.  I think we would need to track the age of unfrozen pages in
> some way. If there are a lot of unfrozen pages with a recent xid, then it's
> fine, but if they are older and getting older, it's a problem and we need to
> be more aggressive.

Tables like pgbench_history will have lots of unfrozen pages with a
recent XID that get scanned during every VACUUM. We should be freezing
such pages at the earliest opportunity.

> The problem I see is how track the age of unfrozen data -
> it'd be easy enough to track the mean(oldest-64bit-xid-on-page), but then we
> again have the issue of rare outliers moving the mean too much...

I think that XID age is mostly not very important compared to the
absolute amount of unfrozen pages, and the cost profile of freezing
now versus later. (XID age *is* important in emergencies, but that's
mostly not what we're discussing right now.)

To be clear, that doesn't mean that XID age shouldn't play an
important role in helping VACUUM to differentiate between pages that
should not be frozen and pages that should be frozen. But even there
it should probably be treated as a cue. And, the relationship between
the XIDs on the page is probably more important than their absolute
age (or relationship to XIDs that appear elsewhere more generally).

For example, if a page is filled with heap tuples whose XIDs all match
(i.e. tuples that were all inserted by the same transaction), or XIDs
that are at least very close together, then that could make VACUUM
more enthusiastic about freezing now. OTOH if the XIDs are more
heterogeneous (but never very old), or if some xmin fields were
frozen, then VACUUM should show much less enthusiasm for freezing if
it's expensive.

-- 
Peter Geoghegan




Re: Add pg_basetype() function to obtain a DOMAIN base type

2023-09-27 Thread Alexander Korotkov
Hi, Steve!

On Tue, Sep 19, 2023 at 8:36 PM Steve Chavez  wrote:
>
> Just to give a data point for the need of this function:
>
> https://dba.stackexchange.com/questions/231879/how-to-get-the-basetype-of-a-domain-in-pg-type
>
> This is also a common use case for services/extensions that require postgres 
> metadata for their correct functioning, like postgREST or pg_graphql.
>
> Here's a query for getting domain base types, taken from the postgREST 
> codebase:
> https://github.com/PostgREST/postgrest/blob/531a183b44b36614224fda432335cdaa356b4a0a/src/PostgREST/SchemaCache.hs#L342-L364
>
> So having `pg_basetype` would be really helpful in those cases.
>
> Looking forward to hearing any feedback. Or if this would be a bad idea.

I think this is a good idea.  It's nice to have a simple (and fast)
built-in function to call instead of investing complex queries over
the system catalog.

The one thing triggering my perfectionism is that the patch does two
syscache lookups instead of one.  In order to fit into one syscache
lookup we could add "bool missing_ok" argument to
getBaseTypeAndTypmod().  However, getBaseTypeAndTypmod() is heavily
used in our codebase.  So, changing its signature would be invasive.
Could we invent getBaseTypeAndTypmodExtended() (ideas for a better
name?) that does all the job and supports "bool missing_ok" argument,
and have getBaseTypeAndTypmod() as a wrapper with the same signature?

--
Regards,
Alexander Korotkov




Re: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Karl O. Pinc
On Wed, 27 Sep 2023 12:58:54 +
"Hayato Kuroda (Fujitsu)"  wrote:

> > Should the committer be interested, your patch applies cleanly
> > and the docs build as expected.  
> 
> Yeah, but cfbot accepted previous version. Did you have anything in
> your mind?

No.  I'm letting the committer know everything I've checked
so that they can decide what they want to check.

> Hmm, what you said looked right. But as Peter pointed out [1], the
> fix seems too much. So I attached three version of patches. How do
> you think? For me, type C is best.
> 
> A. A patch which completely follows your comments. The name is
> "v3-0001-...patch". Cfbot tests it.
> B. A patch which completely follows Peter's comments [1]. The name is
> "Peter_v3-txt". 
> C. A patch which follows both comments. Based on
> b, but some comments (Don't use the future tense, "Other
> characters"->"The bytes of other characters"...) were picked. The
> name is "Both_v3-txt".

I also like C.  Fewer words is better.  So long
as nothing is left unsaid fewer words make for clarity.

However, in the last hunk, "of other than" does not read well.
Instead of writing
"and the bytes of other than printable ASCII characters"
you want "and the bytes that are not printable ASCII characters".
That would be my suggestion.  

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein




Re: Set enable_seqscan doesn't take effect?

2023-09-27 Thread Andres Freund
Hi,

On 2023-09-28 00:37:41 +0800, jacktby jacktby wrote:
> postgres=# SET enable_seqscan = off;
> SET
> postgres=# explain select * from t;
>QUERY PLAN
> -
>  Seq Scan on t  (cost=100.00..123.60 rows=1360 width=32)
> (1 row)
> 
> postgres=#  select * from t;
>a   
> ---
>  [1,2]
> (1 row)

Sorry to be the grump here:

You start several threads a week, often clearly not having done much, if any,
prior research. Often even sending the same question to multiple lists. It
should not be hard to find an explanation for the behaviour you see here.

pgsql-hackers isn't a "do my work for me service". We're hacking on
postgres. It's fine to occasionally ask for direction, but you're very clearly
exceeding that.

Greetings,

Andres Freund




Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund
Hi,

On 2023-09-26 09:07:13 -0700, Peter Geoghegan wrote:
> On Tue, Sep 26, 2023 at 8:19 AM Andres Freund  wrote:
> > However, I'm not at all convinced doing this on a system wide level is a 
> > good
> > idea. Databases do often contain multiple types of workloads at the same
> > time. E.g., we want to freeze aggressively in a database that has the bulk 
> > of
> > its size in archival partitions but has lots of unfrozen data in an active
> > partition. And databases have often loads of data that's going to change
> > frequently / isn't long lived, and we don't want to super aggressively 
> > freeze
> > that, just because it's a large portion of the data.
> 
> I didn't say that we should always have most of the data in the
> database frozen, though. Just that we can reasonably be more lazy
> about freezing the remainder of pages if we observe that most pages
> are already frozen. How they got that way is another discussion.
> 
> I also think that the absolute amount of debt (measured in physical
> units such as unfrozen pages) should be kept under control. But that
> isn't something that can ever be expected to work on the basis of a
> simple threshold -- if only because autovacuum scheduling just doesn't
> work that way, and can't really be adapted to work that way.

I don't think doing this on a system wide basis with a metric like #unfrozen
pages is a good idea. It's quite common to have short lived data in some
tables while also having long-lived data in other tables. Making opportunistic
freezing more aggressive in that situation will just hurt, without a benefit
(potentially even slowing down the freezing of older data!). And even within a
single table, making freezing more aggressive because there's a decent sized
part of the table that is updated regularly and thus not frozen, doesn't make
sense.

If we want to take global freeze debt into account, which I think is a good
idea, we'll need a smarter way to represent the debt than just the number of
unfrozen pages.  I think we would need to track the age of unfrozen pages in
some way. If there are a lot of unfrozen pages with a recent xid, then it's
fine, but if they are older and getting older, it's a problem and we need to
be more aggressive.  The problem I see is how track the age of unfrozen data -
it'd be easy enough to track the mean(oldest-64bit-xid-on-page), but then we
again have the issue of rare outliers moving the mean too much...

Greetings,

Andres Freund




Re: Unlinking Parallel Hash Join inner batch files sooner

2023-09-27 Thread Heikki Linnakangas

On 11/05/2023 00:00, Jehan-Guillaume de Rorthais wrote:

On Wed, 10 May 2023 15:11:20 +1200
Thomas Munro  wrote:

The reason I didn't do this earlier is that sharedtuplestore.c
continues the pre-existing tradition where each parallel process
counts what it writes against its own temp_file_limit.  At the time I
thought I'd need to have one process unlink all the files, but if a
process were to unlink files that it didn't create, that accounting
system would break.  Without some new kind of shared temp_file_limit
mechanism that doesn't currently exist, per-process counters could go
negative, creating free money.  In the attached patch, I realised
something that I'd missed before: there is a safe point for each
backend to unlink just the files that it created, and there is no way
for a process that created files not to reach that point.


Indeed.

For what it worth, from my new and non-experienced understanding of the
parallel mechanism, waiting for all workers to reach
WAIT_EVENT_HASH_GROW_BATCHES_REPARTITION, after re-dispatching old batches in
new ones, seems like a safe place to instruct each workers to clean their old
temp files.


Looks good to me too at a quick glance. There's this one "XXX free" 
comment though:



for (int i = 1; i < old_nbatch; ++i)
{
ParallelHashJoinBatch *shared =
NthParallelHashJoinBatch(old_batches, i);
SharedTuplestoreAccessor *accessor;

accessor = sts_attach(ParallelHashJoinBatchInner(shared),
  ParallelWorkerNumber 
+ 1,
  &pstate->fileset);
sts_dispose(accessor);
/* XXX free */
}


I think that's referring to the fact that sts_dispose() doesn't free the 
'accessor', or any of the buffers etc. that it contains. That's a 
pre-existing problem, though: ExecParallelHashRepartitionRest() already 
leaks the SharedTuplestoreAccessor structs and their buffers etc. of the 
old batches. I'm a little surprised there isn't aready an sts_free() 
function.


Another thought is that it's a bit silly to have to call sts_attach() 
just to delete the files. Maybe sts_dispose() should take the same three 
arguments that sts_attach() does, instead.


So that freeing would be nice to tidy up, although the amount of memory 
leaked is tiny so might not be worth it, and it's a pre-existing issue. 
I'm marking this as Ready for Committer.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Index range search optimization

2023-09-27 Thread Alexander Korotkov
On Mon, Sep 25, 2023 at 1:18 PM Alexander Korotkov 
wrote:

> On Mon, Sep 25, 2023 at 12:58 PM Pavel Borisov 
> wrote:
> > One more doubt about naming. Calling function
> > _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
> > ScanDirection dir, bool *continuescan, bool requiredMatchedByPrecheck)
> > as
> > (void) _bt_checkkeys(scan, itup, indnatts, dir,
> > &requiredMatchedByPrecheck, false);
> > looks little bit misleading because of coincidence of names of 5 and 6
> > arguments.
>
> I've added the comment clarifying this argument usage.
>

Fixed typo inficating => indicating as pointed by Pavel.
Peter, what do you think about the current shape of the patch?

--
Regards,
Alexander Korotkov


0001-Skip-checking-of-scan-keys-required-for-direction-v7.patch
Description: Binary data


Set enable_seqscan doesn't take effect?

2023-09-27 Thread jacktby jacktby
postgres=# SET enable_seqscan = off;
SET
postgres=# explain select * from t;
   QUERY PLAN
-
 Seq Scan on t  (cost=100.00..123.60 rows=1360 width=32)
(1 row)

postgres=#  select * from t;
   a   
---
 [1,2]
(1 row)




Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund
Hi,

(I wrote the first part of the email before Robert and I chatted on a call, I
left it in the email for posterity)

On 2023-09-26 13:49:32 -0400, Robert Haas wrote:
> On Tue, Sep 26, 2023 at 11:11 AM Andres Freund  wrote:
> > As long as the most extreme cases are prevented, unnecessarily freezing is 
> > imo
> > far less harmful than freezing too little.
> >
> > I'm worried that using something as long as 100-200%
> > time-between-recent-checkpoints won't handle insert-mostly workload well,
> > which IME are also workloads suffering quite badly under our current scheme 
> > -
> > and which are quite common.
>
> I wrote about this problem in my other reply and I'm curious as to
> your thoughts about it. Basically, suppose we forget about all of
> Melanie's tests except for three cases: (1) an insert-only table, (2)
> an update-heavy workload with uniform distribution, and (3) an
> update-heavy workload with skew. In case (1), freezing is good. In
> case (2), freezing is bad. In case (3), freezing is good for cooler
> pages and bad for hotter ones. I postulate that any
> recency-of-modification threshold that handles (1) well will handle
> (2) poorly, and that the only way to get both right is to take some
> other factor into account. You seem to be arguing that we can just
> freeze aggressively in case (2) and it won't cost much, but it doesn't
> sound to me like Melanie believes that and I don't think I do either.

I don't believe we can freeze aggressively in all cases of 2) without causing
problems. A small-ish table that's vacuumed constantly, where all rows are
constantly frozen and then updated, will suffer a lot from the WAL
overhead. Whereas superfluously freezing a row in a table with many millions
of rows, where each row is only occasionally updated, due to the update rate
being much smaller than the number of rows, will have acceptable overhead.

What I *do* believe is that for all but the most extreme cases, it's safer to
freeze too much than to freeze too little. There definitely are negative
consequences, but they're more bounded and less surprising than not freezing
for ages and then suddenly freezing everything at once.

Whether 2) really exists in the real world for huge tables, is of course
somewhat debatable...



> > > This doesn't seem completely stupid, but I fear it would behave
> > > dramatically differently on a workload a little smaller than s_b vs.
> > > one a little larger than s_b, and that doesn't seem good.
> >
> > Hm. I'm not sure that that's a real problem. In the case of a workload 
> > bigger
> > than s_b, having to actually read the page again increases the cost of
> > freezing later, even if the workload is just a bit bigger than s_b.
>
> That is true, but I don't think it means that there is no problem. It
> could lead to a situation where, for a while, a table never needs any
> significant freezing, because we always freeze aggressively.

What do you mean with "always freeze aggressively" - do you mean 'aggressive'
autovacuums? Or opportunistic freezing being aggressive? I don't know why the
former would be the case?


> When it grows large enough, we suddenly stop freezing it aggressively, and
> now it starts experiencing vacuums that do a whole bunch of work all at
> once. A user who notices that is likely to be pretty confused about what
> happened, and maybe not too happy when they find out.

Hm - isn't my proposal exactly the other way round? I'm proposing that a page
is frozen more aggressively if not already in shared buffers - which will
become more common once the table has grown "large enough"?

(the remainder was written after that call)


I think there were three main ideas that we discussed:

1) We don't need to be accurate in the freezing decisions for individual
   pages, we "just" need to avoid the situation that over time we commonly
   freeze pages that will be updated again "soon".
2) It might be valuable to adjust the "should freeze page opportunistically"
   based on feedback.
3) We might need to classify the workload for a table and use different
   heruristics for different workloads.


For 2), one of the mechanisms we discussed was to collect information about
the "age" of a page when we "unfreeze" it. If we frequently unfreeze pages
that were recently frozen, we need to be less aggressive in opportunistic
freezing going forward. If that never happens, we can be more aggressive.

The basic idea for classifying the age of a page when unfreezing is to use
"insert_lsn - page_lsn", pretty simple. We can convert that into time using
the averaged WAL generation rate.  What's a bit harder is figuring out how to
usefully aggregate the age across multiple "unfreezes".

I was initially thinking of just using the mean, but Robert was rightly
concerned that that'd the mean would be moved a lot when occasionally freezing
very old pages, potentially leading to opportunistically freezing young pages
too aggressively. The median would be a better choice, bu

Re: Index AmInsert Parameter Confused?

2023-09-27 Thread jacktby jacktby


> 2023年9月27日 18:08,Matthias van de Meent  写道:
> 
> On Wed, 27 Sept 2023 at 05:03, jacktby jacktby  > wrote:
>> 
>> 
>> 
>>> 2023年9月27日 00:45,Matthias van de Meent  写道:
>>> 
>>> On Tue, 26 Sept 2023 at 18:38, jacktby jacktby  wrote:
 
 typedef bool (*aminsert_function) (Relation indexRelation,
 Datum *values,
 bool *isnull,
 ItemPointer heap_tid,
 Relation heapRelation,
 IndexUniqueCheck checkUnique,
 bool indexUnchanged,
 struct IndexInfo *indexInfo);
 
 Why is there a heap_tid, We haven’t inserted the value, so where does it 
 from ?
>>> 
>>> Index insertion only happens after the TableAM tuple has been
>>> inserted. As indexes refer to locations in the heap, this TID contains
>>> the TID of the table tuple that contains the indexed values, so that
>>> the index knows which tuple to refer to.
>>> 
>>> Note that access/amapi.h describes only index AM APIs; it does not
>>> cover the table AM APIs descibed in access/tableam.h
>>> 
>>> Kind regards,
>>> 
>>> Matthias van de Meent
>> 1.Thanks, so if we insert a tuple into a table which has a index on itself, 
>> pg will insert tuple into heap firstly, and the give the heaptid form heap 
>> to the Index am api right?
> 
> Correct. I think this is also detailed in various places of the
> documentation, yes.
> 
>> 2. I’m trying to implement a new index, but I just need the data held in 
>> index table, I hope it’s not inserted into heap, because the all data I want 
>> can be in index table.
> 
> In PostgreSQL, a table maintains the source of truth for the data, and
> indexes are ephemeral data structures that improve the speed of
> querying the data in their table. As such, dropping an index should
> not impact the availability of the table's data.
> If the only copy of your (non-derived) data is in the index, then it
> is likely that some normal table operations will result in failures
> due to the tableAM/indexAM breaking built-in assumptions about access
> methods and data availability.
> 
> Kind regards,
> 
> Matthias van de Meent
> Neon (https://neon.tech )
So do I need to free the ItemPointer and Values in the func implemented by 
myself?

Re: Is this a problem in GenericXLogFinish()?

2023-09-27 Thread Heikki Linnakangas

On 27/09/2023 18:47, Robert Haas wrote:

On Wed, Sep 27, 2023 at 11:03 AM Jeff Davis  wrote:

So it looks like it's intentionally registering a clean buffer so that
it can take a cleanup lock for reasons other than cleaning (or even
modiying) the page. I would think that there's a better way of
accomplishing that goal, so perhaps we can fix that?


I had forgotten some of the details of how this works until you
reminded me, but now that you've jarred my memory, I remember some of
it.

When Amit Kaplla and I were working on the project to add WAL-logging
to hash indexes, we ran into some problems with concurrency control
for individual buckets within the hash index. Before that project,
this was handled using heavyweight locks, one per bucket. That got
changed in 6d46f4783efe457f74816a75173eb23ed8930020 for the reasons
explained in that commit message. Basically, instead of taking
heavyweight locks, we started taking cleanup locks on the primary
bucket pages. I always thought that was a little awkward, but I didn't
quite see how to do better. I don't think that I gave much thought at
the time to the consequence you've uncovered here, namely that it
means we're sometimes locking one page (the primary bucket page)
because we want to do something to another bucket page (some page in
the linked list of pages that are part of that bucket) and for that to
be safe, we need to lock out concurrent scans of that bucket.

I guess I don't know of any easy fix here. :-(


That seems OK. Maybe there would be a better way to do that, but there's 
nothing wrong with that approach per se.


We could define a new REGBUF_NO_CHANGE flag to signal that the buffer is 
registered just for locking purposes, and not actually modified by the 
WAL record.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Is this a problem in GenericXLogFinish()?

2023-09-27 Thread Robert Haas
On Wed, Sep 27, 2023 at 11:03 AM Jeff Davis  wrote:
> So it looks like it's intentionally registering a clean buffer so that
> it can take a cleanup lock for reasons other than cleaning (or even
> modiying) the page. I would think that there's a better way of
> accomplishing that goal, so perhaps we can fix that?

I had forgotten some of the details of how this works until you
reminded me, but now that you've jarred my memory, I remember some of
it.

When Amit Kaplla and I were working on the project to add WAL-logging
to hash indexes, we ran into some problems with concurrency control
for individual buckets within the hash index. Before that project,
this was handled using heavyweight locks, one per bucket. That got
changed in 6d46f4783efe457f74816a75173eb23ed8930020 for the reasons
explained in that commit message. Basically, instead of taking
heavyweight locks, we started taking cleanup locks on the primary
bucket pages. I always thought that was a little awkward, but I didn't
quite see how to do better. I don't think that I gave much thought at
the time to the consequence you've uncovered here, namely that it
means we're sometimes locking one page (the primary bucket page)
because we want to do something to another bucket page (some page in
the linked list of pages that are part of that bucket) and for that to
be safe, we need to lock out concurrent scans of that bucket.

I guess I don't know of any easy fix here. :-(

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: SLRUs in the main buffer pool - Page Header definitions

2023-09-27 Thread Robert Haas
On Fri, Sep 8, 2023 at 8:56 AM Stephen Frost  wrote:
> If we're going to effectively segregate the buffer pool into SLRU parts
> vs. everything else and then use the existing strategies for SLRUs and
> have that be different from what everything else is using ... then
> that doesn't seem like it's really changing things.  What would be the
> point of moving the SLRUs into the main buffer pool then?

I think that integrating the SLRUs into the same memory allocation as
shared_buffers could potentially be pretty helpful even if the buffers
are managed differently. For instance, each SLRU could evaluate (using
some algorithm) whether it needs more or fewer buffers than it has
currently and either steal more buffers from shared_buffers or give
some back. That does seem less elegant than having everything running
through a single system, but getting to the point where everything
runs through a single system may be difficult, and such an
intermediate state could have a lot of advantages with, perhaps, less
risk of breaking stuff.

As far as I am aware, the optimal amount of memory for any particular
SLRU is almost always quite small compared to shared_buffers, but it
can be quite large compared to what we allocate currently. To do
anything about that, we clearly need to fix the algorithms to scale
better. But even once we've done that, I don't think we want to
allocate the largest amount of clog buffers that anyone could ever
need in all instances, and the same for every other SLRU. That seems
wasteful. We could expose a separate tunable for each one, but that is
a lot for users to tune correctly, and it implies using the same value
for the whole lifetime of the server. Letting the value float around
dynamically would make a lot of sense especially for things like
pg_multixact -- if there's a lot of multixacts, grab some more
buffers, if there aren't, release some or even all of them. The same
kind of thing can happen with other SLRUs -- e.g. as the oldest
running xact gets older, you need more subxact buffers; when
long-running transactions end, you need fewer.

Again, I'm not sure what the right thing to do here actually is, just
that I wouldn't be too quick to reject a partial integration into
shared_buffers.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Is this a problem in GenericXLogFinish()?

2023-09-27 Thread Jeff Davis
On Wed, 2023-09-27 at 10:36 -0400, Robert Haas wrote:
> On Tue, Sep 26, 2023 at 9:36 PM Jeff Davis  wrote:
> > That site is pretty trivial to fix, but there are also a couple
> > places
> > in hash.c and hashovfl.c that are registering a clean page and it's
> > not
> > clear to me exactly what's going on.
> 
> Huh, I wonder if that's just a bug. Do you know where exactly?

hashovfl.c:665 and hash.c:831

In both cases it's registering the bucket_buf, and has a comment like:

 /*
  * bucket buffer needs to be registered to ensure that we can
  * acquire a cleanup lock on it during replay.
  */

And various redo routines have comments like:

 /*
  * Ensure we have a cleanup lock on primary bucket page before we
start
  * with the actual replay operation.  This is to ensure that neither a
  * scan can start nor a scan can be already-in-progress during the
replay
  * of this operation.  If we allow scans during this operation, then
they
  * can miss some records or show the same record multiple times.
  */

So it looks like it's intentionally registering a clean buffer so that
it can take a cleanup lock for reasons other than cleaning (or even
modiying) the page. I would think that there's a better way of
accomplishing that goal, so perhaps we can fix that?

Regards,
Jeff Davis





Re: pg_stat_get_activity(): integer overflow due to (int) * (int) for MemoryContextAllocHuge()

2023-09-27 Thread Peter Eisentraut

On 27.09.23 09:08, Michael Paquier wrote:

On Wed, Sep 27, 2023 at 08:41:55AM +0200, Jakub Wartak wrote:

Attached patch adjusts pgstat_track_activity_query_size to be of
size_t from int and fixes the issue.


This cannot be backpatched, and using size_t is not really needed as
track_activity_query_size is capped at 1MB.  Why don't you just tweak
the calculation done in pgstat_read_current_status() instead, say with
a cast?


I think it's preferable to use the right type to begin with, rather than 
fixing it up afterwards with casts.






Re: Is this a problem in GenericXLogFinish()?

2023-09-27 Thread Robert Haas
On Tue, Sep 26, 2023 at 9:36 PM Jeff Davis  wrote:
> That site is pretty trivial to fix, but there are also a couple places
> in hash.c and hashovfl.c that are registering a clean page and it's not
> clear to me exactly what's going on.

Huh, I wonder if that's just a bug. Do you know where exactly?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: remaining sql/json patches

2023-09-27 Thread Erik Rijkers

Op 9/27/23 om 15:55 schreef Amit Langote:

On Thu, Sep 21, 2023 at 9:41 PM Amit Langote  wrote:


I don't knoe, maybe it's worthwhile to fix this (admittedly trivial) 
fail in the tests? It's been there for a while.


Thanks,

Erik

diff -U3 
/home/aardvark/pg_stuff/pg_sandbox/pgsql.json_table/src/test/regress/expected/jsonb_sqljson.out
 
/home/aardvark/pg_stuff/pg_sandbox/pgsql.json_table/src/test/regress/results/jsonb_sqljson.out
--- 
/home/aardvark/pg_stuff/pg_sandbox/pgsql.json_table/src/test/regress/expected/jsonb_sqljson.out
 2023-09-27 16:10:52.361367183 +0200
+++ 
/home/aardvark/pg_stuff/pg_sandbox/pgsql.json_table/src/test/regress/results/jsonb_sqljson.out
  2023-09-27 16:18:50.929382498 +0200
@@ -994,14 +994,14 @@
 FROM information_schema.check_constraints
 WHERE constraint_name LIKE 'test_jsonb_constraint%'
 ORDER BY 1;
-   check_clause
   
---
- ((JSON_EXISTS((js)::jsonb, 'strict $."a"' RETURNING integer TRUE ON ERROR) < 
2))
- ((JSON_QUERY((js)::jsonb, '$."a"' RETURNING character(5) OMIT QUOTES EMPTY 
ARRAY ON EMPTY) > ('a'::bpchar COLLATE "C")))
- ((JSON_QUERY((js)::jsonb, '$."a"' RETURNING jsonb WITH CONDITIONAL WRAPPER 
EMPTY OBJECT ON ERROR) < '[10]'::jsonb))
- ((JSON_VALUE((js)::jsonb, '$."a"' RETURNING integer DEFAULT (('12'::text || 
i))::integer ON EMPTY ERROR ON ERROR) > i))
- ((js IS JSON))
- (JSON_EXISTS((js)::jsonb, '$."a"' PASSING (i + 5) AS int, (i)::text AS txt, 
ARRAY[1, 2, 3] AS arr))
+  check_clause 
 
+
+ (JSON_EXISTS((js)::jsonb, 'strict $."a"' RETURNING integer TRUE ON ERROR) < 2)
+ (JSON_QUERY((js)::jsonb, '$."a"' RETURNING character(5) OMIT QUOTES EMPTY 
ARRAY ON EMPTY) > ('a'::bpchar COLLATE "C"))
+ (JSON_QUERY((js)::jsonb, '$."a"' RETURNING jsonb WITH CONDITIONAL WRAPPER 
EMPTY OBJECT ON ERROR) < '[10]'::jsonb)
+ (JSON_VALUE((js)::jsonb, '$."a"' RETURNING integer DEFAULT (('12'::text || 
i))::integer ON EMPTY ERROR ON ERROR) > i)
+ (js IS JSON)
+ JSON_EXISTS((js)::jsonb, '$."a"' PASSING (i + 5) AS int, (i)::text AS txt, 
ARRAY[1, 2, 3] AS arr)
 (6 rows)
 
 SELECT pg_get_expr(adbin, adrelid)


Re: Avoid a possible out-of-bounds access (src/backend/optimizer/util/relnode.c)

2023-09-27 Thread Abhijit Menon-Sen
At 2023-09-27 10:37:45 -0300, ranier...@gmail.com wrote:
>
> Forgive my impulsiveness, anyone who loves perfect, well written code,
> would understand.

I actually find this characterisation offensive.

Being scrupulous about not grouping random drive-by changes together
with the primary change is right up there in importance with writing
good commit messages to help future readers to reduce their WTFs/minute
score one, five, seven, or ten years later.

Ignoring that concern once is thoughtless. Ignoring it over and over
again is disrespectful. Casually deriding it by equating it to hating
"perfect, well-written code" is gross.

-- Abhijit




Re: pg_upgrade and logical replication

2023-09-27 Thread Amit Kapila
On Wed, Sep 27, 2023 at 3:37 PM vignesh C  wrote:
>
> On Tue, 26 Sept 2023 at 10:58, vignesh C  wrote:
> >
> > On Wed, 20 Sept 2023 at 16:54, Amit Kapila  wrote:
> > >
> > > On Fri, Sep 15, 2023 at 3:08 PM vignesh C  wrote:
> > > >
> > > > The attached v8 version patch has the changes for the same.
> > > >
> > >
> > > Is the check to ensure remote_lsn is valid correct in function
> > > check_for_subscription_state()? How about the case where the apply
> > > worker didn't receive any change but just marked the relation as
> > > 'ready'?
> >
> > I agree that remote_lsn will not be valid in the case when all the
> > tables are in ready state and there are no changes to be sent by the
> > walsender to the worker. I was not sure if this check is required in
> > this case in the check_for_subscription_state function. I was thinking
> > that this check could be removed.
> > I'm also checking why the tables should only be in ready state, the
> > check that is there in the same function, can we support upgrades when
> > the tables are in syncdone state or not. I will post my analysis once
> > I have finished checking on the same.
>
> Once the table is in SUBREL_STATE_SYNCDONE state, the apply worker
> will check if the apply worker has some LSN records that need to be
> applied to reach the LSN of the table. Once the required WAL is
> applied, the table state will be changed from SUBREL_STATE_SYNCDONE to
> SUBREL_STATE_READY state. Since there is a chance that in this case
> the apply worker has to apply some transactions to get all the tables
> in READY state, I felt the minimum requirement should be that at least
> all the tables should be in READY state for the upgradation of the
> subscriber.
>

I don't think this theory is completely correct because the pending
WAL can be applied even after an upgrade.

-- 
With Regards,
Amit Kapila.




Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)

2023-09-27 Thread Tom Lane
Tomas Vondra  writes:
> On 9/26/23 23:50, Tom Lane wrote:
>> Huh.  I'm baffled as to what's up there.  Is it possible that this is
>> actually a hardware-based difference?  I didn't think there was much
>> difference between Pi 3B and Pi 4, but we're running out of other
>> explanations.

> Hmm, yeah. Which FreeBSD image did you install? armv7 or aarch64?

https://download.freebsd.org/releases/arm64/aarch64/ISO-IMAGES/14.0/FreeBSD-14.0-BETA3-arm64-aarch64-RPI.img.xz

regards, tom lane




Re: Avoid a possible out-of-bounds access (src/backend/optimizer/util/relnode.c)

2023-09-27 Thread Ranier Vilela
Em qua., 27 de set. de 2023 às 04:35, David Rowley 
escreveu:

> On Wed, 27 Sept 2023 at 06:10, Ranier Vilela  wrote:
> > As suggested, casting is the best option that does not add any overhead
> and improves the robustness of the find_base_rel function.
> > I propose patch v1, with the additional addition of fixing the
> find_base_rel_ignore_join function,
> > which despite not appearing in Coverity reports, suffers from the same
> problem.
>
> Can you confirm that this silences the Converity warning?
>
CID#1518088
This is a historical version of the file displaying the issue before it was
in the Fixed state.


> I think it probably warrants a comment to mention why we cast to uint32.
>
> e.g. /* perform an unsigned comparison so that we also catch negative
> relid values */
>
I'm ok.


>
> > Taking advantage, I also propose a scope reduction,
> >  as well as the const of the root parameter, which is very appropriate.
>
> Can you explain why adding the const qualifier is "very appropriate"
> to catching negative relids?
>
Of course that has nothing to do with it.


> Please check [1] for the mention of:
>
> "The fastest way to get your patch rejected is to make unrelated
> changes. Reformatting lines that haven't changed, changing unrelated
> comments you felt were poorly worded, touching code not necessary to
> your change, etc. Each patch should have the minimum set of changes
> required to work robustly. If you do not follow the code formatting
> suggestions above, expect your patch to be returned to you with the
> feedback of "follow the code conventions", quite likely without any
> other review."
>
Forgive my impulsiveness, anyone who loves perfect, well written code,
would understand.

Do you have an objection to fixing the function find_base_rel_ignore_join?
Or is it included in unrelated changes?

Ranier Vilela


Re: Should consider materializing the cheapest inner path in consider_parallel_nestloop()

2023-09-27 Thread tender wang
Hi tom,
   Do you have any comments or suggestions on this issue? Thanks.

Richard Guo  于2023年9月8日周五 14:06写道:

>
> On Fri, Sep 8, 2023 at 3:15 AM Robert Haas  wrote:
>
>> The example query provided here seems rather artificial. Surely few
>> people write a join clause that references neither of the tables being
>> joined. Is there a more realistic case where this makes a big
>> difference?
>
>
> Yes the given example query is not that convincing.  I tried a query
> with plans as below (after some GUC setting) which might be more
> realistic in real world.
>
> unpatched:
>
> explain select * from partsupp join lineitem on l_partkey > ps_partkey;
>   QUERY PLAN
>
> --
>  Gather  (cost=0.00..5522666.44 rows=16047 width=301)
>Workers Planned: 4
>->  Nested Loop  (cost=0.00..5522666.44 rows=40116667 width=301)
>  Join Filter: (lineitem.l_partkey > partsupp.ps_partkey)
>  ->  Parallel Seq Scan on lineitem  (cost=0.00..1518.44 rows=15044
> width=144)
>  ->  Seq Scan on partsupp  (cost=0.00..267.00 rows=8000 width=157)
> (6 rows)
>
> patched:
>
> explain select * from partsupp join lineitem on l_partkey > ps_partkey;
>   QUERY PLAN
>
> --
>  Gather  (cost=0.00..1807085.44 rows=16047 width=301)
>Workers Planned: 4
>->  Nested Loop  (cost=0.00..1807085.44 rows=40116667 width=301)
>  Join Filter: (lineitem.l_partkey > partsupp.ps_partkey)
>  ->  Parallel Seq Scan on lineitem  (cost=0.00..1518.44 rows=15044
> width=144)
>  ->  Materialize  (cost=0.00..307.00 rows=8000 width=157)
>->  Seq Scan on partsupp  (cost=0.00..267.00 rows=8000
> width=157)
> (7 rows)
>
> The execution time (ms) are (avg of 3 runs):
>
> unpatched:  71769.21
> patched:65510.04
>
> So we can see some (~9%) performance gains in this case.
>
> Thanks
> Richard
>


RE: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Hayato Kuroda (Fujitsu)
Dear Tom,

> No, I'm pretty sure you're mistaken.  It's been a long time since
> high school English, but the way I think this works is that "that"
> introduces a restrictive clause, which narrows the scope of what
> you are saying.  That is, you say "that" when you want to talk
> about only the bytes of the string that aren't ASCII.  But "which"
> introduces a non-restrictive clause that adds information or
> commentary.  If you say "bytes of the string which are not ASCII",
> you are effectively making a side assertion that no byte of the
> string is ASCII.  Which is not the meaning you want here.
> 
> A smell test that works for native speakers (not sure how helpful
> it is for others) is: if the sentence would read well with commas
> or parens added before and after the clause, then it's probably
> non-restrictive and should use "which".  If it looks wrong that way
> then it's a restrictive clause and should use "that".

Thanks for giving your opinion. The suggestion is quite helpful for me,
non-native speaker. If you check my patch [1] I'm very happy.

[1]: 
https://www.postgresql.org/message-id/TYAPR01MB58663EB061888B2715A39217F5C2A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





RE: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thank you for reviewing!

> 
> TBH, I felt the new text descriptions deviated a bit too much from the
> originals. IMO only quite a small tweak was needed, so my suggested
> text in the comments below reflects that.

Good point, my patch may be too much.

> Commit message.
> 
> 1.
> missing description

Added.
If we should use only printable ascii as a commit message, I can use '\x03'
instead of 'あああ'.

> src/sgml/config.sgml
> 
> 2. application_name:
> 
> -Only printable ASCII characters may be used in the
> -application_name value. Other characters will
> be
> -replaced with question marks (?).
> +Characters that are not printable ASCII, like 
> \x03,
> +are replaced with the PostgreSQL
> +C-style escaped
> hexadecimal byte value.
> 
> BEFORE
> Other characters will be replaced with question marks (?).
> 
> SUGGESTION
> Other characters will be replaced with  linkend="sql-syntax-strings-escape">C-style escaped hexadecimal byte
> values.
> 
> ~~~
> 
> 3. cluster_name:
> 
> -build). Only printable ASCII characters may be used in the
> -cluster_name value. Other characters will be
> -replaced with question marks (?).  No name is
> shown
> -if this parameter is set to the empty string
> '' (which is
> -the default). This parameter can only be set at server start.
> +build).
> +Characters that are not printable ASCII, like 
> \x03,
> +are replaced with the PostgreSQL
> +C-style escaped
> hexadecimal byte value.
> +No name is shown if this parameter is set to the empty string
> +'' (which is the default). This parameter can only
> +be set at server start.
> 
> 
> 
> ==
> src/sgml/postgres-fdw.sgml
> 
> 4.
>   
>postgres_fdw.application_name can be any
> string
> -  of any length and contain even non-ASCII characters.  However when
> -  it's passed to and used as application_name
> +  of any length and contain even characters that are not printable ASCII.
> +  However when it's passed to and used as
> application_name
>in a foreign server, note that it will be truncated to less than
>NAMEDATALEN characters and anything other than
> -  printable ASCII characters will be replaced with question
> -  marks (?).
> +  printable ASCII characters are replaced with the
> PostgreSQL
> +  C-style escaped
> hexadecimal byte value.
>See  for details.
>   
> 
> ~
> 
> AFAICT the first change wasn't necessary.
> 
> ~
> 
> As for the 2nd change:
> 
> BEFORE
> ... and anything other than printable ASCII characters will be
> replaced with question marks (?).
> 
> SUGGESTION
> ... and anything other than printable ASCII characters will be
> replaced with C-style
> escaped hexadecimal byte values.

They seem good, but they conflict with Karl's comments.
I made three patches based on comments [1], could you check?

[1]: 
https://www.postgresql.org/message-id/TYAPR01MB58663EB061888B2715A39217F5C2A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: [PGdocs] fix description for handling pf non-ASCII characters

2023-09-27 Thread Hayato Kuroda (Fujitsu)
Dear Karl,

Thank you for reviewing!

> A related thing that's nice to have is to limit the line
> length of the documentation source to 80 characters or less.
> 79 is probably best.  Since the source text around your patch
> conforms to this convention you should also.

IIUC it is not hard limit, but I followed this.

> Should the committer be interested, your patch applies cleanly
> and the docs build as expected.

Yeah, but cfbot accepted previous version. Did you have anything in your mind?

> Also, based on the comments in the
> patch which changed the system's behavior, I believe that
> your patch updates all the relevant places in the documentation.

Thanks. Actually, I think it should be backpatched to PG16 because the commit 
was
done last year. I will make the patch for it after deciding the explanation.

> 
> I now think that you should consider another change to your wording.
> Instead of starting with "Characters that are not printable ASCII ..."
> consider writing "The bytes of the string which are not printable ASCII
> ...".  Notice above that characters (e.g. あ) generate output for
> each non-ASCII byte (e.g. \xe3\x81\x82).  So my thought is that
> the docs should be talking about bytes.
> 
> For the last hunk you'd change around "anything".  Write:
> "... it will be truncated to less than NAMEDATALEN characters and
> the bytes of the string which are not printable ASCII characters ...".
> 

Hmm, what you said looked right. But as Peter pointed out [1], the fix seems too
much. So I attached three version of patches. How do you think?
For me, type C is best.

A. A patch which completely follows your comments. The name is 
"v3-0001-...patch".
   Cfbot tests it.
B. A patch which completely follows Peter's comments [1]. The name is 
"Peter_v3-txt".
C. A patch which follows both comments. Based on b, but some comments
   (Don't use the future tense, "Other characters"->"The bytes of other 
characters"...)
   were picked. The name is "Both_v3-txt".

[1]: 
https://www.postgresql.org/message-id/CAHut%2BPvEbKC8ABA_daX-XPNOTFzuAmHGhjPj%3DtPZYQskRHECOg%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


From c3a2c68fd74bce14af6df7c773317d496cf75209 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda 
Date: Wed, 27 Sep 2023 07:31:07 +
Subject: [PATCH v32] Fix description for handling of non-printable ASCII
 characters
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

45b1a67a changed the behavior when characters that are not printable ASCII were
used for three configuration parameters (application_name, cluster_name, and
postgres_fdw.application_name), but it was not documented. This commit fixes
that.

PG15 and prior:

```
postgres=# SET application_name TO 'あああ';
SET
postgres=# SHOW application_name ;
 application_name
--
 ?
(1 row)
```

PG16 and later:

```
postgres=# SET application_name TO 'あああ';
SET
postgres=# SHOW application_name ;
   application_name
--
 \xe3\x81\x82\xe3\x81\x82\xe3\x81\x82
(1 row)
```
---
 doc/src/sgml/config.sgml   | 17 +++--
 doc/src/sgml/postgres-fdw.sgml |  7 ---
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 38684af5b1..587bdf3d72 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6893,8 +6893,10 @@ local0.*/var/log/postgresql
 and included in CSV log entries.  It can also be included in regular
 log entries via the  parameter.
 Only printable ASCII characters may be used in the
-application_name value. Other characters will be
-replaced with question marks (?).
+application_name value.
+The bytes of other characters are replaced with
+C-style escaped hexadecimal
+byte values.

   
  
@@ -7890,10 +7892,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;
 The name can be any string of less
 than NAMEDATALEN characters (64 characters in a 
standard
 build). Only printable ASCII characters may be used in the
-cluster_name value. Other characters will be
-replaced with question marks (?).  No name is shown
-if this parameter is set to the empty string '' 
(which is
-the default). This parameter can only be set at server start.
+cluster_name value.
+The bytes of other characters are replaced with
+C-style escaped hexadecimal
+byte values.
+No name is shown if this parameter is set to the empty string
+'' (which is the default).
+This parameter can only be set at server start.

   
  
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 5062d712e7..9894d61a98 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -1067,9 +1067,10 @@ pos

Re: should frontend tools use syncfs() ?

2023-09-27 Thread Peter Eisentraut

On 17.08.23 04:50, Michael Paquier wrote:

Yeah, this crossed my mind.  Do you know of any existing examples of
options with links to a common section?  One problem with this approach is
that there are small differences in the wording for some of the frontend
utilities, so it might be difficult to cleanly unite these sections.

The closest thing I can think of is Color Support in section
Appendixes, that describes something shared across a lot of binaries
(that would be 6 tools with this patch).


I think it's a bit much to add a whole appendix for that little content.

We have a collection of platform-specific notes in chapter 19, including 
file-system-related notes in section 19.2.  Maybe it could be put there?






Re: Testing autovacuum wraparound (including failsafe)

2023-09-27 Thread Masahiko Sawada
Sorry for the late reply.

On Sun, Sep 3, 2023 at 2:48 PM Noah Misch  wrote:
>
> On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote:
> > > On 12 Jul 2023, at 09:52, Masahiko Sawada  wrote:
> > > Agreed. The timeout can be set by manually setting
> > > PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it
> > > now require setting PG_TET_EXTRA to run it.
> >
> > +# bump the query timeout to avoid false negatives on slow test syetems.
> > typo: s/syetems/systems/
> >
> >
> > +# bump the query timeout to avoid false negatives on slow test syetems.
> > +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600;
> > Does this actually work?  Utils.pm read the environment variable at compile
> > time in the BEGIN block so this setting won't be seen?  A quick testprogram
> > seems to confirm this but I might be missing something.
>
> The correct way to get a longer timeout is "IPC::Run::timer(4 *
> $PostgreSQL::Test::Utils::timeout_default);".  Even if changing env worked,
> that would be removing the ability for even-slower systems to set timeouts
> greater than 10min.

Agreed.

I've attached new version patches. 0001 patch adds an option to
background_psql to specify the timeout seconds, and 0002 patch is the
main regression test patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v5-0001-Add-option-to-specify-timeout-seconds-to-Backgrou.patch
Description: Binary data


v5-0002-Add-tests-for-XID-wraparound.patch
Description: Binary data


Re: Partial aggregates pushdown

2023-09-27 Thread Alexander Pyhalov

fujii.y...@df.mitsubishielectric.co.jp писал 2023-09-27 01:35:

Hi Mr.Momjian, Mr.Pyhalov.

Tuesday, 26 September 2023 22:15 Alexander Pyhalov 
:

Do you mean that extra->partial_target->sortgrouprefs is not replaced,
and so we preserve tlesortgroupref numbers?

Yes, that is correct.


I'm suspicious about rewriting extra->partial_target->exprs with
partial_target->exprs - I'm still not sure why we
  don't we loose information, added by add_column_to_pathtarget() to
extra->partial_target->exprs?



Hi.

In postgres_fdw.sql

"Partial aggregates are unsafe to push down having clause when there are 
partial aggregates" - this comment likely should be fixed.


Some comments should be added to setGroupClausePartial() and to 
make_partial_grouping_target() - especially why setGroupClausePartial()

is called prior to add_new_columns_to_pathtarget().

I'm not sure that I like this mechanics of adding sort group clauses - 
it seems we do in core additional work, which is of use only for

one extension, but at least it seems to be working.

--
Best regards,
Alexander Pyhalov,
Postgres Professional




Re: Synchronizing slots from primary to standby

2023-09-27 Thread shveta malik
On Mon, Sep 25, 2023 at 7:46 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Ajin, Shveta,
>
> Thank you for rebasing the patch set! Here are new comments for v19_2-0001.
>

Thank You Kuroda-san for the feedback. Most of these are addressed in
v20. Please find my response inline.

> 01. WalSndWaitForStandbyNeeded()
>
> ```
> if (SlotIsPhysical(MyReplicationSlot))
> return false;
> ```
>
> Is there a possibility that physical walsenders call this function?
> IIUC following is a stacktrace for the function, so the only logical 
> walsenders use it.
> If so, it should be Assert() instead of an if statement.
>
> logical_read_xlog_page()
> WalSndWaitForWal()
> WalSndWaitForStandbyNeeded()

It will only be called from logical-walsenders. Modified as you suggested.

>
> 02. WalSndWaitForStandbyNeeded()
>
> Can we set shouldwait in SlotSyncInitConfig()? synchronize_slot_names_list is
> searched whenever the function is called, but it is not changed automatically.
> If the slotname is compared with the list in the SlotSyncInitConfig(), the
> liner search can be reduced.

standby_slot_names and synchronize_slot_names will be removed in the
next version as per discussion in [1] and thus SlotSyncInitConfig()
will not be needed. It will be replaced by new functionality. So I am
currently leaving it as is.

>
> 03. WalSndWaitForStandbyConfirmation()
>
> We should add ProcessRepliesIfAny() during the loop, otherwise the walsender
> overlooks the death of an apply worker.
>

Done.

> 04. WalSndWaitForStandbyConfirmation()
>
> Not sure, but do we have to return early if walsenders got 
> PROCSIG_WALSND_INIT_STOPPING
> signal? I thought that if physical walsenders get stuck, logical walsenders 
> wait
> forever. At that time we cannot stop the primary server even if "pg_ctl stop"
> is executed.
>

yes, right.  I have added CHECK_FOR_INTERRUPTS() and 'got_STOPPING'
handling now which I think should suffice to process
PROCSIG_WALSND_INIT_STOPPING.

> 05. SlotSyncInitConfig()
>
> Why don't we free the memory for rawname, old standby_slot_names_list, and 
> synchronize_slot_names_list?
> They seem to be overwritten.
>

Skipped for the time being due to reasons stated in pt 2.

> 06. SlotSyncInitConfig()
>
> Both physical and logical walsenders call the func, but physical one do not 
> use
> lists, right? If so, can we add a quick exit for physical walsenders?
> Or, we should carefully remove where physical calls it.
>
> 07. StartReplication()
>
> I think we do not have to call SlotSyncInitConfig().
> Alternative approach is written in above.
>

I have removed it  from StartReplication()

> 08. the other
>
> Also, I found the unexpected behavior after both 0001 and 0002 were applied.
> Was it normal or not?
>
> 1. constructed below setup
> (ensured that logical slot existed on secondary)
> 2. stopped the primary
> 3. promoted the secondary server
> 4. disabled a subscription once
> 5. changed the connection string for subscriber
> 6. Inserted data to new primary
> 7. enabled the subscription again
> 8. got an ERROR: replication slot "sub" does not exist
>
> I expected that the logical replication would be restarted, but it could not.
> Was it real issue or my fault? The error would appear in secondary.log.
>
> ```
> Setup:
> primary--->secondary
>|
>|
> subscriber
> ```

I have attached the new test-script (v2), can you please try that on
the v20 set of patches. We should let the slot creation complete first
on standby and then try promotion. I have added a few extra lines in
v2 of your script for the same. In the test-case, primary's
restart-lsn was lagging behind
new-slot's restart_lsn on standby and thus standby was waiting for
primary to catch-up. Meanwhile standby got promoted and thus slot
creation got aborted. That is the reason you were not able to get the
logical replication working on the new primary. v20 has improved
handling and better logging for such a case. Please try the attached
test-script on v20.


[1]: 
https://www.postgresql.org/message-id/CAJpy0uA%2Bt3XP2M0qtEmrOG1gSwHghjHPno5AtwTXM-94-%2Bc6JQ%40mail.gmail.com

thanks
Shveta
#!/bin/bash

port_primary=5431
port_secondary=5432
port_subscriber=5433

echo '=='
echo '=Clean up='
echo '=='

pg_ctl stop -D data_primary
pg_ctl stop -D data_secondary
pg_ctl stop -D data_subscriber

rm -rf data_* *log

echo '==='
echo '=Set up primary server='
echo '==='

initdb -D data_primary -U postgres

cat << EOF >> data_primary/postgresql.conf
wal_level = logical
port = $port_primary
standby_slot_names = 'physical'
synchronize_slot_names = 'sub'
log_replication_commands = 'on'
EOF

cat << EOF >> data_primary/pg_hba.conf
host all,replication all 0.0.0.0/0 trust
EOF

pg_ctl -D data_primary start -w -l primary.log
psql -U postgres -p $port_primary -c "CREATE TABLE tbl (id int primary key);"
psql -U postgres -p $port_primary -c "INSERT INTO tbl VALUES (1)"
psql -U postgres -p $port_primary -c "CREAT

Re: logfmt and application_context

2023-09-27 Thread Étienne BERSAC
Hi,

Le mercredi 27 septembre 2023 à 10:14 +0200, Daniel Gustafsson a écrit :
> Being a common format in ingestion tools makes it interesting though, but I
> wonder if those tools aren't alreday supporting CSV such that adding logfmt
> won't move the compatibility markers much?

Compared to CSV, logfmt has explicit named fields. This helps tools to
apply generic rules like : ok this is pid, this is timestamp, etc.
without any configuration. Loki and Grafana indexes a subset of known
fields. This is harder to achieve with a bunch a semi-colon separated
values.

Compared to JSON, logfmt is terser and easier for human eyes and
fingers.

This is why I think logfmt for PostgreSQL could be a good option.

Regards,
Étienne




Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

2023-09-27 Thread Amit Langote
On Wed, Sep 27, 2023 at 8:07 PM Ashutosh Bapat
 wrote:
> On Wed, Sep 27, 2023 at 2:30 PM Amit Langote  wrote:
> > Just out of curiosity, is their not being present in join_info_list
> > problematic in some manner, such as missed optimization opportunities
> > for child joins?  I noticed there is a loop over join_info_list in
> > add_paths_to_join_rel(), which does get called for child joinrels.  I
> > know this a bit off-topic for the thread, but thought to ask in case
> > you've already thought about it.
>
> The function has a comment and code to take care of this at the very beginning
> /*
> * PlannerInfo doesn't contain the SpecialJoinInfos created for joins
> * between child relations, even if there is a SpecialJoinInfo node for
> * the join between the topmost parents. So, while calculating Relids set
> * representing the restriction, consider relids of topmost parent of
> * partitions.
> */
> if (joinrel->reloptkind == RELOPT_OTHER_JOINREL)
> joinrelids = joinrel->top_parent_relids;
> else
> joinrelids = joinrel->relids;

Ah, that's accounted for.  Thanks.


--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

2023-09-27 Thread Ashutosh Bapat
On Wed, Sep 27, 2023 at 2:30 PM Amit Langote  wrote:
>
> Here are some comments.

Thanks for your review.

>
> Please merge 0003 into 0002.

Done.

>
> +   /*
> +* But the list of operator OIDs and the list of expressions may be
> +* referenced somewhere else. Do not free those.
> +*/
>
> I don't see build_child_join_sjinfo() copying any list of OIDs, so I'm
> not sure what the comment is referring to.  Also, instead of "the list
> of expressions", it might be more useful to mention semi_rhs_expr,
> because that's the only one being copied.

list of OID is semi_operators. It's copied as is from parent
SpecialJoinInfo. But the way it's mentioned in the comment is
confusing. I have modified the prologue of function to provide a
general guidance on what can be freed and what should not be. and then
specific examples. Let me know if this is more clear.

>
> The comment for SpecialJoinInfo mentions that they are stored in
> PlannerInfo.join_info_list, but the child SpecialJoinInfos used by
> partitionwise joins are not, which I noticed has not been mentioned
> anywhere.  Should we make a note of that in the SpecialJoinInfo's
> comment?

Good point. Since SpecialJoinInfos for JOIN_INNER are mentioned there,
it makes sense to mention transient child SpecialJoinInfos as well.
Done.

>
> Just out of curiosity, is their not being present in join_info_list
> problematic in some manner, such as missed optimization opportunities
> for child joins?  I noticed there is a loop over join_info_list in
> add_paths_to_join_rel(), which does get called for child joinrels.  I
> know this a bit off-topic for the thread, but thought to ask in case
> you've already thought about it.

The function has a comment and code to take care of this at the very beginning
/*
* PlannerInfo doesn't contain the SpecialJoinInfos created for joins
* between child relations, even if there is a SpecialJoinInfo node for
* the join between the topmost parents. So, while calculating Relids set
* representing the restriction, consider relids of topmost parent of
* partitions.
*/
if (joinrel->reloptkind == RELOPT_OTHER_JOINREL)
joinrelids = joinrel->top_parent_relids;
else
joinrelids = joinrel->relids;

PFA latest patch set
0001 - same as previous patch set
0002 - 0002 and 0003 from previous patch set squashed together
0003 - changes addressing above comments.

-- 
Best Wishes,
Ashutosh Bapat
From d7e65bf7c277c3dd3afbd1b7a152b86d24bd6904 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Wed, 27 Sep 2023 16:29:11 +0530
Subject: [PATCH 3/3] Address Amit's comments

---
 src/backend/optimizer/path/joinrels.c | 24 +---
 src/include/nodes/pathnodes.h |  4 
 2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 540eda612f..1bcc60d62b 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1733,8 +1733,19 @@ build_child_join_sjinfo(PlannerInfo *root, SpecialJoinInfo *parent_sjinfo,
 }
 
 /*
- * free_child_sjinfo_members
- *		Free memory consumed by members of a child SpecialJoinInfo.
+ * free_child_sjinfo_members Free memory consumed by members of a child
+ * SpecialJoinInfo.
+ *
+ * The candidate members are the ones which are obtained by translating the
+ * corresponding parent members. However the members which may be potentially
+ * referenced elsewhere should not be freed.
+ *
+ * For example, all the Relids are freed since that are used only for comparison
+ * and are not referenced elsewhere. But semi_rhs_exprs are not freed, even if
+ * they are translated, since they may be referenced in paths or other objects.
+ *
+ * semi_operators is not freed as well since it's copied as is from parent
+ * SpecialJoinInfo and is not a translation.
  */
 static void
 free_child_sjinfo_members(SpecialJoinInfo *child_sjinfo)
@@ -1746,19 +1757,10 @@ free_child_sjinfo_members(SpecialJoinInfo *child_sjinfo)
 	if (child_sjinfo->jointype == JOIN_INNER)
 		return;
 
-	/*
-	 * The relids are used only for comparison and their references are not
-	 * stored anywhere. Free those.
-	 */
 	bms_free(child_sjinfo->min_lefthand);
 	bms_free(child_sjinfo->min_righthand);
 	bms_free(child_sjinfo->syn_lefthand);
 	bms_free(child_sjinfo->syn_righthand);
-
-	/*
-	 * But the list of operator OIDs and the list of expressions may be
-	 * referenced somewhere else. Do not free those.
-	 */
 }
 
 /*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 5702fbba60..4ec007efa1 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2839,6 +2839,10 @@ typedef struct PlaceHolderVar
  * cost estimation purposes it is sometimes useful to know the join size under
  * plain innerjoin semantics.  Note that lhs_strict and the semi_xxx fields
  * are not set meaningfully within such structs.
+ *
+ * We also create transient SpecialJoinInfos for child joins by transla

Re: ubsan

2023-09-27 Thread Alexander Lakhin

Hello Andres,

22.11.2022 02:15, Andres Freund wrote:

Hi,

On 2022-09-29 18:17:55 -0700, Andres Freund wrote:

Attached is a rebased version of this patch. Hopefully with a reasonable
amount of comments?  I kind of wanted to add a comment to reached_main, but it
just seems to end up restating the variable name...

I've now pushed a version of this with a few cleanups, mostly in
.cirrus.yml.  It'll be interesting to see how many additional problems
it finds via cfbot.


I've just discovered that that function __ubsan_default_options() is
incompatible with -fsanitize=hwaddress:
$ tmp_install/usr/local/pgsql/bin/postgres
Segmentation fault

Program received signal SIGSEGV, Segmentation fault.
0x00555639e3ec in __hwasan_check_x0_0 ()
(gdb) bt
#0  0x00555639e3ec in __hwasan_check_x0_0 ()
#1  0x00555697b5a8 in __ubsan_default_options () at main.c:446
#2  0x005556367e48 in InitializeFlags ()
    at 
/home/builder/.termux-build/libllvm/src/compiler-rt/lib/hwasan/hwasan.cpp:133
#3  __hwasan_init () at 
/home/builder/.termux-build/libllvm/src/compiler-rt/lib/hwasan/hwasan.cpp:351
#4  0x007ff7f4929c in __dl__ZL13call_functionPKcPFviPPcS2_ES0_ () from 
/system/bin/linker64
#5  0x007ff7f4900c in __dl__ZL10call_arrayIPFviPPcS1_EEvPKcPT_mbS5_ () from 
/system/bin/linker64
#6  0x007ff7f45670 in 
__dl__ZL29__linker_init_post_relocationR19KernelArgumentBlockR6soinfo ()
   from /system/bin/linker64
#7  0x007ff7f449c8 in __dl___linker_init () from /system/bin/linker64
#8  0x007ff7f4b208 in __dl__start () from /system/bin/linker64

I use clang version 16.0.6, Target: aarch64-unknown-linux-android24.

With just 'return ""' in __ubsan_default_options(), I've managed to run
`make check` (there is also an issue with check_stack_depth(),
but that's another story)...

Best regards,
Alexander





Re: Index AmInsert Parameter Confused?

2023-09-27 Thread Matthias van de Meent
On Wed, 27 Sept 2023 at 05:03, jacktby jacktby  wrote:
>
>
>
> > 2023年9月27日 00:45,Matthias van de Meent  写道:
> >
> > On Tue, 26 Sept 2023 at 18:38, jacktby jacktby  wrote:
> >>
> >> typedef bool (*aminsert_function) (Relation indexRelation,
> >>  Datum *values,
> >>  bool *isnull,
> >>  ItemPointer heap_tid,
> >>  Relation heapRelation,
> >>  IndexUniqueCheck checkUnique,
> >>  bool indexUnchanged,
> >>  struct IndexInfo *indexInfo);
> >>
> >> Why is there a heap_tid, We haven’t inserted the value, so where does it 
> >> from ?
> >
> > Index insertion only happens after the TableAM tuple has been
> > inserted. As indexes refer to locations in the heap, this TID contains
> > the TID of the table tuple that contains the indexed values, so that
> > the index knows which tuple to refer to.
> >
> > Note that access/amapi.h describes only index AM APIs; it does not
> > cover the table AM APIs descibed in access/tableam.h
> >
> > Kind regards,
> >
> > Matthias van de Meent
> 1.Thanks, so if we insert a tuple into a table which has a index on itself, 
> pg will insert tuple into heap firstly, and the give the heaptid form heap to 
> the Index am api right?

Correct. I think this is also detailed in various places of the
documentation, yes.

> 2. I’m trying to implement a new index, but I just need the data held in 
> index table, I hope it’s not inserted into heap, because the all data I want 
> can be in index table.

In PostgreSQL, a table maintains the source of truth for the data, and
indexes are ephemeral data structures that improve the speed of
querying the data in their table. As such, dropping an index should
not impact the availability of the table's data.
If the only copy of your (non-derived) data is in the index, then it
is likely that some normal table operations will result in failures
due to the tableAM/indexAM breaking built-in assumptions about access
methods and data availability.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: pg_upgrade and logical replication

2023-09-27 Thread vignesh C
On Tue, 26 Sept 2023 at 10:58, vignesh C  wrote:
>
> On Wed, 20 Sept 2023 at 16:54, Amit Kapila  wrote:
> >
> > On Fri, Sep 15, 2023 at 3:08 PM vignesh C  wrote:
> > >
> > > The attached v8 version patch has the changes for the same.
> > >
> >
> > Is the check to ensure remote_lsn is valid correct in function
> > check_for_subscription_state()? How about the case where the apply
> > worker didn't receive any change but just marked the relation as
> > 'ready'?
>
> I agree that remote_lsn will not be valid in the case when all the
> tables are in ready state and there are no changes to be sent by the
> walsender to the worker. I was not sure if this check is required in
> this case in the check_for_subscription_state function. I was thinking
> that this check could be removed.
> I'm also checking why the tables should only be in ready state, the
> check that is there in the same function, can we support upgrades when
> the tables are in syncdone state or not. I will post my analysis once
> I have finished checking on the same.

Once the table is in SUBREL_STATE_SYNCDONE state, the apply worker
will check if the apply worker has some LSN records that need to be
applied to reach the LSN of the table. Once the required WAL is
applied, the table state will be changed from SUBREL_STATE_SYNCDONE to
SUBREL_STATE_READY state. Since there is a chance that in this case
the apply worker has to apply some transactions to get all the tables
in READY state, I felt the minimum requirement should be that at least
all the tables should be in READY state for the upgradation of the
subscriber.

Regards,
Vignesh




Re: Synchronizing slots from primary to standby

2023-09-27 Thread Drouvot, Bertrand

Hi,

On 9/19/23 6:50 AM, shveta malik wrote:

On Wed, Sep 13, 2023 at 5:19 PM Amit Kapila  wrote:


On Wed, Sep 13, 2023 at 4:54 PM shveta malik  wrote:


PFA  v17. It has below changes:



@@ -2498,6 +2500,13 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
   }
   else
   {
+ /*
+ * Before we send out the last set of changes to logical decoding
+ * output plugin, wait for specified streaming replication standby
+ * servers (if any) to confirm receipt of WAL upto commit_lsn.
+ */
+ WaitForStandbyLSN(commit_lsn);

It seems the first patch has a wait logic for every commit. I think it
is better to integrate this wait with WalSndWaitForWal() as suggested
by Andres in his email[1].

[1] - 
https://www.postgresql.org/message-id/20220207204557.74mgbhowydjco4mh%40alap3.anarazel.de

--


Sure Amit. PFA  v18. It addresses below:

1) patch001: wait for physical-standby confirmation logic is now
integrated with WalSndWaitForWal(). Now walsender waits for physical
standby's confirmation to take changes upto RecentFlushPtr in
WalSndWaitForWal(). This allows walsender to send the changes to
logical subscribers one by one which are already covered in
RecentFlushPtr without needing to wait on every commit for physical
standby confirmation.


+   /* XXX: Is waiting for 1 second before retrying enough or more or less? 
*/
+   (void) WaitLatch(MyLatch,
+WL_LATCH_SET | WL_TIMEOUT | 
WL_EXIT_ON_PM_DEATH,
+1000L,
+
WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION);

I think it would be better to let the physical walsender(s) wake up those 
logical
walsender(s) (instead of waiting for 1 sec or such). Maybe we could introduce a 
new CV that would
broadcast in PhysicalConfirmReceivedLocation() when restart_lsn is changed, 
what do you think?

Still regarding preventing the logical replication to go ahead of
physical replication standbys specified in standby_slot_names: we currently 
don't impose this
limitation to pg_logical_slot_get_changes and friends (that don't start a 
dedicated walsender).

Shouldn't we also prevent them to go ahead of physical replication standbys 
specified in standby_slot_names?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




  1   2   >