Re: make MaxBackends available in _PG_init

2022-01-25 Thread Bossart, Nathan
On 1/25/22, 12:01 AM, "Michael Paquier"  wrote:
> So, where are we on this patch?  It looks like there is an agreement
> that MaxBackends is used widely enough that it justifies the use of a
> separate function to set and get a better value computed.  There may
> be other parameters that could use a brush up, but most known cases
> would be addressed here.  v4 looks rather straight-forward, at quick
> glance.

I think the patch is in decent shape.  There may be a few remaining
places where GetMaxBackends() is called repeatedly in the same
function, but IIRC v4 already clears up the obvious ones.  I don't
know if this is worth worrying about too much, but I can create a new
version if you think it is important.

Nathan



Re: Correct error message for end-of-recovery record TLI

2022-01-25 Thread Bossart, Nathan
On 1/24/22, 8:42 PM, "Amul Sul"  wrote:
> On Tue, Jan 25, 2022 at 10:08 AM Michael Paquier  wrote:
>>
>> On Wed, Dec 01, 2021 at 07:09:34PM +, Bossart, Nathan wrote:
>> > The patch no longer applied, so I went ahead and rebased it.
>>
>> This was on the CF stack for some time, so applied.  I have also
>> changed the messages produced for the shutdown and online checkpoint
>> records as they used the same messages so as one can get more
>> context depending on the record types.
>
> A ton of thanks for the improvement & the commit.

+1, thanks!

Nathan



Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas

2022-01-24 Thread Bossart, Nathan
On 1/22/22, 4:43 PM, "Tomas Vondra"  wrote:
> There's a bug in ProcArrayApplyRecoveryInfo, introduced by 8431e296ea,
> which may cause failures when starting a replica, making it unusable.
> The commit message for 8431e296ea is not very clear about what exactly
> is being done and why, but the root cause is that at while processing
> RUNNING_XACTS, the XIDs are sorted like this:
>
>  /*
>   * Sort the array so that we can add them safely into
>   * KnownAssignedXids.
>   */
>  qsort(xids, nxids, sizeof(TransactionId), xidComparator);
>
> where "safely" likely means "not violating the ordering expected by
> KnownAssignedXidsAdd". Unfortunately, xidComparator compares the values
> as plain uint32 values, while KnownAssignedXidsAdd actually calls
> TransactionIdFollowsOrEquals() and compares the logical XIDs :-(

Wow, nice find.

> This likely explains why we never got any reports about this - most
> systems probably don't leave transactions running for this long, so the
> probability is much lower. And replica restarts are generally not that
> common events either.

I'm aware of one report with the same message [0], but I haven't read
closely enough to determine whether it is the same issue.  It looks
like that particular report was attributed to backup_label being
removed.

> Attached patch is fixing this by just sorting the XIDs logically. The
> xidComparator is meant for places that can't do logical ordering. But
> these XIDs come from RUNNING_XACTS, so they actually come from the same
> wraparound epoch (so sorting logically seems perfectly fine).

The patch looks reasonable to me.

Nathan

[0] https://postgr.es/m/1476795473014.15979.2188%40webmail4



Re: Catalog version access

2022-01-24 Thread Bossart, Nathan
Thanks for taking a look!

On 1/23/22, 7:31 PM, "Michael Paquier"  wrote:
> On Mon, Aug 16, 2021 at 06:12:54PM +0000, Bossart, Nathan wrote:
>> I was looking at the --check switch for the postgres binary recently
>> [0], and this sounds like something that might fit in nicely there.
>> In the attached patch, --check will also check the control file if one
>> exists.
>
> This would not work on a running postmaster as CreateDataDirLockFile()
> is called beforehand, but we want this capability, no?

I was not under the impression this was a requirement, based on the
use-case discussed upthread [0].  

> Abusing a bootstrap option for this purpose does not look like a good
> idea, to be honest, especially for something only used internally now
> and undocumented, but we want to use something aimed at an external
> use with some documentation.  Using a separate switch would be more
> adapted IMO.

This is the opposite of what Magnus proposed earlier [1].  Do we need
a new pg_ctl option for this?  It seems like it is really only
intended for use by PostgreSQL developers, but perhaps there are other
use-cases I am not thinking of.  In any case, the pg_ctl option would
probably end up using --check (or another similar mode) behind the
scenes.

> Also, I think that we should be careful with the read of
> the control file to avoid false positives?   We can rely on an atomic
> read/write thanks to its maximum size of 512 bytes, but this looks
> like a lot what has been done recently with postgres -C for runtime
> GUCs, that *require* a read of the control file before grabbing the
> values we are interested in.

Sorry, I'm not following this one.  In my proposed patch, the control
file (if one exists) is read after CreateDataDirLockFile(), just like
PostmasterMain().

Nathan

[0] https://postgr.es/m/20210222022407.ecaygvx2ise6uwyz%40alap3.anarazel.de
[1] 
https://postgr.es/m/CABUevEySovaEDci7c0DXOrV6c7JzWqYzfVwOiRUJxMao%3D9seEw%40mail.gmail.com



Re: do only critical work during single-user vacuum?

2022-01-21 Thread Bossart, Nathan
On 1/21/22, 2:43 PM, "John Naylor"  wrote:
> - to have a simple, easy to type, command

AFAICT the disagreement is really just about the grammar.
Sawada-san's idea would look something like

VACUUM (FREEZE, INDEX_CLEANUP OFF, MIN_XID_AGE 16, MIN_MXID_AGE 
16);

while your proposal looks more like

VACUUM (WRAPAROUND);

The former is highly configurable, but it is probably annoying to type
at 3 AM, and the interaction between the two *_AGE options is not
exactly intuitive (although I expect MIN_XID_AGE to be sufficient in
most cases).  The latter is not as configurable, but it is much easier
to type at 3 AM.

I think simplicity is a good goal, but I don't know if the difference
between the two approaches outweighs the benefits of configurability.
If you are in an emergency situation, you already will have to take
down the server, connect in single-user mode to the database(s) that
need vacuuming, and actually do the vacuuming.  The wraparound
WARNING/ERROR already has a HINT that describes the next steps
required.  Perhaps it would be enough to also emit an example VACUUM
command to use.

I think folks will find the configurability useful, too.  With
MIN_XID_AGE, it's super easy to have pg_cron vacuum everything over
500M on the weekend (and also do index cleanup), which may allow you
to use more relaxed autovacuum settings during the week.  The docs
already have suggestions for manually vacuuming when the load is low
[0], so I think it is reasonable to build additional support for this
use-case.

Nathan

[0] 
https://www.postgresql.org/docs/devel/routine-vacuuming.html#VACUUM-FOR-SPACE-RECOVERY



Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-01-20 Thread Bossart, Nathan
Thanks for your feedback.

On 1/20/22, 11:47 AM, "Andres Freund"  wrote:
> It seems odd to change a bunch of things to not be errors anymore, but then
> add new sources of errors. If we try to deal with concurrent deletions or
> permission issues - otherwise what's the point of making unlink() not an error
> anymore - why do we expect to be able to lstat()?

My reasoning for making lstat() an ERROR was that there's a chance we
need to fsync() the file, and if we can't fsync() a file for whatever
reason, we definitely want to ERROR.  I suppose we could conditionally
ERROR based on the file name, but I don't know if that's really worth
the complexity.

> I'd be more on board accepting some selective errors. E.g. not erroring on
> ENOENT, but continuing to error on others (most likely ENOACCESS). I think we
> should *not* change the

I think this approach would work for the use-case Bharath mentioned
upthread.  In any case, if deleting a file fails because the file was
already deleted, there's no point in ERROR-ing.  I think filtering
errors is a bit trickier for lstat().  If we would've fsync'd the file
but lstat() gives us ENOENT, we may have a problem.  (However, there's
also a good chance we wouldn't notice such problems if the race didn't
occur.)  I'll play around with it.

>> + /*
>> +  * We just log a message if a file doesn't fit the pattern, 
>> it's
>> +  * probably some editor's lock/state file or similar...
>> +  */
>
> An editor's lock file that starts with map- would presumably be the whole
> filename plus an additional file-ending. But this check won't catch those.

Right, it will either fsync() or unlink() those.  I'll work on the
comment.  Or do you think it's worth validating that there are no
trailing characters?  I looked into that a bit earlier, and the code
felt excessive to me, but I don't have a strong opinion here.

>> +  * Unlike failures to unlink() old logical rewrite 
>> files, we must
>> +  * ERROR if we're unable to sync transient state down 
>> to disk.  This
>> +  * allows replay to assume that everything written out 
>> before
>> +  * checkpoint start is persisted.
>>*/
>
> Hm, not quite happy with the second bit. Checkpointed state being durable
> isn't about replay directly, it's about the basic property of a checkpoint
> being fulfilled?

I'll work on this.  FWIW I modeled this based on the comment above the
function.  Do you think that should be adjusted as well?

>> + /* persist directory entries to disk */
>> + fsync_fname("pg_logical/mappings", true);
>
> This is probably worth backpatching, if you split it out I'll do so.

Sure thing, will do shortly.

Nathan



Re: Document atthasmissing default optimization avoids verification table scan

2022-01-20 Thread Bossart, Nathan
On 1/19/22, 5:15 PM, "James Coleman"  wrote:
> I'm open to the idea of wordsmithing here, of course, but I strongly
> disagree that this is irrelevant data. There are plenty of
> optimizations Postgres could theoretically implement but doesn't, so
> measuring what should happen by what you think is obvious ("told it to
> populate with default values - why do you need to check") is clearly
> not valid.
>
> This patch actually came out of our specifically needing to verify
> that this is true before an op precisely because docs aren't actually
> clear and because we can't risk a large table scan under an exclusive
> lock. We're clearly not the only ones with that question; it came up
> in a comment on this blog post announcing the newly committed feature
> [1].

My initial reaction was similar to David's.  It seems silly to
document that we don't do something that seems obviously unnecessary.
However, I think you make a convincing argument for including it.  I
agree with David's feedback on where this information should go.

Nathan



Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-19 Thread Bossart, Nathan
On 1/3/22, 5:52 PM, "Kyotaro Horiguchi"  wrote:
> It seems to me "LSN" or just "location" is more confusing or
> mysterious than "REDO LSN" for the average user. If we want to avoid
> being technically too detailed, we would use just "start LSN=%X/%X,
> end LSN=%X/%X".  And it is equivalent to "WAL range=[%X/%X, %X/%X]"..

My first instinct was that this should stay aligned with
pg_controldata, but that would mean using "location=%X/%X, REDO
location=%X/%X," which doesn't seem terribly descriptive.  IIUC the
"checkpoint location" is the LSN of the WAL record for the checkpoint,
and the "checkpoint's REDO location" is the LSN where checkpoint
creation began (i.e., what you must retain for crash recovery).  My
vote is for "start=%X/%X, end=%X/%X."

Nathan



Re: Document atthasmissing default optimization avoids verification table scan

2022-01-19 Thread Bossart, Nathan
On 9/24/21, 7:30 AM, "James Coleman"  wrote:
> When PG11 added the ability for ALTER TABLE ADD COLUMN to set a constant
> default value without rewriting the table the doc changes did not note
> how the new feature interplayed with ADD COLUMN DEFAULT NOT NULL.
> Previously such a new column required a verification table scan to
> ensure no values were null. That scan happens under an exclusive lock on
> the table, so it can have a meaningful impact on database "accessible
> uptime".

I'm likely misunderstanding, but are you saying that adding a new
column with a default value and a NOT NULL constraint used to require
a verification scan?

+ Additionally adding a column with a constant default value avoids a
+ a table scan to verify no NULL values are present.

Should this clarify that it's referring to NOT NULL constraints?

Nathan



Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-01-19 Thread Bossart, Nathan
On 1/19/22, 11:08 AM, "Andres Freund"  wrote:
> On 2022-01-19 13:34:21 -0500, Tom Lane wrote:
>> As far as the patch itself goes, I agree that failure to unlink
>> is noncritical, because such a file would have no further effect
>> and we can just ignore it.
>
> I don't agree. We iterate through the directory regularly on systems with
> catalog changes + logical decoding. An ever increasing list of gunk will make
> that more and more expensive.  And I haven't heard a meaningful reason why we
> would have map-* files that we can't remove.

I think the other side of this is that we don't want checkpointing to
continually fail because of a noncritical failure.  That could also
lead to problems down the road.

> Ignoring failures like this just makes problems much harder to debug and they
> tend to bite harder for it.

If such noncritical failures happened regularly, the server logs will
likely become filled with messages about it.  Perhaps users may not
notice for a while, but I don't think the proposed patch would make
debugging excessively difficult.

Nathan



Re: do only critical work during single-user vacuum?

2022-01-19 Thread Bossart, Nathan
On 1/19/22, 11:15 AM, "John Naylor"  wrote:
> This seems to be the motivating reason for wanting new configurability
> on the server side. In any case, new knobs are out of scope for this
> thread. If the use case is compelling enough, may I suggest starting a
> new thread?

Sure.  Perhaps the new top-level command will use these new options
someday.

> Regarding the thread subject, I've been playing with the grammar, and
> found it's quite easy to have
>
> VACUUM FOR WRAPAROUND
> or
> VACUUM FOR EMERGENCY
>
> since FOR is a reserved word (and following that can be an IDENT plus
> a strcmp check) and cannot conflict with table names. This sounds a
> bit more natural than VACUUM LIMIT. Opinions?

I personally think VACUUM FOR WRAPAROUND is the best of the options
provided thus far.

Nathan



Re: do only critical work during single-user vacuum?

2022-01-19 Thread Bossart, Nathan
On 1/18/22, 9:47 PM, "Masahiko Sawada"  wrote:
> IIUC what we want to do here are two things: (1) select only old
> tables and (2) set INDEX_CLEANUP = off, TRUNCATE = off, and FREEZE =
> on. VACUUM LIMIT statement does both things at the same time. Although
> I’m concerned a bit about its flexibility, it’s a reasonable solution.
>
> On the other hand, it’s probably also useful to do either one thing in
> some cases. For instance, having a selector for (1) would be useful,
> and having a new option like FAST_FREEZE for (2) would also be useful.
> Given there is already a way for (2) (it does not default though), I
> think it might also be a good start inventing something for (1). For
> instance, a selector for VACUUM statement I came up with is:
>
> VACUUM (verbose on) TABLES WITH (min_xid_age = 16);
> or
> VACUUM (verbose on) TABLES WITH (min_age = failsafe_limit);
>
> We can expand it in the future to select tables by, for example, dead
> tuple ratio, size, etc.
>
> It's a random thought but maybe worth considering.

That's an interesting idea.  A separate selector clause could also
allow users to choose how they interacted (e.g., should the options be
OR'd or AND'd).

Nathan



Re: O(n) tasks cause lengthy startups and checkpoints

2022-01-18 Thread Bossart, Nathan
On 1/14/22, 11:26 PM, "Bharath Rupireddy" 
 wrote:
> On Sat, Jan 15, 2022 at 12:46 AM Bossart, Nathan  wrote:
>> I'd personally like to avoid creating two code paths for the same
>> thing.  Are there really cases when this one extra auxiliary process
>> would be too many?  And if so, how would a user know when to adjust
>> this GUC?  I understand the point that we should introduce new
>> processes sparingly to avoid burdening low-end systems, but I don't
>> think we should be afraid to add new ones when it is needed.
>
> IMO, having a GUC for enabling/disabling this new worker and it's
> related code would be a better idea. The reason is that if the
> postgres has no replication slots at all(which is quite possible in
> real stand-alone production environments) or if the file enumeration
> (directory traversal and file removal) is fast enough on the servers,
> there's no point having this new worker, the checkpointer itself can
> take care of the work as it is doing today.

IMO introducing a GUC wouldn't be doing users many favors.  Their
cluster might work just fine for a long time before they begin
encountering problems during startups/checkpoints.  Once the user
discovers the underlying reason, they have to then find a GUC for
enabling a special background worker that makes this problem go away.
Why not just fix the problem for everybody by default?

I've been thinking about what other approaches we could take besides
creating more processes.  The root of the problem seems to be that
there are a number of tasks that are performed synchronously that can
take a long time.  The process approach essentially makes these tasks
asynchronous so that they do not block startup and checkpointing.  But
perhaps this can be done in an existing process, possibly even the
checkpointer.  Like the current WAL pre-allocation patch, we could do
this work when the checkpointer isn't checkpointing, and we could also
do small amounts of work in CheckpointWriteDelay() (or a new function
called in a similar way).  In theory, this would help avoid delaying
checkpoints too long while doing cleanup at every opportunity to lower
the chances it falls far behind.

Nathan



Re: docs: pg_replication_origin_oid() description does not match behaviour

2022-01-18 Thread Bossart, Nathan
On 1/17/22, 5:24 PM, "Michael Paquier"  wrote:
> On Tue, Jan 18, 2022 at 10:19:41AM +0900, Ian Lawrence Barwick wrote:
>> Given that the code has remained unchanged since the function was
>> introduced in 9.5, it seems reasonable to change the documentation
>> to match the function behaviour rather than the other way round.
>
> Obviously.  I'll go fix that as you suggest, if there are no
> objections.

+1

Nathan



Re: Add sub-transaction overflow status in pg_stat_activity

2022-01-14 Thread Bossart, Nathan
On 1/14/22, 8:26 AM, "Tom Lane"  wrote:
> Julien Rouhaud  writes:
>> Like many I previously had to investigate a slowdown due to sub-transaction
>> overflow, and even with the information available in a monitoring view (I had
>> to rely on a quick hackish extension as I couldn't patch postgres) it's not
>> terribly fun to do this way.  On top of that log analyzers like pgBadger 
>> could
>> help to highlight such a problem.
>
> It feels to me like far too much effort is being invested in fundamentally
> the wrong direction here.  If the subxact overflow business is causing
> real-world performance problems, let's find a way to fix that, not put
> effort into monitoring tools that do little to actually alleviate anyone's
> pain.

+1

An easy first step might be to increase PGPROC_MAX_CACHED_SUBXIDS and
NUM_SUBTRANS_BUFFERS.  This wouldn't be a long-term solution to all
such performance problems, and we'd still probably want the proposed
monitoring tools, but maybe it'd kick the can down the road a bit.
Perhaps another improvement could be to store the topmost transaction
along with the parent transaction in the subtransaction log to avoid
the loop in SubTransGetTopmostTransaction().

Nathan



Re: O(n) tasks cause lengthy startups and checkpoints

2022-01-14 Thread Bossart, Nathan
On 1/14/22, 3:43 AM, "Maxim Orlov"  wrote:
> The code seems to be in good condition. All the tests are running ok with no 
> errors.

Thanks for your review.

> I like the whole idea of shifting additional checkpointer jobs as much as 
> possible to another worker. In my view, it is more appropriate to call this 
> worker "bg cleaner" or "bg file cleaner" or smth.
>
> It could be useful for systems with high load, which may deal with deleting 
> many files at once, but I'm not sure about "small" installations. Extra bg 
> worker need more resources to do occasional deletion of small amounts of 
> files. I really do not know how to do it better, maybe to have two different 
> code paths switched by GUC?

I'd personally like to avoid creating two code paths for the same
thing.  Are there really cases when this one extra auxiliary process
would be too many?  And if so, how would a user know when to adjust
this GUC?  I understand the point that we should introduce new
processes sparingly to avoid burdening low-end systems, but I don't
think we should be afraid to add new ones when it is needed.

That being said, if making the extra worker optional addresses the
concerns about resource usage, maybe we should consider it.  Justin
suggested using something like max_parallel_maintenance_workers
upthread [0].

> Should we also think about adding WAL preallocation into custodian worker 
> from the patch "Pre-alocationg WAL files" [1] ?

This was brought up in the pre-allocation thread [1].  I don't think
the custodian process would be the right place for it, and I'm also
not as concerned about it because it will generally be a small, fixed,
and configurable amount of work.  In any case, I don't sense a ton of
support for a new auxiliary process in this thread, so I'm hesitant to
go down the same path for pre-allocation.

Nathan

[0] https://postgr.es/m/20211213171935.GX17618%40telsasoft.com
[1] https://postgr.es/m/B2ACCC5A-F9F2-41D9-AC3B-251362A0A254%40amazon.com



Re: Add sub-transaction overflow status in pg_stat_activity

2022-01-13 Thread Bossart, Nathan
Thanks for the new patch!

+   
+Returns a record of information about the backend's subtransactions.
+The fields returned are subxact_count identifies
+number of active subtransaction and subxact_overflow
+ shows whether the backend's subtransaction cache is
+overflowed or not.
+   
+   

nitpick: There is an extra "" here.

Would it be more accurate to say that subxact_count is the number of
subxids that are cached?  It can only ever go up to
PGPROC_MAX_CACHED_SUBXIDS.

Nathan



Re: do only critical work during single-user vacuum?

2022-01-13 Thread Bossart, Nathan
On 1/13/22, 4:58 AM, "John Naylor"  wrote:
> On Wed, Jan 12, 2022 at 12:26 PM Bossart, Nathan  wrote:
>> As I've stated upthread, Sawada-san's suggested approach was my
>> initial reaction to this thread.  I'm not wedded to the idea of adding
>> new options, but I think there are a couple of advantages.  For both
>> single-user mode and normal operation (which may be in imminent
>> wraparound danger), you could use the same command:
>>
>> VACUUM (MIN_XID_AGE 16, ...);
>
> My proposed top-level statement can also be used in normal operation,
> so the only possible advantage is configurability. But I don't really
> see any advantage in that -- I don't think we should be moving in the
> direction of adding more-intricate ways to paper over the deficiencies
> in autovacuum scheduling. (It could be argued that I'm doing exactly
> that in this whole thread, but [imminent] shutdown situations have
> other causes besides deficient scheduling.)

The new top-level command would be configurable, right?  Your patch
uses autovacuum_freeze_max_age/autovacuum_multixact_freeze_max_age, so
the behavior of this new command now depends on the values of
parameters that won't obviously be related to it.  If these parameters
are set very low (e.g., the default values), then this command will
end up doing far more work than is probably necessary.

If we did go the route of using a parameter to determine which tables
to vacuum, I think vacuum_failsafe_age is a much better candidate, as
it defaults to a much higher value that is more likely to prevent
doing extra work.  That being said, I don't know if overloading
parameters is the right way to go.

>> (As an aside, we'd need to figure out how XID and MXID options would
>> work together.  Presumably most users would want to OR them.)
>>
>> This doesn't really tie in super nicely with the failsafe mechanism,
>> but adding something like a FAILSAFE option doesn't seem right to me,
>
> I agree -- it would be awkward and messy as an option. However, I see
> the same problem with xid/mxid -- I would actually argue they are not
> even proper options; they are "selectors". Your comments above about
> 1) needing to OR them and 2) emitting a message when a VACUUM command
> doesn't actually do anything are evidence of that fact.

That's a fair point.  But I don't think these problems are totally
intractable.  We already emit "skipping" messages from VACUUM
sometimes, and interactions between VACUUM options exist today, too.
For example, FREEZE is redundant when FULL is specified, and
INDEX_CLEANUP is totally ignored when FULL is used.

>> The other advantage I see with age-related options is that it can be
>> useful for non-imminent-wraparound situations as well.  For example,
>> maybe a user just wants to manually vacuum everything (including
>> indexes) with an age above 500M on the weekends.
>
> There is already vaccumdb for that, and I think it's method of
> selecting tables is sound -- I'm not convinced that pushing table
> selection to the server command as "options" is an improvement.

I guess I'm ultimately imagining the new options as replacing the
vacuumdb implementation.  IOW vacuumdb would just use MIN_(M)XID_AGE
behind the scenes (as would a new top-level command).

Nathan



Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-01-13 Thread Bossart, Nathan
On 1/12/22, 10:03 PM, "Bharath Rupireddy" 
 wrote:
> On Thu, Jan 13, 2022 at 3:47 AM Bossart, Nathan  wrote:
>> The only feedback I have for the patch is that I don't think the new
>> comments are necessary.
>
> I borrowed the comments as-is from the CheckPointSnapBuild introduced
> by the commit b89e15105. IMO, let the comments be there as they
> explain why we are not emitting ERRORs, however I will leave it to the
> committer to decide on that.

Huh, somehow I missed that when I looked at it yesterday.  I'm going
to bump this one to ready-for-committer, then.

Nathan



Re: parse/analyze API refactoring

2022-01-12 Thread Bossart, Nathan
On 12/28/21, 8:25 AM, "Peter Eisentraut"  
wrote:
> (The "withcb" naming maybe isn't great; better ideas welcome.)

FWIW I immediately understood that this meant "with callback," so it
might be okay.

> Not included in this patch set, but food for further thought:  The
> pg_analyze_and_rewrite_*() functions aren't all that useful (anymore).
> One might as well write
>
>  pg_rewrite_query(parse_analyze_xxx(...))

I had a similar thought while reading through the patches.  If further
deduplication isn't too much trouble, I'd vote for that.

Nathan



Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-01-12 Thread Bossart, Nathan
On 12/31/21, 4:44 AM, "Bharath Rupireddy" 
 wrote:
> Currently the server is erroring out when unable to remove/parse a
> logical rewrite file in CheckPointLogicalRewriteHeap wasting the
> amount of work the checkpoint has done and preventing the checkpoint
> from finishing. This is unlike CheckPointSnapBuild does for snapshot
> files i.e. it just emits a message at LOG level and continues if it is
> unable to parse or remove the file. Attaching a small patch applying
> the same idea to the mapping files.

This seems reasonable to me.  AFAICT moving on to other files after an
error shouldn't cause any problems.  In fact, it's probably beneficial
to try to clean up as much as possible so that the files do not
continue to build up.

The only feedback I have for the patch is that I don't think the new
comments are necessary.

Nathan



Re: Add index scan progress to pg_stat_progress_vacuum

2022-01-12 Thread Bossart, Nathan
On 1/11/22, 11:46 PM, "Masahiko Sawada"  wrote:
> Regarding the new pg_stat_progress_vacuum_index view, why do we need
> to have a separate view? Users will have to check two views. If this
> view is expected to be used together with and joined to
> pg_stat_progress_vacuum, why don't we provide one view that has full
> information from the beginning? Especially, I think it's not useful
> that the total number of indexes to vacuum (num_indexes_to_vacuum
> column) and the current number of indexes that have been vacuumed
> (index_ordinal_position column) are shown in separate views.

I suppose we could add all of the new columns to
pg_stat_progress_vacuum and just set columns to NULL as appropriate.
But is that really better than having a separate view?

> Also, I’m not sure how useful index_tuples_removed is; what can we
> infer from this value (without a total number)?

I think the idea was that you can compare it against max_dead_tuples
and num_dead_tuples to get an estimate of the current cycle progress.
Otherwise, it just shows that progress is being made.

Nathan

[0] https://postgr.es/m/7874FB21-FAA5-49BD-8386-2866552656C7%40amazon.com



Re: do only critical work during single-user vacuum?

2022-01-12 Thread Bossart, Nathan
On 1/12/22, 7:43 AM, "John Naylor"  wrote:
> On Wed, Jan 12, 2022 at 1:49 AM Masahiko Sawada  wrote:
>> As another idea, we might be able to add a new option that takes an
>> optional integer value, like VACUUM (MIN_XID), VACUUM (MIN_MXID), and
>> VACUUM (MIN_XID 50). We vacuum only tables whose age is older than
>> the given value. If the value is omitted, we vacuum only tables whose
>> age exceeds a threshold (say autovacuum_freeze_max_age * 0.95), which
>> can be used in an emergency case and output in GetNewTransactionID()
>> WARNINGs output. vacuumdb’s --min-xid-age and --min-mxid-age can use
>> this option instead of fetching the list of tables from the server.
>
> That could work, and maybe also have general purpose, but I see two
> problems if I understand you correctly:
>
> - If we have a default threshold when the values are omitted, that
> implies we need to special-case single-user mode with non-obvious
> behavior, which is not ideal, as Andres mentioned upthread. (Or, now
> manual VACUUM by default would not do anything, except in extreme
> cases, which is worse.)

I agree, I don't think such options should have a default value.

> - In the single-user case, the admin would still need to add
> INDEX_CLEANUP = off for minimum downtime, and it should be really
> simple.
>
> - For the general case, we would now have the ability to vacuum a
> table, and possibly have no effect at all. That seems out of place
> with the other options.

Perhaps a message would be emitted when tables are specified but
skipped due to the min-xid-age option.

As I've stated upthread, Sawada-san's suggested approach was my
initial reaction to this thread.  I'm not wedded to the idea of adding
new options, but I think there are a couple of advantages.  For both
single-user mode and normal operation (which may be in imminent
wraparound danger), you could use the same command:

VACUUM (MIN_XID_AGE 16, ...);

(As an aside, we'd need to figure out how XID and MXID options would
work together.  Presumably most users would want to OR them.)

This doesn't really tie in super nicely with the failsafe mechanism,
but adding something like a FAILSAFE option doesn't seem right to me,
as it's basically just an alias for a bunch of other options.  In my
mind, even a new top-level command would just be an alias for the
aforementioned command.  Of course, providing a new option is not
quite as simple as opening up single-user mode and typing "BAIL OUT,"
but I don't know if it is prohibitively complicated for end users.
They'll already have had to figure out how to start single-user mode
in the first place, and we can have nice ERROR/WARNING messages that
provide a suggested VACUUM command.

The other advantage I see with age-related options is that it can be
useful for non-imminent-wraparound situations as well.  For example,
maybe a user just wants to manually vacuum everything (including
indexes) with an age above 500M on the weekends.

Another idea is to do both.  We could add age-related options, and we
could also add a "BAIL OUT" command that is just an alias for a
special VACUUM command that we feel will help get things under control
as quickly as possible.

Nathan



Re: Add index scan progress to pg_stat_progress_vacuum

2022-01-11 Thread Bossart, Nathan
On 1/11/22, 12:33 PM, "Imseih (AWS), Sami"  wrote:
> What about something like  "The number of indexes that are eligible for 
> vacuuming".
> This covers the cases where either an individual index is skipped or the 
> entire "index vacuuming" phase is skipped.

Hm.  I don't know if "eligible" is the right word.  An index can be
eligible for vacuuming but skipped because we set INDEX_CLEANUP to
false.  Maybe we should just stick with "The number of indexes that
will be vacuumed."  The only thing we may want to clarify is whether
this value will change in some cases (e.g., vacuum failsafe takes
effect).

Nathan



Re: Add jsonlog log_destination for JSON server logs

2022-01-11 Thread Bossart, Nathan
On 1/10/22, 4:51 AM, "Michael Paquier"  wrote:
> The issue comes from an incorrect change in syslogger_parseArgs()
> where I missed that the incrementation of argv by 3 has no need to be
> changed.  A build with -DEXEC_BACKEND is enough to show the failure,
> which caused a crash when starting up the syslogger because of a NULL
> pointer dereference.  The attached v9 should be enough to switch the
> CF bot to green.

I've been looking at the latest patch set intermittently and playing
around with jsonlog a little.  It seems to work well, and I don't have
any significant comments about the code.  0001 and 0002 seem
straightforward and uncontroversial.  IIUC 0003 simply introduces
jsonlog using the existing framework.

I wonder if we should consider tracking each log destination as a set
of function pointers.  The main logging code would just loop through
the enabled log destinations and use these functions, and it otherwise
would be completely detached (i.e., no "if jsonlog" blocks).  This
might open up the ability to define custom log destinations via
modules, too.  However, I don't know if there's any real demand for
something like this, and it should probably be done separately from
introducing jsonlog, anyway.

Nathan



Re: improve CREATE EXTENSION error message

2022-01-11 Thread Bossart, Nathan
On 1/11/22, 11:23 AM, "Tom Lane"  wrote:
> "Bossart, Nathan"  writes:
>> Okay, the message looks like this in v5:
>
>> postgres=# CREATE EXTENSION does_not_exist;
>> ERROR:  extension "does_not_exist" is not available
>> DETAIL:  Could not open extension control file 
>> "/usr/local/pgsql/share/extension/does_not_exist.control": No such file or 
>> directory.
>> HINT:  The extension must first be installed on the system where 
>> PostgreSQL is running.
>
> Nobody complained about that wording, so pushed.

Thanks!

Nathan



Re: Add index scan progress to pg_stat_progress_vacuum

2022-01-11 Thread Bossart, Nathan
On 1/10/22, 5:01 PM, "Imseih (AWS), Sami"  wrote:
> I have attached the 3rd revision of the patch which also includes the 
> documentation changes. Also attached is a rendered html of the docs for 
> review.
>
> "max_index_vacuum_cycle_time" has been removed.
> "index_rows_vacuumed" renamed to "index_tuples_removed". "tuples" is a more 
> consistent with the terminology used.
> "vacuum_cycle_ordinal_position" renamed to "index_ordinal_position".

Thanks for the new version of the patch!

nitpick: I get one whitespace error when applying the patch.

Applying: Expose progress for the "vacuuming indexes" phase of a VACUUM 
operation.
.git/rebase-apply/patch:44: tab in indent.
   Whenever  is triggered, 
index
warning: 1 line adds whitespace errors.

+ 
+  
+   num_indexes_to_vacuum bigint
+  
+  
+   The number of indexes that will be vacuumed. Only indexes with
+   pg_index.indisready set to "true" will be vacuumed.
+   Whenever  is triggered, index
+   vacuuming will be bypassed.
+  
+ 
+
+   
+  

We may want to avoid exhaustively listing the cases when this value
will be zero.  I would suggest saying, "When index cleanup is skipped,
this value will be zero" instead.

+ 
+  
+   relid oid
+  
+  
+   OID of the table being vacuumed.
+  
+ 

Do we need to include this field?  I would expect indexrelid to go
here.

+ 
+  
+   leader_pid bigint
+  
+  
+   Process ID of the parallel group leader. This field is 
NULL
+   if this process is a parallel group leader or the
+   vacuuming indexes phase is not performed in parallel.
+  
+ 

Are there cases where the parallel group leader will have an entry in
this view when parallelism is enabled?

+ 
+  
+   index_ordinal_position bigint
+  
+  
+   The order in which the index is being vacuumed. Indexes are vacuumed by 
OID in ascending order.
+  
+ 

Should we include the bit about the OID ordering?  I suppose that is
unlikely to change in the near future, but I don't know if it is
relevant information.  Also, do we need to include the "index_"
prefix?  This view is specific for indexes.  (I have the same question
for index_tuples_removed.)

Should this new table go after the "VACUUM phases" table?  It might
make sense to keep the phases table closer to where it is referenced.

+/* Advertise the number of indexes to vacuum if we are not in failsafe 
mode */
+if (!lazy_check_wraparound_failsafe(vacrel))
+pgstat_progress_update_param(PROGRESS_VACUUM_TOTAL_INDEX_VACUUM, 
vacrel->nindexes);

Shouldn't this be 0 when INDEX_CLEANUP is off, too?

+#define PROGRESS_VACUUM_CURRENT_INDRELID 7
+#define PROGRESS_VACUUM_LEADER_PID   8
+#define PROGRESS_VACUUM_INDEX_ORDINAL9
+#define PROGRESS_VACUUM_TOTAL_INDEX_VACUUM   10
+#define PROGRESS_VACUUM_DEAD_TUPLES_VACUUMED 11

nitpick: I would suggest the following names to match the existing
style:

PROGRESS_VACUUM_NUM_INDEXES_TO_VACUUM
PROGRESS_VACUUM_INDEX_LEADER_PID
PROGRESS_VACUUM_INDEX_INDEXRELID
PROGRESS_VACUUM_INDEX_ORDINAL_POSITION
PROGRESS_VACUUM_INDEX_TUPLES_REMOVED

Nathan



Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?

2022-01-11 Thread Bossart, Nathan
On 1/11/22, 10:06 AM, "John Naylor"  wrote:
> I pushed this with one small change -- I felt the comment didn't need
> to explain the warning message, since it now simply matches the coding
> more exactly. Also, v5 was a big enough change from v4 that I put
> Nathan as the first author.

Thanks!

Nathan



Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2022-01-10 Thread Bossart, Nathan
I noticed this thread and thought I'd share my experiences building
something similar for Multi-AZ DB clusters [0].  It's not a strict RPO
mechanism, but it does throttle backends in an effort to keep the
replay lag below a configured maximum.  I can share the code if there
is interest.

I wrote it as a new extension, and except for one piece that I'll go
into later, I was able to avoid changes to core PostgreSQL code.  The
extension manages a background worker that periodically assesses the
state of the designated standbys and updates an atomic in shared
memory that indicates how long to delay.  A transaction callback
checks this value and sleeps as necessary.  Delay can be injected for
write-enabled transactions on the primary, read-only transactions on
the standbys, or both.  The extension is heavily configurable so that
it can meet the needs of a variety of workloads.

One interesting challenge I encountered was accurately determining the
amount of replay lag.  The problem was twofold.  First, if there is no
activity on the primary, there will be nothing to replay on the
standbys, so the replay lag will appear to grow unbounded.  To work
around this, the extension's background worker periodically creates an
empty COMMIT record.  Second, if a standby reconnects after a long
time, the replay lag won't be accurate for some time.  Instead, the
replay lag will slowly increase until it reaches the correct value.
Since the delay calculation looks at the trend of the replay lag, this
apparent unbounded growth causes it to inject far more delay than is
necessary.  My guess is that this is related to 9ea3c64, and maybe it
is worth rethinking that logic.  For now, the extension just
periodically reports the value of GetLatestXTime() from the standbys
to the primary to get an accurate reading.  This is done via a new
replication callback mechanism (which requires core PostgreSQL
changes).  I can share this patch along with the extension, as I bet
there are other applications for it.

I should also note that the extension only considers "active" standbys
and primaries.  That is, ones with an active WAL sender or WAL
receiver.  This avoids the need to guess what should be done during a
network partition, but it also means that we must gracefully handle
standbys reconnecting with massive amounts of lag.  The extension is
designed to slowly ramp up the amount of injected delay until the
standby's apply lag is trending down at a sufficient rate.

I see that an approach was suggested upthread for throttling based on
WAL distance instead of per-transaction.  While the transaction
approach works decently well for certain workloads (e.g., many small
transactions like those from pgbench), it might require further tuning
for very large transactions or workloads with a variety of transaction
sizes.  For that reason, I would definitely support building a way to
throttle based on WAL generation.  It might be a good idea to avoid
throttling critical activity such as anti-wraparound vacuuming, too.

Nathan

[0] 
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/multi-az-db-clusters-concepts.html



Re: make MaxBackends available in _PG_init

2022-01-10 Thread Bossart, Nathan
On 1/7/22, 12:27 PM, "Robert Haas"  wrote:
> On Fri, Jan 7, 2022 at 1:09 PM Bossart, Nathan  wrote:
>> While that approach would provide a way to safely retrieve the value,
>> I think it would do little to prevent the issue in practice.  If the
>> size of the patch is a concern, we could also convert MaxBackends into
>> a macro for calling GetMaxBackends().  This could also be a nice way
>> to avoid breaking extensions that are using it correctly while
>> triggering ERRORs for extensions that are using it incorrectly.  I've
>> attached a new version of the patch that does it this way.
>
> That's too magical for my taste.

Fair point.  v4 [0] is the less magical version.

Nathan

[0] 
https://postgr.es/m/attachment/125445/v4-0001-Disallow-external-access-to-MaxBackends.patch



Re: Add index scan progress to pg_stat_progress_vacuum

2022-01-10 Thread Bossart, Nathan
On 1/6/22, 6:14 PM, "Imseih (AWS), Sami"  wrote:
> I am hesitant to make column name changes for obvious reasons, as it breaks 
> existing tooling. However, I think there is a really good case to change 
> "index_vacuum_count" as the name is confusing. 
> "index_vacuum_cycles_completed" is the name I suggest if we agree to rename.
>
> For the new column, "num_indexes_to_vacuum" is good with me. 

Yeah, I think we can skip renaming index_vacuum_count for now.  In any
case, it would probably be good to discuss that in a separate thread.

Nathan



Re: Worth using personality(ADDR_NO_RANDOMIZE) for EXEC_BACKEND on linux?

2022-01-07 Thread Bossart, Nathan
On 11/23/21, 11:29 AM, "Thomas Munro"  wrote:
> Here's a patch for Linux and also FreeBSD.  The latter OS decided to
> turn on ASLR by default recently, causing my workstation to fail like
> this quite reliably, which reminded me to follow up with this.  It
> disables ASLR in pg_ctl and pg_regress, which is enough for check and
> check-world, but doesn't help you if you run the server directly
> (unlike the different hack done for macOS).
>
> For whatever random reason the failures are rarer on Linux (could be
> my imagination, but I think they might be clustered, I didn't look
> into the recipe for the randomness), but even without reproducing a
> failure it's clear to see using pmap that this has the right effect.
> I didn't bother with a check for the existence of ADDR_NO_RANDOMIZE
> because it's since 2.6.12 which is definitely ancient enough.

FWIW I just found this patch very useful for testing some EXEC_BACKEND
stuff on Linux.

Nathan



Re: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers

2022-01-07 Thread Bossart, Nathan
On 1/6/22, 11:25 PM, "Jeff Davis"  wrote:
> On Wed, 2022-01-05 at 23:59 -0800, SATYANARAYANA NARLAPURAM wrote:
>> I would like to propose a GUC send_Wal_after_quorum_committed which
>> when set to ON, walsenders corresponds to async standbys and logical
>> replication workers wait until the LSN is quorum committed on the
>> primary before sending it to the standby. This not only simplifies
>> the post failover steps but avoids unnecessary downtime for the async
>> replicas. Thoughts?
>
> Do we need a GUC? Or should we just always require that sync rep is
> satisfied before sending to async replicas?
>
> It feels like the sync quorum should always be ahead of the async
> replicas. Unless I'm missing a use case, or there is some kind of
> performance gotcha.

I don't have a strong opinion on whether there needs to be a GUC, but
+1 for the ability to enforce sync quorum before sending WAL to async
standbys.  I think that would be a reasonable default behavior.

Nathan



Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-01-07 Thread Bossart, Nathan
On 1/7/22, 5:52 AM, "David Steele"  wrote:
> On 1/6/22 20:20, Euler Taveira wrote:
>> On Thu, Jan 6, 2022, at 9:48 PM, Bossart, Nathan wrote:
>>> After a quick glance, I didn't see an easy way to hold a session open
>>> while the test does other things.  If there isn't one, modifying
>>> backup_fs_hot() to work with non-exclusive mode might be more trouble
>>> than it is worth.
>>
>> You can use IPC::Run to start psql in background. See examples in
>> src/test/recovery.
>
> I don't think updating backup_fs_hot() is worth it here.
>
> backup_fs_cold() works just fine for this case and if there is a need
> for backup_fs_hot() in the future it can be implemented as needed.

Thanks for the pointer on IPC::Run.  I had a feeling I was missing
something obvious!

I think I agree with David that it still isn't worth it for just this
one test.  Of course, it would be great to test the non-exclusive
backup logic as much as possible, but I'm not sure that this
particular test will provide any sort of meaningful coverage.

Nathan



Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-01-06 Thread Bossart, Nathan
On 12/2/21, 1:34 PM, "Bossart, Nathan"  wrote:
> On 12/2/21, 9:50 AM, "David Steele"  wrote:
>> On 12/2/21 12:38, David Steele wrote:
>>> On 12/2/21 11:00, Andrew Dunstan wrote:
>>>>
>>>> Should we really be getting rid of
>>>> PostgreSQL::Test::Cluster::backup_fs_hot() ?
>>>
>>> Agreed, it would be better to update backup_fs_hot() to use exclusive
>>> mode and save out backup_label instead.
>>
>> Oops, of course I meant non-exclusive mode.
>
> +1.  I'll fix that in the next revision.

I finally got around to looking into this, and I think I found why it
was done this way in 2018.  backup_fs_hot() runs pg_start_backup(),
closes the session, copies the data, and then runs pg_stop_backup() in
a different session.  This doesn't work with non-exclusive mode
because the backup will be aborted when the session that runs
pg_start_backup() is closed.  pg_stop_backup() will fail with a
"backup is not in progress" error.  Furthermore,
010_logical_decoding_timelines.pl seems to be the only test that uses
backup_fs_hot().

After a quick glance, I didn't see an easy way to hold a session open
while the test does other things.  If there isn't one, modifying
backup_fs_hot() to work with non-exclusive mode might be more trouble
than it is worth.

Nathan



Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2022-01-06 Thread Bossart, Nathan
On 12/21/21, 11:42 AM, "Mark Dilger"  wrote:
> +   /* pre-v9.3 lock-only bit pattern */
> +   ereport(ERROR,
> +   (errcode(ERRCODE_DATA_CORRUPTED),
> +errmsg_internal("found tuple with HEAP_XMAX_COMMITTED 
> and"
> +"HEAP_XMAX_EXCL_LOCK set and "
> +"HEAP_XMAX_IS_MULTI unset")));
> +   }
> +
>
> I find this bit hard to understand.  Does the comment mean to suggest that 
> the *upgrade* process should have eliminated all pre-v9.3 bit patterns, and 
> therefore any such existing patterns are certainly corruption, or does it 
> mean that data written by pre-v9.3 servers (and not subsequently updated) is 
> defined as corrupt, or  ?
>
> I am not complaining that the logic is wrong, just trying to wrap my head 
> around what the comment means.

This is just another way that a tuple may be marked locked-only, and
we want to explicitly disallow locked-only + xmax-committed.  This bit
pattern may be present on servers that were pg_upgraded from pre-v9.3
versions.  See commits 0ac5ad5 and 74ebba8 for more detail.

Nathan



Re: Add index scan progress to pg_stat_progress_vacuum

2022-01-06 Thread Bossart, Nathan
On 12/29/21, 8:44 AM, "Imseih (AWS), Sami"  wrote:
> In "pg_stat_progress_vacuum", introduce 2 columns:
>
> * total_index_vacuum : This is the # of indexes that will be vacuumed. Keep 
> in mind that if failsafe mode kicks in mid-flight to the vacuum, Postgres may 
> choose to forgo index scans. This value will be adjusted accordingly.
> * max_index_vacuum_cycle_time : The total elapsed time for a index vacuum 
> cycle is calculated and this value will be updated to reflect the longest 
> vacuum cycle. Until the first cycle completes, this value will be 0. The 
> purpose of this column is to give the user an idea of how long an index 
> vacuum cycle takes to complete.

I think that total_index_vacuum is a good thing to have.  I would
expect this to usually just be the number of indexes on the table, but
as you pointed out, this can be different when we are skipping
indexes.  My only concern with this new column is the potential for
confusion when compared with the index_vacuum_count value.
index_vacuum_count indicates the number of vacuum cycles completed,
but total_index_vacuum indicates the number of indexes that will be
vacuumed.  However, the names sound like they could refer to the same
thing to me.  Perhaps we should rename index_vacuum_count to
index_vacuum_cycles/index_vacuum_cycle_count, and the new column
should be something like num_indexes_to_vacuum or index_vacuum_total.

I don't think we need the max_index_vacuum_cycle_time column.  While
the idea is to give users a rough estimate for how long an index cycle
will take, I don't think it will help generate any meaningful
estimates for how much longer the vacuum operation will take.  IIUC we
won't have any idea how many total index vacuum cycles will be needed.
Even if we did, the current cycle could take much more or much less
time.  Also, none of the other progress views seem to provide any
timing information, which I suspect is by design to avoid inaccurate
estimates.

> Introduce a new view called "pg_stat_progress_vacuum_index". This view will 
> track the progress of a worker ( or leader PID ) while it's vacuuming an 
> index. It will expose some key columns:
>
> * pid: The PID of the worker process
>
> * leader_pid: The PID of the leader process. This is the column that can be 
> joined with "pg_stat_progress_vacuum". leader_pid and pid can have the same 
> value as a leader can also perform an index vacuum.
>
> * indrelid: The relid of the index currently being vacuumed
>
> * vacuum_cycle_ordinal_position: The processing position of the index being 
> vacuumed. This can be useful to determine how many indexes out of the total 
> indexes ( pg_stat_progress_vacuum.total_index_vacuum ) have been vacuumed
>
> * index_tuples_vacuumed: This is the number of index tuples vacuumed for the 
> index overall. This is useful to show that the vacuum is actually doing work, 
> as the # of tuples keeps increasing. 

Should we also provide some information for determining the progress
of the current cycle?  Perhaps there should be an
index_tuples_vacuumed_current_cycle column that users can compare with
the num_dead_tuples value in pg_stat_progress_vacuum.  However,
perhaps the number of tuples vacuumed in the current cycle can already
be discovered via index_tuples_vacuumed % max_dead_tuples.

+void
+rusage_adjust(const PGRUsage *ru0, PGRUsage *ru1)
+{
+   if (ru1->tv.tv_usec < ru0->tv.tv_usec)
+   {
+   ru1->tv.tv_sec--;
+   ru1->tv.tv_usec += 100;
+   }
+   if (ru1->ru.ru_stime.tv_usec < ru0->ru.ru_stime.tv_usec)
+   {
+   ru1->ru.ru_stime.tv_sec--;
+   ru1->ru.ru_stime.tv_usec += 100;
+   }
+   if (ru1->ru.ru_utime.tv_usec < ru0->ru.ru_utime.tv_usec)
+   {
+   ru1->ru.ru_utime.tv_sec--;
+   ru1->ru.ru_utime.tv_usec += 100;
+   }
+}

I think this function could benefit from a comment.  Without going
through it line by line, it is not clear to me exactly what it is
doing.

I know we're still working on what exactly this stuff should look
like, but I would suggest adding the documentation changes in the near
future.

Nathan



Re: skip replication slot snapshot/map file removal during end-of-recovery checkpoint

2022-01-05 Thread Bossart, Nathan
On 12/23/21, 3:17 AM, "Bharath Rupireddy" 
 wrote:
> Currently the end-of-recovery checkpoint can be much slower, impacting
> the server availability, if there are many replication slot files
> .snap or map- to be enumerated and deleted. How about skipping
> the .snap and map- file handling during the end-of-recovery
> checkpoint? It makes the server available faster and the next regular
> checkpoint can deal with these files. If required, we can have a GUC
> (skip_replication_slot_file_handling or some other better name) to
> control this default being the existing behavior.

I suggested something similar as a possibility in the other thread
where these tasks are being discussed [0].  I think it is worth
considering, but IMO it is not a complete solution to the problem.  If
there are frequently many such files to delete and regular checkpoints
are taking longer, the shutdown/end-of-recovery checkpoint could still
take a while.  I think it would be better to separate these tasks from
checkpointing instead.

Nathan

[0] https://postgr.es/m/A285A823-0AF2-4376-838E-847FA4710F9A%40amazon.com



Re: Pre-allocating WAL files

2022-01-05 Thread Bossart, Nathan
On 12/30/21, 3:52 AM, "Maxim Orlov"  wrote:
> I did check the patch too and found it to be ok. Check and check-world are 
> passed. 
> Overall idea seems to be good in my opinion, but I'm not sure where is the 
> optimal place to put the pre-allocation.
>
> On Thu, Dec 30, 2021 at 2:46 PM Pavel Borisov  wrote:
>> I've checked the patch v7. It applies cleanly, code is good, check-world 
>> tests passed without problems. 
>> I think it's ok to use checkpointer for this and the overall patch can be 
>> committed. But the seen performance gain makes me think again before adding 
>> this feature. I did tests myself a couple of months ago and got similar 
>> results.
>> Really don't know whether is it worth the effort.

Thank you both for your review.

Nathan



Re: Strange path from pgarch_readyXlog()

2021-12-30 Thread Bossart, Nathan
On 12/29/21, 3:11 PM, "Tom Lane"  wrote:
> "Bossart, Nathan"  writes:
>> This crossed my mind, too.  I also think one of the arrays can be
>> eliminated in favor of just using the heap (after rebuilding with a
>> reversed comparator).  Here is a minimally-tested patch that
>> demonstrates what I'm thinking.
>
> I already pushed a patch that de-static-izes those arrays, so this
> needs rebased at least.  However, now that you mention it it does
> seem like maybe the intermediate arch_files[] array could be dropped
> in favor of just pulling the next file from the heap.
>
> The need to reverse the heap's sort order seems like a problem
> though.  I really dislike the kluge you used here with a static flag
> that inverts the comparator's sort order behind the back of the
> binary-heap mechanism.  It seems quite accidental that that doesn't
> fall foul of asserts or optimizations in binaryheap.c.  For
> instance, if binaryheap_build decided it needn't do anything when
> bh_has_heap_property is already true, this code would fail.  In any
> case, we'd need to spend O(n) time inverting the heap's sort order,
> so this'd likely be slower than the current code.
>
> On the whole I'm inclined not to bother trying to optimize this
> further.  The main thing that concerned me is that somebody would
> bump up NUM_FILES_PER_DIRECTORY_SCAN to the point where the static
> space consumption becomes really problematic, and we've fixed that.

Your assessment seems reasonable to me.  If there was a better way to
adjust the comparator for the heap, maybe there would be a stronger
case for this approach.  I certainly don't think it's worth inventing
something for just this use-case.

Thanks for fixing this!

Nathan



Re: Strange path from pgarch_readyXlog()

2021-12-29 Thread Bossart, Nathan
On 12/29/21, 12:22 PM, "Thomas Munro"  wrote:
> Isn't this a corrupted pathname?
>
> 2021-12-29 03:39:55.708 CST [79851:1] WARNING:  removal of orphan
> archive status file
> "pg_wal/archive_status/00010003.0028.backup00010004.ready"
> failed too many times, will try again later

I bet this was a simple mistake in beb4e9b.

Nathan

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 434939be9b..b5b0d4e12f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -113,7 +113,7 @@ static PgArchData *PgArch = NULL;
  * is empty, at which point another directory scan must be performed.
  */
 static binaryheap *arch_heap = NULL;
-static char arch_filenames[NUM_FILES_PER_DIRECTORY_SCAN][MAX_XFN_CHARS];
+static char arch_filenames[NUM_FILES_PER_DIRECTORY_SCAN][MAX_XFN_CHARS + 1];
 static char *arch_files[NUM_FILES_PER_DIRECTORY_SCAN];
 static int arch_files_size = 0;



Re: Add index scan progress to pg_stat_progress_vacuum

2021-12-15 Thread Bossart, Nathan
On 12/1/21, 3:02 PM, "Imseih (AWS), Sami"  wrote:
> The current implementation of pg_stat_progress_vacuum does not
> provide progress on which index is being vacuumed making it
> difficult for a user to determine if the "vacuuming indexes" phase
> is making progress. By exposing which index is being scanned as well
> as the total progress the scan has made for the current cycle, a
> user can make better estimations on when the vacuum will complete.

+1

> The proposed patch adds 4 new columns to pg_stat_progress_vacuum:
>
> 1. indrelid - the relid of the index being vacuumed
> 2. index_blks_total - total number of blocks to be scanned in the
> current cycle
> 3. index_blks_scanned - number of blocks scanned in the current
> cycle
> 4. leader_pid - if the pid for the pg_stat_progress_vacuum entry is
> a leader or a vacuum worker. This patch places an entry for every
> worker pid ( if parallel ) as well as the leader pid

nitpick: Shouldn't index_blks_scanned be index_blks_vacuumed?  IMO it
is more analogous to heap_blks_vacuumed.

This will tell us which indexes are currently being vacuumed and the
current progress of those operations, but it doesn't tell us which
indexes have already been vacuumed or which ones are pending vacuum.
I think such information is necessary to truly understand the current
progress of vacuuming indexes, and I can think of a couple of ways we
might provide it:

  1. Make the new columns you've proposed return arrays.  This isn't
 very clean, but it would keep all the information for a given
 vacuum operation in a single row.  The indrelids column would be
 populated with all the indexes that have been vacuumed, need to
 be vacuumed, or are presently being vacuumed.  The other index-
 related columns would then have the associated stats and the
 worker PID (which might be the same as the pid column depending
 on whether parallel index vacuum was being done).  Alternatively,
 the index column could have an array of records, each containing
 all the information for a given index.
  2. Create a new view for just index vacuum progress information.
 This would have similar information as 1.  There would be an
 entry for each index that has been vacuumed, needs to be
 vacuumed, or is currently being vacuumed.  And there would be an
 easy way to join with pg_stat_progress_vacuum (e.g., leader_pid,
 which again might be the same as our index vacuum PID depending
 on whether we were doing parallel index vacuum).  Note that it
 would be possible for the PID of these entries to be null before
 and after we process the index.
  3. Instead of adding columns to pg_stat_progress_vacuum, adjust the
 current ones to be more general, and then add new entries for
 each of the indexes that have been, need to be, or currently are
 being vacuumed.  This is the most similar option to your current
 proposal, but instead of introducing a column like
 index_blks_total, we'd rename heap_blks_total to blks_total and
 use that for both the heap and indexes.  I think we'd still want
 to add a leader_pid column.  Again, we have to be prepared for
 the PID to be null in this case.  Or we could just make the pid
 column always refer to the leader, and we could introduce a
 worker_pid column.  That might create confusion, though.

I wish option #1 was cleaner, because I think it would be really nice
to have all this information in a single row.  However, I don't expect
much support for a 3-dimensional view, so I suspect option #2
(creating a separate view for index vacuum progress) is the way to go.
The other benefit of option #2 versus option #3 or your original
proposal is that it cleanly separates the top-level vacuum operations
and the index vacuum operations, which are related at the moment, but
which might not always be tied so closely together.

Nathan



Re: archive modules

2021-12-15 Thread Bossart, Nathan
On 11/2/21, 8:07 AM, "Bossart, Nathan"  wrote:
> The main motivation is provide a way to archive without shelling out.
> This reduces the amount of overhead, which can improve archival rate
> significantly.  It should also make it easier to archive more safely.
> For example, many of the common shell commands used for archiving
> won't fsync the data, but it isn't too hard to do so via C.  The
> current proposal doesn't introduce any extra infrastructure for
> batching or parallelism, but it is probably still possible.  I would
> like to eventually add batching, but for now I'm only focused on
> introducing basic archive module support.

As noted above, the latest patch set (v11) doesn't add any batching or
parallelism.  Now that beb4e9b is committed (which causes the archiver
to gather multiple files to archive in each scan of archive_status),
it seems like a good time to discuss this a bit further.  I think
there are some interesting design considerations.

As is, the current archive module infrastructure in the v11 patch set
should help reduce the amount of overhead per-file quite a bit, and I
observed a noticeable speedup with a basic file-copying archive
strategy (although this is likely not representative of real-world
workloads).  I believe it would be possible for archive module authors
to implement batching/parallelism, but AFAICT it would still require
hacks similar to what folks do today with archive_command.  For
example, you could look ahead in archive_status, archive a bunch of
files in a batch or in parallel with background workers, and then
quickly return true when the archive_library is called for later files
in the batch.

Alternatively, we could offer some kind of built-in batching support
in the archive module infrastructure.  One simple approach would be to
just have pgarch_readyXlog() optionally return the entire list of
files gathered from the directory scan of archive_status (presently up
to 64 files).  Or we could provide a GUC like archive_batch_size that
would allow users to limit how many files are sent to the
archive_library each time.  This list would be given to
pgarch_archiveXlog(), which would return which files were successfully
archived and which failed.  I think this could be done for
archive_command as well, although it might be tricky to determine
which files were archived successfully.  To handle that, we might just
need to fail the whole batch if the archive_command return value
indicates failure.

Another interesting change is that the special timeline file handling
added in beb4e9b becomes less useful.  Presently, if a timeline
history file is marked ready for archival, we force pgarch_readyXlog()
to do a new scan of archive_status the next time it is called in order
to pick it up as soon as possible (ordinarily it just returns the
files gathered in a previous scan until it runs out).  If we are
sending a list of files to the archive module, it will be more
difficult to ensure timeline history files are picked up so quickly.
Perhaps this is a reasonable tradeoff to make when archive batching is
enabled.

I think the retry logic can stay roughly the same.  If any files in a
batch cannot be archived, wait a second before retrying.  If that
happens a few times in a row, stop archiving for a bit.  It wouldn't
be quite as precise as what's there today because the failures could
be for different files each time, but I don't know if that is terribly
important.

Finally, I wonder if batching support is something we should bother
with at all for the first round of archive module support.  I believe
it is something that could be easily added later, although it might
require archive modules to adjust the archiving callback to accept and
return a list of files.  IMO the archive modules infrastructure is
still an improvement even without batching, and it seems to fit nicely
into the existing behavior of the archiver process.  I'm curious what
others think about all this.

Nathan



Re: O(n) tasks cause lengthy startups and checkpoints

2021-12-14 Thread Bossart, Nathan
On 12/14/21, 12:09 PM, "Bossart, Nathan"  wrote:
> On 12/14/21, 9:00 AM, "Bruce Momjian"  wrote:
>> Have we changed temporary file handling in any recent major releases,
>> meaning is this a current problem or one already improved in PG 14.
>
> I haven't noticed any recent improvements while working in this area.

On second thought, the addition of the remove_temp_files_after_crash
GUC is arguably an improvement since it could prevent files from
accumulating after repeated crashes.

Nathan



Re: O(n) tasks cause lengthy startups and checkpoints

2021-12-14 Thread Bossart, Nathan
On 12/14/21, 9:00 AM, "Bruce Momjian"  wrote:
> Have we changed temporary file handling in any recent major releases,
> meaning is this a current problem or one already improved in PG 14.

I haven't noticed any recent improvements while working in this area.

Nathan



Re: O(n) tasks cause lengthy startups and checkpoints

2021-12-13 Thread Bossart, Nathan
On 12/13/21, 12:37 PM, "Robert Haas"  wrote:
> On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan  wrote:
>> > But against all that, if these tasks are slowing down checkpoints and
>> > that's avoidable, that seems pretty important too. Interestingly, I
>> > can't say that I've ever seen any of these things be a problem for
>> > checkpoint or startup speed. I wonder why you've had a different
>> > experience.
>>
>> Yeah, it's difficult for me to justify why users should suffer long
>> periods of downtime because startup or checkpointing is taking a very
>> long time doing things that are arguably unrelated to startup and
>> checkpointing.
>
> Well sure. But I've never actually seen that happen.

I'll admit that surprises me.  As noted elsewhere [0], we were seeing
this enough with pgsql_tmp that we started moving the directory aside
before starting the server.  Discussions about handling this usually
prompt questions about why there are so many temporary files in the
first place (which is fair).  FWIW all four functions noted in my
original message [1] are things I've seen firsthand affecting users.

Nathan

[0] https://postgr.es/m/E7573D54-A8C9-40A8-89D7-0596A36ED124%40amazon.com
[1] https://postgr.es/m/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com



Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-13 Thread Bossart, Nathan
On 12/13/21, 6:30 AM, "Dilip Kumar"  wrote:
> On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby  wrote:
>> Since I think this field is usually not interesting to most users of
>> pg_stat_activity, maybe this should instead be implemented as a function like
>> pg_backend_get_subxact_status(pid).
>>
>> People who want to could use it like:
>> SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) sub;
>
> I have provided two function, one for subtransaction counts and other
> whether subtransaction cache is overflowed or not, we can use like
> this,  if we think this is better way to do it then we can also add
> another function for the lastOverflowedXid

The general approach looks good to me.  I think we could have just one
function for all three values, though.

Nathan



Re: O(n) tasks cause lengthy startups and checkpoints

2021-12-13 Thread Bossart, Nathan
On 12/13/21, 9:20 AM, "Justin Pryzby"  wrote:
> On Mon, Dec 13, 2021 at 08:53:37AM -0500, Robert Haas wrote:
>> Another issue is that we don't want to increase the number of
>> processes without bound. Processes use memory and CPU resources and if
>> we run too many of them it becomes a burden on the system. Low-end
>> systems may not have too many resources in total, and high-end systems
>> can struggle to fit demanding workloads within the resources that they
>> have. Maybe it would be cheaper to do more things at once if we were
>> using threads rather than processes, but that day still seems fairly
>> far off.
>
> Maybe that's an argument that this should be a dynamic background worker
> instead of an auxilliary process.  Then maybe it would be controlled by
> max_parallel_maintenance_workers (or something similar).  The checkpointer
> would need to do these tasks itself if parallel workers were disabled or
> couldn't be launched.

I think this is an interesting idea.  I dislike the prospect of having
two code paths for all this stuff, but if it addresses the concerns
about resource usage, maybe it's worth it.

Nathan



Re: O(n) tasks cause lengthy startups and checkpoints

2021-12-13 Thread Bossart, Nathan
On 12/13/21, 5:54 AM, "Robert Haas"  wrote:
> I don't know whether this kind of idea is good or not.

Thanks for chiming in.  I have an almost-complete patch set that I'm
hoping to post to the lists in the next couple of days.

> One thing we've seen a number of times now is that entrusting the same
> process with multiple responsibilities often ends poorly. Sometimes
> it's busy with one thing when another thing really needs to be done
> RIGHT NOW. Perhaps that won't be an issue here since all of these
> things are related to checkpointing, but then the process name should
> reflect that rather than making it sound like we can just keep piling
> more responsibilities onto this process indefinitely. At some point
> that seems bound to become an issue.

Two of the tasks are cleanup tasks that checkpointing handles at the
moment, and two are cleanup tasks that are done at startup.  For now,
all of these tasks are somewhat nonessential.  There's no requirement
that any of these tasks complete in order to finish startup or
checkpointing.  In fact, outside of preventing the server from running
out of disk space, I don't think there's any requirement that these
tasks run at all.  IMO this would have to be a core tenet of a new
auxiliary process like this.

That being said, I totally understand your point.  If there were a
dozen such tasks handled by a single auxiliary process, that could
cause a new set of problems.  Your checkpointing and startup might be
fast, but you might run out of disk space because our cleanup process
can't handle it all.  So a new worker could end up becoming an
availability risk as well.

> Another issue is that we don't want to increase the number of
> processes without bound. Processes use memory and CPU resources and if
> we run too many of them it becomes a burden on the system. Low-end
> systems may not have too many resources in total, and high-end systems
> can struggle to fit demanding workloads within the resources that they
> have. Maybe it would be cheaper to do more things at once if we were
> using threads rather than processes, but that day still seems fairly
> far off.

I do agree that it is important to be very careful about adding new
processes, and if a better idea for how to handle these tasks emerges,
I will readily abandon my current approach.  Upthread, Andres
mentioned optimizing unnecessary snapshot files, and I mentioned
possibly limiting how much time startup and checkpoints spend on these
tasks.  I don't have too many details for the former, and for the
latter, I'm worried about not being able to keep up.  But if the
prospect of adding a new auxiliary process for this stuff is a non-
starter, perhaps I should explore that approach some more.

> But against all that, if these tasks are slowing down checkpoints and
> that's avoidable, that seems pretty important too. Interestingly, I
> can't say that I've ever seen any of these things be a problem for
> checkpoint or startup speed. I wonder why you've had a different
> experience.

Yeah, it's difficult for me to justify why users should suffer long
periods of downtime because startup or checkpointing is taking a very
long time doing things that are arguably unrelated to startup and
checkpointing.

Nathan



Re: do only critical work during single-user vacuum?

2021-12-09 Thread Bossart, Nathan
On 12/9/21, 5:27 PM, "Peter Geoghegan"  wrote:
> I imagine that this new function (to handle maintenance tasks in the
> event of a wraparound emergency) would output information about its
> progress. For example, it would make an up-front decision about which
> tables needed to be vacuumed in order for the current DB's
> datfrozenxid to be sufficiently new, before it started anything (with
> handling for edge-cases with many tables, perhaps). It might also show
> the size of each table, and show another line for each table that has
> been processed so far, as a rudimentary progress indicator.

I like the idea of having a built-in function that does the bare
minimum to resolve wraparound emergencies, and I think providing some
sort of simple progress indicator (even if rudimentary) would be very
useful.  I imagine the decision logic could be pretty simple.  If
we're only interested in getting the cluster out of a wraparound
emergency, we can probably just look for all tables with an age over
~2B.

Nathan



Re: do only critical work during single-user vacuum?

2021-12-09 Thread Bossart, Nathan
On 12/9/21, 5:06 PM, "Bossart, Nathan"  wrote:
> On 12/9/21, 4:36 PM, "Peter Geoghegan"  wrote:
>> We could then apply this criteria in new code that implements this
>> "big red button" (maybe this is a new option for the postgres
>> executable, a little like --single?). Something that's reasonably
>> targeted, and dead simple to use.
>
> +1

As Andres noted, such a feature might be useful during normal
operation, too.  Perhaps the vacuumdb --min-xid-age stuff should be
moved to a new VACUUM option.

Nathan



Re: do only critical work during single-user vacuum?

2021-12-09 Thread Bossart, Nathan
On 12/9/21, 4:36 PM, "Peter Geoghegan"  wrote:
> We could then apply this criteria in new code that implements this
> "big red button" (maybe this is a new option for the postgres
> executable, a little like --single?). Something that's reasonably
> targeted, and dead simple to use.

+1

Nathan



Re: do only critical work during single-user vacuum?

2021-12-09 Thread Bossart, Nathan
On 12/9/21, 12:33 PM, "Bossart, Nathan"  wrote:
> On 12/9/21, 11:34 AM, "John Naylor"  wrote:
>> Now that we have a concept of a fail-safe vacuum, maybe it would be
>> beneficial to skip a vacuum in single-user mode if the fail-safe
>> criteria were not met at the beginning of vacuuming a relation. This
>> is not without risk, of course, but it should be much faster than
>> today and once up and running the admin would have a chance to get a
>> handle on things. Thoughts?
>
> Would the --min-xid-age and --no-index-cleanup vacuumdb options help
> with this?

Sorry, I'm not sure what I was thinking.  Of coure you cannot use
vacuumdb in single-user mode.  But I think something like
--min-xid-age in VACUUM is what you are looking for.

Nathan



Re: do only critical work during single-user vacuum?

2021-12-09 Thread Bossart, Nathan
On 12/9/21, 11:34 AM, "John Naylor"  wrote:
> When a user must shut down and restart in single-user mode to run
> vacuum on an entire database, that does a lot of work that's
> unnecessary for getting the system online again, even without
> index_cleanup. We had a recent case where a single-user vacuum took
> around 3 days to complete.
>
> Now that we have a concept of a fail-safe vacuum, maybe it would be
> beneficial to skip a vacuum in single-user mode if the fail-safe
> criteria were not met at the beginning of vacuuming a relation. This
> is not without risk, of course, but it should be much faster than
> today and once up and running the admin would have a chance to get a
> handle on things. Thoughts?

Would the --min-xid-age and --no-index-cleanup vacuumdb options help
with this?

Nathan



Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-08 Thread Bossart, Nathan
On 12/8/21, 3:29 AM, "Bharath Rupireddy" 
 wrote:
> Thanks for your thoughts. I'm fine either way, hence attaching two
> patches here with and I will leave it for the committer 's choice.
> 1) v1-0001-Add-DB_IN_END_OF_RECOVERY_CHECKPOINT-state-for-co.patch --
> adds new db state DB_IN_END_OF_RECOVERY_CHECKPOINT for control file.
> 2) v1-0001-Skip-control-file-db-state-updation-during-end-of.patch --
> just skips setting db state to DB_SHUTDOWNING and DB_SHUTDOWNED in
> case of end-of-recovery checkpoint so that the state will be
> DB_IN_CRASH_RECOVERY which then changes to DB_IN_PRODUCTION.

I've bumped this one to ready-for-committer.  For the record, my
preference is the second patch (for the reasons discussed upthread).
Both patches might benefit from a small comment or two, too.

Nathan



Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 8:42 PM, "Bharath Rupireddy" 
 wrote:
> On Wed, Dec 8, 2021 at 9:49 AM Bossart, Nathan  wrote:
>> I think that's alright.  The only other small suggestion I have would
>> be to say "during end-of-recovery checkpoint" instead of "while in
>> end-of-recovery checkpoint."
>
> "while in" is being used by DB_IN_CRASH_RECOVERY and
> DB_IN_ARCHIVE_RECOVERY messages. I don't think it's a good idea to
> deviate from that and use "during".

Fair enough.  I don't have a strong opinion about this.

>> Another option we might want to consider is to just skip updating the
>> state entirely for end-of-recovery checkpoints.  The state would
>> instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION.  I
>> don't know if it's crucial to have a dedicated control file state for
>> end-of-recovery checkpoints.
>
> Please note that end-of-recovery can take a while in production
> systems (we have observed such things working with our customers) and
> anything can happen during that period of time. The end-of-recovery
> checkpoint is not something that gets finished momentarily. Therefore,
> having a separate db state in the control file is useful.

Is there some useful distinction between the states for users?  ISTM
that users will be waiting either way, and I don't know that an extra
control file state will help all that much.  The main reason I bring
up this option is that the list of states is pretty short and appears
to be intended to indicate the high-level status of the server.  Most
of the states are over 20 years old, and the newest one is over 10
years old, so I don't think new states can be added willy-nilly.

Of course, I could be off-base and others might agree that this new
state would be nice to have.

Nathan



Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 5:21 PM, "Bharath Rupireddy" 
 wrote:
> On Wed, Dec 8, 2021 at 2:50 AM Bossart, Nathan  wrote:
>> I noticed that some (but not all) of the surrounding messages say
>> "last known up at" the control file time.  I'm curious why you chose
>> not to use that phrasing in this case.
>
> If state is DB_IN_END_OF_RECOVERY_CHECKPOINT that means the db was
> interrupted while in end-of-recovery checkpoint, so I used the
> phrasing similar to DB_IN_CRASH_RECOVERY and DB_IN_ARCHIVE_RECOVERY
> cases. I would like to keep it as-is (in the v1 patch) unless anyone
> has other thoughts here?
> (errmsg("database system was interrupted while in recovery at %s",
> (errmsg("database system was interrupted while in recovery at log time %s",

I think that's alright.  The only other small suggestion I have would
be to say "during end-of-recovery checkpoint" instead of "while in
end-of-recovery checkpoint."

Another option we might want to consider is to just skip updating the
state entirely for end-of-recovery checkpoints.  The state would
instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION.  I
don't know if it's crucial to have a dedicated control file state for
end-of-recovery checkpoints.

Nathan



Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 5:21 PM, "Bharath Rupireddy" 
 wrote:
> On Wed, Dec 8, 2021 at 4:17 AM Bossart, Nathan  wrote:
>> I agree with Tom.  I would just s/server/backend/ (as per the
>> attached) and call it a day.
>
> Thanks. v5 patch looks good to me.

I've marked the commitfest entry as ready-for-committer.

Nathan



Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 12:10 AM, "Bharath Rupireddy" 
 wrote:
> Here's a patch that I've come up with. Please see if this looks okay
> and let me know if we want to take it forward so that I can add a CF
> entry.

Overall, the patch seems reasonable to me.

+   case DB_IN_END_OF_RECOVERY_CHECKPOINT:
+   ereport(LOG,
+   (errmsg("database system was 
interrupted while in end-of-recovery checkpoint at %s",
+   
str_time(ControlFile->time;
+   break;

I noticed that some (but not all) of the surrounding messages say
"last known up at" the control file time.  I'm curious why you chose
not to use that phrasing in this case.

Nathan



Re: Pre-allocating WAL files

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 9:35 AM, "Bossart, Nathan"  wrote:
> On 12/7/21, 12:29 AM, "Bharath Rupireddy" 
>  wrote:
>> Why can't the walwriter pre-allocate some of the WAL segments instead
>> of a new background process? Of course, it might delay the main
>> functionality of the walwriter i.e. flush and sync the WAL files, but
>> having checkpointer do the pre-allocation makes it do another extra
>> task. Here the amount of walwriter work vs checkpointer work, I'm not
>> sure which one does more work compared to the other.
>
> The argument against adding it to the WAL writer is that we want it to
> run with low latency to flush asynchronous commits.  If we added WAL
> pre-allocation to the WAL writer, there could periodically be large
> delays.

To your point on trying to avoid giving the checkpointer extra tasks
(basically what we are talking about on the other thread [0]), WAL
pre-allocation might not be of much concern because it will generally
be a small, fixed (and configurable) amount of work, and it can be
performed concurrently with the checkpoint.  Plus, WAL pre-allocation
should ordinarily be phased out as WAL segments become eligible for
recycling.  IMO it's not comparable to tasks like
CheckPointSnapBuild() that can delay checkpointing for a long time.

Nathan

[0] 
https://www.postgresql.org/message-id/flat/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com



Re: Alter all tables in schema owner fix

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 2:47 AM, "Greg Nancarrow"  wrote:
> On Tue, Dec 7, 2021 at 9:01 PM Amit Kapila  wrote:
>>
>> Thanks, the patch looks mostly good to me. I have slightly modified it
>> to incorporate one of Michael's suggestions, ran pgindent, and
>> modified the commit message.
>>
>
> LGTM, except in the patch commit message I'd change "Create
> Publication" to "CREATE PUBLICATION".

LGTM, too.

Nathan



Re: Pre-allocating WAL files

2021-12-07 Thread Bossart, Nathan
On 12/7/21, 12:29 AM, "Bharath Rupireddy" 
 wrote:
> Why can't the walwriter pre-allocate some of the WAL segments instead
> of a new background process? Of course, it might delay the main
> functionality of the walwriter i.e. flush and sync the WAL files, but
> having checkpointer do the pre-allocation makes it do another extra
> task. Here the amount of walwriter work vs checkpointer work, I'm not
> sure which one does more work compared to the other.

The argument against adding it to the WAL writer is that we want it to
run with low latency to flush asynchronous commits.  If we added WAL
pre-allocation to the WAL writer, there could periodically be large
delays.

> Another idea could be to let walwrtier or checkpointer pre-allocate
> the WAL files whichever seems free as-of-the-moment when the WAL
> segment pre-allocation request comes. We can go further to let the
> user choose which process i.e. checkpointer or walwrtier do the
> pre-allocation with a GUC maybe?

My latest patch set [0] adds WAL pre-allocation to the checkpointer.
In that patch set, WAL pre-allocation is done both outside of
checkpoints as well as during checkpoints (via
CheckPointWriteDelay()).  

Nathan

[0] 
https://www.postgresql.org/message-id/CB15BEBD-98FC-4E72-86AE-513D34014176%40amazon.com



Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-07 Thread Bossart, Nathan
On 12/6/21, 8:19 PM, "Dilip Kumar"  wrote:
> If the subtransaction cache is overflowed in some of the transactions
> then it will affect all the concurrent queries as they need to access
> the SLRU for checking the visibility of each tuple.  But currently
> there is no way to identify whether in any backend subtransaction is
> overflowed or what is the current active subtransaction count.
> Attached patch adds subtransaction count and subtransaction overflow
> status in pg_stat_activity.  I have implemented this because of the
> recent complain about the same[1]

I'd like to give a general +1 to this effort.  Thanks for doing this!
I've actually created a function to provide this information in the
past, so I will help review.

Nathan



Re: Do we need pre-allocate WAL files during end-of-recovery checkpoint?

2021-12-06 Thread Bossart, Nathan
On 12/6/21, 4:54 AM, "Bharath Rupireddy" 
 wrote:
> The function PreallocXlogFiles doesn't get called during
> end-of-recovery checkpoint in CreateCheckPoint, see [1]. The server
> becomes operational after the end-of-recovery checkpoint and may need
> WAL files. However, I'm not sure how beneficial it is going to be if
> the WAL is pre-allocated (as PreallocXlogFiles just allocates only 1
> extra WAL file).

There is another thread for adding more effective WAL pre-allocation
[0] that you might be interested in.

Nathan

[0] 
https://www.postgresql.org/message-id/flat/20201225200953.jjkrytlrzojbndh5%40alap3.anarazel.de



Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-06 Thread Bossart, Nathan
On 12/6/21, 4:34 AM, "Bharath Rupireddy" 
 wrote:
> While the database is performing end-of-recovery checkpoint, the
> control file gets updated with db state as "shutting down" in
> CreateCheckPoint (see the code snippet at [1]) and at the end it sets
> it back to "shut down" for a brief moment and then finally to "in
> production". If the end-of-recovery checkpoint takes a lot of time or
> the db goes down during the end-of-recovery checkpoint for whatever
> reasons, the control file ends up having the wrong db state.
>
> Should we add a new db state something like
> DB_IN_END_OF_RECOVERY_CHECKPOINT/"in end-of-recovery checkpoint" or
> something else to represent the correct state?

This seems like a reasonable change to me.  From a quick glance, it
looks like it should be a simple fix that wouldn't add too much
divergence between the shutdown and end-of-recovery checkpoint code
paths.

Nathan



Re: O(n) tasks cause lengthy startups and checkpoints

2021-12-06 Thread Bossart, Nathan
On 12/6/21, 3:44 AM, "Bharath Rupireddy" 
 wrote:
> On Fri, Dec 3, 2021 at 11:50 PM Bossart, Nathan  wrote:
>> I might hack something together for the separate worker approach, if
>> for no other reason than to make sure I really understand how these
>> functions work.  If/when a better idea emerges, we can alter course.
>
> Thanks. As I said upthread we've been discussing the approach of
> offloading some of the checkpoint tasks like (deleting snapshot files)
> internally for quite some time and I would like to share a patch that
> adds a new background cleaner process (currently able to delete the
> logical replication snapshot files, if required can be extended to do
> other tasks as well). I don't mind if it gets rejected. Please have a
> look.

Thanks for sharing!  I've also spent some time on a patch set, which I
intend to share once I have handling for all four tasks (so far I have
handling for CheckPointSnapBuild() and RemovePgTempFiles()).  I'll
take a look at your patch as well.

Nathan



Re: pg_replslotdata - a tool for displaying replication slot information

2021-12-06 Thread Bossart, Nathan
On 12/5/21, 11:10 PM, "Michael Paquier"  wrote:
> On Thu, Dec 02, 2021 at 08:32:08AM +0530, Bharath Rupireddy wrote:
>> On Thu, Dec 2, 2021 at 4:22 AM Andres Freund  wrote:
 I don't have any other compelling use- cases at the moment, but I will say
 that it is typically nice from an administrative standpoint to be able to
 inspect things like this without logging into a running server.
>>>
>>> Weighed against the cost of maintaining (including documentation) a separate
>>> tool this doesn't seem sufficient reason.
>> 
>> IMHO, this shouldn't be a reason to not get something useful (useful
>> IMO and few others in this thread) into the core. The maintenance of
>> the tools generally is low compared to the core server features once
>> they get reviewed and committed.
>
> Well, a bit less maintenance is always better than more maintenance.
> An extra cost that you may be missing is related to the translation of
> the documentation, as well as the translation of any new strings this
> would require.  FWIW, I don't directly see a use for this tool that
> could not be solved with an online server.

Bharath, perhaps you should maintain this outside of core PostgreSQL
for now.  If some compelling use-cases ever surface that make it seem
worth the added maintenance burden, this thread could probably be
revisited.

Nathan



Re: O(n) tasks cause lengthy startups and checkpoints

2021-12-03 Thread Bossart, Nathan
On 12/3/21, 5:57 AM, "Bharath Rupireddy" 
 wrote:
> On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan  wrote:
>>
>> On 12/1/21, 6:48 PM, "Bharath Rupireddy" 
>>  wrote:
>> > +1 for the overall idea of making the checkpoint faster. In fact, we
>> > here at our team have been thinking about this problem for a while. If
>> > there are a lot of files that checkpoint has to loop over and remove,
>> > IMO, that task can be delegated to someone else (maybe a background
>> > worker called background cleaner or bg cleaner, of course, we can have
>> > a GUC to enable or disable it). The checkpoint can just write some
>>
>> Right.  IMO it isn't optimal to have critical things like startup and
>> checkpointing depend on somewhat-unrelated tasks.  I understand the
>> desire to avoid adding additional processes, and maybe it is a bigger
>> hammer than what is necessary to reduce the impact, but it seemed like
>> a natural solution for this problem.  That being said, I'm all for
>> exploring other ways to handle this.
>
> Having a generic background cleaner process (controllable via a few
> GUCs), which can delete a bunch of files (snapshot, mapping, old WAL,
> temp files etc.) or some other task on behalf of the checkpointer,
> seems to be the easiest solution.
>
> I'm too open for other ideas.

I might hack something together for the separate worker approach, if
for no other reason than to make sure I really understand how these
functions work.  If/when a better idea emerges, we can alter course.

Nathan



Re: should we document an example to set multiple libraries in shared_preload_libraries?

2021-12-03 Thread Bossart, Nathan
On 12/3/21, 6:21 AM, "Bharath Rupireddy" 
 wrote:
> +1 to add here in the "Parameter Names and Values section", but do we
> want to backlink every string parameter to this section? I think it
> needs more effort. IMO, we can just backlink for
> shared_preload_libraries alone. Thoughts?

IMO this is most important for GUC_LIST_QUOTE parameters, of which
there are only a handful.  I don't think adding a link to every string
parameter is necessary.

Nathan



Re: Skip vacuum log report code in lazy_scan_heap() if possible

2021-12-03 Thread Bossart, Nathan
On 12/3/21, 7:40 AM, "Peter Geoghegan"  wrote:
> If my patch to unite vacuum verbose and the autovacuum logging gets
> in, then this issue also goes away. 

Perhaps this patch should be marked Rejected in favor of that one,
then.

Nathan



Re: Alter all tables in schema owner fix

2021-12-03 Thread Bossart, Nathan
On 12/2/21, 11:57 PM, "tanghy.f...@fujitsu.com"  wrote:
> Thanks for your patch.
> I tested it and it fixed this problem as expected. It also passed "make 
> check-world".

+1, the patch looks good to me, too.  My only other suggestion would
be to move IsSchemaPublication() to pg_publication.c

Nathan



Re: Alter all tables in schema owner fix

2021-12-02 Thread Bossart, Nathan
On 12/2/21, 7:07 PM, "vignesh C"  wrote:
> Currently while changing the owner of ALL TABLES IN SCHEMA
> publication, it is not checked if the new owner has superuser
> permission or not. Added a check to throw an error if the new owner
> does not have superuser permission.
> Attached patch has the changes for the same. Thoughts?

Yeah, the documentation clearly states that "the new owner of a FOR
ALL TABLES or FOR ALL TABLES IN SCHEMA publication must be a
superuser" [0].

+/*
+ * Check if any schema is associated with the publication.
+ */
+static bool
+CheckSchemaPublication(Oid pubid)

I don't think the name CheckSchemaPublication() accurately describes
what this function is doing.  I would suggest something like
PublicationHasSchema() or PublicationContainsSchema().  Also, much of
this new function appears to be copied from GetPublicationSchemas().
Should we just use that instead?

+CREATE ROLE regress_publication_user3 LOGIN SUPERUSER;
+GRANT regress_publication_user2 TO regress_publication_user3;
+SET ROLE regress_publication_user3;
+SET client_min_messages = 'ERROR';
+CREATE PUBLICATION testpub4 FOR ALL TABLES IN SCHEMA pub_test;
+RESET client_min_messages;
+SET ROLE regress_publication_user;
+ALTER ROLE regress_publication_user3 NOSUPERUSER;
+SET ROLE regress_publication_user3;

I think this test setup can be simplified a bit:

CREATE ROLE regress_publication_user3 LOGIN;
GRANT regress_publication_user2 TO regress_publication_user3;
SET client_min_messages = 'ERROR';
CREATE PUBLICATION testpub4 FOR ALL TABLES IN SCHEMA pub_test;
RESET client_min_messages;
ALTER PUBLICATION testpub4 OWNER TO regress_publication_user3;
SET ROLE regress_publication_user3;

Nathan

[0] https://www.postgresql.org/docs/devel/sql-alterpublication.html



Re: Temporary tables versus wraparound... again

2021-12-02 Thread Bossart, Nathan
On 10/12/21, 3:07 PM, "Greg Stark"  wrote:
> Here's an updated patch. I added some warning messages to autovacuum.

I think this is useful optimization, and I intend to review the patch
more closely soon.  It looks reasonable to me after a quick glance.

> One thing I learned trying to debug this situation in production is
> that it's nigh impossible to find the pid of the session using a
> temporary schema. The number in the schema refers to the backendId in
> the sinval stuff for which there's no external way to look up the
> corresponding pid. It would have been very helpful if autovacuum had
> just told me which backend pid to kill.

I certainly think it would be good to have autovacuum log the PID, but
IIUC a query like this would help you map the backend ID to the PID:

SELECT bid, pg_stat_get_backend_pid(bid) AS pid FROM 
pg_stat_get_backend_idset() bid;

Nathan



Re: should we document an example to set multiple libraries in shared_preload_libraries?

2021-12-02 Thread Bossart, Nathan
On 12/1/21, 5:59 AM, "Bharath Rupireddy" 
 wrote:
> On Wed, Dec 1, 2021 at 6:45 PM Tom Lane  wrote:
>> Considering the vanishingly small number of actual complaints we've
>> seen about this, that sounds ridiculously over-engineered.
>> A documentation example should be sufficient.
>
> Thanks. Here's the v1 patch adding examples in the documentation.

I think the problems you noted upthread are shared for all GUCs with
type GUC_LIST_QUOTE (e.g., search_path, temp_tablespaces).  Perhaps
the documentation for each of these GUCs should contain a short blurb
about how to properly SET a list of values.

Also upthread, I see that you gave the following example for an
incorrect way to set shared_preload_libraries:

ALTER SYSTEM SET shared_preload_libraries =
auth_delay,pg_stat_statements,sepgsql;  --> wrong

Why is this wrong?  It seems to work okay for me.

Nathan



Re: Skip vacuum log report code in lazy_scan_heap() if possible

2021-12-02 Thread Bossart, Nathan
On 10/29/21, 10:49 AM, "Bossart, Nathan"  wrote:
> On 10/29/21, 3:54 AM, "Greg Nancarrow"  wrote:
>> When recently debugging a vacuum-related issue, I noticed that there
>> is some log-report processing/formatting code at the end of
>> lazy_scan_heap() that, unless the vacuum VERBOSE option is specified,
>> typically results in no log output (as the log-level is then DEBUG2).
>> I think it is worth skipping this code if ultimately nothing is going
>> to be logged (and I found that even for a tiny database, a VACUUM may
>> execute this code hundreds of times).
>> I have attached a simple patch for this.
>
> I think this logging only happens once per table, so I'm not sure it's
> really worth it to short-circuit here.  If it was per-page, IMO there
> would be a much stronger case for it.  That being said, I don't think
> the proposed patch would hurt anything.

Since I have no further comments, I went ahead and marked this once as
ready-for-committer.

Nathan



Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2021-12-02 Thread Bossart, Nathan
On 12/2/21, 9:50 AM, "David Steele"  wrote:
> On 12/2/21 12:38, David Steele wrote:
>> On 12/2/21 11:00, Andrew Dunstan wrote:
>>>
>>> Should we really be getting rid of
>>> PostgreSQL::Test::Cluster::backup_fs_hot() ?
>>
>> Agreed, it would be better to update backup_fs_hot() to use exclusive
>> mode and save out backup_label instead.
>
> Oops, of course I meant non-exclusive mode.

+1.  I'll fix that in the next revision.

Nathan



Re: O(n) tasks cause lengthy startups and checkpoints

2021-12-02 Thread Bossart, Nathan
On 12/1/21, 6:48 PM, "Bharath Rupireddy" 
 wrote:
> +1 for the overall idea of making the checkpoint faster. In fact, we
> here at our team have been thinking about this problem for a while. If
> there are a lot of files that checkpoint has to loop over and remove,
> IMO, that task can be delegated to someone else (maybe a background
> worker called background cleaner or bg cleaner, of course, we can have
> a GUC to enable or disable it). The checkpoint can just write some

Right.  IMO it isn't optimal to have critical things like startup and
checkpointing depend on somewhat-unrelated tasks.  I understand the
desire to avoid adding additional processes, and maybe it is a bigger
hammer than what is necessary to reduce the impact, but it seemed like
a natural solution for this problem.  That being said, I'm all for
exploring other ways to handle this.

> Another idea could be to parallelize the checkpoint i.e. IIUC, the
> tasks that checkpoint do in CheckPointGuts are independent and if we
> have some counters like (how many snapshot/mapping files that the
> server generated)

Could you elaborate on this?  Is your idea that the checkpointer would
create worker processes like autovacuum does?

Nathan



Re: O(n) tasks cause lengthy startups and checkpoints

2021-12-02 Thread Bossart, Nathan
On 12/1/21, 6:06 PM, "Euler Taveira"  wrote:
> Saying that a certain task is O(n) doesn't mean it needs a separate process to
> handle it. Did you have a use case or even better numbers (% of checkpoint /
> startup time) that makes your proposal worthwhile?

I don't have specific numbers on hand, but each of the four functions
I listed is something I routinely see impacting customers.

> For (3), there is already a GUC that would avoid the slowdown during startup.
> Use it if you think the startup time is more important that disk space 
> occupied
> by useless files.

Setting remove_temp_files_after_crash to false only prevents temp file
cleanup during restart after a backend crash.  It is always called for
other startups.

> For (4), you are forgetting that the on-disk state of replication slots is
> stored in the pg_replslot/SLOTNAME/state. It seems you cannot just rename the
> replication slot directory and copy the state file. What happen if there is a
> crash before copying the state file?

Good point.  I think it's possible to deal with this, though.  Perhaps
the files that should be deleted on startup should go in a separate
directory, or maybe we could devise a way to ensure the state file is
copied even if there is a crash at an inconvenient time.

Nathan



Re: O(n) tasks cause lengthy startups and checkpoints

2021-12-01 Thread Bossart, Nathan
On 12/1/21, 2:56 PM, "Andres Freund"  wrote:
> On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote:
>> I realize adding a new maintenance worker might be a bit heavy-handed,
>> but I think it would be nice to have somewhere to offload tasks that
>> really shouldn't impact startup and checkpointing.  I imagine such a
>> process would come in handy down the road, too.  WDYT?
>
> -1. I think the overhead of an additional worker is disproportional here. And
> there's simplicity benefits in having a predictable cleanup interlock as well.

Another idea I had was to put some upper limit on how much time is
spent on such tasks.  For example, a checkpoint would only spend X
minutes on CheckPointSnapBuild() before giving up until the next one.
I think the main downside of that approach is that it could lead to
unbounded growth, so perhaps we would limit (or even skip) such tasks
only for end-of-recovery and shutdown checkpoints.  Perhaps the
startup tasks could be limited in a similar fashion.

> I think particularly for the snapshot stuff it'd be better to optimize away
> unnecessary snapshot files, rather than making the cleanup more asynchronous.

I can look into this.  Any pointers would be much appreciated.

Nathan



Re: Fix inappropriate uses of PG_GETARG_UINT32()

2021-12-01 Thread Bossart, Nathan
On 12/1/21, 10:29 AM, "Peter Eisentraut"  
wrote:
> The attached patch fixes this by accepting the argument using
> PG_GETARG_INT32(), doing some checks, and then casting it to unsigned
> for the rest of the code.
>
> The patch also fixes another inappropriate use in an example in the
> documentation.  These two were the only inappropriate uses I found,
> after we had fixed a few recently.

LGTM

Nathan



O(n) tasks cause lengthy startups and checkpoints

2021-12-01 Thread Bossart, Nathan
Hi hackers,

Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
avoid individually syncing all database files after a crash.  However,
as noted earlier this year [0], there are still a number of O(n) tasks
that affect startup and checkpointing that I'd like to improve.
Below, I've attempted to summarize each task and to offer ideas for
improving matters.  I'll likely split each of these into its own
thread, given there is community interest for such changes.

1) CheckPointSnapBuild(): This function loops through
   pg_logical/snapshots to remove all snapshots that are no longer
   needed.  If there are many entries in this directory, this can take
   a long time.  The note above this function indicates that this is
   done during checkpoints simply because it is convenient.  IIUC
   there is no requirement that this function actually completes for a
   given checkpoint.  My current idea is to move this to a new
   maintenance worker.
2) CheckPointLogicalRewriteHeap(): This function loops through
   pg_logical/mappings to remove old mappings and flush all remaining
   ones.  IIUC there is no requirement that the "remove old mappings"
   part must complete for a given checkpoint, but the "flush all
   remaining" portion allows replay after a checkpoint to only "deal
   with the parts of a mapping that have been written out after the
   checkpoint started."  Therefore, I think we should move the "remove
   old mappings" part to a new maintenance worker (probably the same
   one as for 1), and we should consider using syncfs() for the "flush
   all remaining" part.  (I suspect the main argument against the
   latter will be that it could cause IO spikes.)
3) RemovePgTempFiles(): This step can delay startup if there are many
   temporary files to individually remove.  This step is already
   optionally done after a crash via the remove_temp_files_after_crash
   GUC.  I propose that we have startup move the temporary file
   directories aside and create new ones, and then a separate worker
   (probably the same one from 1 and 2) could clean up the old files.
4) StartupReorderBuffer(): This step deletes logical slot data that
   has been spilled to disk.  This code appears to be written to avoid
   deleting different types of files in these directories, but AFAICT
   there shouldn't be any other files.  Therefore, I think we could do
   something similar to 3 (i.e., move the directories aside during
   startup and clean them up via a new maintenance worker).

I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing.  I imagine such a
process would come in handy down the road, too.  WDYT?

Nathan

[0] https://postgr.es/m/32B59582-AA6C-4609-B08F-2256A271F7A5%40amazon.com



Re: SKIP LOCKED assert triggered

2021-12-01 Thread Bossart, Nathan
On 11/12/21, 8:56 AM, "Simon Riggs"  wrote:
> The combination of these two statements in a transaction hits an
> Assert in heapam.c at line 4770 on REL_14_STABLE

I've been unable to reproduce this.  Do you have any tips for how to
do so?  Does there need to be some sort of concurrent workload?

Nathan



Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2021-12-01 Thread Bossart, Nathan
On 12/1/21, 8:27 AM, "David Steele"  wrote:
> On 11/30/21 18:31, Bossart, Nathan wrote:
>> Do you think it's still worth trying to make it safe, or do you think
>> we should just remove exclusive mode completely?
>
> My preference would be to remove it completely, but I haven't gotten a
> lot of traction so far.

In this thread, I count 6 people who seem alright with removing it,
and 2 who might be opposed, although I don't think anyone has
explicitly stated they are against it.

Nathan



Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2021-11-30 Thread Bossart, Nathan
On 11/30/21, 2:58 PM, "David Steele"  wrote:
> I did figure out how to keep the safe part of exclusive backup (not
> having to maintain a connection) while removing the dangerous part
> (writing backup_label into PGDATA), but it was a substantial amount of
> work and I felt that it had little chance of being committed.

Do you think it's still worth trying to make it safe, or do you think
we should just remove exclusive mode completely?

> Attaching the thread [1] that I started with a patch to remove exclusive
> backup for reference.

Ah, good, some light reading.  :)

Nathan



Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2021-11-30 Thread Bossart, Nathan
On 11/30/21, 2:27 PM, "Tom Lane"  wrote:
> If we're willing to outright remove it, I don't have any great objection.
> My original two cents was that we shouldn't put effort into improving it;
> but removing it isn't that.

I might try to put a patch together for the January commitfest, given
there is enough support.

Nathan



Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2021-11-30 Thread Bossart, Nathan
On 11/30/21, 9:51 AM, "Stephen Frost"  wrote:
> I disagree that that’s a satisfactory approach. It certainly wasn’t
> intended or documented as part of the original feature and therefore
> to call it satisfactory strikes me quite strongly as revisionist
> history. 

It looks like the exclusive way has been marked deprecated in all
supported versions along with a note that it will eventually be
removed.  If it's not going to be removed out of fear of breaking
backward compatibility, I think the documentation should be updated to
say that.  However, unless there is something that is preventing users
from switching to the non-exclusive approach, I think it is reasonable
to begin thinking about removing it.

Nathan



Re: pg_replslotdata - a tool for displaying replication slot information

2021-11-30 Thread Bossart, Nathan
On 11/30/21, 6:14 AM, "Peter Eisentraut"  
wrote:
> On 23.11.21 06:09, Bharath Rupireddy wrote:
>> The replication slots data is stored in binary format on the disk under
>> the pg_replslot/<> directory which isn't human readable. If
>> the server is crashed/down (for whatever reasons) and unable to come up,
>> currently there's no way for the user/admin/developer to know what were
>> all the replication slots available at the time of server crash/down to
>> figure out what's the restart lsn, xid, two phase info or types of slots
>> etc.
>
> What do you need that for?  You can't do anything with a replication
> slot while the server is down.

One use-case might be to discover the value you need to set for
max_replication_slots, although it's pretty trivial to discover the
number of replication slots by looking at the folder directly.
However, you also need to know how many replication origins there are,
and AFAIK there isn't an easy way to read the replorigin_checkpoint
file at the moment.  IMO a utility like this should also show details
for the replication origins.  I don't have any other compelling use-
cases at the moment, but I will say that it is typically nice from an
administrative standpoint to be able to inspect things like this
without logging into a running server.

Nathan



Re: improve CREATE EXTENSION error message

2021-11-29 Thread Bossart, Nathan
On 11/29/21, 2:32 PM, "Chapman Flack"  wrote:
> If it were me, I would combine that DETAIL and HINT as one larger HINT,
> and use DETAIL for specific details about what actually happened (such
> as the exact filename sought and the %m).
>
> The need for those details doesn't go away; they're still what you need
> when what went wrong is some other freak occurrence the hint doesn't
> explain.

How's this?

postgres=# CREATE EXTENSION does_not_exist;
ERROR:  extension "does_not_exist" is not available
DETAIL:  Extension control file 
"/usr/local/pgsql/share/extension/does_not_exist.control" does not exist.
HINT:  The extension must first be installed on the system where 
PostgreSQL is running.  The pg_available_extensions view lists the available 
extensions.

Nathan



Re: improve CREATE EXTENSION error message

2021-11-29 Thread Bossart, Nathan
On 11/29/21, 2:13 PM, "Bossart, Nathan"  wrote:
> Alright, here's v3.  In this version, I actually removed the message
> about the control file entirely, so now the error message looks like
> this:
>
> postgres=# CREATE EXTENSION does_not_exist;
> ERROR:  extension "does_not_exist" is not available
> DETAIL:  The extension must first be installed on the system where 
> PostgreSQL is running.
> HINT:  The pg_available_extensions view lists the extensions that are 
> available for installation.
>
> I can add the control file part back if we think it's necessary.

Hm.  I should probably adjust the hint to avoid confusion from
"installed on the system" and "available for installation."  Maybe
something like

The pg_available_extensions view lists the available extensions.

Nathan



Re: improve CREATE EXTENSION error message

2021-11-29 Thread Bossart, Nathan
On 11/29/21, 1:38 PM, "Chapman Flack"  wrote:
> On 11/29/21 16:31, Daniel Gustafsson wrote:
>> That's a good point, the hint is targeting users who might not even know that
>> an extension needs to be physically and separately installed on the machine
>> before it can be installed in their database; so maybe using "installed" here
>> isn't entirely helpful at all.  That being said I'm at a loss for a more
>> suitable word, "available" perhaps?
>
> Maybe a larger break with the "This means the extension something something"
> formulation, and more on the lines of
>
> HINT:  an extension must first be present (for example, installed with a
>  package manager) on the system where PostgreSQL is running.

I like this idea.  I can do it this way in the next revision if others
agree.

Nathan



Re: improve CREATE EXTENSION error message

2021-11-29 Thread Bossart, Nathan
On 11/29/21, 1:03 PM, "Tom Lane"  wrote:
> If we issue the hint only for errno == ENOENT, I think we could be
> less wishy-washy (and if it's not ENOENT, the hint is likely
> inappropriate anyway).  I'm thinking something more like
>
> HINT:  This means the extension is not installed on the system.

Good idea.

> I'm not quite satisfied with the "on the system" wording, but I'm
> not sure of what would be better.  I agree that we can't just say
> "is not installed", because people will confuse that with whether
> it is installed within the database.

Right.  The only other idea I have at the moment is to say something
like

This means the extension is not available[ on the system].

I don't know whether that is actually any less confusing, though.

Nathan



Re: improve CREATE EXTENSION error message

2021-11-29 Thread Bossart, Nathan
On 11/29/21, 12:33 PM, "Daniel Gustafsson"  wrote:
> I haven't given the suggested wording too much thought, but in general that
> sounds like a good idea.

Thanks.  I'm flexible with the wording.

Nathan



Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2021-11-29 Thread Bossart, Nathan
On 11/26/21, 7:33 AM, "Tom Lane"  wrote:
> Michael Paquier  writes:
>> On Thu, Nov 25, 2021 at 06:19:03PM -0800, SATYANARAYANA NARLAPURAM wrote:
>>> If we are keeping it then why not make it better?
>
>> Well, non-exclusive backups are better by design in many aspects, so I
>> don't quite see the point in spending time on something that has more
>> limitations than what's already in place.
>
> IMO the main reason for keeping it is backwards compatibility for users
> who have a satisfactory backup arrangement using it.  That same argument
> implies that we shouldn't change how it works (at least, not very much).

The issues with exclusive backups seem to be fairly well-documented
(e.g., c900c15), but perhaps there should also be a note in the
"Backup Control Functions" table [0].

Nathan

[0] 
https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-ADMIN-BACKUP



Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2021-11-29 Thread Bossart, Nathan
On 11/25/21, 9:16 AM, "Mark Dilger"  wrote:
>> On Nov 24, 2021, at 12:53 PM, Bossart, Nathan  wrote:
>>
>> Another option we might consider is only checking for the
>> HEAP_XMAX_LOCK_ONLY bit instead of everything in
>> HEAP_XMAX_IS_LOCKED_ONLY.  IIUC everything else is only expected to
>> happen for upgrades from v9.2 or earlier, so it might be pretty rare
>> at this point.  Otherwise, I'll extract the exact bit pattern for the
>> error message as you suggest.
>
>I would prefer to detect and report any "can't happen" bit patterns without 
>regard for how likely the pattern may be.  The difficulty is in proving that a 
>bit pattern is disallowed.  Just because you can't find a code path in the 
>current code base that would create a pattern doesn't mean it won't have 
>legitimately been created by some past release or upgrade path.  As such, any 
>prohibitions explicitly in the backend, such as Asserts around a condition, 
>are really valuable.  You can know that the pattern is disallowed, since the 
>server would Assert on it if encountered.
>
> Aside from that, I don't really buy the argument that databases upgraded from 
> v9.2 or earlier are rare.  Even if servers *running* v9.2 or earlier are (or 
> become) rare, servers initialized that far back which have been upgraded one 
> or more times since then may be common.

Okay, I'll do it that way in the next revision.

Nathan



Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2021-11-24 Thread Bossart, Nathan
On 11/23/21, 4:59 PM, "Mark Dilger"  wrote:
>> On Nov 23, 2021, at 4:51 PM, Bossart, Nathan  wrote:
>>
>> This is a good point.  Right now, you'd have to manually inspect the
>> infomask field to determine the exact form of the invalid combination.
>> My only worry with this is that we'd want to make sure these checks
>> stayed consistent with the definition of HEAP_XMAX_IS_LOCKED_ONLY in
>> htup_details.h.  I'm guessing HEAP_XMAX_IS_LOCKED_ONLY is unlikely to
>> change all that often, though.
>
> I expect that your patch will contain some addition to the amcheck (or 
> pg_amcheck) tests, so if we ever allow some currently disallowed bit 
> combination, we'd be reminded of the need to update amcheck.  So I'm not too 
> worried about that aspect of this.
>
> I prefer not to have to get a page (or full file) from a customer when the 
> check reports corruption.  Even assuming they are comfortable giving that, 
> which they may not be, it is an extra round trip to the customer asking for 
> stuff.

Another option we might consider is only checking for the
HEAP_XMAX_LOCK_ONLY bit instead of everything in
HEAP_XMAX_IS_LOCKED_ONLY.  IIUC everything else is only expected to
happen for upgrades from v9.2 or earlier, so it might be pretty rare
at this point.  Otherwise, I'll extract the exact bit pattern for the
error message as you suggest.

Nathan



Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2021-11-23 Thread Bossart, Nathan
On 11/23/21, 4:36 PM, "Mark Dilger"  wrote:
> I have to wonder if, when corruption is reported for conditions like this:
>
> +   if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
> +   HEAP_XMAX_IS_LOCKED_ONLY(ctx->tuphdr->t_infomask))
>
> if the first thing we're going to want to know is which branch of the 
> HEAP_XMAX_IS_LOCKED_ONLY macro evaluated true?  Should we split this check to 
> do each branch of the macro separately, such as:
>
> if (ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED)
> {
> if (ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY)
> ... report something ...
> else if ((ctx->tuphdr->t_infomask & (HEAP_XMAX_IS_MULTI | 
> HEAP_LOCK_MASK)) == HEAP_XMAX_EXCL_LOCK)
> ... report a different thing ...
> }

This is a good point.  Right now, you'd have to manually inspect the
infomask field to determine the exact form of the invalid combination.
My only worry with this is that we'd want to make sure these checks
stayed consistent with the definition of HEAP_XMAX_IS_LOCKED_ONLY in
htup_details.h.  I'm guessing HEAP_XMAX_IS_LOCKED_ONLY is unlikely to
change all that often, though.

Nathan



Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2021-11-23 Thread Bossart, Nathan
The archives seem unhappy with my attempt to revive this old thread,
so here is a link to it in case anyone is looking for more context:


https://www.postgresql.org/message-id/flat/3476708e-7919-20cb-ca45-6603470565f7%40amazon.com

Nathan



Re: Sequence's value can be rollback after a crashed recovery.

2021-11-23 Thread Bossart, Nathan
On 11/23/21, 1:41 PM, "Tom Lane"  wrote:
> I wrote:
>> I wonder though if we shouldn't try to improve the existing text.
>> The phrasing "never rolled back" seems like it's too easily
>> misinterpreted.  Maybe rewrite the  block like
>> ...
>
> A bit of polishing later, maybe like the attached.

The doc updates look good to me.  Yesterday I suggested possibly
adding a way to ensure that nextval() called in an uncommitted
transaction was persistent, but I think we'd have to also ensure that
synchronous replication waits for those records, too.  Anyway, I don't
think it is unreasonable to require the transaction to be committed to
avoid duplicates from nextval().

Nathan



Re: Sequence's value can be rollback after a crashed recovery.

2021-11-22 Thread Bossart, Nathan
On 11/22/21, 5:10 AM, "Laurenz Albe"  wrote:
> On Mon, 2021-11-22 at 15:43 +0800, Andy Fan wrote:
>> The performance argument was expected before this writing. If we look at the
>> nextval_interval more carefully, we can find it would not flush the xlog 
>> every
>> time even the sequence's cachesize is 1. Currently It happens every 32 times
>> on the nextval_internal at the worst case.
>
> Right, I didn't think of that.  Still, I'm -1 on this performance regression.

I periodically hear rumblings about this behavior as well.  At the
very least, it certainly ought to be documented if it isn't yet.  I
wouldn't mind trying my hand at that.  Perhaps we could also add a new
configuration parameter if users really want to take the performance
hit.

Nathan



Re: Improving psql's \password command

2021-11-22 Thread Bossart, Nathan
On 11/21/21, 11:15 AM, "Tom Lane"  wrote:
> "Bossart, Nathan"  writes:
>> Yeah, I was looking for a way to simplify this, too.  Your approach
>> seems reasonable enough to me.
>
> Hearing no contrary opinions, pushed that way.

Thanks!

Nathan



  1   2   3   4   5   >